Parse HTML code

cordoprod · August 16, 2009

Hi! I need to parse some HTML and I'm not quite sure how to do it.

I use curl to get the HTML, and I know how to do that and I have all the code for that.

But then I get the source of the whole page, and I don't want that.

Check this source out:

 <div>
    <div class="bransjenavn">
        
            
        
        <div class="sort">
            
            <form name="form" action="">
                <select onchange="MM_jumpMenu('parent',this,0)" name="jumpMenu">
                    <option value="/gs/companyList.c?bc=0&q=elektronikk">Standard sortering</option>
                    
                        <option value="/gs/companyList.c?bc=0&q=elektronikk&sort=2">Sorter alfabetisk</option>
                    
                    
                        
                            <option value="/gs/companyList.c?bc=0&q=elektronikk&sort=4">Sorter etter omtaler</option>
                        
                    
                </select>
            </form>                     
                                                                              <div class="treffKart">
                    <a href="/kart/#tab%3Dyellow%26autozoom%3Dtrue%26id%3Dc_Z001UNJL%26id%3Dc_Z0HPLR6L">Vis treff i kart</a>
                </div>
            
        </div>
        <h1>Treff i firmanavn:
            
                    <span>
                            2
                        av
                            254
                        treff
                        -
                        <a href="/gs/companyList.c?bc=0&q=elektronikk">
                            Vis alle
                        </a>
                    </span>
                
        </h1>
    </div>
</div>

All I want from that code is "Treff i firmanavn".

Is it possible to remove all the other code? (Be aware of that there should be more of them, so Treff i firmanavn is a category. And that is not the whole source code of the page.

Here is my function which just outputs the whole page:

function getCategory($q) {
$url = "http://www.gulesider.no/gs/categoryList.c?q=$q";
    $html = curlGet($url);
    $start = strpos($html, "<h1>");
    $end = strpos($html, ":");
    $html = substr($html, $start, $end-$start);
    
    
    
    preg_match_all('/' . preg_quote($start, '/') . '([^\.)]+)'. preg_quote($end, '/').'/i', $html, $matches);
    
    return $matches[1];
}

DEVILofDARKNESS · August 16, 2009

Maybe you should try to split the code twice,

first all text until <h1>

and next all text after </h1>

cordoprod · August 16, 2009

Could you please show me some code?

PatrickMc · August 31, 2009

All I want from that code is "Treff i firmanavn".

Cordoprod, are you allowed to use anything in addition to PHP ? If so, the following server-side biterscripting script will work.

# Get source into a str variable. Source may be a
# document on the internet, or in a local file. Use
# correct URL (starting with http:// ) or
# file path instead of "source" below.
var str source ; cat "source" > $source

# Extract and remove portion up to <h1...>.
stex -c -r "^<h1&\>^]" $source > null

# Extract and output portion up to colon .
stex "]^:^" $source

Sign In

Parse HTML code

Recommended Posts

cordoprod

Link to comment

Share on other sites

DEVILofDARKNESS

Link to comment

Share on other sites

cordoprod

Link to comment

Share on other sites

PatrickMc

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information