Using DOMDocument and DOMXPath to scrape a site

cags · January 19, 2010

Ok, it would appear that my expectations and reality don't match up. I don't know if I'm going about this completely wrong or simply making a small mistake, but I'm getting nothing but a headache from it. I have a HTML page that has been fetched from a site using cURL. I am attempting to fetch various bits of information from it. Since the HTML is so large I won't post it (at least for now). Hopefully it will suffice for me to tell you that the site contains many div's which have the class = "slot". I am attempting to loop through them and within them I am (currently) trying to fetch the href attribute of an a tag that is within a div tag that has the class = "something". A basic example of the XML structure...

<div id="slots">
   <div class="slot">
      <div class="something">
         <a href="http://www.google.com">Google</a>
      </div>
   </div>
   <div class="slot">
      <div class="something">
         <a href="http://www.yahoo.com">Yahoo</a>
      </div>
   </div>
</div>

This is the core of the code I've been trying, I've tried *many* variations, but it seems like somewhere along the lines I'm making an assumption I shouldn't be.

$dom = new DOMDocument;
libxml_use_internal_errors(true);
@$dom->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);

$query = $xpath->query('//div[@class="slot"]');

foreach($query as $node) {
    $q = $xpath->query('//div[@class="titlelogo"]/a', $node);
    echo $q->item(0)->attributes->getNamedItem('href')->value;
}

Before somebody mentions it, yes I know I could just do something like //div[@class=slot]/*/a (or whatever the exact syntax is for that) or even build a full relative path, but the point is the contents of the 'slot' divs are all related so I need to work on each 'slot' individually.

salathe · January 19, 2010

The second XPath query (within the loop) relates to the root of the document even though you're trying to use a context node. To make the path relative, start it with a dot (.//div[).

cags · January 19, 2010

Thanks salathe, that works much better. I was so near yet so far.

Sign In

Using DOMDocument and DOMXPath to scrape a site

Recommended Posts

cags

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information