Scraping links/content

Omzy · January 30, 2010

I've created a scrape script which fetches all links on a page:

$dom = new DOMDocument();

@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$links = $xpath->query("//a[@class='listinglink']");

$i=0;

foreach($links as $item)
{
$href = $links->item($i);
$url = $href->getAttribute('href');
echo '<a href="'.$url.'">'.$url.'</a><br/>';
}

I now need to extend this further - it needs to go in to each link and get for example the content of all the <p> tags on the page. So for example the page output should be as follows:

Link 1

P tag 1 content

P tag 2 content

P tag 3 content

Link 2

P tag 1 content

P tag 2 content

...and so on. Can anyone assist me with this?

The Little Guy · January 30, 2010

I would do it a little something like this:

foreach($links as $item)
{
curl_setopt($ch, CURLOPT_URL, $item);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$opt = curl_exec($ch);
curl_close($ch);
preg_match_all("~<p>(.+?)</p>~", $opt, $matches);
print_r($matches);
}

Sign In

Scraping links/content

Recommended Posts

Omzy

Link to comment

Share on other sites

The Little Guy

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information