Jump to content

Scraping links/content


Omzy

Recommended Posts

I've created a scrape script which fetches all links on a page:

 

$dom = new DOMDocument();

@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$links = $xpath->query("//a[@class='listinglink']");

$i=0;

foreach($links as $item)
{
$href = $links->item($i);
$url = $href->getAttribute('href');
echo '<a href="'.$url.'">'.$url.'</a><br/>';
}

 

I now need to extend this further - it needs to go in to each link and get for example the content of all the <p> tags on the page. So for example the page output should be as follows:

 

Link 1

P tag 1 content

P tag 2 content

P tag 3 content

 

Link 2

P tag 1 content

P tag 2 content

 

...and so on. Can anyone assist me with this?

Link to comment
https://forums.phpfreaks.com/topic/190376-scraping-linkscontent/
Share on other sites

I would do it a little something like this:

 

foreach($links as $item)
{
curl_setopt($ch, CURLOPT_URL, $item);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$opt = curl_exec($ch);
curl_close($ch);
preg_match_all("~<p>(.+?)</p>~", $opt, $matches);
print_r($matches);
}

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.