Linkparser that gives the results into a DOMDocument

dilbertone · November 7, 2010

good evening dear PHPFreaks - hello to everybody.

i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear your review... Since i am new to programming i love to get some hints from experienced devs.

Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search

Note: i want to itterate over the resultpages - with a loop.

http://www.educa.ch/dyn/79376.asp?id=1568

http://www.educa.ch/dyn/79376.asp?id=2149

i take this loop:

for($i=1;$i<=$match[1];$i++)
{
  $url = "http://www.example.com/page?page={$i}";
  // access new sub-page, extract necessary data
}

Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls?

BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff.

well this is what i want to. And now i need to have a good parser-script.

Note: this is a tree-part-job:

1. fetching the sub-pages

2. parsing them

3. storing the data in a mysql-db

Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to

leave them aside - unless i do not want to populate my mysql-db with too much infos..

Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this.

The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job.

note ive taken the script from this place:

http://www.merchantos.com/makebeta/php/scraping-links-with-php/

function storeLink($url,$gathered_from) {
$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
mysql_query($query) or die('Error, insert query failed');
}

for($i=1;$i<= 10000; $i++)

{
$target_url = "http://www.educa.ch/dyn/79376.asp?id={$i}";
}
  // access new sub-page, extract necessary data

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
storeLink($url,$target_url);
echo "<br />Link stored: $url";
}

Dear PHP-Freaks, what do you think? What about the Loop over the target-Urls?

love to hear from you !

Sign In

Linkparser that gives the results into a DOMDocument

Recommended Posts

dilbertone

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information