Jump to content

about my crawl , help plz


webtuto

Recommended Posts

hey ,

i made a crawl and i had a problem

my simple crawl get me the links from the front page only

i want it to go thru links and get links from other pages not just the index page , here is the c ode im using -->

$site = "http://www.zik4.com/";
$html = file_get_contents($site);

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$links = $xpath->evaluate("/html/body//a");


foreach($links as $link)
{
    echo '<br />' . $link->getAttribute('href');
} 

Link to comment
https://forums.phpfreaks.com/topic/165601-about-my-crawl-help-plz/
Share on other sites

in your foreach loop, instead of echoing the link out, you have to put it into an array. I suggest validating the link before putting it into the array though. For instance, you can check for if the link is "#".  Or you can check if it's not a valid link, like if someone is calling a javascript function in the href.  You'd also want to decide what to do with relative links; disregard them or try to convert them into absolute urls? You'll also want to check for duplicate urls, and external urls.

 

From there, you will need to basically run the same DOM code on every link in your array.  So one way to do that is to wrap it in a function and use some recursion.  Note that crawlers can very quickly timeout and max memory limits.  So you'll need to setup your script to handle that.  You can set a limit on links; tell the script to stop running after X links.  Or set the timeout limit to something higher (or remove it) if you have the access.  Store the links in a db or flatfile, instead of an array.  It will make your script take longer to run, but it's the only way to tradeoff maxing out memory.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.