Jump to content

Recommended Posts

hey ,

i made a crawl and i had a problem

my simple crawl get me the links from the front page only

i want it to go thru links and get links from other pages not just the index page , here is the c ode im using -->

$site = "http://www.zik4.com/";
$html = file_get_contents($site);

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$links = $xpath->evaluate("/html/body//a");


foreach($links as $link)
{
    echo '<br />' . $link->getAttribute('href');
} 

Link to comment
https://forums.phpfreaks.com/topic/165601-about-my-crawl-help-plz/
Share on other sites

in your foreach loop, instead of echoing the link out, you have to put it into an array. I suggest validating the link before putting it into the array though. For instance, you can check for if the link is "#".  Or you can check if it's not a valid link, like if someone is calling a javascript function in the href.  You'd also want to decide what to do with relative links; disregard them or try to convert them into absolute urls? You'll also want to check for duplicate urls, and external urls.

 

From there, you will need to basically run the same DOM code on every link in your array.  So one way to do that is to wrap it in a function and use some recursion.  Note that crawlers can very quickly timeout and max memory limits.  So you'll need to setup your script to handle that.  You can set a limit on links; tell the script to stop running after X links.  Or set the timeout limit to something higher (or remove it) if you have the access.  Store the links in a db or flatfile, instead of an array.  It will make your script take longer to run, but it's the only way to tradeoff maxing out memory.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.