webtuto Posted July 11, 2009 Share Posted July 11, 2009 hey , i made a crawl and i had a problem my simple crawl get me the links from the front page only i want it to go thru links and get links from other pages not just the index page , here is the c ode im using --> $site = "http://www.zik4.com/"; $html = file_get_contents($site); $dom = new DOMDocument(); $dom->loadHTML($html); $xpath = new DOMXPath($dom); $links = $xpath->evaluate("/html/body//a"); foreach($links as $link) { echo '<br />' . $link->getAttribute('href'); } Quote Link to comment https://forums.phpfreaks.com/topic/165601-about-my-crawl-help-plz/ Share on other sites More sharing options...
webtuto Posted July 11, 2009 Author Share Posted July 11, 2009 TOP Quote Link to comment https://forums.phpfreaks.com/topic/165601-about-my-crawl-help-plz/#findComment-873479 Share on other sites More sharing options...
.josh Posted July 11, 2009 Share Posted July 11, 2009 in your foreach loop, instead of echoing the link out, you have to put it into an array. I suggest validating the link before putting it into the array though. For instance, you can check for if the link is "#". Or you can check if it's not a valid link, like if someone is calling a javascript function in the href. You'd also want to decide what to do with relative links; disregard them or try to convert them into absolute urls? You'll also want to check for duplicate urls, and external urls. From there, you will need to basically run the same DOM code on every link in your array. So one way to do that is to wrap it in a function and use some recursion. Note that crawlers can very quickly timeout and max memory limits. So you'll need to setup your script to handle that. You can set a limit on links; tell the script to stop running after X links. Or set the timeout limit to something higher (or remove it) if you have the access. Store the links in a db or flatfile, instead of an array. It will make your script take longer to run, but it's the only way to tradeoff maxing out memory. Quote Link to comment https://forums.phpfreaks.com/topic/165601-about-my-crawl-help-plz/#findComment-873508 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.