Jump to content

[SOLVED] Web Crawler > Preventing never ending loop


drewbee

Recommended Posts

Hello everyone,

I have designed a crawler in PHP. It works fairly well, and is actually pretty quick. It has the ability to "multithread" in its own way. PHP does not offer true multi-threading, however it can be rigged to do it. Currently it crawls around a website with no problem, obtaining the necessary data from the page/retrieving new links.

 

I have noticed one little problem though dealing with a never ending page, for instance a calender. A calender that is fed and manipulated by links IE (click here for next month) will litterally never end, due to the pages being dynamically generated. If it is currently on March, the next link will be "April". On this page load the next link will be "May" etc etc so on and so on.

 

Does anyone have any ideas to which I can place logic in the code for preventing this? The only way that I can even think of is my literally telling the spider that the page is recursive IE "DO NOT INDEX /calender.php" or, "ONLY INDEX calender.php, DO NOT INDEX ANYTHING ELSE OF THIS FILE".

 

If one (or many) of the spider threads would happen to fall onto this file, it would never leave it due to the fact that it would have an infitnite amount of links to follow.

 

Any thoughts / ideas? I wonder how yahoo / google etc take care of this issue.

Link to comment
Share on other sites

Well, the only thing I can come up with is doing a comparison across the pages (duplicate content check). My algorithm will just have to see if it is finding a lot of duplicate content from calender.php and maybe only take the original one in if duplicate content is found.

 

Anyone else have any thoughts or ideas?

Link to comment
Share on other sites

Yeah I am.

 

If i do something other then a comparison of 'duplicate content', I may have to have an initiative of only indexing up to a certain amount of url paramaters. I think google will only index up to three of them.

Link to comment
Share on other sites

When you follow a link from the starting point, you're at a depth of 1. Every subsequent link you follow in that "thread" will increase the current depth. Set a limit for how deep the crawler is allowed to go.

Link to comment
Share on other sites

PC Nerd, that is not the type of loop we are looking at. The loop we are talking about is a valid loop according to the spider, its just the spider will get stuck on certain pages (it will still be running correctly / indexing etc), but it will never leave the current file it is on. These loops will be generated through the current site my bot is crawling. A Calender is a perfect example of something that has the potential to  "never end"

Link to comment
Share on other sites

Lur,

 

That is a very good idea. The depth will be set, but will be on specific page matching.

 

So what I am thinking is that the depth will only be set when it calls itself. (the current page we are on). Then it will limit itself to only going so deep on the current page.

 

If more then one page links to it, it will have the possibility to index other parts of it, but it will never index anymore then the allowed indexing depth.

 

Thank you very much

Link to comment
Share on other sites

Yes, I would agree with limiting the parameters for the URL..

 

But i also like PC Nerds idea.. Not so sure exactly how you would go about it.. Might take some thinking but.. Every time you go on to a new file.. ie news.php or index.php make a timestamp with mk_time() or something... Then in your while loops statement have something to check that you havnt been processing the same file for 30 seconds or something.. if you have.. move to next file and reset the timestamp.. Its an interesting thought!

 

Andy

 

EDIT: Ok sorry wrong kind of loop..

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.