[SOLVED] Web Crawler > Preventing never ending loop

drewbee · July 17, 2007

Hello everyone,

I have designed a crawler in PHP. It works fairly well, and is actually pretty quick. It has the ability to "multithread" in its own way. PHP does not offer true multi-threading, however it can be rigged to do it. Currently it crawls around a website with no problem, obtaining the necessary data from the page/retrieving new links.

I have noticed one little problem though dealing with a never ending page, for instance a calender. A calender that is fed and manipulated by links IE (click here for next month) will litterally never end, due to the pages being dynamically generated. If it is currently on March, the next link will be "April". On this page load the next link will be "May" etc etc so on and so on.

Does anyone have any ideas to which I can place logic in the code for preventing this? The only way that I can even think of is my literally telling the spider that the page is recursive IE "DO NOT INDEX /calender.php" or, "ONLY INDEX calender.php, DO NOT INDEX ANYTHING ELSE OF THIS FILE".

If one (or many) of the spider threads would happen to fall onto this file, it would never leave it due to the fact that it would have an infitnite amount of links to follow.

Any thoughts / ideas? I wonder how yahoo / google etc take care of this issue.

drewbee · July 18, 2007

Well, the only thing I can come up with is doing a comparison across the pages (duplicate content check). My algorithm will just have to see if it is finding a lot of duplicate content from calender.php and maybe only take the original one in if duplicate content is found.

Anyone else have any thoughts or ideas?

AbydosGater · July 18, 2007

Well are you indexing dynamic links? Eg: ?action=goToNextData ..?

If so that could be a problem. most spiders dont index these just for the very fact of being caught in an endless loop.

Andy

PC Nerd · July 18, 2007

im not too sure, butmaybe run a timeout, like i dunno, have 2 conditions in the loop:

while all pages havent been catalogued && page_time_running > 30 seconds

might work

drewbee · July 18, 2007

Yeah I am.

If i do something other then a comparison of 'duplicate content', I may have to have an initiative of only indexing up to a certain amount of url paramaters. I think google will only index up to three of them.

lur · July 18, 2007

When you follow a link from the starting point, you're at a depth of 1. Every subsequent link you follow in that "thread" will increase the current depth. Set a limit for how deep the crawler is allowed to go.

drewbee · July 18, 2007

PC Nerd, that is not the type of loop we are looking at. The loop we are talking about is a valid loop according to the spider, its just the spider will get stuck on certain pages (it will still be running correctly / indexing etc), but it will never leave the current file it is on. These loops will be generated through the current site my bot is crawling. A Calender is a perfect example of something that has the potential to "never end"

drewbee · July 18, 2007

Lur,

That is a very good idea. The depth will be set, but will be on specific page matching.

So what I am thinking is that the depth will only be set when it calls itself. (the current page we are on). Then it will limit itself to only going so deep on the current page.

If more then one page links to it, it will have the possibility to index other parts of it, but it will never index anymore then the allowed indexing depth.

Thank you very much

AbydosGater · July 18, 2007

Yes, I would agree with limiting the parameters for the URL..

But i also like PC Nerds idea.. Not so sure exactly how you would go about it.. Might take some thinking but.. Every time you go on to a new file.. ie news.php or index.php make a timestamp with mk_time() or something... Then in your while loops statement have something to check that you havnt been processing the same file for 30 seconds or something.. if you have.. move to next file and reset the timestamp.. Its an interesting thought!

Andy

EDIT: Ok sorry wrong kind of loop..

Sign In

[SOLVED] Web Crawler > Preventing never ending loop

Recommended Posts

drewbee

Link to comment

Share on other sites

drewbee

Link to comment

Share on other sites

AbydosGater

Link to comment

Share on other sites

PC Nerd

Link to comment

Share on other sites

drewbee

Link to comment

Share on other sites

lur

Link to comment

Share on other sites

drewbee

Link to comment

Share on other sites

drewbee

Link to comment

Share on other sites

AbydosGater

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information