drewbee Posted July 17, 2007 Share Posted July 17, 2007 Hello everyone, I have designed a crawler in PHP. It works fairly well, and is actually pretty quick. It has the ability to "multithread" in its own way. PHP does not offer true multi-threading, however it can be rigged to do it. Currently it crawls around a website with no problem, obtaining the necessary data from the page/retrieving new links. I have noticed one little problem though dealing with a never ending page, for instance a calender. A calender that is fed and manipulated by links IE (click here for next month) will litterally never end, due to the pages being dynamically generated. If it is currently on March, the next link will be "April". On this page load the next link will be "May" etc etc so on and so on. Does anyone have any ideas to which I can place logic in the code for preventing this? The only way that I can even think of is my literally telling the spider that the page is recursive IE "DO NOT INDEX /calender.php" or, "ONLY INDEX calender.php, DO NOT INDEX ANYTHING ELSE OF THIS FILE". If one (or many) of the spider threads would happen to fall onto this file, it would never leave it due to the fact that it would have an infitnite amount of links to follow. Any thoughts / ideas? I wonder how yahoo / google etc take care of this issue. Quote Link to comment Share on other sites More sharing options...
drewbee Posted July 18, 2007 Author Share Posted July 18, 2007 Well, the only thing I can come up with is doing a comparison across the pages (duplicate content check). My algorithm will just have to see if it is finding a lot of duplicate content from calender.php and maybe only take the original one in if duplicate content is found. Anyone else have any thoughts or ideas? Quote Link to comment Share on other sites More sharing options...
AbydosGater Posted July 18, 2007 Share Posted July 18, 2007 Well are you indexing dynamic links? Eg: ?action=goToNextData ..? If so that could be a problem. most spiders dont index these just for the very fact of being caught in an endless loop. Andy Quote Link to comment Share on other sites More sharing options...
PC Nerd Posted July 18, 2007 Share Posted July 18, 2007 im not too sure, butmaybe run a timeout, like i dunno, have 2 conditions in the loop: while all pages havent been catalogued && page_time_running > 30 seconds might work Quote Link to comment Share on other sites More sharing options...
drewbee Posted July 18, 2007 Author Share Posted July 18, 2007 Yeah I am. If i do something other then a comparison of 'duplicate content', I may have to have an initiative of only indexing up to a certain amount of url paramaters. I think google will only index up to three of them. Quote Link to comment Share on other sites More sharing options...
lur Posted July 18, 2007 Share Posted July 18, 2007 When you follow a link from the starting point, you're at a depth of 1. Every subsequent link you follow in that "thread" will increase the current depth. Set a limit for how deep the crawler is allowed to go. Quote Link to comment Share on other sites More sharing options...
drewbee Posted July 18, 2007 Author Share Posted July 18, 2007 PC Nerd, that is not the type of loop we are looking at. The loop we are talking about is a valid loop according to the spider, its just the spider will get stuck on certain pages (it will still be running correctly / indexing etc), but it will never leave the current file it is on. These loops will be generated through the current site my bot is crawling. A Calender is a perfect example of something that has the potential to "never end" Quote Link to comment Share on other sites More sharing options...
drewbee Posted July 18, 2007 Author Share Posted July 18, 2007 Lur, That is a very good idea. The depth will be set, but will be on specific page matching. So what I am thinking is that the depth will only be set when it calls itself. (the current page we are on). Then it will limit itself to only going so deep on the current page. If more then one page links to it, it will have the possibility to index other parts of it, but it will never index anymore then the allowed indexing depth. Thank you very much Quote Link to comment Share on other sites More sharing options...
AbydosGater Posted July 18, 2007 Share Posted July 18, 2007 Yes, I would agree with limiting the parameters for the URL.. But i also like PC Nerds idea.. Not so sure exactly how you would go about it.. Might take some thinking but.. Every time you go on to a new file.. ie news.php or index.php make a timestamp with mk_time() or something... Then in your while loops statement have something to check that you havnt been processing the same file for 30 seconds or something.. if you have.. move to next file and reset the timestamp.. Its an interesting thought! Andy EDIT: Ok sorry wrong kind of loop.. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.