TheStudent2023 Posted May 12, 2023 Share Posted May 12, 2023 (edited) Security Features Programmers, I got some challenging questions for you. Is there anything I got to program my link crawler to watch-out for ? When I ask below, "How do I code ?", I mean, "which php fuctions you want me to look into ?". I do not want my web crawler getting trapped onto some domain, while crawling it. Trapped and going in a loop for some reason. And so, what to look-out for to prevent loops ? 1. I know crawlers should not spider dynamic urls as they can go in a neverending loop. And so, apart from that, what other dangers are there ? 2. I know I have to program the crawler to avoid trying crawl pages that are dead. And so, got to lookout for 404 pages. And what other numbers got to lookout for ? I need a list of error numbers to feed my crawler. 3. I do not want any hacker/crook/fraud calling my crawler (pinging it) to crawl bad natured pages. Pages that are phishing pages. And so, how do I write code for my crawler to identify phishing pages so it does not crawl or index them on my searchengine ? 4. I do not want any hacker/crook/fraud calling my crawler (pinging it) to crawl his pages that are infected with virus, worm, ant, spyware, etc. Pages that will infect my crawler to carry infections to other domains it crawls afterwards. And so, how do I write code for my crawler to identify infected pages so it does not crawl or index them on my searchengine nor carry the infections to third party domains ? Would you like to add your own stuff in number 5 ? Edited May 12, 2023 by TheStudent2023 Link to comment Share on other sites More sharing options...
TheStudent2023 Posted May 12, 2023 Author Share Posted May 12, 2023 Which one of these should I stick to ? 1 $xml = file_get_contents($sitemap); //Should I stick to this line or below line ? 2 //Parse the sitemap content to object $xml = simplexml_load_string($sitemap); //Should I stick to this line or above line ? Link to comment Share on other sites More sharing options...
Recommended Posts