Jump to content

Recommended Posts

Security Features Programmers,

I got some challenging questions for you.

Is there anything I got to program my link crawler to watch-out for ?

When I ask below, "How do I code ?", I mean, "which php fuctions you want me to look into ?".


I do not want my web crawler getting trapped onto some domain, while crawling it. Trapped and going in a loop for some reason. And so, what to look-out for to prevent loops ?

1.

I know crawlers should not spider dynamic urls as they can go in a neverending loop. And so, apart from that, what other dangers are there ?

 

2.

I know I have to program the crawler to avoid trying crawl pages that are dead. And so, got to lookout for 404 pages. And what other numbers got to lookout for ? I need a list of error numbers to feed my crawler.

 

3.

I do not want any hacker/crook/fraud calling my crawler (pinging it) to crawl bad natured pages. Pages that are phishing pages. And so, how do I write code for my crawler to identify phishing pages so it does not crawl or index them on my searchengine ?

 

4.

I do not want any hacker/crook/fraud calling my crawler (pinging it) to crawl his pages that are infected with virus, worm, ant, spyware, etc. Pages that will infect my crawler to carry infections to other domains it crawls afterwards. And so, how do I write code for my crawler to identify infected pages so it does not crawl or index them on my searchengine nor carry the infections to third party domains ?

 

Would you like to add your own stuff in number 5 ?


 

Edited by TheStudent2023

Which one of these should I stick to ?

 

1

	$xml = file_get_contents($sitemap); //Should I stick to this line or below line ?
	

 

2

	//Parse the sitemap content to object
$xml = simplexml_load_string($sitemap); //Should I stick to this line or above line ?
	

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.