Crawler help

mbk · November 19, 2009

I am trying to create a crawler and have worked out how to extract the data I need from the site using regex, however I havent yet worked out how to access the site! I know, wrong way round, but still...

The pages I need to crawl are all of the following makeup:-

 http://www.blah.co.uk/applications/blah/details.asp?prdno=xyz

Issue is that I dont know how many different prdnos there are, but am looking for a way to crawl through starting from 0 and if it exists then access data if not then move on to the next.

Can someone help me please?

sKunKbad · November 19, 2009

If you need an example of a crawler, you might check out phpSitemapNG.

mbk · November 19, 2009

sKunKbad - thanks for the quick reply.

I tried using the crawler you suggested to see if it worked on the site in question so I could then look at the source, however it didnt work, it merely returned the root url.

Any other suggestions pls?

JonnoTheDev · November 19, 2009

If you want to write your own spider then take a look at cURL.

Read thoroughly and look through examples.

http://uk2.php.net/curl

mbk · November 19, 2009

Thanks Neil,

Just want to clarify that by using cUrl I will be able to in effect search for the prdno from my original post and action them if found?

sorry if this is a simple question just want to make sure I am reading the correct info.

At present the directories beyond the domain are all virtual listings and thus inacessible, so I need to find a way of generating the prdno and trying to access the page.

cUrl is the way to go?

Thanks

JonnoTheDev · November 19, 2009

cURL is used for making HTTP requests just as your web browser does. The response from a request will contain the source of the target url. From this you can extract the data you require. Be careful when scraping websites!

mbk · November 19, 2009

OK - but what I am initially looking for is a way to retrieve the URLs I wish to visit. Which I guess is the sitemap isnt it...

Think I need to look into the sitemap more.

Thanks for info re scraping too, its OK though.

JonnoTheDev · November 19, 2009

but what I am initially looking for is a way to retrieve the URLs

Extract the urls from your source target and then get your spider to follow them. Setup what is called a penetration level. i.e. stop following links if you are say 4 levels deep from the initial source.

Sign In

Crawler help

Recommended Posts

mbk

Link to comment

Share on other sites

sKunKbad

Link to comment

Share on other sites

mbk

Link to comment

Share on other sites

JonnoTheDev

Link to comment

Share on other sites

mbk

Link to comment

Share on other sites

JonnoTheDev

Link to comment

Share on other sites

mbk

Link to comment

Share on other sites

JonnoTheDev

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information