Jump to content

Crawler help


mbk

Recommended Posts

I am trying to create a crawler and have worked out how to extract the data I need from the site using regex, however I havent yet worked out how to access the site!  I know, wrong way round, but still...

 

The pages I need to crawl are all of the following makeup:-

 

 http://www.blah.co.uk/applications/blah/details.asp?prdno=xyz

 

Issue is that I dont know how many different prdnos there are, but am looking for a way to crawl through starting from 0 and if it exists then access data if not then move on to the next.

 

Can someone help me please?

 

 

Link to comment
https://forums.phpfreaks.com/topic/182163-crawler-help/
Share on other sites

Thanks Neil,

 

Just want to clarify that by using cUrl I will be able to in effect search for the prdno from my original post and action them if found?

 

sorry if this is a simple question just want to make sure I am reading the correct info.

 

At present the directories beyond the domain are all virtual listings and thus inacessible, so I need to find a way of generating the prdno and trying to access the page.

 

cUrl is the way to go?

 

Thanks

Link to comment
https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961154
Share on other sites

but what I am initially looking for is a way to retrieve the URLs

Extract the urls from your source target and then get your spider to follow them. Setup what is called a penetration level. i.e. stop following links if you are say 4 levels deep from the initial source.

Link to comment
https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961164
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.