Jump to content

Recommended Posts

I am trying to create a crawler and have worked out how to extract the data I need from the site using regex, however I havent yet worked out how to access the site!  I know, wrong way round, but still...

 

The pages I need to crawl are all of the following makeup:-

 

 http://www.blah.co.uk/applications/blah/details.asp?prdno=xyz

 

Issue is that I dont know how many different prdnos there are, but am looking for a way to crawl through starting from 0 and if it exists then access data if not then move on to the next.

 

Can someone help me please?

 

 

Link to comment
https://forums.phpfreaks.com/topic/182163-crawler-help/
Share on other sites

Thanks Neil,

 

Just want to clarify that by using cUrl I will be able to in effect search for the prdno from my original post and action them if found?

 

sorry if this is a simple question just want to make sure I am reading the correct info.

 

At present the directories beyond the domain are all virtual listings and thus inacessible, so I need to find a way of generating the prdno and trying to access the page.

 

cUrl is the way to go?

 

Thanks

Link to comment
https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961154
Share on other sites

but what I am initially looking for is a way to retrieve the URLs

Extract the urls from your source target and then get your spider to follow them. Setup what is called a penetration level. i.e. stop following links if you are say 4 levels deep from the initial source.

Link to comment
https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961164
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.