mbk Posted November 19, 2009 Share Posted November 19, 2009 I am trying to create a crawler and have worked out how to extract the data I need from the site using regex, however I havent yet worked out how to access the site! I know, wrong way round, but still... The pages I need to crawl are all of the following makeup:- http://www.blah.co.uk/applications/blah/details.asp?prdno=xyz Issue is that I dont know how many different prdnos there are, but am looking for a way to crawl through starting from 0 and if it exists then access data if not then move on to the next. Can someone help me please? Quote Link to comment https://forums.phpfreaks.com/topic/182163-crawler-help/ Share on other sites More sharing options...
sKunKbad Posted November 19, 2009 Share Posted November 19, 2009 If you need an example of a crawler, you might check out phpSitemapNG. Quote Link to comment https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961102 Share on other sites More sharing options...
mbk Posted November 19, 2009 Author Share Posted November 19, 2009 sKunKbad - thanks for the quick reply. I tried using the crawler you suggested to see if it worked on the site in question so I could then look at the source, however it didnt work, it merely returned the root url. Any other suggestions pls? Quote Link to comment https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961140 Share on other sites More sharing options...
JonnoTheDev Posted November 19, 2009 Share Posted November 19, 2009 If you want to write your own spider then take a look at cURL. Read thoroughly and look through examples. http://uk2.php.net/curl Quote Link to comment https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961151 Share on other sites More sharing options...
mbk Posted November 19, 2009 Author Share Posted November 19, 2009 Thanks Neil, Just want to clarify that by using cUrl I will be able to in effect search for the prdno from my original post and action them if found? sorry if this is a simple question just want to make sure I am reading the correct info. At present the directories beyond the domain are all virtual listings and thus inacessible, so I need to find a way of generating the prdno and trying to access the page. cUrl is the way to go? Thanks Quote Link to comment https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961154 Share on other sites More sharing options...
JonnoTheDev Posted November 19, 2009 Share Posted November 19, 2009 cURL is used for making HTTP requests just as your web browser does. The response from a request will contain the source of the target url. From this you can extract the data you require. Be careful when scraping websites! Quote Link to comment https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961159 Share on other sites More sharing options...
mbk Posted November 19, 2009 Author Share Posted November 19, 2009 OK - but what I am initially looking for is a way to retrieve the URLs I wish to visit. Which I guess is the sitemap isnt it... Think I need to look into the sitemap more. Thanks for info re scraping too, its OK though. Quote Link to comment https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961161 Share on other sites More sharing options...
JonnoTheDev Posted November 19, 2009 Share Posted November 19, 2009 but what I am initially looking for is a way to retrieve the URLs Extract the urls from your source target and then get your spider to follow them. Setup what is called a penetration level. i.e. stop following links if you are say 4 levels deep from the initial source. Quote Link to comment https://forums.phpfreaks.com/topic/182163-crawler-help/#findComment-961164 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.