zero_cool Posted February 1, 2013 Share Posted February 1, 2013 I dont know if this is the right place to ask for this but I am wondering if someone know which technique you colud use to make a program follow links on a website. Like if I wanted my "robot program" to "click" follow a link on my site ...how could I do that, because it is possible to do that right? Quote Link to comment Share on other sites More sharing options...
requinix Posted February 1, 2013 Share Posted February 1, 2013 This would be the right place if you wanted to use regular expressions for it, but I would suggest you not go that route. Are you talking about spidering a site (crawling all the links you can find, for every page you can find) or about making a bot to "click" a very particular link or set of links? Quote Link to comment Share on other sites More sharing options...
.josh Posted February 4, 2013 Share Posted February 4, 2013 There are a lot of tools out there that do this already, but I guess it depends on what you are wanting to do whether or not they'd be useful. For example I normally use Xenu (free) for basic crawling. My company purchased a license for Wasp which is a bit more robust but cost monies. But if you are looking to write your own tool, I agree with requinix: using regex to scrape pages for links is not advisable, since regex is not designed to parse html. Basically you would need to: start with the "seed" URL (usually like http://www.yoursite.com/) Use cURL (this is harder to implement, but more advisable, since you can take advantage of cURL's multi-threading. Here is a nice class for multi-threaded cURLing) or file_get_contents (easier to implement, but not advisable, since you can't multi-thread) to get the contents of the URL. Use something like DOM or simplehtmldom to pull all links from the page and store them in an arrayNOTE: neither method will allow you to get client-side dynamically generated links! There really isn't much you can do about that except get all complicated about it..basically you'd have to setup your server to basically request the script through a "browser" program that will execute javascript and get the links that way.[*]Then basically the overall "loop" would be to repeat the first 2 bullet points until there are no more links in the array. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.