Jump to content

technique for following links


zero_cool

Recommended Posts

I dont know if this is the right place to ask for this but I am wondering if someone know which technique you colud use to make a

program follow links on a website. Like if I wanted my "robot program" to "click" follow a link on my site ...how could I do that, because it is possible to do that right?

Link to comment
Share on other sites

This would be the right place if you wanted to use regular expressions for it, but I would suggest you not go that route.

 

Are you talking about spidering a site (crawling all the links you can find, for every page you can find) or about making a bot to "click" a very particular link or set of links?

Link to comment
Share on other sites

There are a lot of tools out there that do this already, but I guess it depends on what you are wanting to do whether or not they'd be useful. For example I normally use Xenu (free) for basic crawling. My company purchased a license for Wasp which is a bit more robust but cost monies.

 

But if you are looking to write your own tool, I agree with requinix: using regex to scrape pages for links is not advisable, since regex is not designed to parse html. Basically you would need to:

  • start with the "seed" URL (usually like http://www.yoursite.com/)
  • Use cURL (this is harder to implement, but more advisable, since you can take advantage of cURL's multi-threading. Here is a nice class for multi-threaded cURLing) or file_get_contents (easier to implement, but not advisable, since you can't multi-thread) to get the contents of the URL.
  • Use something like DOM or simplehtmldom to pull all links from the page and store them in an array
    • NOTE: neither method will allow you to get client-side dynamically generated links! There really isn't much you can do about that except get all complicated about it..basically you'd have to setup your server to basically request the script through a "browser" program that will execute javascript and get the links that way.

    [*]Then basically the overall "loop" would be to repeat the first 2 bullet points until there are no more links in the array.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.