Jump to content

Scraping page for link, then scraping that link's page for another link


Recommended Posts

Hello all,

 

I'm not new to coding, but I'm pretty new to php. What I need to do is scrape A.php for a link to B.php, and then scrape B.php for a link to C.php so I can then scrape data from C.php- and push said data out to X.php because A.php is dynamic.

 

Here is an example of my current code with which I am obtaining the desired link to B.php from A.php:

<?php
    require('simple_html_dom.php');

    $html = file_get_html('http://test.com/');

    $i = 1;
    foreach ($html->find('tr') as $desiredItem)
    {
        if ($i > 2) {
            break;
        }

    // Find link element
    $desiredItemDetails = $desiredItem->find('a.tag', 0);

    // Get href attribute
    $desiredItemUrl = 'test.com/' . $desiredItemDetails->href;

    $i++;
    }

    echo ($desiredItemUrl);
?>

I've tried re-initializing $html by passing $desiredItemUrl through file_get_html. This doesn't seem to work, even if I call it as a string.

 

Is this not possible? Is there simply an easier/more efficient way of doing this? Any help is greatly appreciated. Thanks!

Thanks for the quick reply,

 

A.php and B.php are on my site, but the link I need to traverse to in B.php is actually an external domain I have no control over.

 

And X.php is actually X.html right now as I'm unsure how to go about this. Because A.php is dynamic, I was under the impression I would need to begin from there.

Edited by L1GH7

So I'm about 95% sure that this process you're describing is either way convoluted or flat out the wrong approach to this.

 

Can you be more precise than these A/B/C/X.php files and scraping links and pushing data?

 

Basically, if you control A.php and B.php then there's no reason why B should have to scrape anything from A - you could just copy the code or logic or whatever that A is using into B. B gets what it needs naturally. Not sure what C or X are supposed to be.

Ok, my apologies. The main page (A.php) is essentially a dynamic table which lists ranked clients 1 through n. I'm simply grabbing the first client's link in the table (which may be different on any given day). This link provides another page which houses bio/information on the client (B.php). On This page, there is an external link (C.php) I need to retrieve a name and profile image from so that I may display them on a greeting/information page (X).

 

I need data from the external link, but only if the external link correlates to the rank 1 client found on A.php.

 

I hope this is more precise, thanks for your time.

A.php has the logic to list clients. You control this page so you can find that logic and replicate it in X.php in order to get the first client in that list.

B.php has the logic to display information. You control this page so you can find that logic and replicate it in X.php in order to get that external link.

I don't know if you control C.php. If so then I'm sure you can guess what I would say.

 

X.php doesn't have to do any scraping from any pages that you control because you can just copy the logic driving each page.

 

So as far as I'm concerned the only unanswered issue is with C.php...

Ok I see what you're saying, but after looking into the main html and php files, not only are they extremely confusing, but there is a database that's being read from, a common.php and a bunch of other includes which all seem to require each other.

 

So, I've essentially been trying to replicate the entire site in a sub directory for testing, trying to fix error after error, and it seems to be heavily cumbersome when all that's needed is a name and an avatar. I guess I still don't know if it's just a matter of me being inexperienced with php, or if it really would be better to have some kind of method of traversing through a few pages and grabbing what I need.

Scraping your own site is definitely not the right answer. I can tell you that right now.

 

As much of a burden as it may be, learning how your site works is the best thing you can do. You should do it regardless of this project.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.