Scraping page for link, then scraping that link's page for another link

L1GH7 · February 6, 2018

Hello all,

I'm not new to coding, but I'm pretty new to php. What I need to do is scrape A.php for a link to B.php, and then scrape B.php for a link to C.php so I can then scrape data from C.php- and push said data out to X.php because A.php is dynamic.

Here is an example of my current code with which I am obtaining the desired link to B.php from A.php:

<?php
    require('simple_html_dom.php');

    $html = file_get_html('http://test.com/');

    $i = 1;
    foreach ($html->find('tr') as $desiredItem)
    {
        if ($i > 2) {
            break;
        }

    // Find link element
    $desiredItemDetails = $desiredItem->find('a.tag', 0);

    // Get href attribute
    $desiredItemUrl = 'test.com/' . $desiredItemDetails->href;

    $i++;
    }

    echo ($desiredItemUrl);
?>

I've tried re-initializing $html by passing $desiredItemUrl through file_get_html. This doesn't seem to work, even if I call it as a string.

Is this not possible? Is there simply an easier/more efficient way of doing this? Any help is greatly appreciated. Thanks!

requinix · February 6, 2018

Are all these pages on your site? Why are there so many pages involved that are scraping each other? Why can't X.php do everything?

L1GH7 · February 6, 2018

Thanks for the quick reply,

A.php and B.php are on my site, but the link I need to traverse to in B.php is actually an external domain I have no control over.

And X.php is actually X.html right now as I'm unsure how to go about this. Because A.php is dynamic, I was under the impression I would need to begin from there.

Edited February 6, 2018 by L1GH7

requinix · February 6, 2018

So I'm about 95% sure that this process you're describing is either way convoluted or flat out the wrong approach to this.

Can you be more precise than these A/B/C/X.php files and scraping links and pushing data?

Basically, if you control A.php and B.php then there's no reason why B should have to scrape anything from A - you could just copy the code or logic or whatever that A is using into B. B gets what it needs naturally. Not sure what C or X are supposed to be.

L1GH7 · February 6, 2018

Ok, my apologies. The main page (A.php) is essentially a dynamic table which lists ranked clients 1 through n. I'm simply grabbing the first client's link in the table (which may be different on any given day). This link provides another page which houses bio/information on the client (B.php). On This page, there is an external link (C.php) I need to retrieve a name and profile image from so that I may display them on a greeting/information page (X).

I need data from the external link, but only if the external link correlates to the rank 1 client found on A.php.

I hope this is more precise, thanks for your time.

requinix · February 6, 2018

A.php has the logic to list clients. You control this page so you can find that logic and replicate it in X.php in order to get the first client in that list.

B.php has the logic to display information. You control this page so you can find that logic and replicate it in X.php in order to get that external link.

I don't know if you control C.php. If so then I'm sure you can guess what I would say.

X.php doesn't have to do any scraping from any pages that you control because you can just copy the logic driving each page.

So as far as I'm concerned the only unanswered issue is with C.php...

L1GH7 · February 7, 2018

Ok I see what you're saying, but after looking into the main html and php files, not only are they extremely confusing, but there is a database that's being read from, a common.php and a bunch of other includes which all seem to require each other.

So, I've essentially been trying to replicate the entire site in a sub directory for testing, trying to fix error after error, and it seems to be heavily cumbersome when all that's needed is a name and an avatar. I guess I still don't know if it's just a matter of me being inexperienced with php, or if it really would be better to have some kind of method of traversing through a few pages and grabbing what I need.

requinix · February 7, 2018

Scraping your own site is definitely not the right answer. I can tell you that right now.

As much of a burden as it may be, learning how your site works is the best thing you can do. You should do it regardless of this project.

L1GH7 · February 7, 2018

Gotcha.

I'll keep you posted, and return if I run into any issues.

Thanks again for your help thus far!

Sign In

Scraping page for link, then scraping that link's page for another link

Recommended Posts

L1GH7

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

L1GH7

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

L1GH7

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

L1GH7

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

L1GH7

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information