Jump to content

Scraping page for link, then scraping that link's page for another link


L1GH7

Recommended Posts

Hello all,

 

I'm not new to coding, but I'm pretty new to php. What I need to do is scrape A.php for a link to B.php, and then scrape B.php for a link to C.php so I can then scrape data from C.php- and push said data out to X.php because A.php is dynamic.

 

Here is an example of my current code with which I am obtaining the desired link to B.php from A.php:

<?php
    require('simple_html_dom.php');

    $html = file_get_html('http://test.com/');

    $i = 1;
    foreach ($html->find('tr') as $desiredItem)
    {
        if ($i > 2) {
            break;
        }

    // Find link element
    $desiredItemDetails = $desiredItem->find('a.tag', 0);

    // Get href attribute
    $desiredItemUrl = 'test.com/' . $desiredItemDetails->href;

    $i++;
    }

    echo ($desiredItemUrl);
?>

I've tried re-initializing $html by passing $desiredItemUrl through file_get_html. This doesn't seem to work, even if I call it as a string.

 

Is this not possible? Is there simply an easier/more efficient way of doing this? Any help is greatly appreciated. Thanks!

Link to comment
Share on other sites

Thanks for the quick reply,

 

A.php and B.php are on my site, but the link I need to traverse to in B.php is actually an external domain I have no control over.

 

And X.php is actually X.html right now as I'm unsure how to go about this. Because A.php is dynamic, I was under the impression I would need to begin from there.

Link to comment
Share on other sites

So I'm about 95% sure that this process you're describing is either way convoluted or flat out the wrong approach to this.

 

Can you be more precise than these A/B/C/X.php files and scraping links and pushing data?

 

Basically, if you control A.php and B.php then there's no reason why B should have to scrape anything from A - you could just copy the code or logic or whatever that A is using into B. B gets what it needs naturally. Not sure what C or X are supposed to be.

Link to comment
Share on other sites

Ok, my apologies. The main page (A.php) is essentially a dynamic table which lists ranked clients 1 through n. I'm simply grabbing the first client's link in the table (which may be different on any given day). This link provides another page which houses bio/information on the client (B.php). On This page, there is an external link (C.php) I need to retrieve a name and profile image from so that I may display them on a greeting/information page (X).

 

I need data from the external link, but only if the external link correlates to the rank 1 client found on A.php.

 

I hope this is more precise, thanks for your time.

Link to comment
Share on other sites

A.php has the logic to list clients. You control this page so you can find that logic and replicate it in X.php in order to get the first client in that list.

B.php has the logic to display information. You control this page so you can find that logic and replicate it in X.php in order to get that external link.

I don't know if you control C.php. If so then I'm sure you can guess what I would say.

 

X.php doesn't have to do any scraping from any pages that you control because you can just copy the logic driving each page.

 

So as far as I'm concerned the only unanswered issue is with C.php...

Link to comment
Share on other sites

Ok I see what you're saying, but after looking into the main html and php files, not only are they extremely confusing, but there is a database that's being read from, a common.php and a bunch of other includes which all seem to require each other.

 

So, I've essentially been trying to replicate the entire site in a sub directory for testing, trying to fix error after error, and it seems to be heavily cumbersome when all that's needed is a name and an avatar. I guess I still don't know if it's just a matter of me being inexperienced with php, or if it really would be better to have some kind of method of traversing through a few pages and grabbing what I need.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.