Jump to content

Recommended Posts

Hi all,

 

Ive been pondering this one for a while

But i need to make a script that can visit a webite. Pull the html file. And extract information. Then visit the next page.

 

E.G:

page.php?id=1

 

Extract the data between <EM> and </EM>

 

Write it to a file on server

 

Then visit

page.php?id=2

And repeat the process.

Anybody have an idea?

Link to comment
https://forums.phpfreaks.com/topic/141341-extract-info-from-website/
Share on other sites

hopefully this will help some

 

<?php
$subject = "abcdef<em>huh</em>uighiug";
$pattern = '/em>/';
preg_match_all($pattern, $subject, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
/*foreach($matches[0] as $k=>$v){echo $matches[0][$k][1];}*/
echo substr($subject,$matches[0][0][1]+3,($matches[0][0][1]-$matches[0][1][1])-4);
?>

Not tested but, here's the idea:

 

<?php

$url = "http://www.somesite.com/page.php?id="; // site to scrape
$page = 1; // page to start at
$exists = true; // loop control

// begin loop
while ($exists == true) {
   // get contents of target page
   $content = file_get_contents($url . $page);
   // check if current page is same as prev page
   $exists = ($content == $prevContent)? false : true;

   // grab everything between all em tags
   preg_match_all('~<em>(.*?)</em>~', $content, $matches);
   // assign results to an array of $page position
   $matchedPages[$page] = $matches;

   // assign current page as prev page for next loop iteration
   $prevContent = $content;   
   // inc page
   $page++;
} // end while

// echo out results
echo "<pre>"; print_r($matchedPages); echo "</pre>";
?>

 

CAUTION:  This could easily make an infinite loop!.  The idea is that you keep iterating $page and try to find out if there are no more pages to scrape by checking the current page scrape against the previous one.  But if there is dynamic content being generated on the page that changes each page load (like, current time, some random quote, ad rotations, etc..) this will not work, because this would technically make the page unique.  So this code is a starting block.  Your task will be to look at the 'page not found' or 'error page' or whatever page is defaulted to when there are id=xx doesn't exist, and find an identifier that doesn't change.  You will probably have to preg_match for it. 

Hi Crayon.

 

Thanks for posting that

 

but the only output i get is

Array

(

    [0] => Array

        (

            [0] => Array

                (

                )

 

            [1] => Array

                (

                )

 

        )

 

    [1] => Array

        (

            [0] => Array

                (

                )

 

            [1] => Array

                (

                )

 

        )

 

)

 

...and...? There's a million reasons that could happen.  You could have forgotten to put the correct url in $url.  You could have typoed when putting the url in $url.  You could be trying to scrape a page that doesn't have <em>...</em> tags.  Your server could be setup to not even allow file_get_contents.  In other words, be more specific when posting a problem.  You can start by posting the url you are trying to scrape and how you specifically integrated that code into your script, because I'm not psychic. 

Hi sorry,

 

I tested the url and that outputs correctly.

I tested with just file_get_contents on its own and it outputted the page.

 

It just seems to be not able to read the tag in the page

 

 

Fixed it.

I used explode() to get the data from the <em> to the end of the page into the array.

Then explode again just get the data i need on its own.

 

Bit of a resource hog really. But it works :D

 

Thank you crayon

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.