Extract Info from website.

zerofool2005 · January 18, 2009

Hi all,

Ive been pondering this one for a while

But i need to make a script that can visit a webite. Pull the html file. And extract information. Then visit the next page.

E.G:

page.php?id=1

Extract the data between and

Write it to a file on server

Then visit

page.php?id=2

And repeat the process.

Anybody have an idea?

dawsba · January 18, 2009

hopefully this will help some

<?php
$subject = "abcdef<em>huh</em>uighiug";
$pattern = '/em>/';
preg_match_all($pattern, $subject, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
/*foreach($matches[0] as $k=>$v){echo $matches[0][$k][1];}*/
echo substr($subject,$matches[0][0][1]+3,($matches[0][0][1]-$matches[0][1][1])-4);
?>

.josh · January 18, 2009

Not tested but, here's the idea:

<?php

$url = "http://www.somesite.com/page.php?id="; // site to scrape
$page = 1; // page to start at
$exists = true; // loop control

// begin loop
while ($exists == true) {
   // get contents of target page
   $content = file_get_contents($url . $page);
   // check if current page is same as prev page
   $exists = ($content == $prevContent)? false : true;

   // grab everything between all em tags
   preg_match_all('~<em>(.*?)</em>~', $content, $matches);
   // assign results to an array of $page position
   $matchedPages[$page] = $matches;

   // assign current page as prev page for next loop iteration
   $prevContent = $content;   
   // inc page
   $page++;
} // end while

// echo out results
echo "<pre>"; print_r($matchedPages); echo "</pre>";
?>

CAUTION: This could easily make an infinite loop!. The idea is that you keep iterating $page and try to find out if there are no more pages to scrape by checking the current page scrape against the previous one. But if there is dynamic content being generated on the page that changes each page load (like, current time, some random quote, ad rotations, etc..) this will not work, because this would technically make the page unique. So this code is a starting block. Your task will be to look at the 'page not found' or 'error page' or whatever page is defaulted to when there are id=xx doesn't exist, and find an identifier that doesn't change. You will probably have to preg_match for it.

zerofool2005 · January 18, 2009

Hi Crayon.

Thanks for posting that

but the only output i get is

Array

(

[0] => Array

(

[0] => Array

(

)

[1] => Array

(

)

[1] => Array

(

[0] => Array

(

)

[1] => Array

(

)

.josh · January 18, 2009

...and...? There's a million reasons that could happen. You could have forgotten to put the correct url in $url. You could have typoed when putting the url in $url. You could be trying to scrape a page that doesn't have ... tags. Your server could be setup to not even allow file_get_contents. In other words, be more specific when posting a problem. You can start by posting the url you are trying to scrape and how you specifically integrated that code into your script, because I'm not psychic.

zerofool2005 · January 18, 2009

Hi sorry,

I tested the url and that outputs correctly.

I tested with just file_get_contents on its own and it outputted the page.

It just seems to be not able to read the tag in the page

Fixed it.

I used explode() to get the data from the to the end of the page into the array.

Then explode again just get the data i need on its own.

Bit of a resource hog really. But it works

Thank you crayon

.josh · January 18, 2009

ah you know what, in the code I posted you can try adding an s modifier to the preg_match_all:

preg_match_all('~<em>(.*?)</em>~s', $content, $matches);

problem might be that your em content spans across multiple lines...

Sign In

Extract Info from website.

Recommended Posts

zerofool2005

Link to comment

Share on other sites

dawsba

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

zerofool2005

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

zerofool2005

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information