zerofool2005 Posted January 18, 2009 Share Posted January 18, 2009 Hi all, Ive been pondering this one for a while But i need to make a script that can visit a webite. Pull the html file. And extract information. Then visit the next page. E.G: page.php?id=1 Extract the data between <EM> and </EM> Write it to a file on server Then visit page.php?id=2 And repeat the process. Anybody have an idea? Quote Link to comment https://forums.phpfreaks.com/topic/141341-extract-info-from-website/ Share on other sites More sharing options...
dawsba Posted January 18, 2009 Share Posted January 18, 2009 hopefully this will help some <?php $subject = "abcdef<em>huh</em>uighiug"; $pattern = '/em>/'; preg_match_all($pattern, $subject, $matches, PREG_OFFSET_CAPTURE); print_r($matches); /*foreach($matches[0] as $k=>$v){echo $matches[0][$k][1];}*/ echo substr($subject,$matches[0][0][1]+3,($matches[0][0][1]-$matches[0][1][1])-4); ?> Quote Link to comment https://forums.phpfreaks.com/topic/141341-extract-info-from-website/#findComment-739818 Share on other sites More sharing options...
.josh Posted January 18, 2009 Share Posted January 18, 2009 Not tested but, here's the idea: <?php $url = "http://www.somesite.com/page.php?id="; // site to scrape $page = 1; // page to start at $exists = true; // loop control // begin loop while ($exists == true) { // get contents of target page $content = file_get_contents($url . $page); // check if current page is same as prev page $exists = ($content == $prevContent)? false : true; // grab everything between all em tags preg_match_all('~<em>(.*?)</em>~', $content, $matches); // assign results to an array of $page position $matchedPages[$page] = $matches; // assign current page as prev page for next loop iteration $prevContent = $content; // inc page $page++; } // end while // echo out results echo "<pre>"; print_r($matchedPages); echo "</pre>"; ?> CAUTION: This could easily make an infinite loop!. The idea is that you keep iterating $page and try to find out if there are no more pages to scrape by checking the current page scrape against the previous one. But if there is dynamic content being generated on the page that changes each page load (like, current time, some random quote, ad rotations, etc..) this will not work, because this would technically make the page unique. So this code is a starting block. Your task will be to look at the 'page not found' or 'error page' or whatever page is defaulted to when there are id=xx doesn't exist, and find an identifier that doesn't change. You will probably have to preg_match for it. Quote Link to comment https://forums.phpfreaks.com/topic/141341-extract-info-from-website/#findComment-739831 Share on other sites More sharing options...
zerofool2005 Posted January 18, 2009 Author Share Posted January 18, 2009 Hi Crayon. Thanks for posting that but the only output i get is Array ( [0] => Array ( [0] => Array ( ) [1] => Array ( ) ) [1] => Array ( [0] => Array ( ) [1] => Array ( ) ) ) Quote Link to comment https://forums.phpfreaks.com/topic/141341-extract-info-from-website/#findComment-739871 Share on other sites More sharing options...
.josh Posted January 18, 2009 Share Posted January 18, 2009 ...and...? There's a million reasons that could happen. You could have forgotten to put the correct url in $url. You could have typoed when putting the url in $url. You could be trying to scrape a page that doesn't have <em>...</em> tags. Your server could be setup to not even allow file_get_contents. In other words, be more specific when posting a problem. You can start by posting the url you are trying to scrape and how you specifically integrated that code into your script, because I'm not psychic. Quote Link to comment https://forums.phpfreaks.com/topic/141341-extract-info-from-website/#findComment-739876 Share on other sites More sharing options...
zerofool2005 Posted January 18, 2009 Author Share Posted January 18, 2009 Hi sorry, I tested the url and that outputs correctly. I tested with just file_get_contents on its own and it outputted the page. It just seems to be not able to read the tag in the page Fixed it. I used explode() to get the data from the <em> to the end of the page into the array. Then explode again just get the data i need on its own. Bit of a resource hog really. But it works Thank you crayon Quote Link to comment https://forums.phpfreaks.com/topic/141341-extract-info-from-website/#findComment-739902 Share on other sites More sharing options...
.josh Posted January 18, 2009 Share Posted January 18, 2009 ah you know what, in the code I posted you can try adding an s modifier to the preg_match_all: preg_match_all('~<em>(.*?)</em>~s', $content, $matches); problem might be that your em content spans across multiple lines... Quote Link to comment https://forums.phpfreaks.com/topic/141341-extract-info-from-website/#findComment-739940 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.