sanchez77 Posted July 28, 2011 Share Posted July 28, 2011 I used to use the file_get_contents to display a page from another site, but my ISP changed and I had to change to getfile() function. So it grabs the file and displays it on my page which is great, but does anyone know a way to trim the header and footer of the file and just display the meat in the middle? My Code: <?php function getFile($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); $file = curl_exec($ch); curl_close($ch); return $file; } $filename = getFile("http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt"); echo "" . $filename . "<br>"; A summary of page result: --Begin-- HTTP/1.1 200 OK Server: Apache Content-Type: text/html; charset=ISO-8859-1 Date: Thu, 28 Jul 2011 15:38:54 GMT Content-Length: 1845 Connection: keep-alive FZUS51 KPHI 281408 CWFPHI COASTAL WATERS FORECAST --footer-- WINDS AND SEAS HIGHER IN AND NEAR TSTMS. -------------------------------------------------------------------------------- National Weather Service Generated 1452 UTC, Thursday, Jul 28, 2011 Document URL http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt So I want to display the data from the page after keep-alive and i want to stop it from displaying ---- and below. Any ideas? Something to point in the direction i need to go? Thanks, sanchez Quote Link to comment Share on other sites More sharing options...
mikesta707 Posted July 28, 2011 Share Posted July 28, 2011 You have two options. You can use a regular expression to grab the string you want, or you can use the simpler str_replace function or the substr function if the header and footer are always guaranteed to be the same every time. regex: http://php.net/manual/en/function.preg-match.php regex replace: http://www.php.net/manual/en/function.preg-replace.php substr: http://php.net/manual/en/function.substr.php str_replace: http://php.net/manual/en/function.str-replace.php if you need an example of one of these functions use, then just ask Quote Link to comment Share on other sites More sharing options...
silkfire Posted July 28, 2011 Share Posted July 28, 2011 Use regex, mate! $text = preg_replace('#.*?Connection: keep-alive(.*?)\s*---.*#s', '$1', $filename) Quote Link to comment Share on other sites More sharing options...
xyph Posted July 28, 2011 Share Posted July 28, 2011 IMO, Using the string functions is faster and more simple in this case <?php $data = getFile("http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt"); $start = strpos( $data, '<PRE>' ); $length = strpos( $data, '</PRE>' ) - $start + 6; // Add 6 to include the trailing </PRE> echo substr( $data, $start, $length ); function getFile($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); $file = curl_exec($ch); curl_close($ch); return $file; } ?> Quote Link to comment Share on other sites More sharing options...
mikesta707 Posted July 28, 2011 Share Posted July 28, 2011 IMO, Using the string functions is faster and more simple in this case <?php $data = getFile("http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt"); $start = strpos( $data, '<PRE>' ); $length = strpos( $data, '</PRE>' ) - $start + 6; // Add 6 to include the trailing </PRE> echo substr( $data, $start, $length ); function getFile($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); $file = curl_exec($ch); curl_close($ch); return $file; } ?> assuming that the header and footer are always the same (which they appear to be) I would agree with this Quote Link to comment Share on other sites More sharing options...
sanchez77 Posted July 28, 2011 Author Share Posted July 28, 2011 Thanks a lot xyph. This is perfect. exactly what I was looking to do. Thanks for the lesson. Cheers, sanchez Quote Link to comment Share on other sites More sharing options...
xyph Posted July 28, 2011 Share Posted July 28, 2011 assuming that the header and footer are always the same (which they appear to be) I would agree with this If they weren't, you'd have to use an HTML parser. And even then, if that section isn't consistent, who's to say the content will be consistent? Make the scrape as efficient as possible - check the data and make sure it's what you want. If you get oddly formed returns, then have data you can fall back on, let the user know it's old data, and set up a notification of some sort that tells you to change the scrape code Quote Link to comment Share on other sites More sharing options...
silkfire Posted July 28, 2011 Share Posted July 28, 2011 Use whatever is more convenient to you. I prefer regex in this case because it's exact and I can create a regex really fast; nor do I have to think about string positions. If you're not that good at regex use a method that is easier to understand. /silkfire Quote Link to comment Share on other sites More sharing options...
xyph Posted July 28, 2011 Share Posted July 28, 2011 RegEx is MUCH slower that string functions, even if the expression is quite efficient. The one you've provided is actually very inefficient, and requires a backtrack on nearly every character matched Your advice is pretty odd as well. The snippet I've provided is both more simple and more exact than yours (yours matches his summary, and not actually what he's going to be scraping) and was probably coded faster. I don't understand how you could find your solution more elegant or simple when the thought process goes like this: strpos - find position of starting tag - find position of end tag, and subtract position of starting tag to get a length - return part of body that starts from our first position, and goes to the found length preg_match .*?Connection: keep-alive(.*?)\s*---.* - Match any amount of characters as few times as possible - Match the exact string, returning to the previous step if variances are found - Match any amount of characters as few times as possible and store them - Match a white space character 0 or more times followed by three dashes followed by anything 0 or more times, returning to the previous step if variances are found FEWF! Not so exact or 'really fast' when you look at it that way. Use RegEx when it's needed. This problem does not NEED RegEx. You should code for efficiency and cleanliness, not whatever is more convenient. There are best practices while coding, and convenience is generally not taken into account. Quote Link to comment Share on other sites More sharing options...
silkfire Posted July 28, 2011 Share Posted July 28, 2011 The beauty of PHP is that I can whip up some code that does what I want to do in seconds. The time difference of a 100th second is negligble. I matched according to the sample text he provided. Everyone programs in his/her own way. If the speed is negligible then I think "best practices" aren't needed. I hate when people come with suggestions that burden the programmer without adding noticeble speed to the execution of the script. If my solution would take like 5-10 sec more than his then I'd understand otherwise I don't. PHP can parse a normal wep page in less than a second what takes time is retrieveing it with file_get_contents or CURL. A bad programmer is someone who doesn't produce anything not the one that doesn't follow some rules that some perfectionist geek has invented. Quote Link to comment Share on other sites More sharing options...
xyph Posted July 29, 2011 Share Posted July 29, 2011 Those perfectionist geeks are the reason we aren't using tubes and abaci. Bad habits are bad habits, no matter how you justify them. Bad advice is bad advice, no matter how hard you try to spin it. preg_match( '/(string)/', $subject, $matches ) takes nearly 5x longer than strpos( $subject, 'string' ). I'd imagine comparing something like your regular expression to using strpos would be more like 50x longer. Milliseconds or not, your solution was bad. Then again, programming habits like this help keep me employed. A good chunk of my work involves charging clients a second time to fix efficiency and bloat issues caused by lazy programmers. Encouraging this behavior in a place of learning is terrible. If I ever ran into a professor with your attitude, I'd drop the class and demand my money back. Quote Link to comment Share on other sites More sharing options...
silkfire Posted July 29, 2011 Share Posted July 29, 2011 You're comparing this code sample with a professional project. Nvm cba to discuss things with someone so stubborn. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.