Read page and trim header/footer

sanchez77 · July 28, 2011

I used to use the file_get_contents to display a page from another site, but my ISP changed and I had to change to getfile() function. So it grabs the file and displays it on my page which is great, but does anyone know a way to trim the header and footer of the file and just display the meat in the middle?

My Code:

<?php

function getFile($url) { 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
$file = curl_exec($ch);
curl_close($ch);
return $file;
}

$filename = getFile("http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt");
echo "" . $filename . "<br>";

A summary of page result:

--Begin--

HTTP/1.1 200 OK Server: Apache Content-Type: text/html; charset=ISO-8859-1 Date: Thu, 28 Jul 2011 15:38:54 GMT Content-Length: 1845 Connection: keep-alive

FZUS51 KPHI 281408

CWFPHI

COASTAL WATERS FORECAST

--footer--

WINDS AND SEAS HIGHER IN AND NEAR TSTMS.

--------------------------------------------------------------------------------

National Weather Service

Generated 1452 UTC, Thursday, Jul 28, 2011

Document URL http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt

So I want to display the data from the page after keep-alive and i want to stop it from displaying ---- and below.

Any ideas? Something to point in the direction i need to go?

Thanks,

sanchez

mikesta707 · July 28, 2011

You have two options. You can use a regular expression to grab the string you want, or you can use the simpler str_replace function or the substr function if the header and footer are always guaranteed to be the same every time.

regex: http://php.net/manual/en/function.preg-match.php

regex replace: http://www.php.net/manual/en/function.preg-replace.php

substr: http://php.net/manual/en/function.substr.php

str_replace: http://php.net/manual/en/function.str-replace.php

if you need an example of one of these functions use, then just ask

silkfire · July 28, 2011

Use regex, mate!

$text = preg_replace('#.*?Connection: keep-alive(.*?)\s*---.*#s', '$1', $filename)

xyph · July 28, 2011

IMO, Using the string functions is faster and more simple in this case

<?php

$data = getFile("http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt");

$start = strpos( $data, '<PRE>' );
$length = strpos( $data, '</PRE>' ) - $start + 6;
// Add 6 to include the trailing </PRE>

echo substr( $data, $start, $length );

function getFile($url) { 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
$file = curl_exec($ch);
curl_close($ch);
return $file;
}

?>

mikesta707 · July 28, 2011

IMO, Using the string functions is faster and more simple in this case

<?php

$data = getFile("http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt");

$start = strpos( $data, '<PRE>' );
$length = strpos( $data, '</PRE>' ) - $start + 6;
// Add 6 to include the trailing </PRE>

echo substr( $data, $start, $length );

function getFile($url) { 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
$file = curl_exec($ch);
curl_close($ch);
return $file;
}

?>

assuming that the header and footer are always the same (which they appear to be) I would agree with this

sanchez77 · July 28, 2011

Thanks a lot xyph. This is perfect. exactly what I was looking to do. Thanks for the lesson.

Cheers,

sanchez

xyph · July 28, 2011

assuming that the header and footer are always the same (which they appear to be) I would agree with this

If they weren't, you'd have to use an HTML parser. And even then, if that section isn't consistent, who's to say the content will be consistent? Make the scrape as efficient as possible - check the data and make sure it's what you want. If you get oddly formed returns, then have data you can fall back on, let the user know it's old data, and set up a notification of some sort that tells you to change the scrape code

silkfire · July 28, 2011

Use whatever is more convenient to you. I prefer regex in this case because it's exact and I can create a regex really fast; nor do I have to think about string positions. If you're not that good at regex use a method that is easier to understand.

/silkfire

xyph · July 28, 2011

RegEx is MUCH slower that string functions, even if the expression is quite efficient. The one you've provided is actually very inefficient, and requires a backtrack on nearly every character matched

Your advice is pretty odd as well. The snippet I've provided is both more simple and more exact than yours (yours matches his summary, and not actually what he's going to be scraping) and was probably coded faster. I don't understand how you could find your solution more elegant or simple when the thought process goes like this:

strpos

- find position of starting tag

- find position of end tag, and subtract position of starting tag to get a length

- return part of body that starts from our first position, and goes to the found length

preg_match .*?Connection: keep-alive(.*?)\s*---.*

- Match any amount of characters as few times as possible

- Match the exact string, returning to the previous step if variances are found

- Match any amount of characters as few times as possible and store them

- Match a white space character 0 or more times followed by three dashes followed by anything 0 or more times, returning to the previous step if variances are found

FEWF! Not so exact or 'really fast' when you look at it that way.

Use RegEx when it's needed. This problem does not NEED RegEx. You should code for efficiency and cleanliness, not whatever is more convenient. There are best practices while coding, and convenience is generally not taken into account.

silkfire · July 28, 2011

The beauty of PHP is that I can whip up some code that does what I want to do in seconds. The time difference of a 100th second is negligble.

I matched according to the sample text he provided. Everyone programs in his/her own way. If the speed is negligible then I think "best practices" aren't needed.

I hate when people come with suggestions that burden the programmer without adding noticeble speed to the execution of the script.

If my solution would take like 5-10 sec more than his then I'd understand otherwise I don't.

PHP can parse a normal wep page in less than a second what takes time is retrieveing it with file_get_contents or CURL.

A bad programmer is someone who doesn't produce anything not the one that doesn't follow some rules that some perfectionist geek has invented.

xyph · July 29, 2011

Those perfectionist geeks are the reason we aren't using tubes and abaci.

Bad habits are bad habits, no matter how you justify them. Bad advice is bad advice, no matter how hard you try to spin it.

preg_match( '/(string)/', $subject, $matches ) takes nearly 5x longer than strpos( $subject, 'string' ). I'd imagine comparing something like your regular expression to using strpos would be more like 50x longer. Milliseconds or not, your solution was bad.

Then again, programming habits like this help keep me employed. A good chunk of my work involves charging clients a second time to fix efficiency and bloat issues caused by lazy programmers. Encouraging this behavior in a place of learning is terrible. If I ever ran into a professor with your attitude, I'd drop the class and demand my money back.

silkfire · July 29, 2011

You're comparing this code sample with a professional project. Nvm cba to discuss things with someone so stubborn.

Sign In

Read page and trim header/footer

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information