Jump to content

Read page and trim header/footer


sanchez77

Recommended Posts

I used to use the file_get_contents to display a page from another site, but my ISP changed and I had to change to getfile() function. So it grabs the file and displays it on my page which is great, but does anyone know a way to trim the header and footer of the file and just display the meat in the middle?

 

My Code:

 

<?php

function getFile($url) { 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
$file = curl_exec($ch);
curl_close($ch);
return $file;
}

$filename = getFile("http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt");
echo "" . $filename . "<br>";

 

A summary of page result:

 

--Begin--

HTTP/1.1 200 OK Server: Apache Content-Type: text/html; charset=ISO-8859-1 Date: Thu, 28 Jul 2011 15:38:54 GMT Content-Length: 1845 Connection: keep-alive

FZUS51 KPHI 281408

CWFPHI

COASTAL WATERS FORECAST

 

--footer--

WINDS AND SEAS HIGHER IN AND NEAR TSTMS.

 

--------------------------------------------------------------------------------

National Weather Service

Generated 1452 UTC, Thursday, Jul 28, 2011

Document URL http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt

 

 

So I want to display the data from the page after keep-alive and i want to stop it from displaying ---- and below.

 

Any ideas? Something to point in the direction i need to go?

 

Thanks,

sanchez

Link to comment
Share on other sites

You have two options. You can use a regular expression to grab the string you want, or you can use the simpler str_replace function or the substr function if the header and footer are always guaranteed to be the same every time.

 

regex: http://php.net/manual/en/function.preg-match.php

regex replace: http://www.php.net/manual/en/function.preg-replace.php

 

substr: http://php.net/manual/en/function.substr.php

str_replace: http://php.net/manual/en/function.str-replace.php

 

if you need an example of one of these functions use, then just ask

Link to comment
Share on other sites

IMO, Using the string functions is faster and more simple in this case

 

<?php

$data = getFile("http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt");

$start = strpos( $data, '<PRE>' );
$length = strpos( $data, '</PRE>' ) - $start + 6;
// Add 6 to include the trailing </PRE>

echo substr( $data, $start, $length );

function getFile($url) { 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
$file = curl_exec($ch);
curl_close($ch);
return $file;
}

?>

Link to comment
Share on other sites

IMO, Using the string functions is faster and more simple in this case

 

<?php

$data = getFile("http://weather.noaa.gov/cgi-bin/fmtbltn.pl?file=forecasts/marine/coastal/an/anz451.txt");

$start = strpos( $data, '<PRE>' );
$length = strpos( $data, '</PRE>' ) - $start + 6;
// Add 6 to include the trailing </PRE>

echo substr( $data, $start, $length );

function getFile($url) { 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
$file = curl_exec($ch);
curl_close($ch);
return $file;
}

?>

 

assuming that the header and footer are always the same (which they appear to be) I would agree with this

Link to comment
Share on other sites

assuming that the header and footer are always the same (which they appear to be) I would agree with this

 

If they weren't, you'd have to use an HTML parser. And even then, if that section isn't consistent, who's to say the content will be consistent? Make the scrape as efficient as possible - check the data and make sure it's what you want. If you get oddly formed returns, then have data you can fall back on, let the user know it's old data, and set up a notification of some sort that tells you to change the scrape code :)

Link to comment
Share on other sites

Use whatever is more convenient to you. I prefer regex in this case because it's exact and I can create a regex really fast; nor do I have to think about string positions. If you're not that good at regex use a method that is easier to understand.

 

/silkfire

Link to comment
Share on other sites

RegEx is MUCH slower that string functions, even if the expression is quite efficient. The one you've provided is actually very inefficient, and requires a backtrack on nearly every character matched :P

 

Your advice is pretty odd as well. The snippet I've provided is both more simple and more exact than yours (yours matches his summary, and not actually what he's going to be scraping) and was probably coded faster. I don't understand how you could find your solution more elegant or simple when the thought process goes like this:

 

strpos

- find position of starting tag

- find position of end tag, and subtract position of starting tag to get a length

- return part of body that starts from our first position, and goes to the found length

 

preg_match .*?Connection: keep-alive(.*?)\s*---.*

- Match any amount of characters as few times as possible

- Match the exact string, returning to the previous step if variances are found

- Match any amount of characters as few times as possible and store them

- Match a white space character 0 or more times followed by three dashes followed by anything 0 or more times, returning to the previous step if variances are found

 

FEWF! Not so exact or 'really fast' when you look  at it that way.

 

Use RegEx when it's needed. This problem does not NEED RegEx. You should code for efficiency and cleanliness, not whatever is more convenient. There are best practices while coding, and convenience is generally not taken into account.

Link to comment
Share on other sites

The beauty of PHP is that I can whip up some code that does what I want to do in seconds. The time difference of a 100th second is negligble.

 

I matched according to the sample text he provided. Everyone programs in his/her own way. If the speed is negligible then I think "best practices" aren't needed.

I hate when people come with suggestions that burden the programmer without adding noticeble speed to the execution of the script.

 

If my solution would take like 5-10 sec more than his then I'd understand otherwise I don't.

PHP can parse a normal wep page in less than a second what takes time is retrieveing it with file_get_contents or CURL.

A bad programmer is someone who doesn't produce anything not the one that doesn't follow some rules that some perfectionist geek has invented.

Link to comment
Share on other sites

Those perfectionist geeks are the reason we aren't using tubes and abaci.

 

Bad habits are bad habits, no matter how you justify them. Bad advice is bad advice, no matter how hard you try to spin it.

 

preg_match( '/(string)/', $subject, $matches ) takes nearly 5x longer than strpos( $subject, 'string' ). I'd imagine comparing something like your regular expression to using strpos would be more like 50x longer. Milliseconds or not, your solution was bad.

 

Then again, programming habits like this help keep me employed. A good chunk of my work involves charging clients a second time to fix efficiency and bloat issues caused by lazy programmers. Encouraging this behavior in a place of learning is terrible. If I ever ran into a professor with your attitude, I'd drop the class and demand my money back.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.