[SOLVED] Difficulty with fgets--do some websites block fgets from working properly?

dunkhippo33 · June 14, 2007

Hi everyone,

I'm a beginner at at PHP and would appreciate any advice. In trying to learn PHP, I've been trying some scripts that read the source code of webpages and save them to text files. While I have gotten various example scripts that I've found on the web to work some of the time, I've noticed that none of the scripts I've tried work properly all of the time. One example:

<html>

<BODY>

<?

$myFile = "test.txt";

$fh = fopen($myFile, 'w') or die("can't open file");

$fextpg = fopen("http://www.yahoo.com", "r");

if ($fextpg) {

while (!feof($fextpg)) {

$buffer = fgets($fextpg, 4096);

echo $buffer;

fwrite($fh, $buffer);

}

fclose($fextpg);

}

fclose($fh);

?>

</body>

</html>

In this script, I've used someone else's example code to take in each line of the source code at yahoo.com and save to test.txt. However, if you look at the webpage that you get after copying each line of yahoo's source code, you get a very different Yahoo! page than the "real" yahoo.com main page. However, when I try other websites such as google.com, this script works perfectly.

Are there some sites that block functions such as fgets? Or is this script not reading in every line of yahoo's source code page?

Any help would be much appreciated! Thanks so much!

Best,

Elizabeth

Full-Demon · June 14, 2007

They cant block it, you just read the HTML output of the server, just as your browser does.

fgets($fextpg, 4096);

4096, perhaps some lines are longer?

FD

dunkhippo33 · June 19, 2007

Thanks, FD.

Unfortunately, even after increasing this number significantly, there is still a parsing problem. In particular, pages with huge amounts of javascript or css seem to be problems--the javascript and css just simply get cut out after the page has been parsed! I don't know why this is, because I thought fgets would work for all strings.

However, interestingly, if I save the source code of a webpage locally, regardless of how much js or css these pages have, parsing these saved pages works just fine. What is the difference between parsing pages locally and remotely?

Thanks!

Elizabeth

GingerRobot · June 19, 2007

I would guess that the problem is that the content of yahoo's pages depends significantly on the information it gathers from the user. For instance, i would expect that yahoo can be accessed from a mobile phone, but it would look very differant. Yahoo is probably getting information including browser, OS etc to configure the page for optimal display. I dont think that using fgets will be giving yahoo any of this information, so i would expect you get the cut down version - perhaps something similar to what would be displayed on a mobile phone.

This also makes sense as to why it works when you try it with the locally saved copy. The HTML, CSS and javascript has all already been created in your saved page.

I wonder if you would achieve better results using cURL - as you can pass many of the things that it might want ( i seem to remember you can send along a user agent for instance) Try looking into the uses of curl in php.

dunkhippo33 · June 19, 2007

Awesome, Ben! You were right, and we were able to solve the prob.

Thanks!

Elizabeth

Sign In

[SOLVED] Difficulty with fgets--do some websites block fgets from working properly?

Recommended Posts

dunkhippo33

Link to comment

Share on other sites

Full-Demon

Link to comment

Share on other sites

dunkhippo33

Link to comment

Share on other sites

GingerRobot

Link to comment

Share on other sites

dunkhippo33

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information