Jump to content

[SOLVED] Difficulty with fgets--do some websites block fgets from working properly?


dunkhippo33

Recommended Posts

Hi everyone,

 

I'm a beginner at at PHP and would appreciate any advice.  In trying to learn PHP, I've been trying some scripts that read the source code of webpages and save them to text files.  While I have gotten various example scripts that I've found on the web to work some of the time, I've noticed that none of the scripts I've tried work properly all of the time.  One example:

 

<html>

<head><title>Test</title></head>

 

<BODY>

 

<?

$myFile = "test.txt";

$fh = fopen($myFile, 'w') or die("can't open file");

$fextpg = fopen("http://www.yahoo.com", "r");

if ($fextpg) {

    while (!feof($fextpg)) {

        $buffer = fgets($fextpg, 4096);

        echo $buffer;

fwrite($fh, $buffer);

    }

    fclose($fextpg);

}

fclose($fh);

 

?>

</body>

</html>

 

In this script, I've used someone else's example code to take in each line of the source code at yahoo.com and save to test.txt.  However, if you look at the webpage that you get after copying each line of yahoo's source code, you get a very different Yahoo! page than the "real" yahoo.com main page.  However, when I try other websites such as google.com, this script works perfectly. 

 

Are there some sites that block functions such as fgets?  Or is this script not reading in every line of yahoo's source code page?

 

Any help would be much appreciated!  Thanks so much!

Best,

Elizabeth

 

Link to comment
Share on other sites

Thanks, FD.

 

Unfortunately, even after increasing this number significantly, there is still a parsing problem.  In particular, pages with huge amounts of javascript or css seem to be problems--the javascript and css just simply get cut out after the page has been parsed!  I don't know why this is, because I thought fgets would work for all strings. 

 

However, interestingly, if I save the source code of a webpage locally, regardless of how much js or css these pages have, parsing these saved pages works just fine.  What is the difference between parsing pages locally and remotely?   

 

Thanks!

Elizabeth

Link to comment
Share on other sites

I would guess that the problem is that the content of yahoo's pages depends significantly on the information it gathers from the user. For instance, i would expect that yahoo can be accessed from a mobile phone, but it would look very differant. Yahoo is probably getting information including browser, OS etc to configure the page for optimal display. I dont think that using fgets will be giving yahoo any of this information, so i would expect you get the cut down version - perhaps something similar to what would be displayed on a mobile phone.

 

This also makes sense as to why it works when you try it with the locally saved copy. The HTML, CSS and javascript has all already been created in your saved page.

 

I wonder if you would achieve better results using cURL - as you can pass many of the things that it might want ( i seem to remember you can send along a user agent for instance) Try looking into the uses of curl in php.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.