Jump to content

Recommended Posts

I have been using file_get_contents to retrieve the HTML of web pages without issue until now.  I prefer this method over curl due to the simpler syntax, but scraping a more complex page, perhaps with a form submission, then curl is the way to go.

 

So I started scraping a new site today and discovered some odd behavior.  Here is a code snippet to give some context...

 

file_delete($temporaryFile);  // $temporaryFile is defined as a relative path to a local text file
$fd = fopen($temporaryFile, 'w');
$context = stream_context_create(array('http' => array('follow_location' => false)));
$contents = file_get_contents($selectedURL, FALSE, $context); // $selectedURL is the URL of the page being scraped
fwrite($fd, $contents);
fclose($fd);
header("Location:" . $redirect);
// $redirect is the intended destination page which parses the $contents in the local file but we never reach the page
exit();

 

This script is running as part of a web page and not from the command line.  The code snippet is placed before the html <body> tag and there has been no other output to the web page (no echos, etc.).

 

Normally when the file_get_conents stuffs the HTML of the page pointed to by the URL in $selectedURL into the variable $contents there is no issue.  Normally the contents of variable $contents has no effect on the behavior of the scraping script or the rendering of the web page hosting the scraping script.  HOWEVER in this cause either the actual contents or the activity of retrieving them do affect the rendering of the scraping script/page.  You can see I write $contents to a file for post-analysis.  The problem is when the page loads I expect it to be the page specified by the URL $redirect, however in this particular case instead the page rendered is the page that is scraped (not the page doing the scraping as expected).  How do I know that?  I examine the contents of the written file and confirm the page rendered is from the contents of the written file, or at least the same source.  Very odd.  I have not seen this before.

 

I suspect there is Javascript in the page being rendered within file_get_contents that is overriding the rendering of the scraping page.  There is definitely Javascript in the page being scraped (I can see it in the captured file) but it appears complex and I am not a Javascript expert.  Does it make sense the javascript of the page being scraped is affecting the page doing the scraping?  I ask since here since there are likely people with more expertise on the subject.  I have spent all day on this and I can now say I need help.

 

Clearly I don't want this unintended page redirection, and I just want the HTML of the subject page so that it does not have an effect the page/script doing the scraping.    Has anyone here seen this before?  If it is Javascript in the page being scraped that is having an effect how can I disable the Javascript in the scrape subject page.

 

Any thoughts or comments would be greatly appreciated.

 

Mike Henniger

 

 

P.S.  When I use curl to get the page by substituting the following for file_get_conents...

 

$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, FALSE);
curl_setopt($curl, CURLOPT_AUTOREFERER,    TRUE);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl, CURLOPT_TIMEOUT,        30);
curl_setopt($curl, CURLOPT_MAXREDIRS,      0);
curl_setopt($curl, CURLOPT_USERAGENT,      sprintf("Mozilla/%d.0",rand(4,5)));
curl_setopt($curl, CURLOPT_URL,            $selectedURL);
curl_setopt($curl, CURLOPT_HEADER,         (int)NULL);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
$contents = trim(curl_exec($curl));
curl_close($curl);

 

...it has the same effect along with the unintended page redirection.

Link to comment
https://forums.phpfreaks.com/topic/286027-file_get_contents-odd-behaviour/
Share on other sites

I've used file_get_contents() many times on pages with javascript and never seen that issue. So, I don't believe that is the problem. Looking at the code provided I see no way that would happen - but anything is possible. Are you absolutely sure that the page you are loading is the same one you are editing? I've done it before where I copy a page/folder and edit the wrong page wondering why I am not seeing any changes when loading the page. I'm thinking you may have had the page initially output the contents to the page to verify you were getting the right results before changing the code to write to a file. Or, perhaps, that is what you did and the page is showing the cache instead of getting new content. Try Ctrl-F5

 

Of, it you are certain that is not the problem, provide the URL of the page you are hitting so we can see the problem for ourselves.

javascript has ability to redirect. If you are scraping content from a location that has javascript that redirects, and then output that scraped content on your page, and the javascript is executed, then yes, it's going to redirect. If you don't want that to happen, in principle, you're going to have to do what dalecosp said: find and strip it from the scraped content. Assuming you want to preserve everything else except the redirect, probably his suggestion of replacing opening script tag with something else is too broad a stroke. But we can't really give you a more definitive answer than this, without seeing the content ourselves.

That is brilliant... Don't bother with shutting down the machinery nicely... just throw a wrench into the gears!!!

 

So I modified the snippet as follows...

$context = stream_context_create(array('http' => array('follow_location' => false)));
fwrite($fd, str_replace("<script", "[script", file_get_contents($selectedURL, FALSE, $context)));
fclose($fd);
header("Location:" . $redirect);
exit();

 

The undesired redirection still happens, but the resulting page is messed up.

 

I am still investigating.  I'll post again if I discover something interesting.

 

Mike

 

 

 

 

> Are you absolutely sure that the page you are loading is the same one you are editing? 

> I've done it before where I copy a page/folder and edit the wrong page wondering why 

> I am not seeing any changes when loading the page. 

 

Been there done that as well.  I am absolutely sure I am editing and loading the correct scraping page.  Minor changes are correctly reflected.

 

> I'm thinking you may have had the page initially output the contents to the page to verify you

> were getting the right results before changing the code to write to a file.

 

Good idea as well, but the number of lines between getting the page contents and writing it to a file is minimal.  Only a few lines.  Definitely no output.  See revised code snippet below.

 

> If you are scraping content from a location that has javascript that redirects, and then 

> output that scraped content on your page, and the javascript is executed, then yes, it's 

> going to redirect.  If you don't want that to happen, in principle, you're going to have 

> to do what dalecosp said: find and strip it from the scraped content.

 

OK, a bit of review first.  Here is what I have developed since my original post...

 

$fd = fopen($temporaryFileFlickr, 'w');

$context = stream_context_create(array('http' => array('follow_location' => false)));

$contents = file_get_contents($selectedURL, FALSE, $context);

$contents = str_replace("<script",     "", $contents);

$contents = str_replace("al:ios:url", "", $contents); 

// ...Many more lines like this to strip unneeded content...

$contents = str_replace("url",         "", $contents);

fwrite($fd, $contents);

fclose($fd);

header("Location:" . $redirect);

exit();

 

These are the actual lines in my code, minus the many str_replace commands that are unnecessary here.  As you can see I produce no text before the redirect line.  If any Javascript survived the purge, I am not displaying or writing it into the scraping page to be executed.

 

If I comment out the "fwrite" line the redirect in the "header" line just before the "exit" line executes as expected.  HOWEVER, if I put the fwrite line in as it should be, then the page does the unwanted redirect although with corrupted results.  THAT is weird.  Somehow writing the text scraped and mangled html & java script contents to a local file is resulting in the unexpected redirect.

 

Here is what I think it comes down to... Despite the "damage" I have done to the html & Javascript, I don't think I have mangled enough.  I am going to try cutting out large portions of the text I know I don't need.  I'll report my results tomorrow.

 

> Or, it you are certain that is not the problem, provide the URL of the page you are hitting

> so we can see the problem for ourselves.

 

Well, that is a dilemma.  Let me explain.  

 

I developed a website that documents the location of displayed vintage aircraft and their histories (www.aerialvisuals.ca).  One thing that we (a small group of admins) do is contact photographers on a popular photo hosting service and ask their permission to use their photos on our site as part of documenting the histories of airframes.  For those who are willing to grant us permission we access their photos directly from the photo hosting service by scraping the page to get the ID info and the photos themselves.  We have permissions from the owners of the photos to use the photos.  However... discussing the methods this popular photo hosting service uses to secure their website and essentially how to defeat the security, may be crossing an ethical boundary.  I would therefore hesitate to post a link to the page I am trying to scrape on an open forum like this.  Perhaps if one of you is really interested in this particular odd, but interesting, problem you could try sending me a private message here.

 

I will report any progress I make on this issue.

 

Mike

That is brilliant... Don't bother with shutting down the machinery nicely... just throw a wrench into the gears!!!

Hmm ... under the cover of a married middle-aged convertible driving mild-mannered Midwestern conservative church-going wanna-be record producer lies the heart of a saboteur ;)

 

At any rate, glad you've resolved this, and "hats off" for learning in advance of the next time.

 

Cheers,

 

:)

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.