Jump to content

file_get_contents and CURL not working for specific URL


RokNStoK

Recommended Posts

Hi Freaks...

  I have an online service that requires that I read member URLs and mine the data.  I started out by using file_get_contents, until I found that it didn't work on all URL's.  Then I tried using CURL, but still had trouble getting content for some URL's.  In my  investigations, I've also picked up little tricks like setting the user agent to trick servers into thinking I was a browser and also using urlencode for special characters.  Now, I'm using a combination of all of the above.  If file_get_contents fails, I try CURL.  Alas, there are still URL's that fail.  The latest URL to fail is:

 

http://suicide-prevention.dummipedia.com/Suicide-prevention:Backlinks

 

I though it might be easier for you to show me how you can read it... v. me show you how I can't.

 

Thanks for any / all help!!!

 

Ken

so can u tell me what u have tried with i mean ur code... :P

 

i believe  u want to get the contents of the site if yes try using this script

 

here actually i am wrting the content back in the browser.. but u can try returning it to a file so that u can fetch what ever u want by file_get_contents

 

<?php


$url = '[url=http://php.net]http://php.net[/url]';

// disguises the curl using fake headers and a fake user agent.
function disguise_curl($url)
{
  $curl = curl_init();

  // Setup headers - I used the same headers from Firefox version 2.0.0.6
  // below was split up because php.net said the line was too long. :/
  $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
  $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
  $header[] = "Cache-Control: max-age=0";
  $header[] = "Connection: keep-alive";
  $header[] = "Keep-Alive: 300";
  $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
  $header[] = "Accept-Language: en-us,en;q=0.5";
  $header[] = "Pragma: "; // browsers keep this blank.

  curl_setopt($curl, CURLOPT_URL, $url);
  curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
  curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
  curl_setopt($curl, CURLOPT_REFERER, '[url=http://www.google.com]http://www.google.com[/url]');
  curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
  curl_setopt($curl, CURLOPT_AUTOREFERER, true);
  curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($curl, CURLOPT_TIMEOUT, 10);

  $html = curl_exec($curl); // execute the curl command
  curl_close($curl); // close the connection

  return $html; // and finally, return $html
}

// uses the function and displays the text off the website
$text = disguise_curl($url);
echo $text;
?> 

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.