Jump to content

Web Scraping From Protected Site


gritano

Recommended Posts

I administer a site where we bet on Danish Superliga matches every week. In order not to manually enter all the odds of the weekly matches I am trying to grab the odds from this site:

http://www.bold.dk/o...ndex.php?liga=1

 

... but whatever method I've come up with so far. All I get is request time outs and absolutely no data! Every other site I try my code on works fine. It is just this site, that seem to be protected in some way. Can any of you guys supply me with a working code/suggestions for grabbing data from this site?

 

I have tried these solutions (works of any other site than bold.dk):

 

$url = "http://www.bold.dk/odds/index.php?liga=1";

SOLUTION 1:
$source = file_get_contents($url);

SOLUTION 2:
function url_get_contents ($url) {
if (!function_exists('curl_init')){
die('CURL is not installed!');
}
$timeout = 5;
$agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1";
$ch = curl_init();
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
return $output;
}
$source = url_get_contents ($url);

Edited by gritano
Link to comment
Share on other sites

Ignace: Yes, it still doesn't work. It might be some settings on my web server that need to be changed, maybe.

Christian F. No, I do not have a specific permission. I did not notice that I needed that. So I guess I need to aquire one, once I get this working. Thank you for making me aware. Don't wanna break any laws, of course :-)

Link to comment
Share on other sites

Try this:

var_dump(ini_get('allow_url_fopen')); // string(1) "1"

 

@CF I presume Google asked you for your permission to scrape your website? All jokes aside I scrape quite a few websites for my own personal use (Belgium job sites, even steampoweredgames.com), and I make sure none of them generates any serious load (sleep).

 

I use a customized UA string that has a unique name and also clearly communicates that it honors the robots.txt.

 

User-Agent: ItsieBitsieSpider (honors robots.txt)

 

:P

 

So if some websites wants my crawler to stop crawling them it's easy for them to do so. So far, there is no-one that is blocking the crawler.

 

--

 

@OP Follow the leader (=Google). Do as they do:

 

1. Search for a robots.txt file (@ http://domain.top/robots.txt) read it and act accordingly

2. Search for a Robots header, act accordingly

3. Search for a Robots meta header (@ <meta name="robots" ..), act accordingly

 

ALWAYS clearly communicate who you are:

 

User-Agent: A Unique Name (Extra Info)

 

Do NOT open many simultaneous threads.. You may get the info faster but you are then "DDoS attacking" their server which is a crime.

 

Do NOT sell the info, or make the visitors pay for the info.. obviously.

 

--

 

If someone would sue you for crawling them then you have sufficient proof that your intentions were honest. For example I crawl steampoweredgames.com because their search sux*! And I am not the only one who thinks so. The only thing that does work in a usable matter is their checkout..

 

* I regularly want to browse for a yet unknown game that satisfies certain criteria (A simple search on release date is impossible, combining game modes, or genres also impossible.. ugh) but Steam does not give you enough tools to do so.

Edited by ignace
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.