Jump to content

Recommended Posts

I administer a site where we bet on Danish Superliga matches every week. In order not to manually enter all the odds of the weekly matches I am trying to grab the odds from this site:

http://www.bold.dk/o...ndex.php?liga=1

 

... but whatever method I've come up with so far. All I get is request time outs and absolutely no data! Every other site I try my code on works fine. It is just this site, that seem to be protected in some way. Can any of you guys supply me with a working code/suggestions for grabbing data from this site?

 

I have tried these solutions (works of any other site than bold.dk):

 

$url = "http://www.bold.dk/odds/index.php?liga=1";

SOLUTION 1:
$source = file_get_contents($url);

SOLUTION 2:
function url_get_contents ($url) {
if (!function_exists('curl_init')){
die('CURL is not installed!');
}
$timeout = 5;
$agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1";
$ch = curl_init();
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
return $output;
}
$source = url_get_contents ($url);

Edited by gritano
Link to comment
https://forums.phpfreaks.com/topic/269514-web-scraping-from-protected-site/
Share on other sites

First and foremost: Do you have permission to scrape their site like that? As quoted from their TOS:

Du vil ikke offentliggøre eller på andre måder gøre bold.dks materiale, som du ikke har rettigheder til, tilgængeligt uden først at få tilladelse fra ejeren af materialet.

Ignace: Yes, it still doesn't work. It might be some settings on my web server that need to be changed, maybe.

Christian F. No, I do not have a specific permission. I did not notice that I needed that. So I guess I need to aquire one, once I get this working. Thank you for making me aware. Don't wanna break any laws, of course :-)

Try this:

var_dump(ini_get('allow_url_fopen')); // string(1) "1"

 

@CF I presume Google asked you for your permission to scrape your website? All jokes aside I scrape quite a few websites for my own personal use (Belgium job sites, even steampoweredgames.com), and I make sure none of them generates any serious load (sleep).

 

I use a customized UA string that has a unique name and also clearly communicates that it honors the robots.txt.

 

User-Agent: ItsieBitsieSpider (honors robots.txt)

 

:P

 

So if some websites wants my crawler to stop crawling them it's easy for them to do so. So far, there is no-one that is blocking the crawler.

 

--

 

@OP Follow the leader (=Google). Do as they do:

 

1. Search for a robots.txt file (@ http://domain.top/robots.txt) read it and act accordingly

2. Search for a Robots header, act accordingly

3. Search for a Robots meta header (@ <meta name="robots" ..), act accordingly

 

ALWAYS clearly communicate who you are:

 

User-Agent: A Unique Name (Extra Info)

 

Do NOT open many simultaneous threads.. You may get the info faster but you are then "DDoS attacking" their server which is a crime.

 

Do NOT sell the info, or make the visitors pay for the info.. obviously.

 

--

 

If someone would sue you for crawling them then you have sufficient proof that your intentions were honest. For example I crawl steampoweredgames.com because their search sux*! And I am not the only one who thinks so. The only thing that does work in a usable matter is their checkout..

 

* I regularly want to browse for a yet unknown game that satisfies certain criteria (A simple search on release date is impossible, combining game modes, or genres also impossible.. ugh) but Steam does not give you enough tools to do so.

Edited by ignace
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.