sloth456 Posted January 22, 2010 Share Posted January 22, 2010 This has been really frustrating me for about 2 days now. $url="http://www.goldpoll.com"; $agent="Firefox/3.5.7"; $referer=""; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_REFERER, $referer); curl_setopt($ch, CURLOPT_USERAGENT, $agent); curl_close($ch); $source=curl_exec($ch); echo $source; As you can see, all it does is scrape http://www.goldpoll.com. I'm running the scraper locally, everytime I run it my browser redirects to localhost/public/j?kdwN+HG+V30X1eO0TNripy8= The characters at the end are random everytime. I thought my when I'm echo'ing the code I'm also echo'ing out some redirection code, so I commented it out, I still get exactly the same thing hapening. I thought, maybe there is some kind of setting not right in my server. So I changed the url to google.com. It seems to work fine for google. I thought, maybe goldpoll is blocking my I.P, but if I navigate there through my browser it works fine. So I just don't get it, its really confusing me. Does Goldpoll.com have some kind of advanced protection against scrapers? Any help would be massively appreciated! Quote Link to comment Share on other sites More sharing options...
oni-kun Posted January 22, 2010 Share Posted January 22, 2010 Your code for me (atleast on my *nix server) returns nothing. Reading the headers of the site I get this: Array ( [0] => HTTP/1.0 200 OK [Content-type] => text/html [Cache-Control] => no-cache, no-store, must-revalidate, max-age=0 [Expires] => Thu, 01 Jan 1970 00:00:00 GMT [Connection] => close ) 1 And content this: <html><head><meta·http-equiv="Cache-Control"·content="no-cache,·no-store,·must-revalidate,·max-age=0"><meta·http-equiv="Expires"·content="Thu,·01·Jan·1970·00:00:00·GMT"></head><body><script·language="JavaScript">var·strbuf·=·new·Array();strbuf[15]='y8';strbuf[14]='X';strbuf[13]='V';strbuf[12]='i';strbuf[11]='1';strbuf[10]='?mB';strbuf[9]='/j';strbuf[8]='=';strbuf[7]='hjl';strbuf[6]='2';strbuf[5]='kdp';strbuf[4]='k';strbuf[3]='js';strbuf[2]='19';strbuf[1]='D';strbuf[0]='Od';var·arr=[9,10,3,5,13,2,4,1,14,12,0,11,6,7,15,8];var·b='';for·(q·=·0;q<16;q++){b+=strbuf[arr[q]];}window.location.href=b;</script></body></html> What iswith the JS? That is probably the error. It's supposed to redirect them to /local? apparently, pulling the JS and displaying it with CURL just redirects you incorrectly to localhost or whatnot. Not advances security, poor site or not so efficient obfuscation. Quote Link to comment Share on other sites More sharing options...
GingerRobot Posted January 22, 2010 Share Posted January 22, 2010 At first glance, the characters at the end look like some form of token. Try either: not following the redirects, or: allowing cookies to be set. Quote Link to comment Share on other sites More sharing options...
sloth456 Posted January 22, 2010 Author Share Posted January 22, 2010 Aha! thank you for all your help guys. GingerRobot, yep it was a cookie thing, I just turned off cookies in my browser to test and got the same result as curl. How do I get cURL to use cookies? Quote Link to comment Share on other sites More sharing options...
sloth456 Posted January 22, 2010 Author Share Posted January 22, 2010 Decided to give up and scrape another site with essentially the same information. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.