Nightwolf629 Posted June 24, 2009 Share Posted June 24, 2009 I am trying to use curl with PHP to screen scrap a page that requires authentication. I have tried the following, and I get nothing: $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, "$url"); curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY); curl_setopt($ch, CURLOPT_USERPWD, "$user:$pass"); curl_exec($ch); I do not control the page I am trying to scrape, and I can't be sure of the authentication it uses. Is there some tell that? When I log into the page with IE, it uses the generic windows login popup. And when I log into the page with Firefox, it gives me a popup that is unique to firefox. So I am assuming that it is a HTTP authentication. So the CURLAUTH_ANY should pretty much cover it... or so I thought. There is also no proxy involved, so that is not a factor. I have no clue what else I can do to access this. Can anybody help me? Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/ Share on other sites More sharing options...
flyhoney Posted June 24, 2009 Share Posted June 24, 2009 If the page is SSL you might need to add: <?php curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-862904 Share on other sites More sharing options...
thebadbad Posted June 24, 2009 Share Posted June 24, 2009 Your script doesn't return or output anything. So you shouldn't get anything. Setting CURLOPT_RETURNTRANSFER to true makes curl_exec() return the source code of the page you're accessing. You should also set the user agent string for compatibility with more sites: <?php $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; da; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11'); curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY); curl_setopt($ch, CURLOPT_USERPWD, "$user:$pass"); $contents = curl_exec($ch); curl_close($ch); ?> Now the contents will be in $contents. I also closed the cURL connection for you. Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-862908 Share on other sites More sharing options...
Nightwolf629 Posted June 25, 2009 Author Share Posted June 25, 2009 Thank you flyhoney and thebadbad for replying @flyhoney There is no SSL, and I have tried setting CURLOPT_SSL_VERIFYPEER to false and it still came back with nothing. @thebadbad setting CURLOPT_RETURNTRANSFER to true is what returns it to a variable. It is false by default, in which case curl_exec() will just output the scraped html to the page. That is what I am trying to do, just to make sure it works. I will worry about extracting the data I want after that. For example, if I change the $url to "http://www.google.com" , the page will output something that looks like google (except the images don't load). In my script I am closing the cURL, I just did not include that above. Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-863288 Share on other sites More sharing options...
Nightwolf629 Posted June 25, 2009 Author Share Posted June 25, 2009 Ok, it seems like I figured this out. The problem was this: curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY); According the php.net, CURLAUTH_ANY is an alias for CURLAUTH_BASIC | CURLAUTH_DIGEST | CURLAUTH_GSSNEGOTIATE | CURLAUTH_NTLM. But, when I tried each of these individually, some worked and some did not: CURLAUTH_BASIC = did not work CURLAUTH_DIGEST = worked CURLAUTH_GSSNEGOTIATE = worked CURLAUTH_NTLM = worked I am still not sure why CURLAUTH_ANY is not working, but at this point I really do not care. Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-863303 Share on other sites More sharing options...
thebadbad Posted June 25, 2009 Share Posted June 25, 2009 in which case curl_exec() will just output the scraped html to the page. Sorry, you're right, forgot that. But did you try to add the user agent string? Edit: Oh, that's good. cURL must've tried to use one of the methods that didn't work. Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-863306 Share on other sites More sharing options...
Nightwolf629 Posted June 25, 2009 Author Share Posted June 25, 2009 I did try the agent string, but no luck... And I was mistaken. The ones that I said that "worked" did return something, but they are still not logging in. It is simple defaulting to a page that has a link to login. I guess my real problem is how to enter a username/password into a login dialogue, like these: http://polpo.org/tmp/httpauth.png Does anyone know how to do that in PHP? I wouldn't even care if it was cURL or not. Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-863333 Share on other sites More sharing options...
thebadbad Posted June 25, 2009 Share Posted June 25, 2009 If you can't log in with cURL, I don't know what else to try. But you can try to allow cURL to follow any redirect headers and keep sending the username and password: curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_UNRESTRICTED_AUTH, true); Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-863628 Share on other sites More sharing options...
Nightwolf629 Posted June 26, 2009 Author Share Posted June 26, 2009 Ok, it looks like the page creates a cookie, and relies on that for the authentication. I have tried to use the options in cURL to capture the cookies, but with no luck. I can pull the external javascript file, and it looks like the cookie is being built and set by the javascript. Can cURL get a cookie if it is set by an external javascript file? Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-864159 Share on other sites More sharing options...
Nightwolf629 Posted June 26, 2009 Author Share Posted June 26, 2009 Well, I just found a cURL site that stated embedded Javascript is not supported by cURL. http://curl.haxx.se/docs/faq.html#Does_curl_support_Javascript_or So I can't get past the authentication because I can't get the cookie. And I can't get the cookie because cURL cannot run the Javascript that builds it. I did try manually entering the page, then copying the cookie I got from it. I put it into my code using the CURLOPT_COOKIE option. This did work, but it is not a permanent fix. The cookie does not work forever, so I would need to manually update the data for the cookie every time it stopped working.... not very good automation if you ask me. I am not sure what else to do at this point, any ideas? Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-864182 Share on other sites More sharing options...
thebadbad Posted June 26, 2009 Share Posted June 26, 2009 If possible, you could generate the cookie data yourself, simulating what the javascript does. Else, if certain values need to be grabbed from the javascript to generate, that could be done with cURL and regular expressions. If the only thing changing within the cookie data is something like a timestamp, it's easily done. Quote Link to comment https://forums.phpfreaks.com/topic/163554-curl-to-screen-scrap-page-with-authentication/#findComment-864190 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.