Jump to content

Recommended Posts

I am trying to use curl with PHP to screen scrap a page that requires authentication.

 

I have tried the following, and I get nothing:

 

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "$url");
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_USERPWD, "$user:$pass");
curl_exec($ch);

 

 

 

I do not control the page I am trying to scrape, and I can't be sure of the authentication it uses.  Is there some tell that?

 

When I log into the page with IE, it uses the generic windows login popup.  And when I log into the page with Firefox, it gives me a popup that is unique to firefox.  So I am assuming that it is a HTTP authentication.  So the CURLAUTH_ANY should pretty much cover it... or so I thought.

 

There is also no proxy involved, so that is not a factor.

 

I have no clue what else I can do to access this.  Can anybody help me?

Your script doesn't return or output anything. So you shouldn't get anything. Setting CURLOPT_RETURNTRANSFER to true makes curl_exec() return the source code of the page you're accessing. You should also set the user agent string for compatibility with more sites:

 

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; da; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_USERPWD, "$user:$pass");
$contents = curl_exec($ch);
curl_close($ch);
?>

 

Now the contents will be in $contents. I also closed the cURL connection for you.

Thank you flyhoney and thebadbad for replying

 

 

 

@flyhoney

 

There is no SSL, and I have tried setting CURLOPT_SSL_VERIFYPEER to false and it still came back with nothing.

 

 

 

@thebadbad

 

setting CURLOPT_RETURNTRANSFER to true is what returns it to a variable.  It is false by default, in which case curl_exec() will just output the scraped html to the page.  That is what I am trying to do, just to make sure it works.  I will worry about extracting the data I want after that.

 

For example, if I change the $url to "http://www.google.com" , the page will output something that looks like google (except the images don't load).

 

In my script I am closing the cURL, I just did not include that above.

Ok, it seems like I figured this out.

 

The problem was this:

 

curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);

 

According the php.net,

CURLAUTH_ANY  is an alias for CURLAUTH_BASIC | CURLAUTH_DIGEST | CURLAUTH_GSSNEGOTIATE | CURLAUTH_NTLM.

 

But, when I tried each of these individually, some worked and some did not:

 

CURLAUTH_BASIC = did not work

CURLAUTH_DIGEST = worked

CURLAUTH_GSSNEGOTIATE = worked

CURLAUTH_NTLM = worked

 

 

I am still not sure why CURLAUTH_ANY is not working, but at this point I really do not care.

in which case curl_exec() will just output the scraped html to the page.

 

Sorry, you're right, forgot that.

 

But did you try to add the user agent string?

 

Edit: Oh, that's good. cURL must've tried to use one of the methods that didn't work.

I did try the agent string, but no luck...

 

And I was mistaken.  The ones that I said that "worked" did return something, but they are still not logging in.  It is simple defaulting to a page that has a link to login.

 

 

I guess my real problem is how to enter a username/password into a login dialogue, like these:

http://polpo.org/tmp/httpauth.png

 

Does anyone know how to do that in PHP?  I wouldn't even care if it was cURL or not.

If you can't log in with cURL, I don't know what else to try. But you can try to allow cURL to follow any redirect headers and keep sending the username and password:

 

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_UNRESTRICTED_AUTH, true);

Ok, it looks like the page creates a cookie, and relies on that for the authentication.

 

I have tried to use the options in cURL to capture the cookies, but with no luck.

 

I can pull the external javascript file, and it looks like the cookie is being built and set by the javascript.

 

 

 

Can cURL get a cookie if it is set by an external javascript file?

Well, I just found a cURL site that stated embedded Javascript is not supported by cURL.

 

http://curl.haxx.se/docs/faq.html#Does_curl_support_Javascript_or

 

So I can't get past the authentication because I can't get the cookie.  And I can't get the cookie because cURL cannot run the Javascript that builds it.

 

I did try manually entering the page, then copying the cookie I got from it.  I put it into my code using the CURLOPT_COOKIE option.

 

This did work, but it is not a permanent fix.  The cookie does not work forever, so I would need to manually update the data for the cookie every time it stopped working.... not very good automation if you ask me.

 

 

I am not sure what else to do at this point, any ideas?

If possible, you could generate the cookie data yourself, simulating what the javascript does.

 

Else, if certain values need to be grabbed from the javascript to generate, that could be done with cURL and regular expressions. If the only thing changing within the cookie data is something like a timestamp, it's easily done.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.