Jump to content

Recommended Posts

Okay, starting a new thread as I can't modify the old one.  I had some pretty stable code scraping a Yahoo site for a while but they've made a change and I can no longer log in.  It's not telling me my password is incorrect, but it's forcing a human recognition page.  The flow is: load page -> Detect login form -> login via that form -> Bob's your uncle.  This all worked, but after I login, I get the human confirmation thing.

 

After tinkering my entire Father's Day away, I have been able to replicate this in a browser (Firefox/Chrome) by doing the following:

 

1: Turning off Javascript

2: Going directly to the 3rd step above (login via that form) and removing some of the GET variables from the URL.

 

Things of note:

The following line is in the page, which I ignore.

<noscript><input type="hidden" name=".nojs" value="1"></noscript>

 

However, ignoring it is apparently not enough.

 

Using HttpFox, I'm able to see what I'm sending.  I've tried to perfectly emulate my Firefox browser.  I'm sending the same headers AND POST data with only one exception: Content Size (which is automatic in CURL).  Curl is always about 20 or so bytes smaller than what the browser says.  That being said, I'm kind curious as to what it is.  I'm not sure that that is the issue though.  I'm following the form pretty much to a T.  There are 3 variable values it sends that it uses for authentication, and that I appear to be handling right.  On a successful login, about 5 cookies are written and the page is redirected.

 

The anti-phishing page itself states to make sure java-script is enabled, and also to check your network settings.  Is there anything else in CURL I might be missing?  An SSL setting, as it does use secure login?  I'm mainly looking for a brainstorm more than anything here, but if someone spots a glaring error, I'm all ears.  Perhaps another way that it's detecting JavaScript is disabled.

 

class CURL {
var $callback = false;

function CURL( $cookie = "" ) {

	if ( !strlen( $cookie ) ) {
		$this->cookie = "default_cookie.txt";
	} else {
		$this->cookie = $cookie;
	}
}

function setCallback($func_name) {
	$this->callback = $func_name;
}

function doRequest($method, $url, $vars, $referer ) {

	$ch = curl_init();

	$header[0] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
	$header[1] = "Accept-Language: en-us,en;q=0.5";
	$header[2] = "Accept-Encoding: gzip, deflate";
	$header[3] = "Accept-Charset: EUC-JP,utf-8;q=0.7,*;q=0.7";
	$header[4] = "Keep-Alive: 115";
	$header[5] = "Connection: keep-alive";

	if ( $method == 'GET' ) {
		$header[7] = "Cache-Control: max-age=0";
	}

	curl_setopt($ch, CURLOPT_VERBOSE, 1);
	if ( $referer != "" ) {
		curl_setopt($ch, CURLOPT_REFERER, $referer);
	}
	curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
	curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
	curl_setopt($ch, CURLOPT_ENCODING, "" );
	curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE );
                curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
	curl_setopt($ch, CURLOPT_URL, $url);
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_TIMEOUT, 5);
	curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64;rv:2.0) Gecko/20110411 Firefox/4.0');
	curl_setopt($ch, CURLOPT_HEADER, 1);
	curl_setopt($ch, CURLOPT_HTTPHEADER, $header );
	curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookie);
	curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookie);

	if ( $method == 'POST' ) {
		curl_setopt($ch, CURLOPT_POST, 1);
		curl_setopt($ch, CURLOPT_POSTFIELDS, $vars);
	}

	$data = curl_exec($ch);

	$info = curl_getinfo( $ch );

	curl_close($ch);
	if ($data) {
		if ($this->callback) {
			$callback = $this->callback;
			$this->callback = false;
			return call_user_func($callback, $data);
		} else {
			return $data;
		}
	} else {
		return "There was an error in getting page.  Try refreshing your browser.<br>";
		// return curl_error($ch);
	}
}

function get($url, $referer) {
	return $this->doRequest('GET', $url, 'NULL', $referer);
}

function post($url, $vars, $referer ) {
   		return $this->doRequest('POST', $url, $vars, $referer);
}
}

Link to comment
https://forums.phpfreaks.com/topic/239793-curl-issue-sucking-the-life-out-of-me/
Share on other sites

What if you add another parameter,

.nojs=0

to what you're passing? That line you're ignoring appears to be telling the page receiving the request that JavaScript is turned off, so you'll need to lie to it to prevent that.

 

Sorry, should have mentioned.

1) I tried that

2) In the browser, you're sending nothing (as it's within a <noscript> tag) and that's what I'm trying to emulate at the moment.

 

I think I've tried just about every sane approach here.  I'm just curious how they're detecting it's not a browser if I'm sending the same headers & post data.  Frustrating.

I've done some digging.  My CURL session is now sending byte for byte the same information as Firefox (according to HttpFox) and what looks like perfectly fine values.  I believe the server is detecting Javascript either enabled or disabled from the first time I view the page.  Is this possible?  The flow, again is:

 

1: Slurp page (Cookie set here)

2: Read form

3: Follow form to login page

4: Login using that form (this requires said cookies, but the server isn't satisfied -- forces Captcha)

 

I've been able to spoof the other side into thinking Javascript was off by deleting the cookie set in step 1.  So, my assumption is that the magic all starts here, which seems very tricky to me.  The page is loaded via a GET request, and I'm sending the exact same headers.  There are some <noscript> tags in the page, but these couldn't have an effect on the cookie, could they?  I'm not all up to date, but last I checked, they're sent in the headers (before the page contents).

 

Okay -- I'm comparing packets (again) and the magic is happening within <noscript> tags... or within <script> tags.  One of the two. There are javascripts loaded (very,very convoluted ones) and there are alternatives within the <noscript> tags. I've yet to decipher what they do exactly.  However, not being a JS pro, I'm not exactly sure what's going on.  My headers & POST data match exactly, byte for byte.  However, they could be storing something server side, denoting that I might not be human.  Can something be set server side in a javascript?

 

Also, is there an alternative to Curl that executes a javascript?  I guess I should go through the script but it's a few thousand lines long. :(

 

Cheers

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.