CrimpJiggler Posted September 20, 2013 Share Posted September 20, 2013 Okay so heres an example of a page I want to download with cURL: http://myip.ms/browse/comp_browseragents/Computer_Browser_Agents.html as you can see there are over 15000 pages of these user_agent entries, and there are no URL variables. So I used tamper_data to get the post data, and here it is: 14:08:58.925[559ms][total 559ms] Status: 200[OK] POST http://myip.ms/ajax_table/comp_browseragents/3/ Load Flags[LOAD_BYPASS_CACHE LOAD_BACKGROUND ] Content Size[3904] Mime Type[text/html] Request Headers: Host[myip.ms] User-Agent[Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:23.0) Gecko/20100101 Firefox/23.0] Accept[text/html, */*; q=0.01] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Content-Type[application/x-www-form-urlencoded; charset=UTF-8] X-Requested-With[XMLHttpRequest] Referer[http://myip.ms/browse/comp_browseragents/Computer_Browser_Agents.html] Content-Length[19] Cookie[s2_csrf_cookie_name=23f5312826ba7a316f70bcf2555c1e94; s2_csrf_cookie_name=23f5312826ba7a316f70bcf2555c1e94; sw=141.6; sh=65.2; __utma=126509969.298336339.1379680552.1379680552.1379682469.2; __utmc=126509969; __utmz=126509969.1379682469.2.2.utmcsr=localhost|utmccn=(referral)|utmcmd=referral|utmcct=/dummy_page/test.php; __utmb=126509969.2.10.1379682469] DNT[1] Connection[keep-alive] Pragma[no-cache] Cache-Control[no-cache] Post Data: getpage[yes] lang[en] Response Headers: Server[nginx] Date[Fri, 20 Sep 2013 13:08:58 GMT] Content-Type[text/html; charset=utf-8] Content-Length[3904] Connection[keep-alive] Content-Encoding[gzip] Vary[Accept-Encoding] X-Powered-By[PleskLin] So the only thing in there which identifies the page number, is this: http://myip.ms/ajax_table/comp_browseragents/3/ I'm guessing I need to replicate that ajax POST, so heres what I tried: $ch = curl_init('http://myip.ms/ajax_table/comp_browseragents/3/'); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13'); curl_setopt($ch, CURLOPT_REFERER, 'http://myip.ms/browse/comp_browseragents/Computer_Browser_Agents.html'); $data = array( 'getpage' => 'yes', 'lang' => 'en' ); curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_POSTFIELDS, $data); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $curl_scraped_page = curl_exec($ch); curl_close($ch); echo $curl_scraped_page; but when I ran the script, heres the response I got: Invalid Webpage URLGo Home I am trying to learn how to use cURL to scrape from sites effectively, but this is a problem I keep running into, I don't know how to do whatever it is the website is doing to get the data. Quote Link to comment Share on other sites More sharing options...
Solution kicken Posted September 20, 2013 Solution Share Posted September 20, 2013 You need to add the X-Requested-With: XMLHttpRequest header. It is checking for that to validate that it is an ajax request. See CURLOPT_HTTPHEADER Quote Link to comment Share on other sites More sharing options...
CrimpJiggler Posted September 21, 2013 Author Share Posted September 21, 2013 That worked, thanks a lot. For some reason the $curl_scraped_page variable only contained the data outputted by AJAX, rather than the full web page. This is exactly what I needed but I'm trying to figure out how it works since the script still included the same commands I'd use to scrape the whole page. Quote Link to comment Share on other sites More sharing options...
CrimpJiggler Posted September 21, 2013 Author Share Posted September 21, 2013 Ah wait sorry, I see it was only the AJAX URL I was loading, not the main page. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.