CrimpJiggler Posted September 20, 2013 Share Posted September 20, 2013 Okay so heres an example of a page I want to download with cURL: http://myip.ms/browse/comp_browseragents/Computer_Browser_Agents.html as you can see there are over 15000 pages of these user_agent entries, and there are no URL variables. So I used tamper_data to get the post data, and here it is: 14:08:58.925[559ms][total 559ms] Status: 200[OK] POST http://myip.ms/ajax_table/comp_browseragents/3/ Load Flags[LOAD_BYPASS_CACHE LOAD_BACKGROUND ] Content Size[3904] Mime Type[text/html] Request Headers: Host[myip.ms] User-Agent[Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:23.0) Gecko/20100101 Firefox/23.0] Accept[text/html, */*; q=0.01] Accept-Language[en-US,en;q=0.5] Accept-Encoding[gzip, deflate] Content-Type[application/x-www-form-urlencoded; charset=UTF-8] X-Requested-With[XMLHttpRequest] Referer[http://myip.ms/browse/comp_browseragents/Computer_Browser_Agents.html] Content-Length[19] Cookie[s2_csrf_cookie_name=23f5312826ba7a316f70bcf2555c1e94; s2_csrf_cookie_name=23f5312826ba7a316f70bcf2555c1e94; sw=141.6; sh=65.2; __utma=126509969.298336339.1379680552.1379680552.1379682469.2; __utmc=126509969; __utmz=126509969.1379682469.2.2.utmcsr=localhost|utmccn=(referral)|utmcmd=referral|utmcct=/dummy_page/test.php; __utmb=126509969.2.10.1379682469] DNT[1] Connection[keep-alive] Pragma[no-cache] Cache-Control[no-cache] Post Data: getpage[yes] lang[en] Response Headers: Server[nginx] Date[Fri, 20 Sep 2013 13:08:58 GMT] Content-Type[text/html; charset=utf-8] Content-Length[3904] Connection[keep-alive] Content-Encoding[gzip] Vary[Accept-Encoding] X-Powered-By[PleskLin] So the only thing in there which identifies the page number, is this: http://myip.ms/ajax_table/comp_browseragents/3/ I'm guessing I need to replicate that ajax POST, so heres what I tried: $ch = curl_init('http://myip.ms/ajax_table/comp_browseragents/3/'); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13'); curl_setopt($ch, CURLOPT_REFERER, 'http://myip.ms/browse/comp_browseragents/Computer_Browser_Agents.html'); $data = array( 'getpage' => 'yes', 'lang' => 'en' ); curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_POSTFIELDS, $data); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $curl_scraped_page = curl_exec($ch); curl_close($ch); echo $curl_scraped_page; but when I ran the script, heres the response I got: Invalid Webpage URLGo Home I am trying to learn how to use cURL to scrape from sites effectively, but this is a problem I keep running into, I don't know how to do whatever it is the website is doing to get the data. Link to comment https://forums.phpfreaks.com/topic/282309-how-to-replicate-a-post-with-curl/ Share on other sites More sharing options...
kicken Posted September 20, 2013 Share Posted September 20, 2013 You need to add the X-Requested-With: XMLHttpRequest header. It is checking for that to validate that it is an ajax request. See CURLOPT_HTTPHEADER Link to comment https://forums.phpfreaks.com/topic/282309-how-to-replicate-a-post-with-curl/#findComment-1450448 Share on other sites More sharing options...
CrimpJiggler Posted September 21, 2013 Author Share Posted September 21, 2013 That worked, thanks a lot. For some reason the $curl_scraped_page variable only contained the data outputted by AJAX, rather than the full web page. This is exactly what I needed but I'm trying to figure out how it works since the script still included the same commands I'd use to scrape the whole page. Link to comment https://forums.phpfreaks.com/topic/282309-how-to-replicate-a-post-with-curl/#findComment-1450606 Share on other sites More sharing options...
CrimpJiggler Posted September 21, 2013 Author Share Posted September 21, 2013 Ah wait sorry, I see it was only the AJAX URL I was loading, not the main page. Link to comment https://forums.phpfreaks.com/topic/282309-how-to-replicate-a-post-with-curl/#findComment-1450609 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.