Search the Community
Showing results for tags 'curl scraping'.
-
Hey guys, i'm new here. English is not my native language but i'll try my best. First, what i would like to do. The purpose of the code is too get data (html file) which is located in a website, but you need to have an account to access the page. After differents tries and misses and several hours into discovering curl lib and testing, everything i tried failed. I tried to solve this step by step and i'm afraid something goes wrong at step 1 but i can't tell what, and how to fix it. This is my code : <?php /* Here is a script that is usefull to : - login to a POST form, - store a session cookie, - download a file once logged in. */ // INIT CURL $ch = curl_init(); // SET URL FOR THE POST FORM LOGIN curl_setopt($ch, CURLOPT_URL, 'https://mywebsite.com/user/login'); // ENABLE HTTP POST curl_setopt ($ch, CURLOPT_POST, 1); // SET POST PARAMETERS : FORM VALUES FOR EACH FIELD curl_setopt ($ch, CURLOPT_POSTFIELDS, 'name=myname&pass=mypass&form_id=user_login'); // IMITATE CLASSIC BROWSER'S BEHAVIOUR : HANDLE COOKIES curl_setopt ($ch, CURLOPT_COOKIEJAR, "/tmp/cookieFileName.txt"); //curl_setopt($ch, CURLOPT_REFERER, 'http://mywebsite.com'); //curl_setopt($ch, CURLOPT_HEADER, TRUE); //curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); # Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL # not to print out the results of its query. # Instead, it will return the results as a string return value # from curl_exec() instead of the usual true/false. curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); // EXECUTE 1st REQUEST (FORM LOGIN) $store = curl_exec ($ch); $info = curl_getinfo($ch); /* i might have already problems here since $info contains : Array ( [url] => https://mywebsite.com/user/login [content_type] => [http_code] => 0 [header_size] => 0 [request_size] => 0 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 0 [namelookup_time] => 0 [connect_time] => 0.171 [pretransfer_time] => 0 [size_upload] => 0 [size_download] => 0 [speed_download] => 0 [speed_upload] => 0 [download_content_length] => -1 [upload_content_length] => -1 [starttransfer_time] => 0 [redirect_time] => 0 ) */ // SET FILE TO DOWNLOAD curl_setopt($ch, CURLOPT_URL, 'http://mywebsite.com/users/en/myfile/1/'); curl_setopt($ch, CURLOPT_COOKIEFILE, "/tmp/cookieFileName.txt"); // EXECUTE 2nd REQUEST (FILE DOWNLOAD) $content = curl_exec ($ch); // CLOSE CURL curl_close ($ch); ?> $content contains a "you must be logged" page instead of "this is your data" page. 2nd possible problem : the cookie.txt contains : # Netscape HTTP Cookie File # http://curl.haxx.se/rfc/cookie_spec.html # This file was generated by libcurl! Edit at your own risk. mywebsite.com FALSE / FALSE 0 LOL_TRIB p4epeqgp9tfijl0evi91rsl225 and not all the cookies that are stored in my navigator if i log in manually. Could someone explain to me where are my errors, or give me a hint please ? Thanks.