Nuv Posted December 20, 2010 Share Posted December 20, 2010 Hi guys, I am making a bot which only scrapes the source code of the site AFTER logging into the site.The script to login is : <?php $username="xxx"; $password="iwonttellyou"; $url="http://internet.com/login.php"; $cookie="cookie.txt"; $postdata = "name=".$username."&password=".$password; $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"); curl_setopt ($ch, CURLOPT_TIMEOUT, 60); curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 0); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ch, CURLOPT_COOKIEJAR, $cookie); curl_setopt ($ch, CURLOPT_REFERER, $url); curl_setopt ($ch, CURLOPT_POSTFIELDS, $postdata); curl_setopt ($ch, CURLOPT_POST, 1); $result = curl_exec ($ch); echo $result; ?> I can see different SESSION ID's in cookie.txt everytime i compile this code, which makes me believe its working.However what next? How should i go to that site again, already logged in and scrape the data ? Some suggestions would be nice. Quote Link to comment Share on other sites More sharing options...
btherl Posted December 20, 2010 Share Posted December 20, 2010 Have you tried doing another curl request using curl_exec() immediately afterwards, using the same $ch ? If Perl is an option, WWW::Mechanize is more suited for this kind of task. Quote Link to comment Share on other sites More sharing options...
Nuv Posted December 20, 2010 Author Share Posted December 20, 2010 Have you tried doing another curl request using curl_exec() immediately afterwards, using the same $ch ? Yes i have.It doesn't work. If Perl is an option, WWW::Mechanize is more suited for this kind of task. Ill look into it.Never worked with Perl before. Quote Link to comment Share on other sites More sharing options...
btherl Posted December 21, 2010 Share Posted December 21, 2010 Have you tried doing another curl request using curl_exec() immediately afterwards, using the same $ch ? Yes i have.It doesn't work. What happens? And how do you determine the correct request to make after logging in? Did you find it from the HTML source, from a snooping add-on like LiveHTTPHeaders, or some other method? If Perl is an option, WWW::Mechanize is more suited for this kind of task. Ill look into it.Never worked with Perl before. The awesome thing about WWW::Mechanize is it will not only keep track of your cookies, it will also parse the html and let you select links by name or link text, and let you choose and submit a form without requiring you to parse it. People have tried to make an equivalent for PHP but there's still no real alternative. At my workplace we call perl scripts to do this sort of work, then pass the result back to PHP. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.