Dustin013 Posted June 4, 2009 Share Posted June 4, 2009 I am trying to get this to work. I would like to be able to log into my forum and then access the content withing the forums using cURL, then convert the data into something simple html dom can understand so that I can parse and scrape data. The script below will successfully login to a Invision Power Board I have installed on my testing machine. If I echo out any of the content it shows me the "logging in" page, then forwards me to the domain. So before I get any of the parsing and what have you worked out, I simply need to be able to tell the script how to login, then go to the correct forum so that it can begin collecting data so that the page can be parsed. I really am not sure where I am going wrong but it does log you in then forwards the page to the targeted domain from where ever its run. Any suggestions? I was thinking about possibly using stream_context_create, but I could use some advice. Thanks in advance! <?php /* * * The idea of this script is to scrape / parse data from a member protected forum run on Invision Power Board * This is not a hack, but rather an information gathering tool to a forum you already have access to. * The script uses simple_html_dom and cURL. First the user is logged into the site using cURL, * then directed to the correct forum ID, where data can then be scraped / parsed and organized. * */ ini_set('display_errors',1); // Turn error reporting on error_reporting(E_ALL|E_STRICT); // All errors displayed include_once('simple_html_dom.php'); // Simple_HTML_DOM *http://simplehtmldom.sourceforge.net/) // Config ////////////// $url = "http://****.com/ipb/"; // Target URL with a forward slash! $username = "testing"; // Your forum username $password = "testing"; // Your forum password $forum_id = "25"; // The ID of the target forum you want to be logged into ////////////// // End Config // Post Data ////////////// $curlPost = "index.php?act=Login&CODE=01&referer=".urlencode($url)."index.php%3F&UserName=".$username."&PassWord=".$password."&CookieDate=1&showforum=".$forum_id; echo "curlPost :".$curlPost."<br />"; // Start cURL ////////////// $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); // $url is target URL curl_setopt($ch, CURLOPT_HEADER, 1); // return headers curl_setopt($ch, CURLOPT_USERAGENT, 'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt'); // Use cookie.txt for STORING cookies curl_setopt($ch, CURLOPT_POST, true); // Tell curl that we are posting data curl_setopt($ch, CURLOPT_POSTFIELDS, $curlPost); // send post data $html = curl_exec($ch); // Execute! $html contains curl data! curl_close($ch); // Free the memory ////////////// // End cURL // Convert $html into $content for use with simple_html_dom $content = str_get_html($html); // Convert String // Find all article blocks foreach($content->find('div.maintitle') as $content) { // so we are looking for anything in the content of a div with an id of maintitle $item['title'] = $article->find('div.maintitle', 0)->plaintext; // all this data is to be in plain text $content[] = $item; // place into array } // Testing Output //////////////////////////// // You guessed it.. if you each either of these out you get forwarded // to the index.php of the target url, however you get logged in // echo $item; // echo $content; // echo '<pre>'; // print_r($content); // echo '</pre>'; //////////////////////////// // End Testing Output // Loop through the array to get your data foreach($content as $item){ // ya ya ya foreach ($item as $key => $value){ // output contents of array echo $key.' : '.$value.'<br />'; // returns "title : %contents of div%" } echo '<br />'; // space em out } $content->clear(); // free up memory unset($content); // cause you got to ?> Run as is, the script outputs the following before redirecting to the next page curlPost :index.php?act=Login&CODE=01&referer=http%3A%2F%2Fwww.****.com%2Findex.php%3F&UserName=******&PassWord=******&CookieDate=1&showforum=25 nodetype : 5 tag : root attr : Array children : Array nodes : Array parent : _ : Array 0 : HTTP/1.1 200 OK Date: Thu, 04 Jun 2009 02:29:23 GMT Server: Apache Set-Cookie: ipbsession_id=3ea9501cb4aeb7c3aaa5f6044558ac71; path=/; domain=.*****.com; httponly Set-Cookie: ipbipb_stronghold=cdd5a86c4b0dd10032f8ddf41d7443c4; expires=Fri, 04-Jun-2010 02:29:23 GMT; path=/; domain=.*****.com; httponly Set-Cookie: ipbmember_id=******; expires=Fri, 04-Jun-2010 02:29:23 GMT; path=/; domain=******.com; httponly Set-Cookie: ipbpass_hash=92942a37*****257d721b86d6abf6d55; expires=Sat, 04-Jul-2009 02:29:23 GMT; path=/; domain=.*****.com; httponly Set-Cookie: ipbcoppa=0; path=/; domain=.*****.com Set-Cookie: ipbsession_id=2c7ce04ec93d904219a08c31526256d3; path=/; domain=.******.com; httponly Vary: Accept-Encoding Connection: close Transfer-Encoding: chunked Content-Type: text/html Quote Link to comment Share on other sites More sharing options...
Dustin013 Posted June 4, 2009 Author Share Posted June 4, 2009 Anyone any ideas? Quote Link to comment Share on other sites More sharing options...
JonnoTheDev Posted June 4, 2009 Share Posted June 4, 2009 Your parameters are in the format of a GET request not POST. Here is a function so you can use an array for post fields. The key is the field name & the value is the input value. The file index.php should be part of the url not the POST data. <?php function postString($dataArray) { foreach($dataArray as $key => $value) { if(strlen(trim($value)) > 0) { $value = is_array($value) ? $value : urlencode($value); $tempString[] = $key . "=" . $value; } else { $tempString[] = $key; } } $queryString = join('&', $tempString); return $queryString; } // url $target = "http://****.com/ipb/index.php"; // post data $postArray['UserName'] = "joe"; $postArray['PassWord'] = "bloggs"; $postArray['act'] = "Login"; curl_setopt($ch, CURLOPT_URL, $target); curl_setopt($ch, CURLOPT_POSTFIELDS, postString($postArray)); curl_setopt($ch, CURLOPT_POST, TRUE); ?> Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.