euroiplayer Posted December 1, 2006 Share Posted December 1, 2006 Hello everybody..I need help making a scraper (scraping comments off of a myspace comments page). I have tried many times and have failed miserably. Please help me out.Here is an example...http://www.website123.com/scrapes.php?action=friends&profile=64100451...and this would be the output from that website to an XML format...[quote]<?xml version="1.0" encoding="UTF-8"?> <first_child> <commentData> <userID>64100451</userID> <thumbnail>http://myspace-308.vo.llnwd.net/01133/80/39/1133389308_s.jpg</thumbnail> <postedDate>28.Nov.2006 13:05</postedDate> <userName>natalie</userName> <commentText></commentText> </commentData> </first_child> [/quote]Thats an example above of a one comment being scraped from a whole list. This is the page that the comments would be scraped off from: [url=http://comment.myspace.com/index.cfm?fuseaction=user.viewComments&friendID=64100451]http://comment.myspace.com/index.cfm?fuseaction=user.viewComments&friendID=64100451[/url]So basicly whatever ID number I enter here (#####)... http://www.website123.com/scrapes.php?action=friends&profile=#####... it would scrape the comments off from that myspace.Any questions, feel free to ask..Any help would be GREATLY appreciated very much. Thanks Quote Link to comment Share on other sites More sharing options...
HuggieBear Posted December 1, 2006 Share Posted December 1, 2006 Are you able to provide us with any code that you've created?RegardsHuggie Quote Link to comment Share on other sites More sharing options...
euroiplayer Posted December 1, 2006 Author Share Posted December 1, 2006 The code is messed up, and I am very sure that after seeing the results Quote Link to comment Share on other sites More sharing options...
euroiplayer Posted December 1, 2006 Author Share Posted December 1, 2006 This is the code that I started with, modified it over and over -- and well no results. So if possible is it please you can modify the code for me ???[quote]<?php// Screen scraping your way into RSS// Example script, by Dennis Pallett// http://www.phpit.net/tutorials/screenscrap-rss// Get page$url = "http://www.phpit.net/";$data = implode("", file($url));// Get content itemspreg_match_all ("/<div class=\"contentitem\">([^`]*?)<\/div>/", $data, $matches);<?php// Begin feedheader ("Content-Type: text/xml; charset=ISO-8859-1");echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <channel> <title>PHPit Latest Content</title> <description>The latest content from PHPit (http://www.phpit.net), screen scraped!</description> <link>http://www.phpit.net</link> <language>en-us</language><?php// Loop through each content itemforeach ($matches[0] as $match) { // First, get title preg_match ("/\">([^`]*?)<\/a><\/h3>/", $match, $temp); $title = $temp['1']; $title = strip_tags($title); $title = trim($title); // Second, get url preg_match ("/<a href=\"([^`]*?)\">/", $match, $temp); $url = $temp['1']; $url = trim($url); // Third, get text preg_match ("/<p>([^`]*?)<span class=\"byline\">/", $match, $temp); $text = $temp['1']; $text = trim($text); // Fourth, and finally, get author preg_match ("/<span class=\"byline\">By ([^`]*?)<\/span>/", $match, $temp); $author = $temp['1']; $author = trim($author); // Echo RSS XML echo "<item>\n"; echo "\t\t\t<title>" . strip_tags($title) . "</title>\n"; echo "\t\t\t<link>http://www.phpit.net" . strip_tags($url) . "</link>\n"; echo "\t\t\t<description>" . strip_tags($text) . "</description>\n"; echo "\t\t\t<content:encoded><![CDATA[ \n"; echo $text . "\n"; echo " ]]></content:encoded>\n"; echo "\t\t\t<dc:creator>" . strip_tags($author) . "</dc:creator>\n"; echo "\t\t</item>\n";}?></channel></rss>[/quote]Note: it displays the output as in rss (need it to be in xml like above)Thanks Quote Link to comment Share on other sites More sharing options...
HuggieBear Posted December 1, 2006 Share Posted December 1, 2006 I'll take a look at this for you this evening.Just a quick question, you have to be logged into MySpace to be able to access that content, how do you expect to get around that?RegardsHuggie Quote Link to comment Share on other sites More sharing options...
euroiplayer Posted December 1, 2006 Author Share Posted December 1, 2006 I am not sure, but I believe it can be done by cURL? Is there a way I can put a username and a password in the scrapes.php itself, or maybe call it from mySQL database, so after that then it logins into myspace beforehand and then goes to the users' comments page?Here check this website out..[url=http://scrapes.php?action=comments&profile=23432]scrapes.php?action=comments&profile=23432[/url]Note:You can change the ####.It might display it as an error, but the output does the job anyway because for example when u view source, everything is there.Tell me what you think... is it possible? Quote Link to comment Share on other sites More sharing options...
HuggieBear Posted December 1, 2006 Share Posted December 1, 2006 OK, I'm still looking, I like this little challenge :)Huggie Quote Link to comment Share on other sites More sharing options...
euroiplayer Posted December 1, 2006 Author Share Posted December 1, 2006 Here I found a website, this should be helpful ::)http://makedatamakesense.com/Look under:MySpace RSS CreatorMySpace Events RSS CreatorMySpace Comments RSS Creator Quote Link to comment Share on other sites More sharing options...
HuggieBear Posted December 4, 2006 Share Posted December 4, 2006 The website didn't really help, but I did come up with the following code, so give it a try. You may have problems with the XML but you can probably fix them better than I can.RegardsHuggieP.S. Admins/Mods: I didn't use [nobbc][code][/code][/nobbc] tags for this as it throws the syntax colouring out. I would normally use it ;)[code=php:0]<?php// Turn on all error reportingerror_reporting(E_ALL);// Create a new CURL handle$ch = curl_init();// Set the CURL optionscurl_setopt($ch, CURLOPT_URL, "http://www.myspace.com/tom");curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// Put the data received from the page into a variable$data = curl_exec($ch);// Output an error to the browser if execute errorsif (curl_errno($ch)){ echo "Error: " . curl_error();}// Close the objectcurl_close($ch);// Regular expression to capture the data into the comments array$pattern = '/bgcolor="FF9933" style="word-wrap: break-word">.*?id=(\d+)".*?>(.*?)<.*?src="(.*?)".*?text10">(.*?)<.*?<br>.*?<br>(.*?)<\/td>/ims';preg_match_all($pattern, $data, $comments, PREG_SET_ORDER);// Dispose of the full pattern matchesforeach ($comments as $k => $v){ array_shift($comments[$k]);}// Strip out all the white space from 'userName', 'postedDate' and 'commentText'foreach ($comments as $k => $v){ $comments[$k][1] = trim($comments[$k][1]); $comments[$k][3] = trim($comments[$k][3]); $comments[$k][4] = trim($comments[$k][4]);}// Echo the XMLecho "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";echo "<first_child>\n";foreach ($comments as $k => $v){ echo " <commentData>\n"; echo " <userID>{$comments[$k][0]}</userID>\n"; echo " <thumbnail>{$comments[$k][2]}</thumbnail>\n"; echo " <postedDate>{$comments[$k][3]}</postedDate>\n"; echo " <userName>{$comments[$k][1]}</userName>\n"; echo " <commentText>{$comments[$k][4]}</commentText>\n"; echo " </commentData>\n";}echo "</first_child>\n";?>[/code] Quote Link to comment Share on other sites More sharing options...
euroiplayer Posted December 4, 2006 Author Share Posted December 4, 2006 :) Thanks for all the great help Huggie Quote Link to comment Share on other sites More sharing options...
HuggieBear Posted December 4, 2006 Share Posted December 4, 2006 No problem, this is of course liable to break the moment MySpace make any changes, but it should just be a case of changing the regular expression.I found that it worked OK for normal alphanumeric characters, but when someone had a username that had special characters in, the XML didn't like it.You'll have to figure that one out for yourself.RegardsHuggie Quote Link to comment Share on other sites More sharing options...
chrishawkins Posted November 8, 2007 Share Posted November 8, 2007 Hey Huggie,I am a noob here as well as to the world of PHP and CURL, however, I was wondering if you would be willing to lend me some of your expertise with a project that I am working on. It is somewhat similar to the topics posted in this thread. Let me know if you would be willing to help out. Like I said, I am new here, so I am still unable to post prvt messages. So, if your interested, I guess you can either post back here, or send me a prvt message with some contact info.CheersChris Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.