Need help (Making a scraper)

euroiplayer · December 1, 2006

Hello everybody..

I need help making a scraper (scraping comments off of a myspace comments page). I have tried many times and have failed miserably. Please help me out.

Here is an example...

http://www.website123.com/scrapes.php?action=friends&profile=64100451
...and this would be the output from that website to an XML format...
[quote]<?xml version="1.0" encoding="UTF-8"?>
<first_child>

<commentData>
<userID>64100451</userID>
<thumbnail>http://myspace-308.vo.llnwd.net/01133/80/39/1133389308_s.jpg</thumbnail>
<postedDate>28.Nov.2006 13:05</postedDate>
<userName>natalie</userName>
<commentText></commentText>
</commentData>

</first_child> [/quote]

Thats an example above of a one comment being scraped from a whole list. This is the page that the comments would be scraped off from: [url=http://comment.myspace.com/index.cfm?fuseaction=user.viewComments&friendID=64100451]http://comment.myspace.com/index.cfm?fuseaction=user.viewComments&friendID=64100451[/url]

So basicly whatever ID number I enter here (#####)... http://www.website123.com/scrapes.php?action=friends&profile=#####... it would scrape the comments off from that myspace.

Any questions, feel free to ask..
Any help would be GREATLY appreciated very much. Thanks

HuggieBear · December 1, 2006

Are you able to provide us with any code that you've created?

Regards
Huggie

euroiplayer · December 1, 2006

The code is messed up, and I am very sure that after seeing the results

euroiplayer · December 1, 2006

This is the code that I started with, modified it over and over -- and well no results. So if possible is it please you can modify the code for me ???

[quote]<?php

// Screen scraping your way into RSS
// Example script, by Dennis Pallett
// http://www.phpit.net/tutorials/screenscrap-rss

// Get page
$url = "http://www.phpit.net/";
$data = implode("", file($url));

// Get content items
preg_match_all ("/<div class=\"contentitem\">([^`]*?)<\/div>/", $data, $matches);
<?php
// Begin feed
header ("Content-Type: text/xml; charset=ISO-8859-1");
echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";
?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<channel>
<title>PHPit Latest Content</title>
<description>The latest content from PHPit (http://www.phpit.net), screen scraped!</description>
<link>http://www.phpit.net</link>
<language>en-us</language>
<?php
// Loop through each content item
foreach ($matches[0] as $match) {
// First, get title
preg_match ("/\">([^`]*?)<\/a><\/h3>/", $match, $temp);
$title = $temp['1'];
$title = strip_tags($title);
$title = trim($title);

// Second, get url
preg_match ("/<a href=\"([^`]*?)\">/", $match, $temp);
$url = $temp['1'];
$url = trim($url);

// Third, get text
preg_match ("/([^`]*?)/", $match, $temp);
$text = $temp['1'];
$text = trim($text);

// Fourth, and finally, get author
preg_match ("/By ([^`]*?)<\/span>/", $match, $temp);
$author = $temp['1'];
$author = trim($author);

// Echo RSS XML
echo "<item>\n";
echo "\t\t\t<title>" . strip_tags($title) . "</title>\n";
echo "\t\t\t<link>http://www.phpit.net" . strip_tags($url) . "</link>\n";
echo "\t\t\t<description>" . strip_tags($text) . "</description>\n";
echo "\t\t\t<content:encoded><![CDATA[ \n";
echo $text . "\n";
echo " ]]></content:encoded>\n";
echo "\t\t\t<dc:creator>" . strip_tags($author) . "</dc:creator>\n";
echo "\t\t</item>\n";
}
?>
</channel>
</rss>[/quote]

Note: it displays the output as in rss (need it to be in xml like above)

Thanks

HuggieBear · December 1, 2006

I'll take a look at this for you this evening.

Just a quick question, you have to be logged into MySpace to be able to access that content, how do you expect to get around that?

Regards
Huggie

euroiplayer · December 1, 2006

I am not sure, but I believe it can be done by cURL? Is there a way I can put a username and a password in the scrapes.php itself, or maybe call it from mySQL database, so after that then it logins into myspace beforehand and then goes to the users' comments page?

Here check this website out..
[url=http://scrapes.php?action=comments&profile=23432]scrapes.php?action=comments&profile=23432[/url]
Note:You can change the ####.
It might display it as an error, but the output does the job anyway because for example when u view source, everything is there.

Tell me what you think... is it possible?

HuggieBear · December 1, 2006

OK, I'm still looking, I like this little challenge :)

Huggie

euroiplayer · December 1, 2006

Here I found a website, this should be helpful ::)

http://makedatamakesense.com/
Look under:
MySpace RSS Creator
MySpace Events RSS Creator
MySpace Comments RSS Creator

HuggieBear · December 4, 2006

The website didn't really help, but I did come up with the following code, so give it a try. You may have problems with the XML but you can probably fix them better than I can.

Regards
Huggie

P.S. Admins/Mods: I didn't use [nobbc][code][/code][/nobbc] tags for this as it throws the syntax colouring out. I would normally use it ;)

[code=php:0]<?php

// Turn on all error reporting
error_reporting(E_ALL);

// Create a new CURL handle
$ch = curl_init();

// Set the CURL options
curl_setopt($ch, CURLOPT_URL, "http://www.myspace.com/tom");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

// Put the data received from the page into a variable
$data = curl_exec($ch);

// Output an error to the browser if execute errors
if (curl_errno($ch)){
echo "Error: " . curl_error();
}

// Close the object
curl_close($ch);

// Regular expression to capture the data into the comments array
$pattern = '/bgcolor="FF9933" style="word-wrap: break-word">.*?id=(\d+)".*?>(.*?)<.*?src="(.*?)".*?text10">(.*?)<.*? .*? (.*?)<\/td>/ims';
preg_match_all($pattern, $data, $comments, PREG_SET_ORDER);

// Dispose of the full pattern matches
foreach ($comments as $k => $v){
array_shift($comments[$k]);
}

// Strip out all the white space from 'userName', 'postedDate' and 'commentText'
foreach ($comments as $k => $v){
$comments[$k][1] = trim($comments[$k][1]);
$comments[$k][3] = trim($comments[$k][3]);
$comments[$k][4] = trim($comments[$k][4]);
}

// Echo the XML
echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
echo "<first_child>\n";
foreach ($comments as $k => $v){
echo " <commentData>\n";
echo " <userID>{$comments[$k][0]}</userID>\n";
echo " <thumbnail>{$comments[$k][2]}</thumbnail>\n";
echo " <postedDate>{$comments[$k][3]}</postedDate>\n";
echo " <userName>{$comments[$k][1]}</userName>\n";
echo " <commentText>{$comments[$k][4]}</commentText>\n";
echo " </commentData>\n";
}
echo "</first_child>\n";

?>[/code]

euroiplayer · December 4, 2006

:) Thanks for all the great help Huggie

HuggieBear · December 4, 2006

No problem, this is of course liable to break the moment MySpace make any changes, but it should just be a case of changing the regular expression.

I found that it worked OK for normal alphanumeric characters, but when someone had a username that had special characters in, the XML didn't like it.

You'll have to figure that one out for yourself.

Regards
Huggie

chrishawkins · November 8, 2007

Hey Huggie,

I am a noob here as well as to the world of PHP and CURL, however, I was wondering if you would be willing to lend me some of your expertise with a project that I am working on. It is somewhat similar to the topics posted in this thread. Let me know if you would be willing to help out. Like I said, I am new here, so I am still unable to post prvt messages. So, if your interested, I guess you can either post back here, or send me a prvt message with some contact info.

Cheers

Chris

Sign In

Need help (Making a scraper)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information