Jump to content

Need help (Making a scraper)


euroiplayer

Recommended Posts

Hello everybody..

I need help making a scraper (scraping comments off of a myspace comments page).  I have tried many times and have failed miserably.  Please help me out.

Here is an example...

http://www.website123.com/scrapes.php?action=friends&profile=64100451
...and this would be the output from that website to an XML format...
[quote]<?xml version="1.0" encoding="UTF-8"?>
<first_child>

<commentData>
<userID>64100451</userID>
<thumbnail>http://myspace-308.vo.llnwd.net/01133/80/39/1133389308_s.jpg</thumbnail>
<postedDate>28.Nov.2006 13:05</postedDate>
<userName>natalie</userName>
<commentText></commentText>
</commentData>

</first_child> [/quote]

Thats an example above of a one comment being scraped from a whole list.  This is the page that the comments would be scraped off from: [url=http://comment.myspace.com/index.cfm?fuseaction=user.viewComments&friendID=64100451]http://comment.myspace.com/index.cfm?fuseaction=user.viewComments&friendID=64100451[/url]


So basicly whatever ID number I enter here (#####)... http://www.website123.com/scrapes.php?action=friends&profile=#####... it would scrape the comments off from that myspace.

Any questions, feel free to ask..
Any help would be GREATLY appreciated very much.  Thanks
Link to comment
Share on other sites

This is the code that I started with, modified it over and over -- and well no results.  So if possible is it please you can modify the code for me ???

[quote]<?php

// Screen scraping your way into RSS
// Example script, by Dennis Pallett
// http://www.phpit.net/tutorials/screenscrap-rss

// Get page
$url = "http://www.phpit.net/";
$data = implode("", file($url));

// Get content items
preg_match_all ("/<div class=\"contentitem\">([^`]*?)<\/div>/", $data, $matches);
<?php
// Begin feed
header ("Content-Type: text/xml; charset=ISO-8859-1");
echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";
?>
<rss version="2.0"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:admin="http://webns.net/mvcb/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <channel>
                <title>PHPit Latest Content</title>
                <description>The latest content from PHPit (http://www.phpit.net), screen scraped!</description>
                <link>http://www.phpit.net</link>
                <language>en-us</language>
<?php
// Loop through each content item
foreach ($matches[0] as $match) {
        // First, get title
        preg_match ("/\">([^`]*?)<\/a><\/h3>/", $match, $temp);
        $title = $temp['1'];
        $title = strip_tags($title);
        $title = trim($title);

        // Second, get url
        preg_match ("/<a href=\"([^`]*?)\">/", $match, $temp);
        $url = $temp['1'];
        $url = trim($url);

        // Third, get text
        preg_match ("/<p>([^`]*?)<span class=\"byline\">/", $match, $temp);
        $text = $temp['1'];
        $text = trim($text);

        // Fourth, and finally, get author
        preg_match ("/<span class=\"byline\">By ([^`]*?)<\/span>/", $match, $temp);
        $author = $temp['1'];
        $author = trim($author);

        // Echo RSS XML
        echo "<item>\n";
                echo "\t\t\t<title>" . strip_tags($title) . "</title>\n";
                echo "\t\t\t<link>http://www.phpit.net" . strip_tags($url) . "</link>\n";
                echo "\t\t\t<description>" . strip_tags($text) . "</description>\n";
                echo "\t\t\t<content:encoded><![CDATA[ \n";
                echo $text . "\n";
                echo " ]]></content:encoded>\n";
                echo "\t\t\t<dc:creator>" . strip_tags($author) . "</dc:creator>\n";
        echo "\t\t</item>\n";
}
?>
</channel>
</rss>[/quote]

Note: it displays the output as in rss (need it to be in xml like above)


Thanks

Link to comment
Share on other sites

I am not sure, but I believe it can be done by cURL?  Is there a way I can put a username and a password in the scrapes.php itself, or maybe call it from mySQL database, so after that then it logins into myspace beforehand and then goes to the users' comments page?

Here check this website out..
[url=http://scrapes.php?action=comments&profile=23432]scrapes.php?action=comments&profile=23432[/url]
Note:You can change the ####.
It might display it as an error, but the output does the job anyway because for example when u view source, everything is there.

Tell me what you think... is it possible?
Link to comment
Share on other sites

The website didn't really help, but I did come up with the following code, so give it a try.  You may have problems with the XML but you can probably fix them better than I can.

Regards
Huggie

P.S. Admins/Mods: I didn't use [nobbc][code][/code][/nobbc] tags for this as it throws the syntax colouring out.  I would normally use it ;)

[code=php:0]<?php

// Turn on all error reporting
error_reporting(E_ALL);

// Create a new CURL handle
$ch = curl_init();

// Set the CURL options
curl_setopt($ch, CURLOPT_URL, "http://www.myspace.com/tom");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

// Put the data received from the page into a variable
$data = curl_exec($ch);

// Output an error to the browser if execute errors
if (curl_errno($ch)){
  echo "Error: " . curl_error();
}

// Close the object
curl_close($ch);

// Regular expression to capture the data into the comments array
$pattern = '/bgcolor="FF9933" style="word-wrap: break-word">.*?id=(\d+)".*?>(.*?)<.*?src="(.*?)".*?text10">(.*?)<.*?<br>.*?<br>(.*?)<\/td>/ims';
preg_match_all($pattern, $data, $comments, PREG_SET_ORDER);

// Dispose of the full pattern matches
foreach ($comments as $k => $v){
  array_shift($comments[$k]);
}

// Strip out all the white space from 'userName', 'postedDate' and 'commentText'
foreach ($comments as $k => $v){
  $comments[$k][1] = trim($comments[$k][1]);
  $comments[$k][3] = trim($comments[$k][3]);
  $comments[$k][4] = trim($comments[$k][4]);
}

// Echo the XML
echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
echo "<first_child>\n";
foreach ($comments as $k => $v){
  echo " <commentData>\n";
  echo "  <userID>{$comments[$k][0]}</userID>\n";
  echo "  <thumbnail>{$comments[$k][2]}</thumbnail>\n";
  echo "  <postedDate>{$comments[$k][3]}</postedDate>\n";
  echo "  <userName>{$comments[$k][1]}</userName>\n";
  echo "  <commentText>{$comments[$k][4]}</commentText>\n";
  echo " </commentData>\n";
}
echo "</first_child>\n";

?>[/code]
Link to comment
Share on other sites

No problem, this is of course liable to break the moment MySpace make any changes, but it should just be a case of changing the regular expression.

I found that it worked OK for normal alphanumeric characters, but when someone had a username that had special characters in, the XML didn't like it.

You'll have to figure that one out for yourself.

Regards
Huggie
Link to comment
Share on other sites

  • 11 months later...
Hey Huggie,

I am a noob here as well as to the world of PHP and CURL, however, I was wondering if you would be willing to lend me some of your expertise with a project that I am working on. It is somewhat similar to the topics posted in this thread. Let me know if you would be willing to help out. Like I said, I am new here, so I am still unable to post prvt messages. So, if your interested, I guess you can either post back here, or send me a prvt message with some contact info.

Cheers

Chris
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.