Page Scraping

Yakooza · August 13, 2009

Hello Im currently developing a clan site on the Battle Field Heroes.

I wanted to have a page where our member's stats such as kills/deaths are posted.

The official BFH website has a ranking system implanted so I thought this should be possible. I searched google for help. Found a article on page scraping which gave me this code, but it did not work.

<html>
<head>
  <title>PHP Test</title>
</head>
<body>

<?php

        $url = "http://www.amazon.com/exec/obidos/ASIN/1904151191/";

        $filepointer = fopen($url,"r");

  if($filepointer){

  while(!feof($filepointer)){

              $buffer = fgets($filepointer, 4096);

                $file .= $buffer;

            }

            fclose($filepointer);

         } else {

              die("Could not create a connection to Amazon.com");   

        }

    ?>

    <?php

          preg_match("/<b>Amazon.com\sSales\sRank:\s<\/b>\s(.*)\s/i",$file,$match);

         $result = $match[1];

         echo $result;   

     ?>



</body>
</html>

The outcome of this is

 Amazon.com\sSales\sRank:\s<\/b>\s(.*)\s/i",$file,$match); $result = $match[1]; echo $result; ?>

...

Im coming here for help. Is there any way I can do this?

Btw, here is the page of the rank page

http://www.battlefieldheroes.com/heroes/191233146

oni-kun · August 13, 2009

Thankfully I've had the same problem before. There's a very simple single PHP file that can parse HTML/DOM for you, you just need to include it..

http://simplehtmldom.sourceforge.net/

You can made the coding as simple as this!

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

foreach($html->find('div.scores') as $html) {
    $item['title']     = $html->find('div.title', 0)->plaintext;
    $item['rank'] = $html->find('div.rank', 0)->plaintext;
    $html[] = $item;
}

Like that sorta, can find div's and output their innerhtml and everything.

echo $element->src . '<br>';

Yakooza · August 13, 2009

Sorry, but Im new and did not fully understand what you are saying.

Can you explain more thoroughly on how to use the code to extract data from another site and have it save the output on a page?

trq · August 13, 2009

Page scraping is probably easier / better done client side using jQuery these days.

Yakooza · August 13, 2009

This really shouldnt be complicated.

All I need to do is pull some info from another site

I also tried starting a new document putting this code in it and saving it as a .php.

<?php
$data = file_get_contents('http://www.warbeats.com/Default.aspx');

$regex = '/been working on this (.+?) day/';

preg_match($regex,$data,$match);

var_dump($match);

echo $match[1];

?>

It should work...but I see no output.

Do I have to call onto the function or something?

oni-kun · August 13, 2009

Oh, simply you download the DOM parser and open it up. All you need is the one php file 'simple_html_dom.php' but you can browse the examples if you wish. For example... lets say you wanted to grab google's logo.

<?php
include('simple_html_dom.php');
$html = file_get_html('http://www.google.com/');
$logo = $html->find('#logo');
echo $logo;

That'll return the div #logo with the image. You can easily traverse to someone's stats like you wanted, and parse the content after.. There are plenty of examples here..

http://simplehtmldom.sourceforge.net/manual.htm

Yakooza · August 13, 2009

Ok I tried that

The outcome was

Array

How come?

trq · August 13, 2009

Ok I tried that

The outcome was

Array

How come?

Because your echoing an array perhaps? Post some code.

RichardRotterdam · August 13, 2009

Page scraping is probably easier / better done client side using jQuery these days.

Just a while ago I stumbled upon phpQuery which is a port of jquery to php. That might also help http://code.google.com/p/phpquery/

However I don't think you really need a third party script. the DOMdocument class is enough to easily do this sort of task. Here is a recent thread about regex and DOMdocument which might help you:

http://www.phpfreaks.com/forums/index.php/topic,264032.0.html

trq · August 13, 2009

I still don't see the point of doing this with a server side script when jQuery alone is perfectly capable of doing this.

Doing it client-side means no overhead at all on your server, the client simply makes the request to the external source themselves.

Yakooza · August 13, 2009

Im hosting this on Byethost.

I cant do it through the client

Yakooza · August 13, 2009

I tried copying the examples exactly from the simplehtmdom parser, except I changed the directory of the include file, but none of them returned an output...

Whats wrong here?

RichardRotterdam · August 13, 2009

I still don't see the point of doing this with a server side script when jQuery alone is perfectly capable of doing this.

Doing it client-side means no overhead at all on your server, the client simply makes the request to the external source themselves.

Totally true, however if you are doing this serverside you could create a cron job/ scheduled task and store the data that is required locally (in a database or xml for example). After that it would be a simple task of just fetching the required data locally thus increasing performance.

I heaven't tried to use jQuery yet for scraping a remote site. Wouldn't that cause a cross domain scripting implication? I wanna try that out so far I've been getting jsonp to work for cross doimain scripting but not reading a whole remote site as string.

I tried copying the examples exactly from the simplehtmdom parser, except I changed the directory of the include file, but none of them returned an output...

Whats wrong here?

What code did you exactly use? just saying "but none of them returned an output" is a little too hard too work with for giving you help.

Yakooza · August 13, 2009

These are the codes I tried

<?php
include_once('/simple_html_dom.php');

function scraping_slashdot() {
    // create HTML DOM
    $html = file_get_html('http://slashdot.org/');

    // get article block
    foreach($html->find('div[id^=firehose-]') as $article) {
        // get title
        $item['title'] = trim($article->find('a.datitle', 0)->plaintext);
        // get body
        $item['body'] = trim($article->find('div.body', 0)->plaintext);

        $ret[] = $item;
    }
    
    // clean up memory
    $html->clear();
    unset($html);

    return $ret;
}

// -----------------------------------------------------------------------------
// test it!
$ret = scraping_slashdot();

foreach($ret as $v) {
    echo $v['title'].'<br>';
    echo '<ul>';
    echo '<li>'.$v['body'].'</li>';
    echo '</ul>';
}
?>

<?php
include_once('../../simple_html_dom.php');

function scraping_IMDB($url) {
    // create HTML DOM
    $html = file_get_html($url);

    // get title
    $ret['Title'] = $html->find('title', 0)->innertext;

    // get rating
    $ret['Rating'] = $html->find('div[class="general rating"] b', 0)->innertext;

    // get overview
    foreach($html->find('div[class="info"]') as $div) {
        // skip user comments
        if($div->find('h5', 0)->innertext=='User Comments:')
            return $ret;

        $key = '';
        $val = '';

        foreach($div->find('*') as $node) {
            if ($node->tag=='h5')
                $key = $node->plaintext;

            if ($node->tag=='a' && $node->plaintext!='more')
                $val .= trim(str_replace("\n", '', $node->plaintext));

            if ($node->tag=='text')
                $val .= trim(str_replace("\n", '', $node->plaintext));
        }

        $ret[$key] = $val;
    }
    
    // clean up memory
    $html->clear();
    unset($html);

    return $ret;
}


// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB('http://imdb.com/title/tt0335266/');

foreach($ret as $k=>$v)
    echo '<strong>'.$k.' </strong>'.$v.'<br>';
?>

<?php
include_once('../../simple_html_dom.php');

function scraping_digg() {
    // create HTML DOM
    $html = file_get_html('http://digg.com/');

    // get news block
    foreach($html->find('div.news-summary') as $article) {
        // get title
        $item['title'] = trim($article->find('h3', 0)->plaintext);
        // get details
        $item['details'] = trim($article->find('p', 0)->plaintext);
        // get intro
        $item['diggs'] = trim($article->find('li a strong', 0)->plaintext);

        $ret[] = $item;
    }
    
    // clean up memory
    $html->clear();
    unset($html);

    return $ret;
}


// -----------------------------------------------------------------------------
// test it!

// "http://digg.com" will check user_agent header...
ini_set('user_agent', 'My-Application/2.5');

$ret = scraping_digg();

foreach($ret as $v) {
    echo $v['title'].'<br>';
    echo '<ul>';
    echo '<li>'.$v['details'].'</li>';
    echo '<li>Diggs: '.$v['diggs'].'</li>';
    echo '</ul>';
}

?>

None of them had an output. When I opened the page containing this code, it was completely blank. Im guessing its outdated?

RichardRotterdam · August 14, 2009

The first code sniplet works fine for me. What php version are you running? and do you have your error reporting on?

Yakooza · August 15, 2009

Its running 5.2.10 and Im not sure about error reporting. Im not hosting the site myself o.O

RichardRotterdam · August 15, 2009

If for example you create an error on purpose do you a error in your browser?

Check the manual for the error_reporting

Also you might want to test something on a local pc first.

Sign In

Page Scraping

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information