Jump to content

Page Scraping


Yakooza

Recommended Posts

Hello Im currently developing a clan site on the Battle Field Heroes.

I wanted to have a page where our member's stats such as kills/deaths are posted.

 

The official BFH website has a ranking system implanted so I thought this should be possible. I searched google for help. Found a article on page scraping which gave me this code, but it did not work.

 

<html>
<head>
  <title>PHP Test</title>
</head>
<body>

<?php

        $url = "http://www.amazon.com/exec/obidos/ASIN/1904151191/";

        $filepointer = fopen($url,"r");

  if($filepointer){

  while(!feof($filepointer)){

              $buffer = fgets($filepointer, 4096);

                $file .= $buffer;

            }

            fclose($filepointer);

         } else {

              die("Could not create a connection to Amazon.com");   

        }

    ?>

    <?php

          preg_match("/<b>Amazon.com\sSales\sRank:\s<\/b>\s(.*)\s/i",$file,$match);

         $result = $match[1];

         echo $result;   

     ?>



</body>
</html>

 

The outcome of this is

 Amazon.com\sSales\sRank:\s<\/b>\s(.*)\s/i",$file,$match); $result = $match[1]; echo $result; ?>  

...

 

Im coming here for help. Is there any way I can do this?

 

Btw, here is the page of the rank page

http://www.battlefieldheroes.com/heroes/191233146

Link to comment
Share on other sites

Thankfully I've had the same problem before. There's a very simple single PHP file that can parse HTML/DOM for you, you just need to include it..

http://simplehtmldom.sourceforge.net/

 

You can made the coding as simple as this!

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

foreach($html->find('div.scores') as $html) {
    $item['title']     = $html->find('div.title', 0)->plaintext;
    $item['rank'] = $html->find('div.rank', 0)->plaintext;
    $html[] = $item;
}

 

Like that sorta, can find div's and output their innerhtml and everything.

echo $element->src . '<br>';

Link to comment
Share on other sites

This really shouldnt be complicated.

All I need to do is pull some info from another site

 

I also tried starting a new document putting this code in it and saving it as a .php.

 

<?php

$data = file_get_contents('http://www.warbeats.com/Default.aspx');

$regex = '/been working on this (.+?) day/';

preg_match($regex,$data,$match);

var_dump($match);

echo $match[1];

?>

 

It should work...but I see no output.

 

Do I have to call onto the function or something?

Link to comment
Share on other sites

Oh, simply you download the DOM parser and open it up. All you need is the one php file 'simple_html_dom.php' but you can browse the examples if you wish. For example... lets say you wanted to grab google's logo.

 

<?php
include('simple_html_dom.php');
$html = file_get_html('http://www.google.com/');
$logo = $html->find('#logo');
echo $logo;

 

That'll return the div #logo with the image. You can easily traverse to someone's stats like you wanted, and parse the content after.. There are plenty of examples here..

http://simplehtmldom.sourceforge.net/manual.htm

Link to comment
Share on other sites

Page scraping is probably easier / better done client side using jQuery these days.

Just a while ago I stumbled upon phpQuery which is a port of jquery to php. That might also help http://code.google.com/p/phpquery/

 

However I don't think you really need a third party script. the DOMdocument class is enough to easily do this sort of task. Here is a recent thread about regex and DOMdocument which might help you:

http://www.phpfreaks.com/forums/index.php/topic,264032.0.html

 

 

Link to comment
Share on other sites

I still don't see the point of doing this with a server side script when jQuery alone is perfectly capable of doing this.

 

Doing it client-side means no overhead at all on your server, the client simply makes the request to the external source themselves.

Link to comment
Share on other sites

I still don't see the point of doing this with a server side script when jQuery alone is perfectly capable of doing this.

 

Doing it client-side means no overhead at all on your server, the client simply makes the request to the external source themselves.

Totally true, however if you are doing this serverside you could create a cron job/ scheduled task and store the data that is required locally (in a database or xml for example). After that it would be a simple task of just fetching the required data locally thus increasing performance.

 

I heaven't tried to use jQuery yet for scraping a remote site. Wouldn't that cause a cross domain scripting implication? I wanna try that out so far I've been getting jsonp to work for cross doimain scripting but not reading a whole remote site as string.

 

I tried copying the examples exactly from the simplehtmdom parser, except I changed the directory of the include file, but none of them returned an output...

 

Whats wrong here?

What code did you exactly use? just saying "but none of them returned an output" is a little too hard too work with for giving you help.

Link to comment
Share on other sites

These are the codes I tried

 

<?php
include_once('/simple_html_dom.php');

function scraping_slashdot() {
    // create HTML DOM
    $html = file_get_html('http://slashdot.org/');

    // get article block
    foreach($html->find('div[id^=firehose-]') as $article) {
        // get title
        $item['title'] = trim($article->find('a.datitle', 0)->plaintext);
        // get body
        $item['body'] = trim($article->find('div.body', 0)->plaintext);

        $ret[] = $item;
    }
    
    // clean up memory
    $html->clear();
    unset($html);

    return $ret;
}

// -----------------------------------------------------------------------------
// test it!
$ret = scraping_slashdot();

foreach($ret as $v) {
    echo $v['title'].'<br>';
    echo '<ul>';
    echo '<li>'.$v['body'].'</li>';
    echo '</ul>';
}
?>

<?php
include_once('../../simple_html_dom.php');

function scraping_IMDB($url) {
    // create HTML DOM
    $html = file_get_html($url);

    // get title
    $ret['Title'] = $html->find('title', 0)->innertext;

    // get rating
    $ret['Rating'] = $html->find('div[class="general rating"] b', 0)->innertext;

    // get overview
    foreach($html->find('div[class="info"]') as $div) {
        // skip user comments
        if($div->find('h5', 0)->innertext=='User Comments:')
            return $ret;

        $key = '';
        $val = '';

        foreach($div->find('*') as $node) {
            if ($node->tag=='h5')
                $key = $node->plaintext;

            if ($node->tag=='a' && $node->plaintext!='more')
                $val .= trim(str_replace("\n", '', $node->plaintext));

            if ($node->tag=='text')
                $val .= trim(str_replace("\n", '', $node->plaintext));
        }

        $ret[$key] = $val;
    }
    
    // clean up memory
    $html->clear();
    unset($html);

    return $ret;
}


// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB('http://imdb.com/title/tt0335266/');

foreach($ret as $k=>$v)
    echo '<strong>'.$k.' </strong>'.$v.'<br>';
?>

 

 

<?php
include_once('../../simple_html_dom.php');

function scraping_digg() {
    // create HTML DOM
    $html = file_get_html('http://digg.com/');

    // get news block
    foreach($html->find('div.news-summary') as $article) {
        // get title
        $item['title'] = trim($article->find('h3', 0)->plaintext);
        // get details
        $item['details'] = trim($article->find('p', 0)->plaintext);
        // get intro
        $item['diggs'] = trim($article->find('li a strong', 0)->plaintext);

        $ret[] = $item;
    }
    
    // clean up memory
    $html->clear();
    unset($html);

    return $ret;
}


// -----------------------------------------------------------------------------
// test it!

// "http://digg.com" will check user_agent header...
ini_set('user_agent', 'My-Application/2.5');

$ret = scraping_digg();

foreach($ret as $v) {
    echo $v['title'].'<br>';
    echo '<ul>';
    echo '<li>'.$v['details'].'</li>';
    echo '<li>Diggs: '.$v['diggs'].'</li>';
    echo '</ul>';
}

?>

 

None of them had an output. When I opened the page containing this code, it was completely blank. Im guessing its outdated?

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.