[SOLVED] Scraping With Regex

tail · June 7, 2009

I've been reading some tutorials trying to understand regex but it's not making sense to me. I'm trying to scrape the name of a game but I'm not sure if I'm doing it correctly. This is the code I'm using:

<?php
$autofill_site = 'http://www.kontraband.com/games/17474';
$html = file_get_contents($autofill_site);
$regex = '~(<div class="picTitleCentre"><b>[^<]*</b>).~s';
preg_match($regex, $html, $matches);
var_dump($matches);
?>

Output:

array(2) { [0]=>  string(40) "
Doom<" [1]=> string(39) "
Doom" }

Why is it that I'm getting two results? And why does one have a "<" in it?

.josh · June 7, 2009

element 0 contains the full regex match. element 1 contains the first captured match (what you have in parenthesis). Element 2 would contain the 2nd captured match, etc...

tail · June 7, 2009

Where is the full regex match coming from? Is there a way to just capture what I have in the parenthesis?

.josh · June 7, 2009

well you have your parenthesis wrapped around your whole regex, except for one wildcard thrown in there at the end... you could always remove that dot at the end and the parenthesis, and then element 0 will be like element 1. But in general, element 0 is always the full regex match. This would just coincidentally work for you. But there's nothing wrong with using element 1 in the first place...

.josh · June 7, 2009

and by the way, you might wanna rightclick > viewsource that output of yours. You do know that you are capturing that div and b tag, along with the title, and not just the title itself, right?

tail · June 7, 2009

and by the way, you might wanna rightclick > viewsource that output of yours. You do know that you are capturing that div and b tag, along with the title, and not just the title itself, right?

No, I didn't know that. What I'm trying to do is something similar to what I scraped off another website using this:

preg_match('~<title>([^<]*).+(/Games/[^.]+.swf).+Categories:\s*(.*?)<br />.+?GameDescription">([^<]*)~s', $html, $matches);

Which returned the name of the game, category, description, and link to the .swf. What I'm trying to accomplish is the same thing except for the category because this site doesn't list it. I got the div and b tag out like this:

<?php
$autofill_site = 'http://www.kontraband.com/games/17474';
$html = file_get_contents($autofill_site);
$regex = '~<div class="picTitleCentre"><b>([^<]*)</b>~s';
preg_match($regex, $html, $matches);
var_dump($matches);
?>

But how do I retrieve the rest of the info I'm looking for?

tail · June 7, 2009

I managed to get the link to the .swf, the name, and the description:

<?php
$autofill_site = 'http://www.kontraband.com/games/17474';
$html = file_get_contents($autofill_site);
$regex = '~<div class="picTitleCentre"><b>([^<]*).+<param name=movie value="([^<>"]*).+?<p style="margin-top: 5px;">([^<]*)~s';
preg_match($regex, $html, $matches);
var_dump($matches);
?>

Is that the best way to do it?

thebadbad · June 7, 2009

Seems fine if it works for you. But instead of using [^<>"] it would be sufficient to use [^"], meaning any character NOT a double quote (that's what the ^ does, when it begins a character class).

nrg_alpha · June 7, 2009

Usually, .+ (or even .*) is frowned up, for reasons of potential speed and even worse, accuracy issues. You can view post#11 and 14 from this thread for a more complete explanation. At the very least, use .+? in these cases (if it's possible to use negated character classes, that would be preferable).

These parts could pose as problematic:

$regex = '~<div class="picTitleCentre"><b>([^<]*).+<param name=movie value="([^<>"]*).+?<p style="margin-top: 5px;">([^<]*)~s';

Using the first red section as an example, the problem is that you are telling the regex engine, match anything that is not a <, zero or more times... so at this point, the regex engine is matching everything up to (but does not include) the < character. But then the .+ forces the regex to match the next character (which is now the <), but since it is greedy, keep going till it hits <param..... So what happens to the first tag it ran into? Is that the <param tag you are looking for?

Is this something along the lines of what you are looking for?

Example:

// obviously, I just used this heredoc to simulate the site.. you won't use this...
$html = <<<HTML
<div class="picTitleCentre"><b>Doom</b></div>
        <div class="picTitleRight">

         

        </div>
    </div>
    <div id="nonVidContentDiv">
    <div align="center" style="margin-bottom:12px"><br /><object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://active.macromedia.com/flash2/cabs/swflash.cab#version=10,0,0,0" id=intro width="600" height="375"  align="top"><param name=movie value="http://208.116.9.205/10/content/17474/games_Doom.swf"><param name=AllowScriptAccess value="always"><param name=quality value=high><param name=bgcolor value=""><embed src="http://208.116.9.205/10/content/17474/games_Doom.swf" AllowScriptAccess="always" quality=high bgcolor="" width="600"  height="375" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash"></embed></object><br /></div>
    </div>
    <div class="picTitle" style="text-align:center;">
HTML;

$regex = '#<div class="picTitleCentre"><b>([^<]+)</b>.+?<param name=movie value="([^"]+)"#si';
preg_match($regex, $html, $match); 
echo $match[1] . "<br />\n"; // $match[1] = Doom
echo $match[2]; //$match[2] = http://208.116.9.205/10/content/17474/games_Doom.swf

nrg_alpha · June 7, 2009

@OP: I just noticed this thread was in the PHP help forum and moved this to the more appropriate PHP Regex board.

thebadbad · June 7, 2009

@nrg

Cheers for pointing me to your explanation, didn't know too much about the engine's method.

Curious as I am, I just ran some tests to see how fast a greedy vs. a lazy search is in our example.

Lazy code:

preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+?<param name=movie value="([^"]+)"#si', $html, $matches);

Greedy code:

preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+<param name=movie value="([^"]+)"#si', $html, $matches);

$html is the full source of http://www.kontraband.com/games/17474 hard-copied into the script. The only difference in the above lines is the single question mark in the middle of the regex pattern.

Results:

100 iterations of the lazy code took ~ 0.010 seconds, while 100 iterations of the greedy code took a 'whopping' ~ 0.895 seconds.

But then I added another part to the end of the regex, .+?end of google analytics code([^;]+), and ran the same test. This time the greedy method was as fast as the lazy. Anyone know why?

nrg_alpha · June 7, 2009

@nrg

Cheers for pointing me to your explanation, didn't know too much about the engine's method.

Curious as I am, I just ran some tests to see how fast a greedy vs. a lazy search is in our example.

Lazy code:
preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+?<param name=movie value="([^"]+)"#si', $html, $matches);
Greedy code:
preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+<param name=movie value="([^"]+)"#si', $html, $matches);
$html is the full source of http://www.kontraband.com/games/17474 hard-copied into the script. The only difference in the above lines is the single question mark in the middle of the regex pattern.

Results:

100 iterations of the lazy code took ~ 0.010 seconds, while 100 iterations of the greedy code took a 'whopping' ~ 0.895 seconds.

Yeah, definitely, on a single pass (or on small amounts of data, the speed difference won't be huge at all (in fact perhaps even infinitesimal). There are other additional factors to consider such as where .+ / .+? is located in the pattern... if it's close to (or at) the very end, then the amount of backtracking on say .+ will be obviously less than if it was located early on in the pattern. But perhaps even more important than the speed is the issue of accuracy (as CV illustrated in the link I posted). In hind sight, this would be the bigger issue I would think (although, if I can squeeze out some extra speed while I'm at it, why not?). Using character classes when possible typically beats out both .+ and .+? methods (speed wise - I think there was the odd time where it didn't, but over all , it was faster).

But then I added another part to the end of the regex, .+?end of google analytics code([^;]+), and ran the same test. This time the greedy method was as fast as the lazy. Anyone know why?

I'm not sure I quite follow on that one TBH.

If you want to really understand how the regex engine *thinks*, my suggestion is the book Mastering Regular Expressions (at least, it helped me out a hell of a lot).

tail · June 8, 2009

Wow thanks for all the help everyone! I got it working now.

thebadbad · June 8, 2009

If you want to really understand how the regex engine *thinks*, my suggestion is the book Mastering Regular Expressions (at least, it helped me out a hell of a lot).

Thanks, I'd better check it out.

tail · June 9, 2009

I have one more question. When the script scrapes the site, the flash file URL comes up in a different format every time. In order to get the thumbnail image, I have to replace a certain part of the filename with "t.jpg".

Ex:

Flash File - http://208.116.9.205/1/graphics/games/11623/games_BillySuicide.swf

Thumbnail - http://208.116.9.205/1/graphics/games/11623/t.jpg

Now that I have the flash URL, how can I get "http://208.116.9.205/1/graphics/games/" into a string? The folders come up different sometimes.

thebadbad · June 9, 2009

There are simple string functions to deal with something like this:

<?php
$game = 'http://208.116.9.205/1/graphics/games/11623/games_BillySuicide.swf';
$thumb = substr($game, 0, strrpos($game, '/')) . '/t.jpg';
?>

Sign In

[SOLVED] Scraping With Regex

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information