Jump to content

[SOLVED] Scraping With Regex


tail

Recommended Posts

I've been reading some tutorials trying to understand regex but it's not making sense to me. I'm trying to scrape the name of a game but I'm not sure if I'm doing it correctly. This is the code I'm using:

<?php
$autofill_site = 'http://www.kontraband.com/games/17474';
$html = file_get_contents($autofill_site);
$regex = '~(<div class="picTitleCentre"><b>[^<]*</b>).~s';
preg_match($regex, $html, $matches);
var_dump($matches);
?>

Output:

array(2) { [0]=>  string(40) "
Doom<" [1]=> string(39) "
Doom" }

 

Why is it that I'm getting two results? And why does one have a "<" in it?

Link to comment
Share on other sites

well you have your parenthesis wrapped around your whole regex, except for one wildcard thrown in there at the end... you could always remove that dot at the end and the parenthesis, and then element 0 will be like element 1.  But in general, element 0 is always the full regex match.  This would just coincidentally work for you.  But there's nothing wrong with using element 1 in the first place...

Link to comment
Share on other sites

and by the way, you might wanna rightclick > viewsource that output of yours.  You do know that you are capturing that div and b tag, along with the title, and not just the title itself, right?

No, I didn't know that. What I'm trying to do is something similar to what I scraped off another website using this:

preg_match('~<title>([^<]*).+(/Games/[^.]+.swf).+Categories:\s*(.*?)<br />.+?GameDescription">([^<]*)~s', $html, $matches);

Which returned the name of the game, category, description, and link to the .swf. What I'm trying to accomplish is the same thing except for the category because this site doesn't list it. I got the div and b tag out like this:

<?php
$autofill_site = 'http://www.kontraband.com/games/17474';
$html = file_get_contents($autofill_site);
$regex = '~<div class="picTitleCentre"><b>([^<]*)</b>~s';
preg_match($regex, $html, $matches);
var_dump($matches);
?>

But how do I retrieve the rest of the info I'm looking for?

Link to comment
Share on other sites

I managed to get the link to the .swf, the name, and the description:

<?php
$autofill_site = 'http://www.kontraband.com/games/17474';
$html = file_get_contents($autofill_site);
$regex = '~<div class="picTitleCentre"><b>([^<]*).+<param name=movie value="([^<>"]*).+?<p style="margin-top: 5px;">([^<]*)~s';
preg_match($regex, $html, $matches);
var_dump($matches);
?>

Is that the best way to do it?

Link to comment
Share on other sites

Usually, .+ (or even .*) is frowned up, for reasons of potential speed and even worse, accuracy issues. You can view post#11 and 14 from this thread for a more complete explanation. At the very least, use .+? in these cases (if it's possible to use negated character classes, that would be preferable).

 

These parts could pose as problematic:

$regex = '~<div class="picTitleCentre"><b>([^<]*).+<param name=movie value="([^<>"]*).+?<p style="margin-top: 5px;">([^<]*)~s';

 

Using the first red section as an example, the problem is that you are telling the regex engine, match anything that is not a <, zero or more times... so at this point, the regex engine is matching everything up to (but does not include) the < character. But then the .+ forces the regex to match the next character (which is now the <), but since it is greedy, keep going till it hits <param..... So what happens to the first tag it ran into? Is that the <param tag you are looking for?

 

Is this something along the lines of what you are looking for?

 

Example:

// obviously, I just used this heredoc to simulate the site.. you won't use this...
$html = <<<HTML
<div class="picTitleCentre"><b>Doom</b></div>
        <div class="picTitleRight">

         

        </div>
    </div>
    <div id="nonVidContentDiv">
    <div align="center" style="margin-bottom:12px"><br /><object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://active.macromedia.com/flash2/cabs/swflash.cab#version=10,0,0,0" id=intro width="600" height="375"  align="top"><param name=movie value="http://208.116.9.205/10/content/17474/games_Doom.swf"><param name=AllowScriptAccess value="always"><param name=quality value=high><param name=bgcolor value=""><embed src="http://208.116.9.205/10/content/17474/games_Doom.swf" AllowScriptAccess="always" quality=high bgcolor="" width="600"  height="375" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash"></embed></object><br /></div>
    </div>
    <div class="picTitle" style="text-align:center;">
HTML;

$regex = '#<div class="picTitleCentre"><b>([^<]+)</b>.+?<param name=movie value="([^"]+)"#si';
preg_match($regex, $html, $match); 
echo $match[1] . "<br />\n"; // $match[1] = Doom
echo $match[2]; //$match[2] = http://208.116.9.205/10/content/17474/games_Doom.swf

Link to comment
Share on other sites

@nrg

Cheers for pointing me to your explanation, didn't know too much about the engine's method.

 

Curious as I am, I just ran some tests to see how fast a greedy vs. a lazy search is in our example.

 

Lazy code:

preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+?<param name=movie value="([^"]+)"#si', $html, $matches);

Greedy code:

preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+<param name=movie value="([^"]+)"#si', $html, $matches);

$html is the full source of http://www.kontraband.com/games/17474 hard-copied into the script. The only difference in the above lines is the single question mark in the middle of the regex pattern.

 

Results:

100 iterations of the lazy code took ~ 0.010 seconds, while 100 iterations of the greedy code took a 'whopping' ~ 0.895 seconds.

 

But then I added another part to the end of the regex, .+?end of google analytics code([^;]+), and ran the same test. This time the greedy method was as fast as the lazy. Anyone know why?

Link to comment
Share on other sites

@nrg

Cheers for pointing me to your explanation, didn't know too much about the engine's method.

 

Curious as I am, I just ran some tests to see how fast a greedy vs. a lazy search is in our example.

 

Lazy code:

preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+?<param name=movie value="([^"]+)"#si', $html, $matches);

Greedy code:

preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+<param name=movie value="([^"]+)"#si', $html, $matches);

$html is the full source of http://www.kontraband.com/games/17474 hard-copied into the script. The only difference in the above lines is the single question mark in the middle of the regex pattern.

 

Results:

100 iterations of the lazy code took ~ 0.010 seconds, while 100 iterations of the greedy code took a 'whopping' ~ 0.895 seconds.

 

Yeah, definitely, on a single pass (or on small amounts of data, the speed difference won't be huge at all (in fact perhaps even infinitesimal). There are other additional factors to consider such as where .+ / .+? is located in the pattern... if it's close to (or at) the very end, then the amount of backtracking on say .+ will be obviously less than if it was located early on in the pattern. But perhaps even more important than the speed is the issue of accuracy (as CV illustrated in the link I posted). In hind sight, this would be the bigger issue I would think (although, if I can squeeze out some extra speed while I'm at it, why not?). Using character classes when possible typically beats out both .+ and .+? methods (speed wise - I think there was the odd time where it didn't, but over all , it was faster).

 

But then I added another part to the end of the regex, .+?end of google analytics code([^;]+), and ran the same test. This time the greedy method was as fast as the lazy. Anyone know why?

 

I'm not sure I quite follow on that one TBH.

 

If you want to really understand how the regex engine *thinks*, my suggestion is the book Mastering Regular Expressions (at least, it helped me out a hell of a lot).

Link to comment
Share on other sites

I have one more question. When the script scrapes the site, the flash file URL comes up in a different format every time. In order to get the thumbnail image, I have to replace a certain part of the filename with "t.jpg".

Ex:

Flash File - http://208.116.9.205/1/graphics/games/11623/games_BillySuicide.swf

Thumbnail - http://208.116.9.205/1/graphics/games/11623/t.jpg

 

Now that I have the flash URL, how can I get "http://208.116.9.205/1/graphics/games/" into a string? The folders come up different sometimes.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.