tail Posted June 7, 2009 Share Posted June 7, 2009 I've been reading some tutorials trying to understand regex but it's not making sense to me. I'm trying to scrape the name of a game but I'm not sure if I'm doing it correctly. This is the code I'm using: <?php $autofill_site = 'http://www.kontraband.com/games/17474'; $html = file_get_contents($autofill_site); $regex = '~(<div class="picTitleCentre"><b>[^<]*</b>).~s'; preg_match($regex, $html, $matches); var_dump($matches); ?> Output: array(2) { [0]=> string(40) " Doom<" [1]=> string(39) " Doom" } Why is it that I'm getting two results? And why does one have a "<" in it? Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/ Share on other sites More sharing options...
.josh Posted June 7, 2009 Share Posted June 7, 2009 element 0 contains the full regex match. element 1 contains the first captured match (what you have in parenthesis). Element 2 would contain the 2nd captured match, etc... Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851111 Share on other sites More sharing options...
tail Posted June 7, 2009 Author Share Posted June 7, 2009 Where is the full regex match coming from? Is there a way to just capture what I have in the parenthesis? Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851118 Share on other sites More sharing options...
.josh Posted June 7, 2009 Share Posted June 7, 2009 well you have your parenthesis wrapped around your whole regex, except for one wildcard thrown in there at the end... you could always remove that dot at the end and the parenthesis, and then element 0 will be like element 1. But in general, element 0 is always the full regex match. This would just coincidentally work for you. But there's nothing wrong with using element 1 in the first place... Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851121 Share on other sites More sharing options...
.josh Posted June 7, 2009 Share Posted June 7, 2009 and by the way, you might wanna rightclick > viewsource that output of yours. You do know that you are capturing that div and b tag, along with the title, and not just the title itself, right? Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851122 Share on other sites More sharing options...
tail Posted June 7, 2009 Author Share Posted June 7, 2009 and by the way, you might wanna rightclick > viewsource that output of yours. You do know that you are capturing that div and b tag, along with the title, and not just the title itself, right? No, I didn't know that. What I'm trying to do is something similar to what I scraped off another website using this: preg_match('~<title>([^<]*).+(/Games/[^.]+.swf).+Categories:\s*(.*?)<br />.+?GameDescription">([^<]*)~s', $html, $matches); Which returned the name of the game, category, description, and link to the .swf. What I'm trying to accomplish is the same thing except for the category because this site doesn't list it. I got the div and b tag out like this: <?php $autofill_site = 'http://www.kontraband.com/games/17474'; $html = file_get_contents($autofill_site); $regex = '~<div class="picTitleCentre"><b>([^<]*)</b>~s'; preg_match($regex, $html, $matches); var_dump($matches); ?> But how do I retrieve the rest of the info I'm looking for? Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851133 Share on other sites More sharing options...
tail Posted June 7, 2009 Author Share Posted June 7, 2009 I managed to get the link to the .swf, the name, and the description: <?php $autofill_site = 'http://www.kontraband.com/games/17474'; $html = file_get_contents($autofill_site); $regex = '~<div class="picTitleCentre"><b>([^<]*).+<param name=movie value="([^<>"]*).+?<p style="margin-top: 5px;">([^<]*)~s'; preg_match($regex, $html, $matches); var_dump($matches); ?> Is that the best way to do it? Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851137 Share on other sites More sharing options...
thebadbad Posted June 7, 2009 Share Posted June 7, 2009 Seems fine if it works for you. But instead of using [^<>"] it would be sufficient to use [^"], meaning any character NOT a double quote (that's what the ^ does, when it begins a character class). Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851176 Share on other sites More sharing options...
nrg_alpha Posted June 7, 2009 Share Posted June 7, 2009 Usually, .+ (or even .*) is frowned up, for reasons of potential speed and even worse, accuracy issues. You can view post#11 and 14 from this thread for a more complete explanation. At the very least, use .+? in these cases (if it's possible to use negated character classes, that would be preferable). These parts could pose as problematic: $regex = '~<div class="picTitleCentre"><b>([^<]*).+<param name=movie value="([^<>"]*).+?<p style="margin-top: 5px;">([^<]*)~s'; Using the first red section as an example, the problem is that you are telling the regex engine, match anything that is not a <, zero or more times... so at this point, the regex engine is matching everything up to (but does not include) the < character. But then the .+ forces the regex to match the next character (which is now the <), but since it is greedy, keep going till it hits <param..... So what happens to the first tag it ran into? Is that the <param tag you are looking for? Is this something along the lines of what you are looking for? Example: // obviously, I just used this heredoc to simulate the site.. you won't use this... $html = <<<HTML <div class="picTitleCentre"><b>Doom</b></div> <div class="picTitleRight"> </div> </div> <div id="nonVidContentDiv"> <div align="center" style="margin-bottom:12px"><br /><object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://active.macromedia.com/flash2/cabs/swflash.cab#version=10,0,0,0" id=intro width="600" height="375" align="top"><param name=movie value="http://208.116.9.205/10/content/17474/games_Doom.swf"><param name=AllowScriptAccess value="always"><param name=quality value=high><param name=bgcolor value=""><embed src="http://208.116.9.205/10/content/17474/games_Doom.swf" AllowScriptAccess="always" quality=high bgcolor="" width="600" height="375" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash"></embed></object><br /></div> </div> <div class="picTitle" style="text-align:center;"> HTML; $regex = '#<div class="picTitleCentre"><b>([^<]+)</b>.+?<param name=movie value="([^"]+)"#si'; preg_match($regex, $html, $match); echo $match[1] . "<br />\n"; // $match[1] = Doom echo $match[2]; //$match[2] = http://208.116.9.205/10/content/17474/games_Doom.swf Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851187 Share on other sites More sharing options...
nrg_alpha Posted June 7, 2009 Share Posted June 7, 2009 @OP: I just noticed this thread was in the PHP help forum and moved this to the more appropriate PHP Regex board. Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851193 Share on other sites More sharing options...
thebadbad Posted June 7, 2009 Share Posted June 7, 2009 @nrg Cheers for pointing me to your explanation, didn't know too much about the engine's method. Curious as I am, I just ran some tests to see how fast a greedy vs. a lazy search is in our example. Lazy code: preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+?<param name=movie value="([^"]+)"#si', $html, $matches); Greedy code: preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+<param name=movie value="([^"]+)"#si', $html, $matches); $html is the full source of http://www.kontraband.com/games/17474 hard-copied into the script. The only difference in the above lines is the single question mark in the middle of the regex pattern. Results: 100 iterations of the lazy code took ~ 0.010 seconds, while 100 iterations of the greedy code took a 'whopping' ~ 0.895 seconds. But then I added another part to the end of the regex, .+?end of google analytics code([^;]+), and ran the same test. This time the greedy method was as fast as the lazy. Anyone know why? Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851229 Share on other sites More sharing options...
nrg_alpha Posted June 7, 2009 Share Posted June 7, 2009 @nrg Cheers for pointing me to your explanation, didn't know too much about the engine's method. Curious as I am, I just ran some tests to see how fast a greedy vs. a lazy search is in our example. Lazy code: preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+?<param name=movie value="([^"]+)"#si', $html, $matches); Greedy code: preg_match('#<div class="picTitleCentre"><b>([^<]+)</b>.+<param name=movie value="([^"]+)"#si', $html, $matches); $html is the full source of http://www.kontraband.com/games/17474 hard-copied into the script. The only difference in the above lines is the single question mark in the middle of the regex pattern. Results: 100 iterations of the lazy code took ~ 0.010 seconds, while 100 iterations of the greedy code took a 'whopping' ~ 0.895 seconds. Yeah, definitely, on a single pass (or on small amounts of data, the speed difference won't be huge at all (in fact perhaps even infinitesimal). There are other additional factors to consider such as where .+ / .+? is located in the pattern... if it's close to (or at) the very end, then the amount of backtracking on say .+ will be obviously less than if it was located early on in the pattern. But perhaps even more important than the speed is the issue of accuracy (as CV illustrated in the link I posted). In hind sight, this would be the bigger issue I would think (although, if I can squeeze out some extra speed while I'm at it, why not?). Using character classes when possible typically beats out both .+ and .+? methods (speed wise - I think there was the odd time where it didn't, but over all , it was faster). But then I added another part to the end of the regex, .+?end of google analytics code([^;]+), and ran the same test. This time the greedy method was as fast as the lazy. Anyone know why? I'm not sure I quite follow on that one TBH. If you want to really understand how the regex engine *thinks*, my suggestion is the book Mastering Regular Expressions (at least, it helped me out a hell of a lot). Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851236 Share on other sites More sharing options...
tail Posted June 8, 2009 Author Share Posted June 8, 2009 Wow thanks for all the help everyone! I got it working now. Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851368 Share on other sites More sharing options...
thebadbad Posted June 8, 2009 Share Posted June 8, 2009 If you want to really understand how the regex engine *thinks*, my suggestion is the book Mastering Regular Expressions (at least, it helped me out a hell of a lot). Thanks, I'd better check it out. Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-851407 Share on other sites More sharing options...
tail Posted June 9, 2009 Author Share Posted June 9, 2009 I have one more question. When the script scrapes the site, the flash file URL comes up in a different format every time. In order to get the thumbnail image, I have to replace a certain part of the filename with "t.jpg". Ex: Flash File - http://208.116.9.205/1/graphics/games/11623/games_BillySuicide.swf Thumbnail - http://208.116.9.205/1/graphics/games/11623/t.jpg Now that I have the flash URL, how can I get "http://208.116.9.205/1/graphics/games/" into a string? The folders come up different sometimes. Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-852081 Share on other sites More sharing options...
thebadbad Posted June 9, 2009 Share Posted June 9, 2009 There are simple string functions to deal with something like this: <?php $game = 'http://208.116.9.205/1/graphics/games/11623/games_BillySuicide.swf'; $thumb = substr($game, 0, strrpos($game, '/')) . '/t.jpg'; ?> Quote Link to comment https://forums.phpfreaks.com/topic/161293-solved-scraping-with-regex/#findComment-852159 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.