tippy_102 Posted January 1, 2007 Share Posted January 1, 2007 I don't know if this is a general coding problem, or a regular expression problem. I've been struggling with this all weekend, and I have no idea what's going on.Here is my code:[code]<?php// Page to read$url = "http://www.nanaimoweather.ca/current.php?id=85";$data = implode("", file($url)); // Get content itemspreg_match_all ('/<td align=\"right\">(.*?)<\/span><\/td><\/tr>/', $data, $matches);// Grab data // temperature preg_match ('/Temperature:<\/td><td><span class=\"bold\"> (.*?) <\/span>/', $data, $temp); $temperature = $temp['1']; // $temperature = trim($temperature); //rain preg_match ('/Rain:<\/td><td><span class=\"bold\"> (.*?) <\/span>/', $match, $temp); $rain = $temp[1]; $rain = trim($rain); echo "Temperature:" . strip_tags($temperature) . "<br />\n"; echo "Rain:" . strip_tags($rain) . "<br />\n";echo $data;?>[/code]Here is the relavant data I am reading (with permission)[code]<table class="current_obs realtime"><tr><td align="right">Date:</td><td><span class="bold">2007/01/01, 11:37</span></td></tr><tr><td align="right">Temperature:</td><td><span class="bold">3.4 C</span><br><span class="smaller"> L: 2.9 C, H: 3.6 C</span></td></tr><tr><td align="right">Humidity:</td><td><span class="bold">99 %</span></td></tr><tr><td align="right">Pressure:</td><td><span class="bold">1016 hPa</span><img src="http://www.victoriaweather.ca/images/baro_s.png" alt=""></td></tr><tr><td align="right">Insolation:</td><td><span class="bold">11 W/m<sup>2</sup></span></td></tr><tr><td align="right">UV Index:</td><td><span class="bold">0</span></td></tr><tr><td align="right">Rain:</td><td><span class="bold">12.4 mm</span></td></tr><tr><td align="right">Wind Speed:</td><td><span class="bold">2 km/hr NE</span><br><span class="smaller">Max: 14.5 km/hr </span></td></tr></table>[/code]Here is my result:[code]Temperature:Rain:[/code]When I echo $data, I get the entire page, so I *think* this is a regular expression problem with my preg_match_all ? Quote Link to comment https://forums.phpfreaks.com/topic/32480-solved-scraping-for-dummies/ Share on other sites More sharing options...
effigy Posted January 2, 2007 Share Posted January 2, 2007 The Temperature and Rain regexs are not matching because the source does not have spaces within the span tag: remove the space around[tt] (.*?)[/tt]. Also, why is Rain being pulled from[tt] $match [/tt]and not[tt] $data[/tt]?Always print out your array of matches when debugging:[tt] echo '< pre>', print_r($matches, true), '</ pre>';[/tt] Quote Link to comment https://forums.phpfreaks.com/topic/32480-solved-scraping-for-dummies/#findComment-151565 Share on other sites More sharing options...
c4onastick Posted January 2, 2007 Share Posted January 2, 2007 effigy's right, you're not using the right variables for the different matches, and you're using a bunch of white space around '(.*?)' without the 'x' modifier.The usual technique for scraping like this is:One, narrow down the html with a single preg_match. In this case pull out that table (this step is optional, has pros/cons).[code]preg_match('/<table class="current_obs realtime">(.*?)<\/table>/s', $data, $big_match);[/code]Then you can pull out all the Temp, Rain, etc with a more general preg_match_all since you know that you've already got the data narrowed down.[code]preg_match_all('/<td align="right">([^<]+)<\/td><td><span class="bold">([^<]+)/', $big_match, $matches, PREG_SET_ORDER);[/code]Optionally you can pick up those "smaller" span tags with something like this:[code]preg_match_all('/<td align="right">([^<]+)</td><td><span class="bold">([^<]+)</span>(?:</td></tr>|<br><span class="smaller">([^<]+)</span></td></tr>)/', $big_match, $matches, PREG_SET_ORDER);[/code]I used the 'PREG_SET_ORDER' flag so that you get an array of arrays with all the matches (just personal preference, easier to work with in my opinion).The final $matches array will look like this:[code]Array( Array(full_match, variable, bold_value, small_value) // First Match Array(full_match, variable, bold_value, small_value) // Second Match Array(full_match, variable, bold_value, small_value) // Third Match ...)[/code]Where full_match will be everything including '<td>, <span>, ... etc'. Variable will be 'Date:, Temperature:, ... etc.'. And bold_value and small_value will be the values within those spans.Then you can re-list the data with something like:[code]foreach($matches as $match){ echo $match[1]; echo $match[2]; if($match[3] != '') {echo $match[3];}}[/code]Here's a great little site for testing regexes:http://regexlib.com/RETester.aspxAlso you might want to check out curl:http://us3.php.net/manual/en/ref.curl.phpIts generally quicker and more robust than doing a file on a url. Quote Link to comment https://forums.phpfreaks.com/topic/32480-solved-scraping-for-dummies/#findComment-151583 Share on other sites More sharing options...
tippy_102 Posted January 3, 2007 Author Share Posted January 3, 2007 Wow! Thank you [b]very[/b] much!Thanks to you, I now (finally!) have this thing working! WooHoo!One question - if I use the "optional" option you supplied to grab the smaller tags, I receive an error message of "Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 't' in C:\Program Files\xampp\htdocs\test\2.php on line 12"Line 12 is:[code]preg_match_all('/<td align="right">([^<]+)</td><td><span class="bold">([^<]+)</span>(?:</td></tr>|<br><span class="smaller">([^<]+)</span></td></tr>)/', $big_match[0], $matches, PREG_SET_ORDER);[/code]but the print_r($big_match[1], true) displays the proper data:[quote]Date:2007/01/02, 21:21 Temperature:6.1 CL: 5.5 C, H: 12.0 C Humidity:96 % Pressure:1004 hPa Insolation:0 W/m2 UV Index:0 Rain:41.4 mm Wind Speed:6 km/hr NW Max: 48.3 km/hr [/quote]I don't see a 't' modifier in that line, unless it is referring to the pipe, but according to the reading I've done, this is valid. Any idea what this error is referring to? Quote Link to comment https://forums.phpfreaks.com/topic/32480-solved-scraping-for-dummies/#findComment-152027 Share on other sites More sharing options...
effigy Posted January 3, 2007 Share Posted January 3, 2007 You cannot use a[tt] / [/tt]if your delimiter is a[tt] /[/tt]. Either escape it ([tt]\/[/tt]) or, preferably, change the delimiter ([tt]%pattern%, #pattern#, etc). Quote Link to comment https://forums.phpfreaks.com/topic/32480-solved-scraping-for-dummies/#findComment-152178 Share on other sites More sharing options...
c4onastick Posted January 3, 2007 Share Posted January 3, 2007 [quote author=effigy link=topic=120594.msg496080#msg496080 date=1167834617]You cannot use a[tt] / [/tt]if your delimiter is a[tt] /[/tt]. Either escape it ([tt]\/[/tt]) or, preferably, change the delimiter ([tt]%pattern%, #pattern#, etc).[/quote]Yeah... oops! I tested it with another delimeter, this should be fixed:[code]preg_match_all('/<td align="right">([^<]+)<\/td><td><span class="bold">([^<]+)<\/span>(?:<\/td><\/tr>|<br><span class="smaller">([^<]+)<\/span><\/td><\/tr>)/', $big_match[0], $matches, PREG_SET_ORDER);[/code]It sees the slash in '</td>' a dozen or so characters in as the end of the regex since I used a slash as the opening delimiter. That's why its wondering what the modifier '/t' is.Glad to help! Quote Link to comment https://forums.phpfreaks.com/topic/32480-solved-scraping-for-dummies/#findComment-152504 Share on other sites More sharing options...
tippy_102 Posted January 4, 2007 Author Share Posted January 4, 2007 Thank you very much c4onastick! :) All your efforts on this are greatly appreciated!I should have caught the missing escapes. :-[For those who are wondering, here is the finished code:[code]<?php// Page to read$url = "http://www.nanaimoweather.ca/current.php?id=85";$data = implode("", file($url)); // Get content itemspreg_match('/<table class="current_obs realtime">(.*?)<\/table>/s', $data, $big_match);// Grab the weather conditions, including data in the smaller span tagspreg_match_all('/<td align="right">([^<]+)<\/td><td><span class="bold">([^<]+)<\/span>(?:<\/td><\/tr>|<br><span class="smaller">([^<]+)<\/span><\/td><\/tr>)/', $big_match[0], $matches, PREG_SET_ORDER);// Displayforeach($matches as $match){ echo $match[1]; echo "<br>"; echo $match[2]; echo "<br>"; if($match[3] != '') {echo $match[3];} echo "<br>";}//echo '< pre>', print_r($big_match[1], true), '</ pre>';?>[/code] Quote Link to comment https://forums.phpfreaks.com/topic/32480-solved-scraping-for-dummies/#findComment-152705 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.