Jump to content

[SOLVED] Scraping for dummies


tippy_102

Recommended Posts

I don't know if this is a general coding problem, or a regular expression problem.  I've been struggling with this all weekend, and I have no idea what's going on.

Here is my code:
[code]
<?php

// Page to read
$url = "http://www.nanaimoweather.ca/current.php?id=85";
$data = implode("", file($url));

// Get content items
preg_match_all ('/<td align=\"right\">(.*?)<\/span><\/td><\/tr>/', $data, $matches);


// Grab data

    // temperature
preg_match ('/Temperature:<\/td><td><span class=\"bold\"> (.*?)  <\/span>/', $data, $temp);
    $temperature = $temp['1'];
    // $temperature = trim($temperature);
   
//rain
    preg_match ('/Rain:<\/td><td><span class=\"bold\"> (.*?)  <\/span>/', $match, $temp);
    $rain = $temp[1];
    $rain = trim($rain);

 
    echo "Temperature:" . strip_tags($temperature) . "<br />\n";
    echo "Rain:" . strip_tags($rain) . "<br />\n";
echo $data;
?>

[/code]

Here is the relavant data I am reading (with permission)
[code]
<table class="current_obs realtime">
<tr><td align="right">Date:</td><td><span class="bold">2007/01/01, 11:37</span></td></tr>
<tr><td align="right">Temperature:</td><td><span class="bold">3.4 C</span><br><span class="smaller"> L: 2.9 C, H: 3.6 C</span></td></tr>
<tr><td align="right">Humidity:</td><td><span class="bold">99 %</span></td></tr>
<tr><td align="right">Pressure:</td><td><span class="bold">1016 hPa</span><img src="http://www.victoriaweather.ca/images/baro_s.png" alt=""></td></tr>
<tr><td align="right">Insolation:</td><td><span class="bold">11 W/m<sup>2</sup></span></td></tr>
<tr><td align="right">UV Index:</td><td><span class="bold">0</span></td></tr>
<tr><td align="right">Rain:</td><td><span class="bold">12.4 mm</span></td></tr>
<tr><td align="right">Wind Speed:</td><td><span class="bold">2 km/hr NE</span><br><span class="smaller">Max: 14.5 km/hr </span></td></tr>
</table>
[/code]

Here is my result:
[code]
Temperature:
Rain:
[/code]

When I echo $data, I get the entire page, so I *think* this is a regular expression problem with my preg_match_all ?



Link to comment
https://forums.phpfreaks.com/topic/32480-solved-scraping-for-dummies/
Share on other sites

The Temperature and Rain regexs are not matching because the source does not have spaces within the span tag: remove the space around[tt] (.*?)[/tt]. Also, why is Rain being pulled from[tt] $match [/tt]and not[tt] $data[/tt]?

Always print out your array of matches when debugging:
[tt] echo '< pre>', print_r($matches, true), '</ pre>';[/tt]
effigy's right, you're not using the right variables for the different matches, and you're using a bunch of white space around '(.*?)' without the 'x' modifier.

The usual technique for scraping like this is:
One, narrow down the html with a single preg_match. In this case pull out that table (this step is optional, has pros/cons).
[code]preg_match('/<table class="current_obs realtime">(.*?)<\/table>/s', $data, $big_match);[/code]

Then you can pull out all the Temp, Rain, etc with a more general preg_match_all since you know that you've already got the data narrowed down.
[code]preg_match_all('/<td align="right">([^<]+)<\/td><td><span class="bold">([^<]+)/', $big_match, $matches, PREG_SET_ORDER);[/code]

Optionally you can pick up those "smaller" span tags with something like this:
[code]preg_match_all('/<td align="right">([^<]+)</td><td><span class="bold">([^<]+)</span>(?:</td></tr>|<br><span class="smaller">([^<]+)</span></td></tr>)/', $big_match, $matches, PREG_SET_ORDER);[/code]

I used the 'PREG_SET_ORDER' flag so that you get an array of arrays with all the matches (just personal preference, easier to work with in my opinion).

The final $matches array will look like this:
[code]Array(
  Array(full_match, variable, bold_value, small_value) // First Match
  Array(full_match, variable, bold_value, small_value) // Second Match
  Array(full_match, variable, bold_value, small_value) // Third Match
  ...
)[/code]
Where full_match will be everything including '<td>, <span>, ... etc'. Variable will be 'Date:, Temperature:, ... etc.'. And bold_value and small_value will be the values within those spans.

Then you can re-list the data with something like:
[code]foreach($matches as $match)
{
  echo $match[1];
  echo $match[2];
  if($match[3] != '') {echo $match[3];}
}
[/code]

Here's a great little site for testing regexes:
http://regexlib.com/RETester.aspx

Also you might want to check out curl:
http://us3.php.net/manual/en/ref.curl.php
Its generally quicker and more robust than doing a file on a url.
Wow! Thank you [b]very[/b] much!

Thanks to you, I now (finally!) have this thing working!  WooHoo!

One question - if I use the "optional" option you supplied to grab the smaller tags, I receive an error message of "Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 't' in C:\Program Files\xampp\htdocs\test\2.php on line 12"

Line 12 is:
[code]
preg_match_all('/<td align="right">([^<]+)</td><td><span class="bold">([^<]+)</span>(?:</td></tr>|<br><span class="smaller">([^<]+)</span></td></tr>)/', $big_match[0], $matches, PREG_SET_ORDER);
[/code]

but the print_r($big_match[1], true) displays the proper data:
[quote]
Date:2007/01/02, 21:21
Temperature:6.1 C
L: 5.5 C, H: 12.0 C
Humidity:96 %
Pressure:1004 hPa
Insolation:0 W/m2 UV Index:0 Rain:41.4 mm Wind Speed:6 km/hr NW Max: 48.3 km/hr
[/quote]

I don't see a 't' modifier in that line, unless it is referring to the pipe, but according to the reading I've done, this is valid.  Any idea what this error is referring to?

[quote author=effigy link=topic=120594.msg496080#msg496080 date=1167834617]
You cannot use a[tt] / [/tt]if your delimiter is a[tt] /[/tt]. Either escape it ([tt]\/[/tt]) or, preferably, change the delimiter ([tt]%pattern%, #pattern#, etc).
[/quote]
Yeah... oops! I tested it with another delimeter, this should be fixed:
[code]preg_match_all('/<td align="right">([^<]+)<\/td><td><span class="bold">([^<]+)<\/span>(?:<\/td><\/tr>|<br><span class="smaller">([^<]+)<\/span><\/td><\/tr>)/', $big_match[0], $matches, PREG_SET_ORDER);[/code]
It sees the slash in '</td>' a dozen or so characters in as the end of the regex since I used a slash as the opening delimiter. That's why its wondering what the modifier '/t' is.

Glad to help!
Thank you very much c4onastick!  :)  All your efforts on this are greatly appreciated!

I should have caught the missing escapes.  :-[

For those who are wondering, here is the finished code:
[code]
<?php

// Page to read
$url = "http://www.nanaimoweather.ca/current.php?id=85";
$data = implode("", file($url));

// Get content items
preg_match('/<table class="current_obs realtime">(.*?)<\/table>/s', $data, $big_match);

// Grab the weather conditions, including data in the smaller span tags
preg_match_all('/<td align="right">([^<]+)<\/td><td><span class="bold">([^<]+)<\/span>(?:<\/td><\/tr>|<br><span class="smaller">([^<]+)<\/span><\/td><\/tr>)/', $big_match[0], $matches, PREG_SET_ORDER);

// Display
foreach($matches as $match)
{
  echo $match[1];
  echo "<br>";
  echo $match[2];
      echo "<br>";
  if($match[3] != '') {echo $match[3];}
      echo "<br>";
}

//echo '< pre>', print_r($big_match[1], true), '</ pre>';

?>
[/code]

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.