Jump to content

[SOLVED] Scraping for dummies


tippy_102

Recommended Posts

I don't know if this is a general coding problem, or a regular expression problem.  I've been struggling with this all weekend, and I have no idea what's going on.

Here is my code:
[code]
<?php

// Page to read
$url = "http://www.nanaimoweather.ca/current.php?id=85";
$data = implode("", file($url));

// Get content items
preg_match_all ('/<td align=\"right\">(.*?)<\/span><\/td><\/tr>/', $data, $matches);


// Grab data

    // temperature
preg_match ('/Temperature:<\/td><td><span class=\"bold\"> (.*?)  <\/span>/', $data, $temp);
    $temperature = $temp['1'];
    // $temperature = trim($temperature);
   
//rain
    preg_match ('/Rain:<\/td><td><span class=\"bold\"> (.*?)  <\/span>/', $match, $temp);
    $rain = $temp[1];
    $rain = trim($rain);

 
    echo "Temperature:" . strip_tags($temperature) . "<br />\n";
    echo "Rain:" . strip_tags($rain) . "<br />\n";
echo $data;
?>

[/code]

Here is the relavant data I am reading (with permission)
[code]
<table class="current_obs realtime">
<tr><td align="right">Date:</td><td><span class="bold">2007/01/01, 11:37</span></td></tr>
<tr><td align="right">Temperature:</td><td><span class="bold">3.4 C</span><br><span class="smaller"> L: 2.9 C, H: 3.6 C</span></td></tr>
<tr><td align="right">Humidity:</td><td><span class="bold">99 %</span></td></tr>
<tr><td align="right">Pressure:</td><td><span class="bold">1016 hPa</span><img src="http://www.victoriaweather.ca/images/baro_s.png" alt=""></td></tr>
<tr><td align="right">Insolation:</td><td><span class="bold">11 W/m<sup>2</sup></span></td></tr>
<tr><td align="right">UV Index:</td><td><span class="bold">0</span></td></tr>
<tr><td align="right">Rain:</td><td><span class="bold">12.4 mm</span></td></tr>
<tr><td align="right">Wind Speed:</td><td><span class="bold">2 km/hr NE</span><br><span class="smaller">Max: 14.5 km/hr </span></td></tr>
</table>
[/code]

Here is my result:
[code]
Temperature:
Rain:
[/code]

When I echo $data, I get the entire page, so I *think* this is a regular expression problem with my preg_match_all ?



Link to comment
Share on other sites

The Temperature and Rain regexs are not matching because the source does not have spaces within the span tag: remove the space around[tt] (.*?)[/tt]. Also, why is Rain being pulled from[tt] $match [/tt]and not[tt] $data[/tt]?

Always print out your array of matches when debugging:
[tt] echo '< pre>', print_r($matches, true), '</ pre>';[/tt]
Link to comment
Share on other sites

effigy's right, you're not using the right variables for the different matches, and you're using a bunch of white space around '(.*?)' without the 'x' modifier.

The usual technique for scraping like this is:
One, narrow down the html with a single preg_match. In this case pull out that table (this step is optional, has pros/cons).
[code]preg_match('/<table class="current_obs realtime">(.*?)<\/table>/s', $data, $big_match);[/code]

Then you can pull out all the Temp, Rain, etc with a more general preg_match_all since you know that you've already got the data narrowed down.
[code]preg_match_all('/<td align="right">([^<]+)<\/td><td><span class="bold">([^<]+)/', $big_match, $matches, PREG_SET_ORDER);[/code]

Optionally you can pick up those "smaller" span tags with something like this:
[code]preg_match_all('/<td align="right">([^<]+)</td><td><span class="bold">([^<]+)</span>(?:</td></tr>|<br><span class="smaller">([^<]+)</span></td></tr>)/', $big_match, $matches, PREG_SET_ORDER);[/code]

I used the 'PREG_SET_ORDER' flag so that you get an array of arrays with all the matches (just personal preference, easier to work with in my opinion).

The final $matches array will look like this:
[code]Array(
  Array(full_match, variable, bold_value, small_value) // First Match
  Array(full_match, variable, bold_value, small_value) // Second Match
  Array(full_match, variable, bold_value, small_value) // Third Match
  ...
)[/code]
Where full_match will be everything including '<td>, <span>, ... etc'. Variable will be 'Date:, Temperature:, ... etc.'. And bold_value and small_value will be the values within those spans.

Then you can re-list the data with something like:
[code]foreach($matches as $match)
{
  echo $match[1];
  echo $match[2];
  if($match[3] != '') {echo $match[3];}
}
[/code]

Here's a great little site for testing regexes:
http://regexlib.com/RETester.aspx

Also you might want to check out curl:
http://us3.php.net/manual/en/ref.curl.php
Its generally quicker and more robust than doing a file on a url.
Link to comment
Share on other sites

Wow! Thank you [b]very[/b] much!

Thanks to you, I now (finally!) have this thing working!  WooHoo!

One question - if I use the "optional" option you supplied to grab the smaller tags, I receive an error message of "Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 't' in C:\Program Files\xampp\htdocs\test\2.php on line 12"

Line 12 is:
[code]
preg_match_all('/<td align="right">([^<]+)</td><td><span class="bold">([^<]+)</span>(?:</td></tr>|<br><span class="smaller">([^<]+)</span></td></tr>)/', $big_match[0], $matches, PREG_SET_ORDER);
[/code]

but the print_r($big_match[1], true) displays the proper data:
[quote]
Date:2007/01/02, 21:21
Temperature:6.1 C
L: 5.5 C, H: 12.0 C
Humidity:96 %
Pressure:1004 hPa
Insolation:0 W/m2 UV Index:0 Rain:41.4 mm Wind Speed:6 km/hr NW Max: 48.3 km/hr
[/quote]

I don't see a 't' modifier in that line, unless it is referring to the pipe, but according to the reading I've done, this is valid.  Any idea what this error is referring to?

Link to comment
Share on other sites

[quote author=effigy link=topic=120594.msg496080#msg496080 date=1167834617]
You cannot use a[tt] / [/tt]if your delimiter is a[tt] /[/tt]. Either escape it ([tt]\/[/tt]) or, preferably, change the delimiter ([tt]%pattern%, #pattern#, etc).
[/quote]
Yeah... oops! I tested it with another delimeter, this should be fixed:
[code]preg_match_all('/<td align="right">([^<]+)<\/td><td><span class="bold">([^<]+)<\/span>(?:<\/td><\/tr>|<br><span class="smaller">([^<]+)<\/span><\/td><\/tr>)/', $big_match[0], $matches, PREG_SET_ORDER);[/code]
It sees the slash in '</td>' a dozen or so characters in as the end of the regex since I used a slash as the opening delimiter. That's why its wondering what the modifier '/t' is.

Glad to help!
Link to comment
Share on other sites

Thank you very much c4onastick!  :)  All your efforts on this are greatly appreciated!

I should have caught the missing escapes.  :-[

For those who are wondering, here is the finished code:
[code]
<?php

// Page to read
$url = "http://www.nanaimoweather.ca/current.php?id=85";
$data = implode("", file($url));

// Get content items
preg_match('/<table class="current_obs realtime">(.*?)<\/table>/s', $data, $big_match);

// Grab the weather conditions, including data in the smaller span tags
preg_match_all('/<td align="right">([^<]+)<\/td><td><span class="bold">([^<]+)<\/span>(?:<\/td><\/tr>|<br><span class="smaller">([^<]+)<\/span><\/td><\/tr>)/', $big_match[0], $matches, PREG_SET_ORDER);

// Display
foreach($matches as $match)
{
  echo $match[1];
  echo "<br>";
  echo $match[2];
      echo "<br>";
  if($match[3] != '') {echo $match[3];}
      echo "<br>";
}

//echo '< pre>', print_r($big_match[1], true), '</ pre>';

?>
[/code]

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.