Jump to content

Screen Scraping


mangy1983

Recommended Posts

Hi all

I am developing a website for a fishing organisation which was to include a weather widget. The only weather widget which blended into the colours of the website l am creating was the second transparent widget on this web page http://www.weatherforecastmap.com/getwidget.phtml/

 

. It has been fine for the last month apart from it is not as accurate as a local website. On top of this members are complaining that wind speeds are measured in m/s instead of mph which is the UK measurement.

 

I have been scouring the net for a screen scraping code which will scrape the information I want from the local website and found the code supplied on this webpage: http://www.bradino.com/php/screen-scraping/

 

This looks to do what l want as the website l want displays the weather in a table displaying the weather at 3 hour intervals per row. I would only like to extract the information from the first row but the code in the link does not work for me. If anyone could help me with this it would be great!

 

Below is the code l have as l wanted to get the example working before l customised it to my own use.

 

thanks for any replies Callum

 

<?php
$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";

$raw = file_get_contents($url);


$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content,'
$end = strpos($content,'

',$start) + 8;

$table = substr($content,$start,$end-$start);

preg_match_all("|

|U",$table,$rows);

foreach ($rows[0] as $row){

if ((strpos($row,'

preg_match_all("|

|U",$row,$cells);

$number = strip_tags($cells[0][0]);

$name = strip_tags($cells[0][1]);

$position = strip_tags($cells[0][2]);

echo "{$position} – {$name} – Number {$number} 
\n";

}

}

?>

Link to comment
Share on other sites

I would suggest using the DOMDocument Object, I think this will be the easiest method for you.  Being that the weather doesn't change, I would also write the wanted contents to a file, and only update it once per day.

 

$doc = new DOMDocument();
$doc->loadHTMLFile('http://server.com/some/file");
$widget = $doc->getElementsById('widgetId');
foreach($widget as $element) {
echo $element->nodeValue;
}

Link to comment
Share on other sites

I would suggest using the DOMDocument Object, I think this will be the easiest method for you.  Being that the weather doesn't change, I would also write the wanted contents to a file, and only update it once per day.

 

$doc = new DOMDocument();
$doc->loadHTMLFile('http://server.com/some/file");
$widget = $doc->getElementsById('widgetId');
foreach($widget as $element) {
echo $element->nodeValue;
}

 

I have done it this way before when trying to get a single element which l did not have too much of a problem with. The problem with the code l need to grab is that it is inside a nameless table, and that the td's all have the same class name. There are also several instances of these class names in different rows as the weather is displayed for every three hours on a separate row whereas l only need a few of the tds from the first row.

 

For instance l would like to have a screen grab of the first row in the weather table at this weblink http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html

 

Hope this makes sense.

 

thanks Callum

Link to comment
Share on other sites

thats not a very good tut. he has a few errors in the code initially.

 

use curl and go from there.

 

Try this: http://devtrench.com/posts/screen-scrape-with-php-curl

 

Thanks for your reply. Unfortunately I don't know how to use curl l am afraid and it doesn't make much sense to me at the moment. I use codes from examples at the moment and once l get them working I learn what each line does so as to understand what everything does. The code from the example I gave in my first post l can understand bits of it and like the way that you can turn the tds elements information into separate variables to use as you wish later.

 

thanks again Callum

Link to comment
Share on other sites

Try this:

<?php
echo '<style type="text/css">
	table {
		border-collapse: collapse;
	}
	table, th, td {
		border: 1px solid black;
		padding: 2px;
	}

</style>';
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html'); //load the file;
$desired_rows = 1; //How many rows you want from the table.
$table = $doc->getElementsByTagName('table'); //get our tables out, it should return 2 from the file, we only want the second.
$rows = $table->item(1)->getElementsByTagName('tr'); //pull the table rows from the second table (notice we select the second by item(1).)
$count = $rows->length; //returns a count of the table rows.
echo '<table id="weather"><tr>
									<th rowspan="2">Date</th>
									<th rowspan="2">Time</th>
									<th rowspan="2">Weather</th>

									<th rowspan="2">Temp</th>
									<th colspan="3">Wind</th>
									<th rowspan="2">Visibility</th>
								</tr>
								<tr>
									<th>Dir</th>
									<th>Speed</th>

									<th>Gust</th>
								</tr>'; //mock up of the original table headers.
for($i=2,$start=$i;$i<($start + $desired_rows);$i++) { //for loop, goes through the rows.
 echo '<tr>'; //start row.
  $columns = $rows->item($i)->getElementsByTagName('td'); //get columns for this row.
  $columnCount = $columns->length;
  for($n=0;$n<$columnCount;$n++) { //go through the columns.
	if($n == 2) {
		$img = $columns->item($n)->getElementsByTagName('img'); //the 3rd column is an image, so we must get the image title.
		$value = $img->item(0)->getAttribute('title');
	} else {
		$value = $columns->item($n)->nodeValue; //else we will just take what is in the column.
	}
	echo '<td>' . $value . '</td>'; //push the column to the screen.
  }
  echo '</tr>'; //end the row.
}
echo '</table>'; //end the table.

?>

 

Link to comment
Share on other sites

Try this:

<?php
echo '<style type="text/css">
	table {
		border-collapse: collapse;
	}
	table, th, td {
		border: 1px solid black;
		padding: 2px;
	}

</style>';
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html'); //load the file;
$desired_rows = 1; //How many rows you want from the table.
$table = $doc->getElementsByTagName('table'); //get our tables out, it should return 2 from the file, we only want the second.
$rows = $table->item(1)->getElementsByTagName('tr'); //pull the table rows from the second table (notice we select the second by item(1).)
$count = $rows->length; //returns a count of the table rows.
echo '<table id="weather"><tr>
									<th rowspan="2">Date</th>
									<th rowspan="2">Time</th>
									<th rowspan="2">Weather</th>

									<th rowspan="2">Temp</th>
									<th colspan="3">Wind</th>
									<th rowspan="2">Visibility</th>
								</tr>
								<tr>
									<th>Dir</th>
									<th>Speed</th>

									<th>Gust</th>
								</tr>'; //mock up of the original table headers.
for($i=2,$start=$i;$i<($start + $desired_rows);$i++) { //for loop, goes through the rows.
 echo '<tr>'; //start row.
  $columns = $rows->item($i)->getElementsByTagName('td'); //get columns for this row.
  $columnCount = $columns->length;
  for($n=0;$n<$columnCount;$n++) { //go through the columns.
	if($n == 2) {
		$img = $columns->item($n)->getElementsByTagName('img'); //the 3rd column is an image, so we must get the image title.
		$value = $img->item(0)->getAttribute('title');
	} else {
		$value = $columns->item($n)->nodeValue; //else we will just take what is in the column.
	}
	echo '<td>' . $value . '</td>'; //push the column to the screen.
  }
  echo '</tr>'; //end the row.
}
echo '</table>'; //end the table.

?>

 

thank you soo much for the work you did on this it is well appreciated! As my previous post l was wondering if it is possible to have each td tags element saved as a variable in order to save them to a database. I would then run this script every 3 hours using a cron job to update the database table. If yourself or one of the other great members on here can be of help l would be immensely grateful. Once the td elements have been turned into variables l will be on familiar territory  to save the information to the database.

 

thanks again Callum

Link to comment
Share on other sites

I managed to add incremental variables to the values supplied in the code supplied by jcbones (thanks again) and echo them separately outside of the loop in order to insert them into my database so here is the complete code with my own additions. Theoretically this topic is solved unless anyone thinks of a more efficient way of producing my inserted code shown below

 

thanks again guys Callum

 

<?php

$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html'); //load the file;
$desired_rows = 1; //How many rows you want from the table.
$table = $doc->getElementsByTagName('table'); //get our tables out, it should return 2 from the file, we only want the second.
$rows = $table->item(1)->getElementsByTagName('tr'); //pull the table rows from the second table (notice we select the second by item(1).)
$count = $rows->length; //returns a count of the table rows.

for($i=2,$start=$i;$i<($start + $desired_rows);$i++) { //for loop, goes through the rows.

  $columns = $rows->item($i)->getElementsByTagName('td'); //get columns for this row.
  $columnCount = $columns->length;
  for($n=0;$n<$columnCount;$n++) { //go through the columns.
	if($n == 2) {
		$img = $columns->item($n)->getElementsByTagName('img'); //the 3rd column is an image, so we must get the image title.
		$value = $img->item(0)->getAttribute('title');
	} else {
		$value = $columns->item($n)->nodeValue; //else we will just take what is in the column.
	}
${a.$n} = $value;
  }
}

$patterns[0] = '/[^0-9]/';
$replacements[0] = '';
ksort($patterns);
ksort($replacements);
$a3 = preg_replace($patterns, $replacements, $a3);
$a5 = preg_replace($patterns, $replacements, $a5);
$a6 = preg_replace($patterns, $replacements, $a6);

echo $a0, '</br>', $a1, '</br>', $a2, '</br>', $a3, '</br>', $a4, '</br>', $a5, '</br>', $a6, '</br>', $a7, '</br>', $a8;

?>

Link to comment
Share on other sites

I had you a code that put the information in a SQL query.  The forums wouldn't work for me yesterday, but here it is anyway.

 

 

This will give you a well formulated MySQL query.  I just pushed every column into a separate database column. 

 

<?php
$doc = new DOMDocument();
@$doc->loadHTMLFile('weather.html');
$desired_rows = 100; //header and 1st row of data.

$table = $doc->getElementsByTagName('table');
$rows = $table->item(1)->getElementsByTagName('tr');
$count = $rows->length;
$sql = 'INSERT INTO weather (day,`time`,description,tempature,windDir,windSpeed,windGust,visibility) VALUES ';
for($i=2,$start=$i;$i<($start + $desired_rows) && $i < ($count - 1);$i++) {
$values = array();
 $columns = $rows->item($i)->getElementsByTagName('td');
  $columnCount = $columns->length;
  if($columnCount ==  { $retainDate = true; }
  for($n=0;$n<$columnCount;$n++) {
	$value = $columns->item($n)->nodeValue;//go through the columns.
		$img = $columns->item($n)->getElementsByTagName('img');
		for($ii = 0; $ii < $img->length; $ii++) {
			$value = $img->item($ii)->getAttribute('title');
		}
	if($retainDate == true && $n == 0) {
		$date = $value;
	}
	elseif($n == 0) {
		$value = $date . '\',\'' . $value;
	}
	$values[] = $value;
  }
  $queryValueArray[] =  implode('\',\'',$values);
  $retainDate = false;
}
$sql .= '(\'' . implode("'),\n('",$queryValueArray) . '\')';

echo "<pre>$sql</pre>";
?>

 

Link to comment
Share on other sites

I had you a code that put the information in a SQL query.  The forums wouldn't work for me yesterday, but here it is anyway.

 

 

This will give you a well formulated MySQL query.  I just pushed every column into a separate database column. 

 

<?php
$doc = new DOMDocument();
@$doc->loadHTMLFile('weather.html');
$desired_rows = 100; //header and 1st row of data.

$table = $doc->getElementsByTagName('table');
$rows = $table->item(1)->getElementsByTagName('tr');
$count = $rows->length;
$sql = 'INSERT INTO weather (day,`time`,description,tempature,windDir,windSpeed,windGust,visibility) VALUES ';
for($i=2,$start=$i;$i<($start + $desired_rows) && $i < ($count - 1);$i++) {
$values = array();
 $columns = $rows->item($i)->getElementsByTagName('td');
  $columnCount = $columns->length;
  if($columnCount ==  { $retainDate = true; }
  for($n=0;$n<$columnCount;$n++) {
	$value = $columns->item($n)->nodeValue;//go through the columns.
		$img = $columns->item($n)->getElementsByTagName('img');
		for($ii = 0; $ii < $img->length; $ii++) {
			$value = $img->item($ii)->getAttribute('title');
		}
	if($retainDate == true && $n == 0) {
		$date = $value;
	}
	elseif($n == 0) {
		$value = $date . '\',\'' . $value;
	}
	$values[] = $value;
  }
  $queryValueArray[] =  implode('\',\'',$values);
  $retainDate = false;
}
$sql .= '(\'' . implode("'),\n('",$queryValueArray) . '\')';

echo "<pre>$sql</pre>";
?>

 

Thank you so much for the code jcbones. It helped me immensely and hopefully others looking for something similar too

 

thanks again Callum

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.