Jump to content

Xpath Page Scraping Issue. Rather Hard To Work Out.


jamesxg1

Recommended Posts

Hi guys,

 

Been a while! I was wondering if anyone out there could help with an issue I'm having. Basically I am trying to scrape a page for football fixtures, everything is going very well except the football "dates".

 

If you visit http://www.bbc.co.uk...league/fixtures you will see a grey tr column containing the dates for when matches start. As you will see Westham V Arsenal is the last match on Saturday the 6th of Oct then Southampton V Fulham is the first match for the 27th. So to grab all this data I have written the following (please excuse some of the crap in it - debugging purposes).

 


<?php
$url = "http://www.bbc.co.uk/sport/football/premier-league/fixtures";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);

$dom = new DOMDocument;
libxml_use_internal_errors(true);
@$dom->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);

$query = $xpath->query('.//div[@id="fixtures-data"]');
$matches = array();

foreach ($query as $node) {
$DateData = $xpath->query('//h2[@class="table-header"]', $node);
$MatchParent = $xpath->query('//tbody/tr[@class="preview"]/td[@class="match-details"]', $node);
$Kick = $xpath->query('//tbody/tr[@class="preview"]/td[@class="kickoff"]', $node);
$Date = trim($DateData->item(0)->nodeValue);

echo '**' . $DateData->length. '**<br />';
echo '**' . $Kick->length. '**<br />';
echo '**' . $MatchParent->length. '**';

for ($i = 0; $i <= ($MatchParent->length - 1) AND $i <= ($Kick->length - 1); $i++) {
$Teams = str_replace(array(' V ', '\n'), '-', trim($MatchParent->item($i)->nodeValue));
$TeamPeices = explode('-', $Teams);

$matches[$Date][] = array('Home' => trim($TeamPeices[0]), 'Away' => trim($TeamPeices[1]), 'KickOff' => trim($Kick->item($i)->nodeValue));
}
}
echo '<pre>' . print_r($matches, true) . '</pre>';
?>

 

When I run the script I get the following results.....

**52**
**319**
**320**
Array
(
[saturday 6th October 2012] => Array
 (
	 [0] => Array
		 (
			 [Home] => Man City
			 [Away] => Sunderland
			 [KickOff] => 12:45
		 )
	 [1] => Array
		 (
			 [Home] => Chelsea
			 [Away] => Norwich
			 [KickOff] => 15:00
		 )
	 [2] => Array
		 (
			 [Home] => Swansea
			 [Away] => Reading
			 [KickOff] => 15:00
		 )
	 [3] => Array
		 (
			 [Home] => West Brom
			 [Away] => QPR
			 [KickOff] => 15:00
		 )
	 [4] => Array
		 (
			 [Home] => Wigan
			 [Away] => Everton
			 [KickOff] => 15:00
		 )
	 [5] => Array
		 (
			 [Home] => West Ham
			 [Away] => Arsenal
			 [KickOff] => 17:30
		 )
	 [6] => Array
		 (
			 [Home] => Southampton
			 [Away] => Fulham
			 [KickOff] => 13:30
		 )
	 [7] => Array
		 (
			 [Home] => Liverpool
			 [Away] => Stoke
			 [KickOff] => 15:00
		 )
	 [8] => Array
		 (
			 [Home] => Tottenham
			 [Away] => Aston Villa
			 [KickOff] => 15:00
		 )
	 [9] => Array
		 (
			 [Home] => Newcastle
			 [Away] => Man Utd
			 [KickOff] => 16:00
		 )
	 [10] => Array
		 (
			 [Home] => Tottenham
			 [Away] => Chelsea
			 [KickOff] => 12:45
		 )
...
	 [310] => Array
		 (
			 [Home] => Chelsea
			 [Away] => Everton
			 [KickOff] => 15:00
		 )
	 [311] => Array
		 (
			 [Home] => Liverpool
			 [Away] => QPR
			 [KickOff] => 15:00
		 )
	 [312] => Array
		 (
			 [Home] => Man City
			 [Away] => Norwich
			 [KickOff] => 15:00
		 )
	 [313] => Array
		 (
			 [Home] => Newcastle
			 [Away] => Arsenal
			 [KickOff] => 15:00
		 )
	 [314] => Array
		 (
			 [Home] => Southampton
			 [Away] => Stoke
			 [KickOff] => 15:00
		 )
	 [315] => Array
		 (
			 [Home] => Swansea
			 [Away] => Fulham
			 [KickOff] => 15:00
		 )
	 [316] => Array
		 (
			 [Home] => Tottenham
			 [Away] => Sunderland
			 [KickOff] => 15:00
		 )
	 [317] => Array
		 (
			 [Home] => West Brom
			 [Away] => Man Utd
			 [KickOff] => 15:00
		 )
	 [318] => Array
		 (
			 [Home] => West Ham
			 [Away] => Reading
			 [KickOff] => 15:00
		 )
 )
)

 

All very good, except it is putting them all under one date, when I need it to add the correct dates as the array "head" as it were then have the fixture data as another array inside. Does anyone know how I can do this?

 

Any help appreciated as I've tried everything.

 

Thanks,

 

James.

Edited by jamesxg1
Link to comment
Share on other sites

Your problem lies here:

$Date = trim($DateData->item(0)->nodeValue);

If you re-read your code, you should be able to spot why this is happening. It's quite obvious, really, if you pay attention to the names of stuff. ;)

 

Thanks for responding Christian :) - Regarding the date collector, I know how to pull the data down but I'm stumped as to how to make it "break" upon each new date because there could be 5 matches in one day and 10 in another so there is no default break :S and I'm very very new to DOM at the minute :S

 

Thanks Christian :)

 

James.

Link to comment
Share on other sites

Seems you didn't quite understand what I was trying to say, as I see I was a bit too vague. So let me get into a bit more details.

 

In the source page you have only one node with the class "fixtures-table", which means that this loop will only run once.

foreach ($query as $node) {

 

Inside this loop you're fetching all of the dates with another query, saving the results to the $DateData variable. Two lines below, still in the same 1-time loop, you're running the code I quoted in my post above. In other words, you're setting the $date to be the date from the first matched table header.

Then, you're running through all of the matches found in the fixtures table, which you've fetched right before fetching all of the date headers. Which means that the $MatchParent variable contains every single match in the table. Regardless of what date they belong to. Then, inside the inner foreach ($MatchParent..) loop, you're assigning the individual match details to the $Date index.

 

In other words your entire logic is flawed, as it does not take into consideration the relation between the headers and the actual data. What you need to do, is to figure out how to run through the contents of the fixtures-table div, in such a way that it only parses the match details for the immediate preceding date header.

Then you can grab the contents of said header, and use it for the first dimensional index.

Edited by Christian F.
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.