jamesxg1 Posted October 5, 2012 Share Posted October 5, 2012 (edited) Hi guys, Been a while! I was wondering if anyone out there could help with an issue I'm having. Basically I am trying to scrape a page for football fixtures, everything is going very well except the football "dates". If you visit http://www.bbc.co.uk...league/fixtures you will see a grey tr column containing the dates for when matches start. As you will see Westham V Arsenal is the last match on Saturday the 6th of Oct then Southampton V Fulham is the first match for the 27th. So to grab all this data I have written the following (please excuse some of the crap in it - debugging purposes). <?php $url = "http://www.bbc.co.uk/sport/football/premier-league/fixtures"; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $html = curl_exec($ch); curl_close($ch); $dom = new DOMDocument; libxml_use_internal_errors(true); @$dom->loadHTML($html); libxml_use_internal_errors(false); $xpath = new DOMXPath($dom); $query = $xpath->query('.//div[@id="fixtures-data"]'); $matches = array(); foreach ($query as $node) { $DateData = $xpath->query('//h2[@class="table-header"]', $node); $MatchParent = $xpath->query('//tbody/tr[@class="preview"]/td[@class="match-details"]', $node); $Kick = $xpath->query('//tbody/tr[@class="preview"]/td[@class="kickoff"]', $node); $Date = trim($DateData->item(0)->nodeValue); echo '**' . $DateData->length. '**<br />'; echo '**' . $Kick->length. '**<br />'; echo '**' . $MatchParent->length. '**'; for ($i = 0; $i <= ($MatchParent->length - 1) AND $i <= ($Kick->length - 1); $i++) { $Teams = str_replace(array(' V ', '\n'), '-', trim($MatchParent->item($i)->nodeValue)); $TeamPeices = explode('-', $Teams); $matches[$Date][] = array('Home' => trim($TeamPeices[0]), 'Away' => trim($TeamPeices[1]), 'KickOff' => trim($Kick->item($i)->nodeValue)); } } echo '<pre>' . print_r($matches, true) . '</pre>'; ?> When I run the script I get the following results..... **52** **319** **320** Array ( [saturday 6th October 2012] => Array ( [0] => Array ( [Home] => Man City [Away] => Sunderland [KickOff] => 12:45 ) [1] => Array ( [Home] => Chelsea [Away] => Norwich [KickOff] => 15:00 ) [2] => Array ( [Home] => Swansea [Away] => Reading [KickOff] => 15:00 ) [3] => Array ( [Home] => West Brom [Away] => QPR [KickOff] => 15:00 ) [4] => Array ( [Home] => Wigan [Away] => Everton [KickOff] => 15:00 ) [5] => Array ( [Home] => West Ham [Away] => Arsenal [KickOff] => 17:30 ) [6] => Array ( [Home] => Southampton [Away] => Fulham [KickOff] => 13:30 ) [7] => Array ( [Home] => Liverpool [Away] => Stoke [KickOff] => 15:00 ) [8] => Array ( [Home] => Tottenham [Away] => Aston Villa [KickOff] => 15:00 ) [9] => Array ( [Home] => Newcastle [Away] => Man Utd [KickOff] => 16:00 ) [10] => Array ( [Home] => Tottenham [Away] => Chelsea [KickOff] => 12:45 ) ... [310] => Array ( [Home] => Chelsea [Away] => Everton [KickOff] => 15:00 ) [311] => Array ( [Home] => Liverpool [Away] => QPR [KickOff] => 15:00 ) [312] => Array ( [Home] => Man City [Away] => Norwich [KickOff] => 15:00 ) [313] => Array ( [Home] => Newcastle [Away] => Arsenal [KickOff] => 15:00 ) [314] => Array ( [Home] => Southampton [Away] => Stoke [KickOff] => 15:00 ) [315] => Array ( [Home] => Swansea [Away] => Fulham [KickOff] => 15:00 ) [316] => Array ( [Home] => Tottenham [Away] => Sunderland [KickOff] => 15:00 ) [317] => Array ( [Home] => West Brom [Away] => Man Utd [KickOff] => 15:00 ) [318] => Array ( [Home] => West Ham [Away] => Reading [KickOff] => 15:00 ) ) ) All very good, except it is putting them all under one date, when I need it to add the correct dates as the array "head" as it were then have the fixture data as another array inside. Does anyone know how I can do this? Any help appreciated as I've tried everything. Thanks, James. Edited October 5, 2012 by jamesxg1 Quote Link to comment https://forums.phpfreaks.com/topic/269142-xpath-page-scraping-issue-rather-hard-to-work-out/ Share on other sites More sharing options...
Christian F. Posted October 6, 2012 Share Posted October 6, 2012 Your problem lies here: $Date = trim($DateData->item(0)->nodeValue); If you re-read your code, you should be able to spot why this is happening. It's quite obvious, really, if you pay attention to the names of stuff. Quote Link to comment https://forums.phpfreaks.com/topic/269142-xpath-page-scraping-issue-rather-hard-to-work-out/#findComment-1383118 Share on other sites More sharing options...
jamesxg1 Posted October 6, 2012 Author Share Posted October 6, 2012 Your problem lies here: $Date = trim($DateData->item(0)->nodeValue); If you re-read your code, you should be able to spot why this is happening. It's quite obvious, really, if you pay attention to the names of stuff. Thanks for responding Christian - Regarding the date collector, I know how to pull the data down but I'm stumped as to how to make it "break" upon each new date because there could be 5 matches in one day and 10 in another so there is no default break :S and I'm very very new to DOM at the minute :S Thanks Christian James. Quote Link to comment https://forums.phpfreaks.com/topic/269142-xpath-page-scraping-issue-rather-hard-to-work-out/#findComment-1383120 Share on other sites More sharing options...
Christian F. Posted October 6, 2012 Share Posted October 6, 2012 (edited) Seems you didn't quite understand what I was trying to say, as I see I was a bit too vague. So let me get into a bit more details. In the source page you have only one node with the class "fixtures-table", which means that this loop will only run once. foreach ($query as $node) { Inside this loop you're fetching all of the dates with another query, saving the results to the $DateData variable. Two lines below, still in the same 1-time loop, you're running the code I quoted in my post above. In other words, you're setting the $date to be the date from the first matched table header. Then, you're running through all of the matches found in the fixtures table, which you've fetched right before fetching all of the date headers. Which means that the $MatchParent variable contains every single match in the table. Regardless of what date they belong to. Then, inside the inner foreach ($MatchParent..) loop, you're assigning the individual match details to the $Date index. In other words your entire logic is flawed, as it does not take into consideration the relation between the headers and the actual data. What you need to do, is to figure out how to run through the contents of the fixtures-table div, in such a way that it only parses the match details for the immediate preceding date header. Then you can grab the contents of said header, and use it for the first dimensional index. Edited October 6, 2012 by Christian F. Quote Link to comment https://forums.phpfreaks.com/topic/269142-xpath-page-scraping-issue-rather-hard-to-work-out/#findComment-1383129 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.