Jump to content

scraping the data from website


mark103

Recommended Posts

Hi,

I am having a problem with scraping the data from the website. I can't be able to output the data to my php after I have scraping the data from the website. On my php it show as a empty page.

here is the html source I want to scrape:

<span id="row3Time" class="zc-ssl-pg-time">11:00 AM</span>
<a id="rowTitle3" class="zc-ssl-pg-title" href='http://tvlistings.zap2it.com/tv/sportscenter/EP00019917'>SportsCenter</a>
<ul class="zc-icons">
<li class="zc-ic zc-ic-span"><span class="zc-ic-live">LIVE</span></li></ul>
</li>
<li class="zc-ssl-pg" id="row1-4" style="">

<span id="row4Time" class="zc-ssl-pg-time">12:00 PM</span>
<a id="rowTitle4" class="zc-ssl-pg-title" href='http://tvlistings.zap2it.com/tv/sportscenter/EP00019917'>SportsCenter</a>
<ul class="zc-icons">
<li class="zc-ic zc-ic-span"><span class="zc-ic-live">LIVE</span></li></ul>
</li>
<li class="zc-ssl-pg" id="row1-5" style="">

<span id="row5Time" class="zc-ssl-pg-time">1:00 PM</span>
<a id="rowTitle5" class="zc-ssl-pg-title" href='http://tvlistings.zap2it.com/tv/sportscenter/EP00019917'>SportsCenter</a>
<ul class="zc-icons">
<li class="zc-ic zc-ic-span"><span class="zc-ic-live">LIVE</span></li></ul>

here is the php source:


<?php

$contents = file_get_contents('http://tvlistings.zap2it.com/tvlistings/ZCSGrid.do?stnNum=10179');
preg_match('/<a id="rowTitle3" class="zc-ssl-pg-title"[.*]<\/a>/i', $data, $matches);
$rowtitle = $matches[1];
echo $rowtitle."<br>\n";
?>



And here is the php output:

<br>



does anyone know how I can scraping the data from that website using with <a id=rowTitle3 to the end of the page?

any advice would be much appreicated.

Thanks in advance

Edited by mark103
Link to comment
Share on other sites

Try changing


$contents = file_get_contents('http://tvlistings.zap2it.com/tvlistings/ZCSGrid.do?stnNum=10179');

to


$data = file_get_contents('http://tvlistings.zap2it.com/tvlistings/ZCSGrid.do?stnNum=10179');

also remember that arrays normally start at element 0 not 1 so you are looking for $matches[0] if the data is put into an array.

Link to comment
Share on other sites

thanks you very much for your help, but there is a problem. There is no output data when I am using this:

<?php

$data = file_get_contents('http://tvlistings.zap2it.com/tvlistings/ZCSGrid.do?stnNum=10179');
$p = "/a id='rowTitle1' class='zc-ssl-pg-title'>(.*)<\/a>/";
preg_match($p, $html, $match);
echo $match[0];
?>

i am not really sure if i have done it wrong.

 

can you help?

Edited by mark103
Link to comment
Share on other sites

The problem is in your regular expression. In your first post, you can fix the regex by simply removing the square brackets ([]) leaving the characters inside. That matches the sample input you gave in your first post, but your newest expression is completely different so I'm not sure what exactly you are trying to match.

 

You probably want to do something like this:

$data = file_get_contents('http://tvlistings.zap2it.com/tvlistings/ZCSGrid.do?stnNum=10179');
preg_match_all('/<a id="rowTitle\d+" class="zc-ssl-pg-title".*<\/a>/im', $data, $matches);
$titles = $matches[0];

print_r($titles);

If you are NOT trying to get all the titles, which ones do you want?

Link to comment
Share on other sites

You can use parentheses to capture segments:

$data = file_get_contents('http://tvlistings.zap2it.com/tvlistings/ZCSGrid.do?stnNum=10179');
preg_match_all('/<a id="rowTitle\d+" class="zc-ssl-pg-title"[^>]*>([^<]+)<\/a>/im', $test, $matches);
$titles = $matches[1];

print_r($titles);
Link to comment
Share on other sites

Sorry, I switched the variables on accident. This should work:

$data = file_get_contents('http://tvlistings.zap2it.com/tvlistings/ZCSGrid.do?stnNum=10179');
preg_match_all('/<a id="rowTitle\d+" class="zc-ssl-pg-title"[^>]*>([^<]+)<\/a>/im', $data, $matches);
$titles = $matches[1];

print_r($titles);
Link to comment
Share on other sites

thanks for your help. I can't be able to output the correct data in current time, e.g my local time is 9:00pm and the current time for the data is 4:00pm. I can only output the data before the current time.

 

can you help?

Link to comment
Share on other sites

You could either use cURL (or similar) to send the correct cookie for your timezone (I see that is an option on the site), or you could combine the day headers with the time and use strtotime() with the correct time addition to create a timestamp of the correct date/time.

Link to comment
Share on other sites

thanks, could you please post the code that i could use cURL or strtotime to get the correct time 5 hours back from my current time to get the correct data in that website, e.g my current time is 10pm and i look for the time that is 5 hours backward which it is 5pm and get the data that show at 5pm??

Edited by mark103
Link to comment
Share on other sites

You might have to choose a timezone that is in the US for the cURL method to work. I'm not sure where you are, but I could only get 4:00 and 6:00 by trying Hawaii and Alaska respectively. If you can get the website to show the correct time while you are browsing it, let me know how and I can help. Otherwise, you might have to use the other method.

Link to comment
Share on other sites

I come from the UK so I don't know how to use cURL to get the timezone before scraping the right data in the same row as the time that match my current time. There is no other website I can use, this is the only one I can use. Could you please help?

Link to comment
Share on other sites

What you have to do is find a relationship between the dates and the times. Usually the only way is by relating the physical locations, fortunately, the HTML actually had numbers that related so I've adjusted the regex accordingly. After putting all the variables into a format where they can be related, they can be iterated through. Since you want to do date math, the dates' relationships to their times will actually change when the time carries over to a different day. Because of this, the output probably shouldn't be done until after all the time adjustments are complete.

 

Here is an example of how this works. I've included the original scraped text in parentheses in the output so you can see what it was converted to. You should be able to take this code and adjust the output to meet your needs.

$test = file_get_contents('http://tvlistings.zap2it.com/tvlistings/ZCSGrid.do?stnNum=10179');

//Find all header dates
preg_match_all('/<li class="zc-ssl-sp" id="dayLabel(\d+-\d+)">([^<]+)<\/li>/mi', $test, $matches);
//Find all listings
preg_match_all('/<li class="zc-ssl-pg" id="row(\d+-\d+)" style="">[^<]+<span id="row\d+Time" class="zc-ssl-pg-time">([^<]+)<\/span>[^>]+>([^<]+)<\/a>/mi', $test, $matches2);

//Set arrays
$days = $matches[2];
$day_nums = $matches[1];
$listing_nums = $matches2[1];
$listing_times = $matches2[2];
$listing_titles = $matches2[3];

$j=0;	//listings pointer
foreach ($day_nums as $i => $day_num)
{
	$date = fixDate($days[$i]);	//Change words that strtotime can't parse
	$next = $i+1;
	if (!isset($day_nums[$next]))
		break;
	while ($listing_nums[$j] != $day_nums[$next])	//loop through until the header number matches the listing number
	{
		$time = trim($listing_times[$j]);
		$datetime = date('M j, Y g:iA', strtotime($date . ' ' . $time . ' -5 hours'));
		echo '('.$days[$i].'-'.$listing_times[$j].') '.$datetime . ' - ' . $listing_titles[$j] .'<br/>';
		$j++;
	}
}

function fixDate($date)
{
	$find = array(
		'/Last Night/',
		'/(?:^[^,]+,)|(?:Night)/',
		'/Tonight/'
	);
	$replace = array(
		'Yesterday',
		'',
		'Today',
	);
	
	return preg_replace($find, $replace, $date);
}

I hope that helps.

Link to comment
Share on other sites

Thanks, I have input the code in my php and I saw the list of title included the time. You have got it wrong there and you don't understand what I want to achieve. Let me explain to you again. I want to scrape the data in the current time in the USA that are 5 hours behind my current time which my current time is 3:00am and the usa time is 10:00pm.

 

Please see the data that show in the programme current time like this:

10:00 PM Baseball Tonight

    LIVE

11:00 PM SportsCenter

    LIVE

Tomorrow
12:00 AM SportsCenter

    LIVE

1:00 AM SportsCenter

    LIVE

2:00 AM SportsCenter

    LIVE

3:00 AM SportsCenter

4:00 AM SportsCenter

Now I hope you get my point?

Edited by mark103
Link to comment
Share on other sites

I thought you said you were five hours behind. Just change the minus (-) to a plus (+) in strtotime() and it will add five hours instead of subtracting it.

$datetime = date('M j, Y g:iA', strtotime($date . ' ' . $time . ' +5 hours'));
Link to comment
Share on other sites

Yes, BUT I SAID I WANT TO SCRAPE THE  TITLE THAT IS ON TODAY IN THE CURRENT TIME UNTIL TO THE END OF THE PAGE AND NOT YESTERDAY. I WANT TO DISPLAY THEM IN MY PHP:

 

The USA current time is 10:00PM

10:00 PM Baseball Tonight

    LIVE

11:00 PM SportsCenter

    LIVE

Tomorrow
12:00 AM SportsCenter

    LIVE

1:00 AM SportsCenter

    LIVE

2:00 AM SportsCenter

    LIVE

3:00 AM SportsCenter

4:00 AM SportsCenter

Not like this:

( Yesterday-7:00 PM) Dec 31, 1969 7:00PM - Around the Horn
( Yesterday-7:00 PM) Dec 31, 1969 7:00PM - Pardon the Interruption
( Yesterday-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Yesterday-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter Special
(Last Night-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter Special: On the Clock
(Last Night-7:00 PM) Dec 31, 1969 7:00PM - NFL Live
(Last Night-7:00 PM) Dec 31, 1969 7:00PM - Baseball Tonight
(Last Night-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - SportsCenter
( Today-7:00 PM) Dec 31, 1969 7:00PM - Outside the Lines
( Today-7:00 PM) Dec 31, 1969 7:00PM - College Football Live

Are you thick???????

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.