Jump to content

parsing HTML page


ravi181229

Recommended Posts

Hi,

 

I would like to parse  http://rivals.yahoo.com/ncaa/baseball/collegebroadcast html page and

get all events' info for particular date:

 

for example

Sat, Dec 6 (get this date and all the events info for this date)

San Francisco vs. Long Beach St.  - Men's Basketball   10:00 pm EST

http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10487697

 

Mon, Dec 8(get this date and all the events' info for this date) and so on.

 

need help.

 

Thanks

 

 

Link to comment
https://forums.phpfreaks.com/topic/135709-parsing-html-page/
Share on other sites

from the following code:

 

<table border="0" cellspacing="0" cellpadding="2" class="ysptblclbg4" width="100%" style="border-collapse: collapse" bordercolor="#111111"><tr><td class="yspdetailttl" valign="bottom" height="12"> Sat, Dec 6</td></tr>

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; San Francisco <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10487697','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/free_video_box.gif" width="16" height="12" alt="Free Video"></a> </span> vs. Long Beach St. <span class='yspscores'></span> - Men's Basketball </td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  10:00 pm EST  </td>

 

</tr>

</table></td></tr>

</table><table border="0" cellspacing="0" cellpadding="2" class="ysptblclbg4" width="100%" style="border-collapse: collapse" bordercolor="#111111"><tr><td class="yspdetailttl" valign="bottom" height="12"> Mon, Dec 8</td></tr>

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Lehigh <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10603183','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_video_box.gif" width="16" height="12" alt="Subscription Video"></a> </span> vs. Albany <span class='yspscores'></span> - Men's Basketball </td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  6:30 pm EST  </td>

 

</tr>

</table></td></tr>

</table>

</table>

 

I would like to display :

 

Sat, Dec 6

• San Francisco Free Video  vs. Long Beach St.  - Men's Basketball   10:00 pm EST

Mon, Dec 8

• Lehigh Subscription Video  vs. Albany  - Men's Basketball   6:30 pm EST

 

 

 

Link to comment
https://forums.phpfreaks.com/topic/135709-parsing-html-page/#findComment-707101
Share on other sites

I would also like to have the link (sorry, I missed it in previous post):

 

Sat, Dec 6

• San Francisco Free Video  vs. Long Beach St.  - Men's Basketball          10:00 pm EST

    http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10487697

Mon, Dec 8

• Lehigh Subscription Video  vs. Albany  - Men's Basketball          6:30 pm EST

    http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10603183

Link to comment
https://forums.phpfreaks.com/topic/135709-parsing-html-page/#findComment-707128
Share on other sites

<?php
$string = '<table border="0" cellspacing="0" cellpadding="2" class="ysptblclbg4" width="100%" style="border-collapse: collapse" bordercolor="#111111"><tr><td class="yspdetailttl" valign="bottom" height="12"> Sat, Dec 6</td></tr>
<tr><td>
<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">
<tr>
<td nowrap>  San Francisco <span class=\'yspscores\'><a href="javascript:void(window.open(\'http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10487697\',\'playerWindow\',\'width=793,height=608,scrollbars=no\'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/free_video_box.gif" width="16" height="12" alt="Free Video"></a> </span> vs. Long Beach St. <span class=\'yspscores\'></span> - Men\'s Basketball </td>
<td nowrap>  </td>
<td width="100%" nowrap class="yspscores">  10:00 pm EST  </td>

</tr>
</table></td></tr>
</table><table border="0" cellspacing="0" cellpadding="2" class="ysptblclbg4" width="100%" style="border-collapse: collapse" bordercolor="#111111"><tr><td class="yspdetailttl" valign="bottom" height="12"> Mon, Dec 8</td></tr>
<tr><td>
<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">
<tr>
<td nowrap>  Lehigh <span class=\'yspscores\'><a href="javascript:void(window.open(\'http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10603183\',\'playerWindow\',\'width=793,height=608,scrollbars=no\'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_video_box.gif" width="16" height="12" alt="Subscription Video"></a> </span> vs. Albany <span class=\'yspscores\'></span> - Men\'s Basketball </td>
<td nowrap>  </td>
<td width="100%" nowrap class="yspscores">  6:30 pm EST  </td>

</tr>
</table></td></tr>
</table>
</table>';

preg_match_all("~<td class=\"yspdetailttl\" valign=\"bottom\" height=\"12\">(.*)</td>~",$string, $matches);
foreach ($matches[1] as $match) {
$dates[] = $match;
}

//print_r($dates);

preg_match_all("~<td nowrap> (.*)</td>~",$string, $matches);
foreach ($matches[1] as $match) {
$vs[] = " " . $match;
}

//print_r($vs);

preg_match_all("~<td width=\"100%\" nowrap class=\"yspscores\">(.*)</td>~",$string, $matches);
foreach ($matches[1] as $match) {
$time[] = $match;
}

//print_r($time);
$count = count($dates);

for ($i=0; $i<$count; $i++) {
echo $date[$i] . "<br />" . $vs[$i] . "\t\t" . $time[$i] . "<br /><br />";
}

?>

 

I will let you figure out the displaying of them. Not sure if that is the most efficient way, but it works.

Link to comment
https://forums.phpfreaks.com/topic/135709-parsing-html-page/#findComment-707134
Share on other sites

this code works perfectly but it does not display all the events for a particular date.

for examle(there can be many events under particular date):

 

Fri, Dec 5

• Stony Brook  vs. Lehigh Subscription Audio    6:30 pm EST

• Hope Free Audio  vs. Carthage    6:40 pm EST

• Pennsylvania Subscription Audio  vs. Navy Subscription Audio    7:00 pm EST

• Iowa Subscription Audio  vs. Bryant    7:30 pm EST

• Texas A&M  vs. Arizona Free Audio    8:30 pm EST

Sat, Dec 6

• Davidson  vs. N.C. State Subscription Audio    12:00 pm EST

• Holy Cross  vs. W. Michigan Subscription Audio    12:30 pm EST

• Indiana Subscription Audio  vs. Gonzaga    12:30 pm EST

• Iowa St. Subscription Audio  vs. Oregon St. Subscription Audio    1:30 pm EST

• Kansas  vs. Jackson St. Free Audio    2:00 pm EST

 

 

HTML code :

 

<table border="0" cellspacing="0" cellpadding="2" class="ysprow1" width="100%" style="border-collapse: collapse" bordercolor="#111111"><tr><td class="yspdetailttl" valign="bottom" height="12"> Fri, Dec 5</td></tr>

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Hope <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10092771','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/free_audio_box.gif" width="16" height="12" alt="Free Audio"></a> </span> vs. Carthage <span class='yspscores'></span></td>

 

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  5:40 pm EST  </td>

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Stony Brook <span class='yspscores'></span> vs. Lehigh <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10603182','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span></td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  6:30 pm EST  </td>

 

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Pennsylvania <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10838785','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span> vs. Navy <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10689020','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span></td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  7:00 pm EST  </td>

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Iowa <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10219999','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span> vs. Bryant <span class='yspscores'></span></td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  7:30 pm EST  </td>

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

 

<td nowrap> &#149; Texas A&M <span class='yspscores'></span> vs. Arizona <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10590979','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/free_audio_box.gif" width="16" height="12" alt="Free Audio"></a> </span></td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  8:30 pm EST  </td>

</tr>

</table></td></tr>

</table><table border="0" cellspacing="0" cellpadding="2" class="ysprow1" width="100%" style="border-collapse: collapse" bordercolor="#111111"><tr><td class="yspdetailttl" valign="bottom" height="12"> Sat, Dec 6</td></tr>

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Davidson <span class='yspscores'></span> vs. N.C. State <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10663073','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span></td>

 

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  12:00 pm EST  </td>

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Holy Cross <span class='yspscores'></span> vs. W. Michigan <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10592143','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span></td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  12:30 pm EST  </td>

 

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Indiana <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10331117','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span> vs. Gonzaga <span class='yspscores'></span></td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  12:30 pm EST  </td>

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Iowa St. <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10345634','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span> vs. Oregon St. <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10580288','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span></td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  1:30 pm EST  </td>

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

 

<td nowrap> &#149; Providence <span class='yspscores'></span> vs. Rhode Island <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10487157','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/free_audio_box.gif" width="16" height="12" alt="Free Audio"></a> </span></td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  2:00 pm EST  </td>

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Kansas <span class='yspscores'></span> vs. Jackson St. <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10530559','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/free_audio_box.gif" width="16" height="12" alt="Free Audio"></a> </span></td>

 

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  2:00 pm EST  </td>

</tr>

</table></td></tr>

 

 

Link to comment
https://forums.phpfreaks.com/topic/135709-parsing-html-page/#findComment-707171
Share on other sites

The DOM works great as long as the page obeys the standards. I tried to do this with the DOM but since yahoo does not have the page properly formatted it does not populate the DOM in PHP.

 

I could be doing it wrong, but if you have a working example using that page I would love to see Curtis, thanks!

Link to comment
https://forums.phpfreaks.com/topic/135709-parsing-html-page/#findComment-708030
Share on other sites

The DOM works great as long as the page obeys the standards. I tried to do this with the DOM but since yahoo does not have the page properly formatted it does not populate the DOM in PHP.

 

I could be doing it wrong, but if you have a working example using that page I would love to see Curtis, thanks!

No, it was my mistake, you're absolutely right. The only times I've tried this, I happened to be working with standards conforming documents. Obviously, that's a rare luxury with (X)HTML. Sorry about that.

 

There are a couple possibilities here, in order to prevent reinventing the wheel. One is to use the PEAR package, HTML_Common2, which utilizes PHP 5 OOP. The PHP 4 version is available, but not recommended. I briefly looked at some of the class members, and they seem to be using regex as well.

 

At the very least, when writing complicated regexes, I prefer to use the /x modifier to allow whitespace and comments, because they are much easier to maintain that way.

 

Also, there's a PECL extension called html_parse, which seems a better solution for handling HTML, but at the cost of some portability.

Link to comment
https://forums.phpfreaks.com/topic/135709-parsing-html-page/#findComment-708193
Share on other sites

ok...can we write the regex to fetch the data in red color below:

 

<table border="0" cellspacing="0" cellpadding="2" class="ysprow1" width="100%" style="border-collapse: collapse" bordercolor="#111111"><tr><td class="yspdetailttl" valign="bottom" height="12"> Wed, Dec 10</td></tr>

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Indiana <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10331118','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/pay_audio_box.gif" width="16" height="12" alt="Subscription Audio"></a> </span> vs. TCU <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10644764','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/free_audio_box.gif" width="16" height="12" alt="Free Audio"></a> </span></td>

 

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  6:00 pm EST  </td>

</tr>

</table></td></tr>

 

<tr><td>

<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="100%">

<tr>

<td nowrap> &#149; Long Island <span class='yspscores'><a href="javascript:void(window.open('http://cosmos.bcst.yahoo.com/up/collegetest/?cl=10730677','playerWindow','width=793,height=608,scrollbars=no'));"  ><img border="0" src="http://l.yimg.com/a/i/us/sp/ed/ic/free_audio_box.gif" width="16" height="12" alt="Free Audio"></a> </span> vs. Iona <span class='yspscores'></span></td>

<td nowrap>  </td>

<td width="100%" nowrap class="yspscores">  7:00 pm EST  </td>

 

</tr>

</table></td></tr>

</table>

 

I was trying with:

//$html contains above code

preg_match_all("~<tr><td class=\"yspdetailttl\" valign=\"bottom\" height=\"12\"> Wed, Dec 10</td></tr><tr><td>(.*)</td></tr>~", $html, $matches);

 

Link to comment
https://forums.phpfreaks.com/topic/135709-parsing-html-page/#findComment-711470
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.