Jump to content

Help with parsing this html


Recommended Posts


I've got some html i just need a couple of strings from.. argh, it's freaking me out. I've tried a lot.


Here is the html:

<div id="Tab01" style="overflow: auto; overflow-x:hidden; height: 2800px; width:930px">
<table style="width:930px;background-color:#deded5; border-style:dotted; border-width:1px; border-color:#79796F; border-top:none"><tr>
<td style="width:40px"> </td><td style="width:50px"><font class="overskrift2-ruteoversikt">Rutenr.:</td><td style="360px"><font class="overskrift2-ruteoversikt">Rutenavn:</td>
<td style="width:40px"> </td><td style="width:50px"><font class="overskrift2-ruteoversikt">Rutenr.:</td><td style="360px"><font class="overskrift2-ruteoversikt">Rutenavn:</td>
<table style="background-color:#f3f3de;width:924px;height:369px;border-style:dotted;border-width:1px;border-color:#79796F;border-top:none">
<tr valign="top">
<table style="width:462px">
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-100.htm">01-100</a></td><td style="width:360px">Moss-Fredrikstad-Halden</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-102.htm">01-102</a></td><td style="width:360px">Halden-Parken-Tistedal-Vold skog-Parken-Stenrød-Brekkerød-Halden</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif"
align="center"></td><td style="width:50px"><a
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-105.htm">01-105</a></td><td style="width:360px">Halden-Stangeløkka-Refne-Halden</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-110.htm">01-110</a></td><td style="width:360px">Halden-Sørli-Isebakke</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-111.htm">01-111</a></td><td style="width:360px">Halden-Svinesund-Strömstad</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-112.htm">01-112</a></td><td style="width:360px">Halden-Knardal-Hov-Halden</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-113.htm">01-113</a></td><td style="width:360px">Halden-Holtet-Bakke-Halden</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-114.htm">01-114</a></td><td style="width:360px">Halden-Aspedammen-Prestebakke-Kornsjø-Halden</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-115.htm">01-115</a></td><td style="width:360px">Halden-Elgklev</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-118.htm">01-118</a></td><td style="width:360px">Halden-Isebakke-Svinesund-Sponvika-Halden</td></tr>
<table style="width:462px">
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-121.htm">01-121</a></td><td style="width:360px">Halden-Torpedal-Halden</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-123.htm">01-123</a></td><td style="width:360px">Fjeld bru-Jørkebekk-Østkroken-Aremark</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-124.htm">01-124</a></td><td style="width:360px">Aremark-strømsfoss-Vestsida-Halden</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-131.htm">01-131</a></td><td style="width:360px">Ørje-Kasbo-Buer-Engsødegård-Granerud</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-132.htm">01-132</a></td><td style="width:360px">Ørje-Damholtet-Strømsfoss</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-140.htm">01-140</a></td><td style="width:360px">Halden-Aremark-Strømsfoss-Granerud-Ørje</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/FERJE-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-150.htm">01-150</a></td><td style="width:360px">Skjærhalden-Hvaler</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BAAT-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-191.htm">01-191</a></td><td style="width:360px">Strømsfoss-Tistedal/Ørje</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-192.htm">01-192</a></td><td style="width:360px">Halden-Brekkerød-Halden</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-194.htm">01-194</a></td><td style="width:360px">Halden ringbuss</td></tr>
<tr valign="top"><td style="width:40px"><IMG SRC="../images/BUSS-s.gif" align="center"></td><td style="width:50px"><a href="../t/01-199.htm">01-199</a></td><td style="width:360px">Skolebuss Halden</td></tr>


I want to parse this html, and put in a mysql database based on 2 fields.

The first one is the bus number.


All these are the bus numbers: <a href="../t/01-100.htm">01-100</a>

And all these are the places: <td style="width:360px">Moss-Fredrikstad-Halden</td>


But theres one more thing. The bus number and the place is connected to each other, so I need to find a way to parse them so they doesn't get mixed up.


Any help please?

Link to comment
Share on other sites

use preg_match()


something like:

foreach ($htmlCode as $lineNum => $htmlLine)
preg_match('/(.*\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*px">)(.*)(<\/td>.*)/', $htmlLine, $matches);
$busNumber = $matches[2];
$busName = $matches[4];
$busArray[$busNumber] = $busName;


The regulr expression is probably wrong because I never get them right first time and have to stuff around changing them. I'll leave that up to you.

Link to comment
Share on other sites

There are lots of ways this could be done. Generally speaking when parsing HTML the best approach is to use some kind of document model such as DOMDocument. It can be achieved using a regular expression something along the lines of...


$pattern = '#<tr valign="top"><td style="width:40px"><IMG SRC="\.\./images/[a-z]+?-s\.gif" align="center"></td><td style="width:50px"><a href="\.\./t/[0-9]{2}-[0-9]{3}\.htm">([0-9]{2}-[0-9]{3})</a></td><td style="width:360px">([^<]+)</td></tr>#i';

preg_match_all($pattern, $input, $out);

...but it's not really the best way. I'd have given you an example of using the document model, but to be honest, I've never actually used it myself.


Faux Edit: Catfish replied whilst I was typing this.

Link to comment
Share on other sites

Heres my code know:

$htmlCode = file_get_contents("http://www.rutebok.no/NRIIISStaticTables/Tables/ruter/index/Avd_01.htm");

preg_match('/(.*\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*px">)(.*)(<\/td>.*)/', $htmlCode, $matches);
$busNumber = $matches[2];
$busName = $matches[4];
$busArray[$busNumber] = $busName;

echo $busNumber;
echo $busName;


It works, but it only output one entry of the things i want to parse.

The busNumer is 01-100 and the busName is Moss-Fredrikstad, which is correct.


But theres more of busNumbers and busNames.

Link to comment
Share on other sites

That's because the code you are using was designed to operate on a per-line basis hence the foreach loop in their code. The preg_match function is designed to match a single pattern. I'm also not entirely sure why they used 4 capture groups when you only want two bits of information, but that's by-the-by. You will need to either split the content into lines and run the array through the loop like Catfish did in their example, or use preg_match_all like I did in my example. I still stand by my suggestion that Regular Expressions is probably not the best solution though.

Link to comment
Share on other sites

That's because (as with most solutions provided by this forum) it's not perfectly custom tailored to be copy/pasted into your code (mainly owing to the fact you didn't post any). The foreach syntax is a construct for iterating through an array I'm going to go ahead and guess you are passing it $htmlCode which is a string not an array and as such can't be iterated though. You would need to parse the file into an array by using something such as explode to split the file up, then pass the array returned to the foreach loop. This approach, in my opinion requires more work than is necessary. Using preg_match_all to find the matches seems much more sensible. Depending on what you need to do with the information you would then use a foreach loop to iterate though $matches to display/ do whatever with the information.

Link to comment
Share on other sites

Ok so I tried preg_match_all:

$htmlCode = file_get_contents("http://www.rutebok.no/NRIIISStaticTables/Tables/ruter/index/Avd_01.htm");
preg_match_all('/(.*\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*px">)(.*)(<\/td>.*)/', $htmlCode, $matches);
$busNumber = $matches[2];
$busName = $matches[4];

echo $busNumber." ".$busName."<br>";


And the output is:

Array Array

Link to comment
Share on other sites

As I said in my previous post you would need to loop through the outputs with some kind of loop. You should really read the manual for the functions you use (I handily provided links in my earlier posts). It's outputting the word Array because $matches[2] holds an array of all the numbers and $matches[4] holds an array of all the names. One solution for outputting them would be...


foreach($matches[2] as $k=>$v) {
   echo 'Bus Number: ' . $v . 'Bus Name: ' . $matches[4][$k] . '<br/>';

Link to comment
Share on other sites

So what if i want to add these to a mysql table.


I've tried this but it outputs nothing, and doesn't add anything to the table.

$host = "***"; 
$user = "***"; 
$pass = "***"; 
$database = "***"; 

$linkID = mysql_connect($host, $user, $pass) or die("Could not connect to host."); 
mysql_select_db($database, $linkID) or die("Could not find database."); 

$query = "SELECT * FROM ruteinfo_linjer ORDER BY bussnavn DESC"; 
$resultID = mysql_query($query, $linkID) or die("Data not found."); 

$htmlCode = file_get_contents("http://www.rutebok.no/NRIIISStaticTables/Tables/ruter/index/Avd_01.htm");
preg_match_all('/(.*\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*px">)(.*)(<\/td>.*)/', $htmlCode, $matches);
$busNumber = $matches[2];
$busName = $matches[4];

while ($row = mysql_fetch_array($resultID)) {
    foreach($matches[2] as $k=>$v) {
	if($v == $row['bussnummer']) {
		//this is just a test
   			echo 'Bus Number: ' . $v . 'Bus Name: ' . $matches[4][$k] . '<br/>';
	else {
		$sql = "INSERT INTO ruteinfo_linje(fylke,bussnummer,bussnavn) VALUES('Østfold', '".$v."', '".$matches[4][$k]."'";
	    $result = mysql_query($sql, $linkID) or die("Error");

		echo "hello";


Please excuse me, I've almost forgotten PHP. It's been a while since I've developed with PHP, because I'm an Objective-C and Cocoa programmer now.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.