Jump to content

preg_match_all help


cordoprod

Recommended Posts

Hi,

Im trying to parse some HTML code.

It's a whole webpage, and I need to start parsing it on a tag, and end the parsing at the end of the tag.

 

This is an example:

<div id="1"> // start parse here
blah blah
blah blah
</div> // end parsing here

 

Here is my regex:

$tabeller = preg_match_all('/^<div id="Tab01">(.*\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*px">)(.*)(<\/td>.*)<\/div>$/mu', $htmlCode, $matches);
die(var_dump($matches));

 

Output is just empty arrays when i try that code.

 

And here is the site I'm trying to do it with:

http://www.rutebok.no/NRIIISStaticTables/Tables/ruter/index/Avd_01.htm

Link to comment
Share on other sites

At a guess the HTML source you are attempting to match against has vertical white space (ie newline characters), by default the fullstop doesn't match these characters, meaning you get an empty array because nothing in the HTML matches your pattern. Try adding the s modifier to fix that problem. Having said that because you are using greedy quantifiers you are likely to match a lot more than you want to in a pattern meaning you'll end up with less pattern matches being returns. What I mean by this is everywhere you have .* it will keep matching characters until the Regex after it cannot be true. You are probably going to need to make them lazy matches.

Link to comment
Share on other sites

As I'm on my way to bed... Complete guess off the top of my head...

 

$tabeller = preg_match_all('/^<divid="Tab01">(.*?\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*?px">)(.*?)(<\/td>.*)<\/div>$/su',$htmlCode, $matches);

Link to comment
Share on other sites

I want output like this:

http://www.cordoproduction.com/x.png

 

As you can see Halden is one of the tabs at that page. The tabs are javascript driven so all the content in each tab is in one HTML source.

 

I want to seperate the content in the tabs because when I try to parse the content in the tabs, I get all the content from all tabs if i parse from the beginning of the page to the end.

 

Thats why i need to start at <div id="Tab0x"> and end it at </div>

Link to comment
Share on other sites

I'm sure Regular Expressions aren't the best solution for parsing this HTML, but I'm also sure you've been told that before so I'm not sure why you put aside xpath. Matching the info in the div will probably be easier if you do two matches...

 

$div_pattern = '#<div id="Tab01" style="overflow: auto; overflow-x:hidden; height: 2800px; width:930px">(.*?)</div>#s';
$info_pattern = '#<a href="\.\./t/(\d{2}-\d{3})\.htm">\1</a></td><td style="width:360px">([^<]*)</td></tr>#s';
preg_match($div_pattern, $input, $out);
preg_match_all($info_pattern, $out[1], $out);
print_r($out);

Link to comment
Share on other sites

[ot]

I'm sure Regular Expressions aren't the best solution for parsing this HTML

Not particularly.  Just in case anyone was wondering, here is one way to parse the required information using the DOM/XPath.

 

<?php

$url  = 'http://www.rutebok.no/NRIIISStaticTables/Tables/ruter/index/Avd_01.htm';

$dom = new DOMDocument;

// HTML will have lots of XML errors, ignore them when loading
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile($url);
libxml_use_internal_errors(FALSE);

// Grab the location (Halden) from the JavaScript
$script   = $dom->getElementsByTagName('script')->item(2)->textContent;
$location = 'Unknown';
if (preg_match('/^\["([^"]+)", "Tab01"/m', $script, $match)) {
$location = $match[1];
}

// Query the first tab for the routes
$xpath  = new DOMXPath($dom);
$tab    = $xpath->query('//div[@id="Tab01"]')->item(0);
$rows   = $xpath->query('./table[2]/tr/td/table/tr', $tab);
$routes = array();
foreach ($rows as $row) {
$cells    = $row->getElementsByTagName("td");
$routes[] = array(
	'number' => $cells->item(1)->textContent,
	'name'   => str_replace("\r\n", "", $cells->item(2)->textContent)
);
}


// Output routes
header('Content-Type: text/html; charset=utf-8');
?>
<h2><?php echo $location ?></h2>
<?php if (empty($routes)) : ?>
<p>No routes found :-(</p>
<?php else : ?>
<?php foreach ($routes as $route) : ?>
<strong><?php echo $route['number'] ?></strong>
<?php echo $route['name'] ?>
<strong><?php echo $location ?></strong>
<br>
<?php endforeach; ?>
<?php endif; ?>

 

[Edited to fix super-long HTML line][/ot]

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.