preg_match_all help

cordoprod · January 9, 2010

Hi,

Im trying to parse some HTML code.

It's a whole webpage, and I need to start parsing it on a tag, and end the parsing at the end of the tag.

This is an example:

<div id="1"> // start parse here
blah blah
blah blah
</div> // end parsing here

Here is my regex:

$tabeller = preg_match_all('/^<div id="Tab01">(.*\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*px">)(.*)(<\/td>.*)<\/div>$/mu', $htmlCode, $matches);
die(var_dump($matches));

Output is just empty arrays when i try that code.

And here is the site I'm trying to do it with:

http://www.rutebok.no/NRIIISStaticTables/Tables/ruter/index/Avd_01.htm

cags · January 9, 2010

At a guess the HTML source you are attempting to match against has vertical white space (ie newline characters), by default the fullstop doesn't match these characters, meaning you get an empty array because nothing in the HTML matches your pattern. Try adding the s modifier to fix that problem. Having said that because you are using greedy quantifiers you are likely to match a lot more than you want to in a pattern meaning you'll end up with less pattern matches being returns. What I mean by this is everywhere you have .* it will keep matching characters until the Regex after it cannot be true. You are probably going to need to make them lazy matches.

cordoprod · January 9, 2010

Can you please show me how to do this in my code so I can understand it correctly?

cags · January 9, 2010

As I'm on my way to bed... Complete guess off the top of my head...

$tabeller = preg_match_all('/^<divid="Tab01">(.*?\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*?px">)(.*?)(<\/td>.*)<\/div>$/su',$htmlCode, $matches);

cordoprod · January 9, 2010

I tried it, but unfortunatly empty arrays.

I tried to set the first to <div.*> and also tried <div\sid="Tab01">, but still no luck.

cags · January 9, 2010

On that page what information do you actually want?

cordoprod · January 9, 2010

I want output like this:

http://www.cordoproduction.com/x.png

As you can see Halden is one of the tabs at that page. The tabs are javascript driven so all the content in each tab is in one HTML source.

I want to seperate the content in the tabs because when I try to parse the content in the tabs, I get all the content from all tabs if i parse from the beginning of the page to the end.

Thats why i need to start at <div id="Tab0x"> and end it at </div>

cags · January 9, 2010

I'm sure Regular Expressions aren't the best solution for parsing this HTML, but I'm also sure you've been told that before so I'm not sure why you put aside xpath. Matching the info in the div will probably be easier if you do two matches...

$div_pattern = '#<div id="Tab01" style="overflow: auto; overflow-x:hidden; height: 2800px; width:930px">(.*?)</div>#s';
$info_pattern = '#<a href="\.\./t/(\d{2}-\d{3})\.htm">\1</a></td><td style="width:360px">([^<]*)</td></tr>#s';
preg_match($div_pattern, $input, $out);
preg_match_all($info_pattern, $out[1], $out);
print_r($out);

cordoprod · January 9, 2010

Excellent! Finally got it working Thanks so much.

salathe · January 9, 2010

[ot]

I'm sure Regular Expressions aren't the best solution for parsing this HTML

Not particularly. Just in case anyone was wondering, here is one way to parse the required information using the DOM/XPath.

<?php

$url  = 'http://www.rutebok.no/NRIIISStaticTables/Tables/ruter/index/Avd_01.htm';

$dom = new DOMDocument;

// HTML will have lots of XML errors, ignore them when loading
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile($url);
libxml_use_internal_errors(FALSE);

// Grab the location (Halden) from the JavaScript
$script   = $dom->getElementsByTagName('script')->item(2)->textContent;
$location = 'Unknown';
if (preg_match('/^\["([^"]+)", "Tab01"/m', $script, $match)) {
$location = $match[1];
}

// Query the first tab for the routes
$xpath  = new DOMXPath($dom);
$tab    = $xpath->query('//div[@id="Tab01"]')->item(0);
$rows   = $xpath->query('./table[2]/tr/td/table/tr', $tab);
$routes = array();
foreach ($rows as $row) {
$cells    = $row->getElementsByTagName("td");
$routes[] = array(
	'number' => $cells->item(1)->textContent,
	'name'   => str_replace("\r\n", "", $cells->item(2)->textContent)
);
}


// Output routes
header('Content-Type: text/html; charset=utf-8');
?>
<h2><?php echo $location ?></h2>
<?php if (empty($routes)) : ?>
<p>No routes found :-(</p>
<?php else : ?>
<?php foreach ($routes as $route) : ?>
<strong><?php echo $route['number'] ?></strong>
<?php echo $route['name'] ?>
<strong><?php echo $location ?></strong>
<br>
<?php endforeach; ?>
<?php endif; ?>

[Edited to fix super-long HTML line][/ot]

Sign In

preg_match_all help

Recommended Posts

cordoprod

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

cordoprod

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

cordoprod

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

cordoprod

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

cordoprod

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information