cordoprod Posted January 9, 2010 Share Posted January 9, 2010 Hi, Im trying to parse some HTML code. It's a whole webpage, and I need to start parsing it on a tag, and end the parsing at the end of the tag. This is an example: <div id="1"> // start parse here blah blah blah blah </div> // end parsing here Here is my regex: $tabeller = preg_match_all('/^<div id="Tab01">(.*\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*px">)(.*)(<\/td>.*)<\/div>$/mu', $htmlCode, $matches); die(var_dump($matches)); Output is just empty arrays when i try that code. And here is the site I'm trying to do it with: http://www.rutebok.no/NRIIISStaticTables/Tables/ruter/index/Avd_01.htm Quote Link to comment Share on other sites More sharing options...
cags Posted January 9, 2010 Share Posted January 9, 2010 At a guess the HTML source you are attempting to match against has vertical white space (ie newline characters), by default the fullstop doesn't match these characters, meaning you get an empty array because nothing in the HTML matches your pattern. Try adding the s modifier to fix that problem. Having said that because you are using greedy quantifiers you are likely to match a lot more than you want to in a pattern meaning you'll end up with less pattern matches being returns. What I mean by this is everywhere you have .* it will keep matching characters until the Regex after it cannot be true. You are probably going to need to make them lazy matches. Quote Link to comment Share on other sites More sharing options...
cordoprod Posted January 9, 2010 Author Share Posted January 9, 2010 Can you please show me how to do this in my code so I can understand it correctly? Quote Link to comment Share on other sites More sharing options...
cags Posted January 9, 2010 Share Posted January 9, 2010 As I'm on my way to bed... Complete guess off the top of my head... $tabeller = preg_match_all('/^<divid="Tab01">(.*?\d\d\-\d\d\d\.htm">)(\d\d\-\d\d\d)(.*?px">)(.*?)(<\/td>.*)<\/div>$/su',$htmlCode, $matches); Quote Link to comment Share on other sites More sharing options...
cordoprod Posted January 9, 2010 Author Share Posted January 9, 2010 I tried it, but unfortunatly empty arrays. I tried to set the first to <div.*> and also tried <div\sid="Tab01">, but still no luck. Quote Link to comment Share on other sites More sharing options...
cags Posted January 9, 2010 Share Posted January 9, 2010 On that page what information do you actually want? Quote Link to comment Share on other sites More sharing options...
cordoprod Posted January 9, 2010 Author Share Posted January 9, 2010 I want output like this: http://www.cordoproduction.com/x.png As you can see Halden is one of the tabs at that page. The tabs are javascript driven so all the content in each tab is in one HTML source. I want to seperate the content in the tabs because when I try to parse the content in the tabs, I get all the content from all tabs if i parse from the beginning of the page to the end. Thats why i need to start at <div id="Tab0x"> and end it at </div> Quote Link to comment Share on other sites More sharing options...
cags Posted January 9, 2010 Share Posted January 9, 2010 I'm sure Regular Expressions aren't the best solution for parsing this HTML, but I'm also sure you've been told that before so I'm not sure why you put aside xpath. Matching the info in the div will probably be easier if you do two matches... $div_pattern = '#<div id="Tab01" style="overflow: auto; overflow-x:hidden; height: 2800px; width:930px">(.*?)</div>#s'; $info_pattern = '#<a href="\.\./t/(\d{2}-\d{3})\.htm">\1</a></td><td style="width:360px">([^<]*)</td></tr>#s'; preg_match($div_pattern, $input, $out); preg_match_all($info_pattern, $out[1], $out); print_r($out); Quote Link to comment Share on other sites More sharing options...
cordoprod Posted January 9, 2010 Author Share Posted January 9, 2010 Excellent! Finally got it working Thanks so much. Quote Link to comment Share on other sites More sharing options...
salathe Posted January 9, 2010 Share Posted January 9, 2010 [ot] I'm sure Regular Expressions aren't the best solution for parsing this HTML Not particularly. Just in case anyone was wondering, here is one way to parse the required information using the DOM/XPath. <?php $url = 'http://www.rutebok.no/NRIIISStaticTables/Tables/ruter/index/Avd_01.htm'; $dom = new DOMDocument; // HTML will have lots of XML errors, ignore them when loading libxml_use_internal_errors(TRUE); $dom->loadHTMLFile($url); libxml_use_internal_errors(FALSE); // Grab the location (Halden) from the JavaScript $script = $dom->getElementsByTagName('script')->item(2)->textContent; $location = 'Unknown'; if (preg_match('/^\["([^"]+)", "Tab01"/m', $script, $match)) { $location = $match[1]; } // Query the first tab for the routes $xpath = new DOMXPath($dom); $tab = $xpath->query('//div[@id="Tab01"]')->item(0); $rows = $xpath->query('./table[2]/tr/td/table/tr', $tab); $routes = array(); foreach ($rows as $row) { $cells = $row->getElementsByTagName("td"); $routes[] = array( 'number' => $cells->item(1)->textContent, 'name' => str_replace("\r\n", "", $cells->item(2)->textContent) ); } // Output routes header('Content-Type: text/html; charset=utf-8'); ?> <h2><?php echo $location ?></h2> <?php if (empty($routes)) : ?> <p>No routes found :-(</p> <?php else : ?> <?php foreach ($routes as $route) : ?> <strong><?php echo $route['number'] ?></strong> <?php echo $route['name'] ?> <strong><?php echo $location ?></strong> <br> <?php endforeach; ?> <?php endif; ?> [Edited to fix super-long HTML line][/ot] Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.