DOMDocument or preg_match_all

Andy-H · June 14, 2012

I am writing a simple templating system for a website I'm making, basically I want to retrieve content from files and insert it into a document using jquery style selectors, I'm unsure whether to use DOMDocument or regular expressions, which one would be faster for this? I'm leaning for regex atm but I'm not too clued up on it:


<?php


$tag = 'span';
$pat = <<<PAT
   ~(.*?)<{$tag}.*id\s*=\s*["?|'?]testing["?|'?][^>]*>(.*?)</\s*{$tag}\s*>|\s*>|\s*/>(.*?)~i
PAT;
$htm = <<<HTM
test
<span id="test">Test</span>
<span class="testing" id="testing">Testing</span>
tredst
HTM;
preg_match_all($pat, $htm, $matches);
echo '<pre>'. htmlentities(print_r($matches, 1), ENT_QUOTES, 'UTF-8');


?>

OUTPUT:

Array
(
    [0] => Array
        (
            [0] => >
            [1] => >
            [2] => <span class="testing" id="testing">Testing</span>
        )

    [1] => Array
        (
            [0] => 
            [1] => 
            [2] => 
        )

    [2] => Array
        (
            [0] => 
            [1] => 
            [2] => Testing
        )

    [3] => Array
        (
            [0] => 
            [1] => 
            [2] => 
        )

)

Desired:

Array
(
    [0] => Array
        (
            [0] => test
<span id="test">Test</span>
            [1] => <span class="testing" id="testing">
            [2] => Testing
            [3] => </span>
tredst
        )
)

Any help appreciated.

P.S. Sorry if this should be in regex help, I was unsure as I also wanted advice on whether it was the right decision not to go with DOMDocument.

kicken · June 15, 2012

If your wanting to parse HTML you should use DOMDocument. Your HTML will have to be mostly valid though for it to work properly. Parsing HTML with regex is generally considered a bad idea. It works sometimes, I'll do it sometimes for one-off re-format scripts or small personal-use scrapers but for something like a template engine you'd be better off going with a real html parsing solution like domdocument.

Andy-H · June 15, 2012

Now got



$tag = 'span';
$pat = <<<PAT
   ~(.*)(<{$tag}.*class\s*=\s*["?|'?]testing["?|'?][^>]*>)(.*?)(</\s*{$tag}\s*>)(.*)~is
PAT;
$htm = <<<HTM
<span>
   <span class="testing" id="test">Test</span>
   <span class="testing" id="testing">Testing</span>
</span>
HTM;
preg_match_all($pat, $htm, $matches, PREG_SET_ORDER);
echo '<pre>'. htmlentities(print_r($matches, 1));

Outputs:


Array
(
    [0] => Array
        (
            [0] => <span>
   <span class="testing" id="test">Test</span>
   <span class="testing" id="testing">Testing</span>
</span>
            [1] => <span>
   <span class="testing" id="test">Test</span>
   
            [2] => <span class="testing" id="testing">
            [3] => Testing
            [4] => </span>
            [5] => 
</span>
        )
)

Desired output:


Array
(
    [0] => Array
        (
            [0] => <span>
   <span class="testing" id="test">Test</span>
   <span class="testing" id="testing">Testing</span>
</span>
            [1] => <span>
   
            [2] => <span class="testing" id="test">
            [3] => Test
            [4] => </span>
            [5] => <span class="testing" id="testing">Testing</span>
</span>
        )
    [1] => Array
        (
            [0] => <span>
   <span class="testing" id="test">Test</span>
   <span class="testing" id="testing">Testing</span>
</span>
            [1] => <span>
   <span class="testing" id="test">Test</span>
   
            [2] => <span class="testing" id="testing">
            [3] => Testing
            [4] => </span>
            [5] => 
</span>
        )
)

Andy-H · June 15, 2012

If your wanting to parse HTML you should use DOMDocument. Your HTML will have to be mostly valid though for it to work properly. Parsing HTML with regex is generally considered a bad idea. It works sometimes, I'll do it sometimes for one-off re-format scripts or small personal-use scrapers but for something like a template engine you'd be better off going with a real html parsing solution like domdocument.

Ok, so scrap the regex idea, thanks.

I also have another problem, I want to be able to call templates like so:

$Page = (new Template('default', ['site' => 'b2c']))->getContent('slider')->insertAfter('#header');

However, after calling getContent, I want it to return another object for the insertAfter, rather than update a class member to hold the content, this way the insertBefore/after / append/prependTo methods are only exposed when content is loaded, is this the right way to go?

Here's what I have so far.

Template.class.php


namespace phantom\classes\templating;
class Template {
   protected $_pageContent;
   
   public function __construct($template, array $data = array())
   {
      $this->_pageContent = $this->_getContent($template, $data);
   }
   public function getContent($file_name, array $data = array())
   {
      return new Content($this->_getContent($file_name, $data), $this);
   }
   public function querySelector($selector)
   {
      $selector = expolode('#', $selector);
      $tag      = $selector[0];
      $match    = $selector[1];
   }
   protected function _getContent($file_name, array $data = array())
   {
      ob_start();
      extract($data, EXTR_SKIP);
      include 'templates'. DIRECTORY_SEPARATOR . $file_name .'.tmpl.php';
      return ob_get_clean();
   }
}

Content.class.php


namespace phantom\classes\templating;
class Content {
   protected $_content;
   protected $_template;
   
   public function __construct($content, Template $tmpl)
   {
      $this->_content  = $content;
      $this->_template = $tmpl;
   }
   public function insertBefore($tag)
   {
      
   }
   public function insertAfter($tag)
   {
      
   }
   public function appendTo($tag)
   {
      
   }
   public function prependTo($tag)
   {
      
   }
}

But now I am unsure of how to update the template contents without exposing public methods to set the content??

Andy-H · June 15, 2012

OK, I now have:

Template.class.php


<?php
namespace phantom\classes\templating;
class Template {
   protected $_document;
   
   public function __construct($tmpl_file, array $data = array(), $ver = '4.01', $enc = 'UTF-8')
   {
      $this->_document = new DOMDocument($ver, $enc);
      $this->_document->loadHTML($this->_getContent($tmpl_file, $data));
   }
   public function getContent($file_name, array $data = array())
   {
      return new Content($this->_getContent($file_name, $data), $this->_document);
   }
   protected function _getContent($file_name, array $data = array())
   {
      ob_start();
      extract($data, EXTR_SKIP);
      include 'templates'. DIRECTORY_SEPARATOR . $file_name .'.tmpl.php';
      return ob_get_clean();
   }
}

Content.class.php


<?php
namespace phantom\classes\templating;
class Content {
   protected $_content;
   protected $_document;
   
   public function __construct($content, DOMDocument $document)
   {
      $this->_content  = $content;
      $this->_document = $document;
   }
   public function insertBefore($tag)
   {
      $this->_getElement($tag);
   }
   public function insertAfter($tag)
   {
      
   }
   public function appendTo($tag)
   {
      
   }
   public function prependTo($tag)
   {
      
   }
   protected function _getElement($tag)
   {
      if ( substr($tag, 0, 1) == '#' )
         return $this->_document->getElementById(substr($tag, 1));
   }
}

I am now stuck as to how to convert a HTML string into a document fragment, I know you can do this with well-formed XHTML, however, I am using HTML 4.01, anyone got any ideas how I could do something along the lines of:

$DOMDocument->loadFragment('<div class="slider"><h1>Tracking vehichles</h1><p>Blah blah blah</p><>')->insertAfter(DOMNode);

?? thanks for any help.

trq · June 15, 2012

You might want to take a look at phpQuery. http://code.google.com/p/phpquery/

Andy-H · June 15, 2012

I was looking at that yesterday, I'd rather do it using SPL if possible, as I might re-use this code in several environments.

trq · June 15, 2012

You've lost me, your not using anything from SPL.

Andy-H · June 15, 2012

Oh, sorry, I just mean I only want to use things pre-packaged with PHP, like available on servers where PHP was built with default configuration.

Andy-H · June 15, 2012

$doc = new DOMDocument('4.01', 'UTF-8');
$doc->loadHTML('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
   <head>
      <title>Phantom - Tracking Ststems and Accessories</title>
      <!-- META //-->
      <meta name="description"
         content="" >
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >
      <!-- LINKS AND SCRIPTS //-->
      <link rel="stylesheet" href="css/layout.css" type="text/css" >
   </head>
   <body>
      <div id="clouds"><>
      <div id="container">
         <!-- HEADER //-->
         <div id="header">
            <h1>
               <a href="/">
                  <img src="images/layout/b2c/phantom.png" alt="Phantom vehicle tracking and accessories" >
               </a>
            </h1>
            <div id="header_right">
               <ul id="header_nav" class="navigation">
                  <li class="first"><a href="#">About</a></li>
                  <li><a href="#">News</a></li>
               </ul>
               <img class="telephone" src="images/layout/b2c/tel.png" alt="Telephone number" >
            <>
         <>
         <!-- SLIDER //-->
         <div id="slider">
         <>
         <!-- PRODUCT NAVLIGATION IMAGES //-->
         <ul id="product_navigation">
            <li class="first"><a href="#" id="remap">Engine ECU remapping</a></li>
            <li><a href="#" id="tyre-pro">Tyre protector</a></li>
            <li><a href="#" id="sat-dish">Caravan and motorhome satellite dish</a></li>
            <li><a href="#" id="reverse-sensor">Reverse sensor</a></li>
            <li><a href="#" id="in-car-cam">In car camera</a></li>
            <li><a href="#" id="alarms">Caravan and motorhome alarms</a></li>
            <li><a href="#" id="tracking">Caravan and motorhome tracking</a></li>
            <li><a href="#" id="subs">Renew tracking subscription</a></li>
         </ul>
         <!-- CONTENT //-->
         <div id="content">
            <div class="clr"><>
         <>
      <>
      <!-- FOOTER //-->
      <div id="footer">
         <div id="logo"><>
         <div class="green_banner">
            <div id="motto">Protect, Secure, Enjoy<>
         <>
         <div class="blue_banner"><>
      <>
   </body>
</html>');
$frag = $doc->createDocumentFragment();
$frag->appendXML('
         <!-- MAIN NAVIGATION //-->
         <ul id="main_nav" class="navigation">
            <li class="first"><a href="#">Home</a></li>
            <li><a href="#">Tracking</a></li>
            <li><a href="#">Remapping</a></li>
            <li><a href="#">Tyre protector</a></li>
            <li><a href="#">Alarms</a></li>
            <li><a href="#">Cameras and sensors</a></li>
            <li><a href="#">Insurance</a></li>
            <li><a href="#">Contact us</a></li>
         </ul>');
$doc->getElementById('container')->insertBefore($frag, $doc->getElementById('slider'));
echo $doc->saveHTML();

Seems to work quite well, as long as I add the /> for non-closing tags, but it outputs them correctly

Cheers

Sign In

DOMDocument or preg_match_all

Recommended Posts

Andy-H

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

Andy-H

Link to comment

Share on other sites

Andy-H

Link to comment

Share on other sites

Andy-H

Link to comment

Share on other sites

trq

Link to comment

Share on other sites

Andy-H

Link to comment

Share on other sites

trq

Link to comment

Share on other sites

Andy-H

Link to comment

Share on other sites

Andy-H

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information