Jump to content


Photo

Regex for HTML


  • Please log in to reply
1 reply to this topic

#1 Flinch

Flinch
  • Members
  • PipPip
  • Member
  • 13 posts
  • Locationcomputer, house, USA

Posted 09 March 2006 - 03:43 PM

Hey all, long time no post.

So anyway, I'm writing a function that will parse user input from text forms based on a set of instructions passed to it by a multi-dimmensional array. Right now, I'm, working on a section that will allow me to specify what HTML tags are allowed through the array ($instruct), and parse them accordingly. The way my function will work is it starts off checking what HTML is allowed, and will replace any allowed tags's < & > with [[ & ]] so my next section, that converts non-allowed HTML tags into > & <, will not parse the wanted HTML. I'm not very good at explaining, so here's an example:

//Wanted HTML: <br>
$text = "Here is a <br> <b>new</b> line!";

//my function will find any occurences of <br> and replace it with[[br]].
$return = parser($text, $instruct);

echo $return; //will return the same string, with the <br> tag intact, and the <b> & </b> tags replaced with <b>.

I think that demonstrates what I'm trying to do. So far, I haven't had too much problem, but I fear that the regular expression I wrote to do this checking is a little sub-par, and may not work.

Here's my regular expression being used in a foreach loop.

(<|</)(\s)*".$k."(\s)*([^>".$k."].+?)?(>|/>)

The ."$k." parts are because all the tags that have been allowed are broken into another array, and cycled through the input text replacing where needed. So in our above example for the <br>, it would look like this:

(<|</)(\s)*br(\s)*([^>br].+?)?(>|/>)

This seems to work like I want, but when it comes to closing tags (</td></tr>) they just get replaced as [] and [] instead of [ and []. I'm wondering if anyone has any help or suggestions for me that I could use to tweak this regex to make my script work. It's been troubling for a few days now.

Here's the concerned area of the script I'm talking about:

//use the normal regex
                if(preg_match("#(<|</)(\s)*".$k."(\s)*([^>".$k."].+?)?(>|/>)#im", $txt, $matches)) {
            
            
                /*+------------THIS IS VERY IMPORTANT Y'ALL!-------------+
                  + First and fifth elements are the < and > respectively
                  + Second and third will ALWAYS be spaces
                  + Fourth will be any markup inside the tag
                  +------------------------------------------------------+*/

                //now, begin the replacement technique.
                print_r($matches);
                if(preg_match("#/#is", $matches[1]) ) {
                    $add_in1 = "/";
                }
                
                $add_in2 = ($matches[4] != "") ? " ".$matches[4] : "";
                
                $this_tag = "[[".$add_in1.$k.$add_in2."]]";
                
                $txt = preg_replace("#(<|</)(\s)*".$k."(\s)*([^>".$k."].+?)?(>|/>)#im", $this_tag, $txt);
                unset($this_tag, $add_in1, $add_in2);
                
                }

Thanks.
[br][br][br][br]

#2 wickning1

wickning1
  • Members
  • PipPipPip
  • Advanced Member
  • 405 posts

Posted 09 March 2006 - 06:56 PM

I just did something very similar. Here's what I wrote. It's not well tested but should be working.

<?php
$tags = array('b', 'i', 'h\d');

function preserveHTML($text, $tags) {
    global $tags;
    foreach ($tags as $tag) {
        $text = preg_replace('/<(\/)?(' . $tag . '(\s.*?)?)>/i', '[$1$2]', $text);
    }
    return $text;
}

function restoreHTML($text, $tags) {
    global $tags;
    foreach ($tags as $tag) {
        $text = preg_replace('/\[(\/)?(' . $tag . '(\s.*?)?)\]/i', '<$1$2>', $text);
    }
    return $text;
}
?>





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users