Regex for HTML

Flinch · March 9, 2006

Hey all, long time no post.

So anyway, I'm writing a function that will parse user input from text forms based on a set of instructions passed to it by a multi-dimmensional array. Right now, I'm, working on a section that will allow me to specify what HTML tags are allowed through the array ($instruct), and parse them accordingly. The way my function will work is it starts off checking what HTML is allowed, and will replace any allowed tags's < & > with [[ & ]] so my next section, that converts non-allowed HTML tags into > & <, will not parse the wanted HTML. I'm not very good at explaining, so here's an example:

[code]//Wanted HTML: 
$text = "Here is a new line!";

//my function will find any occurences of and replace it with[[br]].
$return = parser($text, $instruct);

echo $return; //will return the same string, with the tag intact, and the & tags replaced with .
[/code]

I think that demonstrates what I'm trying to do. So far, I haven't had too much problem, but I fear that the regular expression I wrote to do this checking is a little sub-par, and may not work.

Here's my regular expression being used in a foreach loop.

[code](<|</)(\s)*".$k."(\s)*([^>".$k."].+?)?(>|/>)[/code]

The ."$k." parts are because all the tags that have been allowed are broken into another array, and cycled through the input text replacing where needed. So in our above example for the , it would look like this:

[code](<|</)(\s)*br(\s)*([^>br].+?)?(>|/>)[/code]

This seems to work like I want, but when it comes to closing tags (</td></tr>) they just get replaced as [[td]] and [[tr]] instead of [[/td] and [[/tr]]. I'm wondering if anyone has any help or suggestions for me that I could use to tweak this regex to make my script work. It's been troubling for a few days now.

Here's the concerned area of the script I'm talking about:

[code]
//use the normal regex
 if(preg_match("#(<|</)(\s)*".$k."(\s)*([^>".$k."].+?)?(>|/>)#im", $txt, $matches)) {


 /*+------------THIS IS VERY IMPORTANT Y'ALL!-------------+
 + First and fifth elements are the < and > respectively
 + Second and third will ALWAYS be spaces
 + Fourth will be any markup inside the tag
 +------------------------------------------------------+*/

 //now, begin the replacement technique.
 print_r($matches);
 if(preg_match("#/#is", $matches[1]) ) {
 $add_in1 = "/";
 }

 $add_in2 = ($matches[4] != "") ? " ".$matches[4] : "";

 $this_tag = "[[".$add_in1.$k.$add_in2."]]";

 $txt = preg_replace("#(<|</)(\s)*".$k."(\s)*([^>".$k."].+?)?(>|/>)#im", $this_tag, $txt);
 unset($this_tag, $add_in1, $add_in2);

 }
[/code]

Thanks.

wickning1 · March 9, 2006

I just did something very similar. Here's what I wrote. It's not well tested but should be working.

[code]<?php
$tags = array('b', 'i', 'h\d');

function preserveHTML($text, $tags) {
 global $tags;
 foreach ($tags as $tag) {
 $text = preg_replace('/<(\/)?(' . $tag . '(\s.*?)?)>/i', '[$1$2]', $text);
 }
 return $text;
}

function restoreHTML($text, $tags) {
 global $tags;
 foreach ($tags as $tag) {
 $text = preg_replace('/\[(\/)?(' . $tag . '(\s.*?)?)\]/i', '<$1$2>', $text);
 }
 return $text;
}
?>[/code]

Sign In

Regex for HTML

Recommended Posts

Flinch

Link to comment

Share on other sites

wickning1

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information