Closing all html tags

tijmenamsing · February 22, 2012

Hello,

I'm working on a page where users can add articles by writing text in textareas with a WYSIWYG editor. When they submit the form it's saved in a database.

As a summary of the article i grab the first 800 characters of the article, but as you could imagine there might occur html tags like <div> or <span> in the summary which are not closed. To prevent this from ruining my page layout when their articles are posted on the wegbsite I could use strip_tags but I'd like to keep the format, also this would delete images.

I couldn't think of another solution then a function which checks for open tags and if so; add closing tags at the end of the summary.

I already made a similar function a while back, but that one only checks for <div> and <span>, as those are the worst.. The nasty part is that I kind of deleted that function accidentally, and I can't fully remember how I wrote that.. :facepalm:

So what I would like to have is a function that checks for all unclosed html tags and add the associated closing tags, in the right order, at the end of the summary. Any help getting on the right track is appreciated.

scootstah · February 22, 2012

An easy way would be to count all open tags, and then count all closed tags. If the amount of closed tags is less than the open tags, add as many as you need.

It might screw up the layout of what they posted but at least it will be confined to that area.

tijmenamsing · February 22, 2012

That's the way my function worked as far as I can remember. And I called it like: tags("<div>", "</div>", $summary), the paramaters being (opening tags, closing tags, haystack).

I used to call it only for div and span but now I would like a function that checks for all occuring opening tags automatically.

Psycho · February 22, 2012

Since the tags aren't displayed I wouldn't consider them part of the "first 800" characters. So, I'd build a process to strip out all the characters after the first 800 that aren't tags. So, you might have some empty tags towards the end but since there would be no content between them they wouldn't do anything to the display. It's a little late, otherwise I might tinker with this.

Psycho · February 22, 2012

OK, I lied. I found this interesting and wanted to give it a shot. This is pretty sloppy, but it works with the testing I did. I'll leave it to you to parse it down as needed and clean it up

function getTextPart($text, $maxCount=800)
{
    preg_match_all('#([^<]*)|(<[^>]*>)|([^<]*)#', $text, $matches);
    $output = '';
    $letterCount = 0;
    $maxLength = false;

    foreach($matches[0] as $line)
    {
        if(empty($line)) { continue; }
        if($line[0]=="<")
        {
            $output .= $line;
        }
        elseif(!$maxLength)
        {
            $space = 0;
            foreach(explode(' ', $line) as $word)
            {
                if(!$maxLength && (strlen($word) + $letterCount + $space) <= $maxCount)
                {
                    if($space) { $output .= ' '; }
                    $output .= $word;
                    $letterCount += strlen($word) + $space;
                    $space = 1;
                }
                else
                {
                    $maxLength = true;
                }
            }
        }
    }
    return $output;
}

echo getTextPart($input);

xyph · February 22, 2012

Here's my version

<?php

$void_tags = array_fill_keys(array(
'area','base','br','col','command','embed','hr','img',
'input','keygen','link','meta','param','source'
),'');

$content = <<<HEREDOC
This thing of text <b><i>will have some</i>
missing tags. Others will be<br>complete. Still
others <input type="text" value="won't need to be">
completed, and shouldn't close at the end. <div>
<img src="foobar.jpg" alt="baz" /> A tag will
even be closed without being opened</table>
HEREDOC;


// RegEx solution
$pattern = '#<(/?)([a-z]+)[^>]*>#i';
// Capture 1 will be empty for opening tag, '/' for closing tags
// Capture 2 will be the type of tag

if( !preg_match_all($pattern, $content, $matches, PREG_SET_ORDER) )
die( 'RegEx error' );

// This will hold the counts of each tag.
$counts = array();
// This will hold a string of the tags to prepend
$before = '';
// This will hold a string of the tags to append
$after = '';

foreach( $matches as $match ) {
// Verify that the tag doesn't need to be escaped
if( isset($void_tags[$match[2]]) )
	continue;
// Check if this is a closing tag
if( $match[1] == '/' ) {
	if( isset($counts[$match[2]]) ) $counts[$match[2]]--;
	// If this happens, someone has closed a tag before opening one.
	else {
		$before .= '<'.$match[2].'>';
		$counts[$match[2]] = 0;
	}
// This must be an opening tag
} else {
	if( isset($counts[$match[2]]) ) $counts[$match[2]]++;
	else $counts[$match[2]] = 1;
}
}

// Now we should have an array containing tags for keys, and integers for
// values. If a tag's value is 0, there are as many opening tags as closing tags
// If it is negative, there are that many missing opening tags; positive, missing
// closing tags.

echo '<h3>Preview of $counts array</h3><pre>';
print_r( $counts );
echo '</pre>';

foreach( $counts as $tag=>$count ) {
while( $count > 0 ) {
	$after .= '</'.$tag.'>';
	$count--;
}
while( $count < 0 ) {
	$before .= '<'.$tag.'>';
	$count++;
}
}

echo '<h3>Contents</h3><pre>';
echo htmlspecialchars( $before.$content.$after );
echo '</pre>';


?>

The only down side is that the tags aren't put back in the same order. This is possible to do though, with a little more effort.

Let me know if you have any questions.

kicken · February 22, 2012

Here is a function I use that closes up tags. You could adapt it to also count the 800 non-tag characters and use it for extracting your summary.


function fixupHtml($html){
        static $noClosers = array(
                'input',
                'br',
                'link',
                'base'
        );
        $tagStack = array();
        $len=strlen($html);

        $pos = 0;
        while (($pos=strpos($html, '<', $pos)) !== false){
                $pos++;

                $isEnding = false;
                $isSelfClose = false;
                $foundTagEnd= false;
                for ($i=$pos; !$foundTagEnd && $i<$len; $i++){
                        $ch = $html[$i];
                        switch ($ch){
                                case "/":
                                        $isEnding = true;
                                        $isSelfClose = !($i==$pos);
                                        break;
                                case '>':
                                case " ":
                                case "\r":
                                case "\n":
                                        $foundTagEnd=true;
                                        break;
                                default:
                        }
                }

                $tagEnd = strpos($html, '>', $pos);
                if ($tagEnd === false){ $foundTagEnd = false;   }

                if ($foundTagEnd){
                        $i--;
                        $tag = substr($html, $pos, $i-($pos));
                        if (!$isEnding){
                                if (!in_array($tag, $noClosers)){
                                        array_push($tagStack, $tag);
                                }
                        }
                        else if (!$isSelfClose){
                                $tag=ltrim($tag, '/');
                                $tslen = count($tagStack);
                                if ($tagStack[$tslen-1] == $tag){
                                        array_pop($tagStack);
                                }
                                else {
                                        //Try and find it earlier in the stack
                                        $found=false;
                                        for ($i=$tslen-1; !$found && $i>=0; $i--){
                                                if ($tagStack[$i] == $tag){
                                                        unset($tagStack[$i]);
                                                        $tagStack = array_values($tagStack);
                                                        $found=true;
                                                }
                                        }

                                        if (!$found){
                                                //Bad end tag found.  Lets remove it.
                                                $tagStart = $pos-1;
                                                $endOfTag = strpos($html, '>', $tagStart);

                                                $html = substr($html, 0, $tagStart).substr($html, $endOfTag+1);
                                        }
                                }
                        }
                }
                else {
                        $html = substr($html, 0, $pos-1);
                }
        }

        while (count($tagStack) > 0){
                $tag = array_pop($tagStack);
                $html .= '</'.$tag.'>';
        }

        return $html;
}

tijmenamsing · February 22, 2012

wow thanks all for the replies!

I don't have the knowledge yet to understand all of the code, but I think kicken's function is most complete and works out best for me.

I get one warning though, which occurs when I add a closing tag when there is no opening tag of it.

Warning: strpos() [function.strpos]: Offset not contained in string in #/test.php on line 12

line 12 being: while (($pos=strpos($html, '<', $pos)) !== false){

Any idea how to fix this? And if it's not too much to ask, could you add some comments to the function?

The Little Guy · February 22, 2012

If you have tidy installed, you should use that:

tidy_get_output

The Little Guy · February 22, 2012

Here is an example, as you can see html tidy adds the closing <b> tag:

<?php
$html = <<<HTML
<div>
<b>Hello world
</div>
HTML;
$config = array(
"show-body-only" => true
);

$tidy = new tidy();
$out = $tidy->repairString($html, $config, 'UTF8');

echo htmlentities($out);
?>

More config options: http://tidy.sourceforge.net/docs/quickref.html#show-body-only

tijmenamsing · February 22, 2012

Neat, never heard and works like a charm without installing anything new.

I have no how it works though ;p

Sign In

Closing all html tags

Recommended Posts

tijmenamsing

Link to comment

Share on other sites

scootstah

Link to comment

Share on other sites

tijmenamsing

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

xyph

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

tijmenamsing

Link to comment

Share on other sites

The Little Guy

Link to comment

Share on other sites

The Little Guy

Link to comment

Share on other sites

tijmenamsing

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information