Jump to content

Closing all html tags


tijmenamsing

Recommended Posts

Hello,

 

I'm working on a page where users can add articles by writing text in textareas with a WYSIWYG editor. When they submit the form it's saved in a database.

As a summary of the article i grab the first 800 characters of the article, but as you could imagine there might occur html tags like <div> or <span> in the summary which are not closed. To prevent this from ruining my page layout when their articles are posted on the wegbsite I could use strip_tags but I'd like to keep the format, also this would delete images.

I couldn't think of another solution then a function which checks for open tags and if so; add closing tags at the end of the summary.

I already made a similar function a while back, but that one only checks for <div> and <span>, as those are the worst.. The nasty part is that I kind of deleted that function accidentally, and I can't fully remember how I wrote that.. :facepalm:

 

So what I would like to have is a function that checks for all unclosed html tags and add the associated closing tags, in the right order, at the end of the summary. Any help getting on the right track is appreciated.

Link to comment
Share on other sites

That's the way my function worked as far as I can remember. And I called it like:  tags("<div>", "</div>", $summary), the paramaters being (opening tags, closing tags, haystack).

I used to call it only for div and span but now I would like a function that checks for all occuring opening tags automatically.

Link to comment
Share on other sites

Since the tags aren't displayed I wouldn't consider them part of the "first 800" characters. So, I'd build a process to strip out all the characters after the first 800 that aren't tags. So, you might have some empty tags towards the end but since there would be no content between them they wouldn't do anything to the display. It's a little late, otherwise I might tinker with this.

Link to comment
Share on other sites

OK, I lied. I found this interesting and wanted to give it a shot. This is pretty sloppy, but it works with the testing I did. I'll leave it to you to parse it down as needed and clean it up

 

function getTextPart($text, $maxCount=800)
{
    preg_match_all('#([^<]*)|(<[^>]*>)|([^<]*)#', $text, $matches);
    $output = '';
    $letterCount = 0;
    $maxLength = false;

    foreach($matches[0] as $line)
    {
        if(empty($line)) { continue; }
        if($line[0]=="<")
        {
            $output .= $line;
        }
        elseif(!$maxLength)
        {
            $space = 0;
            foreach(explode(' ', $line) as $word)
            {
                if(!$maxLength && (strlen($word) + $letterCount + $space) <= $maxCount)
                {
                    if($space) { $output .= ' '; }
                    $output .= $word;
                    $letterCount += strlen($word) + $space;
                    $space = 1;
                }
                else
                {
                    $maxLength = true;
                }
            }
        }
    }
    return $output;
}

echo getTextPart($input);

Link to comment
Share on other sites

Here's my version

 

<?php

$void_tags = array_fill_keys(array(
'area','base','br','col','command','embed','hr','img',
'input','keygen','link','meta','param','source'
),'');

$content = <<<HEREDOC
This thing of text <b><i>will have some</i>
missing tags. Others will be<br>complete. Still
others <input type="text" value="won't need to be">
completed, and shouldn't close at the end. <div>
<img src="foobar.jpg" alt="baz" /> A tag will
even be closed without being opened</table>
HEREDOC;


// RegEx solution
$pattern = '#<(/?)([a-z]+)[^>]*>#i';
// Capture 1 will be empty for opening tag, '/' for closing tags
// Capture 2 will be the type of tag

if( !preg_match_all($pattern, $content, $matches, PREG_SET_ORDER) )
die( 'RegEx error' );

// This will hold the counts of each tag.
$counts = array();
// This will hold a string of the tags to prepend
$before = '';
// This will hold a string of the tags to append
$after = '';

foreach( $matches as $match ) {
// Verify that the tag doesn't need to be escaped
if( isset($void_tags[$match[2]]) )
	continue;
// Check if this is a closing tag
if( $match[1] == '/' ) {
	if( isset($counts[$match[2]]) ) $counts[$match[2]]--;
	// If this happens, someone has closed a tag before opening one.
	else {
		$before .= '<'.$match[2].'>';
		$counts[$match[2]] = 0;
	}
// This must be an opening tag
} else {
	if( isset($counts[$match[2]]) ) $counts[$match[2]]++;
	else $counts[$match[2]] = 1;
}
}

// Now we should have an array containing tags for keys, and integers for
// values. If a tag's value is 0, there are as many opening tags as closing tags
// If it is negative, there are that many missing opening tags; positive, missing
// closing tags.

echo '<h3>Preview of $counts array</h3><pre>';
print_r( $counts );
echo '</pre>';

foreach( $counts as $tag=>$count ) {
while( $count > 0 ) {
	$after .= '</'.$tag.'>';
	$count--;
}
while( $count < 0 ) {
	$before .= '<'.$tag.'>';
	$count++;
}
}

echo '<h3>Contents</h3><pre>';
echo htmlspecialchars( $before.$content.$after );
echo '</pre>';


?>

 

The only down side is that the tags aren't put back in the same order. This is possible to do though, with a little more effort.

 

Let me know if you have any questions.

Link to comment
Share on other sites

Here is a function I use that closes up tags.  You could adapt it to also count the 800 non-tag characters and use it for extracting your summary.

 


function fixupHtml($html){
        static $noClosers = array(
                'input',
                'br',
                'link',
                'base'
        );
        $tagStack = array();
        $len=strlen($html);

        $pos = 0;
        while (($pos=strpos($html, '<', $pos)) !== false){
                $pos++;

                $isEnding = false;
                $isSelfClose = false;
                $foundTagEnd= false;
                for ($i=$pos; !$foundTagEnd && $i<$len; $i++){
                        $ch = $html[$i];
                        switch ($ch){
                                case "/":
                                        $isEnding = true;
                                        $isSelfClose = !($i==$pos);
                                        break;
                                case '>':
                                case " ":
                                case "\r":
                                case "\n":
                                        $foundTagEnd=true;
                                        break;
                                default:
                        }
                }

                $tagEnd = strpos($html, '>', $pos);
                if ($tagEnd === false){ $foundTagEnd = false;   }

                if ($foundTagEnd){
                        $i--;
                        $tag = substr($html, $pos, $i-($pos));
                        if (!$isEnding){
                                if (!in_array($tag, $noClosers)){
                                        array_push($tagStack, $tag);
                                }
                        }
                        else if (!$isSelfClose){
                                $tag=ltrim($tag, '/');
                                $tslen = count($tagStack);
                                if ($tagStack[$tslen-1] == $tag){
                                        array_pop($tagStack);
                                }
                                else {
                                        //Try and find it earlier in the stack
                                        $found=false;
                                        for ($i=$tslen-1; !$found && $i>=0; $i--){
                                                if ($tagStack[$i] == $tag){
                                                        unset($tagStack[$i]);
                                                        $tagStack = array_values($tagStack);
                                                        $found=true;
                                                }
                                        }

                                        if (!$found){
                                                //Bad end tag found.  Lets remove it.
                                                $tagStart = $pos-1;
                                                $endOfTag = strpos($html, '>', $tagStart);

                                                $html = substr($html, 0, $tagStart).substr($html, $endOfTag+1);
                                        }
                                }
                        }
                }
                else {
                        $html = substr($html, 0, $pos-1);
                }
        }

        while (count($tagStack) > 0){
                $tag = array_pop($tagStack);
                $html .= '</'.$tag.'>';
        }

        return $html;
}

 

 

Link to comment
Share on other sites

wow thanks all for the replies!

 

I don't have the knowledge yet to understand all of the code, but I think kicken's function is most complete and works out best for me.

 

I get one warning though, which occurs when I add a closing tag when there is no opening tag of it.

 

Warning: strpos() [function.strpos]: Offset not contained in string in #/test.php on line 12

 

line 12 being: while (($pos=strpos($html, '<', $pos)) !== false){

 

Any idea how to fix this? And if it's not too much to ask, could you add some comments to the function?

Link to comment
Share on other sites

Here is an example, as you can see html tidy adds the closing <b> tag:

 

<?php
$html = <<<HTML
<div>
<b>Hello world
</div>
HTML;
$config = array(
"show-body-only" => true
);

$tidy = new tidy();
$out = $tidy->repairString($html, $config, 'UTF8');

echo htmlentities($out);
?>

 

More config options: http://tidy.sourceforge.net/docs/quickref.html#show-body-only

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.