Fetching Contents - Issue With substr()

Mko · February 4, 2013

Hello all,

I'm recently writing a script on the homepage that would display certain threads from certain forum categories.

My current SQL query and fetching the contents work well, except I encounter an odd issue when using the substring method on the fetched contents to limit the characters displayed.

Just so you're aware, I'm parsing the contents of the thread's post through vBulletin's BBCodeParser, yet that's not the issue.

Here's a bit of background regarding my code/issue.

Current Code (only included the important stuff):

$parsed_text = $parser->do_parse($body);

$message_pre = substr($parsed_text, 0, 500);
$message = substr($message_pre, 0, strrpos($message_pre, ' '));

echo '<div id="a1">
echo '<div class="b">';
echo $message.'...';
echo '<div class="c"></div>';
echo '<div class="d">[<a href="">Read More...</a>]';
echo '</div>';
echo '<div class="e"></div></div></div>';

So, that's all fine. However, let's get some example database contents:

[b]bold[/b] [i]italic[/i] [u]underline[/u] 
[center] center [/center]
 
[left]left [/left]
 
[right]right [/right]
[url="http://google.com"]google.com[/url] [url="http://google.com"]url1[/url] [url="http://google.com"]url2[/url] [email="1@2.com"]1@2.com[/email] [email=1@2.com]1@2.com2[/email] [img=http://google.com] [size=4]yo[/size] [size="4"]yo2[/size] [font="Book Antiqua"]test[/font] [font=Book Antiqua]test2[/font] [color="Red"]hey[/color] [color="#0048C0"]hey2[/color] [list] [*]hello [*]world [/list] [list=1] [*]list2 [*]list2_1 [/list]

Now, the BBCodeParser successfully parses the BBCode like it should and spits back some HTML, which I store inside the $parsed_text variable.

However, I have an odd issue with the $message variable. Some of the HTML that is parsed seems to not terminate correctly, thus messing up my style.

Here's an example of the issue in action (HTML output):

<b>bold</b><br />
<i>italic</i><br />
<u>underline</u><br />
<div align="[url=""]center[/url]"> center<br />
</div><div align="[url=""]left[/url]">left<br />
</div><div align="[url=""]right[/url]">right<br />
</div><a href="[url="view-source:http://google.com/"]http://google.com[/url]" target="[url=""]_blank[/url]">google.com</a><br />
<a href="[url="view-source:http://google.com/"]http://google.com[/url]" target="[url=""]_blank[/url]">url1</a><br />
<a href="[url="view-source:http://google.com/"]http://google.com[/url]" target="[url=""]_blank[/url]">url2</a><br />
<a href="[email="mark@mko.com"]mailto:1@2.com[/email]">1@2.com</a><br />
<a href="[email="mark@mko.com"]mailto:1@2.com[/email]">1@2.com2</a><br />
<img...<div class="[url=""]clear[/url]"></div><div class="[url=""]news_bottom[/url]">[<a href="">Read More...</a>]</div>

As you can most likely see, the contents of $message end with <img, because of the space before the src in <img src.

My question is: What would be the correct way to go about limiting the amount of characters displayed AND preventing unclosed HTML tags from being displayed on the last line of the $message variable's content?

Thanks for any and all help,

Mark

Edited February 4, 2013 by Mko

kicken · February 4, 2013

Rather than doing a blind substr() to a specific length, you'd need to create sort of a mini-parser to go through the string an determine if your in an HTML tag or not. Only count characters when your not inside a tag and also keep track of which tags have been opened.

When you reach your target character count you can substr() to that position, then close any tags that are still open.

I posted a function that does something like this quite a while ago I believe, you could try searching for it. If I can find it I'll post the link.

requinix · February 4, 2013

The only problem is the cut-off tag? Not the things?

Technique I use is a preg_split() to alternate strings that can be cut (regular text) with strings that cannot (ie, HTML tags). As you're going you keep track of what HTML tags you've opened and closed.

$parts = preg_split('#capture opening *and closing* html tags#', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
$cut = true; // first in $parts is regular text
$length = 0; // so far
$opentags = array(); // stack of tags needing to close
$output = ""; // shortened version

foreach ($parts as $p) {
   if ($cut) {
       // if you need to trim then go ahead, then break out of the loop
       // otherwise add to $length
   } else {
       // look at the captured html tag
       // if it opens and doesn't self-close then
       // - add the tag name to $opentags
       // if it closes then
       // - optionally check that it agrees with the top of the $opentags stack
       // - pop off $opentags
   }

   $output .= $p;
   $cut = !$cut;
}

// now close off the remaining open tags
foreach ($opentags as $tag) {
   $output .= "</{$tag}>";
}

Mko · February 5, 2013

The only problem is the cut-off <img> tag? Not the things?

Technique I use is a preg_split() to alternate strings that can be cut (regular text) with strings that cannot (ie, HTML tags). As you're going you keep track of what HTML tags you've opened and closed.

$parts = preg_split('#capture opening *and closing* html tags#', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
$cut = true; // first in $parts is regular text
$length = 0; // so far
$opentags = array(); // stack of tags needing to close
$output = ""; // shortened version

foreach ($parts as $p) {
if ($cut) {
// if you need to trim then go ahead, then break out of the loop
// otherwise add to $length
} else {
// look at the captured html tag
// if it opens and doesn't self-close then
// - add the tag name to $opentags
// if it closes then
// - optionally check that it agrees with the top of the $opentags stack
// - pop off $opentags
}

$output .= $p;
$cut = !$cut;
}

// now close off the remaining open tags
foreach ($opentags as $tag) {
$output .= "</{$tag}>";
}

The was just an issue with View Source for some reason :s

Anyways, I implemented your code, but for some reason, I get this error when I run my script:

Fatal error: Maximum execution time of 30 seconds exceeded in /home/mko/public_html/home.php on line 189

My current code:

<?php
$conn = new DB();
$query = $conn->query("query here;");

if (mysqli_num_rows($query) > 0) {
while ($result = mysqli_fetch_array($query)) {
	$body = $result['pagetext'];
	$parser = new vB_BbCodeParser($vbulletin, fetch_tag_list(), true);
	$parsed_text = $parser->do_parse($body);


	$parts = preg_split("/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/", $parsed_text, -1, PREG_SPLIT_DELIM_CAPTURE);
	$cut = true; // first in $parts is regular text
	$length = 0; // so far
	$opentags = array(); // stack of tags needing to close
	$output = ""; // shortened version

	foreach ($parts as $p) {
		if ($cut) {
			// if you need to trim then go ahead, then break out of the loop
			// otherwise add to $length
			if ($length > 250) {
				break;
			} else {
				$length .= $p;
			}
		} else {
			// look at the captured html tag
			// if it opens and doesn't self-close then
			// - add the tag name to $opentags
			// if it closes then
			// - optionally check that it agrees with the top of the $opentags stack
			// - pop off $opentags
			if ($p.substr($p, 1, 1) != "/") {
				$opentags .= $p;
			} else if ($p.substr($p, 1, 1) == "/") {
				unset($opentags[$p]);
			}
		}

		$output .= $p;
		$cut = !$cut;
	}

	// now close off the remaining open tags
	foreach ($opentags as $tag) {
 		 $output .= "</{$tag}>";
	}

echo '<div id="a1">';
echo '<div class="b">';
echo $output.'...';
echo '<div class="c"></div>';
echo '<div class="d">[<a href="">Read More...</a>]';
echo '</div>';
echo '<div class="e"></div></div></div>';

}
} else {
echo 'No news!';
}
?>

Am I implementing this correctly?

Thanks,

Mark

Edited February 5, 2013 by Mko

requinix · February 6, 2013

I don't see anything that would cause an infinite loop but I do see a few things to fix. Can you do some debugging to find out where the problem is?

I can improve upon what I said earlier but now it's probably getting hard to follow me. So I'll just throw the whole thing at you.

function shorten($text, $limit) {
   $selfclosing = array("img", "br");
 
   $parts = preg_split('#(</?([a-z]+)[^>]*>)#i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
   $what = "text"; // "text", "html", or "tag"
   $tagaction = "add"; // "add", "remove", or "ignore"
   $length = 0; // so far
   $opentags = array(); // stack of tags needing to close
   $output = ""; // shortened version
 
   foreach ($parts as $p) {
       // just regular text
       if ($what == "text") {
           // if the new $p pushes the $length too long, cut it and stop
           // this is a good place for an ellipsis
           $l = strlen($p);
           if ($length + $l >= $limit) {
               $output .= substr($p, 0, $limit - $length - $l) . "...";
               break;
           }
           // otherwise add it
           else {
               $output .= $p;
               $length += $l;
           }
 
           $what = "html"; // next step
       }
 
       // the entire html tag. see if it needs a separate closing tag
       else if ($what == "html") {
           // if it's a closing tag then it needs to be removed from the stack
           if ($p[1] == "/") {
               $action = "remove";
           }
           // if it explicitly closes itself then ignore it
           else if (substr($p, -2, 1) == "/") {
               $action = "ignore";
           }
           // otherwise it's an opening tag so add it
           else {
               $action = "add";
           }
 
           $output .= $p;
           $what = "tag"; // next step
       }
 
       // just the tag name
       else {
           // maybe add the tag to the top (beginning) of the stack (array)
           if ($action == "add" && !in_array(strtolower($p), $selfclosing)) {
               array_unshift($opentags, $p);
           }
           // remove whatever's on top
           else if ($action == "remove") {
               array_shift($opentags);
           }
 
           $what = "text"; // reset
       }
   }
 
   // now close off the remaining open tags
   foreach ($opentags as $tag) {
        $output .= "</{$tag}>";
   }
 
   return $output;
}

If I run that on

<div itemprop="commentText" class='post entry-content '>
Hello all,<br />
I'm recently writing a script on the homepage that would display certain threads from certain forum categories.<br />
My current SQL query and fetching the contents work well, except I encounter an odd issue when using the substring method on the fetched contents to limit the characters displayed.<br />
Just so you're aware, I'm parsing the contents of the thread's post through vBulletin's BBCodeParser, yet that's not the issue.<br />
<br />
Here's a bit of background regarding my code/issue.<br />
Current Code (only included the important stuff):<br />
<pre class='prettyprint'>
$parsed_text = $parser->do_parse($body);
 
$message_pre = substr($parsed_text, 0, 500);
$message = substr($message_pre, 0, strrpos($message_pre, ' '));
 
echo '<div id="a1">
echo '<div class="b">';
echo $message.'...';
echo '<div class="c"></div>';
echo '<div class="d">[<a href="">Read More...</a>]';
echo '</div>';
echo '<div class="e"></div></div></div>';
</pre>
<br />
So, that's all fine. However, let's get some example database contents:<br />

(a modified piece of the HTML source of your post) with a length of 250 I get

<div itemprop="commentText" class='post entry-content '>
Hello all,<br>
I'm recently writing a script on the homepage that would display certain threads from certain forum categories.<br />
My current SQL query and fetching the contents work well, except I encounter an odd issue when using the substring method on t...</div>

Mko · February 6, 2013

I don't see anything that would cause an infinite loop but I do see a few things to fix. Can you do some debugging to find out where the problem is?

I can improve upon what I said earlier but now it's probably getting hard to follow me. So I'll just throw the whole thing at you.

I did some debugging with my previous version. From what I could tell, the Regular Expression I had (/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/) wasn't functioning properly -- yielding the 30 second execution time warning.

Your example worked! I can follow everything you posted, except for one hiccup, regarding the Regular Expression. My question is: can you explain to me what the functionality of the # and #i before and after the Regular Expression ('#(</?([a-z]+)[^>]*>)#i') is?

Thanks a bunch for your continued help ,

Mark

Edited February 6, 2013 by Mko

requinix · February 6, 2013

PCRE expressions need delimiters but you've got a lot of freedom as to what they are. Slashes are traditional. However if you want to use slashes in the expression itself, like I did with the /?, then you'd have to escape it lest it be interpreted as a delimiter. I don't like needlessly escaping things so I just changed to a different delimiter: # (another popular one).

Between the two delimiters is the expression itself and after the delimiter comes optional "flags" (or "modifiers"). The /i flag (the shorthand tends to be written with the slash delimiter) means case-insensitivity. A [a-z] by itself is literally "a lowercase letter a-z" and would thus only match HTML tags written in lowercase. Of course they may all be lowercase for you, but it's cheap enough to do just in case that's not true.

The manual has everything listed out if you'd like to keep reading.

Edited February 6, 2013 by requinix

Sign In

Fetching Contents - Issue With substr()

Recommended Posts

Mko

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

Mko

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

Mko

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information