Getting Summary of a Blog Post

imperialized · May 4, 2015

Alright, so I have ended myself in a predicament. Lets say, for example, I have a blog post that has 500 words (not including any HTML markup within).

The post stored in the DB could be something like this:

<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: right'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: center'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: right'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: center'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>

Doing a word count, or substr, or splitting in on a space could potentially leave disaster if it splits in the middle of a style, or leaves out a closing tag for html markup.

I've thought about doing a

substr($post, 0, 200)

and pulling the first 200 characters but that leaves the possibility for the above mentioned issues.

Doing a slice also leaves the issue:

$postSummmary = implode(" ", array_slice(explode(" ", $post), 0, 100);

Any ideas?

Barand · May 4, 2015

Alright, so I have ended myself in a predicament.

Brought on by storing the markup with the data.

Solution: Don't.

fastsol · May 4, 2015

I have run into the same thing before and this is how I did it. Granted with this method you would need to always format the portion you want to show in it's own <div> or <p>.

preg_match("/<p>(.*)<\/p>/U", $a['art_body'], $matches);

echo '<p>'.$matches[1].'..... </p>';

This basically finds the first <p> and </p> and grabs everything inside of it and assigns it to $matches. If you want it to find the div instead then just change that in the preg_match expression.

fastsol · May 4, 2015

Brought on by storing the markup with the data.

Solution: Don't.

There isn't really any other way when using something like tinymce texteditor, at least from what I know.

Barand · May 4, 2015

$summary = substr(strip_tags($text),0, 200);
echo $summary;

maybe?

requinix · May 4, 2015

I derived this monstrosity for work*: we had arbitrary HTML and I needed to cut it down to a certain number of words, not counting headings, and preserving as much markup as possible. Slightly redacted. For regular use, one would probably want to make a couple adjustments like removing the check for "foo_" and "foohead" (both indicating heading markup).

/**
 * Truncate raw content
 *
 * @param mixed $content
 * @param int $words
 * @param bool $headings
 * @return array
 */
private static function truncateRaw($content, $words, $headings = false) {
	if(is_array($content)) {
		$ret = array('content' => array(), 'remaining' => $words);
		foreach($content as $key => $value) {
			if($ret['remaining'] <= 0) {
				if($ret['remaining'] == 0) {
					$ret['content'][$key] = '...';
				}
				break;
			}

			$remaining = $value->truncate($ret['remaining'], $headings);
			$ret['content'][$key] = $value;
			$ret['remaining'] = $remaining;
		}
		return $ret;
	} else {
		$ret = array('content' => '', 'remaining' => $words);
		if($words <= 0) {
			return $ret;
		}

		// some special content has html comments explicitly marking a preview area
		if(($pos1 = strpos($content, '<!-- BEGIN PREVIEW -->')) !== false && ($pos2 = strpos($content, '<!-- END PREVIEW -->', $pos1)) !== false) {
			// 22 for strlen('<!-- BEGIN PREVIEW -->')
			$len = $pos2 - $pos1 - 22;
			$ret['content'] = trim(substr($content, $pos1 + 22, $len));
			$ret['remaining'] = 0;
		} else {
			// cut the text in a way to preserve html tag structure

			$pieces = preg_split('#(</?(\w+).*?>)#is', $content, -1, PREG_SPLIT_DELIM_CAPTURE); // comes in triplets
			$excerpt = array();
			$tags = array();
			$header = 0; // counter for tag depth in headers
			$state = 0;
			// piece A: 0=text
			// piece B: 1=html tag
			// piece C: 2=open tag name, 3=opening tag name of header, 4=self-closing tag, 5=closing tag name
			//
			//   A      B      C
			// +----+------+------------------------+
			// | 0 -|-> 1 -|-> 2 if opening tag    -|-> 0 and header++ (if header>0)
			// +----+------+------------------------+
			// | 0 -|-> 1 -|-> 3 if opening header -|-> 0 and header++
			// +----+------+------------------------+
			// | 0 -|-> 1 -|-> 4 if self-closing   -|-> 0
			// +----+------+------------------------+
			// | 0 -|-> 1 -|-> 5 if closing tag    -|-> 0 and header-- (if header>0)
			// +----+------+------------------------+
			//
			// break text if header==0, otherwise only increase word count
			//
			//   h1                           h2     h1   h0
			// 013                    0      12 0   15  015  014    012 0      15  0
			//  <p class="foo_header">Header <b>Text</b> </p> <br /> <p>Content</p>
			//            ^ header                                     ^ not header
			foreach($pieces as $piece) {
				// 0. text
				if($state == 0) {
					if($header) {
						if($headings) {
							$cut = self::cutWords($piece, $ret['remaining']);
							$excerpt[] = $cut['content'];
							$ret['remaining'] = $cut['remaining'];
						} else {
							$excerpt[] = $piece;
							$ret['remaining'] -= self::countWords($piece);
						}
					} else {
						$cut = self::cutWords($piece, $ret['remaining']);
						$excerpt[] = $cut['content'];
						$ret['remaining'] = $cut['remaining'];
						if($ret['remaining'] <= 0) {
							break;
						}
					}
					$state = 1;
				}

				// 1. html tag
				else if($state == 1) {
					// logic is easier to write in reverse order
					if($piece[1] == '/') { // closing
						// closing tag logic will decide when to add itself to the excerpt
						$state = 5;
					} else if(substr($piece, -2, 1) == '/') { // self-closing
						$excerpt[] = $piece;
						$state = 4;
					} else if(strlen($piece) >= 3 && $piece[1] == 'h' && ctype_digit($piece[2])) { // normal header
						$excerpt[] = $piece;
						$state = 3;
					} else if(strpos($piece, 'foo_') !== false || strpos($piece, 'foohead') !== false) { // old header
						$excerpt[] = $piece;
						$state = 3;
					} else { // text
						$excerpt[] = $piece;
						$state = 2;
					}
				}

				// 2. opening tag
				else if($state == 2) {
					$header && $header++;
					$tags[] = $piece;
					$state = 0;
				}

				// 3. opening header
				else if($state == 3) {
					$header++;
					$tags[] = $piece;
					$state = 0;
				}

				// 4. self-closing
				else if($state == 4) {
					$state = 0;
				}

				// 5. closing tag
				else if($state == 5) {
					$header && $header--;
					while($tags && $tag = array_pop($tags)) {
						$excerpt[] = "</{$tag}>";
						if($tag == $piece) {
							break;
						}
					}
					$state = 0;
				}
			}
			// clean up any unclosed tags
			while($tags && $tag = array_pop($tags)) {
				$excerpt[] = "</{$tag}>";
			}

			$ret['content'] = implode('', $excerpt);
		}
		return $ret;
	}
}

The most important thing was that it put tags onto a stack in order to close them out properly when the content gets cut inside multiple tags - I had to deal with things like ULs and tables.

OR, instead of all this crazy work deciding how to get a summary: ask the writer to write one. Seriously. It's so much easier and their summary will be nicer than one you come up with automatically.

* I don't normally like sharing stuff done for my job, IP rights and whatnot, but we're fairly lax and this is one of those times when there can be significant benefit to the community.

imperialized · May 5, 2015

Like the guy above mentioned, I am using tinyMCE so I don't have much choice when it comes to storing the tags with the data.

I used a combination of techniques to accomplish what I was trying to achieve. It seems to be working. Albeit not the #1 solution, it will suffice for the purpose of this project.

Thanks for the help.

The following code is what was used to accomplish the intended result:

$summary = strip_tags($this->blogPost);
$summary = implode(' ', array_slice(explode(' ', $summary), 0, 100));

The above code gets the first 100 words of the blog post.

Psycho · May 5, 2015

How important is it to maintain the markup - for the summary? I understand you want the markup for the actual output, but why not remove tags to generate the summary?

imperialized · May 5, 2015

psycho, After rethinking it and looking at the posts provided that is exactly what I did. It wasn't important to keep the formatting for the summary.

Sign In

Getting Summary of a Blog Post

Recommended Posts

imperialized

Link to comment

Share on other sites

Barand

Link to comment

Share on other sites

fastsol

Link to comment

Share on other sites

fastsol

Link to comment

Share on other sites

Barand

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

imperialized

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

imperialized

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information