Jump to content

Getting Summary of a Blog Post


imperialized

Recommended Posts

Alright, so I have ended myself in a predicament. Lets say, for example, I have a blog post that has 500 words (not including any HTML markup within). 

The post stored in the DB could be something like this: 

<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: right'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: center'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: right'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: center'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>
<div style='text-align: left'> post post post post post post </div>

Doing a word count, or substr, or splitting in on a space could potentially leave disaster if it splits in the middle of a style, or leaves out a closing tag for html markup.

I've thought about doing a

substr($post, 0, 200) 

 and pulling the first 200 characters but that leaves the possibility for the above mentioned issues. 

 

Doing a slice also leaves the issue:

$postSummmary = implode(" ", array_slice(explode(" ", $post), 0, 100);  

Any ideas?

Edited by imperialized
Link to comment
Share on other sites

I have run into the same thing before and this is how I did it.  Granted with this method you would need to always format the portion you want to show in it's own <div> or <p>.

preg_match("/<p>(.*)<\/p>/U", $a['art_body'], $matches);

echo '<p>'.$matches[1].'..... </p>';

This basically finds the first <p> and </p> and grabs everything inside of it and assigns it to $matches.  If you want it to find the div instead then just change that in the preg_match expression.

Link to comment
Share on other sites

I derived this monstrosity for work*: we had arbitrary HTML and I needed to cut it down to a certain number of words, not counting headings, and preserving as much markup as possible. Slightly redacted. For regular use, one would probably want to make a couple adjustments like removing the check for "foo_" and "foohead" (both indicating heading markup).

/**
 * Truncate raw content
 *
 * @param mixed $content
 * @param int $words
 * @param bool $headings
 * @return array
 */
private static function truncateRaw($content, $words, $headings = false) {
	if(is_array($content)) {
		$ret = array('content' => array(), 'remaining' => $words);
		foreach($content as $key => $value) {
			if($ret['remaining'] <= 0) {
				if($ret['remaining'] == 0) {
					$ret['content'][$key] = '...';
				}
				break;
			}

			$remaining = $value->truncate($ret['remaining'], $headings);
			$ret['content'][$key] = $value;
			$ret['remaining'] = $remaining;
		}
		return $ret;
	} else {
		$ret = array('content' => '', 'remaining' => $words);
		if($words <= 0) {
			return $ret;
		}

		// some special content has html comments explicitly marking a preview area
		if(($pos1 = strpos($content, '<!-- BEGIN PREVIEW -->')) !== false && ($pos2 = strpos($content, '<!-- END PREVIEW -->', $pos1)) !== false) {
			// 22 for strlen('<!-- BEGIN PREVIEW -->')
			$len = $pos2 - $pos1 - 22;
			$ret['content'] = trim(substr($content, $pos1 + 22, $len));
			$ret['remaining'] = 0;
		} else {
			// cut the text in a way to preserve html tag structure

			$pieces = preg_split('#(</?(\w+).*?>)#is', $content, -1, PREG_SPLIT_DELIM_CAPTURE); // comes in triplets
			$excerpt = array();
			$tags = array();
			$header = 0; // counter for tag depth in headers
			$state = 0;
			// piece A: 0=text
			// piece B: 1=html tag
			// piece C: 2=open tag name, 3=opening tag name of header, 4=self-closing tag, 5=closing tag name
			//
			//   A      B      C
			// +----+------+------------------------+
			// | 0 -|-> 1 -|-> 2 if opening tag    -|-> 0 and header++ (if header>0)
			// +----+------+------------------------+
			// | 0 -|-> 1 -|-> 3 if opening header -|-> 0 and header++
			// +----+------+------------------------+
			// | 0 -|-> 1 -|-> 4 if self-closing   -|-> 0
			// +----+------+------------------------+
			// | 0 -|-> 1 -|-> 5 if closing tag    -|-> 0 and header-- (if header>0)
			// +----+------+------------------------+
			//
			// break text if header==0, otherwise only increase word count
			//
			//   h1                           h2     h1   h0
			// 013                    0      12 0   15  015  014    012 0      15  0
			//  <p class="foo_header">Header <b>Text</b> </p> <br /> <p>Content</p>
			//            ^ header                                     ^ not header
			foreach($pieces as $piece) {
				// 0. text
				if($state == 0) {
					if($header) {
						if($headings) {
							$cut = self::cutWords($piece, $ret['remaining']);
							$excerpt[] = $cut['content'];
							$ret['remaining'] = $cut['remaining'];
						} else {
							$excerpt[] = $piece;
							$ret['remaining'] -= self::countWords($piece);
						}
					} else {
						$cut = self::cutWords($piece, $ret['remaining']);
						$excerpt[] = $cut['content'];
						$ret['remaining'] = $cut['remaining'];
						if($ret['remaining'] <= 0) {
							break;
						}
					}
					$state = 1;
				}

				// 1. html tag
				else if($state == 1) {
					// logic is easier to write in reverse order
					if($piece[1] == '/') { // closing
						// closing tag logic will decide when to add itself to the excerpt
						$state = 5;
					} else if(substr($piece, -2, 1) == '/') { // self-closing
						$excerpt[] = $piece;
						$state = 4;
					} else if(strlen($piece) >= 3 && $piece[1] == 'h' && ctype_digit($piece[2])) { // normal header
						$excerpt[] = $piece;
						$state = 3;
					} else if(strpos($piece, 'foo_') !== false || strpos($piece, 'foohead') !== false) { // old header
						$excerpt[] = $piece;
						$state = 3;
					} else { // text
						$excerpt[] = $piece;
						$state = 2;
					}
				}

				// 2. opening tag
				else if($state == 2) {
					$header && $header++;
					$tags[] = $piece;
					$state = 0;
				}

				// 3. opening header
				else if($state == 3) {
					$header++;
					$tags[] = $piece;
					$state = 0;
				}

				// 4. self-closing
				else if($state == 4) {
					$state = 0;
				}

				// 5. closing tag
				else if($state == 5) {
					$header && $header--;
					while($tags && $tag = array_pop($tags)) {
						$excerpt[] = "</{$tag}>";
						if($tag == $piece) {
							break;
						}
					}
					$state = 0;
				}
			}
			// clean up any unclosed tags
			while($tags && $tag = array_pop($tags)) {
				$excerpt[] = "</{$tag}>";
			}

			$ret['content'] = implode('', $excerpt);
		}
		return $ret;
	}
}
The most important thing was that it put tags onto a stack in order to close them out properly when the content gets cut inside multiple tags - I had to deal with things like ULs and tables.

 

OR, instead of all this crazy work deciding how to get a summary: ask the writer to write one. Seriously. It's so much easier and their summary will be nicer than one you come up with automatically.

 

* I don't normally like sharing stuff done for my job, IP rights and whatnot, but we're fairly lax and this is one of those times when there can be significant benefit to the community.

Edited by requinix
Link to comment
Share on other sites

Like the guy above mentioned, I am using tinyMCE so I don't have much choice when it comes to storing the tags with the data. 

I used a combination of techniques to accomplish what I was trying to achieve. It seems to be working. Albeit not the #1 solution, it will suffice for the purpose of this project.

 

Thanks for the help.

 

The following code is what was used to accomplish the intended result:

$summary = strip_tags($this->blogPost);
$summary = implode(' ', array_slice(explode(' ', $summary), 0, 100));

The above code gets the first 100 words of the blog post. 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.