Jump to content

Recommended Posts

Just the

tags? Or are you stripping out the

?

 

How are you getting $content? Is it just a string? Do you know if it's always valid HTML? What if it isn't? Is there always just the

and

? How else can $content vary?

As the others have stated we need to know exactly how the content can vary. But, IF the variable will only contain 1 pair of P tags then a simple regex will suffice. In fact, you can use regex if there are multiple P tag pairs as long as they are not nested and properly paired..

 

//If only one P tag pair in content
if(preg_match("#<p>[^<]+</p>#i", $content, $match))
{
   //Assign the paragraph to a variable
  $para = array_shift($match);
}
else
{
  $para = false;
}


//If multiple P tag pairs in content
if(preg_match_all("#<p>[^<]+</p>#i", $content, $matches))
{
  //Assign the paragraphs to an array
  $paraAry = array_shift($matches);
}
else
{
  $paraAry = false;
}

Edited by Psycho

In fact, you can use regex if there are multiple P tag pairs as long as they are not nested and properly paired..

Properly paired is a definite requirement, but with that expression

#<p>[^<]+</p>#i

it's quite easy to turn it into something that can handle nested

s. You know, as an academic exercise.

#<p>([^<]+|(?!<p>)<|(?R))+</p>#i

Same as before but the contents of the tag are either a] normal-looking text, b] the start of an HTML tag that isn't "

", or c] the entire expression matched recursively.

Edited by requinix

Properly paired is a definite requirement, but with that expression

#<p>[^<]+</p>#i

it's quite easy to turn it into something that can handle nested <p>s. You know, as an academic exercise.

#<p>([^<]+|(?!<p>)<|(?R))+</p>#i

Same as before but the contents of the tag are either a] normal-looking text, b] the start of an HTML tag that isn't "<p>", or c] the entire expression matched recursively.

 

That's beyond my skillset. But, in testing that code in the hopes of breaking it down it doesn't seem to be working for nested content. Using this as the content:

$content = '<h1>heading</h1><p>page content</p> <p>outer content 1 <p>Nested Content</p> outer content 2 </p>';

 

The regex is succeeding, but with 0 matches. I'm actually quite interested in this possible solution as I had to implement a workaround to a similar problem in some previous code and I'd like to go back and refactor if there is a simpler solution.

Edited by Psycho

Succeeding? I tried and it does not, even though I can (thought I could) see how it should be able to match something, even if it's the wrong text.

 

Anyway, the middle part in the list was to exclude the delimiters. I made sure it wasn't "

" but didn't include "

". Together they're "?p>".

#<p>([^<]+|(?!</?p>)<|(?R))+</p>#i

$content = '<h1>heading</h1><p>page content</p> <p>outer content 1 <p>Nested Content</p> outer content 2 </p>';
$regex = '#<p>([^<]+|(?!</?p>)<|(?R))+</p>#i';

preg_match_all($regex, $content, $matches);
var_dump($matches);

array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(19) "<p>page content</p>"
    [1]=>
    string(61) "<p>outer content 1 <p>Nested Content</p> outer content 2 </p>"
  }
  [1]=>
  array(2) {
    [0]=>
    string(12) "page content"
    [1]=>
    string(17) " outer content 2 "
  }
}

 

Without trying to hijack the topic, the basic form is

beginning delimiter ( valid content that isn't either delimiter | (?R) )+ ending delimiter

In this case your original expression defined the valid content to be "not a )". When trying to match paired parentheses the regex would look like

/
\(   # beginning delimiter
(
	[^()]+   # valid content is everything, besides a parenthesis (the delimiter)
	| (?R)   # recursion
)+
\)   # ending delimiter
/ix

The variable is coming from a database. It is actually search results I am trying to contain a snippet to give a brief description of the page. So actually I don't want the tags thinking about it, just the content inside, and I want to limit the characters to e.g. 300.

Does that make sense?

Sans regex method

 

<?php
$content = '<h1>heading1</h1><p>page content 1</p><h1>heading 2</h1><p>page content 2</p><h1>heading 3</h1><p>page content 3</p>';
$new = parasOnly ($content);
echo htmlentities($new);

function parasOnly($html)
{
   $pos1 = 0;
   $res = '';
   $k = substr_count($html, '<p>');
   for ($i=0; $i<$k; $i++) {
    $pos2 = strpos($html, '<p>', $pos1);
    $pos3 = strpos($html, '</p>', $pos2);
    $res .= substr($html, $pos2, $pos3-$pos2+4);
    $pos1 = $pos3;
   }
   return $res;
}
?>				    

RESULT:

<p>page content 1</p><p>page content 2</p><p>page content 3</p>

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.