1internet Posted November 27, 2012 Share Posted November 27, 2012 I have $content = '<h1>heading</h1><p>page content</p>' in a variable. How can I create another variable just of the p tags i.e. $newContent = '<p>page content</p>' Quote Link to comment Share on other sites More sharing options...
Jessica Posted November 27, 2012 Share Posted November 27, 2012 You'll want to use a DOM Parser. HTML is too complex to be handled by string functions, and most regex. Quote Link to comment Share on other sites More sharing options...
requinix Posted November 27, 2012 Share Posted November 27, 2012 Just the tags? Or are you stripping out the ? How are you getting $content? Is it just a string? Do you know if it's always valid HTML? What if it isn't? Is there always just the and ? How else can $content vary? Quote Link to comment Share on other sites More sharing options...
Psycho Posted November 27, 2012 Share Posted November 27, 2012 (edited) As the others have stated we need to know exactly how the content can vary. But, IF the variable will only contain 1 pair of P tags then a simple regex will suffice. In fact, you can use regex if there are multiple P tag pairs as long as they are not nested and properly paired.. //If only one P tag pair in content if(preg_match("#<p>[^<]+</p>#i", $content, $match)) { //Assign the paragraph to a variable $para = array_shift($match); } else { $para = false; } //If multiple P tag pairs in content if(preg_match_all("#<p>[^<]+</p>#i", $content, $matches)) { //Assign the paragraphs to an array $paraAry = array_shift($matches); } else { $paraAry = false; } Edited November 27, 2012 by Psycho Quote Link to comment Share on other sites More sharing options...
requinix Posted November 27, 2012 Share Posted November 27, 2012 (edited) In fact, you can use regex if there are multiple P tag pairs as long as they are not nested and properly paired.. Properly paired is a definite requirement, but with that expression #<p>[^<]+</p>#i it's quite easy to turn it into something that can handle nested s. You know, as an academic exercise. #<p>([^<]+|(?!<p>)<|(?R))+</p>#i Same as before but the contents of the tag are either a] normal-looking text, b] the start of an HTML tag that isn't " ", or c] the entire expression matched recursively. Edited November 27, 2012 by requinix Quote Link to comment Share on other sites More sharing options...
Psycho Posted November 27, 2012 Share Posted November 27, 2012 (edited) Properly paired is a definite requirement, but with that expression #<p>[^<]+</p>#i it's quite easy to turn it into something that can handle nested <p>s. You know, as an academic exercise. #<p>([^<]+|(?!<p>)<|(?R))+</p>#i Same as before but the contents of the tag are either a] normal-looking text, b] the start of an HTML tag that isn't "<p>", or c] the entire expression matched recursively. That's beyond my skillset. But, in testing that code in the hopes of breaking it down it doesn't seem to be working for nested content. Using this as the content: $content = '<h1>heading</h1><p>page content</p> <p>outer content 1 <p>Nested Content</p> outer content 2 </p>'; The regex is succeeding, but with 0 matches. I'm actually quite interested in this possible solution as I had to implement a workaround to a similar problem in some previous code and I'd like to go back and refactor if there is a simpler solution. Edited November 27, 2012 by Psycho Quote Link to comment Share on other sites More sharing options...
requinix Posted November 27, 2012 Share Posted November 27, 2012 Succeeding? I tried and it does not, even though I can (thought I could) see how it should be able to match something, even if it's the wrong text. Anyway, the middle part in the list was to exclude the delimiters. I made sure it wasn't " " but didn't include "". Together they're "?p>". #<p>([^<]+|(?!</?p>)<|(?R))+</p>#i $content = '<h1>heading</h1><p>page content</p> <p>outer content 1 <p>Nested Content</p> outer content 2 </p>'; $regex = '#<p>([^<]+|(?!</?p>)<|(?R))+</p>#i'; preg_match_all($regex, $content, $matches); var_dump($matches); array(2) { [0]=> array(2) { [0]=> string(19) "<p>page content</p>" [1]=> string(61) "<p>outer content 1 <p>Nested Content</p> outer content 2 </p>" } [1]=> array(2) { [0]=> string(12) "page content" [1]=> string(17) " outer content 2 " } } Without trying to hijack the topic, the basic form is beginning delimiter ( valid content that isn't either delimiter | (?R) )+ ending delimiter In this case your original expression defined the valid content to be "not a )". When trying to match paired parentheses the regex would look like / \( # beginning delimiter ( [^()]+ # valid content is everything, besides a parenthesis (the delimiter) | (?R) # recursion )+ \) # ending delimiter /ix Quote Link to comment Share on other sites More sharing options...
1internet Posted November 28, 2012 Author Share Posted November 28, 2012 The variable is coming from a database. It is actually search results I am trying to contain a snippet to give a brief description of the page. So actually I don't want the tags thinking about it, just the content inside, and I want to limit the characters to e.g. 300. Does that make sense? Quote Link to comment Share on other sites More sharing options...
Barand Posted November 28, 2012 Share Posted November 28, 2012 Sans regex method <?php $content = '<h1>heading1</h1><p>page content 1</p><h1>heading 2</h1><p>page content 2</p><h1>heading 3</h1><p>page content 3</p>'; $new = parasOnly ($content); echo htmlentities($new); function parasOnly($html) { $pos1 = 0; $res = ''; $k = substr_count($html, '<p>'); for ($i=0; $i<$k; $i++) { $pos2 = strpos($html, '<p>', $pos1); $pos3 = strpos($html, '</p>', $pos2); $res .= substr($html, $pos2, $pos3-$pos2+4); $pos1 = $pos3; } return $res; } ?> RESULT: <p>page content 1</p><p>page content 2</p><p>page content 3</p> Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.