preg_match_all: passing through possible classes

Lexas · July 10, 2009

Hello guys. I'm trying to use the Wordpress plugin WordPress Easy Contents, but it is with a "bug" that I'm trying to fix.

The following expression is meant to catch HTML tags, like h1 for exemple

preg_match_all('#\<'.$element.'>(.+?)\</'.$element.'>#si', $content, $matches, PREG_SET_ORDER);

$element is the tag element that must be cautch, $cotent is the input text to be searched.

The problem is this expressions only works if the tag has no ID and no class.

For exemple, <h1> works, but <h1 class="anything"> doesn't.

I've tried a lot of combinations to mean "anything from here until '>'" but nothing worked. Any idea of what can be used here?

thebadbad · July 10, 2009

preg_match_all("#<$element(\s+[^>]+)?>(.+?)</$element>#si", $content, $matches, PREG_SET_ORDER);

Added an optional subpattern: 1 or more whitespace characters followed by 1 or more characters not a >.

nrg_alpha · July 10, 2009

preg_match_all("#<$element(s+[^>]+)?>(.+?)</$element>#si", $content, $matches, PREG_SET_ORDER);
Added an optional subpattern: 1 or more whitespace characters followed by 1 or more characters not a >.

Conversely, you could also simply use:

preg_match_all("#<$element[^>]*>(.+?)</$element>#si", $content, $matches, PREG_SET_ORDER);

In your case, should there be some attribute(s) after the $element tag name, you will be capturing it. If there is no need to capture, you can use non-capturing parenthesis: (?: ... ), but I find simply using the negated character class easier.

thebadbad · July 11, 2009

In your case, should there be some attribute(s) after the $element tag name, you will be capturing it. If there is no need to capture, you can use non-capturing parenthesis: (?: ... ), but I find simply using the negated character class easier.

You're right that I should have used non-capturing parentheses, simply forgot it. Consider this sample string to see why I added the whitespace(s):

When $element = 'a', your pattern would (wrongfully) capture the green part.

nrg_alpha · July 11, 2009

In your case, should there be some attribute(s) after the $element tag name, you will be capturing it. If there is no need to capture, you can use non-capturing parenthesis: (?: ... ), but I find simply using the negated character class easier.

You're right that I should have used non-capturing parentheses, simply forgot it. Consider this sample string to see why I added the whitespace(s):

<acronym title="PHP Freaks"><a href="http://php.net/">PHP</a>F</acronym>

When $element = 'a', your pattern would (wrongfully) capture the green part.

Right.. I see what your saying now. In that case, we could simply insert a \b word boundery inside the opening tag in the pattern:

<$element\b[^>]*>

This way, if $element = 'a', it will ignore tags like <acronym> or <abbr> for example and will find the actual anchor tags (and thus bypass the need for a group checking for a space, then anything not a >, all of which is optional).

thebadbad · July 11, 2009

True, using a word boundary would be more appropriate

nrg_alpha · July 11, 2009

It was a good catch on your part though.. looking at the OP's pattern, then looking at yours, I wasn't sure what you were getting at (hindsight has 20/20 vision they say

).

thebadbad · July 11, 2009

I didn't go in detail on purpose, to let you figure it out yourself

Sign In

preg_match_all: passing through possible classes

Recommended Posts

Lexas

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information