Jump to content

preg_match_all: passing through possible classes


Lexas

Recommended Posts

Hello guys. I'm trying to use the Wordpress plugin WordPress Easy Contents, but it is with a "bug" that I'm trying to fix.

 

The following expression is meant to catch HTML tags, like h1 for exemple

preg_match_all('#\<'.$element.'>(.+?)\</'.$element.'>#si', $content, $matches, PREG_SET_ORDER);

 

$element is the tag element that must be cautch, $cotent is the input text to be searched.

 

The problem is this expressions only works if the tag has no ID and no class.

For exemple, <h1> works, but <h1 class="anything"> doesn't.

 

I've tried a lot of combinations to mean "anything from here until '>'" but nothing worked. Any idea of what can be used here?

preg_match_all("#<$element(s+[^>]+)?>(.+?)</$element>#si", $content, $matches, PREG_SET_ORDER);

Added an optional subpattern: 1 or more whitespace characters followed by 1 or more characters not a >.

 

Conversely, you could also simply use:

 

preg_match_all("#<$element[^>]*>(.+?)</$element>#si", $content, $matches, PREG_SET_ORDER);

 

In your case, should there be some attribute(s) after the $element tag name, you will be capturing it. If there is no need to capture, you can use non-capturing parenthesis: (?: ... ), but I find simply using the negated character class easier.

In your case, should there be some attribute(s) after the $element tag name, you will be capturing it. If there is no need to capture, you can use non-capturing parenthesis: (?: ... ), but I find simply using the negated character class easier.

 

You're right that I should have used non-capturing parentheses, simply forgot it. Consider this sample string to see why I added the whitespace(s):

 

<acronym title="PHP Freaks"><a href="http://php.net/">PHP</a>F</acronym>

 

When $element = 'a', your pattern would (wrongfully) capture the green part.

In your case, should there be some attribute(s) after the $element tag name, you will be capturing it. If there is no need to capture, you can use non-capturing parenthesis: (?: ... ), but I find simply using the negated character class easier.

 

You're right that I should have used non-capturing parentheses, simply forgot it. Consider this sample string to see why I added the whitespace(s):

 

<acronym title="PHP Freaks"><a href="http://php.net/">PHP</a>F</acronym>

 

When $element = 'a', your pattern would (wrongfully) capture the green part.

 

Right.. I see what your saying now. In that case, we could simply insert a \b word boundery inside the opening tag in the pattern:

<$element\b[^>]*>

 

This way, if $element = 'a', it will ignore tags like <acronym> or <abbr> for example and will find the actual anchor tags (and thus bypass the need for a group checking for a space, then anything not a >, all of which is optional).

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.