Jump to content

preg_match_all: passing through possible classes


Lexas

Recommended Posts

Hello guys. I'm trying to use the Wordpress plugin WordPress Easy Contents, but it is with a "bug" that I'm trying to fix.

 

The following expression is meant to catch HTML tags, like h1 for exemple

preg_match_all('#\<'.$element.'>(.+?)\</'.$element.'>#si', $content, $matches, PREG_SET_ORDER);

 

$element is the tag element that must be cautch, $cotent is the input text to be searched.

 

The problem is this expressions only works if the tag has no ID and no class.

For exemple, <h1> works, but <h1 class="anything"> doesn't.

 

I've tried a lot of combinations to mean "anything from here until '>'" but nothing worked. Any idea of what can be used here?

Link to comment
Share on other sites

preg_match_all("#<$element(s+[^>]+)?>(.+?)</$element>#si", $content, $matches, PREG_SET_ORDER);

Added an optional subpattern: 1 or more whitespace characters followed by 1 or more characters not a >.

 

Conversely, you could also simply use:

 

preg_match_all("#<$element[^>]*>(.+?)</$element>#si", $content, $matches, PREG_SET_ORDER);

 

In your case, should there be some attribute(s) after the $element tag name, you will be capturing it. If there is no need to capture, you can use non-capturing parenthesis: (?: ... ), but I find simply using the negated character class easier.

Link to comment
Share on other sites

In your case, should there be some attribute(s) after the $element tag name, you will be capturing it. If there is no need to capture, you can use non-capturing parenthesis: (?: ... ), but I find simply using the negated character class easier.

 

You're right that I should have used non-capturing parentheses, simply forgot it. Consider this sample string to see why I added the whitespace(s):

 

<acronym title="PHP Freaks"><a href="http://php.net/">PHP</a>F</acronym>

 

When $element = 'a', your pattern would (wrongfully) capture the green part.

Link to comment
Share on other sites

In your case, should there be some attribute(s) after the $element tag name, you will be capturing it. If there is no need to capture, you can use non-capturing parenthesis: (?: ... ), but I find simply using the negated character class easier.

 

You're right that I should have used non-capturing parentheses, simply forgot it. Consider this sample string to see why I added the whitespace(s):

 

<acronym title="PHP Freaks"><a href="http://php.net/">PHP</a>F</acronym>

 

When $element = 'a', your pattern would (wrongfully) capture the green part.

 

Right.. I see what your saying now. In that case, we could simply insert a \b word boundery inside the opening tag in the pattern:

<$element\b[^>]*>

 

This way, if $element = 'a', it will ignore tags like <acronym> or <abbr> for example and will find the actual anchor tags (and thus bypass the need for a group checking for a space, then anything not a >, all of which is optional).

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.