How can I replace a part of a match

ryy705 · August 27, 2008

Hello,

I need to replace '3D' out of malformed strings like <key=3D"value">. I assume the that I could match this by using "<.*(3D).*>" but how can I replace 3D with an empty string? Many thanks in advance for helping me.

nrg_alpha · August 27, 2008

You could just use a simple str_replace:

$str = '<key=3D"value">';
$str = str_replace('3D', '', $str);
echo $str;

nrg_alpha · August 27, 2008

Or you could go this route (which basically accepts your <key=3D"value"> and have preg_replace simple reconstruct the statement without the second capture ($2)..) Using this expression, it doesn't care what's in the double quotes.. so in theory, you could dump anything with the format <key=*"anyname"> (* = anything).

$str = '<key=3D"value">';
$str = preg_replace('#(<key=)(\w+)("\w+">)#e', '"$1$3"', $str);
echo $str;

ryy705 · August 28, 2008

Thank you. Could you please explain the regular expression a bit further.

key and value could be any html and css tag. So they are arbitrary.

So I guess I am looking for something like preg_replace("(<.*=)(3D)(.*>)", '"$1$3"', $str). Let me explain what I am trying to do.

(<.*=) means starts with a < then bunch of characters then end with a =, (.*>) means bunch of characters followed by a >. But how do I write it? Sorry I am not all that great regular expressions.

Doesn't \w represent digits? What does # represent? Tried googling it but I could not find anything.

nrg_alpha · August 28, 2008

Ah.. should have it explained it that way from the beginning

Ok, so here is the newer version..

$str = '<H1=2D"some_value">'; // try something other than 'key'..
$str = preg_replace('#(<[^=]+=)(\w+)("\w+">)#e', '"$1$3"', $str);
echo $str;

Here's the breakdown of the pattern..

The first capture (which is the first set of parenthesis..which is automatically stored as variable $1) is:

(<[^=]+=)

So basically, this is saying: start with a <, then in a character class '[ ]', match anything that is NOT an equal sign (the not part is due to the carot '^' inside the class. So match anything not an equal sign one or more times (the plus sign), until you reach an equal sign (the last equal sign inside the parenthesis).

Next, we have the second capture (which is obviously the second parenthesis ($2).. this is what we will ultimately not include..

(\w+)

the \w is a word character, which by standard definition matches a-zA-Z0-9_ (although depending on your locale, it might actually match more.. but for this sake, not important).. so since the third part starts with an equal sign (which is NOT matched by \w), the regex engine will match all characters (which fall into the \w category one or more times (till it arrives at the equal sign), and stores this into variable $2 automatically.

Finally, we get to the last set of parenthesis ($3):

("\w+">)

And this basically says, any word character once or more times then a '>' character. And that's the pattern..

You may notice the 'e' after the last delimiter.. this is a modifier for preg_replace.. it allows the replace aspect to utilise php code.. so, looking at the replacement part, what has been basically done is:

'"$1$3"'

Which is a set of single quotes with a set of double quotes nested inside. Inside those quotes is the first and third captures that we want (remember, we don't want the second capture).. and lastly, in this example, we tell regex that we are using $str string as the source for all of this to be matched in..

Hopefully, this helps you in understanding this a little better.. regex can be tricky at first.. but if you keep hacking away at it, it starts to slowly make sense

Cheers,

NRG

nrg_alpha · August 28, 2008

Oops.. I re-read it afterwards and made some errors in my explanation.. so I'll revise what I didn't get right..

The first capture (which is the first set of parenthesis..which is automatically stored as variable $1) is:

should be:

The first capture (which is the first set of parenthesis..which (if matched) is automatically stored as variable $1) is:

Next, we have the second capture (which is obviously the second parenthesis ($2).. this is what we will ultimately not include..

(\w+)

the \w is a word character, which by standard definition matches a-zA-Z0-9_ (although depending on your locale, it might actually match more.. but for this sake, not important).. so since the third part starts with an equal sign (which is NOT matched by \w), the regex engine will match all characters (which fall into the \w category one or more times (till it arrives at the equal sign), and stores this into variable $2 automatically.

should be:

Next, we have the second capture (which is obviously the second parenthesis ($2).. this is what we will ultimately not include..

(\w+)

the \w is a word character, which by standard definition matches a-zA-Z0-9_ (although depending on your locale, it might actually match more.. but for this sake, not important).. so since the third part starts with an equal sign (which is NOT matched by \w), the regex engine will match all characters (which fall into the \w category one or more times (till it arrives at the double quote sign), and stores this into variable $2 automatically.

effigy · August 28, 2008

$str = preg_replace('#(<[^=>]+=)(\w+)("\w+">)#', '$1$3', $str);

I added > to the character class so it will not match into another tag should the originating one not have an equals sign.
There's no need for /e.

Another approach with less captures, should it fit the context of the data:

$str = preg_replace('#(?<==)\w+(?=")#', '', $str);

nrg_alpha · August 28, 2008

$str = preg_replace('#(<[^=>]+=)(\w+)("\w+">)#', '$1$3', $str);

I added > to the character class so it will not match into another tag should the originating one not have an equals sign.

There's no need for /e.

Ah, I forgot about including the closing '>' character.. oops.. yes, that would be bad if the string had multiple tags.. my bad :-[

As for the e modifier, a slip up on my part by mixing single and double quotes (it's those little things here and there that end up snagging you!)

Another approach with less captures, should it fit the context of the data:

$str = preg_replace('#(?<==)\w+(?=")#', '', $str);

As for your newest pattern involving less captures.. I automatically wondered why there was a lookahead assertion after the \w+ (afterall, \w does not encompass " characters, and thus should stop there. While your example works, I tested it without the forward assertion.. and it still works:

$str = '<H1=2D"some_value">';
$str = preg_replace('#(?<==)\w+#', '', $str);
echo $str;

output (via right-click - view source):

<H1="some_value">

Is the forward assertion absolutely necessary? Or am I missing something?

effigy · August 28, 2008

It depends on the data. I think it's a safer approach because we're making two verifications rather than one: (1) the data must appear after an equals sign; and (2) the data must appear before a double quote.

Otherwise, it could botch up something like this:

<pre>
<?php
$str = '<H1=2D"some_value">The following is an emoticon: =o) The following is a formula: a=b*c';
$str = preg_replace('#(?<==)\w+#', '', $str);
echo $str;
?>
</pre>

nrg_alpha · August 28, 2008

Given your last code snippet, when I run it, I get:

source code ouput:

<H1="some_value">The following is an emoticon: =) The following is a formula: a=*c

This still acheives what we want, no? (meaning, remove any word characters after = but before "). Naturally, the rest gets outputted on screen. But as far as tags are concerned, would this still not suffice?

Is there another example using the preg pattern last used that could yeild some unpredictable results?

I do agree that the additional lookahead assertion would certainly 'strengthen' the 'conditions' required to match. I suppose better to be safe than sorry. Just as someone still learning regex, it has peaked my curiosity as to the 'why' aspect of it (not sure if I'm explaining myself correctly or not).

effigy · August 28, 2008

It made unwanted modifications; observe:

The following is an emoticon: =o) The following is a formula: a=b*c

The following is an emoticon: =) The following is a formula: a=*c

nrg_alpha · August 28, 2008

Touché

Without the tag, it becomes very apparent! It's all clear now.

Sign In

How can I replace a part of a match

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information