New to regex but can't solve this problem

jrw4 · October 6, 2009

If I have a string that looks like:

<content type="xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">

or

<content type="xhtml" xmlns:x="http://www.w3.org/1999/xhtml">

I am trying to get the value that is between xmlns: and the equal sign.

So I have tried the following code:

$string = '<content type="xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">';

preg_match("xmlns:[\w]+\=", $string, $matches);

var_dump($matches);

This returns null

So I am not sure how to find that. What am I doing wrong and what should I do to find that part of the string?

nrg_alpha · October 6, 2009

The problem is that you're delimiters are lacking..., plus you are not isolating what you are looking for in your pattern.

Here is an example of how I would tackle it (throwing both url examples you listed as an array):

$html = array('<content type="xhtml" xmlns:x="http://www.w3.org/1999/xhtml">', '<content type="xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">');
foreach($html as $val){
    preg_match('#xmlns:\K[^=]+#', $val, $match);
    echo $match[0] . "<br />\n";
}

Output:

x
xhtml

This way, $match[0] will only contain what is between xmlns and =.

EDIT - In our resources page, you read up about the 'Why delimiters?' thread, as well as delimiters in the php manual.

cags · October 6, 2009

Sorry to hijack the thread slightly, but it looks like nrg_alpha has solved it anyway, what does the \K modifer do, I tried looking it up and came up with 'Named Capturing Groups', but the references I found seemed to indicate \k was a .NET syntax.

EDIT: Nevermind I found it in one of the links you provided.

nrg_alpha · October 6, 2009

Sorry to hijack the thread slightly, but it looks like nrg_alpha has solved it anyway, what does the \K modifer do..

http://www.phpfreaks.com/blog/pcre-regex-spotlight-k

cags · October 6, 2009

Excellent, thanks for the link. I really should get around to checking out the articles here, I've only been around a few days and only focused on the forum.

nrg_alpha · October 6, 2009

No problem..take your time.. you're doing great

jrw4 · October 7, 2009

Thanks for asking that as I had the same question on \K.

Now if I wanted to match something like:

<content[ANYTHING]>

My pattern would be:

$pattern = "/<content\K[\w]*>/";

nrg_alpha · October 7, 2009

Thanks for asking that as I had the same question on \K.

Now if I wanted to match something like:

<content[ANYTHING]>

My pattern would be:
$pattern = "/<content\K[\w]*>/";

Keep in mind that in that case, $0 (or if using preg_match, index[ 0 ] - either of which is the value that the entire pattern matched / captured is stored as) would be 'ANYTHING>' (I'm assuming that the [] brackets surrounding ANYTHING in the source string isn't there.. just displayed to surround ANYTHING for illustrative purposes...) Note that the > is included.. so chances are this is not what you would want.

In this case, you have a few options.. you can either put the > into a lookahead assertion like so:

$pattern = "/<content\s?\K\w*(?=>)/";

Since assertions don't consume any text, the > part is not included with the base match...

Or, depending on the string's circumstances (like let's assume that after '<content ' and the sequence of \w characters, it closes off with >, you might be able to even outright omit > completely:

$pattern = "/<content\s?\K\w*/";

This way, in either sample, the > character is not included in the base variable $0 (or if using preg_match, index[ 0 ]), which cleans things up a bit. Also note that I didn't use [\w], as \w is already a character class short hand in and of itself, which for all intents and purposes (without delving into the topic of locales, is understood as saying [a-zA-Z0-9_].. so if something like \w, or \d etc.. is the only thing being placed inside a character class, the character class is useless. And finally, if stuff follows <content, you probably don't want to include the initial space in your match, so I threw in the \s? just in case...

However, I don't think I would even use \w.. perhaps instead of that, I would use a character class to grab everything up to > like so:

$pattern = "/<content\s?\K[^>]*/";

jrw4 · October 7, 2009

Well I was trying along the lines of:

$input = preg_replace("/<content\K[^\>]+/", "", $input);

Which just takes <content[ANYTHING]> and turns it into <content> which is alright but then I have to do a second line of code to remove that too:

$input = str_replace("<content>", "", $input);

I meant to ask how to do that in one line?

nrg_alpha · October 7, 2009

If I understand correctly, you are wiping out [ANYTHING] from <content[ANYTHING]> if applicable, but then you want to wipe out <content> itself?

I'll provide three samples which remove various levels of the <content> tags:

example:

$input = <<<EOF
Some text. <content class="whatever">Some content</content> Some more text yet again! <content>And yes, some more content!</content>
EOF;

# exmaple1: remove <content[ANYTHING]> only!
$input1 = preg_replace('#<content[^>]*>#i', '', $input);
echo $input1 . "<br />\n"; // Output: Some text. Some content</content> Some more text yet again! And yes, some more content!</content>

# example2: remove complete content tags
$input2 = preg_replace('#<content[^>]*>.*?</content>#is', '', $input);
echo $input2 . "<br />\n"; // Output: Some text.  Some more text yet again!

# example3: remove only the content tags (yet leave the text inside those tags in place)
$input3 = preg_replace('#<content[^>]*>(.*?)</content>#is', '$1', $input);
echo $input3 . "<br />\n"; // Output: Some text. Some content Some more text yet again! And yes, some more content!

thebadbad · October 7, 2009

Lol, I was about to post this, but then you beat me to it, haha:

Then you don't need \K at all, but just
$input = preg_replace('~<content[^>]*>~i', '', $input);
And my advance apologies goes to nrg, who is probably in the process of writing an elaborate answer (no offence - you do a great job explaining things in detail, while my answers just often aren't that long).

nrg_alpha · October 7, 2009

lol no harm, no foul

Sign In

New to regex but can't solve this problem

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information