Jump to content

New to regex but can't solve this problem


jrw4

Recommended Posts

If I have a string that looks like:

 

<content type="xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">

 

or

 

<content type="xhtml" xmlns:x="http://www.w3.org/1999/xhtml">

 

I am trying to get the value that is between xmlns: and the equal sign.

 

<content type="xhtml" xmlns:[THIS IS WHAT IM LOOKING FOR]="http://www.w3.org/1999/xhtml">

 

So I have tried the following code:

 

$string = '<content type="xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">';

preg_match("xmlns:[\w]+\=", $string, $matches);

var_dump($matches);

 

This returns null

 

So I am not sure how to find that.  What am I doing wrong and what should I do to find that part of the string?

Link to comment
Share on other sites

The problem is that you're delimiters are lacking..., plus you are not isolating what you are looking for in your pattern.

Here is an example of how I would tackle it (throwing both url examples you listed as an array):

 

$html = array('<content type="xhtml" xmlns:x="http://www.w3.org/1999/xhtml">', '<content type="xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">');
foreach($html as $val){
    preg_match('#xmlns:\K[^=]+#', $val, $match);
    echo $match[0] . "<br />\n";
}

 

Output:

x
xhtml

 

This way, $match[0] will only contain what is between xmlns and =.

 

EDIT - In our resources page, you read up about the 'Why delimiters?' thread, as well as delimiters in the php manual.

Link to comment
Share on other sites

Sorry to hijack the thread slightly, but it looks like nrg_alpha has solved it anyway, what does the \K modifer do, I tried looking it up and came up with 'Named Capturing Groups', but the references I found seemed to indicate \k was a .NET syntax.

 

EDIT: Nevermind I found it in one of the links you provided.

Link to comment
Share on other sites

Thanks for asking that as I had the same question on \K.

 

Now if I wanted to match something like:

 

<content[ANYTHING]>

 

My pattern would be:

 

$pattern = "/<content\K[\w]*>/";

 

Keep in mind that in that case, $0 (or if using preg_match, index[ 0 ] - either of which is the value that the entire pattern matched / captured is stored as) would be 'ANYTHING>' (I'm assuming that the [] brackets surrounding ANYTHING in the source string isn't there.. just displayed to surround ANYTHING for illustrative purposes...) Note that the > is included.. so chances are this is not what you would want.

 

In this case, you have a few options.. you can either put the > into a lookahead assertion like so:

$pattern = "/<content\s?\K\w*(?=>)/";

Since assertions don't consume any text, the > part is not included with the base match...

 

Or, depending on the string's circumstances (like let's assume that after '<content ' and the sequence of \w characters, it closes off with >, you might be able to even outright omit > completely:

$pattern = "/<content\s?\K\w*/";

 

This way, in either sample, the > character is not included in the base variable $0 (or if using preg_match, index[ 0 ]), which cleans things up a bit. Also note that I didn't use [\w], as \w is already a character class short hand in and of itself, which for all intents and purposes (without delving into the topic of locales, is understood as saying [a-zA-Z0-9_].. so if something like \w, or \d etc.. is the only thing being placed inside a character class, the character class is useless. And finally, if stuff follows <content, you probably don't want to include the initial space in your match, so I threw in the \s? just in case...

 

However, I don't think I would even use \w.. perhaps instead of that, I would use a character class to grab everything up to > like so:

 

$pattern = "/<content\s?\K[^>]*/";

Link to comment
Share on other sites

Well I was trying along the lines of:

 

$input = preg_replace("/<content\K[^\>]+/", "", $input);

 

Which just takes <content[ANYTHING]> and turns it into <content> which is alright but then I have to do a second line of code to remove that too:

 

$input = str_replace("<content>", "", $input);

 

I meant to ask how to do that in one line?

Link to comment
Share on other sites

If I understand correctly, you are wiping out [ANYTHING] from <content[ANYTHING]> if applicable, but then you want to wipe out <content> itself?

I'll provide three samples which remove various levels of the <content> tags:

 

example:

$input = <<<EOF
Some text. <content class="whatever">Some content</content> Some more text yet again! <content>And yes, some more content!</content>
EOF;

# exmaple1: remove <content[ANYTHING]> only!
$input1 = preg_replace('#<content[^>]*>#i', '', $input);
echo $input1 . "<br />\n"; // Output: Some text. Some content</content> Some more text yet again! And yes, some more content!</content>

# example2: remove complete content tags
$input2 = preg_replace('#<content[^>]*>.*?</content>#is', '', $input);
echo $input2 . "<br />\n"; // Output: Some text.  Some more text yet again!

# example3: remove only the content tags (yet leave the text inside those tags in place)
$input3 = preg_replace('#<content[^>]*>(.*?)</content>#is', '$1', $input);
echo $input3 . "<br />\n"; // Output: Some text. Some content Some more text yet again! And yes, some more content!

Link to comment
Share on other sites

Lol, I was about to post this, but then you beat me to it, haha:

 

Then you don't need \K at all, but just

 

$input = preg_replace('~<content[^>]*>~i', '', $input);

And my advance apologies goes to nrg, who is probably in the process of writing an elaborate answer (no offence - you do a great job explaining things in detail, while my answers just often aren't that long). ;)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.