Jump to content

Convert "&" to "&"


dbrimlow

Recommended Posts

As a self-acknowledged idiot when it comes to regex, I spent a good half hour or so yesterday searching here and Google for a simple way to convert "&" to "&" so I could make my firm's dynamically generated web pages database results text adhere to web standards for XHTML.

 

I couldn't find it.

 

But, then I realized that I already use a great email filter function that I found on the manual site, and it has a section that does this conversion (included within it), so I tried it:

 

// Convert ampersands to named or numbered entities.

// Use regex to skip any that might be part of existing entities.

function makeAmpersandEntities($str, $useNamedEntities = 1) {

  return preg_replace("/&(?![A-Za-z]{0,4}\w{2,3};|#[0-9]{2,5};)/m", $useNamedEntities ? "&" : "&", $str);

  }

 

It worked like a charm when I called the function during my select command's variables initializing

($description = makeAmpersandEntities($description)

complete variable below:

 

$description = $result_row[comment1].$result_row[comment2].$result_row[comment3].$result_row[comment4].$result_row[comment5].$result_row[comment6].$result_row[comment7].$result_row[comment8];
$description = makeAmpersandEntities($description);
$description = strtolower($description);
$description = ucfirst($description);

 

Now, since I am an idiot, I would like to try to actually UNDERSTAND the preg-replace command it uses. I THINK I understand what it does. Do I have the following right?

 

&(?![A-Za-z]{0,4}\w{2,3}; - find the ampersand and ignore any alpha characters that might be immediately after it that occur at least 0 times but not more than 4 times, AND any word characters that occur at least 2 times but not more than 3 times?

 

the pipe - | - says to check (before and after) both the sub-patterns within the parentheses.

 

Is this - #[0-9]{2,5};) - find the hash mark and at least 2 but no more than 5 numbers?

 

Lastly, I simply can't "get it" why the multiline is necessary - /m - unless it is there initially because this was part of an email filter so it WOULD need to apply any instance of a new line - /n - within the email message .

 

Thanks to anyone who answers.

 

Idiot Dave

 

Link to comment
Share on other sites

Thanks,

 

I wanted to try something simple and elegant like that, but that doesn't take into account any potential pre-existing instances of "&" or "&"

 

Therefore, your solution would recode "&" as "I like pizza & pasta " or  "&" as "I like pizza &#38 pasta".

 

This is why the check for characters after the & was necessary.

 

(Nice websites, BTW. Clean css and markup).

 

Dave

Link to comment
Share on other sites

  • 2 weeks later...

<?php

$var = "I like pizza & pasta";

echo preg_replace("/ & /","&",$var);

?>

 

can't you just add spaces around the ampersand and get what you want?

That won't work if somebody uses an ampersand that is not surrounded in spaces.

And if you don't port the spaces there, then it will mess up other entities.

So there is no simple REGEX solution for this.

There are some already existing functions that will do the job easily, though. Like the one neel listed above.

 

And dbrim, I think you understood it correctly. And the multi-line is probably there so that if there is an entity that starts on one line and ends on another, the regex will not goof up.

Link to comment
Share on other sites

Neel, it didn't work.

 

As Azu pointed out, a majority of the errors made in that field are written by some 9 to 5 underpaid data entry person more concerned with punching them out fast, than proper content, so the &s tend to get either tagged onto a word or after a word.

 

Besides, I had to create a function to do this because I only want to make the & conversion for one dynamic variable (pulled from the DB).

 

Link to comment
Share on other sites

(?!) is a negative lookahead assertion.  It means find the preceeding when it's NOT followed by the stuff in (?!).  As "&(?!amp;)" means find "&" unless it's "&".

 

The pipe (|) means "or."  "/abc|123/" matches "abc" or "123".  "/abc(?:123|xyz)/" finds "abc123" or "abcxyz".

 

{2,5} is the quantifier "at least two, no more than 5."  {0,4} can be written {,4} and is just what you think it is.

 

/m is unnecessary.  See Pattern modifiers.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.