Jump to content

Remove & and other entities for ATOM feed


BizLab

Recommended Posts

I can't figure this one out - i have this so far, which i figured would work, but it doesn't.

I need to remove all html entities from the "content" for the ATOM description.

 

$content = preg_replace('/\&(.*);/','',$content);

 

This IS identifying the entities, but it is erasing all the content after the instance. if i use the "$" meta to signify the end of the string, the regex does nothing.

Link to comment
Share on other sites

The problem with the regex is that the * quantifier is being "greedy" and gobbling up much more than you want. You have a few options: make it ungreedy (.*?) or change the item being quantified ([^;]*).

 

Also, why do you want to convert the entities? Can you give an example of the ATOM feed that you're working with, I have a feeling this really should be a non-issue.

Link to comment
Share on other sites

The only problem here is that i would actually like to take the & and other entities out of the content, not replace them with the actual & symbol.

 

The problem with the regex is that the * quantifier is being "greedy" and gobbling up much more than you want. You have a few options: make it ungreedy (.*?) or change the item being quantified ([^;]*).

 

Also, why do you want to convert the entities? Can you give an example of the ATOM feed that you're working with, I have a feeling this really should be a non-issue.

 

When validating the feed the only issue i had was that the w3c validator didn't recognize the ™ entity. Then i thought.... there really is no reason for me to leave any of these entities in the content summary at all, so i will remove them. If people want to read the entire feed, they can come over to the website. I limit the summary to 500 chars to promote this.

 

A sample of the result from the /\&(.*?);/ regex is "The QuickFittrade; buckle ensures a prefect fit..." Notice the "trade;" - the & has been removed, but the word remains. The TM sign is backed up to the word it applies to like so: MyCoolness™

 

It would probably work to just remove the ™ entity... i used /\&(.*?)(trade);/ which works to remove the TM symbols -  and i can use /\&(.*?)(amp);/ to replace those entities with 'and' so i guess i'm all good.

 

Thanks for the help

Link to comment
Share on other sites

Your description of the not being replaced properly makes  no sense, given the regex that you're using... it will never replace just an ampersand.

 

As for having entities in the XML document, they would be fine if you used a CDATA block like

<blah><![CDATA[This is my ™ text isn't it ©?!]]></blah>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.