Remove & and other entities for ATOM feed

BizLab · July 15, 2010

I can't figure this one out - i have this so far, which i figured would work, but it doesn't.

I need to remove all html entities from the "content" for the ATOM description.

$content = preg_replace('/\&(.*);/','',$content);

This IS identifying the entities, but it is erasing all the content after the instance. if i use the "$" meta to signify the end of the string, the regex does nothing.

trq · July 15, 2010

html_entity_decode

JAY6390 · July 15, 2010

My thoughts exactly lol

salathe · July 15, 2010

The problem with the regex is that the * quantifier is being "greedy" and gobbling up much more than you want. You have a few options: make it ungreedy (.*?) or change the item being quantified ([^;]*).

Also, why do you want to convert the entities? Can you give an example of the ATOM feed that you're working with, I have a feeling this really should be a non-issue.

BizLab · July 15, 2010

html_entity_decode

The only problem here is that i would actually like to take the & and other entities out of the content, not replace them with the actual & symbol.

The problem with the regex is that the * quantifier is being "greedy" and gobbling up much more than you want. You have a few options: make it ungreedy (.*?) or change the item being quantified ([^;]*).

Also, why do you want to convert the entities? Can you give an example of the ATOM feed that you're working with, I have a feeling this really should be a non-issue.

When validating the feed the only issue i had was that the w3c validator didn't recognize the ™ entity. Then i thought.... there really is no reason for me to leave any of these entities in the content summary at all, so i will remove them. If people want to read the entire feed, they can come over to the website. I limit the summary to 500 chars to promote this.

A sample of the result from the /\&(.*?);/ regex is "The QuickFittrade; buckle ensures a prefect fit..." Notice the "trade;" - the & has been removed, but the word remains. The TM sign is backed up to the word it applies to like so: MyCoolness™

It would probably work to just remove the ™ entity... i used /\&(.*?)(trade);/ which works to remove the TM symbols - and i can use /\&(.*?)(amp);/ to replace those entities with 'and' so i guess i'm all good.

Thanks for the help

salathe · July 15, 2010

Your description of the ™ not being replaced properly makes no sense, given the regex that you're using... it will never replace just an ampersand.

As for having entities in the XML document, they would be fine if you used a CDATA block like

<blah><![CDATA[This is my ™ text isn't it ©?!]]></blah>

Sign In

Remove & and other entities for ATOM feed

Recommended Posts

BizLab

Link to comment

Share on other sites

trq

Link to comment

Share on other sites

JAY6390

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

BizLab

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information