BizLab Posted July 15, 2010 Share Posted July 15, 2010 I can't figure this one out - i have this so far, which i figured would work, but it doesn't. I need to remove all html entities from the "content" for the ATOM description. $content = preg_replace('/\&(.*);/','',$content); This IS identifying the entities, but it is erasing all the content after the instance. if i use the "$" meta to signify the end of the string, the regex does nothing. Quote Link to comment Share on other sites More sharing options...
trq Posted July 15, 2010 Share Posted July 15, 2010 html_entity_decode Quote Link to comment Share on other sites More sharing options...
JAY6390 Posted July 15, 2010 Share Posted July 15, 2010 My thoughts exactly lol Quote Link to comment Share on other sites More sharing options...
salathe Posted July 15, 2010 Share Posted July 15, 2010 The problem with the regex is that the * quantifier is being "greedy" and gobbling up much more than you want. You have a few options: make it ungreedy (.*?) or change the item being quantified ([^;]*). Also, why do you want to convert the entities? Can you give an example of the ATOM feed that you're working with, I have a feeling this really should be a non-issue. Quote Link to comment Share on other sites More sharing options...
BizLab Posted July 15, 2010 Author Share Posted July 15, 2010 html_entity_decode The only problem here is that i would actually like to take the & and other entities out of the content, not replace them with the actual & symbol. The problem with the regex is that the * quantifier is being "greedy" and gobbling up much more than you want. You have a few options: make it ungreedy (.*?) or change the item being quantified ([^;]*). Also, why do you want to convert the entities? Can you give an example of the ATOM feed that you're working with, I have a feeling this really should be a non-issue. When validating the feed the only issue i had was that the w3c validator didn't recognize the ™ entity. Then i thought.... there really is no reason for me to leave any of these entities in the content summary at all, so i will remove them. If people want to read the entire feed, they can come over to the website. I limit the summary to 500 chars to promote this. A sample of the result from the /\&(.*?);/ regex is "The QuickFittrade; buckle ensures a prefect fit..." Notice the "trade;" - the & has been removed, but the word remains. The TM sign is backed up to the word it applies to like so: MyCoolness™ It would probably work to just remove the ™ entity... i used /\&(.*?)(trade);/ which works to remove the TM symbols - and i can use /\&(.*?)(amp);/ to replace those entities with 'and' so i guess i'm all good. Thanks for the help Quote Link to comment Share on other sites More sharing options...
salathe Posted July 15, 2010 Share Posted July 15, 2010 Your description of the ™ not being replaced properly makes no sense, given the regex that you're using... it will never replace just an ampersand. As for having entities in the XML document, they would be fine if you used a CDATA block like <blah><![CDATA[This is my ™ text isn't it ©?!]]></blah> Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.