NotionCommotion Posted December 9, 2015 Share Posted December 9, 2015 Starting with some user provided HTML, I wish to purify it as best as I reasonably can, then use DOMDocument() to replace some tags, and finally email it. To do so, I created the following script: <?php //The following message is generated when the user cuts-and-pastes something from an outlook email into a TinyMCE editor. $message = <<<EOT <div>Start</div> <div> </div> <div>foo bar</div> <div> </div> <p> </p> <p>bla bla bla: “something in quotes” bla bla bla</p> <div>End</div> EOT; echo("Raw message: $message\n\n"); //$message= str_replace(' ', ' ', $message); //Hack to prevent line spaces <p> </p> to be converted to <p> </p> and then to <p> </p> require('../../../application/classes_3rd/htmlpurifier/library/HTMLPurifier.auto.php'); $config = HTMLPurifier_Config::createDefault(); $purifier = new HTMLPurifier($config); $message=$purifier->purify($message); echo("Purified message: $message\n\n"); //While not shown, I use DOMDocument to replace some tags. $doc = new DOMDocument(); $doc->loadHTML($message); $body = $doc->getElementsByTagName('body')->item(0); $message=$doc->saveHTML($body); echo("Modified message: $message\n\n"); //email the message (not shown) The output is as follows. Notice the  and â symbols. When emailed, they cause even more havoc. Raw message: <div>Start</div> <div> </div> <div>foo bar</div> <div> </div> <p> </p> <p>bla bla bla: “something in quotes” bla bla bla</p> <div>End</div> Purified message: <div>Start</div> <div> </div> <div>foo bar</div> <div> </div> <p> </p> <p>bla bla bla: “something in quotes” bla bla bla</p> <div>End</div> Modified message: <body> <div>Start</div> <div> </div> <div>foo bar</div> <div> </div> <p> </p> <p>bla bla bla: âsomething in quotesâ bla bla bla</p> <div>End</div> </body> I could make some progress by un-commenting line 15 and replacing with a blank space, and now get the following output which doesn't have the  symbols but still has the and â symbols. What is the best way to deal with this? Thanks Raw message: <div>Start</div> <div> </div> <div>foo bar</div> <div> </div> <p> </p> <p>bla bla bla: “something in quotes” bla bla bla</p> <div>End</div> Purified message: <div>Start</div> <div> </div> <div>foo bar</div> <div> </div> <p> </p> <p>bla bla bla: “something in quotes” bla bla bla</p> <div>End</div> Modified message: <body> <div>Start</div> <div> </div> <div>foo bar</div> <div> </div> <p> </p> <p>bla bla bla: âsomething in quotesâ bla bla bla</p> <div>End</div> </body> Quote Link to comment https://forums.phpfreaks.com/topic/299686-character-issues-when-using-domdocument/ Share on other sites More sharing options...
NotionCommotion Posted December 10, 2015 Author Share Posted December 10, 2015 Any reason this shows zero views? Any help would be appreciated. Thank you Quote Link to comment https://forums.phpfreaks.com/topic/299686-character-issues-when-using-domdocument/#findComment-1527766 Share on other sites More sharing options...
Solution NotionCommotion Posted December 10, 2015 Author Solution Share Posted December 10, 2015 Solved by using $message = mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'); Note that $message= str_replace(' ', ' ', $message); is also no longer needed. Quote Link to comment https://forums.phpfreaks.com/topic/299686-character-issues-when-using-domdocument/#findComment-1527789 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.