Jump to content

Character issues when using DOMDocument


NotionCommotion
 Share

Go to solution Solved by NotionCommotion,

Recommended Posts

Starting with some user provided HTML, I wish to purify it as best as I reasonably can, then use DOMDocument() to replace some tags, and finally email it.  To do so, I created the following script:

<?php

//The following message is generated when the user cuts-and-pastes something from an outlook email into a TinyMCE editor.
$message = <<<EOT
<div>Start</div>
  <div> </div>
<div>foo bar</div>
<div> </div>
   <p> </p>
<p>bla bla bla: “something in quotes” bla bla bla</p>
<div>End</div>
EOT;

echo("Raw message: $message\n\n");
//$message= str_replace(' ', ' ', $message);  //Hack to prevent line spaces <p> </p> to be converted to <p> </p> and then to <p>Â </p>
require('../../../application/classes_3rd/htmlpurifier/library/HTMLPurifier.auto.php');
$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
$message=$purifier->purify($message);
echo("Purified message: $message\n\n");
//While not shown, I use DOMDocument to replace some tags.
$doc = new DOMDocument();
$doc->loadHTML($message);
$body = $doc->getElementsByTagName('body')->item(0);
$message=$doc->saveHTML($body);
echo("Modified message: $message\n\n");
//email the message (not shown)

The output is as follows.  Notice the  and â symbols.  When emailed, they cause even more havoc.

Raw message: <div>Start</div>
  <div> </div>
<div>foo bar</div>
<div> </div>
   <p> </p>
<p>bla bla bla: “something in quotes” bla bla bla</p>
<div>End</div>

Purified message: <div>Start</div>
  <div> </div>
<div>foo bar</div>
<div> </div>
   <p> </p>
<p>bla bla bla: “something in quotes” bla bla bla</p>
<div>End</div>

Modified message: <body>
<div>Start</div>
  <div> </div>
<div>foo bar</div>
<div> </div>
   <p> </p>
<p>bla bla bla: âsomething in quotesâ bla bla bla</p>
<div>End</div>
</body>

I could make some progress by un-commenting line 15 and replacing   with a blank space, and now get the following output which doesn't have the  symbols but still has the and â symbols.

 

What is the best way to deal with this?  Thanks

Raw message: <div>Start</div>
  <div> </div>
<div>foo bar</div>
<div> </div>
   <p> </p>
<p>bla bla bla: “something in quotes” bla bla bla</p>
<div>End</div>

Purified message: <div>Start</div>
  <div> </div>
<div>foo bar</div>
<div> </div>
   <p> </p>
<p>bla bla bla: “something in quotes” bla bla bla</p>
<div>End</div>

Modified message: <body>
<div>Start</div>
  <div> </div>
<div>foo bar</div>
<div> </div>
   <p> </p>
<p>bla bla bla: âsomething in quotesâ bla bla bla</p>
<div>End</div>
</body>
Link to comment
Share on other sites

This thread is more than a year old. Are you sure you have something important to add to it?

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.