Jump to content

XML and Character Encoding Hell


Mike521

Recommended Posts

I am in character encoding hell, I hope someone can get me out!

 

I have a web form encoded in ISO-8859-1. It posts to another ISO-8859-1 page.

 

That page takes the post data and sends it to a script that runs in the background.

 

The script's job is to convert the post data into xml, and then post it to yet another script that will process it.

 

The problem I run into is when there are spanish characters on the input. It seems no matter how I try to encode them, the final receiving script always either ignores all the incoming data, or ignores the fields with spanish characters.

 

It seems to me that the problem is happening in the last post. For example here is what my xml data might look like right before I send it:

 

<?xml version="1.0" encoding="utf-8"?>
<data>
<spanishStuff>here+are+some+span+chars+%26Ntilde%3B+%26ntilde%3B%26euml%3B%26oacute%3B</spanishStuff>
</data>

 

The very first thing I do on the final script is email the post data to myself. Well here's what it looks like:

 

<?xml version="1.0" encoding="utf-8"?>
<data>
<spanishStuff>here are some span chars Ñ ñëó</spanishStuff>
</data>

 

See how the %26's have been replaced with &? Well then when I do a simplexml_load_string, it gives me warnings such as "parser error : Entity 'Ntilde' not defined". After that, all the input is either ignored, or the fields with spanish chars are ignored, depending on which variation of encoding I've tried this time around.

 

I don't know what to do at this point, I've spent a lot of time trying TONS of ways to encode the data, either before I send it or after I receive it, and nothing seems to help.

 

For what it's worth, one of the first things I do is utf8_encode the incoming post data since the web form is in ISO

 

 

Here is a step-by-step of the process if you want further clarification:

 

1. user enters data on ISO-8859-1 page

2. data is posted to a receiving ISO-8859-1 page

3. receiving page spawns a background process (using http_build_query on the post data, and fsockopen / fwrite to send it)

    -- background process ignores user disconnect

4. background process takes the post data and forms it into XML.

    -- as it does so, it encodes the data in UTF8, htmlentities, and urlencode

5. background process uses cURL to post the xml string to the final, receiving script

6. receiving script grabs the data and does whatever it needs to do

 

the background process technically can be skipped, but we don't want the user waiting around while all this other stuff happens, so I simply tell them thank you and let the system do the rest.

 

hope someone can help, thanks

 

 

Link to comment
Share on other sites

It seems like the problem is the entity references though.

 

%26Ntilde%3B becomes Ñ when it's received. Then simplexml gives the error "Entity: line 23: parser error : Entity 'Ntilde' not defined"

 

Is there a way to tell simplexml to expect those types of entities, perhaps?

Link to comment
Share on other sites

I think I finally figured it out.

 

instead of converting miscellaneous characters to entities, I figured what the hell, if I can get them to utf8 then why should I encode them?

 

so I just utf8 encode the incoming data, replace only the worst characters ( & < > ) with their entities, urlencode, and send.

 

seems to work fine so far.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.