Where does ÿþ come from?

Kane250 · June 30, 2009

Hi,

I'm getting contents of a text file and displaying as plain text, but am starting to get some of these bad boys: ÿþ

I am using utf_encode after getting the contents, which cleaned it up a bit from what it was doing before, but I'm still getting these. Can someone tell me what I should be using?

Thanks!

pkedpker · June 30, 2009

i used to get the ÿþ when I used a crappy lightweight webserver (ZazouMiniWebServer) but idk might not be related could be something with PHP

Kane250 · June 30, 2009

I'm assuming it's a PHP issue...some weird html invisible characters maybe? Never seen it before though..

PFMaBiSmAd · June 30, 2009

Define: "contents of a text file" Where is this file coming from, what produced it, and if it is already a text file, you should not need to do anything to it to display it as plain text.

The two characters are 00FE and 00FF (hex) - http://en.wikipedia.org/wiki/ISO_8859-1

And using uf8_encode won't fix anything, just change it from one form in to a different one.

Kane250 · July 1, 2009

Define: "contents of a text file" Where is this file coming from, what produced it, and if it is already a text file, you should not need to do anything to it to display it as plain text.

Hi, thanks. I created the text files manually, and they each contain a short phrase in them. If I pull them without any additional work, I get many more symbols such as: &, #, þ, ; and some others. I should also note that I am pulling these into a script that is generating XML.

This is what I'm doing

$title = file_get_contents($titlepath);
$nameValue = $dom->createTextNode($title);
		$name->appendChild($nameValue);

Does this change anything? Should I be passing it a different way?

PFMaBiSmAd · July 1, 2009

I'm going to take a wild guess that you are using some Windows or Mac Word processor to create the files?

Kane250 · July 1, 2009

ahh, yeah I think some were generated from another program by someone else - and yes, probably a word processor program. Hidden characters?

Kane250 · July 1, 2009

Is there anything I can do in PHP to strip out the invisible characters?

PFMaBiSmAd · July 1, 2009

The characters are not 'hidden characters'. They are formatting and document information that the native application that was used to create them put into the file so that when it opens the file it can restore the document to the form it was created as. The files are also not text files. They are documents that have a specific native data format.

The proper way to publish the content of the documents as plain text is to open the documents in their native application and save them as plain text files. This will remove all the formatting and save just the content of the files in a format that you can then use.

There are a small number of "content stripper" applications that know enough about the native format of some of the popular Word processors that you could use instead of the actual native application.

I can only imagine that someone gave you a bunch of files that they want "published" or indexed/search-able on a web page and you have been fighting all kinds of strange characters and broken text. Even if you were to strip out all characters that are not 'human' readable, you will still have some amount of broken text because some of the information that the native application stores in the file uses data values that are valid as readable characters. You are not going to be 100% successful unless you first save the file as plain text, then attempt to publish it as XML.

Sign In

Where does ÿþ come from?

Recommended Posts

Kane250

Link to comment

Share on other sites

pkedpker

Link to comment

Share on other sites

Kane250

Link to comment

Share on other sites

PFMaBiSmAd

Link to comment

Share on other sites

Kane250

Link to comment

Share on other sites

PFMaBiSmAd

Link to comment

Share on other sites

Kane250

Link to comment

Share on other sites

Kane250

Link to comment

Share on other sites

PFMaBiSmAd

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information