Jump to content

Where does ÿþ come from?


Kane250

Recommended Posts

Hi,

 

I'm getting contents of a text file and displaying as plain text, but am starting to get some of these bad boys:  ÿþ

 

I am using utf_encode after getting the contents, which cleaned it up a bit from what it was doing before, but I'm still getting these.  Can someone tell me what I should be using?

 

Thanks!

Link to comment
Share on other sites

Define: "contents of a text file" Where is this file coming from, what produced it, and if it is already a text file, you should not need to do anything to it to display it as plain text.

 

The two characters are 00FE and 00FF (hex) - http://en.wikipedia.org/wiki/ISO_8859-1

 

And using uf8_encode won't fix anything, just change it from one form in to a different one.

Link to comment
Share on other sites

Define: "contents of a text file" Where is this file coming from, what produced it, and if it is already a text file, you should not need to do anything to it to display it as plain text.

 

Hi, thanks.  I created the text files manually, and they each contain a short phrase in them.  If I pull them without any additional work, I get many more symbols such as: &, #, þ, ; and some others.  I should also note that I am pulling these into a script that is generating XML.

 

This is what I'm doing

$title = file_get_contents($titlepath);
$nameValue = $dom->createTextNode($title);
		$name->appendChild($nameValue);

 

Does this change anything?  Should I be passing it a different way?

Link to comment
Share on other sites

The characters are not 'hidden characters'. They are formatting and document information that the native application that was used to create them put into the file so that when it opens the file it can restore the document to the form it was created as. The files are also not text files. They are documents that have a specific native data format.

 

The proper way to publish the content of the documents as plain text is to open the documents in their native application and save them as plain text files. This will remove all the formatting and save just the content of the files in a format that you can then use.

 

There are a small number of "content stripper" applications that know enough about the native format of some of the popular Word processors that you could use instead of the actual native application.

 

I can only imagine that someone gave you a bunch of files that they want "published" or indexed/search-able on a web page and you have been fighting all kinds of strange characters and broken text. Even if you were to strip out all characters that are not 'human' readable, you will still have some amount of broken text because some of the information that the native application stores in the file uses data values that are valid as readable characters. You are not going to be 100% successful unless you first save the file as plain text, then attempt to publish it as XML.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.