Kane250 Posted June 30, 2009 Share Posted June 30, 2009 Hi, I'm getting contents of a text file and displaying as plain text, but am starting to get some of these bad boys: ÿþ I am using utf_encode after getting the contents, which cleaned it up a bit from what it was doing before, but I'm still getting these. Can someone tell me what I should be using? Thanks! Link to comment https://forums.phpfreaks.com/topic/164318-where-does-%C3%BF%C3%BE-come-from/ Share on other sites More sharing options...
pkedpker Posted June 30, 2009 Share Posted June 30, 2009 i used to get the ÿþ when I used a crappy lightweight webserver (ZazouMiniWebServer) but idk might not be related could be something with PHP Link to comment https://forums.phpfreaks.com/topic/164318-where-does-%C3%BF%C3%BE-come-from/#findComment-866788 Share on other sites More sharing options...
Kane250 Posted June 30, 2009 Author Share Posted June 30, 2009 I'm assuming it's a PHP issue...some weird html invisible characters maybe? Never seen it before though.. Link to comment https://forums.phpfreaks.com/topic/164318-where-does-%C3%BF%C3%BE-come-from/#findComment-866796 Share on other sites More sharing options...
PFMaBiSmAd Posted June 30, 2009 Share Posted June 30, 2009 Define: "contents of a text file" Where is this file coming from, what produced it, and if it is already a text file, you should not need to do anything to it to display it as plain text. The two characters are 00FE and 00FF (hex) - http://en.wikipedia.org/wiki/ISO_8859-1 And using uf8_encode won't fix anything, just change it from one form in to a different one. Link to comment https://forums.phpfreaks.com/topic/164318-where-does-%C3%BF%C3%BE-come-from/#findComment-866803 Share on other sites More sharing options...
Kane250 Posted July 1, 2009 Author Share Posted July 1, 2009 Define: "contents of a text file" Where is this file coming from, what produced it, and if it is already a text file, you should not need to do anything to it to display it as plain text. Hi, thanks. I created the text files manually, and they each contain a short phrase in them. If I pull them without any additional work, I get many more symbols such as: &, #, þ, ; and some others. I should also note that I am pulling these into a script that is generating XML. This is what I'm doing $title = file_get_contents($titlepath); $nameValue = $dom->createTextNode($title); $name->appendChild($nameValue); Does this change anything? Should I be passing it a different way? Link to comment https://forums.phpfreaks.com/topic/164318-where-does-%C3%BF%C3%BE-come-from/#findComment-866812 Share on other sites More sharing options...
PFMaBiSmAd Posted July 1, 2009 Share Posted July 1, 2009 I'm going to take a wild guess that you are using some Windows or Mac Word processor to create the files? Link to comment https://forums.phpfreaks.com/topic/164318-where-does-%C3%BF%C3%BE-come-from/#findComment-866815 Share on other sites More sharing options...
Kane250 Posted July 1, 2009 Author Share Posted July 1, 2009 ahh, yeah I think some were generated from another program by someone else - and yes, probably a word processor program. Hidden characters? Link to comment https://forums.phpfreaks.com/topic/164318-where-does-%C3%BF%C3%BE-come-from/#findComment-866818 Share on other sites More sharing options...
Kane250 Posted July 1, 2009 Author Share Posted July 1, 2009 Is there anything I can do in PHP to strip out the invisible characters? Link to comment https://forums.phpfreaks.com/topic/164318-where-does-%C3%BF%C3%BE-come-from/#findComment-866835 Share on other sites More sharing options...
PFMaBiSmAd Posted July 1, 2009 Share Posted July 1, 2009 The characters are not 'hidden characters'. They are formatting and document information that the native application that was used to create them put into the file so that when it opens the file it can restore the document to the form it was created as. The files are also not text files. They are documents that have a specific native data format. The proper way to publish the content of the documents as plain text is to open the documents in their native application and save them as plain text files. This will remove all the formatting and save just the content of the files in a format that you can then use. There are a small number of "content stripper" applications that know enough about the native format of some of the popular Word processors that you could use instead of the actual native application. I can only imagine that someone gave you a bunch of files that they want "published" or indexed/search-able on a web page and you have been fighting all kinds of strange characters and broken text. Even if you were to strip out all characters that are not 'human' readable, you will still have some amount of broken text because some of the information that the native application stores in the file uses data values that are valid as readable characters. You are not going to be 100% successful unless you first save the file as plain text, then attempt to publish it as XML. Link to comment https://forums.phpfreaks.com/topic/164318-where-does-%C3%BF%C3%BE-come-from/#findComment-867072 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.