gmcmudder Posted July 10, 2008 Share Posted July 10, 2008 I'm working on a project using a text document that contains some special characters I need to remove. I know I use to have a word or excel document that had a list of all the characters, but now I can't find it. Does anyone know where a similar list might be on the web. A Google search doesn't turn up any tables that contain all of the characters I need to remove and I can't see the characters in the text document for reference. Most of the list I've found online only contain about 125 of the available characters that can be typed and have none of the characters I need to scan for in it. Some of the characters that the original text document contains I need to remove are (hard return, unknown) etc. Any help would be greatly appreciated. Quote Link to comment Share on other sites More sharing options...
effigy Posted July 10, 2008 Share Posted July 10, 2008 Such as this? Have you tried a hex editor? Are you on Unix? Quote Link to comment Share on other sites More sharing options...
discomatt Posted July 10, 2008 Share Posted July 10, 2008 http://www.danshort.com/ASCIImap/ ? Quote Link to comment Share on other sites More sharing options...
gmcmudder Posted July 10, 2008 Author Share Posted July 10, 2008 Such as this? Have you tried a hex editor? Are you on Unix? Thats the first table that I used and it doesn't list the characters that I'm looking for. I'm using a hex editor, but for some reason my function isn't finding and removing the characters. I think the problem is that the original text file is created on a mac and the web server is a windows based machine. Here's one of the special characters in the text document that I can't seem to remove, I know if I can find the hex equivalent for it I can replace it with what needs to be there instead. one of the characters I need to remove - Ê Quote Link to comment Share on other sites More sharing options...
DarkWater Posted July 10, 2008 Share Posted July 10, 2008 That is 0xCA. I just did: <?php echo dechex(ord("Ê")); ?> Quote Link to comment Share on other sites More sharing options...
gmcmudder Posted July 10, 2008 Author Share Posted July 10, 2008 That is 0xCA. I just did: <?php echo dechex(ord("Ê")); ?> Thanks DarkWater, thats twice now you've helped me. The project would be much easier if their editors would simply listen to me and not use things like that and hard returns. They can tell me what the code is, what it's used for on the print machine but have no clue how not to use them in the text document for their data. Now the question is, if I use that echo command to find the hex code, do I simply put the 0x before what it echo's' as the result? Maybe my hex editor isn't that great because I thought the one I had before this one gave me all that information. Any recommendations on a hex editor? Quote Link to comment Share on other sites More sharing options...
DarkWater Posted July 10, 2008 Share Posted July 10, 2008 Yes, you put 0x in front of it to represent it as HEX in php. Then you can do: $string = str_replace(chr(0xCA), '', $string); I use GHex, but then again, I'm on Ubuntu. I'm pretty sure XVI32 is what I used on Windows. Quote Link to comment Share on other sites More sharing options...
gmcmudder Posted July 10, 2008 Author Share Posted July 10, 2008 Yes, you put 0x in front of it to represent it as HEX in php. Then you can do: $string = str_replace(chr(0xCA), '', $string); I use GHex, but then again, I'm on Ubuntu. I'm pretty sure XVI32 is what I used on Windows. When working with more than one hex value would this statement work? $cleantext = str_replace(chr(0x0B), "\n", $contents); $cleantext = str_replace('chr(0xCA)|chr(somehexvaluehere)', ' ', $cleantext); Quote Link to comment Share on other sites More sharing options...
DarkWater Posted July 10, 2008 Share Posted July 10, 2008 $find = array(chr(0xCA), chr(0x0B)); $cleantext = str_replace($find, '', $cleantext); Quote Link to comment Share on other sites More sharing options...
discomatt Posted July 10, 2008 Share Posted July 10, 2008 If you want to remove all accented/special characters... regex might be an easier solution Quote Link to comment Share on other sites More sharing options...
DarkWater Posted July 10, 2008 Share Posted July 10, 2008 No, I remember his last thread. He gets some articles or something and the place where he gets them from has all this weird print characters in it or something. Last time he had random vertical tabs in there (0x0B), and this solution worked. Should work again. =P Quote Link to comment Share on other sites More sharing options...
effigy Posted July 10, 2008 Share Posted July 10, 2008 Actually, do you have a list of what characters are valid? In regex you could use ranges to cover this basis. Quote Link to comment Share on other sites More sharing options...
DarkWater Posted July 10, 2008 Share Posted July 10, 2008 I think there are only 2 or 3 characters that are throwing him off, so it would be easier to just replace them so he doesn't miss anything with the regex. Quote Link to comment Share on other sites More sharing options...
discomatt Posted July 10, 2008 Share Posted July 10, 2008 preg_replace( '/[^\\041-\\176\\s]/', '', $subject ) Will remove all characters not on a US keyboard. If you want to strip vertical tabs too, replace it with this preg_replace( '/[^\\040-\\176\\r\\n\\t]/', '', $subject ) Quote Link to comment Share on other sites More sharing options...
gmcmudder Posted July 10, 2008 Author Share Posted July 10, 2008 No, I remember his last thread. He gets some articles or something and the place where he gets them from has all this weird print characters in it or something. Last time he had random vertical tabs in there (0x0B), and this solution worked. Should work again. =P Yeap, you remembered right, the random vertical tabs are the same as line breaks in a normal text file. Their editing software uses the special characters to insert the articles for printing into the actual news printing machine. They want to use the same file they use for the news print to enter the article data into a database. I spoke with one of the editors today and the software developer for the news print software used key values (characters) that he felt weren't used in other programs anymore to create a text document that is fed into the print machine. Finding and removing all of those characters from the text file has proven to be a bit of a challenge though. Like finding out what the hex character for a hard return is, without actually seeing one. Quote Link to comment Share on other sites More sharing options...
gmcmudder Posted July 10, 2008 Author Share Posted July 10, 2008 Lets try this, each character that I need to remove in the text document does a specific task on the news print machine. ie - the hex character chr(0x0B) would be a line break in an article. While hex character chr(0xCA) is an added space in the line. So I'm figuring out what each special hex character is suppose to represent and then replace it accordingly. Anything else will be considered trailing garbage and then removed from the text file. Where I would use something like - preg_replace( '/[^\\041-\\176\\s]/', '', $subject ) Quote Link to comment Share on other sites More sharing options...
gmcmudder Posted July 10, 2008 Author Share Posted July 10, 2008 Got it, thanks for the help DarkWater and discomatt. I ended up using both suggestions and got the text files information into the database without any errors at all (finally). The XVI32 worked great as well. Now if I can just get them to edit their own text file, check for line errors and make sure that the lines match up. So an articles line would look like "Earth's top scientist are working on the problem, but have no ideal when they will find a solution to the global warming crisis that we are currently facing." instead of - "Earth's top scientist are working on the problem, but have no ideal when they will find a solution to the global warming crisis that we are currently facing." This one is solved, thanks again everyone. Quote Link to comment Share on other sites More sharing options...
DarkWater Posted July 10, 2008 Share Posted July 10, 2008 No problem. Glad I could help. Please mark the topic as solved. Edit: You already did. Good. =P Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.