wrathican Posted November 12, 2008 Share Posted November 12, 2008 hey people, i have recently had a bit of trouble with character sets and php/mysql a client is pasting stuff from word documents into a form field. this pasted data seems to contain 'funny' characters. when dis playing the data on a page i see diamond '?' where the character shoule be. ive run some test strpos and preg_replace to see if i can replace the characters with normal characters but to no avail. i first did this little test: <?php $needle = '“'; $haystack = "“We felt it was important to contribute something useful.”"; $pos = strpos($haystack, $needle); if ($pos !== false) { echo "The string '{$needle}' was found in the string '{$haystack}'"; echo " and exists at position {$pos}<br><br><br><br><br>"; $string = preg_replace('~(“)|(”)~', '"', $haystack); echo $string; } else { echo "The string '{$needle}' was not found in the string '{$haystack}'"; } ?> this worked. and converted the 'funny' character to what i wanted. now i tried the same thing but using a database selection and apparently there was no match. the only difference was that $haystack was equal to a datbase field. What can i do to convert these characters? Thanks Wrath Quote Link to comment Share on other sites More sharing options...
Lumio Posted November 12, 2008 Share Posted November 12, 2008 Try utf8_encode($str) (or was it utf8_decode?) Quote Link to comment Share on other sites More sharing options...
wrathican Posted November 12, 2008 Author Share Posted November 12, 2008 thanks for the reply, i tried that but again to no avail. i echoed three combinations of my output: 1 - normal echo. 2 - utf8_encode 3 - utf8_decode 1 output the normal string with inline 'funny' characters 2 out put my string but with  infront of all my 'funny' characters 3 did the same as 1 my database coallition is: latin1_swedish_ci - if that helps Quote Link to comment Share on other sites More sharing options...
DarkWater Posted November 12, 2008 Share Posted November 12, 2008 Your database should be utf-8, as well as your Charset for your page. Quote Link to comment Share on other sites More sharing options...
wrathican Posted November 12, 2008 Author Share Posted November 12, 2008 i tired changing the coalltion to utf8_unicode_ci and i still had the same problem Quote Link to comment Share on other sites More sharing options...
haku Posted November 12, 2008 Share Posted November 12, 2008 The problem is that word documents add a lot of extra non-visible data into the document that play havoc when they are pasted into a browser. Quote Link to comment Share on other sites More sharing options...
wrathican Posted November 12, 2008 Author Share Posted November 12, 2008 yeah, i know that pasting from MS word isnt ideal. but what i am asking is if there is a way in which i can detect what the charset of a string is, then convert the string to a normal charset Quote Link to comment Share on other sites More sharing options...
premiso Posted November 12, 2008 Share Posted November 12, 2008 yeah, i know that pasting from MS word isnt ideal. but what i am asking is if there is a way in which i can detect what the charset of a string is, then convert the string to a normal charset I honestly do not think you can, the best you can do is assume that you have to replace certain characters already. I hate MS Word for this exact reason, you have to check for those smart quotes, the - and a bunch of other non-sense and replace them. The easiest way I found was to create 2 arrays, one with the bad vals and one with the good vals and use that replace the bad vals with the good vals. The worst part was this happened to me after I had my site running for about 6 months, so changing charsets was not probable. Wish I would have known to use a different charset back then. Oh well. Hope that helps. Quote Link to comment Share on other sites More sharing options...
haku Posted November 12, 2008 Share Posted November 12, 2008 yeah, i know that pasting from MS word isnt ideal. but what i am asking is if there is a way in which i can detect what the charset of a string is, then convert the string to a normal charset You can do it with multi-byte charsets, but I don't know if you can do it with single-byte charsets. That being said, I don't think that the problem is with your charset - word deals with charsets just fine (I actually use it to detect charsets sometimes when document encoding gets screwed up, as almost every site I deal with is in Japanese), the problem is that word adds extra non-visible characters to the text, which show up weird when you paste them into the browser. This means that word isn't just 'not an ideal solution', rather it's the wrong solution. If your client insists on doing it this way, have them paste the text into notepad or wordpad, and then copy it again and paste it into the browser. That *should* strip out all the extra characters. Quote Link to comment Share on other sites More sharing options...
wrathican Posted November 12, 2008 Author Share Posted November 12, 2008 Thanks for the advice, and i will certainly take this up with my client. I managed to figure a way around for the already inserted items. I tried converting the string (using PHP's iconv function) into different charsets to find the correct one, when i had i did exactly what you said premiso, i had an array of good and bad chars and used preg_replace to sort it out. I hope this will be of use to someone in the future. Thanks for the advice people! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.