maxbenjamin Posted June 26, 2007 Share Posted June 26, 2007 Hello forum, I've written a regex that captures words in an HTML document. It works almost perfectly with the exception of "typographer's" quotes, conjunctions, and possessives which I can't seem to capture. My app downloads a webpage to a file using curl then processes the text from that file and puts the processed text into a sqlite3 DB. The app can then display the processed text found in the DB. I'm serving the pages with iso-8859-1 encoding and the quotes and apostrophes look fine when viewed with a browser. The original html downloaded is also served as iso-8859-1. Taking a look at the html source with firefox the characters I'm trying to capture are ’ ‘ “ ” . Should be easy... However, looking at the text directly in the DB or in the original files the characters are replaced with question marks. I checked and the files created using curl are encoded as utf-8 I believe the sqlite3 db is also utf-8. I'm guessing this is what is giving me the problem. Is there a way to set curl to encode using iso-8859-1 or is there some other fix anyone can suggest? I've been attempting to figure this out for two days and haven't gotten anywhere. Thanks. Quote Link to comment Share on other sites More sharing options...
Wildbug Posted June 26, 2007 Share Posted June 26, 2007 You may be able to use Unicode character properties (cf. PCRE Syntax, about 1/3 of the way down the page). /\p{Ps}smart quotes\p{Pe}/i I've also seen some regexes around that remove "smart" quotes and the curly apostrophes, I think even on the PHP site, but I can't see it now. Try Googling "smart quotes" + PHP + preg_replace. Quote Link to comment Share on other sites More sharing options...
effigy Posted June 26, 2007 Share Posted June 26, 2007 Smart quotes are typically from Windows-1252. I couldn't get iconv to convert between that and ISO8859-1, so your best bet is to replace the characters. Another solution is to use their hex values [\x91\x92] and [\x93\x94]. Quote Link to comment Share on other sites More sharing options...
maxbenjamin Posted June 26, 2007 Author Share Posted June 26, 2007 Ok, but how can I replace the characters is they are not recognized in the character set? That is my main problem. Is there an easy way to convert utf-8 to iso8859-1? Thanks. Quote Link to comment Share on other sites More sharing options...
effigy Posted June 26, 2007 Share Posted June 26, 2007 Try iconv. If that doesn't work, decode the string, make the replacement, then re-encode the string. Quote Link to comment Share on other sites More sharing options...
maxbenjamin Posted July 8, 2007 Author Share Posted July 8, 2007 Thanks for the help. I tried iconv(no luck) and so ended up converting the smart quotes to regular quotes during the initial file save with a function like the following: function convert_quotes($string) { $search = array( chr(146), chr(147), chr(148) ); $replace = array( '\'', '"', '"' ); return str_replace($search, $replace, $string); } Hope this helps someone! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.