Jump to content

Character Encoding Problems and Regex Sadness


maxbenjamin

Recommended Posts

Hello forum,

I've written a regex that captures words in an HTML document. It works almost perfectly with the exception of "typographer's" quotes, conjunctions, and possessives which I can't seem to capture.

 

My app downloads a webpage to a file using curl then processes the text from that file and puts the processed text into a sqlite3 DB. The app can then display the processed text found in the DB. I'm serving the pages with iso-8859-1 encoding and the quotes and apostrophes look fine when viewed with a browser. The original html downloaded is also served as iso-8859-1.

 

Taking a look at the html source with firefox the characters I'm trying to capture are ’ ‘ “ ” . Should be easy...

 

However, looking at the text directly in the DB or in the original files the characters are replaced with question marks. I checked and the files created using curl are encoded as utf-8 I believe the sqlite3 db is also utf-8. I'm guessing this is what is giving me the problem.

 

Is there a way to set curl to encode using iso-8859-1 or is there some other fix anyone can suggest? I've been attempting to figure this out for two days and haven't gotten anywhere.

 

Thanks.

Link to comment
Share on other sites

You may be able to use Unicode character properties (cf. PCRE Syntax, about 1/3 of the way down the page).

 

/\p{Ps}smart quotes\p{Pe}/i

 

I've also seen some regexes around that remove "smart" quotes and the curly apostrophes, I think even on the PHP site, but I can't see it now.  Try Googling "smart quotes" + PHP + preg_replace.

Link to comment
Share on other sites

  • 2 weeks later...

Thanks for the help.  I tried iconv(no luck) and so ended up converting the smart quotes to regular quotes during the initial file save with a function like the following:

function convert_quotes($string) 
{ 
$search = array(
                chr(146),
                chr(147),
                chr(148)
                );
$replace = array(
                '\'',
                '"',
                '"'
                );
    return str_replace($search, $replace, $string); 
}

Hope this helps someone!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.