Regex Troubles with Character Encoding

Milosz · October 31, 2008

Hello,

I'm trying to implement a "looser" version of in_string(). If anyone's used Launchy before, they can skip the next paragraph because I'm implementing that exact search algorithm.

The algorithm is to determine if a set of characters exists, in the given order, in a given string, regardless of intermediate characters, regardless of case. In other words, the purpose of the algorithm is determining if the haystack contains the set of characters given in the needle, in the order given in the needle (ignoring case). A search for the string 'php', in my algorithm, would be a search using the regular expression '/.*p.*h.*p.*/i'.

I have created an HTML form, for testing, that takes a needle and a haystack in, and outputs true or false based on the output of my function. The form has the attribute accept-charset="utf8". My matching function is as follows:

function in_string_loose( $needle, $haystack ) {
    if ( $needle == '' ) {
        return true;
    }
    
    $metacharacters = array( '\\', '/', '|', '(', ')', '[', ']', '{', '}', '^', '$', '*', '+', '?', '.', '-' );

    /* Beginning slash */
    $loose_needle = '/.*';
    
    /* Create Launchy-like regex from string */
    $array = preg_split( '//u', $needle ) );
    foreach( $array as $char ) {
        if ( in_array( $char, $metacharacters ) ) {
            $loose_needle .= '\\' . $char . '.*';            // Normalize metacharacters
        } else if ( $char != '' ) {
            $loose_needle .= $char . '.*';
        }
    }

    /* End slash, modifiers */
    $loose_needle .= '/iu';

    return ( preg_match( $loose_needle, $haystack ) > 0 );
}

A problem occurs when I attempt to use my function on a file name copied from Windows (untested for other OS's) that contains non-ASCII characters. The function always returns false, even if I copy the file name into both the needle and into the haystack, and despite my setting of the 'u' flag in preg_match(). I gather that this has to do with a special character encoding method used in Windows file management that is not UTF-8.

Does anyone have suggestions that would enable proper handling of Windows file name strings containing non-ASCII characters?

Milosz · November 2, 2008

Hello,

I see this thread got a few views so someone may be interested in hearing the solution. I found a neat function that is apparently used in SquirrelMail (university mail server app) called charset_decode_utf_8, which turns any character in {UTF-8} \ {ISO-8859-1} (non-ASCII) into an HTML entity of the form '&#xxx;'. The code is:

function charset_decode_utf_8 ($string) {
    /* Only do the slow convert if there are 8-bit characters */
    /* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
    if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
        return $string;

    // decode three byte unicode characters
    $string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
    "'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
    $string);

    // decode two byte unicode characters
    $string = preg_replace("/([\300-\337])([\200-\277])/e",
    "'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
    $string);

    return $string;
}

I am running my needles and haystacks through this function prior to comparison and all is well. As is implied, I can do without the '/u' modifier in my regexes now.

Sign In

Regex Troubles with Character Encoding

Recommended Posts

Milosz

Link to comment

Share on other sites

Milosz

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information