Jump to content

Regex Troubles with Character Encoding


Milosz

Recommended Posts

Hello,

 

I'm trying to implement a "looser" version of in_string(). If anyone's used Launchy before, they can skip the next paragraph because I'm implementing that exact search algorithm.

 

The algorithm is to determine if a set of characters exists, in the given order, in a given string, regardless of intermediate characters, regardless of case. In other words, the purpose of the algorithm is determining if the haystack contains the set of characters given in the needle, in the order given in the needle (ignoring case). A search for the string 'php', in my algorithm, would be a search using the regular expression '/.*p.*h.*p.*/i'.

 

I have created an HTML form, for testing, that takes a needle and a haystack in, and outputs true or false based on the output of my function. The form has the attribute accept-charset="utf8". My matching function is as follows:

 

function in_string_loose( $needle, $haystack ) {
    if ( $needle == '' ) {
        return true;
    }
    
    $metacharacters = array( '\\', '/', '|', '(', ')', '[', ']', '{', '}', '^', '$', '*', '+', '?', '.', '-' );

    /* Beginning slash */
    $loose_needle = '/.*';
    
    /* Create Launchy-like regex from string */
    $array = preg_split( '//u', $needle ) );
    foreach( $array as $char ) {
        if ( in_array( $char, $metacharacters ) ) {
            $loose_needle .= '\\' . $char . '.*';            // Normalize metacharacters
        } else if ( $char != '' ) {
            $loose_needle .= $char . '.*';
        }
    }

    /* End slash, modifiers */
    $loose_needle .= '/iu';

    return ( preg_match( $loose_needle, $haystack ) > 0 );
}

 

A problem occurs when I attempt to use my function on a file name copied from Windows (untested for other OS's) that contains non-ASCII characters. The function always returns false, even if I copy the file name into both the needle and into the haystack, and despite my setting of the 'u' flag in preg_match(). I gather that this has to do with a special character encoding method used in Windows file management that is not UTF-8.

 

Does anyone have suggestions that would enable proper handling of Windows file name strings containing non-ASCII characters?

Link to comment
Share on other sites

Hello,

 

I see this thread got a few views so someone may be interested in hearing the solution. I found a neat function that is apparently used in SquirrelMail (university mail server app) called charset_decode_utf_8, which turns any character in {UTF-8} \ {ISO-8859-1} (non-ASCII) into an HTML entity of the form '&#xxx;'. The code is:

 

function charset_decode_utf_8 ($string) {
    /* Only do the slow convert if there are 8-bit characters */
    /* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
    if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
        return $string;

    // decode three byte unicode characters
    $string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
    "'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
    $string);

    // decode two byte unicode characters
    $string = preg_replace("/([\300-\337])([\200-\277])/e",
    "'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
    $string);

    return $string;
} 

 

I am running my needles and haystacks through this function prior to comparison and all is well. As is implied, I can do without the '/u' modifier in my regexes now.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.