mraza Posted November 6, 2009 Share Posted November 6, 2009 Hi ! i am working on a script and using a regular expression to fetch the data, these are the codes i am using on multiple places, how can i include a greek words in this, currently it accept only English: preg_match_all('/<([a-z0-9\-]+)(.*?)>((.*?)<\/\1>)?/is', $html, $m); preg_match('/<[a-z0-9\-]+.*?>/is', $m[4][$t]) preg_match_all('/\$([a-z0-9_\-]+)/i', $if_term, $m); preg_replace('/[^a-z0-9_\-]/i', '', $k); Thanks for any support. Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/ Share on other sites More sharing options...
cags Posted November 6, 2009 Share Posted November 6, 2009 I wasn't sure so I tried looking it up for you... Extended properties such as "Greek" or "InMusicalSymbols" are not supported by PCRE. Source: http://www.php.net/manual/en/regexp.reference.unicode.php Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952488 Share on other sites More sharing options...
mraza Posted November 6, 2009 Author Share Posted November 6, 2009 Thanks Cags ! i search a little bit about this and found this in an answer for Arabic support : The PHP manual mentions: "Extended properties such as "Greek" or "InMusicalSymbols" are not supported by PCRE." But that's not entirely true anymore. PCRE release 6.5 added support for script names. http://stackoverflow.com/questions/1011841/check-the-language-of-string-based-on-glyphs-in-php Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952507 Share on other sites More sharing options...
cags Posted November 6, 2009 Share Posted November 6, 2009 Indeed. If you search about abit it does appear that pcre supports Greek as a script name for \p. So I had a quick play around and it does indeed work.... "~[\p{Greek}]+~u"; ...will match greek words. Edit: Assuming I checked it right, lol, not knowing greek I had to rely on google and wikipedia to find a test string. Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952528 Share on other sites More sharing options...
mraza Posted November 6, 2009 Author Share Posted November 6, 2009 Thanks Cags you are always a big help in fact i am not sure where to include this in my line could you please show me ~[\p{Greek}]+~u included in any line above so i can get english or greek both : preg_match_all('/<([a-z0-9\-]+)(.*?)>((.*?)<\/\1>)?/is', $html, $m); Thank you Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952530 Share on other sites More sharing options...
cags Posted November 6, 2009 Share Posted November 6, 2009 Assuming I'm understanding it correctly [\p{Greek}] represents 'any Greek character' in much the same way as [a-z] represents 'any "English" character'. So adding \p{Greek} to your character class should allow Greek characters. The only thing to remember is to use the u modifier to make the match Unicode. preg_match_all('/<([\p{Greek}a-z0-9\-]+)(.*?)>((.*?)<\/\1>)?/isu', $html, $m); Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952535 Share on other sites More sharing options...
mraza Posted November 6, 2009 Author Share Posted November 6, 2009 Thanks Cags i tried but i i am still missing something, plz have a look at the lines i edited not sure if this was the way preg_match_all('/<([\p{Greek}a-z0-9\-]+)(.*?)>((.*?)<\/\1>)?/isu', $html, $m) preg_match('/<[\p{Greek}a-z0-9\-]+.*?>/isu', $m[4][$t]) preg_match_all('/\$([\p{Greek}a-z0-9_\-]+)/iu', $if_term, $m) preg_replace('/[^\p{Greek}a-z0-9_\-]/iu', '', $k); Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952541 Share on other sites More sharing options...
cags Posted November 6, 2009 Share Posted November 6, 2009 Difficult for me to say as I don't know what string you are trying to match. Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952552 Share on other sites More sharing options...
mraza Posted November 6, 2009 Author Share Posted November 6, 2009 here is the complete class i am using. <?php class htmlsql { // configuration: // htmlSQL version: var $version = '0.5'; // referer and user agent: var $referer = ''; var $user_agent = 'htmlSQL/0.5'; // these are filled on runtime: // (don't touch them) // holds snoopy object: var $snoopy = NULL; // the results array is stored in here: var $results = array(); // the results objects are stored in here: var $results_objects = NULL; // the error message gets stored in here: var $error = ''; // the downloaded page is stored in here: var $page = ''; /* ** init_snoopy ** ** initializes the snoopy class */ function init_snoopy(){ $this->snoopy = new Snoopy(); $this->snoopy->agent = $this->user_agent; $this->snoopy->referer = $this->referer; } /* ** set_user_agent ** ** set a custom user agent */ function set_user_agent($u){ $this->user_agent = $u; } /* ** set_referer ** ** sets the referer */ function set_referer($r){ $this->referer = $r; } /* ** _get_between ** ** returns the content between $start and $end */ function _get_between($content,$start,$end){ $r = explode($start, $content); if (isset($r[1])){ $r = explode($end, $r[1]); return $r[0]; } return ''; } /* ** connect ** ** connects to a data source (url, file or string) */ function connect($type, $resource){ if ($type == 'url'){ return $this->_fetch_url($resource); } else if ($type == 'file') { if (!file_exists($resource)){ $this->error = 'The given file "'.$resource.' does not exist!'; return false; } $this->page = file_get_contents($resource); return true; } else if ($type == 'string') { $this->page = $resource; return true; } return false; } /* ** _fetch_url ** ** downloads the given URL with snoopy */ function _fetch_url($url){ $parsed_url = parse_url($url); if (!isset($parsed_url['scheme']) or $parsed_url['scheme'] != 'http'){ $this->error = 'Unsupported URL sheme given, please just use "HTTP".'; return false; } if (!isset($parsed_url['host']) or $parsed_url['host'] == ''){ $this->error = 'Invalid URL given!'; return false; } $host = $parsed_url['host']; $host .= (isset($parsed_url['port']) and !empty($parsed_url['port'])) ? ':'.$parsed_url['port'] : ''; $path = (isset($parsed_url['path']) and !empty($parsed_url['path'])) ? $parsed_url['path'] : '/'; $path .= (isset($parsed_url['query']) and !empty($parsed_url['query'])) ? '?'.$parsed_url['query'] : ''; $url = 'http://' . $host . $path; $this->init_snoopy(); if($this->snoopy->fetch($url)){ $this->page = $this->snoopy->results; // empty buffer: $this->snoopy->results = ''; } else { $this->error = 'Could not establish a connection to the given URL!'; return false; } return true; } /* ** _extract_all_tags ** ** */ function _extract_all_tags($html, &$tag_names, &$tag_attributes, &$tag_values, $depth=0){ // stop endless loops: if ($depth > 99999){ return; } preg_match_all('/<([\p{Greek}a-z0-9\-]+)(.*?)>((.*?)<\/\1>)?/isu', $html, $m); if (count($m[0]) != 0){ for ($t=0; $t < count($m[0]); $t++){ $tag_names[] = trim($m[1][$t]); $tag_attributes[] = trim($m[2][$t]); $tag_values[] = trim($m[4][$t]); // go deeper: if (trim($m[4][$t]) != '' and preg_match('/<[\p{Greek}a-z0-9\-]+.*?>/isu', $m[4][$t])){ $this->_extract_all_tags($m[4][$t], $tag_names, $tag_attributes, $tag_values, $depth+1); } } } } /* ** isolate_content ** ** isolates the content to a specific part */ function isolate_content($start,$end){ $this->page = $this->_get_between($this->page, $start, $end); } /* ** select ** ** restricts the content of a specific tag */ function select($tagname, $num=0){ $num++; if ($tagname != ''){ preg_match('/<'.$tagname.'.*?>(.*?)<\/'.$tagname.'>/is', $this->page, $m); if (isset($m[$num]) and !empty($m[$num])){ $this->page = $m[$num]; } else { $this->error = 'Could not select tag: "'.$tagname.'('.$num.')"!'; return false; } } return true; } /* ** get_content ** ** returns the content of an request */ function get_content(){ return $this->page; } /* ** _clean_array ** ** */ function _clean_array($arr){ $new = array(); for ($x=0; $x < count($arr); $x++){ $arr[$x] = trim($arr[$x]); if ($arr[$x] != ''){ $new[] = $arr[$x]; } } return $new; } /* ** _test_tag ** ** */ function _test_tag($tag_attributes, $if_term){ preg_match_all('/\$([\p{Greek}a-z0-9_\-]+)/iu', $if_term, $m); if (isset($m[1])){ for ($x=0; $x < count($m[1]); $x++){ $varname = $m[1][$x]; $$varname = ''; } } $new_list = array(); while (list($k,$v) = each($tag_attributes)){ $k = preg_replace('/[^\p{Greek}a-z0-9_\-]/iu', '', $k); if ($k != ''){ $new_list[$k] = $v; } } unset($tag_attributes); extract($new_list); $r = false; if (@eval('$r = ('.$if_term.');') === false){ $this->error = 'The WHERE statement is invalid (eval() failed)!'; return false; } return $r; } /* ** _match_tags ** ** */ function _match_tags(&$results, &$return_values, &$where_term, &$tag_attributes, &$tag_values, &$tag_names){ $search_mode = ''; $search_attribute = ''; $search_term = ''; /* ** parse: ** ** href LIKE ".htm" ** class = "foo" */ $where_term = trim($where_term); $search_mode = ($where_term == '') ? 'match_all' : 'eval'; for ($x=0; $x < count($tag_attributes); $x++){ $tag_attributes[$x] = $this->parse_attributes($tag_attributes[$x]); if (is_array($tag_names)){ $tag_attributes[$x]['tagname'] = isset($tag_names[$x]) ? $tag_names[$x] : ''; } else { $tag_attributes[$x]['tagname'] = $tag_names; } // string $tag_attributes[$x]['text'] = isset($tag_values[$x]) ? $tag_values[$x] : ''; if ($search_mode == 'eval'){ if ($this->_test_tag($tag_attributes[$x], $where_term)){ $this->_add_result($results, $return_values, $tag_attributes[$x]); } } else if ($search_mode == 'match_all'){ $this->_add_result($results, $return_values, $tag_attributes[$x]); } } } /* ** query ** ** performs a query */ function query($term){ // query results are stored in here: $results = array(); $this->results = NULL; $this->results_objects = NULL; $term = trim($term); if ($term == ''){ $this->error = 'Empty query given!'; return false; } // match query: preg_match('/^SELECT (.*?) FROM (.*)$/i', $term, $m); // parse returns values // SELECT * FROM ... // SELECT foo,bar FROM ... $return_values = isset($m[1]) ? trim($m[1]) : '*'; if ($return_values != '*'){ $return_values = explode(',', strtolower($return_values)); $return_values = $this->_clean_array($return_values); } // match from and where part: // // ... FROM * WHERE $id=="one" // ... FROM a WHERE $class=="red" // ... FROM a // ... FROM * $last = isset($m[2]) ? trim($m[2]) : ''; $search_term = ''; $where_term = ''; if (preg_match('/^(.*?) WHERE (.*?)$/i', $last, $m)){ $search_term = trim($m[1]); $where_term = trim($m[2]); } else { $search_term = $last; } /* ** find tags: */ if ($search_term == '*'){ // search all $tag_names = array(); $tag_attributes = array(); $tag_values = array(); $html = $this->page; $this->_extract_all_tags($html, $tag_names, $tag_attributes, $tag_values); $this->_match_tags($results, $return_values, $where_term, $tag_attributes, $tag_values, $tag_names); } else { // search term is a tag $tagname = trim($search_term); $tag_attributes = array(); $tag_values = array(); $regexp = '<'.$tagname.'([ \t].*?|)>((.*?)<\/'.$tagname.'>)?'; preg_match_all('/'.$regexp.'/is', $this->page, $m); if (count($m[0]) != 0){ $tag_attributes = $m[1]; $tag_values = $m[3]; } $this->_match_tags($results, $return_values, $where_term, $tag_attributes, $tag_values, $tagname); } $this->results = $results; // was there a error during the search process? if ($this->error != ''){ return false; } return true; } /* ** convert_tagname_to_key ** ** converts the tagname to the array key */ function convert_tagname_to_key(){ $new_array = array(); while(list($key,$val) = each($this->results)){ if (isset($val['tagname'])){ $tag_name = $val['tagname']; unset($val['tagname']); } else { $tag_name = '(empty)'; } $new_array[$tag_name] = $val; } $this->results = $new_array; } /* ** fetch_array ** ** returns the results as an array */ function fetch_array(){ return $this->results; } /* ** _array2object ** ** converts an array to an object */ function _array2object($array) { if (is_array($array)) { $obj = new StdClass(); foreach ($array as $key => $val){ $obj->$key = $val; } } else { $obj = $array; } return $obj; } /* ** fetch_objects ** ** returns the results as objects */ function fetch_objects(){ if ($this->results_objects == NULL){ $results = array(); reset($this->results); while(list($key,$val) = each($this->results)){ $results[$key] = $this->_array2object($val); } $this->results_objects = $results; return $this->results_objects; } else { return $this->results_objects; } } /* ** get_result_count ** ** returns the number of results */ function get_result_count(){ return count($this->results); } /* ** _add_result ** ** */ function _add_result(&$results, $return_values, $tag_attributes){ if ($return_values == '*'){ $results[] = $tag_attributes; } else if (is_array($return_values)){ $new_result = array(); reset($return_values); for ($t=0; $t < count($return_values); $t++){ $_tagname = explode(' as ', $return_values[$t]); $_caption = $return_values[$t]; if (count($_tagname) != 1){ $_caption = trim($_tagname[1]); $_tagname = trim($_tagname[0]); } else { $_tagname = $_caption; } $new_result[$_caption] = isset($tag_attributes[$_tagname]) ? $tag_attributes[$_tagname] : ''; } $results[] = $new_result; } } /* ** parse_attributes ** ** parses HTML attributes and returns an array */ function parse_attributes($attrib){ $attrib .= '>'; $mode = 'search_key'; $tmp = ''; $current_key = ''; $attributes = array(); for ($x=0; $x < strlen($attrib); $x++){ $char = $attrib[$x]; if ($char == '=' and $mode == 'search_key'){ $current_key = trim($tmp); $tmp = ''; $mode = 'value'; } else if ($mode == 'search_key' and preg_match('/[ \t\s\r\n>]/', $char)){ $current_key = strtolower(trim($tmp)); if ($current_key != ''){ $attributes[$current_key] = ''; } $tmp = ''; $current_key = ''; } else if ($mode == 'value' and $char == '"'){ $mode = 'find_value_ending_a'; } else if ($mode == 'value' and $char == '\''){ $mode = 'find_value_ending_b'; } else if ($mode == 'value'){ $tmp .= $char; $mode = 'find_value_ending_c'; } else if ( ($mode == 'find_value_ending_a' and $char == '"') or ($mode == 'find_value_ending_b' and $char == '\'') or ($mode == 'find_value_ending_c' and preg_match('/[ \t\s\r\n>]/', $char)) ){ $mode = 'search_key'; if ($current_key != ''){ $current_key = strtolower($current_key); $attributes[$current_key] = $tmp; } $tmp = ''; } else { $tmp .= $char; } } if ($mode != 'search_key' and $current_key != ''){ $current_key = strtolower($current_key); $attributes[$current_key] = trim(preg_replace('/>+$/', '', $tmp)); } return $attributes; } } ?> i know its a big file Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952559 Share on other sites More sharing options...
cags Posted November 6, 2009 Share Posted November 6, 2009 Yes, it's a big file and it's also in no way relevant. What difference does the rest of your script make to your regular expression?! In order to create a successfull, acurate regular expression pattern, you need to know exactly what information you wish to match. Most people are extremely poor at describing exactly what needs to be matched. So the easiest solution is to give an example input string and saying which part you wish to be matched. Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952566 Share on other sites More sharing options...
nrg_alpha Posted November 6, 2009 Share Posted November 6, 2009 ...Most people are extremely poor at describing exactly what needs to be matched. So the easiest solution is to give an example input string and saying which part you wish to be matched. Which is precisely why we have a sticky on how to ask a regex question. Oh, how many times have I seen a thread drag on due to poorly communicated OPs in the first place? There's a reason why people requested I create a 'not psychic' smiley: Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952597 Share on other sites More sharing options...
salathe Posted November 6, 2009 Share Posted November 6, 2009 I wasn't sure so I tried looking it up for you... Extended properties such as "Greek" or "InMusicalSymbols" are not supported by PCRE. Source: http://www.php.net/manual/en/regexp.reference.unicode.php This needs amending, as of PHP 5.1.3 the bundled PCRE library was 6.6 (previously 6.2 if I'm reading the logs correctly). From that version onwards, script names like Arabic, Greek and Latin are available. Of course, servers/hosts might not be using the bundled library so check which version of the library you're using (phpinfo() will tell you). Please file a "Documentation problem" bug report (http://bugs.php.net/report.php) and someone will look into changing it. Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952600 Share on other sites More sharing options...
Daniel0 Posted November 6, 2009 Share Posted November 6, 2009 Assuming I'm understanding it correctly [\p{Greek}] represents 'any Greek character' If you just need the Greek characters, doing \p{Greek} will suffice as it in itself represents a character class. Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952612 Share on other sites More sharing options...
mraza Posted November 6, 2009 Author Share Posted November 6, 2009 hi All ! Thanks for such a great support, at least i sorted my problem, i needed to change encoding in my pages to get it work , all works well Now... Best Wishes Quote Link to comment https://forums.phpfreaks.com/topic/180539-solved-greek-words-in-regular-expression/#findComment-952887 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.