lococobra Posted July 16, 2007 Share Posted July 16, 2007 First an example: <?php $code = 'Some html here <? echo\'PHP code always starts with <? or <?php, and ends with ?>\'; ?> more html'; ?> Here's the problem... lets say this is the code for a web page and I'm trying to determine which parts are php and which parts are html. If I just split the string at every occurrence of <? or <?php or ?>, obviously there are going to be problems... I highly doubt this could be done in a single regular expression, but multiple ones maybe. First step seems to be to detect where strings are in $code and ignore those areas, but then again, what if html contains something like... <form method="POST" action="<?php echo$_SERVER['PHP_SELF']?>"> As you can see, if all string areas are ignored, some valid php code may also be ignored. Any ideas anyone? Quote Link to comment Share on other sites More sharing options...
Wildbug Posted July 16, 2007 Share Posted July 16, 2007 http://www.cs.vu.nl/~dick/PTAPG.html Quote Link to comment Share on other sites More sharing options...
effigy Posted July 16, 2007 Share Posted July 16, 2007 Perhaps something like this? It was borrowed from this topic. <pre> <?php $mixture = <<<MIX <html> <?php \$title = 'ABC?>'; ?> <head> <title><?php echo \$title; ?></title> </head> <body> Today is <?php \$date = getdate(); echo \$date['weekday']; ?>. </body> </html> <?php echo '<?php "test!" ?>'; ?> MIX; $pieces = preg_split('/(<\?.+?\?>)/', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE); $revised_pieces = array(); $num_pieces = count($pieces); ### Loop through and fix the matches. for ($i = 0; $i < $num_pieces; $i++) { $piece = $pieces[$i]; ### Count the number of non-backslashed quotes. $quotes = 0; preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes); ### Always add the current piece being processed. $revised_pieces[$i] = $piece; ### If the quotes are uneven... if ($quotes % 2) { ### Split apart the next piece. list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]); ### Add the missing end to this piece. $revised_pieces[$i] .= $before; ### Add the rest to the next piece. $revised_pieces[$i+1] = $after; ### Skip processing of the next piece. $i++; } } print_r($revised_pieces); ?> </pre> Quote Link to comment Share on other sites More sharing options...
lococobra Posted July 16, 2007 Author Share Posted July 16, 2007 It does not work completely... most of the time it does, but not in a really brutal case like.. html"<?php?>"html<?php"?>"php?>html Gets turned into... Array ( [0] => html"<?php?> [1] => "html<?php"?> [2] => "php?>html ) Output should be... Array ( [0] => html" [1] => <?php?> [2] => "html [3] => <?php"?>"php?> [4] => html ) Quote Link to comment Share on other sites More sharing options...
effigy Posted July 16, 2007 Share Posted July 16, 2007 Do you have any realistic examples that cause problems? Quote Link to comment Share on other sites More sharing options...
lococobra Posted July 16, 2007 Author Share Posted July 16, 2007 If it works for that one, it should work for anything... I do have an example but I'm not exactly sure what parts of it cause failure and it's about 300 lines long. One thing that I know for sure needs to be modified is that $pieces = preg_split('/(<\?.+?\?>)/', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE); Should be changed to $pieces = preg_split('/(<\?.+?\?>)/s', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE); However, changing that does not fix the problems. Quote Link to comment Share on other sites More sharing options...
effigy Posted July 16, 2007 Share Posted July 16, 2007 After a quick look, adding the following before the if ($quotes % 2) { works; I'm not sure how solid this is yet... ### There's no need to analyze the quotes if we're not in PHP. if (strpos($piece, '<?') === FALSE) { continue; } Quote Link to comment Share on other sites More sharing options...
lococobra Posted July 17, 2007 Author Share Posted July 17, 2007 Here's an example that shows how the above function is still not functioning correctly. <?php function findPHP($input){ $pieces = preg_split('/(<\?.+?\?>)/s', $input, -1, PREG_SPLIT_DELIM_CAPTURE); $revised_pieces = array(); for($i=0;$i<count($pieces);$i++){ $piece = $pieces[$i]; $quotes = 0; preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes); $revised_pieces[$i] = $piece; if (strpos($piece, '<?') === FALSE) continue; if ($quotes % 2) { list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]); $revised_pieces[$i] .= $before; $revised_pieces[$i+1] = $after; $i++; } } foreach($revised_pieces as $piece) if(strlen($piece)!=0) $output[] = $piece; return $output; } print_r(findPHP('html"<?php?>"html<?php"?><?"php?>html ?> end')); ?> Output is: Array ( [0] => html" [1] => <?php?> [2] => "html [3] => <?php"?> [4] => <?"php?>html ?> [5] => end ) Should be: Array ( [0] => html" [1] => <?php?> [2] => "html [3] => <?php"?><?"php?> [4] => html ?> end ) Quote Link to comment Share on other sites More sharing options...
lococobra Posted July 17, 2007 Author Share Posted July 17, 2007 One idea I had was, if the data is parsed linearly, one could safely assume that the first <? encountered would be valid. At that point, even numbered sections would be HTML, and odd numbered sections would be PHP (if you were starting at 0). Also, you could safely assume that all strings php code blocks could be discarded. Only thing I can't seem to figure out is how to only discard a string's contents if it's known to be within a php block. Here's some code I was working on, bit of a brute force, but it may be the only way to do it... <?php function findPHP($input){ $output = $input; $splitLoc = array(); for($i=0;strpos($input, '<?')!==FALSE;$i++){ if(intval($i/2)==($i/2)){ //Even = Html $splitLoc[] = strpos($input, '<?'); $input = substr_replace($input,'xx',strpos($input,'<?'),2); } else { //Odd = PHP preg_match_all('/(["\'])(.*?)(?<!\\\)\1/s', $input, $strings);$strings=$strings[0] foreach($strings as $string){ $replacement = "" for($j=0;$j<strlen($string);$j++)$replacement.="x" $input = substr_replace($input, $replacement, strpos($input, $string), strlen($string)); } $splitLoc[] = strpos($input, '?>'); $input = substr_replace($input,'xx',strpos($input,'?>'),2); } //Magic happens here... } return $output; } ?> I just can't seem to fit all the pieces together. Quote Link to comment Share on other sites More sharing options...
effigy Posted July 17, 2007 Share Posted July 17, 2007 Throw some more tests as this: <pre> <?php function findPHP($input){ $pieces = preg_split('/(<\?.+?\?>)/', $input, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY); //my_print_r($pieces); $revised_pieces = array(); $num_pieces = count($pieces); ### Loop through and fix the matches. for ($i = 0; $i < $num_pieces; $i++) { $piece = $pieces[$i]; ### Count the number of non-backslashed quotes. $quotes = 0; preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes); ### Always add the current piece being processed. $revised_pieces[$i] = $piece; ### If we're in PHP and the quotes are uneven... if (strpos($piece, '<?') !== FALSE && $quotes % 2) { ### Split apart the next piece. list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]); ### Add the missing end to this piece. $revised_pieces[$i] .= $before; ### Add the rest to the next piece if it's not empty. if (! empty($after)) { $revised_pieces[$i+1] = $after; } ### Skip processing of the next piece. $i++; } } return $revised_pieces; } function my_print_r($array) { foreach ($array as $key => &$value) { $value = htmlspecialchars($value); } print_r($array); } $tests = array( 'html"<?php?>"html<?php"?>"php?>html', 'html"<?php?>"html<?php"?><?"php?>html ?> end', ); foreach ($tests as $test) { my_print_r(findPHP($test)); } ?> </pre> Quote Link to comment Share on other sites More sharing options...
lococobra Posted July 17, 2007 Author Share Posted July 17, 2007 I was hopeful after seeing that the test lines had worked, but other tests are still showing failures. I can email you the test I'm running if you want. Quote Link to comment Share on other sites More sharing options...
lococobra Posted July 23, 2007 Author Share Posted July 23, 2007 Just bumping this cause the problem is still unsolved. Quote Link to comment Share on other sites More sharing options...
effigy Posted July 23, 2007 Share Posted July 23, 2007 I still haven't had time to move forward on this. Basically, you cannot use a "big picture" regex to count the quotes due to nesting. You've got to step through each character to determine when you're in a set of quotes. I'll post something if I get around to it. Are you using any multibyte encodings? Quote Link to comment Share on other sites More sharing options...
rea|and Posted July 23, 2007 Share Posted July 23, 2007 mmm, I've adjusted a class I wrote to parsing php code to work as request (I hope). It works as lococobra deduced. The result is an array where even elements contain html code and odd ones php code. I've made only few tests so I don't assure anything <?php include_once 'cl.split.code.php'; $code=file_get_contents('some_mixed_code.php'); $hcode = new lh_splitCode() ; $hcode->lh_splitting( $code ) ; print_r( $hcode->lh_get_code() ); ?> here the class: <?php /** * Andrea Ponzi, b 1.0, 23/07/2007 * */ class lh_splitCode { var $original_code ; var $hliteCode ; var $parsedCode ; var $endphptag='[ENDPHPTAG]'; var $re_open_tag_php = '/(?>^(.*?)<\?(??i)php)?(.*)$)/sS' ; var $re_parse_mixed_code='/(?"|\')(??:\\\\\\\\)*|.*?[^\\\\](?:\\\\\\\\)*)(\1|$))|(??:#|\/\/)(?m-s).*\r?\n)|(?:\/\*.*?(?:\*\/|$))|(?:\?>.*$)|<\?/sS'; function __lh_initialize($code) { $this->original_code = $code ; $this->hliteCode = $this->original_code ; $this->parsedCode = array() ; } function lh_splitting( $code=false ) { $this->__lh_initialize($code); if ($this->original_code==false) return false; $this->__lh_parsing_code(); for($i=1,$c=count($this->parsedCode);$i<$c;$i+=2) $this->parsedCode[$i]=str_replace('[OPENPHP]','<?',$this->parsedCode[$i]); } function lh_get_code(){ return $this->parsedCode ; } function __lh_parsing_code(){ while(preg_match($this->re_open_tag_php, $this->hliteCode, $mth)){ $this->parsedCode[] = $mth[1] ; $this->hliteCode = preg_replace_callback ( $this->re_parse_mixed_code ,array( &$this,'__lh_parsing_engine_cback' ) ,$mth[2] ); if ( strpos($this->hliteCode,$this->endphptag)!==false ) { $tmp = explode($this->endphptag, $this->hliteCode) ; $this->parsedCode[] = $tmp[0] ; $this->hliteCode = $tmp[1] ; } } if (trim($this->hliteCode)!='') $this->parsedCode[] = $this->hliteCode ; } function __lh_parsing_engine_cback($mths) { if( $mths[0]=='' ) return ''; if( $mths[0]=='<?' ) return '[OPENPHP]'; $str=($mths[0]{0}=='?')?$this->endphptag.substr($mths[0],2):$mths[0]; return $str ; } } EDIT: forgotten to say that the php tags are splitted so they are not in the results. Quote Link to comment Share on other sites More sharing options...
lococobra Posted July 25, 2007 Author Share Posted July 25, 2007 Awesome code, no idea how it works... but it does. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.