Jump to content

[SOLVED] Extremly compilcated regex/coding problem.


lococobra

Recommended Posts

First an example:

<?php
$code = 'Some html here <? echo\'PHP code always starts with <? or <?php, and ends with ?>\'; ?> more html';
?>

 

Here's the problem... lets say this is the code for a web page and I'm trying to determine which parts are php and which parts are html. If I just split the string at every occurrence of <? or <?php or ?>, obviously there are going to be problems...

 

I highly doubt this could be done in a single regular expression, but multiple ones maybe. First step seems to be to detect where strings are in $code and ignore those areas, but then again, what if html contains something like...

<form method="POST" action="<?php echo$_SERVER['PHP_SELF']?>">

 

As you can see, if all string areas are ignored, some valid php code may also be ignored. Any ideas anyone?

 

Perhaps something like this? It was borrowed from this topic.

 

<pre>
<?php

$mixture = <<<MIX
		<html>
			<?php \$title = 'ABC?>'; ?>
			<head>
				<title><?php echo \$title; ?></title>
			</head>
			<body>
				Today is <?php \$date = getdate(); echo \$date['weekday']; ?>.
			</body>
		</html>
		<?php echo '<?php "test!" ?>'; ?>
MIX;

$pieces = preg_split('/(<\?.+?\?>)/', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE);
$revised_pieces = array();
$num_pieces = count($pieces);
### Loop through and fix the matches.
for ($i = 0; $i < $num_pieces; $i++) {
	$piece = $pieces[$i];
	### Count the number of non-backslashed quotes.
	$quotes = 0;
	preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
	### Always add the current piece being processed.
	$revised_pieces[$i] = $piece;
	### If the quotes are uneven...
	if ($quotes % 2) {
		### Split apart the next piece.
		list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
		### Add the missing end to this piece.
		$revised_pieces[$i] .= $before;
		### Add the rest to the next piece.
		$revised_pieces[$i+1] = $after;
		### Skip processing of the next piece.
		$i++;
	}
}
print_r($revised_pieces);	
?>
</pre>

It does not work completely... most of the time it does, but not in a really brutal case like..

 

html"<?php?>"html<?php"?>"php?>html

 

Gets turned into...

 

Array
(
    [0] => html"<?php?>
    [1] => "html<?php"?>
    [2] => "php?>html
)

 

Output should be...

 

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?>"php?>
    [4] => html
)

If it works for that one, it should work for anything... I do have an example but I'm not exactly sure what parts of it cause failure and it's about 300 lines long.

 

One thing that I know for sure needs to be modified is that

$pieces = preg_split('/(<\?.+?\?>)/', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE);

 

Should be changed to

$pieces = preg_split('/(<\?.+?\?>)/s', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE);

 

However, changing that does not fix the problems.

Here's an example that shows how the above function is still not functioning correctly.

 

<?php
function findPHP($input){
$pieces = preg_split('/(<\?.+?\?>)/s', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
$revised_pieces = array();
for($i=0;$i<count($pieces);$i++){
	$piece = $pieces[$i];
	$quotes = 0;
	preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
	$revised_pieces[$i] = $piece;
	if (strpos($piece, '<?') === FALSE)
		continue;
	if ($quotes % 2) {
		list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
		$revised_pieces[$i] .= $before;
		$revised_pieces[$i+1] = $after;
		$i++;
	}
}
foreach($revised_pieces as $piece)
	if(strlen($piece)!=0)
		$output[] = $piece;
return $output;
}

print_r(findPHP('html"<?php?>"html<?php"?><?"php?>html ?> end'));
?>

 

Output is:

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?>
    [4] => <?"php?>html ?>
    [5] =>  end
)

 

Should be:

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?><?"php?>
    [4] => html ?> end
)

One idea I had was, if the data is parsed linearly, one could safely assume that the first <? encountered would be valid. At that point, even numbered sections would be HTML, and odd numbered sections would be PHP (if you were starting at 0). Also, you could safely assume that all strings php code blocks could be discarded. Only thing I can't seem to figure out is how to only discard a string's contents if it's known to be within a php block.

 

Here's some code I was working on, bit of a brute force, but it may be the only way to do it...

<?php
function findPHP($input){
$output = $input;
$splitLoc = array();
for($i=0;strpos($input, '<?')!==FALSE;$i++){
	if(intval($i/2)==($i/2)){
		//Even = Html
		$splitLoc[] = strpos($input, '<?');
		$input = substr_replace($input,'xx',strpos($input,'<?'),2);
	} else {
		//Odd = PHP
		preg_match_all('/(["\'])(.*?)(?<!\\\)\1/s', $input, $strings);$strings=$strings[0]
		foreach($strings as $string){
			$replacement = ""
			for($j=0;$j<strlen($string);$j++)$replacement.="x"
			$input = substr_replace($input, $replacement, strpos($input, $string), strlen($string));
		}
		$splitLoc[] = strpos($input, '?>');
		$input = substr_replace($input,'xx',strpos($input,'?>'),2);
	}
//Magic happens here...
}
return $output;
}
?>

 

I just can't seem to fit all the pieces together.

Throw some more tests as this:

 

<pre>
<?php

function findPHP($input){
	$pieces = preg_split('/(<\?.+?\?>)/', $input, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
	//my_print_r($pieces);
	$revised_pieces = array();
	$num_pieces = count($pieces);
	### Loop through and fix the matches.
	for ($i = 0; $i < $num_pieces; $i++) {
		$piece = $pieces[$i];
		### Count the number of non-backslashed quotes.
		$quotes = 0;
		preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
		### Always add the current piece being processed.
		$revised_pieces[$i] = $piece;
		### If we're in PHP and the quotes are uneven...
		if (strpos($piece, '<?') !== FALSE && $quotes % 2) {
			### Split apart the next piece.
			list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
			### Add the missing end to this piece.
			$revised_pieces[$i] .= $before;
			### Add the rest to the next piece if it's not empty.
			if (! empty($after)) {
				$revised_pieces[$i+1] = $after;
			}
			### Skip processing of the next piece.
			$i++;
		}
	}
	return $revised_pieces;
}

function my_print_r($array) {
	foreach ($array as $key => &$value) {
		$value = htmlspecialchars($value);
	}
	print_r($array);
}

$tests = array(
	'html"<?php?>"html<?php"?>"php?>html',
	'html"<?php?>"html<?php"?><?"php?>html ?> end',
);

foreach ($tests as $test) {
	my_print_r(findPHP($test));
}
?>
</pre>

I still haven't had time to move forward on this. Basically, you cannot use a "big picture" regex to count the quotes due to nesting. You've got to step through each character to determine when you're in a set of quotes. I'll post something if I get around to it. Are you using any multibyte encodings?

mmm, I've adjusted a class I wrote to parsing php code to work as request (I hope). It works as lococobra deduced. The result is an array where even elements contain html code and odd ones php code. I've made only few tests so I don't assure anything :)

 

<?php
include_once 'cl.split.code.php';  
$code=file_get_contents('some_mixed_code.php');

$hcode = new lh_splitCode() ;
$hcode->lh_splitting( $code ) ;
print_r( $hcode->lh_get_code() );
?>

 

 

here the class:

 


<?php 

/**
* Andrea Ponzi, b 1.0, 23/07/2007
*
*/


class lh_splitCode {

var $original_code ;
var $hliteCode ;
var $parsedCode ;
var $endphptag='[ENDPHPTAG]';
var $re_open_tag_php = '/(?>^(.*?)<\?(??i)php)?(.*)$)/sS' ;
var $re_parse_mixed_code='/(?"|\')(??:\\\\\\\\)*|.*?[^\\\\](?:\\\\\\\\)*)(\1|$))|(??:#|\/\/)(?m-s).*\r?\n)|(?:\/\*.*?(?:\*\/|$))|(?:\?>.*$)|<\?/sS';	

function __lh_initialize($code)
  	{
  		$this->original_code = $code ;
	$this->hliteCode = $this->original_code ;
	$this->parsedCode = array() ;
  	}	
  	
  	function lh_splitting( $code=false )
{
	$this->__lh_initialize($code);

	if ($this->original_code==false) 
		return false;

	$this->__lh_parsing_code();

	for($i=1,$c=count($this->parsedCode);$i<$c;$i+=2)
	$this->parsedCode[$i]=str_replace('[OPENPHP]','<?',$this->parsedCode[$i]);
}

function lh_get_code(){ return $this->parsedCode ; }	

function __lh_parsing_code(){

		while(preg_match($this->re_open_tag_php, $this->hliteCode, $mth)){
			$this->parsedCode[] = $mth[1] ;
			$this->hliteCode = preg_replace_callback
							  (
								$this->re_parse_mixed_code
							   ,array( &$this,'__lh_parsing_engine_cback' )
							   ,$mth[2]
							);

	if ( strpos($this->hliteCode,$this->endphptag)!==false )
			{
				$tmp = explode($this->endphptag, $this->hliteCode) ;
				$this->parsedCode[] = $tmp[0] ;
				$this->hliteCode = $tmp[1] ;
			}

		}

	if (trim($this->hliteCode)!='')
		$this->parsedCode[] = $this->hliteCode ;
}

function __lh_parsing_engine_cback($mths) 
{
	if( $mths[0]=='' ) return '';
	if( $mths[0]=='<?' ) return '[OPENPHP]';

	$str=($mths[0]{0}=='?')?$this->endphptag.substr($mths[0],2):$mths[0];
return $str ;
}	

}


 

EDIT: forgotten to say that the php tags are splitted so they are not in the results.

 

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.