Jump to content

[SOLVED] Extremly compilcated regex/coding problem.


Recommended Posts

First an example:

<?php
$code = 'Some html here <? echo\'PHP code always starts with <? or <?php, and ends with ?>\'; ?> more html';
?>

 

Here's the problem... lets say this is the code for a web page and I'm trying to determine which parts are php and which parts are html. If I just split the string at every occurrence of <? or <?php or ?>, obviously there are going to be problems...

 

I highly doubt this could be done in a single regular expression, but multiple ones maybe. First step seems to be to detect where strings are in $code and ignore those areas, but then again, what if html contains something like...

<form method="POST" action="<?php echo$_SERVER['PHP_SELF']?>">

 

As you can see, if all string areas are ignored, some valid php code may also be ignored. Any ideas anyone?

 

Perhaps something like this? It was borrowed from this topic.

 

<pre>
<?php

$mixture = <<<MIX
		<html>
			<?php \$title = 'ABC?>'; ?>
			<head>
				<title><?php echo \$title; ?></title>
			</head>
			<body>
				Today is <?php \$date = getdate(); echo \$date['weekday']; ?>.
			</body>
		</html>
		<?php echo '<?php "test!" ?>'; ?>
MIX;

$pieces = preg_split('/(<\?.+?\?>)/', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE);
$revised_pieces = array();
$num_pieces = count($pieces);
### Loop through and fix the matches.
for ($i = 0; $i < $num_pieces; $i++) {
	$piece = $pieces[$i];
	### Count the number of non-backslashed quotes.
	$quotes = 0;
	preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
	### Always add the current piece being processed.
	$revised_pieces[$i] = $piece;
	### If the quotes are uneven...
	if ($quotes % 2) {
		### Split apart the next piece.
		list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
		### Add the missing end to this piece.
		$revised_pieces[$i] .= $before;
		### Add the rest to the next piece.
		$revised_pieces[$i+1] = $after;
		### Skip processing of the next piece.
		$i++;
	}
}
print_r($revised_pieces);	
?>
</pre>

It does not work completely... most of the time it does, but not in a really brutal case like..

 

html"<?php?>"html<?php"?>"php?>html

 

Gets turned into...

 

Array
(
    [0] => html"<?php?>
    [1] => "html<?php"?>
    [2] => "php?>html
)

 

Output should be...

 

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?>"php?>
    [4] => html
)

If it works for that one, it should work for anything... I do have an example but I'm not exactly sure what parts of it cause failure and it's about 300 lines long.

 

One thing that I know for sure needs to be modified is that

$pieces = preg_split('/(<\?.+?\?>)/', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE);

 

Should be changed to

$pieces = preg_split('/(<\?.+?\?>)/s', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE);

 

However, changing that does not fix the problems.

After a quick look, adding the following before the if ($quotes % 2) { works; I'm not sure how solid this is yet...

 

### There's no need to analyze the quotes if we're not in PHP.
if (strpos($piece, '<?') === FALSE) {
	continue;
}

Here's an example that shows how the above function is still not functioning correctly.

 

<?php
function findPHP($input){
$pieces = preg_split('/(<\?.+?\?>)/s', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
$revised_pieces = array();
for($i=0;$i<count($pieces);$i++){
	$piece = $pieces[$i];
	$quotes = 0;
	preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
	$revised_pieces[$i] = $piece;
	if (strpos($piece, '<?') === FALSE)
		continue;
	if ($quotes % 2) {
		list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
		$revised_pieces[$i] .= $before;
		$revised_pieces[$i+1] = $after;
		$i++;
	}
}
foreach($revised_pieces as $piece)
	if(strlen($piece)!=0)
		$output[] = $piece;
return $output;
}

print_r(findPHP('html"<?php?>"html<?php"?><?"php?>html ?> end'));
?>

 

Output is:

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?>
    [4] => <?"php?>html ?>
    [5] =>  end
)

 

Should be:

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?><?"php?>
    [4] => html ?> end
)

One idea I had was, if the data is parsed linearly, one could safely assume that the first <? encountered would be valid. At that point, even numbered sections would be HTML, and odd numbered sections would be PHP (if you were starting at 0). Also, you could safely assume that all strings php code blocks could be discarded. Only thing I can't seem to figure out is how to only discard a string's contents if it's known to be within a php block.

 

Here's some code I was working on, bit of a brute force, but it may be the only way to do it...

<?php
function findPHP($input){
$output = $input;
$splitLoc = array();
for($i=0;strpos($input, '<?')!==FALSE;$i++){
	if(intval($i/2)==($i/2)){
		//Even = Html
		$splitLoc[] = strpos($input, '<?');
		$input = substr_replace($input,'xx',strpos($input,'<?'),2);
	} else {
		//Odd = PHP
		preg_match_all('/(["\'])(.*?)(?<!\\\)\1/s', $input, $strings);$strings=$strings[0]
		foreach($strings as $string){
			$replacement = ""
			for($j=0;$j<strlen($string);$j++)$replacement.="x"
			$input = substr_replace($input, $replacement, strpos($input, $string), strlen($string));
		}
		$splitLoc[] = strpos($input, '?>');
		$input = substr_replace($input,'xx',strpos($input,'?>'),2);
	}
//Magic happens here...
}
return $output;
}
?>

 

I just can't seem to fit all the pieces together.

Throw some more tests as this:

 

<pre>
<?php

function findPHP($input){
	$pieces = preg_split('/(<\?.+?\?>)/', $input, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
	//my_print_r($pieces);
	$revised_pieces = array();
	$num_pieces = count($pieces);
	### Loop through and fix the matches.
	for ($i = 0; $i < $num_pieces; $i++) {
		$piece = $pieces[$i];
		### Count the number of non-backslashed quotes.
		$quotes = 0;
		preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
		### Always add the current piece being processed.
		$revised_pieces[$i] = $piece;
		### If we're in PHP and the quotes are uneven...
		if (strpos($piece, '<?') !== FALSE && $quotes % 2) {
			### Split apart the next piece.
			list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
			### Add the missing end to this piece.
			$revised_pieces[$i] .= $before;
			### Add the rest to the next piece if it's not empty.
			if (! empty($after)) {
				$revised_pieces[$i+1] = $after;
			}
			### Skip processing of the next piece.
			$i++;
		}
	}
	return $revised_pieces;
}

function my_print_r($array) {
	foreach ($array as $key => &$value) {
		$value = htmlspecialchars($value);
	}
	print_r($array);
}

$tests = array(
	'html"<?php?>"html<?php"?>"php?>html',
	'html"<?php?>"html<?php"?><?"php?>html ?> end',
);

foreach ($tests as $test) {
	my_print_r(findPHP($test));
}
?>
</pre>

I still haven't had time to move forward on this. Basically, you cannot use a "big picture" regex to count the quotes due to nesting. You've got to step through each character to determine when you're in a set of quotes. I'll post something if I get around to it. Are you using any multibyte encodings?

mmm, I've adjusted a class I wrote to parsing php code to work as request (I hope). It works as lococobra deduced. The result is an array where even elements contain html code and odd ones php code. I've made only few tests so I don't assure anything :)

 

<?php
include_once 'cl.split.code.php';  
$code=file_get_contents('some_mixed_code.php');

$hcode = new lh_splitCode() ;
$hcode->lh_splitting( $code ) ;
print_r( $hcode->lh_get_code() );
?>

 

 

here the class:

 


<?php 

/**
* Andrea Ponzi, b 1.0, 23/07/2007
*
*/


class lh_splitCode {

var $original_code ;
var $hliteCode ;
var $parsedCode ;
var $endphptag='[ENDPHPTAG]';
var $re_open_tag_php = '/(?>^(.*?)<\?(??i)php)?(.*)$)/sS' ;
var $re_parse_mixed_code='/(?"|\')(??:\\\\\\\\)*|.*?[^\\\\](?:\\\\\\\\)*)(\1|$))|(??:#|\/\/)(?m-s).*\r?\n)|(?:\/\*.*?(?:\*\/|$))|(?:\?>.*$)|<\?/sS';	

function __lh_initialize($code)
  	{
  		$this->original_code = $code ;
	$this->hliteCode = $this->original_code ;
	$this->parsedCode = array() ;
  	}	
  	
  	function lh_splitting( $code=false )
{
	$this->__lh_initialize($code);

	if ($this->original_code==false) 
		return false;

	$this->__lh_parsing_code();

	for($i=1,$c=count($this->parsedCode);$i<$c;$i+=2)
	$this->parsedCode[$i]=str_replace('[OPENPHP]','<?',$this->parsedCode[$i]);
}

function lh_get_code(){ return $this->parsedCode ; }	

function __lh_parsing_code(){

		while(preg_match($this->re_open_tag_php, $this->hliteCode, $mth)){
			$this->parsedCode[] = $mth[1] ;
			$this->hliteCode = preg_replace_callback
							  (
								$this->re_parse_mixed_code
							   ,array( &$this,'__lh_parsing_engine_cback' )
							   ,$mth[2]
							);

	if ( strpos($this->hliteCode,$this->endphptag)!==false )
			{
				$tmp = explode($this->endphptag, $this->hliteCode) ;
				$this->parsedCode[] = $tmp[0] ;
				$this->hliteCode = $tmp[1] ;
			}

		}

	if (trim($this->hliteCode)!='')
		$this->parsedCode[] = $this->hliteCode ;
}

function __lh_parsing_engine_cback($mths) 
{
	if( $mths[0]=='' ) return '';
	if( $mths[0]=='<?' ) return '[OPENPHP]';

	$str=($mths[0]{0}=='?')?$this->endphptag.substr($mths[0],2):$mths[0];
return $str ;
}	

}


 

EDIT: forgotten to say that the php tags are splitted so they are not in the results.

 

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.