[SOLVED] Extremly compilcated regex/coding problem.

lococobra · July 16, 2007

First an example:

<?php
$code = 'Some html here <? echo\'PHP code always starts with <? or <?php, and ends with ?>\'; ?> more html';
?>

Here's the problem... lets say this is the code for a web page and I'm trying to determine which parts are php and which parts are html. If I just split the string at every occurrence of <? or <?php or ?>, obviously there are going to be problems...

I highly doubt this could be done in a single regular expression, but multiple ones maybe. First step seems to be to detect where strings are in $code and ignore those areas, but then again, what if html contains something like...

<form method="POST" action="<?php echo$_SERVER['PHP_SELF']?>">

As you can see, if all string areas are ignored, some valid php code may also be ignored. Any ideas anyone?

Wildbug · July 16, 2007

http://www.cs.vu.nl/~dick/PTAPG.html

effigy · July 16, 2007

Perhaps something like this? It was borrowed from this topic.

<pre>
<?php

$mixture = <<<MIX
		<html>
			<?php \$title = 'ABC?>'; ?>
			<head>
				<title><?php echo \$title; ?></title>
			</head>
			<body>
				Today is <?php \$date = getdate(); echo \$date['weekday']; ?>.
			</body>
		</html>
		<?php echo '<?php "test!" ?>'; ?>
MIX;

$pieces = preg_split('/(<\?.+?\?>)/', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE);
$revised_pieces = array();
$num_pieces = count($pieces);
### Loop through and fix the matches.
for ($i = 0; $i < $num_pieces; $i++) {
	$piece = $pieces[$i];
	### Count the number of non-backslashed quotes.
	$quotes = 0;
	preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
	### Always add the current piece being processed.
	$revised_pieces[$i] = $piece;
	### If the quotes are uneven...
	if ($quotes % 2) {
		### Split apart the next piece.
		list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
		### Add the missing end to this piece.
		$revised_pieces[$i] .= $before;
		### Add the rest to the next piece.
		$revised_pieces[$i+1] = $after;
		### Skip processing of the next piece.
		$i++;
	}
}
print_r($revised_pieces);	
?>
</pre>

lococobra · July 16, 2007

It does not work completely... most of the time it does, but not in a really brutal case like..

html"<?php?>"html<?php"?>"php?>html

Gets turned into...

Array
(
    [0] => html"<?php?>
    [1] => "html<?php"?>
    [2] => "php?>html
)

Output should be...

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?>"php?>
    [4] => html
)

effigy · July 16, 2007

Do you have any realistic examples that cause problems?

lococobra · July 16, 2007

If it works for that one, it should work for anything... I do have an example but I'm not exactly sure what parts of it cause failure and it's about 300 lines long.

One thing that I know for sure needs to be modified is that

$pieces = preg_split('/(<\?.+?\?>)/', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE);

Should be changed to

$pieces = preg_split('/(<\?.+?\?>)/s', $mixture, -1, PREG_SPLIT_DELIM_CAPTURE);

However, changing that does not fix the problems.

effigy · July 16, 2007

After a quick look, adding the following before the if ($quotes % 2) { works; I'm not sure how solid this is yet...

### There's no need to analyze the quotes if we're not in PHP.
if (strpos($piece, '<?') === FALSE) {
	continue;
}

lococobra · July 17, 2007

Here's an example that shows how the above function is still not functioning correctly.

<?php
function findPHP($input){
$pieces = preg_split('/(<\?.+?\?>)/s', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
$revised_pieces = array();
for($i=0;$i<count($pieces);$i++){
	$piece = $pieces[$i];
	$quotes = 0;
	preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
	$revised_pieces[$i] = $piece;
	if (strpos($piece, '<?') === FALSE)
		continue;
	if ($quotes % 2) {
		list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
		$revised_pieces[$i] .= $before;
		$revised_pieces[$i+1] = $after;
		$i++;
	}
}
foreach($revised_pieces as $piece)
	if(strlen($piece)!=0)
		$output[] = $piece;
return $output;
}

print_r(findPHP('html"<?php?>"html<?php"?><?"php?>html ?> end'));
?>

Output is:

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?>
    [4] => <?"php?>html ?>
    [5] =>  end
)

Should be:

Array
(
    [0] => html"
    [1] => <?php?>
    [2] => "html
    [3] => <?php"?><?"php?>
    [4] => html ?> end
)

lococobra · July 17, 2007

One idea I had was, if the data is parsed linearly, one could safely assume that the first <? encountered would be valid. At that point, even numbered sections would be HTML, and odd numbered sections would be PHP (if you were starting at 0). Also, you could safely assume that all strings php code blocks could be discarded. Only thing I can't seem to figure out is how to only discard a string's contents if it's known to be within a php block.

Here's some code I was working on, bit of a brute force, but it may be the only way to do it...

<?php
function findPHP($input){
$output = $input;
$splitLoc = array();
for($i=0;strpos($input, '<?')!==FALSE;$i++){
	if(intval($i/2)==($i/2)){
		//Even = Html
		$splitLoc[] = strpos($input, '<?');
		$input = substr_replace($input,'xx',strpos($input,'<?'),2);
	} else {
		//Odd = PHP
		preg_match_all('/(["\'])(.*?)(?<!\\\)\1/s', $input, $strings);$strings=$strings[0]
		foreach($strings as $string){
			$replacement = ""
			for($j=0;$j<strlen($string);$j++)$replacement.="x"
			$input = substr_replace($input, $replacement, strpos($input, $string), strlen($string));
		}
		$splitLoc[] = strpos($input, '?>');
		$input = substr_replace($input,'xx',strpos($input,'?>'),2);
	}
//Magic happens here...
}
return $output;
}
?>

I just can't seem to fit all the pieces together.

effigy · July 17, 2007

Throw some more tests as this:

<pre>
<?php

function findPHP($input){
	$pieces = preg_split('/(<\?.+?\?>)/', $input, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
	//my_print_r($pieces);
	$revised_pieces = array();
	$num_pieces = count($pieces);
	### Loop through and fix the matches.
	for ($i = 0; $i < $num_pieces; $i++) {
		$piece = $pieces[$i];
		### Count the number of non-backslashed quotes.
		$quotes = 0;
		preg_replace('/(?<!\\\)["\']/', '', $piece, -1, $quotes);
		### Always add the current piece being processed.
		$revised_pieces[$i] = $piece;
		### If we're in PHP and the quotes are uneven...
		if (strpos($piece, '<?') !== FALSE && $quotes % 2) {
			### Split apart the next piece.
			list($before, $after) = preg_split('/(?<=\?>)/', $pieces[$i+1]);
			### Add the missing end to this piece.
			$revised_pieces[$i] .= $before;
			### Add the rest to the next piece if it's not empty.
			if (! empty($after)) {
				$revised_pieces[$i+1] = $after;
			}
			### Skip processing of the next piece.
			$i++;
		}
	}
	return $revised_pieces;
}

function my_print_r($array) {
	foreach ($array as $key => &$value) {
		$value = htmlspecialchars($value);
	}
	print_r($array);
}

$tests = array(
	'html"<?php?>"html<?php"?>"php?>html',
	'html"<?php?>"html<?php"?><?"php?>html ?> end',
);

foreach ($tests as $test) {
	my_print_r(findPHP($test));
}
?>
</pre>

lococobra · July 17, 2007

I was hopeful after seeing that the test lines had worked, but other tests are still showing failures. I can email you the test I'm running if you want.

lococobra · July 23, 2007

Just bumping this cause the problem is still unsolved.

effigy · July 23, 2007

I still haven't had time to move forward on this. Basically, you cannot use a "big picture" regex to count the quotes due to nesting. You've got to step through each character to determine when you're in a set of quotes. I'll post something if I get around to it. Are you using any multibyte encodings?

rea|and · July 23, 2007

mmm, I've adjusted a class I wrote to parsing php code to work as request (I hope). It works as lococobra deduced. The result is an array where even elements contain html code and odd ones php code. I've made only few tests so I don't assure anything

<?php
include_once 'cl.split.code.php';  
$code=file_get_contents('some_mixed_code.php');

$hcode = new lh_splitCode() ;
$hcode->lh_splitting( $code ) ;
print_r( $hcode->lh_get_code() );
?>

here the class:


<?php 

/**
* Andrea Ponzi, b 1.0, 23/07/2007
*
*/


class lh_splitCode {

var $original_code ;
var $hliteCode ;
var $parsedCode ;
var $endphptag='[ENDPHPTAG]';
var $re_open_tag_php = '/(?>^(.*?)<\?(??i)php)?(.*)$)/sS' ;
var $re_parse_mixed_code='/(?"|\')(??:\\\\\\\\)*|.*?[^\\\\](?:\\\\\\\\)*)(\1|$))|(??:#|\/\/)(?m-s).*\r?\n)|(?:\/\*.*?(?:\*\/|$))|(?:\?>.*$)|<\?/sS';	

function __lh_initialize($code)
  	{
  		$this->original_code = $code ;
	$this->hliteCode = $this->original_code ;
	$this->parsedCode = array() ;
  	}	
  	
  	function lh_splitting( $code=false )
{
	$this->__lh_initialize($code);

	if ($this->original_code==false) 
		return false;

	$this->__lh_parsing_code();

	for($i=1,$c=count($this->parsedCode);$i<$c;$i+=2)
	$this->parsedCode[$i]=str_replace('[OPENPHP]','<?',$this->parsedCode[$i]);
}

function lh_get_code(){ return $this->parsedCode ; }	

function __lh_parsing_code(){

		while(preg_match($this->re_open_tag_php, $this->hliteCode, $mth)){
			$this->parsedCode[] = $mth[1] ;
			$this->hliteCode = preg_replace_callback
							  (
								$this->re_parse_mixed_code
							   ,array( &$this,'__lh_parsing_engine_cback' )
							   ,$mth[2]
							);

	if ( strpos($this->hliteCode,$this->endphptag)!==false )
			{
				$tmp = explode($this->endphptag, $this->hliteCode) ;
				$this->parsedCode[] = $tmp[0] ;
				$this->hliteCode = $tmp[1] ;
			}

		}

	if (trim($this->hliteCode)!='')
		$this->parsedCode[] = $this->hliteCode ;
}

function __lh_parsing_engine_cback($mths) 
{
	if( $mths[0]=='' ) return '';
	if( $mths[0]=='<?' ) return '[OPENPHP]';

	$str=($mths[0]{0}=='?')?$this->endphptag.substr($mths[0],2):$mths[0];
return $str ;
}	

}

EDIT: forgotten to say that the php tags are splitted so they are not in the results.

lococobra · July 25, 2007

Awesome code, no idea how it works... but it does.

Sign In

[SOLVED] Extremly compilcated regex/coding problem.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information