Jump to content

Match between


The Little Guy

Recommended Posts

I am writing a css highlighter, It works, except when doing something like this, where there is a colon inside the braces and outside the braces. the part outside the brace gets all messed up and displays the html.

 

example (bottom of page):

http://phplive.org/phpLive/examples/misc/highlight.php

 

a.link:hover{
text-decoration: underline;
}

 

Here is what I have so far:

$find = array(
"/([a-zA-Z-].+?)(:)/",
"/'.+?'/",
"/".+?"/",
"/([.#:>a-zA-Z0-9].+?)(\{)/",
);
$replace = array(
'<span style="color:#0000ff;font-weight:bold;">$1</span>$2',
'<span style="color:#ce7b00;">$0</span>',
'<span style="color:#ce7b00;">$0</span>',
'<span style="color:#007c00;font-weight:bold;">$1</span>$2',
);
$this->quickString = preg_replace($find, $replace, htmlentities($content, ENT_QUOTES));

 

What I am thinking of doing (for the first array parameter) is tell to only match that if it is between { and } otherwise ignore it, but I am not sure how to do that. How can I do that? If that isn't a good way to do it, do you have any better suggestions for me?

 

Thanks!

Link to comment
Share on other sites

I have put together this regex using word boundaries to grab the desired text to replace.

 

$str = "a.link:hover{text-decoration: underline; color: #222; 
font-weight: bold;}";
$pattern = '~[^.]\b([a-zA-Z-]+?)\b(~';
$replacement = '<span style="color:#0000ff;font-weight:bold;">$1</span>$2';
echo preg_replace($pattern,$replacement,$str);

 

my only concern with this, is that it will remove the opening bracket of the CSS code, since it matches the word boundary, I am working on a solution for that, but this can get you going for now.

Link to comment
Share on other sites

1. well, this regex took a little bit of trial and error. I know that a word boundary (\b), if placed to the left of an alphanumeric character, will only match an alphanumeric character if a non-alphanumeric character is to the immediate left of it. same goes for placing a boundary on the right of an alpha-numeric character, it will only match if a non-alphanumeric character immediately follows an alphanumeric character. Now since the text that you want to replace will always be in between either a space, curly bracket, colon, or semi-colon, these are all non-alphanumeric characters, I knew that word boundaries would match only those cases. I had to add [^.] in the beginning of the regex so it would not match a.link, since the word boundary would see the the non-alphanumeric character period (.) followed by an alphanumeric character (l) and would match that case, which we do not want.

 

2. I believe the [^.] is grabbing the tab before the CSS string.. what you can do is remeber this character, and back reference it back into the replacement string.

 

$str = "a.link:hover{
	text-decoration: underline;
	color: #222;
}";
$pattern = '~([^.,])\b([a-zA-Z-]+?)\b(~';
$replacement = '$1<span style="color:#0000ff;font-weight:bold;">$2</span>$3';
echo preg_replace($pattern,$replacement,$str);

 

this should add the tab back into the string.

 

Edit: Thinking about this, I have made the regex a little more robust to also allow commas, for multiple element CSS..

 

$str = "a.link:hover, a.link:active{
	text-decoration: underline;
	color: #222;
}";
$pattern = '~([^.,])\b([a-zA-Z-]+?)\b(~';
$replacement = '$1<span style="color:#0000ff;font-weight:bold;">$2</span>$3';
echo preg_replace($pattern,$replacement,$str);

Link to comment
Share on other sites

It seems to be working!

 

Take a look: http://phplive.org/phpLive/examples/misc/highlight.php

Let me know what you think.

 

Little off topic:

I can now style CSS within the HTML!

 

 

Thanks for the help! Your awesome!

 

it looks to be working nicely! If I think of any improvements to add to the regex I will post them on this thread.

Link to comment
Share on other sites

Hey guys!

 

If I think of any improvements to add to the regex I will post them on this thread.

 

Without reading the details, a couple of thoughts about the expression itself in the spirit of exploration and fine-tuning. (Nothing wrong with AyKay's expression!)

 

$pattern = '~([^.,])\b([a-zA-Z-]+?)\b(~';

 

1. You can drop the "lazy quantifier" (?), as there is no risk that the character class will ever roll over what follows (a word boundary and a colon). You can be greedy here, the engine will work a little faster as lazy matching involves checking ahead and backtracking.

2. I've read that case insensitive is a little faster than [a-zA-Z], not that you would notice the difference if you ran the code a million times. ;)

3. You could make the quantifier possessive by adding a plus, it will fail a little faster.

With those in, you get:

$pattern = '~([^.,])\b([a-z-]++)\b(~i';

The word boundaries are forcing the string in [a-z-]+ to start and end with a letter (it cannot start or end with a dash). Assuming this is what you want.

 

I haven't read the thread in detail so I don't know how the regex performs for the task at hand. These are just optional tweaks for the regex itself (which is already a fine regex as it is).

 

Wishing you all a fun weekend!

 

 

Link to comment
Share on other sites

Hi TLG, Walking out the door to go hiking, but wanted to give you a quick answer: find a table of html characters, find the ascii for > Let's say it's 65 (it's not), then in the character class you can use \x65. If that doesn't work it's probably an encoding story, you'll need the u for unicode at the end of the pattern and someone should be able to help you. For unicode what you put in the class looks like this. \x{201A} (wrong code though)  :)

 

Link to comment
Share on other sites

Taking this CSS:

p > a{
color: red;
}

 

and this Regex:

/([.#:>\-_, a-zA-Z0-9 ]+?)(\{)/

 

the p > doesn't get highlighted, but the a does get highlighted. Any thoughts why?

 

this:

 

$str = "p > a{
color: red;
}";
$pattern = '~([.#:>\-_,a-z0-9 ]+)({)~i';
$replacement = '<span style="color:#333;font-weight:bold;">$1</span>$2';
echo preg_replace($pattern,$replacement,$str);

 

works for me (tweaked it a tad).

 

Edit: edited for &gt (you can also use the hex value is you wish, which would be \x{003E}, as playful suggested, but make sure the u modifier is appended to the regex)

 

$pattern = '~([.#:>\-_,a-z0-9& ]+)({)~i';

Link to comment
Share on other sites

The Little Guy, you should use a lexer here. There are too many edge cases where regexes will not work. highlight.js is a nice ready-to-use highlighter that does lexical analysis.

 

I am not making a website that needs this, I am building a php library that has a highlight function in it (library link in signature, still in alpha stages but still very powerful).

Link to comment
Share on other sites

Regardless of if your making a site or a library, regex is not the right tool for the job here.  You should essentially parse the CSS codes into tokens then apply the formatting.  How accurate your parser needs to be can depend on how accurate you want your highlighting.  A simple parser that will separate out selectors, properties and values shouldn't be too hard to do to start with.

 

Link to comment
Share on other sites

Regardless of if your making a site or a library, regex is not the right tool for the job here.  You should essentially parse the CSS codes into tokens then apply the formatting.

 

I'm not quite sure what you mean here, could you explain? What do you mean by "parse the CSS codes into tokens"?

 

Edit:

 

After reading Wikipedia, it sounds like your saying make a dictionary, and highlight according to the dictionary.

Link to comment
Share on other sites

What do you mean by "parse the CSS codes into tokens"?

 

You break it down into it's fundamental parts using a parser script (aka a lexer)

 

For example (quote for color):

p > a {

color: red;

}

 

p > a is a selector token

{ is a begin ruleset token

color is a property name token

red is a property value token

; is a end statement token

} is a end ruleset token

 

You would create a lexer that will break the css string down into tokens like that, then you can re-assemble the string from the tokens while applying whatever coloring or formatting you need around each token value.

 

Link to comment
Share on other sites

Super simplistic example:


<?php

function tokenizeCss($str){
$tokens=array();
$len=strlen($str);
$i=0;

$state='selector';
$newState=null;


$tokenValue='';
while ($i<$len){
	$ch = $str[$i];
	switch ($ch){
		case '{':
			$tokens[] = array('type' => $state, 'value' => $tokenValue);
			$tokens[] = array('type' => 'ruleset-begin', 'value' => '{');
			$state='ruleset';
			$tokenValue='';
			break;
		case '}':
			$tokens[] = array('type' => $state, 'value' => $tokenValue);
			$tokens[] = array('type' => 'ruleset-end', 'value' => '}');
			$state='selector';
			$tokenValue='';
			break;
		default:
			$tokenValue .= $ch;
	}

	$i++;
}

if (!empty($tokenValue)){
	$tokens[] = array('type' => $state, 'value' => $tokenValue);
}

return $tokens;
}


$css = '
p > a{
color: red;
}

a.link:hover{
text-decoration: underline;
}
';

$tokens = tokenizeCss($css);

$colors=array(
'selector' => 'red',
'ruleset' => 'blue',
'ruleset-begin' => 'orange',
'ruleset-end' => 'orange'
);
foreach ($tokens as $tok){
$color = $colors[$tok['type']];
echo '<span style="color: '.$color.';">'.$tok['value'].'</span>';
}

 

Link to comment
Share on other sites

Here is what I have: http://phplive.org/phpLive/test.php

 

<?php
function tokenizeCss($str){
$tokens=array();
$len=strlen($str);
$i=0;
$state='selector';
$prevState = "";
$newState=null;
$tokenValue='';
$commenting = false;
$value = false;
while ($i<$len){
	switch ($str[$i]){
		case '{':
			if(!$commenting){
				$tokens[] = array('type' => $state, 'value' => $tokenValue);
				$tokens[] = array('type' => 'ruleset-begin', 'value' => '{');
				$state='ruleset';
				$tokenValue='';
			}else{
				$tokenValue .= $str[$i];
			}
			break;
		case '}':
			if(!$commenting){
				$tokens[] = array('type' => $state, 'value' => $tokenValue);
				$tokens[] = array('type' => 'ruleset-end', 'value' => '}');
				$state='selector';
				$tokenValue='';
			}else{
				$tokenValue .= $str[$i];
			}
			break;
		default:
			if($str[$i] == ":" && !$commenting && $state == "ruleset"){
				$value = true;
				$state = "value";
				$tokens[] = array('type' => "ruleset", 'value' => $tokenValue.":");
				$tokenValue = "";
			}
			if($str[$i].$str[$i+1] == "/*" && !$commenting){
				$commenting = true;
				$prevState = $state;
				$state = "comment";
			}
			if($str[$i].$str[$i+1] == "*/" && $commenting){
				$commenting = false;
				$tokens[] = array('type' => $state, 'value' => $tokenValue."*/");
				$state = $prevState;
				$tokenValue = "";
				$i++;
			}else{
				if($prevState == "value" && $str[$i] == ";" && !$value){
					$value = false;
					$tokens[] = array('type' => $prevState, 'value' => $tokenValue);
					$tokenValue = "";
					$state = "ruleset";
				}else{
					if($str[$i] == ":" && $state == "value"){
						//Removes extra colon in value
					}else
					$tokenValue .= $str[$i];
				}
			}
	}
	$i++;
}
if (!empty($tokenValue)){
	$tokens[] = array('type' => $state, 'value' => $tokenValue);
}
return $tokens;
}

$css = 'p > a{
color: red;
}

a.link:hover{
text-decoration: underline;
}
/*
a.link:hover{
text-decoration: underline;
}
*/
p > a{
color: red;
}
a.link{
/*text-decoration: underline;*/
text-decoration: none;
}
';

$tokens = tokenizeCss($css);
//print_r($tokens);
$styles = array(
'selector' => 'font-weight: bold;color: #007c00',
'ruleset' => 'color: #0000ff;',
'ruleset-begin' => 'orange',
'ruleset-end' => 'orange',
'comment' => 'color: #999999;font-style: italic;',
'value' => 'color: #00ff00;'
);
echo "<pre>";
foreach ($tokens as $tok){
$style = $styles[$tok['type']];
echo '<span style="'.$style.'">'.$tok['value'].'</span>';
}
echo "</pre>";

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.