[SOLVED] Matching a whole word

jasonbullard · October 15, 2007

Guys,

I am having a little trouble with a regex that I am building. It is prolly just cause I have been working on it all night and just really can't think right now but here is what I am wanting.

I am building a syntax highlighter in PHP that works with regex and/or words. So, in order to match a word I have a string of

(abstract|event|new|struct|as|explicit|null|switch|base|extern|object|this|bool|false|operator|throw|break|

finally|out|true|byte|fixed|override|try|case|float|params|typeof|catch|for|private|uint|char|foreach|protected|

ulong|checked|goto|public|unchecked|class|if|readonly|unsafe|const|implicit|ref|ushort|continue|int|in|return|using|

decimal|int|sbyte|virtual|default|interface|sealed|volatile|delegate|internal|short|void|do|is|sizeof|while|double|lock|

stackalloc|else|long|static|enum|namespace|string|return|get|set|protected|value)

Now, I know that I can use the b word boundary but there are some things that just don't seem right. If I put it at the beginning like:

/\b(abstract|...)\b/

It highlights words inside other words. If I try and match "as" it will highlight c'as'e instead of just matching a whole word. Are there any other modifiers that I can use? I have also tried to put each word in it's own parenthesis with word boundary around it.

That works; however, it will not pickup anything that does not have whitespaces around it. So if I have,

public event e = new event();

It will match the first event but not the second. I am really at a loss here as I know it can be done just are having a really difficult time with it right now.

Thanks for any help.

jasonbullard · October 15, 2007

Guess I just needed to write it somewhere else. I was adding a backslash to my word boundary.

This works perfectly.

#\b(abstract|event|new|struct|as|explicit|null|switch|base|extern|object|this|bool|false|operator|throw|break|
								finally|out|true|byte|fixed|override|try|case|float|params|typeof|catch|for|private|uint|char|foreach|
								protected|ulong|checked|goto|public|unchecked|class|if|readonly|unsafe|const|implicit|ref|ushort|continue|int|in|
								return|using|decimal|int|sbyte|virtual|default|interface|sealed|volatile|delegate|internal|short|void|do|is|
								sizeof|while|double|lock|stackalloc|else|long|static|enum|namespace|string|return|get|set|protected|value)\b#

jasonbullard · October 15, 2007

Okay. Maybe someone can help me modify this a little more. It does match onlye "whole" words but it will not match items that have things like (), [], {}, etc. around it. I do not need to match words or numbers after or before it, but it does need to match something like (uint).

Does anyone have any suggestions???

TIA,

Jason

jasonbullard · October 15, 2007

Alright. I am getting much closer now. Here is what I have so far.

#(\b|((?<![\w])\B))('.$this->language[$i][3].')(\b|(\B(?![\w]>)))#

It will highlight keywords correctly except for a few instances. If I have int before internal in the keyword list then it will match int and not internal (i.e. internal string - will match 'int'ernal).

It also doesn't match words that have ( before them or are wrapped in (). A little more work and I know that it will work nicely. Feel free to point out any parts that are wrong or need to be rearranged.

TIA,

Jason

effigy · October 16, 2007

If I have int before internal in the keyword list then it will match int and not internal (i.e. internal string - will match 'int'ernal).

Sort your word list before looping.

You also need to preg_quote the variable you're dropping into the expression.

The \b's should be matching around the parentheses...

jasonbullard · October 18, 2007

Well, I figured one problem out but this regex is killing me.

I did switch it around to have matching \b's but that still did not do anything. Here are my regexs now:

/* Match word boundaries first */
$this->code = preg_replace_callback('#(\b('.$language[$i][3].')\b)#', array($this, 'match_regex'), $this->code);

/* Match non-word boundaries second */
$this->code = preg_replace_callback('#(((?=[\W])|(?![\w])\B)('.$language[$i][3].')(\B(?![\w])|(?=[\W])))#', array($this, 'match_regex'), $this->code);

Now, the first one processes all of the whole words and the second should match anything that is not in a word boundary but it can't have characters (i.e. a-zA-Z0-9) before or after the keyword to match. Maybe that will help someone help me figure this out???

Thanks for all the help as I am learning regex's with this script.

effigy · October 18, 2007

Are you aware that PHP has highlight_file and highlight_string?

How about (?<!\w)word(?!\w)?

jasonbullard · October 18, 2007

Yes I am aware but this supports over 50 languages right now so that is where the problem is coming in. The matches work fine if I take the \b's out and leave it open but what happens is that I get highlights in words that shouldn't be highlighted.

No that doesn't work either. I may have to just split that one into two different regexs as well. One to match words that have non-word boundaries around them and the other one I will have to figure out.

effigy · October 18, 2007

It's difficult to proceed without seeing the input data, the results, and the expected results.

Here's a small example illustrating the previous idea:

<pre>
<?php
$str = 'PHP is a programming language. I like to program.';
echo preg_replace('/(?<!\w)(program)(?!\w)/', '<b>\1</b>', $str);
?>
</pre>

jasonbullard · October 18, 2007

Okay. Here is the input data

<?

class paging
{
var $pQueryString;
var $pPageID;
var $pRowsPerPage;
var $pStartFile;
var $pRecordCount;
var $pCount;
function tep_start_paging()
{ $rs = mysql_query($this->pQueryString) or die(mysql_error().“<br>”.$this->pQueryString);
$this->pRecordCount = mysql_num_rows($rs); 
$this->pCount= $this->pRecordCount;
if((int)$this->pRecordCount >= (int)$this->pRowsPerPage)
$this->pRecordCount = ceil((int)$this->pRecordCount / (int)$this->pRowsPerPage);
else
$this->pRecordCount= 1;
if(empty($this->pPageID) or (int)$this->pPageID==1)
{
$this->pPageID = 1;
$this->pStartFile= 0;
}

if($this->pPageID > 1)
$this->pStartFile = ($this->pPageID -1) * $this->pRowsPerPage;
return $this->startPaging();
}

This regex here will match certain things like "die" and "mysql_num_rows".

'#(((?<=[\W])|(?<![\w])\B)('.$language[$i][3].')(\B(?![\w]>)|(?=[\W]>)))#'

Where the problem comes in is for an instance of

... or die(mysql_error()...

For some reason the only way I can match mysql_error is having \B or \w placed in the regex but it will also match something like this:

Let's say the keyword is "link"

It will do this (quotes are highlighted part):

add_"link"s_to_url();

That is where I am at right now. I have been working on this for a couple of weeks now and it is starting to look really nice with fast processing times as well so I really hope that I can get this last part working.

Thanks again for all the help.

Also, attached below is a screenshot of how it looks after it has finished highlighting.

[attachment deleted by admin]

jasonbullard · October 19, 2007

Alright. So I finally got it all figured out. Everything pretty much worked except that I needed to add *all* special character entities into their html equivalent. I was using htmlspecialchars and instead just created my own with all the special characters and it works like a charm.

Thanks again for all the help.

Sign In

[SOLVED] Matching a whole word

Recommended Posts

jasonbullard

Link to comment

Share on other sites

jasonbullard

Link to comment

Share on other sites

jasonbullard

Link to comment

Share on other sites

jasonbullard

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

jasonbullard

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

jasonbullard

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

jasonbullard

Link to comment

Share on other sites

jasonbullard

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information