Jump to content

[SOLVED] Matching a whole word


jasonbullard

Recommended Posts

Guys,

 

I am having a little trouble with a regex that I am building.  It is prolly just cause I have been working on it all night and just really can't think right now but here is what I am wanting.

 

I am building a syntax highlighter in PHP that works with regex and/or words.  So, in order to match a word I have a string of

 

(abstract|event|new|struct|as|explicit|null|switch|base|extern|object|this|bool|false|operator|throw|break|

finally|out|true|byte|fixed|override|try|case|float|params|typeof|catch|for|private|uint|char|foreach|protected|

ulong|checked|goto|public|unchecked|class|if|readonly|unsafe|const|implicit|ref|ushort|continue|int|in|return|using|

decimal|int|sbyte|virtual|default|interface|sealed|volatile|delegate|internal|short|void|do|is|sizeof|while|double|lock|

stackalloc|else|long|static|enum|namespace|string|return|get|set|protected|value)

 

 

Now, I know that I can use the b word boundary but there are some things that just don't seem right.  If I put it at the beginning like:

 

/\b(abstract|...)\b/

 

It highlights words inside other words.  If I try and match "as" it will highlight c'as'e instead of just matching a whole word.  Are there any other modifiers that I can use?  I have also tried to put each word in it's own parenthesis with word boundary around it.

 

That works; however, it will not pickup anything that does not have whitespaces around it.  So if I have,

 

public event e = new event();

 

It will match the first event but not the second.  I am really at a loss here as I know it can be done just are having a really difficult time with it right now.  :)

 

Thanks for any help.

Link to comment
https://forums.phpfreaks.com/topic/73297-solved-matching-a-whole-word/
Share on other sites

Guess I just needed to write it somewhere else.  I was adding a backslash to my word boundary.

 

This works perfectly.

 

#\b(abstract|event|new|struct|as|explicit|null|switch|base|extern|object|this|bool|false|operator|throw|break|
								finally|out|true|byte|fixed|override|try|case|float|params|typeof|catch|for|private|uint|char|foreach|
								protected|ulong|checked|goto|public|unchecked|class|if|readonly|unsafe|const|implicit|ref|ushort|continue|int|in|
								return|using|decimal|int|sbyte|virtual|default|interface|sealed|volatile|delegate|internal|short|void|do|is|
								sizeof|while|double|lock|stackalloc|else|long|static|enum|namespace|string|return|get|set|protected|value)\b#

Okay.  Maybe someone can help me modify this a little more.  It does match onlye "whole" words but it will not match items that have things like (), [], {}, etc. around it.  I do not need to match words or numbers after or before it, but it does need to match something like (uint).

 

Does anyone have any suggestions???

 

TIA,

Jason

Alright.  I am getting much closer now.  Here is what I have so far.

 

#(\b|((?<![\w])\B))('.$this->language[$i][3].')(\b|(\B(?![\w]>)))#

 

It will highlight keywords correctly except for a few instances.  If I have int before internal in the keyword list then it will match int and not internal (i.e. internal string - will match 'int'ernal).

 

It also doesn't match words that have ( before them or are wrapped in ().  A little more work and I know that it will work nicely.  Feel free to point out any parts that are wrong or need to be rearranged.

 

TIA,

Jason

If I have int before internal in the keyword list then it will match int and not internal (i.e. internal string - will match 'int'ernal).

 

Sort your word list before looping.

 

You also need to preg_quote the variable you're dropping into the expression.

 

The \b's should be matching around the parentheses...

Well, I figured one problem out but this regex is killing me.

 

I did switch it around to have matching \b's but that still did not do anything.  Here are my regexs now:

 

/* Match word boundaries first */
$this->code = preg_replace_callback('#(\b('.$language[$i][3].')\b)#', array($this, 'match_regex'), $this->code);

/* Match non-word boundaries second */
$this->code = preg_replace_callback('#(((?=[\W])|(?![\w])\B)('.$language[$i][3].')(\B(?![\w])|(?=[\W])))#', array($this, 'match_regex'), $this->code);

 

Now, the first one processes all of the whole words and the second should match anything that is not in a word boundary but it can't have characters (i.e. a-zA-Z0-9) before or after the keyword to match.  Maybe that will help someone help me figure this out???

 

Thanks for all the help as I am learning regex's with this script.

Yes I am aware but this supports over 50 languages right now so that is where the problem is coming in.  The matches work fine if I take the \b's out and leave it open but what happens is that I get highlights in words that shouldn't be highlighted.  :)

 

No that doesn't work either.  I may have to just split that one into two different regexs as well.  One to match words that have non-word boundaries around them and the other one I will have to figure out.

It's difficult to proceed without seeing the input data, the results, and the expected results.

 

Here's a small example illustrating the previous idea:

 

<pre>
<?php
$str = 'PHP is a programming language. I like to program.';
echo preg_replace('/(?<!\w)(program)(?!\w)/', '<b>\1</b>', $str);
?>
</pre>

Okay.  Here is the input data

 

<?

class paging
{
var $pQueryString;
var $pPageID;
var $pRowsPerPage;
var $pStartFile;
var $pRecordCount;
var $pCount;
function tep_start_paging()
{ $rs = mysql_query($this->pQueryString) or die(mysql_error().“<br>”.$this->pQueryString);
$this->pRecordCount = mysql_num_rows($rs); 
$this->pCount= $this->pRecordCount;
if((int)$this->pRecordCount >= (int)$this->pRowsPerPage)
$this->pRecordCount = ceil((int)$this->pRecordCount / (int)$this->pRowsPerPage);
else
$this->pRecordCount= 1;
if(empty($this->pPageID) or (int)$this->pPageID==1)
{
$this->pPageID = 1;
$this->pStartFile= 0;
}

if($this->pPageID > 1)
$this->pStartFile = ($this->pPageID -1) * $this->pRowsPerPage;
return $this->startPaging();
}

 

 

This regex here will match certain things like "die" and "mysql_num_rows".

 

'#(((?<=[\W])|(?<![\w])\B)('.$language[$i][3].')(\B(?![\w]>)|(?=[\W]>)))#'

 

Where the problem comes in is for an instance of

... or die(mysql_error()...

 

For some reason the only way I can match mysql_error is having \B or \w placed in the regex but it will also match something like this:

 

Let's say the keyword is "link"

 

It will do this (quotes are highlighted part):

 

add_"link"s_to_url();

 

That is where I am at right now.  I have been working on this for a couple of weeks now and it is starting to look really nice with fast processing times as well so I really hope that I can get this last part working.  :)

 

Thanks again for all the help.

 

Also, attached below is a screenshot of how it looks after it has finished highlighting.

 

[attachment deleted by admin]

Alright.  So I finally got it all figured out.  Everything pretty much worked except that I needed to add *all* special character entities into their html equivalent.  I was using htmlspecialchars and instead just created my own with all the special characters and it works like a charm.  ;D

 

Thanks again for all the help. 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.