Jump to content

Recommended Posts

Guys,

 

I am having a little trouble with a regex that I am building.  It is prolly just cause I have been working on it all night and just really can't think right now but here is what I am wanting.

 

I am building a syntax highlighter in PHP that works with regex and/or words.  So, in order to match a word I have a string of

 

(abstract|event|new|struct|as|explicit|null|switch|base|extern|object|this|bool|false|operator|throw|break|

finally|out|true|byte|fixed|override|try|case|float|params|typeof|catch|for|private|uint|char|foreach|protected|

ulong|checked|goto|public|unchecked|class|if|readonly|unsafe|const|implicit|ref|ushort|continue|int|in|return|using|

decimal|int|sbyte|virtual|default|interface|sealed|volatile|delegate|internal|short|void|do|is|sizeof|while|double|lock|

stackalloc|else|long|static|enum|namespace|string|return|get|set|protected|value)

 

 

Now, I know that I can use the b word boundary but there are some things that just don't seem right.  If I put it at the beginning like:

 

/\b(abstract|...)\b/

 

It highlights words inside other words.  If I try and match "as" it will highlight c'as'e instead of just matching a whole word.  Are there any other modifiers that I can use?  I have also tried to put each word in it's own parenthesis with word boundary around it.

 

That works; however, it will not pickup anything that does not have whitespaces around it.  So if I have,

 

public event e = new event();

 

It will match the first event but not the second.  I am really at a loss here as I know it can be done just are having a really difficult time with it right now.  :)

 

Thanks for any help.

Link to comment
https://forums.phpfreaks.com/topic/73297-solved-matching-a-whole-word/
Share on other sites

Guess I just needed to write it somewhere else.  I was adding a backslash to my word boundary.

 

This works perfectly.

 

#\b(abstract|event|new|struct|as|explicit|null|switch|base|extern|object|this|bool|false|operator|throw|break|
								finally|out|true|byte|fixed|override|try|case|float|params|typeof|catch|for|private|uint|char|foreach|
								protected|ulong|checked|goto|public|unchecked|class|if|readonly|unsafe|const|implicit|ref|ushort|continue|int|in|
								return|using|decimal|int|sbyte|virtual|default|interface|sealed|volatile|delegate|internal|short|void|do|is|
								sizeof|while|double|lock|stackalloc|else|long|static|enum|namespace|string|return|get|set|protected|value)\b#

Okay.  Maybe someone can help me modify this a little more.  It does match onlye "whole" words but it will not match items that have things like (), [], {}, etc. around it.  I do not need to match words or numbers after or before it, but it does need to match something like (uint).

 

Does anyone have any suggestions???

 

TIA,

Jason

Alright.  I am getting much closer now.  Here is what I have so far.

 

#(\b|((?<![\w])\B))('.$this->language[$i][3].')(\b|(\B(?![\w]>)))#

 

It will highlight keywords correctly except for a few instances.  If I have int before internal in the keyword list then it will match int and not internal (i.e. internal string - will match 'int'ernal).

 

It also doesn't match words that have ( before them or are wrapped in ().  A little more work and I know that it will work nicely.  Feel free to point out any parts that are wrong or need to be rearranged.

 

TIA,

Jason

If I have int before internal in the keyword list then it will match int and not internal (i.e. internal string - will match 'int'ernal).

 

Sort your word list before looping.

 

You also need to preg_quote the variable you're dropping into the expression.

 

The \b's should be matching around the parentheses...

Well, I figured one problem out but this regex is killing me.

 

I did switch it around to have matching \b's but that still did not do anything.  Here are my regexs now:

 

/* Match word boundaries first */
$this->code = preg_replace_callback('#(\b('.$language[$i][3].')\b)#', array($this, 'match_regex'), $this->code);

/* Match non-word boundaries second */
$this->code = preg_replace_callback('#(((?=[\W])|(?![\w])\B)('.$language[$i][3].')(\B(?![\w])|(?=[\W])))#', array($this, 'match_regex'), $this->code);

 

Now, the first one processes all of the whole words and the second should match anything that is not in a word boundary but it can't have characters (i.e. a-zA-Z0-9) before or after the keyword to match.  Maybe that will help someone help me figure this out???

 

Thanks for all the help as I am learning regex's with this script.

Yes I am aware but this supports over 50 languages right now so that is where the problem is coming in.  The matches work fine if I take the \b's out and leave it open but what happens is that I get highlights in words that shouldn't be highlighted.  :)

 

No that doesn't work either.  I may have to just split that one into two different regexs as well.  One to match words that have non-word boundaries around them and the other one I will have to figure out.

It's difficult to proceed without seeing the input data, the results, and the expected results.

 

Here's a small example illustrating the previous idea:

 

<pre>
<?php
$str = 'PHP is a programming language. I like to program.';
echo preg_replace('/(?<!\w)(program)(?!\w)/', '<b>\1</b>', $str);
?>
</pre>

Okay.  Here is the input data

 

<?

class paging
{
var $pQueryString;
var $pPageID;
var $pRowsPerPage;
var $pStartFile;
var $pRecordCount;
var $pCount;
function tep_start_paging()
{ $rs = mysql_query($this->pQueryString) or die(mysql_error().“<br>”.$this->pQueryString);
$this->pRecordCount = mysql_num_rows($rs); 
$this->pCount= $this->pRecordCount;
if((int)$this->pRecordCount >= (int)$this->pRowsPerPage)
$this->pRecordCount = ceil((int)$this->pRecordCount / (int)$this->pRowsPerPage);
else
$this->pRecordCount= 1;
if(empty($this->pPageID) or (int)$this->pPageID==1)
{
$this->pPageID = 1;
$this->pStartFile= 0;
}

if($this->pPageID > 1)
$this->pStartFile = ($this->pPageID -1) * $this->pRowsPerPage;
return $this->startPaging();
}

 

 

This regex here will match certain things like "die" and "mysql_num_rows".

 

'#(((?<=[\W])|(?<![\w])\B)('.$language[$i][3].')(\B(?![\w]>)|(?=[\W]>)))#'

 

Where the problem comes in is for an instance of

... or die(mysql_error()...

 

For some reason the only way I can match mysql_error is having \B or \w placed in the regex but it will also match something like this:

 

Let's say the keyword is "link"

 

It will do this (quotes are highlighted part):

 

add_"link"s_to_url();

 

That is where I am at right now.  I have been working on this for a couple of weeks now and it is starting to look really nice with fast processing times as well so I really hope that I can get this last part working.  :)

 

Thanks again for all the help.

 

Also, attached below is a screenshot of how it looks after it has finished highlighting.

 

[attachment deleted by admin]

Alright.  So I finally got it all figured out.  Everything pretty much worked except that I needed to add *all* special character entities into their html equivalent.  I was using htmlspecialchars and instead just created my own with all the special characters and it works like a charm.  ;D

 

Thanks again for all the help. 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.