Simple regex match help

Drongo_III · August 16, 2012

Hi Guys

I am not quite sure what I'm doing wrong here as when I run this match in php using preg_match I get the expected result returned - but in JS it's just giving me 'false'.

I want to match the end of the string below for the suffix .html - but everytime i test it i get 'false' returned.


        var str = 'http://localhost/TESTJS.html';
var pattern = new RegExp('/[\.html]+$/');

var result = pattern.test(str);
alert(result);

Any enlightenment on where i am going wrong would be hugely appreciated

Christian F. · August 16, 2012

Square brackets are used to signify a character group, not a capturing sub group. So what you're asking for is to verify that the string ends with one or more of the following characters: h,m,l,t, and period. Meaning, these are fully legit according to that RegExP:

var str1 = "some random string.";
var str2 = "h";
var str3 = "something/that/looks/like/alink.hmmm";

Replace the square brackets for regular parenthesis and remove the quantifier (+). Also, you don't need to use new RegExp (), this works just fine:

var result = str.match (/\.html$/);

.josh · August 16, 2012

to clarify and expand on ChristianF's post...

edit: good lord i went off on a tangent, made a tl;dr..but in case you are interested in lots of details...

Square brackets signify a character class. It will match any one thing listed there, and the + after that quantifies the character class, meaning to match one or more of any one thing listed in the character class. So IOW, it will match any combination of characters listed in the bracket, any length, minimum 1 char.

The most immediate reason why it "worked" with php but not js is because this:

'/[\.html]+$/'

php expects a pattern delimiter as part of the pattern (you use the forward slash / as the delimiter). So in php, that doesn't actually count as part of the pattern. So with php, it was matching because it did find your string end with an "l" (but not necessarily the full ".html" because a character class matches any one character in it, and the + asks for one or more of that, so it will also match for instance your string ending in "htttmmlll"). So it coincidentally matched your string ending in .html because the "l" happened to be the last character, not because it explicitly ended in a full ".html".

sidenote: you don't need to escape the dot when it is inside a character class; it will be treated as a literal dot (but you do need to escape it outside of a character class).

So what you really should be doing in the php version is this:

preg_match('/\.html$/',$string)

But on the other hand in javascript, the argument passed to RegExp() does not use a pattern delimiter, so when you do this:

var pattern = new RegExp('/[\.html]+$/')

The regex is going to expect those forward slashes as part of the actual pattern. And the character class thing still applies. So for instance,

"randomstring/l/" would match true, because [\.html]+ only requires any one of those chars, and it is surrounded by literal forward slashes.
"sompage.html" would match false, because even there are no literal forwardslashes in that string
"sompage/.html/" would match true by coincidence because the character class pattern and quantifier will coincidentally match ".html" and it is surrounded by forwardslashes
"somepage/.hhlllmmm...ttt/" would match true because again, the character class will match any of those characters and the quantifier allows it to repeat one or more times, and it is surrounded by literal forward slashes

So in javascript, in order to match for a full, literal ".html" ending, you would use this:

var pattern = new RegExp('\.html$');
var result = pattern.test(str);

The dot is escaped because it has special meaning in regex and you are wanting to match for a literal dot.

As ChristianF pointed out though, you don't actually need to create a regex object for this, you can do it like this:

var result = str.match (/\.html$/);

In this example, the forward slashes are used. In javascript, a string wrapped in forward slashes instead of quotes signifies a regex object literal. This is a "shortcut" if the pattern is static (will be a hardcoded string). If you need to include a dynamic value in the pattern then you will have to create a regex object with the RegExp() instead.

For example:

function stringEndsWith(haystack, needle) {
  var pattern = new RegExp(needle+"$");
  return haystack.test(pattern);
}

This function will allow you to do for instance stringEndsWith("somefile.html","html"), because you can use variables in the pattern passed to RegExp(). Sidenote: this function is simplified for demo purposes. In reality, this function would be more complex, because you will want to escape characters in needle that have special meaning in regex, and it's kind of a headache because you also have to escape the escape character so it isn't interpreted literally.

But there's no way to use a variable in a regex literal. You can't do .match(/needle+"$"/) because it will interpret it as the literal string needle+"$" to be matched. So for instance your string would have to literally be a value of like var string = 'some string needle+"$"';

Nor can you do .match(needle+"$"), because it will parse needle but append a literal $ to it instead of match it as the end of string. So for instance,

var haystack = "this is a foobar$ more stuff foobar";
var needle="foobar";
return haystack.match(needle+"$");

This will match that "foobar$" because it looks for a literal substring of "foobar$". It will not match the "foobar" at the end of haystack because $ is interpreted literally instead of as a marker for end of string, and that "foobar" at the end of the string does not end with a literal $.

Also sidenote even if you add it to the end, it still wouldn't match because .match will only look for and match the first occurrence of "foobar$" unless you add the global modifier (g)..which you can't do in this example because the modifier can only be added if you used the regex literal version or passed a RegExp object, and this version is just a string being passed to .match().

IOW you can't do .match(/needle+"$"/g) to do a global match because then you're back to the first "can't do" example where needle+"$" is treated as a literal string instead of looking for "foobar$".

The overall point is that there are a lot of limitations with using a regex object literal, so if you're looking for a simple, static string match, then it's a nice shortcut. But if you're looking to be able to expand or make it dynamic (now or in the future), stick with making a RegExp object.

Christian F. · August 16, 2012

I vote to stickify that post, or make it an article, .josh. Damned good info there!

Drongo_III · August 16, 2012

Woah! Both very helpful but esepcially big thanks to josh for taking the time to go into such detail.

I did realise my ignorance in using square brackets shortly after posting - mostly because i was a bit brain fried from working on something all day and i only really use regex once in a blue moon.

But I had no idea javascript regex object didn't use a pattern delimeter and i definitely didn't realise the distinction between explicitly creating a regex object and an object literal - so i am very, very glad i asked guys.

I think i'll spend some time on regex this evening!

PS don't make this a sticky! not sure i want my noob question immortalised lol

Drongo_III · August 16, 2012

Ok, been doing some practice on JS regular expressions and I'm getting stuck :/

Why do the patterns below return false?


                                  var str = 'first sentence';
			//var pattern = new RegExp('^[a-zA-Z]{5,6}\s'); //FALSE
			//var pattern = new RegExp('^[a-zA-Z]{5,6}\b'); //FALSE
			//var pattern = new RegExp('^[a-zA-Z]{5,6} '); //TRUE
			   var pattern = new RegExp('^[a-zA-Z]{5,6} s'); //TRUE

			var result = pattern.test(str);
			 //alert(1);
			document.getElementById('para').innerHTML = result;

I don't understand why the first two return false. To my (increasingly shakey) understanding of regex the patterns that return false should be true. For instance i interpret the following pattern as follows:

var pattern = new RegExp('^[a-zA-Z]{5,6}\b');

Start the beginning of string (^), match any a-z characters of any case - specifically matching 5-6, then match a word boundary. Which to me should match 'first ' in var str above.

So is it simply a case that you should always match spaces as literals or am i reading the regex incorrectly?

.josh · August 16, 2012

okay so the problem goes back to what i kinda mentioned as a sidenote in my tl;dr:

This function will allow you to do for instance stringEndsWith("somefile.html","html"), because you can use variables in the pattern passed to RegExp(). Sidenote: this function is simplified for demo purposes. In reality, this function would be more complex, because you will want to escape characters in needle that have special meaning in regex, and it's kind of a headache because you also have to escape the escape character so it isn't interpreted literally.

Since you are passing a string to the RegExp() method, you have to consider characters that are escaped, because certain characters signify special things when you escape them in a string.

For this pattern: var pattern = new RegExp('^[a-zA-Z]{5,6}\s');

^ match for start of string

[a-zA-Z] any letter (upper or lowercase)

{5,6} 5 or 6 of that previous char class (so together, any combination of letters 5 or 6 characters long)

\s this has no special meaning in strings, so the string is parsed as a literal "s", not the shorthand "whitespace" character class you expected. So IOW you are really passing '^[a-zA-Z]{5,6}s' to RegExp(), so it expects 5 or 6 letters followed by a literal "s".

For this pattern: RegExp('^[a-zA-Z]{5,6}\b')

^ match for start of string

[a-zA-Z] any letter (upper or lowercase)

{5,6} 5 or 6 of that previous char class (so together, any combination of letters 5 or 6 characters long)

\b This has special meaning to strings, it signifies a backspace character, so you are telling your pattern to look for a backspace character, not the shorthand "word boundary" character class you expected.

To fix both of these, you must escape the escape: \\s and \\b. This will tell the string parser to use a literal backslash instead of trying to look for its special chars, so that when the string gets passed to RegExp() it will have the shorthand char classes you expect to pass it.

Sidenote: With a regex object literal, you don't need to do this (ex: you don't need to do this, and in fact, this will not work as expected: .match(/^[a-zA-Z]{5,6}\\s/). Instead, you do it like normal: .match(/^[a-zA-Z]{5,6}\s/)) because you are working with an object literal not a string.

Drongo_III · August 18, 2012

Thanks Josh!

That helps a lot. I should have read your original post more closely.

Thank you for your help.

Sign In

Simple regex match help

Recommended Posts

Drongo_III

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

Drongo_III

Link to comment

Share on other sites

Drongo_III

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

Drongo_III

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information