Easy regex

mikesta707 · January 8, 2010

I know this is easy, but for some reason, regex just kicks my ass.

The pattern I currently have goes like so

$pattern = "/[a-zA-Z]{2,3}\/web\/[0-9]{9,11}\/\.html/";

i'm trying to match URLs that look like this:

mnh/web/(somes numbers).html
thc/web/(numbs).html
dcf/web/(numbs).html

Note that the first 3 digits are basically for certain areas (IE manhattan is mnh, queens is que or something, etc.) I used a character class that includes any characters of 2-3 letters in length to make it easier on myself.

I'm sure its a simple fix, but I just can't seem to figure it out

cags · January 8, 2010

Do you have an example of something that doesn't match that you think should? My first suggestion is that since you are working with a path don't use slashes as your delimiters, it just complicates things when you have to escape them in the string. I don't think you need the last one at all as the number seems like a filename and that shouldn't contain a forward slash.

$pattern = "#[a-z]{2,3}/web/[0-9]{9,11}\.html#i";

mikesta707 · January 8, 2010

wow, totally thought that PHP required forward slashes as delimiters.

Yeah you were right its that last forward slash that was screwing things up. thanks for the tip, those forward slashes really messed everything up. all is well now

nrg_alpha · January 8, 2010

wow, totally thought that PHP required forward slashes as delimiters.

There is now a dedicated delimiters reference page within the pcre aspect of the php manual. This section describes what can be used as delimiters.

mikesta707 · January 9, 2010

Ok, so i'm finding out im really bad at this. Im trying to capture the stuff inside of a div tag that looks like

<div id="userbody">
stuff stuff
</div>

my pattern looks like this

$pattern = '#<div id="userbody">(.)+</div>#i';

I do the following, with the above $pattern

//$stuff is the html
if (preg_match($pattern, $stuff, $matches)){
print_r($matches);
}
else {
echo "Failure";
}

and always get failure. When I change the pattern to just

$pattern = '#<div id="userbody">#i';

I seem to get a match, but when I print_r matches, its empty (and I'm not really sure if it should be empty, but since I have no capturing group, I'm assuming thats right)

any idea on whats wrong with my pattern?

salathe · January 10, 2010

By default the dot won't match newlines, to make it do so you need to add the s modifier to the end of your pattern. You'll also probably want to use (.+) rather than (.)+ as the latter will only capture the very last character that it can find.

nrg_alpha · January 10, 2010

To further comment, be careful when using greedy quantifiers like .* or .+, as this will greedily match (or in your case, capture) as much as it can, then backtrack till it matches what comes after it in the pattern.. so if you have multiple nested divs, you may end up matching more than you bargined for... in cases like this, I would recommend using lazy quantifiers instead: (.+?)

This thread has discussions / explanations on this matter:

http://www.phpfreaks.com/forums/index.php/topic,236933.msg1103233.html (my post is #11 (which upon re-reading, probably would have worded it differently, but still gets the point across) and cv's #14 (more colourful / illustrative)).

This is not to suggest that greedy quantifiers are in and of themselves inherently bad.. but rather that they are bad when improperly employed, which might end in undesirable results.

mikesta707 · January 10, 2010

Thanks guys! It was the newlines. I did read about the . not matching newlines, but i tried to test if there were newlines in the text (it seems I tested wrong) Again thanks alot.

and yeah I also read about "greediness" vs. "laziness", but at this point I was just trying to get a working regex, and probably would have optimized it afterwards. Thanks for the tips though! greatly appreciated

cags · January 10, 2010

Just as a side note, since you appear to be parsing HTML, would you not be better off using xpath/DOMDocument?

Sign In

Easy regex

Recommended Posts

mikesta707

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

mikesta707

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

mikesta707

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

mikesta707

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information