mikesta707 Posted January 8, 2010 Share Posted January 8, 2010 I know this is easy, but for some reason, regex just kicks my ass. The pattern I currently have goes like so $pattern = "/[a-zA-Z]{2,3}\/web\/[0-9]{9,11}\/\.html/"; i'm trying to match URLs that look like this: mnh/web/(somes numbers).html thc/web/(numbs).html dcf/web/(numbs).html Note that the first 3 digits are basically for certain areas (IE manhattan is mnh, queens is que or something, etc.) I used a character class that includes any characters of 2-3 letters in length to make it easier on myself. I'm sure its a simple fix, but I just can't seem to figure it out Quote Link to comment Share on other sites More sharing options...
cags Posted January 8, 2010 Share Posted January 8, 2010 Do you have an example of something that doesn't match that you think should? My first suggestion is that since you are working with a path don't use slashes as your delimiters, it just complicates things when you have to escape them in the string. I don't think you need the last one at all as the number seems like a filename and that shouldn't contain a forward slash. $pattern = "#[a-z]{2,3}/web/[0-9]{9,11}\.html#i"; Quote Link to comment Share on other sites More sharing options...
mikesta707 Posted January 8, 2010 Author Share Posted January 8, 2010 wow, totally thought that PHP required forward slashes as delimiters. Yeah you were right its that last forward slash that was screwing things up. thanks for the tip, those forward slashes really messed everything up. all is well now Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted January 8, 2010 Share Posted January 8, 2010 wow, totally thought that PHP required forward slashes as delimiters. There is now a dedicated delimiters reference page within the pcre aspect of the php manual. This section describes what can be used as delimiters. Quote Link to comment Share on other sites More sharing options...
mikesta707 Posted January 9, 2010 Author Share Posted January 9, 2010 Ok, so i'm finding out im really bad at this. Im trying to capture the stuff inside of a div tag that looks like <div id="userbody"> stuff stuff </div> my pattern looks like this $pattern = '#<div id="userbody">(.)+</div>#i'; I do the following, with the above $pattern //$stuff is the html if (preg_match($pattern, $stuff, $matches)){ print_r($matches); } else { echo "Failure"; } and always get failure. When I change the pattern to just $pattern = '#<div id="userbody">#i'; I seem to get a match, but when I print_r matches, its empty (and I'm not really sure if it should be empty, but since I have no capturing group, I'm assuming thats right) any idea on whats wrong with my pattern? Quote Link to comment Share on other sites More sharing options...
salathe Posted January 10, 2010 Share Posted January 10, 2010 By default the dot won't match newlines, to make it do so you need to add the s modifier to the end of your pattern. You'll also probably want to use (.+) rather than (.)+ as the latter will only capture the very last character that it can find. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted January 10, 2010 Share Posted January 10, 2010 To further comment, be careful when using greedy quantifiers like .* or .+, as this will greedily match (or in your case, capture) as much as it can, then backtrack till it matches what comes after it in the pattern.. so if you have multiple nested divs, you may end up matching more than you bargined for... in cases like this, I would recommend using lazy quantifiers instead: (.+?) This thread has discussions / explanations on this matter: http://www.phpfreaks.com/forums/index.php/topic,236933.msg1103233.html (my post is #11 (which upon re-reading, probably would have worded it differently, but still gets the point across) and cv's #14 (more colourful / illustrative)). This is not to suggest that greedy quantifiers are in and of themselves inherently bad.. but rather that they are bad when improperly employed, which might end in undesirable results. Quote Link to comment Share on other sites More sharing options...
mikesta707 Posted January 10, 2010 Author Share Posted January 10, 2010 Thanks guys! It was the newlines. I did read about the . not matching newlines, but i tried to test if there were newlines in the text (it seems I tested wrong) Again thanks alot. and yeah I also read about "greediness" vs. "laziness", but at this point I was just trying to get a working regex, and probably would have optimized it afterwards. Thanks for the tips though! greatly appreciated Quote Link to comment Share on other sites More sharing options...
cags Posted January 10, 2010 Share Posted January 10, 2010 Just as a side note, since you appear to be parsing HTML, would you not be better off using xpath/DOMDocument? Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.