Jump to content

ragax

Members
  • Posts

    186
  • Joined

  • Last visited

Everything posted by ragax

  1. ...part 2 of this post: B. (?| ) LETS YOU USE ONE GROUP NUMBER FOR MULTIPLE CAPTURES Sometimes, you have data where what you want to capture almost fits in one set of parentheses. Almost: you can fit it in an alternation, but you end up using multiple sets of parentheses. As a result, your data can find itself in Group 1, Group 2, Group 3... You don't know. To sort it out, you have to write some code to look at the array of results. Here's an example: (?:shipping (\w+)|mailing (\w+) to): \w+ This could match either of these strings: shipping books: today mailing books to: john In both cases, "books" would be captured... But in Group 1 for the first case, and Group 2 for the second. That's because group numbers are set from left to right as you read the regex, whether or not they are set. On the second string, Group 1 is not set, but Group 2 captures "books". You'll have to sort that out in PHP by examining your matches. Well, PCRE has a magical feature that lets you capture "books" in Group 1 in both cases, even though they are captured by different sets of parentheses!!! That syntax is (?|, and it allows you to "reset" a capture group once you pass "|", the alternation marker. Here's the piece of magic syntax that always returns "books" in Group 1: (?|shipping (\w+)|mailing (\w+) to): \w+ Here's code to test it: <?php $regex=',(?|shipping (\w+)|mailing (\w+) to): \w+,'; preg_match($regex, 'shipping books: today', $match); echo $match[1].'<br />'; preg_match($regex, 'mailing books to: john', $match); echo $match[1].'<br />'; ?> Output: books books In both cases, "books" is found in $match[1], which is the content of Group 1! This feature is briefly mentioned on the PHP manual's subpattern page. See my page on regex (? syntax disambiguation for more on PCRE (?| group reset syntax. Wishing you all a fun weekend!
  2. Greetings, PHP heads! A couple days ago, during a full revamp of my regex tutorial (see sig), I added two very sweet regex features that I haven't seen discussed in the online PHP world. These are features that I found buried in the PCRE documentation. One of them is briefly mentioned in the PHP manual. The other might be too, but I couldn't find it. I thought I'd make a quick tut on the forum to share these two "secret features" of PHP regex with my fellow regex lovers. A. (?(DEFINE)) LETS YOU REUSE A PATTERN You already know how a back-reference (named or numbered) lets you match a literal string previously captured by a set of parentheses: Both (\d\d) abc \1 and (?P<Nb>\d\d) abc (?P=Nb) will match "12 abc 12", where "12" is captured in Group 1 or in a named capture group: "Nb" What if instead of referring to a string already captured, you could refer to a regex pattern? This is what (?(DEFINE)) does. In the following example, the DEFINE statement defines "phone" as this regex pattern: (?:Tel|Fax):[ ]415-\d{3}-\d{4} Then the regex uses this defined phone pattern multiple times in the expression. Run this: <?php $pattern=',(?x)(?(DEFINE)(?<phone>(?:Tel|Fax):[ ]415-\d{3}-\d{4})) ^start[ ](?&phone)[ ]////[ ](?&phone) [ ]----[ ]((?&phone)),'; $string = 'start Tel: 415-555-1212 //// Fax: 415-555-0000 ---- Fax: 415-555-9999'; if(preg_match($pattern, $string, $match)) echo 'Properly formatted string.<br /> The third number is: <b>'.$match[2].'</b><br />'; ?> Three phone numbers are matched, but the pattern to match a phone number is only given once! Note that the third number is captured by an additional set of parentheses. It is actually Group 2, because the DEFINE statement consumes one group. Now, you could also accomplish this with a repeating expression: the (?1) syntax. For instance: <?php $string = 'start Tel: 415-555-1212 //// Fax: 415-555-0000 ---- Fax: 415-555-9999'; $pattern=',(?x) ^start[ ]((?:Tel|Fax):[ ]415-\d{3}-\d{4}) [ ]////[ ](?1) [ ]----[ ]((?1)),'; if(preg_match($pattern, $string, $match)) echo 'Properly formatted string.<br /> The third number is: <b>'.$match[2].'</b><br />'; ?> Or, using Oniguruma-style named capture: <?php $string = 'start Tel: 415-555-1212 //// Fax: 415-555-0000 ---- Fax: 415-555-9999'; $pattern=',(?x) ^start[ ](?<Phone>(?:Tel|Fax):[ ]415-\d{3}-\d{4}) [ ]////[ ]\g<Phone> [ ]----[ ](\g<Phone>),'; if(preg_match($pattern, $string, $match)) echo 'Properly formatted string.<br /> The third number is: <b>'.$match[2].'</b><br />'; ?> So what is the benefit of the DEFINE syntax over these other techniques? Well, if you wanted, you could set up all your definitions at the beginning of the expression, which could be handy for a long regex! (?(DEFINE)(?<Gender>M|F)) (?(DEFINE)(?<Age>\b\d\d\b)) (?(DEFINE)(?<Name>\b[[:alpha:]]+\b)) Then you can pepper your names in the expression: (?&Age) to match an Age, (?&Gender) to match a Gender, and so on. Then, if you change your mind about a sub-pattern, all you have to do is change it at the top! See my page on regex (? syntax disambiguation for more on PCRE regex DEFINE syntax. As of Jan 28 2012, I couldn't find this feature on the PHP manual, but if you find it please let me know. In the next post, we will look at an even more interesting and useful feature of PHP regex: Capture Groups with Duplicate Numbers.
  3. That's true: as I mentioned in the post, I just gave MCod part 1, part 2, and part 3 so he could do independent tests on these variables. (And potentially report to the person who submitted the data that one particular part is broken.) The post by AyKay discussed doing the same faster by using explode(). Adam is quite right that you can validate the entire string in one go: valid AND valid AND valid. If it fails, you don't know where, so it's up to you to choose the approach that works best for your needs. Nothing wrong with Adam's approach! MCod, small suggestions if you're going with Adam's expression: 1. You don't need the \ in the first bracket in front of the dot (\.) 2. You still need parentheses to capture the three parts since you said you wanted to split the string: 3. You said you want the last part to be 1, 2 or 3, but in the example you gave, $part3 was 0, so you may want to refine that in the last part of the regex, currently [1-3]. Wishing you all a fun weekend
  4. Aha... so it's more like a magic trick than pure psychic ability??? Thank you for explaining your art to your public! ;-)
  5. Ah, just saw AyKay47's post... He's so right, explode() or preg_split() is a great way to do it, and right again, I posted a regex 2 minutes after his message! I envy your psychic powers, AyKay. (And the clarity of mind to go to the easiest solution first.)
  6. Hi MCod! Run this: Input: jim.h|1234567890123456|0 Code: <?php $regex=',([^|]*)\|([^|]*)\|(.*),'; $string='jim.h|1234567890123456|0'; $hit=preg_match($regex,$string,$part); if($hit) { echo "Part 1: "; if(isset($part[1])) echo $part[1]; else echo 'n/a'; echo '<br />'; echo "Part 2: "; if(isset($part[2])) echo $part[2]; else echo 'n/a'; echo '<br />'; echo "Part 3: "; if(isset($part[3])) echo $part[3]; else echo 'n/a'; echo '<br />'; } ?> Output: Part 1: jim.h Part 2: 1234567890123456 Part 3: 0 Then you can do all the tests you want on $part[1], $part[2] and $part[3]. Let me know if this works for you!
  7. Good news, drisate, glad to hear it. :-)
  8. First thing that comes to mind: Insert a negative lookahead in your working regex. #href=(?!"mailto)['|\"](.+?)['|\"]#
  9. Hi Mcod! The way your particular copyright symbols are encoded (not just ascii 169, but ascii 194 in front of it), I would go for something like this: Input: ©1 © leave it ©a ©abc ©2012 Code: <?php $regex=',[\xC2][\xA9]([[:alnum:]]),'; $string='©1 © leave it ©a ©abc ©2012 '; echo '<pre>'.htmlentities(preg_replace($regex, '©$1', $string)).'</pre>'; ?> Output: ©1 © leave it ©a ©abc ©2012 The weird  character seems to be part of how your © seems is encoded (ascii 194 / xC2 in front of the ascii 169 / xA9). But I'm an old Ascii man, so don't ask me about character encoding! I'm sure many people here can explain. (Maybe you can!) If you like, you can take out the  by replacing [\xC2][\xA9] with \xA9
  10. Hi Terry, From what you sent, I'd say very basic. But maybe there's more. The expression I sent is meant to work with a full-blown regex flavor. The commas are delimiters. They're part of the php code I sent you. If you're not using php (although this is the phpfreaks forum), then omit the commas when you paste the expression in your tool. For instance it works in regexbuddy. (?sm) turns on "dot matches new line" and "multiline" modes [^[] Means anything that is not an opening square bracket. (The caret here stands for NOT) \r is a carriage return, whether you need \r\n or \n depends on your OS. \r\n for Windows. * means zero or more. That's what it means in .* and in [^[]* Hope this helps, don't hesitate to ask more.
  11. You're very welcome, glad it helped.
  12. Here you go, fapapfap. Run this code, let me know if it works for you. (There can be more or less space between the lines, it doesn't matter. Code: <?php $regex=',(?s)(?><tr>(?:[ \r\n]*)(?:<td.*?</td>(?:[ \r\n]*)){7})<td>[^>]+>([^<]+)(?:[^>]+>){4}([^<]+),'; $string='<tr> <td><font size="-1">12.34.56.78</font></td> <td><font size="-1">GB</font></td> <td><font size="-1">random things</font></td> <td><font size="-1">randomthings</font></td> <td><font size="-1">random things</font></td> <td><font size="-1">random things</font></td> <td><font size="-1"></font></td> <td><font size="-1">30.9500</font></td> <td><font size="-1">-2.2000</font></td> <td><font size="-1">random things</font></td> <td><font size="-1">random things</font></td> <td><font size="-1"></font></td> <td><font size="-1"></font></td> </tr>'; preg_match($regex,$string,$match); echo $match[1].'<br />'; echo $match[2].'<br />'; ?> Output: 30.9500 -2.2000
  13. Hi again Terry, If you don't have PHP, for the simple REPLACE approach I gave you above, I'd use a program that has regex search-and-replace capabilities. Two that I like: EditPadPro, Aba Search and Replace. There's also some regex replace functionality in some Adobe programs (Dreamweaver, Indesign). The regex flavor there is probably strong enough for the expression I gave you, which is fairly simple. Some of the IDEs have regex functionality: Code::Blocks, NetBeans. I haven't fully tested them. Let me know if you need any help with the two linked tools or the Adobe tools.
  14. Hy Terrypin, Didn't have time to look at Joe's solution, rushing out, just wanted to give you a preg_replace option. You can run this php code. The Regex: ,(?sm)\[([^]]+.jpg)\].*?- COMMENT -(\r\n[^[]*), Code: <?php $regex=',(?sm)\[([^]]+.jpg)\].*?- COMMENT -(\r\n[^[]*),'; $string='[blackfordLane.jpg] File name = BlackfordLane.jpg Directory = C:\Docs\My Videos\PROJECTS\Thames Path Walk Projects\TP03 Project\Geograph Photos\GeoDay2\ Compression = JPEG, quality: 87, subsampling OFF Resolution = 96 x 96 DPI File date/time = 19/01/2012 / 15:01:23 - IPTC - Object Name - s bridge over the River Thames is not a footbridge but carries pipes. - COMMENT - Thames Path on Blackford Lane heading towards Blackford Farm, east of Castle Eaton. [Castle Eaton Church.jpg] File name = Castle Eaton Church.jpg Directory = C:\Docs\My Videos\PROJECTS\Thames Path Walk Projects\TP03 Project\Geograph Photos\GeoDay2\ Compression = JPEG, quality: 87, subsampling OFF Resolution = 72 x 72 DPI File date/time = 19/01/2012 / 14:03:55 - EXIF - Make - FUJIFILM Model - FinePix2600Zoom Orientation - Top left XResolution - 72 YResolution - 72 ResolutionUnit - Inch - COMMENT - Castle Eaton Church [CastleEaton-2.jpg] File name = CastleEaton-2.jpg Directory = C:\Docs\My Videos\PROJECTS\Thames Path Walk Projects\TP03 Project\Geograph Photos\GeoDay2\ Compression = JPEG, quality: 75 Resolution = 0 x 0 DPI File date/time = 18/01/2012 / 15:40:05 - COMMENT - The Red Lion, Castle Eaton A warm welcoming pub on a cold winter\'s day, with the River Thames running at the bottom of the garden. '; $s=preg_replace($regex,'\1\2',$string); echo '<pre>'.$s.'</pre>'; ?> Output: BlackfordLane.jpg Thames Path on Blackford Lane heading towards Blackford Farm, east of Castle Eaton. Castle Eaton Church.jpg Castle Eaton Church CastleEaton-2.jpg The Red Lion, Castle Eaton A warm welcoming pub on a cold winter's day, with the River Thames running at the bottom of the garden. Didn't have time to look at the fine details, let me know if that works for you.
  15. I don't seem to have trouble hanging on to Smarties, it's more M&Ms that give me trouble. It's that crunchy peanut inside the chocolate...
  16. You are the real programmer, Debbie... regex is just my Sunday crossword.
  17. Darnit, McK, that's a disappointment. I thought I was helping you build a spam robot.
  18. Mmm... 1. You could limit the size of each component (e.g., the name) with a quantifier such as {2,10}. Not a solution that would impress Bill Gates. 2. You could write a horrible OR tree to specify each of the characters (if you had 200 years to live). 3. You could use a strlen to check the input programmatically. 4. And... your favorite, I am sure: just before the $, you could insert a (?<=^.{1,60}), which is a lookbehind. But not in PHP, as it doesn't allow variable-width lookbehinds (.NET does) I'll post more if they come to mind. Warmest wishes, A
  19. No, your {,60} quantifier applies to the whole expression, so it would allow up to 60 email addresses. Correct. There's no need to bother about the lower boundary of the quantifier. (As you already knew, seeing your quantifier.) Fabulous. Good to hear your voice, Debbie, talk to you soon. -A
  20. P.S.: It's the same principle as for your strong password thread. Hope it works for you, let me know if you run into any probs. Wishing you a fun weekend, Andy
  21. Hi Debbie, Try this. Without looking at the details of your expression, I inserted a lookahead at the very beginning. It checks that the string has between 1 and 60 characters. if (preg_match('#^(?=.{1,60}$)[A-Z0-9_\+-]+(\.[A-Z0-9_\+-]+)*@[A-Z0-9-]+(\.[A-Z0-9-]+)*\.([A-Z]{2,7})$#i', $trimmed['email'])){ It will match 123@5678901234567890123456789012345678901234567890123456.com but not 123@56789012345678901234567890123456789012345678901234567.com (One more digit before the .com)
  22. Okay, focus on this part of your expression: \d+((?<="<Send Email to ).+) After the digits (\d+), you want to match STUFF (.+) that is preceded by "<Send Email to But there is no such stuff. After the digits, you go straight to "<Send Email Let me explain in detail, as this is a key point of lookarounds. See, the lookbehind does not JUMP over characters. After the digits, the regex engine is standing between the 9 and the " At this stage, if you use a lookaround, you stay PLANTED in that position between the 9 and the " With a lookbehind, you look to the left for "<Send, and of course you're not going to find that, there are only digits. If you used a lookahead, you'd be looking to the right of that spot between 9 and ", so you'd be seeing a double quote and some stuff. And after each lookbehind or lookaround, you're still standing in the same spot! This might make your head spin for a moment because your current understanding of lookarounds is a different paradigm. It's like these images you can see with two geometries, with the stairs either going up or going down... Once it clicks, it will be clear as day. Ctrl + F conditionals on my Tut for more on this topic. (I'm doing a major revamp but it's not ready.) Talk soon bro!
  23. Ah, yes, I should go splash some cold water on my face to wake myself up. Can you paste some of the actual text that the pattern is supposed to match? Without that, I have a hard time troubleshooting an expression.
  24. Hey McK, If that's the actual code you're running, are you sure you have the right test string? For instance, I don't see SendEmail in the string.
  25. Hi McK, It looks to me like the quote in (?>=" closes the pattern string. On your earlier tests, you escaped the double quote, so it worked.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.