another regex question...

mck.workman · January 19, 2012

Hey! I was wondering if you can specify something like below or if you would have to use two different regex's

I am trying to match only the .gh files below by saying:

/http:\/\/www.grasshopper3d.com\/forum\/attachment\/download\?id=2985220%3AUploadedFile%3A[0-9]{6}[^.+\.gh]/ meaning include the files that are .gh files but don't include the .gh in the match. (ie. exclude the .jpg, etc files)

Data:

"http://www.grasshopper3d.com/forum/attachment/download?id=2985220%3AUploadedFile%3A501843">01.jpg</a>

"http://www.grasshopper3d.com/forum/attachment/download?id=2985220%3AUploadedFile%3A506981">SURFACE-DIAGRID-TEST.gh</a>

Thank you!

McK

abareplace · January 19, 2012

Hi, mck.workman,

you can use positive lookahead to check for the presence of ".gh" without including it to the match.

<?php

$regex = '/http:\/\/www.grasshopper3d.com\/forum\/attachment\/download\?id=2985220%3AUploadedFile%3A[0-9]{6}">[^.]+(?=\.gh)/';

$data = '"http://www.grasshopper3d.com/forum/attachment/download?id=2985220%3AUploadedFile%3A501843">01.jpg</a> 
"http://www.grasshopper3d.com/forum/attachment/download?id=2985220%3AUploadedFile%3A506981">SURFACE-DIAGRID-TEST.gh</a>';

if (preg_match($regex, $data, $matches))
print_r( $matches );

Read more about lookahead.

Hope this helps.

ragax · January 19, 2012

Hi McK!

aba is right that lookaheads are a nice way to do it!

Here's code for another solution without lookaheads, which has several benefits.

1. It's a bit more general, in case you'd like to capture files with various numbers,

2. It also works for files that have a dot in them, like try.this.gh

It also matches a bit faster (61 steps vs 112 for the gh string you supplied), but that's immaterial.

Input:

"http://www.grasshopper3d.com/forum/attachment/download?id=2985220%3AUploadedFile%3A501843">01.jpg</a>

"http://www.grasshopper3d.com/forum/attachment/download?id=2985220%3AUploadedFile%3A506981">SURFACE-DIAGRID-TEST.gh</a>

"http://www.grasshopper3d.com/forum/attachment/download?id=88UploadedFile%3A981">AnotherOne.gh</a>'

Code:

<?php 
$string = '"http://www.grasshopper3d.com/forum/attachment/download?id=2985220%3AUploadedFile%3A501843">01.jpg</a>
"http://www.grasshopper3d.com/forum/attachment/download?id=2985220%3AUploadedFile%3A506981">SURFACE-DIAGRID-TEST.gh</a> 
"http://www.grasshopper3d.com/forum/attachment/download?id=88UploadedFile%3A981">Another.One.gh</a>';

$pattern = ',(http://www\.grassh[^?]+\?id[^U]+Up[^>]+>(([^.<]*?\.?)*))\.gh,';
$hit = preg_match_all($pattern,$string,$matches,PREG_PATTERN_ORDER);
$sz=count($matches[0]);
for ($i=0;$i<$sz;$i++) {
echo "Match: ".$matches[1][$i]."<br />";
echo "File: ".$matches[2][$i]."<br /><br />";
}
?>

Output:

Match: http://www.grasshopper3d.com/forum/attachment/download?id=2985220%3AUploadedFile%3A506981">SURFACE-DIAGRID-TEST

File: SURFACE-DIAGRID-TEST

Match: http://www.grasshopper3d.com/forum/attachment/download?id=88UploadedFile%3A981">Another.One

File: Another.One

Nothing wrong with aba's solution, just wanting to give you another option.

Let us know if these work for you.

ragax · January 19, 2012

It also matches a bit faster (61 steps vs 112 for the gh string you supplied), but that's immaterial.

Edit: I have it in reverse. Aba's is the faster one.

mck.workman · January 20, 2012

Hey guys!

Thanks for the input--things are working beautifully Can you use pre and post at same time?

Playful, I do have a question about understanding the '(([^.<]*?\.?)*))\.gh' part of your regex to match '01.jpg</a>'

Translation: "(([not including any character <] zero or more times)maybe)any character maybe)zero or more times).gh"

Question: Where is my translation wrong because you need to say something like ([not including any character]one or more times)<\/a>\.gh)" right?

Thank you again for your help!

ragax · January 20, 2012

Hey McK,

Great to hear from you, and to hear that the expressions from Aba and myself are helping with your project.

I do have a question about understanding the '(([^.<]*?\.?)*))\.gh' part of your regex

Sure! Here is a commented / unrolled version, using comment mode (aka whitespace mode).

(This expression will actually work in preg_match if you put it inside a pattern string with some delimiters.)

(?x)           # comment mode
(              # Start group 1 capture: the whole url without .gh
STUB>          # This is the part of the url up to >
(              # Start Group 2 capture: this is the file name without  .gh
               # On the line below, you could use (?: instead as it is not intended to be capturing
(              # Expression "A": Zero or More times... (set by the * at the end)
[^.<]*?        # Lazily Match characters that are neither dots nor <, expanding as needed
\.?            # Then match one dot if available, but give it back if necessary to complete the overall match
)*             # End Expression A that has repeated zero or more time 
               # Expression A has matched a series of zero or many stuffDOT, more_stuffDOT, but gives up the last DOT to allow .gh to match.
)              # End Group 2 capture
)              # End group 1 capture
\.gh           # Match .gh (but dont capture)

Note that this exact regex will work on STUB>AnotherOne.gh</a>

It is the original expression minus everything up to the >.

I hope this answers your question, please don't hesitate to ask if any of it is unclear!

ragax · January 20, 2012

Couldn't resist posting working php code for this:

<?php
$string = 'STUB>AnotherOne.gh</a>';
if (preg_match('~(?x)           # comment mode
(              # Start group 1 capture: the whole url without .gh
STUB>          # This is the part of the url up to >
(              # Start Group 2 capture: this is the file name without  .gh
               # On the line below, you could use (?: instead as it is not intended to be capturing
(              # Expression "A": Zero or More times... (set by the * at the end)
[^.<]*?        # Lazily Match characters that are neither dots nor <, expanding as needed
\.?            # Then match one dot if available, but give it back if necessary to complete the overall match
)*             # End Expression A that has repeated zero or more time 
               # Expression A has matched a series of zero or many stuffDOT, more_stuffDOT, but gives up the last DOT to allow .gh to match.
)              # End Group 2 capture
)              # End group 1 capture
\.gh           # Match .gh (but dont capture)~', $string,$match))
{
echo "Match: ".$match[1]."<br />";
echo "File: ".$match[2]."<br /><br />";
}
?>

Ouput:

Match: STUB>AnotherOne

File: AnotherOne

mck.workman · January 20, 2012

Got it. That makes perfect sense. If you don't mind I have just one more for you. You introduced me to using groups with regex's which I read a bit about and have been playing with. However, when I try to use a positive look ahead and positive look behind together and they don't work...but individually they do. I found anything that sheds light on why.

//This works:
$url = file_get_contents("http://protege-ontology-editor-knowledge-acquisition-system.136.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=68583");
$pattern1 = "/user\/SendEmail\.jtp\?type=user.+;user=\d+/";
$pattern2 = "/(?<=\">Send Email to ).+(?=<)/";
preg_match_all($pattern1, $url, $useremail);
preg_match_all($pattern2, $url, $username);
print_r($useremail);
print_r($username);

//This doesn't:
$url = file_get_contents("http://protege-ontology-editor-knowledge-acquisition-system.136.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=68583");
$pattern = "/(user\/SendEmail\.jtp\?type=user&user=\d+)((?<="<Send Email to ).+(?=<))/";
preg_match_all($pattern, $url, $userInfo);
echo 'UserEmail: '.$userInfo[1][0]
echo 'UserName: '.$userInfo[2][0]

ragax · January 20, 2012

Hi McK,

((?<="<Send Email to ).+(?=<))

It looks to me like the quote in (?>=" closes the pattern string.

On your earlier tests, you escaped the double quote, so it worked.

mck.workman · January 20, 2012

Sorry, I copied it to here from a regex tester where I didn't need to escape the the quote but in my php code I actually did and its still feeding me empty arrays.

Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) )

$url = file_get_contents("http://protege-ontology-editor-knowledge-acquisition-system.136.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=68583");
$pattern = "/(user\/SendEmail\.jtp\?type=user&user=\d+)((?<=\"<Send Email to ).+(?=<))/";
preg_match_all($pattern, $url, $userInfo);
print_r($userInfo);
// echo('Email: '.$userInfo[1][0]);
// echo('Name: '.$userInfo[2][0]);

ragax · January 20, 2012

its still feeding me empty arrays

Hey McK,

If that's the actual code you're running, are you sure you have the right test string?

For instance, I don't see SendEmail in the string.

mck.workman · January 20, 2012

The url isn't the test string. When I say file_get_contents it returns a string of the html contents of the page so that is the string it is searching.

ragax · January 21, 2012

Ah, yes, I should go splash some cold water on my face to wake myself up.

Can you paste some of the actual text that the pattern is supposed to match?

Without that, I have a hard time troubleshooting an expression.

mck.workman · January 21, 2012

Sure!

<a href="/user/SendEmail.jtp?type=user&user=195799">Send Email to shreyes</a>

ragax · January 21, 2012

Okay, focus on this part of your expression:

\d+((?<="<Send Email to ).+)

After the digits (\d+), you want to match STUFF (.+) that is preceded by "<Send Email to

But there is no such stuff.

After the digits, you go straight to "<Send Email

Let me explain in detail, as this is a key point of lookarounds.

See, the lookbehind does not JUMP over characters.

After the digits, the regex engine is standing between the 9 and the "

At this stage, if you use a lookaround, you stay PLANTED in that position between the 9 and the "

With a lookbehind, you look to the left for "<Send, and of course you're not going to find that, there are only digits.

If you used a lookahead, you'd be looking to the right of that spot between 9 and ", so you'd be seeing a double quote and some stuff.

And after each lookbehind or lookaround, you're still standing in the same spot!

This might make your head spin for a moment because your current understanding of lookarounds is a different paradigm. It's like these images you can see with two geometries, with the stairs either going up or going down...

Once it clicks, it will be clear as day.

Ctrl + F conditionals on my Tut for more on this topic. (I'm doing a major revamp but it's not ready.)

Talk soon bro!

mck.workman · January 21, 2012

Okay I see. That makes sense that you can't skip over a part. Thank you for the explanation.

abareplace · January 21, 2012

McK, may I ask, for what are you using the regular expression? Are you trying to collect the email addresses for marketing purposes (i.e. spam)? I'm sorry if the question is rude.

mck.workman · January 21, 2012

No. No. No. I am learning to use a software called Protege for building ontologies and would like to be able to get more involved with the Protege user community but there is not way to tell if there are any users in my area. I was learning to use KML with google maps and thought that if people that are members of the forum could see other members tagged on google maps with a link to their email they can contact local users in their area by clicking their email link. AND its perfect because I don't know a lot about security so the forum takes care of that by not letting them log in to send an email if they are not registered! I am not a spammer. I have morals.

Check out the pic attached of the website I am trying to build for this to happen.

Ultimately...I would like to send what I have done to them and ask if they would be willing to put a link on their site to my site that allows users to connect with others in their area. If they say no...well, I will have learning a lot from the exercise.

No problem! You have every right to ask.

McKinnley

ragax · January 21, 2012

Darnit, McK, that's a disappointment. I thought I was helping you build a spam robot.

abareplace · January 22, 2012

McK, I'm sorry. As a geek, I'm paranoidally suspicious

Your regex will work if you include the page address into lookbehind:

(?<=(user/SendEmail\.jtp\?type=user&user=\d+)">Send Email to ).+(?=<)

However, most regex engines don't support variable-length lookbehind (\d+ can have any length, from one character to infinity), so it will work only in .NET, RegexBuddy, or my tool.

In PHP, you can use the usual capturing groups:

<?php
$url = '<a href="/user/SendEmail.jtp?type=user&user=195799">Send Email to shreyes</a>';
$pattern = "/(user\/SendEmail\.jtp\?type=user&user=\d+)\">Send Email to (.+)(?=<)/";
preg_match_all($pattern, $url, $userInfo);
echo 'UserAddress: '.$userInfo[1][0] . "<br>\n";
echo 'UserName: '.$userInfo[2][0];

Good luck with your project! It should be very useful for the Protege community.

mck.workman · January 22, 2012

Thanks! No prob. Its funny. Yesterday and today I have been running into security issues---500 errors. Apparently their servers block the file_get_contents function for personal pages..... Oh well, if I don't find another way---live and learn.

Thanks for your help.

McK

Sign In

another regex question...

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information