The Little Guy Posted September 21, 2009 Share Posted September 21, 2009 Here is my regex: ~\<a.+?href=[\"|'][^javascript|mailto|aim\:](.+?)[\"|']~ I am using cURL to grab the HTML from a page, and with the above code I want to get all the href links from all the "a" tags. With the current code, it gets the info I wan't, but it takes off the first character... print_r($matches[1]); Other than that, I think it works, and any suggestions are appreciated! Thanks! Quote Link to comment Share on other sites More sharing options...
thebadbad Posted September 21, 2009 Share Posted September 21, 2009 You've misunderstood how character classes work. The first char is removed if it is matched by your negated character class (and thus not grabbed) - a character class doesn't treat its content as a string but as separate characters. What you're trying to do could be done with a negative lookbehind (?<!javascript:|mailto:|aim:) instead of [^javascript|mailto|aim\:] And I would go with '~<a\b[^>]+href\s?=\s?[\'"](?<!javascript:|mailto:|aim:)(.*?)[\'"]~is' for the full pattern (note the word boundary and pattern modifiers). Quote Link to comment Share on other sites More sharing options...
.josh Posted September 21, 2009 Share Posted September 21, 2009 actually you need to use a negative lookahead, not lookbehind, because with a lookbehind, you have to have the same amount of chars for each alternation. Quote Link to comment Share on other sites More sharing options...
MadTechie Posted September 21, 2009 Share Posted September 21, 2009 My 2 Pence if (preg_match('/<a[^>]+?href\s*=\s*(["|\'])(?!(?:javascript|mailto|aim)(.+?)\1/sim', $html,$match)) { $match = $match[2]; //I used 2 as 1 was used to capute the ' or " } Quote Link to comment Share on other sites More sharing options...
.josh Posted September 21, 2009 Share Posted September 21, 2009 MadTechie: You have a | in your quote char class. This won't break it, though it will match a literal | in the off-chance it exists in the tag... IOW, don't need to do alternation inside a char class, because it is already built-in to character classes to match a OR b OR c OR etc... Also, this: (?!(?:javascript|mailto|aim) can be this: (?!javascript|mailto|aim) because lookaheads (and lookbehinds) have zero width assertion. edit: I think I know why you did that for the neg lookahead: wanted to move the : outside of it since it is present on all 3. My thought though is that since all links starting with javascript|mailto|aim will have that : anyways, no need to further look for the :. Quote Link to comment Share on other sites More sharing options...
MadTechie Posted September 21, 2009 Share Posted September 21, 2009 LMAO.. yeah missed that, just grabed The Little Guy and tweaked it, i always miss things like that! finished work now.. off home! ttfn Quote Link to comment Share on other sites More sharing options...
thebadbad Posted September 21, 2009 Share Posted September 21, 2009 because with a lookbehind, you have to have the same amount of chars for each alternation. That's actually not true when dealing with PCRE in PHP. But I found out that my pattern is not working as intended, and solved it by using a negative lookahead as you suggested: '~<a\b[^>]+href\s?=\s?[\'"](?!javascript:|mailto:|aim:)(.*?)[\'"]~is' My thought though is that since all links starting with javascript|mailto|aim will have that : anyways, no need to further look for the : You could have relative paths beginning with either of those values, so I kept the colons. Quote Link to comment Share on other sites More sharing options...
.josh Posted September 21, 2009 Share Posted September 21, 2009 because with a lookbehind, you have to have the same amount of chars for each alternation. That's actually not true when dealing with PCRE in PHP. But I found out that my pattern is not working as intended, and solved it by using a negative lookahead as you suggested: No sir, you are incorrect. Perl (and therefore PCRE) regex does not support non-fixed length lookbehinds. Here is a very simple example: <?php $string = "blahblah"; preg_match('~(?<!.*).*~',$string,$matches); print_r($matches) ?> Produces: Warning: preg_match() [function.preg-match]: Compilation failed: lookbehind assertion is not fixed length at offset 6 in ../test.php on line 3 My thought though is that since all links starting with javascript|mailto|aim will have that : anyways, no need to further look for the : You could have relative paths beginning with either of those values, so I kept the colons. True, didn't think of them being dir names. Okay you got me there. Quote Link to comment Share on other sites More sharing options...
.josh Posted September 21, 2009 Share Posted September 21, 2009 edit: eh.. I guess that's not the same thing. You're right. I can do (?<!b|bl) coulda swore that wasn't the case. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted September 21, 2009 Share Posted September 21, 2009 Well, I'm glad you're learning something too Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.