Jump to content

Regex removes first character


The Little Guy

Recommended Posts

Here is my regex:

~\<a.+?href=[\"|'][^javascript|mailto|aim\:](.+?)[\"|']~

 

I am using cURL to grab the HTML from a page, and with the above code I want to get all the href links from all the "a" tags.

 

With the current code, it gets the info I wan't, but it takes off the first character...

 

print_r($matches[1]);

 

Other than that, I think it works, and any suggestions are appreciated!

 

Thanks!

Link to comment
Share on other sites

You've misunderstood how character classes work. The first char is removed if it is matched by your negated character class (and thus not grabbed) - a character class doesn't treat its content as a string but as separate characters. What you're trying to do could be done with a negative lookbehind

 

(?<!javascript:|mailto:|aim:)

 

instead of

 

[^javascript|mailto|aim\:]

 

And I would go with

 

'~<a\b[^>]+href\s?=\s?[\'"](?<!javascript:|mailto:|aim:)(.*?)[\'"]~is'

 

for the full pattern (note the word boundary and pattern modifiers).

Link to comment
Share on other sites

MadTechie:

 

You have a | in your quote char class.  This won't break it, though it will match a literal | in the off-chance it exists in the tag... IOW, don't need to do alternation inside a char class, because it is already built-in to character classes to match a OR b OR c OR etc...

 

 

Also, this:

 

(?!(?:javascript|mailto|aim)

 

can be this:

 

(?!javascript|mailto|aim)

 

because lookaheads (and lookbehinds) have zero width assertion.

 

edit: I think I know why you did that for the neg lookahead: wanted to move the : outside of it since it is present on all 3.  My thought though is that since all links starting with javascript|mailto|aim will have that : anyways, no need to further look for the :.

Link to comment
Share on other sites

because with a lookbehind, you have to have the same amount of chars for each alternation.

 

That's actually not true when dealing with PCRE in PHP. But I found out that my pattern is not working as intended, and solved it by using a negative lookahead as you suggested:

 

'~<a\b[^>]+href\s?=\s?[\'"](?!javascript:|mailto:|aim:)(.*?)[\'"]~is'

 

My thought though is that since all links starting with javascript|mailto|aim will have that : anyways, no need to further look for the :

 

You could have relative paths beginning with either of those values, so I kept the colons.

Link to comment
Share on other sites

because with a lookbehind, you have to have the same amount of chars for each alternation.

 

That's actually not true when dealing with PCRE in PHP. But I found out that my pattern is not working as intended, and solved it by using a negative lookahead as you suggested:

 

No sir, you are incorrect.  Perl (and therefore PCRE) regex does not support non-fixed length lookbehinds.  Here is a very simple example:

 

<?php
  $string = "blahblah";
preg_match('~(?<!.*).*~',$string,$matches);
print_r($matches)
?>

 

Produces:

 

Warning: preg_match() [function.preg-match]: Compilation failed: lookbehind assertion is not fixed length at offset 6 in ../test.php on line 3

 

My thought though is that since all links starting with javascript|mailto|aim will have that : anyways, no need to further look for the :

 

You could have relative paths beginning with either of those values, so I kept the colons.

 

True, didn't think of them being dir names.  Okay you got me there.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.