Jump to content

Regex removes first character


The Little Guy

Recommended Posts

Here is my regex:

~\<a.+?href=[\"|'][^javascript|mailto|aim\:](.+?)[\"|']~

 

I am using cURL to grab the HTML from a page, and with the above code I want to get all the href links from all the "a" tags.

 

With the current code, it gets the info I wan't, but it takes off the first character...

 

print_r($matches[1]);

 

Other than that, I think it works, and any suggestions are appreciated!

 

Thanks!

Link to comment
https://forums.phpfreaks.com/topic/174961-regex-removes-first-character/
Share on other sites

You've misunderstood how character classes work. The first char is removed if it is matched by your negated character class (and thus not grabbed) - a character class doesn't treat its content as a string but as separate characters. What you're trying to do could be done with a negative lookbehind

 

(?<!javascript:|mailto:|aim:)

 

instead of

 

[^javascript|mailto|aim\:]

 

And I would go with

 

'~<a\b[^>]+href\s?=\s?[\'"](?<!javascript:|mailto:|aim:)(.*?)[\'"]~is'

 

for the full pattern (note the word boundary and pattern modifiers).

MadTechie:

 

You have a | in your quote char class.  This won't break it, though it will match a literal | in the off-chance it exists in the tag... IOW, don't need to do alternation inside a char class, because it is already built-in to character classes to match a OR b OR c OR etc...

 

 

Also, this:

 

(?!(?:javascript|mailto|aim)

 

can be this:

 

(?!javascript|mailto|aim)

 

because lookaheads (and lookbehinds) have zero width assertion.

 

edit: I think I know why you did that for the neg lookahead: wanted to move the : outside of it since it is present on all 3.  My thought though is that since all links starting with javascript|mailto|aim will have that : anyways, no need to further look for the :.

because with a lookbehind, you have to have the same amount of chars for each alternation.

 

That's actually not true when dealing with PCRE in PHP. But I found out that my pattern is not working as intended, and solved it by using a negative lookahead as you suggested:

 

'~<a\b[^>]+href\s?=\s?[\'"](?!javascript:|mailto:|aim:)(.*?)[\'"]~is'

 

My thought though is that since all links starting with javascript|mailto|aim will have that : anyways, no need to further look for the :

 

You could have relative paths beginning with either of those values, so I kept the colons.

because with a lookbehind, you have to have the same amount of chars for each alternation.

 

That's actually not true when dealing with PCRE in PHP. But I found out that my pattern is not working as intended, and solved it by using a negative lookahead as you suggested:

 

No sir, you are incorrect.  Perl (and therefore PCRE) regex does not support non-fixed length lookbehinds.  Here is a very simple example:

 

<?php
  $string = "blahblah";
preg_match('~(?<!.*).*~',$string,$matches);
print_r($matches)
?>

 

Produces:

 

Warning: preg_match() [function.preg-match]: Compilation failed: lookbehind assertion is not fixed length at offset 6 in ../test.php on line 3

 

My thought though is that since all links starting with javascript|mailto|aim will have that : anyways, no need to further look for the :

 

You could have relative paths beginning with either of those values, so I kept the colons.

 

True, didn't think of them being dir names.  Okay you got me there.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.