Jump to content

A little help please.


d_barszczak

Recommended Posts

That red thing is a regular expression, which is an expression that can be used to match complex strings and gather information about them. Here is a description of how that particular regex works:

 

  • First the / at the beginning and end are delimiters. These don't do anything but mark the beginning and end of the regex.
  • The first the the regex does is look for a character that matches the [\'"] character class, which equals to that it will find either a single or double quote.
  • Next, the pattern evaluates the subpatter (?<=href=) which is before the [\'"]. This is a positive look behind subpatter, which purpose is to evaluate the string before to make sure that the string before matched string matches the look behind pattern. In this case, the regex makes sure that the single or double quote, which it found is preceded by string "href=".
  • Next, it will evaluate the subpattern ([^\s"\'>]+), which pretty much consists of a single negative character class [^\s"\'>]. The character class is negated because it starts with ^. This means that it will match anything, except for what's inside the character class. So, in this case it will match anything except whitespace (\s is a sequence which means any whitespace), single quote, double quote or the greater than sign (right end of tag). The quantifier + after it means that it will match 1 or more characters. In the matches, this subpattern will be in the match [1], because it is a capturing subpattern (unlike the look behind subpattern, which does not capture).

 

If you noticed, one thing I did not mention is the quantifier ? after the [\'"] character class. This was for sake of simplifying the explanation. This quantifier means 1 or 0 occurances. In this particular regex the effect is that it will also look for anything that matches the subpattern ([^\s"\'>]+) and see if it is preceded by 'href='. (Not that I can guarantee it works like that. I might be that it also sees every single space between characters, (which matches to [\'"]? being 0) and sees if it is preceded by 'href=').

 

In practice, this regex will pretty much try to find all values for href attribute in the document.

 

One thing to point out is that this regex is horribly inefficient in practice. This is mostly due to it's look behind subpattern. The problem is that PCRE can't start by looking for the lookbehind pattern. It will start by looking for anything that it should precede. This means, that it will look for any ' " as in the first character class and also for any nonwhitespace (and non >) (because of the next character class) and see if it is preceded by 'href='. So basically the regex engine will try to go through almost every single character in the subject to see there is 'href=' before it. Not very effective, mind you.

 

In addition the regex has few flaws, which exist because it was easier to write the regex that way. Here is my suggestion for better regex (which is about 10 faster or so):

/href=([\'"])?((?(1).+?|[^\s>]+))(?(1)\1)/

 

This will do the same job and fixes few flaws in the pattern you have (by flaw I  mean that basically attribute's value should be anything inside the quotes if the attribute starts with them and your regex ends in whitespace, >, ' or " regardless of whether it started with quote or not).

 

Note that in this regex, the url will be in subpattern 2 rather than 1.

Link to comment
Share on other sites

I would move the last portion into the conditional: /href=([\'"])?((?(1).+?\1|[^\s>]+))/

I actually had it that way initially, but decided to put the last portion outside for the simple reason, that when you use my regexp, the subpattern 2 will always contain the url. If you move the \1 inside, then it will contain url + the ending delimiter for attribute. Of course, you could then just use new subpatterns inside the conditional branches, but then you wouldn't have the url always in the same subpattern, so you'd have to add extra code to check which subpattern contains the url.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.