d_barszczak Posted October 2, 2007 Share Posted October 2, 2007 I know what this function does but i have no idea how. I just wondered if anyone can explain the code in red to me. preg_match_all('/(?<=href=)[\'"]?([^\s"\'>]+)/', $html, $matches); Quote Link to comment https://forums.phpfreaks.com/topic/71575-a-little-help-please/ Share on other sites More sharing options...
Rithiur Posted October 5, 2007 Share Posted October 5, 2007 That red thing is a regular expression, which is an expression that can be used to match complex strings and gather information about them. Here is a description of how that particular regex works: First the / at the beginning and end are delimiters. These don't do anything but mark the beginning and end of the regex. The first the the regex does is look for a character that matches the [\'"] character class, which equals to that it will find either a single or double quote. Next, the pattern evaluates the subpatter (?<=href=) which is before the [\'"]. This is a positive look behind subpatter, which purpose is to evaluate the string before to make sure that the string before matched string matches the look behind pattern. In this case, the regex makes sure that the single or double quote, which it found is preceded by string "href=". Next, it will evaluate the subpattern ([^\s"\'>]+), which pretty much consists of a single negative character class [^\s"\'>]. The character class is negated because it starts with ^. This means that it will match anything, except for what's inside the character class. So, in this case it will match anything except whitespace (\s is a sequence which means any whitespace), single quote, double quote or the greater than sign (right end of tag). The quantifier + after it means that it will match 1 or more characters. In the matches, this subpattern will be in the match [1], because it is a capturing subpattern (unlike the look behind subpattern, which does not capture). If you noticed, one thing I did not mention is the quantifier ? after the [\'"] character class. This was for sake of simplifying the explanation. This quantifier means 1 or 0 occurances. In this particular regex the effect is that it will also look for anything that matches the subpattern ([^\s"\'>]+) and see if it is preceded by 'href='. (Not that I can guarantee it works like that. I might be that it also sees every single space between characters, (which matches to [\'"]? being 0) and sees if it is preceded by 'href='). In practice, this regex will pretty much try to find all values for href attribute in the document. One thing to point out is that this regex is horribly inefficient in practice. This is mostly due to it's look behind subpattern. The problem is that PCRE can't start by looking for the lookbehind pattern. It will start by looking for anything that it should precede. This means, that it will look for any ' " as in the first character class and also for any nonwhitespace (and non >) (because of the next character class) and see if it is preceded by 'href='. So basically the regex engine will try to go through almost every single character in the subject to see there is 'href=' before it. Not very effective, mind you. In addition the regex has few flaws, which exist because it was easier to write the regex that way. Here is my suggestion for better regex (which is about 10 faster or so): /href=([\'"])?((?(1).+?|[^\s>]+))(?(1)\1)/ This will do the same job and fixes few flaws in the pattern you have (by flaw I mean that basically attribute's value should be anything inside the quotes if the attribute starts with them and your regex ends in whitespace, >, ' or " regardless of whether it started with quote or not). Note that in this regex, the url will be in subpattern 2 rather than 1. Quote Link to comment https://forums.phpfreaks.com/topic/71575-a-little-help-please/#findComment-362331 Share on other sites More sharing options...
d_barszczak Posted October 8, 2007 Author Share Posted October 8, 2007 Wow, now im really confused. Thanks for a great explaination though. This regex stuff is really twisting my melon. Do you know of any good tutorials on the subject? Quote Link to comment https://forums.phpfreaks.com/topic/71575-a-little-help-please/#findComment-364512 Share on other sites More sharing options...
effigy Posted October 8, 2007 Share Posted October 8, 2007 Check the Resources. Nice clean-up Rithiur. I would move the last portion into the conditional: /href=([\'"])?((?(1).+?\1|[^\s>]+))/ Quote Link to comment https://forums.phpfreaks.com/topic/71575-a-little-help-please/#findComment-364667 Share on other sites More sharing options...
Rithiur Posted October 11, 2007 Share Posted October 11, 2007 I would move the last portion into the conditional: /href=([\'"])?((?(1).+?\1|[^\s>]+))/ I actually had it that way initially, but decided to put the last portion outside for the simple reason, that when you use my regexp, the subpattern 2 will always contain the url. If you move the \1 inside, then it will contain url + the ending delimiter for attribute. Of course, you could then just use new subpatterns inside the conditional branches, but then you wouldn't have the url always in the same subpattern, so you'd have to add extra code to check which subpattern contains the url. Quote Link to comment https://forums.phpfreaks.com/topic/71575-a-little-help-please/#findComment-367092 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.