glenelkins Posted March 15, 2010 Share Posted March 15, 2010 Hi I still can't get my head totally around REGEX. Lets say I have a string like this: <a href='somesite.com'>Some Site 1</a> <a href="somesite.com">Some Site 2</a> <a href="somesite.com">Some Site 3</a> Using preg_match , how can I pull out all the URLs between the ' ' in the link code. So I end up with an array like 0 => somesite.com 1 => somesite.com 2 => somesite.com thanks Quote Link to comment Share on other sites More sharing options...
Psycho Posted March 15, 2010 Share Posted March 15, 2010 There's no way someone can teach you RegEx in a forum post. There are a lot of tutorials out there. Although, I consider RegEx to be one of the more abstract areas of programming and I struggle with it on occasion. So I understand your difficulty. Here is a pattern to get the results you asked about above. I noticed that you had some urls in single quotes and some in double quotes. This will work on both: $text = "<a href='somesite1.com'>Some Site 1</a> <a href=\"somesite2.com\">Some Site 2</a> <a href=\"somesite3.com\">Some Site 3</a>"; preg_match_all("/<a href=[\'\"]([^\'\"]*)/", $text, $matches); print_r($matches[1]); Output: Array ( [0] => somesite1.com [1] => somesite2.com [2] => somesite3.com ) Although, this might prove more useful as it will get the URLs and the text descriptions for the links: <?php $text = "<a href='somesite1.com'>Some Site 1</a> <a href=\"somesite2.com\">Some Site 2</a> <a href=\"somesite3.com\">Some Site 3</a>"; preg_match_all("/<a href=[\'\"]([^\'\"]*)[^>]*([^<]*)/", $text, $matches); print_r($matches[1]); print_r($matches[2]); ?> Output: Array ( [0] => somesite1.com [1] => somesite2.com [2] => somesite3.com ) Array ( [0] => >Some Site 1 [1] => >Some Site 2 [2] => >Some Site 3 ) Quote Link to comment Share on other sites More sharing options...
Psycho Posted March 15, 2010 Share Posted March 15, 2010 Here is an explanation of the RegEx pattern: "/<a href=[\'\"]([^\'\"]*)[^>]*([^<]*)/" The pattern first looks for any string starting with "<a href=" Then it looks for a signle or double quote mark. Note, the quote marks need to be escaped using a backslash to indicate it is not the end of the pattern: [\'\"] Then it uses parenthesis to mark mathing text that should be returned. And it matches any character that is NOT a single or double quote ( the ^ indicates not within the square brackets). The asterisk tells it to match as many characters as possible (so it will get all of the text up until the next quote mark.: ([^\'\"]*) Next it mathes every character up until the closing braket for the opening A tag: "[^>]*" Lastly, it captures every character that is NOT a '<' which would presumably be the opening for the closing A tag: "([^<]*)" Quote Link to comment Share on other sites More sharing options...
glenelkins Posted March 15, 2010 Author Share Posted March 15, 2010 apologies. i don't need an expression for both single and double quotes, that was a typo error. I just need it for single quotes. No matter how complex a program I can write, Regex is still one thing that I just cannot totally get my head around. Its a good technology but it could of been put together better than it is in my opinion Quote Link to comment Share on other sites More sharing options...
glenelkins Posted March 15, 2010 Author Share Posted March 15, 2010 just reading your explanation. thats very strange, because before i posted on here i actually did an expression very similar to this and it didnt return the correct results! Quote Link to comment Share on other sites More sharing options...
glenelkins Posted March 15, 2010 Author Share Posted March 15, 2010 could you just explain why you escape the single quotes? obviously the double ones need to be escaped here, but the single ones? Quote Link to comment Share on other sites More sharing options...
glenelkins Posted March 15, 2010 Author Share Posted March 15, 2010 also sorry to be a pest, but why does the following not work, to me it looks like its doing a similar thing, looking for a single quote, then returning whats between them preg_match("%'([.*]*)'%", $_line, $_matches); Quote Link to comment Share on other sites More sharing options...
Psycho Posted March 15, 2010 Share Posted March 15, 2010 could you just explain why you escape the single quotes? obviously the double ones need to be escaped here, but the single ones? Some characters should always be escaped and some only need to be escaped sometimes. But, it doesn't cause a problem to escape a character when it doesn't need to be. So instead of having to put any though into it, I just escape some characters by default. Saves time - plus it makes the code more regression proof. What if some yahoo was modifying the code later and decided to define the pattern with single quotes rather than double quotes? My pattern would keep on working whereas if the single quotes were not escaped it would break. also sorry to be a pest, but why does the following not work, to me it looks like its doing a similar thing, looking for a single quote, then returning whats between them preg_match("%'([.*]*)'%", $_line, $_matches); How is that similar? It *looks* like you are trying to match all characters between two single quotes. Which this would be the correct pattern: "%'(.*)'%" However, that will NOT work for what you want. For one it doesn't care where those quotes exist. Two it would not find the text in double quotes which I assume you want also. Third, and most importantly it finds the text from the first single quote to the LAST single quote. Also, preg_match() will stop after the first match instead of finding all matches. If this was your text: <a href='somesite1.com'>Some Site 1</a> <a href='somesite2.com'>Some Site 2</a> The regex above would return this: somesite1.com'>Some Site 1</a> <a href='somesite2.com Because the * modifier is "greedy" - it will match all characters until the last match. You could make it non-greedy by also using the ? after the *: "%'(.*?)'%" But, my understanding is that using that method is not efficient and you should use the method I described above of matching all characters that do not mathc the ending character: "%'([^']*)%" Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.