Regex Question

glenelkins · March 15, 2010

Hi

I still can't get my head totally around REGEX.

Lets say I have a string like this: <a href='somesite.com'>Some Site 1</a> <a href="somesite.com">Some Site 2</a> <a href="somesite.com">Some Site 3</a>

Using preg_match , how can I pull out all the URLs between the ' ' in the link code. So I end up with an array like 0 => somesite.com 1 => somesite.com 2 => somesite.com

thanks

Psycho · March 15, 2010

There's no way someone can teach you RegEx in a forum post. There are a lot of tutorials out there. Although, I consider RegEx to be one of the more abstract areas of programming and I struggle with it on occasion. So I understand your difficulty.

Here is a pattern to get the results you asked about above. I noticed that you had some urls in single quotes and some in double quotes. This will work on both:

$text = "<a href='somesite1.com'>Some Site 1</a> <a href=\"somesite2.com\">Some Site 2</a> <a href=\"somesite3.com\">Some Site 3</a>";

preg_match_all("/<a href=[\'\"]([^\'\"]*)/", $text, $matches);

print_r($matches[1]);

Output:

Array
(
    [0] => somesite1.com
    [1] => somesite2.com
    [2] => somesite3.com
)

Although, this might prove more useful as it will get the URLs and the text descriptions for the links:

<?php

$text = "<a href='somesite1.com'>Some Site 1</a> <a href=\"somesite2.com\">Some Site 2</a> <a href=\"somesite3.com\">Some Site 3</a>";

preg_match_all("/<a href=[\'\"]([^\'\"]*)[^>]*([^<]*)/", $text, $matches);

print_r($matches[1]);
print_r($matches[2]);

?>

Output:

Array
(
    [0] => somesite1.com
    [1] => somesite2.com
    [2] => somesite3.com
)
Array
(
    [0] => >Some Site 1
    [1] => >Some Site 2
    [2] => >Some Site 3
)

Psycho · March 15, 2010

Here is an explanation of the RegEx pattern: "/<a href=[\'\"]([^\'\"]*)[^>]*([^<]*)/"

The pattern first looks for any string starting with "<a href="

Then it looks for a signle or double quote mark. Note, the quote marks need to be escaped using a backslash to indicate it is not the end of the pattern: [\'\"]

Then it uses parenthesis to mark mathing text that should be returned. And it matches any character that is NOT a single or double quote ( the ^ indicates not within the square brackets). The asterisk tells it to match as many characters as possible (so it will get all of the text up until the next quote mark.: ([^\'\"]*)

Next it mathes every character up until the closing braket for the opening A tag: "[^>]*"

Lastly, it captures every character that is NOT a '<' which would presumably be the opening for the closing A tag: "([^<]*)"

glenelkins · March 15, 2010

apologies. i don't need an expression for both single and double quotes, that was a typo error. I just need it for single quotes.

No matter how complex a program I can write, Regex is still one thing that I just cannot totally get my head around. Its a good technology but it could of been put together better than it is in my opinion

glenelkins · March 15, 2010

just reading your explanation. thats very strange, because before i posted on here i actually did an expression very similar to this and it didnt return the correct results!

glenelkins · March 15, 2010

could you just explain why you escape the single quotes? obviously the double ones need to be escaped here, but the single ones?

glenelkins · March 15, 2010

also sorry to be a pest, but why does the following not work, to me it looks like its doing a similar thing, looking for a single quote, then returning whats between them

preg_match("%'([.*]*)'%", $_line, $_matches);

Psycho · March 15, 2010

could you just explain why you escape the single quotes? obviously the double ones need to be escaped here, but the single ones?

Some characters should always be escaped and some only need to be escaped sometimes. But, it doesn't cause a problem to escape a character when it doesn't need to be. So instead of having to put any though into it, I just escape some characters by default. Saves time - plus it makes the code more regression proof. What if some yahoo was modifying the code later and decided to define the pattern with single quotes rather than double quotes? My pattern would keep on working whereas if the single quotes were not escaped it would break.

also sorry to be a pest, but why does the following not work, to me it looks like its doing a similar thing, looking for a single quote, then returning whats between them
preg_match("%'([.*]*)'%", $_line, $_matches);

How is that similar? It *looks* like you are trying to match all characters between two single quotes. Which this would be the correct pattern: "%'(.*)'%"

However, that will NOT work for what you want. For one it doesn't care where those quotes exist. Two it would not find the text in double quotes which I assume you want also. Third, and most importantly it finds the text from the first single quote to the LAST single quote. Also, preg_match() will stop after the first match instead of finding all matches.

If this was your text:

<a href='somesite1.com'>Some Site 1</a> <a href='somesite2.com'>Some Site 2</a>

The regex above would return this:

somesite1.com'>Some Site 1</a> <a href='somesite2.com

Because the * modifier is "greedy" - it will match all characters until the last match. You could make it non-greedy by also using the ? after the *: "%'(.*?)'%"

But, my understanding is that using that method is not efficient and you should use the method I described above of matching all characters that do not mathc the ending character: "%'([^']*)%"

Sign In

Regex Question

Recommended Posts

glenelkins

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

glenelkins

Link to comment

Share on other sites

glenelkins

Link to comment

Share on other sites

glenelkins

Link to comment

Share on other sites

glenelkins

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information