Jump to content

Regex Question


glenelkins

Recommended Posts

Hi

 

I still can't get my head totally around REGEX.

 

Lets say I have a string like this: <a href='somesite.com'>Some Site 1</a> <a href="somesite.com">Some Site 2</a> <a href="somesite.com">Some Site 3</a>

 

Using preg_match , how can I pull out all the URLs between the ' ' in the link code. So I end up with an array like 0 => somesite.com 1 => somesite.com  2 => somesite.com

 

thanks

Link to comment
Share on other sites

There's no way someone can teach you RegEx in a forum post. There are a lot of tutorials out there. Although, I consider RegEx to be one of the more abstract areas of programming and I struggle with it on occasion. So I understand your difficulty.

 

Here is a pattern to get the results you asked about above. I noticed that you had some urls in single quotes and some in double quotes. This will work on both:

 

$text = "<a href='somesite1.com'>Some Site 1</a> <a href=\"somesite2.com\">Some Site 2</a> <a href=\"somesite3.com\">Some Site 3</a>";

preg_match_all("/<a href=[\'\"]([^\'\"]*)/", $text, $matches);

print_r($matches[1]);

 

Output:

Array
(
    [0] => somesite1.com
    [1] => somesite2.com
    [2] => somesite3.com
)

 

Although, this might prove more useful as it will get the URLs and the text descriptions for the links:

<?php

$text = "<a href='somesite1.com'>Some Site 1</a> <a href=\"somesite2.com\">Some Site 2</a> <a href=\"somesite3.com\">Some Site 3</a>";

preg_match_all("/<a href=[\'\"]([^\'\"]*)[^>]*([^<]*)/", $text, $matches);

print_r($matches[1]);
print_r($matches[2]);

?>

 

Output:

Array
(
    [0] => somesite1.com
    [1] => somesite2.com
    [2] => somesite3.com
)
Array
(
    [0] => >Some Site 1
    [1] => >Some Site 2
    [2] => >Some Site 3
)

Link to comment
Share on other sites

Here is an explanation of the RegEx pattern: "/<a href=[\'\"]([^\'\"]*)[^>]*([^<]*)/"

 

The pattern first looks for any string starting with "<a href="

 

Then it looks for a signle or double quote mark. Note, the quote marks need to be escaped using a backslash to indicate it is not the end of the pattern: [\'\"]

 

Then it uses parenthesis to mark mathing text that should be returned. And it matches any character that is NOT a single or double quote ( the ^ indicates not within the square brackets). The asterisk tells it to match as many characters as possible (so it will get all of the text up until the next quote mark.: ([^\'\"]*)

 

Next it mathes every character up until the closing braket for the opening A tag: "[^>]*"

 

Lastly, it captures every character that is NOT a '<' which would presumably be the opening for the closing A tag: "([^<]*)"

Link to comment
Share on other sites

apologies. i don't need an expression for both single and double quotes, that was a typo error. I just need it for single quotes.

 

No matter how complex a program I can write, Regex is still one thing that I just cannot totally get my head around. Its a good technology but it could of been put together better than it is in my opinion

Link to comment
Share on other sites

also sorry to be a pest, but why does the following not work, to me it looks like its doing a similar thing, looking for a single quote, then returning whats between them

 

preg_match("%'([.*]*)'%", $_line, $_matches);

Link to comment
Share on other sites

could you just explain why you escape the single quotes? obviously the double ones need to be escaped here, but the single ones?

 

Some characters should always be escaped and some only need to be escaped sometimes. But, it doesn't cause a problem to escape a character when it doesn't need to be. So instead of having to put any though into it, I just escape some characters by default. Saves time - plus it makes the code more regression proof. What if some yahoo was modifying the code later and decided to define the pattern with single quotes rather than double quotes? My pattern would keep on working whereas if the single quotes were not escaped it would break.

 

also sorry to be a pest, but why does the following not work, to me it looks like its doing a similar thing, looking for a single quote, then returning whats between them

 

preg_match("%'([.*]*)'%", $_line, $_matches);

 

How is that similar? It *looks* like you are trying to match all characters between two single quotes. Which this would be the correct pattern: "%'(.*)'%"

 

However, that will NOT work for what you want. For one it doesn't care where those quotes exist. Two it would not find the text in double quotes which I assume you want also. Third, and most importantly it finds the text from the first single quote to the LAST single quote. Also, preg_match() will stop after the first match instead of finding all matches.

 

If this was your text:

<a href='somesite1.com'>Some Site 1</a> <a href='somesite2.com'>Some Site 2</a>

 

The regex above would return this:

somesite1.com'>Some Site 1</a> <a href='somesite2.com

 

Because the * modifier is "greedy" - it will match all characters until the last match. You could make it non-greedy by also using the ? after the *: "%'(.*?)'%"

 

But, my understanding is that using that method is not efficient and you should use the method I described above of matching all characters that do not mathc the ending character: "%'([^']*)%"

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.