[SOLVED] Regular Expression Trouble

carrotcake1029 · December 16, 2008

Hello all!

I am having an issue with a regular expression I am using for preg_match_all(). What is does is look at whatever data I throw at it and returns any links it finds in an array. Well, for the most part, it is doing it's job, but it's getting a little too much. All the links it returns look like this

http://www.google.com<br

So obviously it is grabbing a little too much and I can't see how to fix it. Can you guy let me know what you think?

$regex = '/https?\:\/\/[^\" ]+/i';

Edit: Sorry, I didn't see until now you had a whole regex subforum. You can move this if you would like. Sorry for any hassle.

effigy · December 16, 2008

%https?://[^\"\s>]+%i

Will the URLs always be double quoted?

carrotcake1029 · December 16, 2008

I am unsure what you mean by that, sorry.

What I am doing is looping through a mysql database and finding links from all the entries.

I also discovered that if any tag is behind it, it always seems to get merged with it, such as </a

Edit: I went regexlib.com and found that this one is supposed to extract urls, but I can't modify it to be used in php. (I am not very good at regex)

(?<http>(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)

effigy · December 16, 2008

What is the format of these entries? HTML? Prose? Anything?

carrotcake1029 · December 16, 2008

HTML

effigy · December 16, 2008

Do you want to pull URLs from tags, content, or both?

carrotcake1029 · December 16, 2008

Just the content.

effigy · December 16, 2008

How about something like this?

<pre>
<?php
$html = <<<HTML
<a href="http://www.phpfreaks.com">PHP Freaks</a>
<a href="http://www.google.com/index.html">Visit http://www.google.com!</a>
HTML;
preg_match('%https?://\S+(?<!\p{P})%i', strip_tags($html), $matches);
print_r($matches);
?>
</pre>

carrotcake1029 · December 16, 2008

Well, that got rid of the tags, but I am still getting extra data. Now after the link if there was some text it gets appended. Like if the post looked like this:

http://www.google.com
Go there for a cool search engine!

it returns

http://www.google.comGo

effigy · December 16, 2008

That data works in the example code:

<pre>
<?php
$html = <<<HTML
http://www.google.com
Go there for a cool search engine!
HTML;
preg_match('%https?://\S+(?<!\p{P})%i', strip_tags($html), $matches);
print_r($matches);
?>
</pre>

What else is happening in your code?

carrotcake1029 · December 17, 2008

Well, its coming more in the form of this:

$html = "http://www.google.com<br>Go there for a cool search engine!";

.josh · December 17, 2008

will there always be a after the link? Will the link always be at the beginning of the string? In order to accurately extract it from the string, a pattern has to be established. A pattern, of course, being something that happens on a regular, predictable basis. It's not really going to be possible to accurately pull a url out from a string if it's just randomly amongst other stuff...

carrotcake1029 · December 17, 2008

I think that for my purposes, either a or will be following most of the time.

nrg_alpha · December 17, 2008

Not sure if I understand this correctly, but would this work?

$str = <<<DATA
http://www.google.com
Go there for a cool search engine!
DATA;
preg_match_all('#(https?://[.\w/-]+)#s', $str, $matches);
echo '<pre>'.print_r($matches[1], true);

Output:

Array
(
    [0] => http://www.google.com
)

EDIT - by my calculations, it shouldn't matter if there is a trailing afterwards or not with the above pattern. I am using preg_match_all incase what you are plugging into the pattern contains multiple urls.

.josh · December 17, 2008

Okay well if it's gonna be that the beginning of the string and a is there "most" of the time, then you can do this:

$html = "http://www.google.com<br />Go there for a cool search engine!";
preg_match("/(.*?)<br.*?>/",$html,$matches);
print_r($matches);

carrotcake1029 · December 17, 2008

Well, my string doesn't always begin with the link. Is there a way you can modify it to get find the link is well? My first post contained a regex that found all the links.

nrg_alpha · December 17, 2008

My pattern does not work for what you are looking for?

carrotcake1029 · December 17, 2008

Nope, I checked.

nrg_alpha · December 17, 2008

Nope, I checked.

Really? because when I test this:

$str = "http://www.google.com<br />Go there for a cool search engine!";
preg_match_all('#(https?://[.\w/-]+)#s', $str, $matches);
echo '<pre>'.print_r($matches[1], true);

It reports back what you seek (in the form of an array element of course).

carrotcake1029 · December 17, 2008

Yes you are right, but for some reason, it is still not working for me. Here is some info from the mysql table I am reading from:

Field      Type      Collation      Null      Default
post   mediumtext latin1_swedish_ci   Yes      NULL

I don't know what else to tell you.

carrotcake1029 · December 17, 2008

Sorry for double post, but I could not edit.

I think I know what I need. I just need a regex to to http:// at the beginning and <br.*?> at the end.

nrg_alpha · December 17, 2008

No, you don't need 'yet another regex solution' as you already have an adequate solution offered to you.

The problem here (it seems) is not knowing how to load your MySQL table into an array, which in turn passes through one of the solutions offered here (if you have managed that far, it wouldn't be hard to implement a solution offered in this thread to quickly hammer out the urls).

This is why when people respond with something like 'nope.. I checked', this tells us absolutely nothing! Perhaps you should reveal your entire block of MySQL code (hide your SQL password and username though) as well as how you integrated one of the solutions offered here so that others can see the bigger picture and pinpoint where you are going wrong (a small sample list of what is stored within your MySQL database might also help out in trouble shooting this matter). Without knowing more of what's happening, it is basically 'shooting in the dark'. I for one am not knowledgable in databases, so unfortunately I cannot help you. But rest assured, you have enough viable regex solutions here that actually do what you are seeking.. now it is a matter of properly connecting to the database, pulling everything into an array, and then passing that array through one of the regex patters in this thread.

effigy · December 17, 2008

<pre>
<?php
   $html = 'http://www.google.com<br>Go there for a cool search engine!';
   ### Similar to strip_tags, but replace with a space.
   $html = preg_replace('/<[^>]*>/', ' ', $html);
   preg_match('%https?://\S+(?<!\p{P})%i', $html, $matches);
   print_r($matches);
?>
</pre>

Sign In

[SOLVED] Regular Expression Trouble

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information