[SOLVED] Extract URL from string

papaface · March 28, 2009

Hello,

I am trying to extract a text link from a given string however I am finding it rather difficult and I am getting no matches for some reason.

My code is:

<?php
  $string = "some random text http://tinyurl.com/dmugyw";
function do_reg($text, $regex)
{
preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
return $result = $result[0];
}
}

do_reg($string, '\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]');
?>

Can anyone explain why I am not getting a match?

Any help would be appreciated

killah · March 28, 2009

First of all, i got an unwanted } from your code.

Second, i got "Warning: preg_match_all() [function.preg-match-all]: Delimiter must not be alphanumeric or backslash in /home/_/public_html/_.php on line 5"

Thirdly, i decided to go with a recode.

<?php
$string = 'Some random Text with http://url.com';
function find($where,$regex)
{
preg_match_all($regex,$where,$result, PREG_PATTERN_ORDER);
return ($result) ? 'Found' : 'Not Found';
}

echo find($string,'~([http]|[https][file]|[ftp]|[irc])://([www]|[])(.*).([com]|[net]|[info]|[org])~i');

?>

Try it out.

papaface · March 28, 2009

Thanks for that, it works, but how do I actually extract it?

I need to get that url and work with it, but I can't in your example.

killah · March 28, 2009

<?php
$string = 'Some random Text with <! http://url.com !>';

preg_match_all('~<! (.*?) !>~i',$string,$matches, PREG_PATTERN_ORDER);

echo $matches[1][0];

?>

That returns http://url.com

How ever, each url is going to need a <! & !>, ill try to come up with something else later. Currently looking for a job online.

.josh · March 28, 2009

<?php
$string = 'Some random Text with <! http://url.com !>';

preg_match_all('~<! (.*?) !>~i',$string,$matches, PREG_PATTERN_ORDER);

echo $matches[1][0];

?>
That returns http://url.com

How ever, each url is going to need a <! & !>, ill try to come up with something else later. Currently looking for a job online.

One would think if he had control over putting custom tags around the urls to be extracted, he wouldn't need to be regexing in the first place.

killah · March 28, 2009

All that is needed to be adding the <! and !> would need to check for (http(s),irc,file,ftp) and (.com,.net,.org,.blah) and add the <! & !> around it. Of course, for the TLD's you would need some kind of long array.

MadTechie · March 28, 2009

All that is needed to be adding the <! and !> would need to check for (http(s),irc,file,ftp) and (.com,.net,.org,.blah) and add the <! & !> around it. Of course, for the TLD's you would need some kind of long array.

LOL...

okay try this.. (read comments)

<?php
$string = "some random text http://tinyurl.com/123123 some random text http://tinyurl.com/787988";
function do_reg($text, $regex)
{
preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
return $result[0];
}

$regex = '\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]';

//Your RegEx is missing some parts
//Start and End character
//aslo your RegEx is case-sensitive add the i ro make it insensitive
$regex = '$'.$regex.'$i';

$A =do_reg($string, $regex);
foreach($A as $B)
{
echo "$B<BR>";
}
?>

nrg_alpha · March 28, 2009

killah, looking at your first sample, I think you misunderstand the usage of character classes among other things..

'~([http]|[https][file]|[ftp]|[irc])://([www]|[])(.*).([com]|[net]|[info]|[org])~i'

When you encase something like http within a character class: [http], what this is in effect saying is that at the current position in the target string, the current character must be an h, or a t, or a t, or a p. In otherwords, it must be one of those characters. Understand that a character class looks for a single character that is within those square brackets. So all those character classes that you have will not look for all the characters listed within them as that sequence within the target string. Instead, you should use capturing (or non capturing) grouping brackets: ()... And on that note...

You are using quite a few sets of parenthesis (which become capturing brackets) when you only need one major set to capture the whole thing (but we can even avoid that!). You can use non-capturing sets to group things together, yet not fall into doing any sub-capturing by using the (?: ) notation. In fact, when you use preg_match / preg_match_all, a third available argument is a variable that stores what is found within the target string using the pattern. Since the whole pattern must match in order for it all to pass, we don't even really need capturing parenthesis (in this case anyway), as the entire pattern is stored as array element 0... (this will become clearer in my sample below).

You also use a simple dot between (.*) and ([com]......): (.*).([com]|[net]|[info]|[org]). Note that in this case, the dot should be escaped to be a literal dot, otherwise, it is treated as a wild card, which will in turn accept any character (except a newline). I'll get to (.*) issues in a bit.

So to take your above example, keeping that functionality you have, it could be rewritten as such:

preg_match(~(?:https?|irc|ftp|file)://(?:www)?.*?\.(?:com|net|info|org)~i, $targetString, $match);

Ok, so if this whole pattern was to match something, the matched result would be stored uner $match[0].

So you'll notice stuff like https? - the ? means optional (zero or one time) and generally applies directly to the single character preceding it. I say generally as ? can apply to a whole group of characters within parenthesis as in (abc)? or a character class as well [abc]?... So in this case https, the s is optional. This effectively covers your [http]|[https] part (minus the character class). You'll notice that the entire first part is encased within (?: )

(?:https?|irc|ftp|file)

This makes this section a non capture, as in this case, since we are looking for the entire thing collectively, if its there, it will be stored under $match[0] anyways, so we don't need capturing parenthesis. Afterwards, I encase www inside another non capturing set and made that optional, which covers your ([www]|[]) part. At this part in the pattern, in yours you used (.*) Note that this is typically not a good idea to use this (it is truly circumstantial however). To read up on why this is the case, you can view this thread (make note of post #11 and #14. The thread deals with .+, but the concept is pretty much the same as for .*). So since I assume this url is nested within a potentially large chunk of text, in this case, I go with .*?, thus making this lazy (which makes it more accurate and saves time). Then finally, to match the extentions you have, I used yet another non-capturing group (?:com|net|info|org).

Note that there are problems with those patterns in general, especially the extensions, as it will not find stuff like .asia or .co.uk by example. To make matters worse, the url extensions will be revised to become expanded to include even more later on, which are more than 2 to 4 characters long. So it is a moving target.

Sorry for the long post. The point was to point out the problems in your understanding of things (but you're trying, and that says alot more than for many others, so kudos. Keep at it! )

MadTechie · March 28, 2009

With a post like that all i can say is nicely done nrg_alpha,

and with that i'll jump on your bandwagon while shouting "Regex forever!"

nrg_alpha · March 28, 2009

lol haha, thanks.. and you deciphered my uber secret binary code I see!

.josh · March 28, 2009

All that is needed to be adding the <! and !> would need to check for (http(s),irc,file,ftp) and (.com,.net,.org,.blah) and add the <! & !> around it. Of course, for the TLD's you would need some kind of long array.

My point is that if he had the ability to insert delimiters around target data, he wouldn't need to be regexing for it in the first place.

papaface · March 28, 2009

Wow, I come back and see a massive wall of text lmao

I am using MadTechie's code as it seems to work pretty well for what I require. So thank you

This is not related to regex however I wonder if someone could help me.

As part of getting these urls (some are tinyurls) I am needing to find out what they actually link to i.e I need the link tinyurl redirects the user to

Does anyone know of a way to do this?

MadTechie · March 28, 2009

cURL should do the trick..

<?php
<?php
$string = "some random text http://tinyurl.com/9uxdwc some http://google.com random text http://tinyurl.com/787988";

$regex = '$\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]$i';

preg_match_all($regex, $string, $result, PREG_PATTERN_ORDER);
$A = $result[0];

foreach($A as $B)
{
$URL = GetRealURL($B);
echo "$URL<BR>";	
}


function GetRealURL( $url ) 
{ 
$options = array(
	CURLOPT_RETURNTRANSFER => true,
	CURLOPT_HEADER         => true,
	CURLOPT_FOLLOWLOCATION => true,
	CURLOPT_ENCODING       => "",
	CURLOPT_USERAGENT      => "spider",
	CURLOPT_AUTOREFERER    => true,
	CURLOPT_CONNECTTIMEOUT => 120,
	CURLOPT_TIMEOUT        => 120,
	CURLOPT_MAXREDIRS      => 10,
); 

$ch      = curl_init( $url ); 
curl_setopt_array( $ch, $options ); 
$content = curl_exec( $ch ); 
$err     = curl_errno( $ch ); 
$errmsg  = curl_error( $ch ); 
$header  = curl_getinfo( $ch ); 
curl_close( $ch ); 
return $header['url']; 
}  

?>

Please note that the returned URL could also be a redirected site.. so you could create a recursive function but it depends on how far you want to go!

Also my results are as follows:

from:

http://tinyurl.com/9uxdwc

http://google.com

http://tinyurl.com/787988

to:

http://wikileaks.org/wiki/Denmark:_3863_sites_on_censorship_list%2C_Feb_2008 => correct

http://www.google.co.uk/ => yet i'm in the UK

http://tinyurl.com/787988 => is error page but still a valid URL!

EDIT: reposted due to some bad parse

killah · March 29, 2009

I tried your code. You doubled the <?php at the top.

<?php
$string = "some random text http://tinyurl.com/9uxdwc some http://google.com random text http://tinyurl.com/787988";

$regex = '$\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]$i';

preg_match_all($regex, $string, $result, PREG_PATTERN_ORDER);
$A = $result[0];

foreach($A as $B)
{
   $URL = GetRealURL($B);
   echo "$URL<BR>";   
}


function GetRealURL( $url ) 
{ 
   $options = array(
      CURLOPT_RETURNTRANSFER => true,
      CURLOPT_HEADER         => true,
      CURLOPT_FOLLOWLOCATION => true,
      CURLOPT_ENCODING       => "",
      CURLOPT_USERAGENT      => "spider",
      CURLOPT_AUTOREFERER    => true,
      CURLOPT_CONNECTTIMEOUT => 120,
      CURLOPT_TIMEOUT        => 120,
      CURLOPT_MAXREDIRS      => 10,
   ); 
   
   $ch      = curl_init( $url ); 
   curl_setopt_array( $ch, $options ); 
   $content = curl_exec( $ch ); 
   $err     = curl_errno( $ch ); 
   $errmsg  = curl_error( $ch ); 
   $header  = curl_getinfo( $ch ); 
   curl_close( $ch ); 
   return $header['url']; 
}  

?>

Uhm, however, i am in south africa, and it still shows google.co.uk when not supposed to. I am sure that's not a big problem at all.

I am fairly new to regexing so, excuse my bad regexing.

Good job madtechie.

MadTechie · March 29, 2009

Oppss about the double <?php

One thing to remember is google re-directs by read the location of the client.. So where is your server based or are you using wamp/xamp etc

Also if this is complete can you click solved please

papaface · March 29, 2009

Thanks for all your input into this. You'll all been very helpful

Marking as solved.

Sign In

[SOLVED] Extract URL from string

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information