papaface Posted March 28, 2009 Share Posted March 28, 2009 Hello, I am trying to extract a text link from a given string however I am finding it rather difficult and I am getting no matches for some reason. My code is: <?php $string = "some random text http://tinyurl.com/dmugyw"; function do_reg($text, $regex) { preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER); return $result = $result[0]; } } do_reg($string, '\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]'); ?> Can anyone explain why I am not getting a match? Any help would be appreciated Quote Link to comment Share on other sites More sharing options...
killah Posted March 28, 2009 Share Posted March 28, 2009 First of all, i got an unwanted } from your code. Second, i got "Warning: preg_match_all() [function.preg-match-all]: Delimiter must not be alphanumeric or backslash in /home/_/public_html/_.php on line 5" Thirdly, i decided to go with a recode. <?php $string = 'Some random Text with http://url.com'; function find($where,$regex) { preg_match_all($regex,$where,$result, PREG_PATTERN_ORDER); return ($result) ? 'Found' : 'Not Found'; } echo find($string,'~([http]|[https][file]|[ftp]|[irc])://([www]|[])(.*).([com]|[net]|[info]|[org])~i'); ?> Try it out. Quote Link to comment Share on other sites More sharing options...
papaface Posted March 28, 2009 Author Share Posted March 28, 2009 Thanks for that, it works, but how do I actually extract it? I need to get that url and work with it, but I can't in your example. Quote Link to comment Share on other sites More sharing options...
killah Posted March 28, 2009 Share Posted March 28, 2009 <?php $string = 'Some random Text with <! http://url.com !>'; preg_match_all('~<! (.*?) !>~i',$string,$matches, PREG_PATTERN_ORDER); echo $matches[1][0]; ?> That returns http://url.com How ever, each url is going to need a <! & !>, ill try to come up with something else later. Currently looking for a job online. Quote Link to comment Share on other sites More sharing options...
.josh Posted March 28, 2009 Share Posted March 28, 2009 <?php $string = 'Some random Text with <! http://url.com !>'; preg_match_all('~<! (.*?) !>~i',$string,$matches, PREG_PATTERN_ORDER); echo $matches[1][0]; ?> That returns http://url.com How ever, each url is going to need a <! & !>, ill try to come up with something else later. Currently looking for a job online. One would think if he had control over putting custom tags around the urls to be extracted, he wouldn't need to be regexing in the first place. Quote Link to comment Share on other sites More sharing options...
killah Posted March 28, 2009 Share Posted March 28, 2009 All that is needed to be adding the <! and !> would need to check for (http(s),irc,file,ftp) and (.com,.net,.org,.blah) and add the <! & !> around it. Of course, for the TLD's you would need some kind of long array. Quote Link to comment Share on other sites More sharing options...
MadTechie Posted March 28, 2009 Share Posted March 28, 2009 All that is needed to be adding the <! and !> would need to check for (http(s),irc,file,ftp) and (.com,.net,.org,.blah) and add the <! & !> around it. Of course, for the TLD's you would need some kind of long array. LOL... okay try this.. (read comments) <?php $string = "some random text http://tinyurl.com/123123 some random text http://tinyurl.com/787988"; function do_reg($text, $regex) { preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER); return $result[0]; } $regex = '\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]'; //Your RegEx is missing some parts //Start and End character //aslo your RegEx is case-sensitive add the i ro make it insensitive $regex = '$'.$regex.'$i'; $A =do_reg($string, $regex); foreach($A as $B) { echo "$B<BR>"; } ?> Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted March 28, 2009 Share Posted March 28, 2009 killah, looking at your first sample, I think you misunderstand the usage of character classes among other things.. '~([http]|[https][file]|[ftp]|[irc])://([www]|[])(.*).([com]|[net]|[info]|[org])~i' When you encase something like http within a character class: [http], what this is in effect saying is that at the current position in the target string, the current character must be an h, or a t, or a t, or a p. In otherwords, it must be one of those characters. Understand that a character class looks for a single character that is within those square brackets. So all those character classes that you have will not look for all the characters listed within them as that sequence within the target string. Instead, you should use capturing (or non capturing) grouping brackets: ()... And on that note... You are using quite a few sets of parenthesis (which become capturing brackets) when you only need one major set to capture the whole thing (but we can even avoid that!). You can use non-capturing sets to group things together, yet not fall into doing any sub-capturing by using the (?: ) notation. In fact, when you use preg_match / preg_match_all, a third available argument is a variable that stores what is found within the target string using the pattern. Since the whole pattern must match in order for it all to pass, we don't even really need capturing parenthesis (in this case anyway), as the entire pattern is stored as array element 0... (this will become clearer in my sample below). You also use a simple dot between (.*) and ([com]......): (.*).([com]|[net]|[info]|[org]). Note that in this case, the dot should be escaped to be a literal dot, otherwise, it is treated as a wild card, which will in turn accept any character (except a newline). I'll get to (.*) issues in a bit. So to take your above example, keeping that functionality you have, it could be rewritten as such: preg_match(~(?:https?|irc|ftp|file)://(?:www)?.*?\.(?:com|net|info|org)~i, $targetString, $match); Ok, so if this whole pattern was to match something, the matched result would be stored uner $match[0]. So you'll notice stuff like https? - the ? means optional (zero or one time) and generally applies directly to the single character preceding it. I say generally as ? can apply to a whole group of characters within parenthesis as in (abc)? or a character class as well [abc]?... So in this case https, the s is optional. This effectively covers your [http]|[https] part (minus the character class). You'll notice that the entire first part is encased within (?: ) (?:https?|irc|ftp|file) This makes this section a non capture, as in this case, since we are looking for the entire thing collectively, if its there, it will be stored under $match[0] anyways, so we don't need capturing parenthesis. Afterwards, I encase www inside another non capturing set and made that optional, which covers your ([www]|[]) part. At this part in the pattern, in yours you used (.*) Note that this is typically not a good idea to use this (it is truly circumstantial however). To read up on why this is the case, you can view this thread (make note of post #11 and #14. The thread deals with .+, but the concept is pretty much the same as for .*). So since I assume this url is nested within a potentially large chunk of text, in this case, I go with .*?, thus making this lazy (which makes it more accurate and saves time). Then finally, to match the extentions you have, I used yet another non-capturing group (?:com|net|info|org). Note that there are problems with those patterns in general, especially the extensions, as it will not find stuff like .asia or .co.uk by example. To make matters worse, the url extensions will be revised to become expanded to include even more later on, which are more than 2 to 4 characters long. So it is a moving target. Sorry for the long post. The point was to point out the problems in your understanding of things (but you're trying, and that says alot more than for many others, so kudos. Keep at it! ) Quote Link to comment Share on other sites More sharing options...
MadTechie Posted March 28, 2009 Share Posted March 28, 2009 With a post like that all i can say is nicely done nrg_alpha, and with that i'll jump on your bandwagon while shouting "Regex forever!" Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted March 28, 2009 Share Posted March 28, 2009 lol haha, thanks.. and you deciphered my uber secret binary code I see! Quote Link to comment Share on other sites More sharing options...
.josh Posted March 28, 2009 Share Posted March 28, 2009 All that is needed to be adding the <! and !> would need to check for (http(s),irc,file,ftp) and (.com,.net,.org,.blah) and add the <! & !> around it. Of course, for the TLD's you would need some kind of long array. My point is that if he had the ability to insert delimiters around target data, he wouldn't need to be regexing for it in the first place. Quote Link to comment Share on other sites More sharing options...
papaface Posted March 28, 2009 Author Share Posted March 28, 2009 Wow, I come back and see a massive wall of text lmao I am using MadTechie's code as it seems to work pretty well for what I require. So thank you This is not related to regex however I wonder if someone could help me. As part of getting these urls (some are tinyurls) I am needing to find out what they actually link to i.e I need the link tinyurl redirects the user to Does anyone know of a way to do this? Quote Link to comment Share on other sites More sharing options...
MadTechie Posted March 28, 2009 Share Posted March 28, 2009 cURL should do the trick.. <?php <?php $string = "some random text http://tinyurl.com/9uxdwc some http://google.com random text http://tinyurl.com/787988"; $regex = '$\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]$i'; preg_match_all($regex, $string, $result, PREG_PATTERN_ORDER); $A = $result[0]; foreach($A as $B) { $URL = GetRealURL($B); echo "$URL<BR>"; } function GetRealURL( $url ) { $options = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_HEADER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_ENCODING => "", CURLOPT_USERAGENT => "spider", CURLOPT_AUTOREFERER => true, CURLOPT_CONNECTTIMEOUT => 120, CURLOPT_TIMEOUT => 120, CURLOPT_MAXREDIRS => 10, ); $ch = curl_init( $url ); curl_setopt_array( $ch, $options ); $content = curl_exec( $ch ); $err = curl_errno( $ch ); $errmsg = curl_error( $ch ); $header = curl_getinfo( $ch ); curl_close( $ch ); return $header['url']; } ?> Please note that the returned URL could also be a redirected site.. so you could create a recursive function but it depends on how far you want to go! Also my results are as follows: from: http://tinyurl.com/9uxdwc http://google.com http://tinyurl.com/787988 to: http://wikileaks.org/wiki/Denmark:_3863_sites_on_censorship_list%2C_Feb_2008 => correct http://www.google.co.uk/ => yet i'm in the UK http://tinyurl.com/787988 => is error page but still a valid URL! EDIT: reposted due to some bad parse Quote Link to comment Share on other sites More sharing options...
killah Posted March 29, 2009 Share Posted March 29, 2009 I tried your code. You doubled the <?php at the top. <?php $string = "some random text http://tinyurl.com/9uxdwc some http://google.com random text http://tinyurl.com/787988"; $regex = '$\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]$i'; preg_match_all($regex, $string, $result, PREG_PATTERN_ORDER); $A = $result[0]; foreach($A as $B) { $URL = GetRealURL($B); echo "$URL<BR>"; } function GetRealURL( $url ) { $options = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_HEADER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_ENCODING => "", CURLOPT_USERAGENT => "spider", CURLOPT_AUTOREFERER => true, CURLOPT_CONNECTTIMEOUT => 120, CURLOPT_TIMEOUT => 120, CURLOPT_MAXREDIRS => 10, ); $ch = curl_init( $url ); curl_setopt_array( $ch, $options ); $content = curl_exec( $ch ); $err = curl_errno( $ch ); $errmsg = curl_error( $ch ); $header = curl_getinfo( $ch ); curl_close( $ch ); return $header['url']; } ?> Uhm, however, i am in south africa, and it still shows google.co.uk when not supposed to. I am sure that's not a big problem at all. I am fairly new to regexing so, excuse my bad regexing. Good job madtechie. Quote Link to comment Share on other sites More sharing options...
MadTechie Posted March 29, 2009 Share Posted March 29, 2009 Oppss about the double <?php One thing to remember is google re-directs by read the location of the client.. So where is your server based or are you using wamp/xamp etc Also if this is complete can you click solved please Quote Link to comment Share on other sites More sharing options...
papaface Posted March 29, 2009 Author Share Posted March 29, 2009 Thanks for all your input into this. You'll all been very helpful Marking as solved. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.