The Little Guy Posted January 17, 2008 Share Posted January 17, 2008 What is the best way to remove all different formats of URL's? Currently I remove all tags, so I am left with just one or more URL's How do I remove them as well? Quote Link to comment Share on other sites More sharing options...
effigy Posted January 17, 2008 Share Posted January 17, 2008 all different formats of URL's? What do you mean by formats--protocols (http, https, ftp)? Quote Link to comment Share on other sites More sharing options...
The Little Guy Posted January 17, 2008 Author Share Posted January 17, 2008 http https ftp www so... if I have one or more of the following: visit: https://google.com and WIN! visit: https://www.google.com and WIN! visit: http://google.com and WIN! visit: http://www.google.com and WIN! visit: www.google.com and WIN! visit: ftp://www.google.com and WIN! visit: ftp://google.com and WIN! or any other common urls, it would then format it like this: visit: [ URL REMOVED ] and WIN! Quote Link to comment Share on other sites More sharing options...
dsaba Posted January 18, 2008 Share Posted January 18, 2008 $pat ='~(?<=visit: )(?:http://|https://|ftp://)*(?:www)?[^\s]+(?= and Win!)~i'; $source = preg_replace($pat, '[url REMOVED]', $source); Quote Link to comment Share on other sites More sharing options...
dsaba Posted January 18, 2008 Share Posted January 18, 2008 *this is better to match just the url with any text ~(??(?<=http://|ftp://|https://)(?:http://|ftp://|https://)|)www\.|http://|https://|ftp://)[^\s]+~ Tested: http://nancywalshee03.freehostia.com/regextester/regex_tester.php?seeSaved=ev0obz7e Quote Link to comment Share on other sites More sharing options...
effigy Posted January 18, 2008 Share Posted January 18, 2008 % ### Protocol or start. (?: (??:https?|ftp)://) | www\. ) ### Body. \S+ ### Avoid punctuation. (?<!\p{P}) %x Quote Link to comment Share on other sites More sharing options...
dsaba Posted January 18, 2008 Share Posted January 18, 2008 Effigy your regex will match: www.lalallookiamnotareallink http://again (look even SMF thinks these are links) All urls have at least 1 '.' in them so I'd say: ~(??:(?:https?|ftp)://)|www\.)(?:\S+\.\S+)(?<!\p{P})~ I also don't see why you are using the x modifier because \S will not match white space anyways. I couldn't figure out the meaning of the P unicode grapheme, could you a provide a link to a listing of these, or just tell what it is? Quote Link to comment Share on other sites More sharing options...
dsaba Posted January 18, 2008 Share Posted January 18, 2008 * ~(??:(?:https?|ftp)://)|www\.)\S+(?<!\p{P})\.\S+(?<!\p{P})~ Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted January 19, 2008 Share Posted January 19, 2008 @dsaba: All URLs does not need at least one dot. E.g. from within my LAN my computer will be accessible by the name daniel-laptop. That means that http://daniel-laptop will be a valid URL from within my LAN seeing as my computer is running an httpd on port 80. Quote Link to comment Share on other sites More sharing options...
effigy Posted January 21, 2008 Share Posted January 21, 2008 I also don't see why you are using the x modifier because \S will not match white space anyways. To separate and comment the different parts. I couldn't figure out the meaning of the P unicode grapheme, could you a provide a link to a listing of these, or just tell what it is? Unicode Character Properties. Quote Link to comment Share on other sites More sharing options...
dsaba Posted January 24, 2008 Share Posted January 24, 2008 I've been to this page but it does not specify what punctuation marks it matches exactly, unicode supports many many languages, and in different languages punctuation means different things and there exists punctuation marks outside the realm of english punctuation marks. Quote Link to comment Share on other sites More sharing options...
effigy Posted January 24, 2008 Share Posted January 24, 2008 There's a basic showing here. And this will show you as much as you want: <meta charset="utf-8"/> <pre> <?php ### Up the range as much as you want. ### Unicode 5.0.0 contains 1,114,112 characters. foreach (range(0, 127) as $code_point) { $utf = code2utf($code_point); if (preg_match('/\p{P}/u', $utf)) { printf('[%09s]: ', number_format($code_point)); echo $utf, '<br/>'; } } ### Borrowed from http://us3.php.net/manual/en/function.utf8-encode.php#58461 function code2utf($num) { if($num<128)return chr($num); if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128); if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128); if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128); return ''; } ?> </pre> For example: [000000033]: ! [000000034]: " [000000035]: # [000000037]: % [000000038]: & [000000039]: ' [000000040]: ( [000000041]: ) [000000042]: * [000000044]: , [000000045]: - [000000046]: . [000000047]: / [000000058]: : [000000059]: ; [000000063]: ? [000000064]: @ [000000091]: [ [000000092]: \ [000000093]: ] [000000095]: _ [000000123]: { [000000125]: } Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.