Manixat Posted January 12, 2013 Share Posted January 12, 2013 (edited) Hello, I'm currently trying to turn links that users post into anchor tags that actually redirect, and I want to ask you which is the best way of doing that? What I think of doing is explode the article on spaces and find like "http://" or "https://" in each word ( element of the array ) and then convert it to an anchor tag, but if there is a really big article or something I think this is not the proper way of achieving what I seek here? Edited January 12, 2013 by Manixat Quote Link to comment Share on other sites More sharing options...
scootstah Posted January 12, 2013 Share Posted January 12, 2013 That is going to be horribly inefficient. Use Regular Expressions. Quote Link to comment Share on other sites More sharing options...
Manixat Posted January 12, 2013 Author Share Posted January 12, 2013 Alrighty, I've come up with some regex but I need confirmation it is correctly written before I code it into the website "/https?:\/\/.*/i" seems to work fine on my test server but what is bothering me is that .* is supposed to match eveything except new lines right? Why isn't it matching the space and everything after it where the link ends? Quote Link to comment Share on other sites More sharing options...
Christian F. Posted January 12, 2013 Share Posted January 12, 2013 It does: php > $string = "This is a string with\nsome text and a http://link.com/url.php link to an article\nwhich contains information you want."; php > preg_match ("/https?:\/\/.*/i", $string, $matches); php > var_dump ($matches); array(1) { [0]=> string(42) "http://link.com/url.php link to an article" } Though, why you'd want to do match everything after the link as well I don't know. Show us your code, not just what you think is the problem, then we can tell you what's wrong. Quote Link to comment Share on other sites More sharing options...
Manixat Posted January 12, 2013 Author Share Posted January 12, 2013 (edited) I don't want to match everything after the link, it is what I'm concerned about, how do I find the end of the link? EDIT: apparently in my articles all links are followed by a new line and that's why I'm not experiencing any issues for now. Edited January 12, 2013 by Manixat Quote Link to comment Share on other sites More sharing options...
Manixat Posted January 12, 2013 Author Share Posted January 12, 2013 Update: I found a regex that works quite well and the whole url can be returned really easily by using $1 "|(([A-Za-z]{3,9})://([-;:&=\+\$,\w]+@{1})?([-A-Za-z0-9\.]+)+:?(\d+)?((/[-\+~%/\.\w]+)?\??([-\+=&;%@\.\w]+)?#?([\w]+)?)?)|" Quote Link to comment Share on other sites More sharing options...
Christian F. Posted January 12, 2013 Share Posted January 12, 2013 (edited) I did some looking around the net, to see if I could find a better RegExp than what I was using already, and came across this article: http://www.devshed.com/c/a/PHP/PHP-URL-Validation-Functions/ It's a bit old (nearly 2 years by now), but it did have a nice list of valid and invalid domains. Plus some of the most cited RegExps for URL validation. However, when I looked at the results they were rather depressing; None of them passed perfectly, and none were better than my own. First option came close to mine, as it had 18,66% failure overall: 4 slippage (14.8%) and 11 overjudgements (27,5%). Anyway, close is not good enough, so I decided to fix it: $RegExp = '#^(??:(?:f|ht)tps?|dchub|sftp|steam)://)?'. // Username-password combos. '(?:\\w+(?::\\w+)?@)?'. // Domain or IP address '(?(?:[\\w\\pL][\\w\\pL-]*(?<!\\-)\\.)+[a-z\\pL]{2,5})(?::\\d{1,5})?'. '|(??:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))'. // URL Path '((??<!/)/(?:\w(?:%[a-f\\d]{2}|[\\w\\., -])*)*)+(??:\\.\\w{1,6})?'. // URL param '(\\?(??:%[a-f\\d]{2}|[\\w\\.-])+=(?:%[a-f\\d]{2}|[\\w\\.-])+)(?:&(?:%[a-f\\d]{2}|[\\w\\.-])+(?:=(?:%[a-f\\d]{2}|[\\w\\.-])+)?)*&?)?'. ')?)?\\z#ui';This passes with a 0% failure rate: 0 slippage and 0 overjudgements. To make it extract URLs from a larger text, and not just validate complete URLs, just remove the anchors (^ and \\z). Added: The one you found was rather abysmal, with 16 overjudgements (40%) and 14 slippages (51.9%). Making for an overall failure rate of 45,9%. Edited March 1, 2013 by Philip Quote Link to comment Share on other sites More sharing options...
sid0972 Posted February 9, 2013 Share Posted February 9, 2013 @christian f can you tell me the difference between URL path and URL param?? does URL parameters include GET values?? Quote Link to comment Share on other sites More sharing options...
Christian F. Posted February 10, 2013 Share Posted February 10, 2013 Everything before the ? is the path, as per the definition of "path": 5. computing the directions for reaching a particular file or directory, as traced hierarchically through each of the parent directories usually from the root; the file or directory and all parent directories are separated from one another in the path by slashes Everything after the ? is the parameters, as per the definition of "parameter": 3. Computers. a variable that must be given a specific value during the execution of a program or of a procedure within a program. The act of requesting a web page via HTTP being the execution of the procedure, in this case. http://dictionary.reference.com/browse/path http://dictionary.reference.com/browse/parameter Quote Link to comment Share on other sites More sharing options...
sid0972 Posted February 10, 2013 Share Posted February 10, 2013 (edited) so this is a path http://www.google.com and everything after google.com is a param https://www.google.c...iw=1855&bih=968 right? Edited February 10, 2013 by sid0972 Quote Link to comment Share on other sites More sharing options...
Christian F. Posted February 10, 2013 Share Posted February 10, 2013 Hmm.. You just pointed out something that I've missed in mine, bookmarks. That said, there isn't actually a path in that first URL. In the second only the slash after "google.com" is a part of the path. If we take the first link you posted, you have the protocol (http://), domain (www.google.com). In the second you have the above, plus the path (/), and then the parameters. (Though Google is using the bookmark identifier, so I assume that the parameters are handled by JS and not the server) Quote Link to comment Share on other sites More sharing options...
Zane Posted February 10, 2013 Share Posted February 10, 2013 Here ya go. http://www.sitepoint.com/forums/showthread.php?530093-Regex-Help&p=3713338#post3713338 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.