johnsmith153 Posted April 18, 2011 Share Posted April 18, 2011 I need a regular expression that detects a web address in a string of text. I need it to find any http://www or www. web address. Any domain (.co.uk, .com anything) All these would be picked up: http://domain.com http://www.domain.com www.domain.com http://domain.co.uk http://www.domain.co.uk www.domain.co.uk Also, it must pick up all folders and other url variables (www.site.com/page1?a=123 etc.) ** Also, most importantly:*** It must NOT pick up web addresses that are inside a <a href="">xxx</a> link already, only oes that are plain text and not embedded in this HTML. I have tried but it only does bits of the above. I can do the PHP code, just need to know the regular expression to drop into my preg_match_all code. Thanks in advance. Quote Link to comment https://forums.phpfreaks.com/topic/234080-detecting-www/ Share on other sites More sharing options...
cyberRobot Posted April 19, 2011 Share Posted April 19, 2011 Not exactly what you asked for, but you could explode the entire string based on the space character. Then test each piece with substr(). <?php ... if(substr($currPiece, 0, 11) == 'http://www.') { $urlFound = true; } elseif(substr($currPiece, 0, 4) == 'www.') { $urlFound = true; } else { $urlFound = false; } ... ?> Quote Link to comment https://forums.phpfreaks.com/topic/234080-detecting-www/#findComment-1203286 Share on other sites More sharing options...
QuickOldCar Posted April 19, 2011 Share Posted April 19, 2011 Here's what I made up for you. It would be extremely difficult to get a link that does not have a href attribute and also not containing the http or www. Something like truveo.com/category/news Do you explode the point?, then check for end slash?, explode slashes? it might not have a slash,end of news might be a ? . Anyway, it's not easy. Not every link is the same and if you make those rules it would exclude others, I suppose can do the code multiple times with each method and combine them all. For those you would have to strip_tags on the page, then end explode every word by . and see if it contains a pattern such as .com, .co.ok, .org and so on. And most likely lots of trimming. Even then I could see some flaws with the method. So here's a simple script to find any http https or www link that is not inside a href tag It's hard to find pages with links and no href, so i tested with my own domain generator with href to off <?php $url = "http://get.blogdns.com/dynaindex/generator.php?character=alphabet&length=9&amount=10&sort=random&protocol=www.&tldext=.co.uk&hyperlink=no"; $file_data = @file_get_contents($url); if ($file_data === false) { echo "<div align='center'><h2><FONT COLOR=red>Unable to retrieve any data</><h2></div>"; EXIT; } else { preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i', $file_data, $matches ); if (isset($matches[1])) { $mime = $matches[1]; } if (isset($matches[3])) { $charset = $matches[3]; } $utf8_text = iconv( $charset, "utf-8", $file_data ); $utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text); $utf8_text = str_replace(array("`","!","@","#","$","^","* ","(",")","{","}",":",";","'","<p>","</p>","<br>","<br/>","<br />","<br/>","</a>","<ul>","</ul>","<li>","</li>","<head>","</head>","<div>","</div>","<form>","</form>","<body>","</body>"), ' ', $utf8_text); $keywords = explode(" ", $utf8_text); foreach ($keywords as $keyword) { if(substr($keyword, 0, 7) == "http://" || substr($keyword, 0, == "https://" || substr($keyword, 0, 4) == "www.") { echo trim($keyword)."<br />"; } } } ?> Quote Link to comment https://forums.phpfreaks.com/topic/234080-detecting-www/#findComment-1203322 Share on other sites More sharing options...
QuickOldCar Posted April 19, 2011 Share Posted April 19, 2011 I forgot that file_get_contents needs http to work, so can add this to the top along with the get so can do links like http://mysite.com/thisscript.php?url=somesite.com It's also better to use curl to fully follow the paths and redirects $url = mysql_real_escape_string(trim($_GET['url'])); if(substr($url, 0, 5) != "http:") { $url = "http://$url"; } Quote Link to comment https://forums.phpfreaks.com/topic/234080-detecting-www/#findComment-1203332 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.