Detecting www.

johnsmith153 · April 18, 2011

I need a regular expression that detects a web address in a string of text.

I need it to find any http://www or www. web address.

Any domain (.co.uk, .com anything)

All these would be picked up:

http://domain.com

http://www.domain.com

www.domain.com

http://domain.co.uk

http://www.domain.co.uk

www.domain.co.uk

Also, it must pick up all folders and other url variables (www.site.com/page1?a=123 etc.)

** Also, most importantly:***

It must NOT pick up web addresses that are inside a <a href="">xxx</a> link already, only oes that are plain text and not embedded in this HTML.

I have tried but it only does bits of the above.

I can do the PHP code, just need to know the regular expression to drop into my preg_match_all code.

Thanks in advance.

cyberRobot · April 19, 2011

Not exactly what you asked for, but you could explode the entire string based on the space character. Then test each piece with substr().

<?php
...

if(substr($currPiece, 0, 11) == 'http://www.') {
     $urlFound = true;

} elseif(substr($currPiece, 0, 4) == 'www.') {
     $urlFound = true;

} else {
     $urlFound = false;
}

...
?>

QuickOldCar · April 19, 2011

Here's what I made up for you.

It would be extremely difficult to get a link that does not have a href attribute and also not containing the http or www.

Something like truveo.com/category/news

Do you explode the point?, then check for end slash?, explode slashes? it might not have a slash,end of news might be a ? . Anyway, it's not easy.

Not every link is the same and if you make those rules it would exclude others, I suppose can do the code multiple times with each method and combine them all.

For those you would have to strip_tags on the page, then end explode every word by . and see if it contains a pattern such as .com, .co.ok, .org and so on. And most likely lots of trimming. Even then I could see some flaws with the method.

So here's a simple script to find any http https or www link that is not inside a href tag

It's hard to find pages with links and no href, so i tested with my own domain generator with href to off

<?php
$url = "http://get.blogdns.com/dynaindex/generator.php?character=alphabet&length=9&amount=10&sort=random&protocol=www.&tldext=.co.uk&hyperlink=no";
$file_data = @file_get_contents($url);
if ($file_data === false) {
echo "<div align='center'><h2><FONT COLOR=red>Unable to retrieve any data</><h2></div>";
EXIT;
} else {

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $file_data, $matches );
if (isset($matches[1])) {
    $mime = $matches[1];
    }
if (isset($matches[3])) {
    $charset = $matches[3];
    }

$utf8_text = iconv( $charset, "utf-8", $file_data );

$utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text);
$utf8_text = str_replace(array("`","!","@","#","$","^","* ","(",")","{","}",":",";","'","<p>","</p>","<br>","<br/>","<br />","<br/>","</a>","<ul>","</ul>","<li>","</li>","<head>","</head>","<div>","</div>","<form>","</form>","<body>","</body>"), ' ', $utf8_text);

$keywords = explode(" ", $utf8_text);

foreach ($keywords as $keyword) {
if(substr($keyword, 0, 7) == "http://" || substr($keyword, 0,  == "https://" || substr($keyword, 0, 4) == "www.") {
echo trim($keyword)."<br />";
}
}
}

?>

QuickOldCar · April 19, 2011

I forgot that file_get_contents needs http to work, so can add this to the top along with the get

so can do links like http://mysite.com/thisscript.php?url=somesite.com

It's also better to use curl to fully follow the paths and redirects

$url = mysql_real_escape_string(trim($_GET['url']));

if(substr($url, 0, 5) != "http:") {
$url = "http://$url";
}

Sign In

Detecting www.

Recommended Posts

johnsmith153

Link to comment

Share on other sites

cyberRobot

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information