Jump to content

Detecting www.


johnsmith153

Recommended Posts

I need a regular expression that detects a web address in a string of text.

 

I need it to find any http://www or www. web address.

 

Any domain (.co.uk, .com anything)

 

All these would be picked up:

 

http://domain.com

http://www.domain.com

www.domain.com

http://domain.co.uk

http://www.domain.co.uk

www.domain.co.uk

 

Also, it must pick up all folders and other url variables (www.site.com/page1?a=123 etc.)

 

** Also, most importantly:***

It must NOT pick up web addresses that are inside a <a href="">xxx</a> link already, only oes that are plain text and not embedded in this HTML.

 

I have tried but it only does bits of the above.

 

I can do the PHP code, just need to know the regular expression to drop into my preg_match_all code.

 

Thanks in advance.

Link to comment
https://forums.phpfreaks.com/topic/234080-detecting-www/
Share on other sites

Not exactly what you asked for, but you could explode the entire string based on the space character. Then test each piece with substr().

 

<?php
...

if(substr($currPiece, 0, 11) == 'http://www.') {
     $urlFound = true;

} elseif(substr($currPiece, 0, 4) == 'www.') {
     $urlFound = true;

} else {
     $urlFound = false;
}

...
?>

Link to comment
https://forums.phpfreaks.com/topic/234080-detecting-www/#findComment-1203286
Share on other sites

Here's what I made up for you.

 

It would be extremely difficult to get a link that does not have a href attribute and also not containing the http or www.

Something like truveo.com/category/news

Do you explode the point?, then check for end slash?, explode slashes? it might not have a slash,end of news might be a ? . Anyway, it's not easy.

 

Not every link is the same and if you make those rules it would exclude others, I suppose can do the code multiple times with each method and combine them all.

 

 

For those you would have to strip_tags on the page, then end explode every word by . and see if it contains a pattern such as .com, .co.ok, .org and so on. And most likely lots of trimming. Even then I could see some flaws with the method.

 

So here's a simple script to find any http https or www link that is not inside a href tag

 

 

It's hard to find pages with links and no href, so i tested with my own domain generator with href to off

<?php
$url = "http://get.blogdns.com/dynaindex/generator.php?character=alphabet&length=9&amount=10&sort=random&protocol=www.&tldext=.co.uk&hyperlink=no";
$file_data = @file_get_contents($url);
if ($file_data === false) {
echo "<div align='center'><h2><FONT COLOR=red>Unable to retrieve any data</><h2></div>";
EXIT;
} else {

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $file_data, $matches );
if (isset($matches[1])) {
    $mime = $matches[1];
    }
if (isset($matches[3])) {
    $charset = $matches[3];
    }

$utf8_text = iconv( $charset, "utf-8", $file_data );

$utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text);
$utf8_text = str_replace(array("`","!","@","#","$","^","* ","(",")","{","}",":",";","'","<p>","</p>","<br>","<br/>","<br />","<br/>","</a>","<ul>","</ul>","<li>","</li>","<head>","</head>","<div>","</div>","<form>","</form>","<body>","</body>"), ' ', $utf8_text);

$keywords = explode(" ", $utf8_text);

foreach ($keywords as $keyword) {
if(substr($keyword, 0, 7) == "http://" || substr($keyword, 0,  == "https://" || substr($keyword, 0, 4) == "www.") {
echo trim($keyword)."<br />";
}
}
}

?>

Link to comment
https://forums.phpfreaks.com/topic/234080-detecting-www/#findComment-1203322
Share on other sites

I forgot that file_get_contents needs http to work, so can add this to the top along with the get

 

so can do links like http://mysite.com/thisscript.php?url=somesite.com

 

It's also better to use curl to fully follow the paths and redirects

 

$url = mysql_real_escape_string(trim($_GET['url']));

if(substr($url, 0, 5) != "http:") {
$url = "http://$url";
}

Link to comment
https://forums.phpfreaks.com/topic/234080-detecting-www/#findComment-1203332
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.