Extract domain from strings?

EchoFool · September 17, 2011

Hey

Im trying to extract urls from inputted data so i can seperate it from the rest of the text but can't seem to work out what regex i need to use.

The main issue is im trying to extract a specific domain example (google.com)

But it could be written 4 ways (google.com, http://google.com, http://www.google.com, www.google.com).

Does any one know how you do it ?

QuickOldCar · September 17, 2011

http://php.net/manual/en/function.parse-url.php

<?php
function parseHOST($url){
$url = str_ireplace("www.",'',trim($url));
$parsedUrl = @parse_url($url);
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

//example
echo  parseHOST('http://google.com')."<br />";
echo  parseHOST('http://www.google.com')."<br />";
echo  parseHOST('google.com')."<br />";

?>

EchoFool · September 17, 2011

Okay but if thats in a paragraph say for a forum post how will i extract them out of the string including GETs on the domain:

google.com?get=5

Also if the domain is in the post which is not spaced out from the words around (some spmammers do that like this "hello therewww.google.comhow are you".

Then it won't pick it up in the check?

QuickOldCar · September 17, 2011

ahh, i see what you mean, yeah totally different.

Are you going to be checking from a list of possible spammy domains?

I would think it would be nearly impossible to detect any type of domain name within a paragraph.

Try something like this.

<?php
function parseHOST($url){
$url = str_ireplace("www.",'',trim($url));
$parsedUrl = @parse_url($url);
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

//your text input from a post
$text = "Visit my youtube link</a> Some sample text with WWW.AOL.com. <br /> http://spam.com/more-spam <br />http://www.youtube.com/watch?v=csgZ2b1bW2o and a spam link is http://www.spam-site.com/click here for spam<br />Anyone use www.myspace.com?  <br />Some people are nuts, look at this stargate link at http://www.youtube.com/watch?v=ZKoUm6z5SzU&feature=grec_index , like aliens exist or something. http://www.youtube.com/watch?v=sfN-7HczmOU&feature=grec_index  and here's a secure site https://familyhistory.hhs.gov, unless you use curl or allow secure connections it will never get a title. <br /> This is a not valid site http://zzzzzzz and this is a dead site http://zwzwzwxzw.com.<br /> Lastly lets try an already made hyperlink and see what it does <a href='http://dynaindex.com'>dynaindex.com</a>";
$spam_array = array("spam-site.com","spam.com");//add to the list
//space anything that would get included in a link, add the space
$text = str_ireplace(array("<br />","\n","\r"),array(" <br /> "," \n "," \r "),$text);
$text = str_replace("  ", " ", $text);
//explode the text by spaces
$text_explode = explode(" ",$text);
//loop and return words not in spam array
foreach($text_explode as $words){
if(!in_array(parseHOST($words),$spam_array)){
echo " $words ";
}
}
?>

EchoFool · September 17, 2011

Thats very good! Though for this kind of string:

<?php $text = "this is some stuffhttp://www.domain.com?d=13213124"; ?>

because there is no space between stuff & http it doesn't notice that stuff is not part of the http etc. But as you say its probably impossible to improve upon its current situation =/ Unless i can some how insert a space before http where it can detect there is no space perhaps.

Oh this doesn't work if there are any new lines in the text like:

google.com
google.com

Fails. Im thinking if i make it remove all new lines this might fix it =/

QuickOldCar · September 17, 2011

maybe try this in there for the text first

$text = str_ireplace("http://", " http://", $text);

I did write something different that discovered non hyperlinks and made them hyperlinks, I say this because I did it in a different way.

<?php
function parseHOST($url){
$url = str_ireplace("www.",'',trim($url));
$parsedUrl = @parse_url($url);
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

function removeSPAM($text){
            $spam_array = array("spam-site.com","spam.com");
            $text = preg_replace( "/(www\.)/is", "http://", $text);
            $text = str_replace(array("http://http://","http://https://"), "http://", $text);
            $reg_exUrl = "/(http|https|ftp|ftps|)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
            preg_match_all($reg_exUrl, $text, $matches);
            $usedPatterns = array();
            
            foreach($matches[0] as $pattern){
                if(!array_key_exists($pattern, $usedPatterns)){
                    $usedPatterns[$pattern]=true;
                    }
                    if(in_array(parseHOST($pattern),$spam_array)){                                                          
                    $text = str_ireplace($pattern, " ", $text);
                    }
            }
            return $text;
}

$text = "Visit my youtube link</a> Some sample text with WWW.AOL.com. <br /> testinghttp://spam.com/more-spam <br />http://www.youtube.com/watch?v=csgZ2b1bW2o and a spam link is http://www.spam-site.com/click here for spam<br />Anyone use www.myspace.com?  <br />Some people are nuts, look at this stargate link at http://www.youtube.com/watch?v=ZKoUm6z5SzU&feature=grec_index , like aliens exist or something. http://www.youtube.com/watch?v=sfN-7HczmOU&feature=grec_index  and here's a secure site https://familyhistory.hhs.gov, unless you use curl or allow secure connections it will never get a title. <br /> This is a not valid site http://zzzzzzz and this is a dead site http://zwzwzwxzw.com.<br /> Lastly lets try an already made hyperlink and see what it does <a href='http://dynaindex.com'>dynaindex.com</a>";

echo removeSPAM($text);

?>

Sign In

Extract domain from strings?

Recommended Posts

EchoFool

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

EchoFool

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

EchoFool

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information