EchoFool Posted September 17, 2011 Share Posted September 17, 2011 Hey Im trying to extract urls from inputted data so i can seperate it from the rest of the text but can't seem to work out what regex i need to use. The main issue is im trying to extract a specific domain example (google.com) But it could be written 4 ways (google.com, http://google.com, http://www.google.com, www.google.com). Does any one know how you do it ? Quote Link to comment https://forums.phpfreaks.com/topic/247313-extract-domain-from-strings/ Share on other sites More sharing options...
QuickOldCar Posted September 17, 2011 Share Posted September 17, 2011 http://php.net/manual/en/function.parse-url.php <?php function parseHOST($url){ $url = str_ireplace("www.",'',trim($url)); $parsedUrl = @parse_url($url); return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)))); } //example echo parseHOST('http://google.com')."<br />"; echo parseHOST('http://www.google.com')."<br />"; echo parseHOST('google.com')."<br />"; ?> Quote Link to comment https://forums.phpfreaks.com/topic/247313-extract-domain-from-strings/#findComment-1270169 Share on other sites More sharing options...
EchoFool Posted September 17, 2011 Author Share Posted September 17, 2011 Okay but if thats in a paragraph say for a forum post how will i extract them out of the string including GETs on the domain: google.com?get=5 Also if the domain is in the post which is not spaced out from the words around (some spmammers do that like this "hello therewww.google.comhow are you". Then it won't pick it up in the check? Quote Link to comment https://forums.phpfreaks.com/topic/247313-extract-domain-from-strings/#findComment-1270173 Share on other sites More sharing options...
QuickOldCar Posted September 17, 2011 Share Posted September 17, 2011 ahh, i see what you mean, yeah totally different. Are you going to be checking from a list of possible spammy domains? I would think it would be nearly impossible to detect any type of domain name within a paragraph. Try something like this. <?php function parseHOST($url){ $url = str_ireplace("www.",'',trim($url)); $parsedUrl = @parse_url($url); return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)))); } //your text input from a post $text = "Visit my youtube link</a> Some sample text with WWW.AOL.com. <br /> http://spam.com/more-spam <br />http://www.youtube.com/watch?v=csgZ2b1bW2o and a spam link is http://www.spam-site.com/click here for spam<br />Anyone use www.myspace.com? <br />Some people are nuts, look at this stargate link at http://www.youtube.com/watch?v=ZKoUm6z5SzU&feature=grec_index , like aliens exist or something. http://www.youtube.com/watch?v=sfN-7HczmOU&feature=grec_index and here's a secure site https://familyhistory.hhs.gov, unless you use curl or allow secure connections it will never get a title. <br /> This is a not valid site http://zzzzzzz and this is a dead site http://zwzwzwxzw.com.<br /> Lastly lets try an already made hyperlink and see what it does <a href='http://dynaindex.com'>dynaindex.com</a>"; $spam_array = array("spam-site.com","spam.com");//add to the list //space anything that would get included in a link, add the space $text = str_ireplace(array("<br />","\n","\r"),array(" <br /> "," \n "," \r "),$text); $text = str_replace(" ", " ", $text); //explode the text by spaces $text_explode = explode(" ",$text); //loop and return words not in spam array foreach($text_explode as $words){ if(!in_array(parseHOST($words),$spam_array)){ echo " $words "; } } ?> Quote Link to comment https://forums.phpfreaks.com/topic/247313-extract-domain-from-strings/#findComment-1270174 Share on other sites More sharing options...
EchoFool Posted September 17, 2011 Author Share Posted September 17, 2011 Thats very good! Though for this kind of string: <?php $text = "this is some stuffhttp://www.domain.com?d=13213124"; ?> because there is no space between stuff & http it doesn't notice that stuff is not part of the http etc. But as you say its probably impossible to improve upon its current situation =/ Unless i can some how insert a space before http where it can detect there is no space perhaps. Oh this doesn't work if there are any new lines in the text like: google.com google.com Fails. Im thinking if i make it remove all new lines this might fix it =/ Quote Link to comment https://forums.phpfreaks.com/topic/247313-extract-domain-from-strings/#findComment-1270175 Share on other sites More sharing options...
QuickOldCar Posted September 17, 2011 Share Posted September 17, 2011 maybe try this in there for the text first $text = str_ireplace("http://", " http://", $text); I did write something different that discovered non hyperlinks and made them hyperlinks, I say this because I did it in a different way. <?php function parseHOST($url){ $url = str_ireplace("www.",'',trim($url)); $parsedUrl = @parse_url($url); return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)))); } function removeSPAM($text){ $spam_array = array("spam-site.com","spam.com"); $text = preg_replace( "/(www\.)/is", "http://", $text); $text = str_replace(array("http://http://","http://https://"), "http://", $text); $reg_exUrl = "/(http|https|ftp|ftps|)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/"; preg_match_all($reg_exUrl, $text, $matches); $usedPatterns = array(); foreach($matches[0] as $pattern){ if(!array_key_exists($pattern, $usedPatterns)){ $usedPatterns[$pattern]=true; } if(in_array(parseHOST($pattern),$spam_array)){ $text = str_ireplace($pattern, " ", $text); } } return $text; } $text = "Visit my youtube link</a> Some sample text with WWW.AOL.com. <br /> testinghttp://spam.com/more-spam <br />http://www.youtube.com/watch?v=csgZ2b1bW2o and a spam link is http://www.spam-site.com/click here for spam<br />Anyone use www.myspace.com? <br />Some people are nuts, look at this stargate link at http://www.youtube.com/watch?v=ZKoUm6z5SzU&feature=grec_index , like aliens exist or something. http://www.youtube.com/watch?v=sfN-7HczmOU&feature=grec_index and here's a secure site https://familyhistory.hhs.gov, unless you use curl or allow secure connections it will never get a title. <br /> This is a not valid site http://zzzzzzz and this is a dead site http://zwzwzwxzw.com.<br /> Lastly lets try an already made hyperlink and see what it does <a href='http://dynaindex.com'>dynaindex.com</a>"; echo removeSPAM($text); ?> Quote Link to comment https://forums.phpfreaks.com/topic/247313-extract-domain-from-strings/#findComment-1270180 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.