I'm written a script which grabs all URLs from a page and outputs them and also it takes out all duplicates. I need it to grab from a <a href ="www.domain.com>akhajha</a> just the www.domain.com or domain.net. If the link is www.domain.org/helllloooooo.html it needs to trim itto www.domain.org. I written the program, just this Regex is driving me nuts.
$url = $_POST['url'];
$banned = array($url);
$handle = fopen($url, "r");
$html = file_get_contents($url);
$matches = array();
$status = preg_match_all('@(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)@', $html, $matches);
$unique = array();
foreach($matches[1] as $match) {
$found = false;
if (in_array($match, $banned)) {
$found = true;
}
foreach($unique as $u) {
if($u == $match) {
$found = true;
break;
}
}
if(!$found) {
array_push($unique, $match);
}
}
foreach ($unique as $link) {
echo $link . "<br />";
}
Please help me fix this problem. Thanks.