Jump to content

Preg_Mtach_All: URL Regex Help Need


seb hughes

Recommended Posts

I'm written a script which grabs all URLs from a page and outputs them and also it takes out all duplicates. I need it to grab from a <a href ="www.domain.com>akhajha</a> just the www.domain.com or domain.net. If the link is www.domain.org/helllloooooo.html it needs to trim itto www.domain.org. I written the program, just this Regex is driving me nuts.

 

	$url = $_POST['url'];
$banned = array($url);
$handle = fopen($url, "r"); 
$html = file_get_contents($url); 
$matches = array();
$status = preg_match_all('@(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)@', $html, $matches);
$unique = array();

	foreach($matches[1] as $match) {
		$found = false;
			if (in_array($match, $banned)) {
				$found = true;
			}

	foreach($unique as $u) {
		if($u == $match) {
			$found = true;
			break;
		}
	}
	if(!$found) {
	array_push($unique, $match);
	}
}
	foreach ($unique as $link) {
	echo $link . "<br />";
}

 

Please help me fix this problem. Thanks.

Link to comment
https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/
Share on other sites

  Quote

Is "www." always a requirement? There are "ww2."s I think, along with a variety of domain endings.

 

all I need it to do is get things in between <a href = "whatisinhere"> but it has to trim it to the domain name so www.doming.com hello.php.net or hwello.hello.php.net it can't have www.domain.com/what_ever_else_is_here.html If it doesnt have www. then its even better.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.