seb hughes Posted November 13, 2007 Share Posted November 13, 2007 I'm written a script which grabs all URLs from a page and outputs them and also it takes out all duplicates. I need it to grab from a <a href ="www.domain.com>akhajha</a> just the www.domain.com or domain.net. If the link is www.domain.org/helllloooooo.html it needs to trim itto www.domain.org. I written the program, just this Regex is driving me nuts. $url = $_POST['url']; $banned = array($url); $handle = fopen($url, "r"); $html = file_get_contents($url); $matches = array(); $status = preg_match_all('@(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)@', $html, $matches); $unique = array(); foreach($matches[1] as $match) { $found = false; if (in_array($match, $banned)) { $found = true; } foreach($unique as $u) { if($u == $match) { $found = true; break; } } if(!$found) { array_push($unique, $match); } } foreach ($unique as $link) { echo $link . "<br />"; } Please help me fix this problem. Thanks. Link to comment https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/ Share on other sites More sharing options...
effigy Posted November 13, 2007 Share Posted November 13, 2007 %href=[\'"]?(?:https?://)?([^\s/\'"]+)% Link to comment https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/#findComment-390659 Share on other sites More sharing options...
seb hughes Posted November 13, 2007 Author Share Posted November 13, 2007 Quote %href=[\'"]?(?:https?://)?([^\s/\'"]+)% $status = preg_match_all(%href=[\'"]?(?:https?://)?([^\s/\'"]+)%, $html, $matches); I get a parse error. Link to comment https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/#findComment-390666 Share on other sites More sharing options...
effigy Posted November 13, 2007 Share Posted November 13, 2007 Patterns are strings--single quote it. Link to comment https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/#findComment-390670 Share on other sites More sharing options...
seb hughes Posted November 13, 2007 Author Share Posted November 13, 2007 I got it to work, but it shows page names like image.php?dfdf=dfsdkfdkfkd which it should'nt do. Link to comment https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/#findComment-390671 Share on other sites More sharing options...
effigy Posted November 13, 2007 Share Posted November 13, 2007 Is "www." always a requirement? There are "ww2."s I think, along with a variety of domain endings. Link to comment https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/#findComment-390676 Share on other sites More sharing options...
seb hughes Posted November 13, 2007 Author Share Posted November 13, 2007 Quote Is "www." always a requirement? There are "ww2."s I think, along with a variety of domain endings. all I need it to do is get things in between <a href = "whatisinhere"> but it has to trim it to the domain name so www.doming.com hello.php.net or hwello.hello.php.net it can't have www.domain.com/what_ever_else_is_here.html If it doesnt have www. then its even better. Link to comment https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/#findComment-390681 Share on other sites More sharing options...
seb hughes Posted November 13, 2007 Author Share Posted November 13, 2007 How can I have it so I can get the following: domain.tld. From a "a href" tag. This has been driving me crazy for DAYS. Link to comment https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/#findComment-390793 Share on other sites More sharing options...
effigy Posted November 13, 2007 Share Posted November 13, 2007 The challenge is that domain extensions can look like file extensions. Is "image.php" a domain? No--we can see that, but the computer cannot unless we give it specific direction. Here are just a few examples. Link to comment https://forums.phpfreaks.com/topic/77160-preg_mtach_all-url-regex-help-need/#findComment-390863 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.