KingNeil Posted April 17, 2014 Share Posted April 17, 2014 So... I've tried using simple_html_dom in order to extract URLs from web pages. And it works a lot of the time, but not all of the time. For example, it doesn't work on the website ArsTechnica.com, because it has a different use of HTML URLs. So... one thing I do know... is that Firefox perfectly gets a list of all links on a page, hence, how you can load up a web page in Firefox, and all the links are clickable. And so... I was wondering... is it possible to download the open source Firefox browser engine, or Chrome, or whatever... and pass some parameters to it somehow, and this will give me a list of all URLs on the page..?? I can then feed that into PHP by whatever means, whether it's shell_exec() or whatever. Is this possible? How do I do it? Quote Link to comment Share on other sites More sharing options...
Ch0cu3r Posted April 17, 2014 Share Posted April 17, 2014 I don't get your problem? The basic example provided by simple_html_dom v1.5 lists all links for me for any site include 'simplehtmldom_1_5/simple_html_dom.php'; // get all links from home page of a website $html = file_get_html('http://www.sitename.com'); foreach($html->find('a') as $element) echo $element->href .' <br />'; // echo the link This is basic DOM traversal. I dont understand how using an actual web browsers engine will help you. Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted April 19, 2014 Share Posted April 19, 2014 (edited) As in the case for ArsTechnica.com, they use relative paths versus absolute paths for their href links. To sum it up you have to determine what type path they use and fix accordingly by appending the protocol, host and any paths to the beginning I do this with a relative links function I made along with knowing the targeted scrape location to go by. I'll try to explain as simply as can A path is a slash separated list of directory names followed by either a directory name or a file name. A directory is the same as a system folder. relative paths: hash tag # (by itself directs to same page, if a fragment is added usually an anchor tag on the same page......but sites have been using them to navigate entire sites or scripts with them over the years) no identifier (just a directory or file) will append own host same directory / root directory ./ up one directory ../ up two directories ../../ up three directories ../../../ on and on for up directories absolute paths: any url (more correctly called uri) that includes the protocol and host with optional others following it just a few examples, is way too many to list every possible type. http://user:password@subdomain.domain.tld.sld:port/directory/file.ext?query=value#fragment http://subdomain.domain.tld http://subdomain.domain.tld.sld/folder/script.php http://domain.tld/script.php local paths: sometimes linked when are outside of the www directory C:\folder\ C:\folder\file C:/folder/file.ext \\folder\file I wrote this using simple_html_dom with functions to fix some issues Can test it at http://dynaindex.com/link-scrape.php Is more to it than meets the eye, I actually have a much longer and complicated relative links function that does much more, this should get you by. <form action="" method="GET"> <input type="text" name="target" size="100" id="target" value="" placeholder="Insert url to get links" /> <input type="submit" value="Get the links" /> <br /> </form> <?php //check if target is set and not a blank value if (isset($_GET['target']) && trim($_GET['target']) != '') { //requires simple_html_dom include 'simple_html_dom.php'; //clean input url $target_url = htmlspecialchars(trim($_GET['target']), ENT_QUOTES, 'UTF-8'); $target_url = filter_var($target_url, FILTER_SANITIZE_URL); //check input url for http or https or file_get_contents fails if (!preg_match("~^(http|https)://~i", $target_url)) { $target_url = "http://" . $target_url; } echo "<h2>Links from " . rawurldecode($target_url) . "</h2>"; //parse the host, no protocol returned function parseHOST($url) { $new_parse_url = str_ireplace(array( "http://", "https://", "http://", "ftp://", "feed://" ), "", trim($url)); $parsedUrl = @parse_url("http://$new_parse_url"); return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)))); } //remove relative paths and position correctly function removePaths($url, $number_positions = NULL) { $path = @parse_url($url, PHP_URL_PATH); $trim_path = trim($path, '/'); $positions = ""; $positions = explode('/', $trim_path); if (preg_match("/\./", end($positions))) { array_pop($positions); } if (!is_null($number_positions)) { for ($i = 1; $i <= $number_positions; $i++) { array_pop($positions); } } foreach ($positions as $folders) { if (!empty($folders)) { $folder_path .= "$folders/"; } } return $folder_path; } //fix relative links to absolute links function fixRELATIVE($target_url, $url) { $domain = "http://" . parseHOST($target_url); if ($url == "#" || $url == "./") { $url = $domain; } if ($url == "/") { $url = $target_url; } $url = rtrim($url, "/"); $up_one = removePaths($target_url, 1); $up_two = removePaths($target_url, 2); $up_three = removePaths($target_url, 3); $up_four = removePaths($target_url, 4); $up_five = removePaths($target_url, 5); $path = parse_url($target_url, PHP_URL_PATH); $full_path = trim($path, '/'); $explode_path = explode("/", $full_path); $last = end($explode_path); $fixed_paths = ""; if (is_array($explode_path)) { foreach ($explode_path as $paths) { if (!empty($paths) && !preg_match("/\./", $paths)) { $fixed_paths .= "$paths/"; } } } $fixed_domain = "$domain/$fixed_paths"; if (substr($url, 0, 1) == "/") { $url = ltrim($url, "/"); $url = "$domain/$url"; } if (substr($url, 0, 1) == "#") { $url = "$domain/$full_path$url"; } if (substr($url, 0, 1) == "?") { $url = "$domain/$full_path$url"; } if (substr($url, 0, 15) == "../../../../../") { $url = str_replace("../../../../../", "", $url); $url = "$domain/$up_five$url"; } if (substr($url, 0, 12) == "../../../../") { $url = str_replace("../../../../", "", $url); $url = "$domain/$up_four$url"; } if (substr($url, 0, 9) == "../../../") { $url = str_replace("../../../", "", $url); $url = "$domain/$up_three$url"; } if (substr($url, 0, 6) == "../../") { $url = str_replace("../../", "", $url); $url = "$domain/$up_two$url"; } if (substr($url, 0, 3) == "../") { $url = str_replace("../", "", $url); $url = "$domain/$up_one$url"; } return $url; } //using curl and following redirects and responses would be better $html = @file_get_html($target_url); if (!$html) { die("failed to connect"); } $url_array = array(); foreach ($html->find('a') as $element) { $href = fixRELATIVE($target_url, trim($element->href)); $title = trim($element->title); $text = trim($element->plaintext); //not all hyperlinks contain a title if ($title == '') { $title = $text; } //if title is still empty, use the href link if ($title == '') { $title = $href; } //create an array associating them $url_array[] = array( "href" => $href, "title" => $title ); } //define urls array $urls = array(); //remove duplicates from array $urls = array_map("unserialize", array_unique(array_map("serialize", $url_array))); //clear url_array $url_array = array(); //print_r($urls); //display the links with titles as hyperlinks foreach ($urls as $link) { echo "<a href='" . $link['href'] . " 'target='_blank'>" . $link['title'] . "</a><br />"; } } ?> Don't forget to make everything safe if saving to database. Edited April 19, 2014 by QuickOldCar Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.