Use web browser engine to extract URLs from a page?

KingNeil · April 17, 2014

So... I've tried using simple_html_dom in order to extract URLs from web pages.

And it works a lot of the time, but not all of the time.

For example, it doesn't work on the website ArsTechnica.com, because it has a different use of HTML URLs.

So... one thing I do know... is that Firefox perfectly gets a list of all links on a page, hence, how you can load up a web page in Firefox, and all the links are clickable.

And so... I was wondering... is it possible to download the open source Firefox browser engine, or Chrome, or whatever... and pass some parameters to it somehow, and this will give me a list of all URLs on the page..??

I can then feed that into PHP by whatever means, whether it's shell_exec() or whatever.

Is this possible? How do I do it?

Ch0cu3r · April 17, 2014

I don't get your problem? The basic example provided by simple_html_dom v1.5 lists all links for me for any site

include 'simplehtmldom_1_5/simple_html_dom.php';

// get all links from home page of a website
$html  = file_get_html('http://www.sitename.com');

foreach($html->find('a') as $element)
    echo $element->href .' <br />'; // echo the link

This is basic DOM traversal.

I dont understand how using an actual web browsers engine will help you.

QuickOldCar · April 19, 2014

As in the case for ArsTechnica.com, they use relative paths versus absolute paths for their href links.

To sum it up you have to determine what type path they use and fix accordingly by appending the protocol, host and any paths to the beginning

I do this with a relative links function I made along with knowing the targeted scrape location to go by.

I'll try to explain as simply as can

A path is a slash separated list of directory names followed by either a directory name or a file name.

A directory is the same as a system folder.

relative paths:

hash tag # (by itself directs to same page, if a fragment is added usually an anchor tag on the same page......but sites have been using them to navigate entire sites or scripts with them over the years)

no identifier (just a directory or file) will append own host

same directory /

root directory ./

up one directory ../

up two directories ../../

up three directories ../../../

on and on for up directories

absolute paths:

any url (more correctly called uri) that includes the protocol and host with optional others following it

just a few examples, is way too many to list every possible type.

http://user:[email protected]:port/directory/file.ext?query=value#fragment

http://subdomain.domain.tld

http://subdomain.domain.tld.sld/folder/script.php

http://domain.tld/script.php

local paths:

sometimes linked when are outside of the www directory

C:\folder\

C:\folder\file

C:/folder/file.ext

\\folder\file

I wrote this using simple_html_dom with functions to fix some issues

Can test it at http://dynaindex.com/link-scrape.php

Is more to it than meets the eye, I actually have a much longer and complicated relative links function that does much more, this should get you by.

<form action="" method="GET">
<input type="text" name="target" size="100" id="target" value="" placeholder="Insert url to get links" />
<input type="submit" value="Get the links" />
<br />
</form>
<?php
//check if target is set and not a blank value
if (isset($_GET['target']) && trim($_GET['target']) != '') {
   
    //requires simple_html_dom
    include 'simple_html_dom.php';
   
    //clean input url
    $target_url = htmlspecialchars(trim($_GET['target']), ENT_QUOTES, 'UTF-8');
    $target_url = filter_var($target_url, FILTER_SANITIZE_URL);
   
    //check input url for http or https or file_get_contents fails
    if (!preg_match("~^(http|https)://~i", $target_url)) {
        $target_url = "http://" . $target_url;
    }
   
    echo "<h2>Links from " . rawurldecode($target_url) . "</h2>";
   
    //parse the host, no protocol returned
    function parseHOST($url)
    {
        $new_parse_url = str_ireplace(array(
            "http://",
            "https://",
            "http://",
            "ftp://",
            "feed://"
        ), "", trim($url));
        $parsedUrl     = @parse_url("http://$new_parse_url");
        return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
    }
   
    //remove relative paths and position correctly
    function removePaths($url, $number_positions = NULL)
    {
       
        $path      = @parse_url($url, PHP_URL_PATH);
        $trim_path = trim($path, '/');
        $positions = "";
        $positions = explode('/', $trim_path);
        if (preg_match("/\./", end($positions))) {
            array_pop($positions);
        }
        if (!is_null($number_positions)) {
            for ($i = 1; $i <= $number_positions; $i++) {
                array_pop($positions);
            }
        }
        foreach ($positions as $folders) {
            if (!empty($folders)) {
                $folder_path .= "$folders/";
            }
           
        }
       
        return $folder_path;
    }
   
    //fix relative links to absolute links
    function fixRELATIVE($target_url, $url)
    {
        $domain = "http://" . parseHOST($target_url);
       
        if ($url == "#" || $url == "./") {
            $url = $domain;
        }
       
        if ($url == "/") {
            $url = $target_url;
        }
       
        $url = rtrim($url, "/");
       
       
       
        $up_one       = removePaths($target_url, 1);
        $up_two       = removePaths($target_url, 2);
        $up_three     = removePaths($target_url, 3);
        $up_four      = removePaths($target_url, 4);
        $up_five      = removePaths($target_url, 5);
        $path         = parse_url($target_url, PHP_URL_PATH);
        $full_path    = trim($path, '/');
        $explode_path = explode("/", $full_path);
        $last         = end($explode_path);
        $fixed_paths  = "";
        if (is_array($explode_path)) {
            foreach ($explode_path as $paths) {
                if (!empty($paths) && !preg_match("/\./", $paths)) {
                    $fixed_paths .= "$paths/";
                }
            }
        }
        $fixed_domain = "$domain/$fixed_paths";
       
       
        if (substr($url, 0, 1) == "/") {
            $url = ltrim($url, "/");
            $url = "$domain/$url";
        }
       
        if (substr($url, 0, 1) == "#") {
            $url = "$domain/$full_path$url";
        }
       
        if (substr($url, 0, 1) == "?") {
            $url = "$domain/$full_path$url";
        }
       
        if (substr($url, 0, 15) == "../../../../../") {
            $url = str_replace("../../../../../", "", $url);
            $url = "$domain/$up_five$url";
        }
       
        if (substr($url, 0, 12) == "../../../../") {
            $url = str_replace("../../../../", "", $url);
            $url = "$domain/$up_four$url";
        }
       
        if (substr($url, 0, 9) == "../../../") {
            $url = str_replace("../../../", "", $url);
            $url = "$domain/$up_three$url";
        }
       
        if (substr($url, 0, 6) == "../../") {
            $url = str_replace("../../", "", $url);
            $url = "$domain/$up_two$url";
        }
       
        if (substr($url, 0, 3) == "../") {
            $url = str_replace("../", "", $url);
            $url = "$domain/$up_one$url";
        }
       
       
        return $url;
    }
   
    //using curl and following redirects and responses would be better
    $html = @file_get_html($target_url);
    if (!$html) {
        die("failed to connect");
    }
   
    $url_array = array();
   
    foreach ($html->find('a') as $element) {
        $href = fixRELATIVE($target_url, trim($element->href));
       
        $title = trim($element->title);
       
        $text = trim($element->plaintext);
       
        //not all hyperlinks contain a title
       
        if ($title == '') {
           
            $title = $text;
           
        }
       
        //if title is still empty, use the href link
       
        if ($title == '') {
           
            $title = $href;
           
        }
       
        //create an array associating them
        $url_array[] = array(
            "href" => $href,
            "title" => $title
        );
       
    }
   
//define urls array
    $urls = array();
   
    //remove duplicates from array
    $urls = array_map("unserialize", array_unique(array_map("serialize", $url_array)));

//clear url_array
    $url_array = array();

    //print_r($urls);
   
    //display the links with titles as hyperlinks
    foreach ($urls as $link) {
        echo "<a href='" . $link['href'] . " 'target='_blank'>" . $link['title'] . "</a><br />";
    }
   
}
?>

Don't forget to make everything safe if saving to database.

Sign In

Use web browser engine to extract URLs from a page?

Recommended Posts

KingNeil

Link to comment

Share on other sites

Ch0cu3r

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information