Jump to content

Use web browser engine to extract URLs from a page?


KingNeil

Recommended Posts

So... I've tried using simple_html_dom in order to extract URLs from web pages.

 

And it works a lot of the time, but not all of the time.

 

For example, it doesn't work on the website ArsTechnica.com, because it has a different use of HTML URLs.

 

So... one thing I do know... is that Firefox perfectly gets a list of all links on a page, hence, how you can load up a web page in Firefox, and all the links are clickable.

 

And so... I was wondering... is it possible to download the open source Firefox browser engine, or Chrome, or whatever... and pass some parameters to it somehow, and this will give me a list of all URLs on the page..??

 

I can then feed that into PHP by whatever means, whether it's shell_exec() or whatever.

 

Is this possible? How do I do it?

Link to comment
Share on other sites

I don't get your problem? The basic example provided by simple_html_dom v1.5 lists all links for me for any site

include 'simplehtmldom_1_5/simple_html_dom.php';

// get all links from home page of a website
$html  = file_get_html('http://www.sitename.com');

foreach($html->find('a') as $element)
    echo $element->href .' <br />'; // echo the link

This is basic DOM traversal.

 

I dont understand how using an actual web browsers engine will help you.

Link to comment
Share on other sites

As in the case for ArsTechnica.com, they use relative paths versus absolute paths for their href links.

To sum it up you have to determine what type path they use and fix accordingly by appending the protocol, host and any paths to the beginning

 

I do this with a relative links function I made along with knowing the targeted scrape location to go by.

 

 

I'll try to explain as simply as can

A path is a slash separated list of directory names followed by either a directory name or a file name.

A directory is the same as a system folder.

 

 

relative paths:

hash tag  # (by itself directs to same page, if a fragment is added usually an anchor tag on the same page......but sites have been using them to navigate entire sites or scripts with them over the years)

no identifier (just a directory or file) will append own host

same directory  /

root directory ./

up one directory ../

up two directories ../../

up three directories ../../../

on and on for up directories

 

absolute paths:

any url (more correctly called uri) that includes the protocol and host with optional others following it

 

just a few examples, is way too many to list every possible type.

http://user:password@subdomain.domain.tld.sld:port/directory/file.ext?query=value#fragment

http://subdomain.domain.tld

http://subdomain.domain.tld.sld/folder/script.php

http://domain.tld/script.php

 

local paths:

sometimes linked when are outside of the www directory

C:\folder\

C:\folder\file

C:/folder/file.ext

\\folder\file

 

I wrote this using simple_html_dom with functions to fix some issues

Can test it at http://dynaindex.com/link-scrape.php

 

Is more to it than meets the eye, I actually have a much longer and complicated relative links function that does much more, this should get you by.

<form action="" method="GET">
<input type="text" name="target" size="100" id="target" value="" placeholder="Insert url to get links" />
<input type="submit" value="Get the links" />
<br />
</form>
<?php
//check if target is set and not a blank value
if (isset($_GET['target']) && trim($_GET['target']) != '') {
   
    //requires simple_html_dom
    include 'simple_html_dom.php';
   
    //clean input url
    $target_url = htmlspecialchars(trim($_GET['target']), ENT_QUOTES, 'UTF-8');
    $target_url = filter_var($target_url, FILTER_SANITIZE_URL);
   
    //check input url for http or https or file_get_contents fails
    if (!preg_match("~^(http|https)://~i", $target_url)) {
        $target_url = "http://" . $target_url;
    }
   
    echo "<h2>Links from " . rawurldecode($target_url) . "</h2>";
   
    //parse the host, no protocol returned
    function parseHOST($url)
    {
        $new_parse_url = str_ireplace(array(
            "http://",
            "https://",
            "http://",
            "ftp://",
            "feed://"
        ), "", trim($url));
        $parsedUrl     = @parse_url("http://$new_parse_url");
        return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
    }
   
    //remove relative paths and position correctly
    function removePaths($url, $number_positions = NULL)
    {
       
        $path      = @parse_url($url, PHP_URL_PATH);
        $trim_path = trim($path, '/');
        $positions = "";
        $positions = explode('/', $trim_path);
        if (preg_match("/\./", end($positions))) {
            array_pop($positions);
        }
        if (!is_null($number_positions)) {
            for ($i = 1; $i <= $number_positions; $i++) {
                array_pop($positions);
            }
        }
        foreach ($positions as $folders) {
            if (!empty($folders)) {
                $folder_path .= "$folders/";
            }
           
        }
       
        return $folder_path;
    }
   
    //fix relative links to absolute links
    function fixRELATIVE($target_url, $url)
    {
        $domain = "http://" . parseHOST($target_url);
       
        if ($url == "#" || $url == "./") {
            $url = $domain;
        }
       
        if ($url == "/") {
            $url = $target_url;
        }
       
        $url = rtrim($url, "/");
       
       
       
        $up_one       = removePaths($target_url, 1);
        $up_two       = removePaths($target_url, 2);
        $up_three     = removePaths($target_url, 3);
        $up_four      = removePaths($target_url, 4);
        $up_five      = removePaths($target_url, 5);
        $path         = parse_url($target_url, PHP_URL_PATH);
        $full_path    = trim($path, '/');
        $explode_path = explode("/", $full_path);
        $last         = end($explode_path);
        $fixed_paths  = "";
        if (is_array($explode_path)) {
            foreach ($explode_path as $paths) {
                if (!empty($paths) && !preg_match("/\./", $paths)) {
                    $fixed_paths .= "$paths/";
                }
            }
        }
        $fixed_domain = "$domain/$fixed_paths";
       
       
        if (substr($url, 0, 1) == "/") {
            $url = ltrim($url, "/");
            $url = "$domain/$url";
        }
       
        if (substr($url, 0, 1) == "#") {
            $url = "$domain/$full_path$url";
        }
       
        if (substr($url, 0, 1) == "?") {
            $url = "$domain/$full_path$url";
        }
       
        if (substr($url, 0, 15) == "../../../../../") {
            $url = str_replace("../../../../../", "", $url);
            $url = "$domain/$up_five$url";
        }
       
        if (substr($url, 0, 12) == "../../../../") {
            $url = str_replace("../../../../", "", $url);
            $url = "$domain/$up_four$url";
        }
       
        if (substr($url, 0, 9) == "../../../") {
            $url = str_replace("../../../", "", $url);
            $url = "$domain/$up_three$url";
        }
       
        if (substr($url, 0, 6) == "../../") {
            $url = str_replace("../../", "", $url);
            $url = "$domain/$up_two$url";
        }
       
        if (substr($url, 0, 3) == "../") {
            $url = str_replace("../", "", $url);
            $url = "$domain/$up_one$url";
        }
       
       
        return $url;
    }
   
    //using curl and following redirects and responses would be better
    $html = @file_get_html($target_url);
    if (!$html) {
        die("failed to connect");
    }
   
    $url_array = array();
   
    foreach ($html->find('a') as $element) {
        $href = fixRELATIVE($target_url, trim($element->href));
       
        $title = trim($element->title);
       
        $text = trim($element->plaintext);
       
        //not all hyperlinks contain a title
       
        if ($title == '') {
           
            $title = $text;
           
        }
       
        //if title is still empty, use the href link
       
        if ($title == '') {
           
            $title = $href;
           
        }
       
        //create an array associating them
        $url_array[] = array(
            "href" => $href,
            "title" => $title
        );
       
    }
   
//define urls array
    $urls = array();
   
    //remove duplicates from array
    $urls = array_map("unserialize", array_unique(array_map("serialize", $url_array)));

//clear url_array
    $url_array = array();

    //print_r($urls);
   
    //display the links with titles as hyperlinks
    foreach ($urls as $link) {
        echo "<a href='" . $link['href'] . " 'target='_blank'>" . $link['title'] . "</a><br />";
    }
   
}
?>

Don't forget to make everything safe if saving to database.

Edited by QuickOldCar
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.