Jump to content

How to read a sitemap using PHP


Help!php

Recommended Posts

First of all I have no idea  how to do this. But I know it can be done. A little help would be alot for me.

 

Thank you in advance.

 

So I want to go through a sitemap and visit each of the link on the sitemap and save the URL on to a database.

 

Example:

 

HP 5550N A3 Colour Laser Printer

 

 

Konica Minolta PagePro 1350W A4 Mono Laser Printer

 

 

HP 9050dn A3 Mono Laser Printer

 

 

HP 5550DTN A3 Colour Laser Printer

 

 

HP 5550HDN A3 Colour Laser Printer

 

 

HP 5550DN A3 Colour Laser Printer

 

Lets say this was on the sitemap and it just continues with other product like this. I want to write a code where it will go to the first link and saves the product URL on to the database and continues to do the same until the last link. On my example it would be HP 5550DN A3 Colour Laser Printer.

 

Any help??? Any ideas??

 

I am not asking someone to write the code for me.. Just need help and good direction  :) 

 

Link to comment
Share on other sites

for something simple:

simplexml()

 

<?php
//test url http://www.domain.com/sitemap.xml
//test url http://www.phpfreaks.com/sitemap.xml
if(isset($_GET['url']) && $_GET['url'] != ''){
$url = trim($_GET['url']);
$xml = simplexml_load_file($url);

foreach ($xml->url as $url_list) {
    $url_array[] = $url_list->loc;  
}

//display the array
foreach($url_array as $urls){
echo "<a href='$urls'>$urls</a><br />";
}

} else {
echo "No xml url inserted";
}
?>

make the php file , use in the address bar something like http://mysite.com/script.php?url=http://www.phpfreaks.com/sitemap.xml

Now that you have the array full of urls you can insert them into a database however you want, AUTO_INCREMENT per id, or serialize or implode the array into a field per each sites xml file

 

To grab the titles can use curl() or file_get_contents()

 

I have a script to obtain the titles for websites, it's not that easy actually to get it from every site, been there.

Link to comment
Share on other sites

I had a little time waiting for a friend, decided to expand upon this for you.

 

I used a stream context and file_get_contents, like I said could also use curl()

Checking for valid xml

Also grabbed the title,description and keywords, and made it all arrays

 

You can work out any filtering or escaping before the mysql inserts.

 

<?php
//test url http://www.domain.com/sitemap.xml
//test url http://www.phpfreaks.com/sitemap.xml
if(isset($_GET['url']) && $_GET['url'] != ''){
$url = trim($_GET['url']);
@$xml = simplexml_load_file($url);
if($xml===FALSE) {
die('not a valid xml string');
} else {

foreach ($xml->url as $url_list) {
    $urls = $url_list->loc;
    if (substr($urls, 0, 4) != "http") {
$urls = "http://$urls";
}

$context = stream_context_create(array('http' => array('timeout' => 1)));
$str = file_get_contents($urls, 0, $context);

if(!$str){
die("Unable to connect");
}

$tags = get_meta_tags($urls);
preg_match("/<title>(.*)<\/title>/Umis", $str, $title); 
preg_match("/<head>(.*)<\/head>/is", $str, $head);

$title = $title[1];
if($title == ''){
$title = $urls;
}

$description = $tags['description'];
$keywords = $tags['keywords'];

//make it an array each value, json decode used to remove url from being xml object
$urls_array[] = array("url"=>json_decode($urls,TRUE),"title"=>$title,"description"=>$description,"keywords"=>$keywords);

}//end loop

//see the array
//print_r($urls_array);

//display the array
echo "<hr>";
foreach($urls_array as $url_value){
echo "<a href=".$url_value['url'].">".$url_value['title']."</a><br />";
echo $url_value['description']."<br />";
echo $url_value['keywords']."<br />";
echo "<hr>";
}
}
} else {
echo "No xml url inserted";
}
?>

 

It will return an array like this for phpfreaks sitemap

Array ( [0] => Array ( [url] => http://www.phpfreaks.com [title] => PHP Freaks - PHP Help Index [description] => PHP Freaks is a website dedicated to learning and teaching PHP. Here you will find a forum consisting of 128,486 members who have posted a total of 1,330,567 posts on the forums. Additionally, we have tutorials covering various aspects of PHP and you will find news syndicated from other websites so you can stay up-to-date. Along with the tutorials, the developers on the forum will be able to help you with your scripts, or you may perhaps share your knowledge so others can learn from you. [keywords] => php help, php forums, php tutorials, php tutorial, php news, php snippets, php, help, news, resources, news, snippets, tutorials, web development, programming ) [1] => Array ( [url] => http://www.phpfreaks.com/forums [title] => PHP Freaks Forums - Index [description] => PHP Freaks Forums - Index [keywords] => php, tutorials, help, tutorial, forum, free, resources, advice, oop, design ) [2] => Array ( [url] => http://www.phpfreaks.com/tutorials [title] => PHP Freaks - Tutorials [description] => Free tutorials on various PHP subjects covering basic to advanced principles. [keywords] => ) [3] => Array ( [url] => http://www.phpfreaks.com/blogs [title] => PHP Freaks - Blog posts [description] => [keywords] => ) [4] => Array ( [url] => http://www.phpfreaks.com/news [title] => PHP Freaks - PHP News [description] => [keywords] => ) ) 

 

Link to comment
Share on other sites

Thank you for helping.

 

I have tired your code and it works for phpfreak.com/sitemap but not for the website I want to try.

 

I dont need the title I just want the URL for these titles.

 

Lets ssay for example http://www.php.net/sitemap.php

 

You know how they have different link eg. Homepage, News Archives ect. I want each of the URL to be printed then I can import that to database.

So for homepage it would be > http://www.php.net/index.php

 

Link to comment
Share on other sites

So you want to get href links from normal pages and also xml pages.

 

I could just respond back saying use DOM , simple_html_dom, or connect to a page and pattern match href links with regex ... , but I'll let you have this link scraper script that I wrote.

 

The code below uses simplexml if is an xml, curl and DOM if it's not to find the urls, most of the code is functions for finding valid urls and fixing relative links.

 

demo of this code:

http://dynainternet.com/test/grab-links.php

 

the site you provided as an example (non xml):

http://dynainternet.com/test/grab-links.php?target=http://www.php.net/sitemap.php

 

and an xml example:

http://dynainternet.com/test/grab-links.php?target=http://www.phpfreaks.com/sitemap.xml

 

 

<html>
<title>Link scraper</title>
<meta name="description" content="Scrape the href links from the body section of a page" />
<meta name="keywords" content="scrape link, scrape urls,links,url,urls,fetch url,grab link" />
<meta name="author" content="Jay Przekop - dynainternet.com" />
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
<head>
<style type="text/css">
#content {
  width: 800px ;
  margin-left: auto ;
  margin-right: auto ;
}
</style>
</head>
<body>
<div id="content">
<form action="" method="GET">
<input type="text" name="target" size="100" id="target" value="" placeholder="Insert url to get links" />
<input type="submit" value="Get the links" />
<br />
</form>

<?php
if(isset($_GET['target']) && $_GET['target'] != ''){

$target_url = trim($_GET['target']);
echo "<h2>Links from ".htmlentities(urldecode($target_url))."</h2>";

$userAgent = 'Linkhunter/1.0 (http://dynainternet.com/test/grab-links.php)';


//replace hxxp function
function replaceHxxp($url){
$url = str_ireplace(array("hxxps://xxx.","hxxps://","hxxp://xxx.","hxxp://"), array("https://www.","https://","http://www.","http://"), trim($url)); 
return $url;
}

//parse the host, no http:// returned
function parseHOST($url){
$new_parse_url = str_ireplace(array("http://","https://", "http://", "ftp://", "feed://"), "", trim($url));
$parsedUrl = @parse_url("http://$new_parse_url");
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

function removePaths($url,$number_positions=NULL) {

        $path = @parse_url($url, PHP_URL_PATH);
        $trim_path = trim($path, '/');
        $positions = "";
        $positions = explode('/', $trim_path);
        if(preg_match("/\./",end($positions))) {
        array_pop($positions);
        }
        if(!is_null($number_positions)){
        for ($i = 1; $i <= $number_positions; $i++) {
        array_pop($positions);
        }
        }
        foreach($positions as $folders){
        if(!empty($folders)){
        $folder_path .= "$folders/";
        }
        
        }
        
        return $folder_path;
}

//check relative and fix
function fixRELATIVE($target_url,$url) {
$url = replaceHxxp($url);
$domain = parseHOST($target_url);
$ip_check = parse_url($url, PHP_URL_HOST);
$up_one = removePaths($target_url,1);
$up_two = removePaths($target_url,2);
$up_three = removePaths($target_url,3);
$up_four = removePaths($target_url,4);
$up_five = removePaths($target_url,5);
$path = parse_url($target_url, PHP_URL_PATH);
$full_path = trim($path, '/');
$explode_path = explode("/", $full_path);
$last = end($explode_path);
//echo "last path: $last<br />";
$fixed_paths = "";
if(is_array($explode_path)){
foreach($explode_path as $paths){
if(!empty($paths) && !preg_match("/\./",$paths)){
$fixed_paths .= "$paths/";
}
}
}
$fixed_domain = "$domain/$fixed_paths";

//echo "Target: $target_url<br />";
//echo "Original: $url<br />";

$domain_array = array(".ac",".ac.cn",".ac.ae",".ad",".ae",".aero",".af",".ag",".agent","ah.cn",".ai",".ak.us",".al",".al.us",".am",".an",".ao",".aq",".ar",".ar.us",".arpa",".arts",".as",".asia",".at",".au",".au.com",".auction",".aw",".ax",".az",".az.us",".b2b",".b2c",".b2m",".ba",".bb",".bd",".be",".bf",".bg",".bh",".bi",".biz",".bj",".bj.cn",".bl",".bm",".bn",".bo",".boutique",".br",".br.com",".bs",".bt",".bv",".bw",".by",".bz",".ca",".ca.us",".cat",".cc",".cd",".cf",".cg",".ch",".chat",".church",".ci",".ck",".cl",".club",".cm",".cn",".cn.com",".co",".co.uk",".co.us",".com",".com.au",".com.ac",".com.cn",".com.au",".com.tw",".coop",".cq.cn",".cr",".ct.us",".cu",".cv",".cx",".cy",".cz",".dc.us",".de",".de.com",".de.net",".de.us",".dir",".dj",".dk",".dk.org",".dm",".do",".dz",".ec",".edu",".edu.ac",".edu.af",".edu.cn",".ee",".eg",".eh",".er",".es",".et",".eu",".eu.com",".eu.org",".family",".fi",".firm",".fj",".fj.cn",".fk",".fl.us",".fm",".fo",".fr",".free",".ga",".ga.us",".game",".gb",".gb.com",".gb.net",".gd",".gd.cn",".ge",".gf",".gg",".gh",".gi",".gl",".gm",".gmbh",".gn",".golf",".gov",".gov.ac",".gov.ae",".gov.cn",".gp",".gq",".gr",".gs",".gs.cn",".gt",".gu",".gw",".gy",".gx.cn",".gz.cn",".ha.cn",".hb.cn",".he.cn",".health",".hi.cn",".hi.us",".hk",".hl.cn",".hm",".hn",".hn.cn",".hr",".ht",".hu",".hu.com",".ia.us",".id",".id.us",".ie",".il",".il.us",".im",".in",".in.us",".inc",".info",".int",".io",".iq",".ir",".is",".it",".je",".jl.cn",".jm",".jo",".jobs",".jp",".js.cn",".jx.cn",".ke",".kg",".kh",".ki",".kids",".ku",".km",".kn",".kp",".kr",".ks.us",".kw",".ky",".ky.us",".kz",".la",".la.us",".law",".lb",".lc",".li",".lk",".llc",".llp",".ln.cn",".love",".lr",".ls",".lt",".ltd",".ltd.uk",".lu",".lv",".ly",".m2c",".m2m",".ma",".ma.us",".mc",".md",".md.us",".me",".me.us",".med",".me.uk",".mf",".mg",".mh",".mi.us",".mil",".mil.ac",".mil.ae",".mil.cn",".mk",".ml",".mm",".mn",".mn",".mo",".mo.us",".mobi",".movie",".mp",".mq",".mr",".ms",".ms.us",".mt",".mt.us",".mu",".museum",".music",".mv",".mw",".mx",".my",".mz",".na",".ne.us",".name",".nc",".nc.us",".nd.us",".ne",".net",".net.ac",".net.ae","net.cn",".net.tw",".net.uk",".news",".nf",".ng",".nh.us",".ni",".nj.us",".nl",".nm.cn",".nm.us",".no",".no.com",".nom.ad",".np",".nr",".nu",".nv.us",".ny.us",".nx.cn",".nz",".oh.us",".ok.us",".om",".or.us",".org",".org.ac",".org.ae",".org.cn",".org.tw",".org.uk",".pa",".pa.us",".pe",".pf",".pg",".ph",".pk",".pl",".plc",".plc.uk",".pm",".pn",".pr",".pro",".pro.ae",".ps",".pt",".pw",".py",".qa",".qc.com",".qh.cn",".re",".rec",".ri.us",".ro",".rs",".ru",".ru.com",".rw",".sa",".sa.com",".sb",".sc",".sc.cn",".sc.us",".sch.uk",".sch.ae",".school",".sd",".sd.cn",".sd.us",".se",".se.com",".search",".sg",".sh",".sh.cn",".shop",".si",".sj",".sk",".sl",".sm",".sn",".sn.cn",".so",".soc",".sport",".sr",".st",".su",".sv",".sy",".sx.cn",".sz",".tc",".td",".tech",".tel",".tf",".tg",".th",".tj",".tj.cn",".tk",".tl",".tm",".tn",".tn.us",".to",".tp",".tr",".trade",".travel",".tt",".tv",".tw",".tw.cn",".tx.us",".tz",".ua",".ug",".uk",".uk.com",".uk.net",".um",".us",".us.com",".ut.us",".uy",".uy.com",".uz",".va",".va.us",".vc",".ve",".vg",".vi",".video",".vn",".voyage",".vt.us",".vu",".wa.us",".wf",".wi.us",".ws",".wv.us",".wy.us",".xj.cn",".xxx",".xz.cn",".ye",".yn.cn",".yt",".yu",".za",".za.com",".zj.cn",".zm",".zr",".zw");
$url = preg_replace('/\\\\/', "/", $url);
$url = str_ireplace(array("http://","https://", "ftp://", "feed://"), "", trim($url));
if(substr(strtolower($url),0,4) == "www."){
$fixed_url[] = "http://$url";
//echo "rule 1<br />";
}

$check_array = array('"',"*","'","//","///","////","'","./",".//","../",".../","..../","...../","./../","../../",'"\"',".//.//");
$excludes_array = array("ac","ad","ae","aero","af","ag","agent","ai","al","am","an","ao","aq","ar","arpa","arts","as","asia","at","au","auction","aw","ax","az","b2b","b2c","b2m","ba","bb","bd","be","bf","bg","bh","bi","biz","bj","bl","bm","bn","bo","boutique","br","bs","bt","bv","bw","by","bz","ca","cat","cc","cd","cf","cg","ch","chat","church","ci","ck","cl","club","cm","cn","co","com","coop","cr","cu","cv","cx","cy","cz","de","dir","dj","dk","dm","do","dz","ec","edu","ee","eg","eh","er","es","et","eu","family","fi","firm","fj","fk","fm","fo","fr","free","ga","game","gb","gd","ge","gf","gg","gh","gi","gl","gm","gmbh","gn","golf","gov","gp","gq","gr","gs","gt","gu","gw","gy","hk","hm","hn","hr","ht","hu","id","ie","il","im","in","inc","info","int","io","iq","ir","is","it","je","jm","jo","jobs","jp","ke","kg","kh","ki","kids","ku","km","kn","kp","kr","kw","ky","kz","la","law","lb","lc","li","lk","llc","llp","love","lr","ls","lt","ltd","lu","lv","ly","m2c","m2m","ma","mc","md","me","med","mf","mg","mh","mil","mk","ml","mm","mn","mn","mo","mobi","movie","mp","mq","mr","ms","mt","mu","museum","music","mv","mw","mx","my","mz","na","name","nc","ne","net","news","nf","ng","ni","nl","no","np","nr","nu","nz","om","org","pa","pe","pf","pg","ph","pk","pl","plc","pm","pn","pr","pro","ps","pt","pw","py","qa","re","rec","ro","rs","ru","rw","sa","sb","sc","school","sd","se","search","sg","sh","shop","si","sj","sk","sl","sm","sn","so","soc","sport","sr","st","su","sv","sy","sz","tc","td","tech","tel","tf","tg","th","tj","tk","tl","tm","tn","to","tp","tr","trade","travel","tt","tv","tw","tz","ua","ug","uk","um","us","uy","uz","va","vc","ve","vg","vi","video","vn","voyage","vu","wf","ws","xxx","ye","yt","yu","za","zm","zr","zw");

if(substr($url,0,1) == "/"){
$url = ltrim($url,"/");
$fixed_url[] = "$domain/$url";
//echo "main site and url<br />";
}

if(substr($url,0,1) == "#"){
$fixed_url[] = "$domain/$full_path$url";
//echo "target and url<br />";
}

if(substr($url,0,15) == "../../../../../"){
$url = str_replace("../../../../../","",$url);
$fixed_url[] = "$domain/$up_five$url";
//echo "five directory up<br />";
}

if(substr($url,0,12) == "../../../../"){
$url = str_replace("../../../../","",$url);
$fixed_url[] = "$domain/$up_four$url";
//echo "four directory up<br />";
}

if(substr($url,0,9) == "../../../"){
$url = str_replace("../../../","",$url);
$fixed_url[] = "$domain/$up_three$url";
//echo "three directory up<br />";
}

if(substr($url,0,6) == "../../"){
$url = str_replace("../../","",$url);
$fixed_url[] = "$domain/$up_two$url";
//echo "two directory up<br />";
}

if(substr($url,0,3) == "../"){
$url = str_replace("../","",$url);
$fixed_url[] = "$domain/$up_one$url";
//echo "one directory up<br />";
}

foreach($check_array as $checks){
$check_length = strlen($checks);
$temporary_url = $url;
$url = @ltrim($url,$checks);
$url = @rtrim($url,$checks);
if(substr($temporary_url,0,$check_length) == $checks){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 2<br />";
}
}

$parse_url = parseHOST($url);
$parse_ext_explode = end(explode(".",$parse_url));
$parse_ext_check = ".$parse_ext_explode";
//echo "$parse_ext_check<br />";




//the following if statements will do checks on what to be added, only the first $fixed_url will be returned
if(in_array($parse_url, $excludes_array)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 3<br />";
}

if(in_array($parse_ext_check, $domain_array)){
$fixed_url[] = "http://$url";
//echo "rule 4<br />";
}

if(preg_match("/([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}/",$ip_check)){
$fixed_url[] = "http://$url";
//echo "is an ip<br />";
}

if(!in_array($parse_ext_check, $domain_array)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 5<br />";
}

if(!preg_match("/^(\w+.)$/siU",$parse_url)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 6<br />";
}

if($parse_url == $fixed_domain) {
$fixed_url[] = "http://$url";
//echo "rule 7<br />";
}

if (0 !== strpos($fixed_url[0], 'http')) {
$fixed_url[0] = "http://$fixed_url[0]";
//echo "rule 8<br />";
}

$lower_domain = parseHOST($fixed_url[0]);
$lower_url = str_ireplace($lower_domain,strtolower($lower_domain),$fixed_url[0]);
$lower_url = trim($lower_url,"'");
$lower_url = trim($lower_url,"#");
return $lower_url;

}//end fix relative function


//try to load as xml file first
@$xml = simplexml_load_file($target_url);
if($xml===TRUE) {
echo "<h3>Page is xml</h3>";

    foreach ($xml->url as $url_list) {
    $raw_url_array[] = json_decode($url_list->loc,TRUE);//create array
    }
    
    
    } else {//connect with curl request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
    //echo "<br />cURL error number:" .curl_errno($ch);
    //echo "<br />cURL error:" . curl_error($ch);
    die("Unable to connect to that url");
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the links on the page
$xpath = new DOMXPath($dom);

//only looking for a href links in the body section
$hrefs = $xpath->query('/html/body//a');

//loop all the found href links
for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $raw_url_array[] = $href->getAttribute('href');//create array
    }//end loop

}//end if/else xml type or not

//check if is array, loop through urls and clean/fix
if(is_array($raw_url_array)){
foreach($raw_url_array as $url){
$url = fixRELATIVE($target_url,$url);//fix the self relative links 
    //only show http links in array
    if($url != '' || substr($url,0,4) != "http:"  || substr($url,0,5) != "https:"){
    $url_array[] = $url;//create a url array
    }

}//end foreach

//displaying it
$url_array = array_unique($url_array);
    foreach($url_array as $clean_url){
    $clean_url = htmlentities(urldecode($clean_url));
    echo "<a href='$clean_url'  target='_blank'>$clean_url</a><br />";
    }//end display

}//end if is array

}//end if $_GET['target'] set
?>
</div>
</body>
</html>

Link to comment
Share on other sites

Made a correction in the code determining if is xml and also xml extension

 

<html>
<title>Link scraper</title>
<meta name="description" content="Scrape the href links from the body section of a page" />
<meta name="keywords" content="scrape link, scrape urls,links,url,urls,fetch url,grab link" />
<meta name="author" content="Jay Przekop - dynainternet.com" />
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
<head>
<style type="text/css">
#content {
  width: 800px ;
  margin-left: auto ;
  margin-right: auto ;
}
</style>
</head>
<body>
<div id="content">
<form action="" method="GET">
<input type="text" name="target" size="100" id="target" value="" placeholder="Insert url to get links" />
<input type="submit" value="Get the links" />
<br />
</form>

<?php
if(isset($_GET['target']) && $_GET['target'] != ''){

$target_url = trim($_GET['target']);
echo "<h2>Links from ".htmlentities(urldecode($target_url))."</h2>";

$userAgent = 'Linkhunter/1.0 (http://dynainternet.com/test/grab-links.php)';


//replace hxxp function
function replaceHxxp($url){
$url = str_ireplace(array("hxxps://xxx.","hxxps://","hxxp://xxx.","hxxp://"), array("https://www.","https://","http://www.","http://"), trim($url)); 
return $url;
}

//parse the host, no http:// returned
function parseHOST($url){
$new_parse_url = str_ireplace(array("http://","https://", "http://", "ftp://", "feed://"), "", trim($url));
$parsedUrl = @parse_url("http://$new_parse_url");
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

function removePaths($url,$number_positions=NULL) {

        $path = @parse_url($url, PHP_URL_PATH);
        $trim_path = trim($path, '/');
        $positions = "";
        $positions = explode('/', $trim_path);
        if(preg_match("/\./",end($positions))) {
        array_pop($positions);
        }
        if(!is_null($number_positions)){
        for ($i = 1; $i <= $number_positions; $i++) {
        array_pop($positions);
        }
        }
        foreach($positions as $folders){
        if(!empty($folders)){
        $folder_path .= "$folders/";
        }
        
        }
        
        return $folder_path;
}

//check relative and fix
function fixRELATIVE($target_url,$url) {
$url = replaceHxxp($url);
$domain = parseHOST($target_url);
$ip_check = parse_url($url, PHP_URL_HOST);
$up_one = removePaths($target_url,1);
$up_two = removePaths($target_url,2);
$up_three = removePaths($target_url,3);
$up_four = removePaths($target_url,4);
$up_five = removePaths($target_url,5);
$path = parse_url($target_url, PHP_URL_PATH);
$full_path = trim($path, '/');
$explode_path = explode("/", $full_path);
$last = end($explode_path);
//echo "last path: $last<br />";
$fixed_paths = "";
if(is_array($explode_path)){
foreach($explode_path as $paths){
if(!empty($paths) && !preg_match("/\./",$paths)){
$fixed_paths .= "$paths/";
}
}
}
$fixed_domain = "$domain/$fixed_paths";

//echo "Target: $target_url<br />";
//echo "Original: $url<br />";

$domain_array = array(".ac",".ac.cn",".ac.ae",".ad",".ae",".aero",".af",".ag",".agent","ah.cn",".ai",".ak.us",".al",".al.us",".am",".an",".ao",".aq",".ar",".ar.us",".arpa",".arts",".as",".asia",".at",".au",".au.com",".auction",".aw",".ax",".az",".az.us",".b2b",".b2c",".b2m",".ba",".bb",".bd",".be",".bf",".bg",".bh",".bi",".biz",".bj",".bj.cn",".bl",".bm",".bn",".bo",".boutique",".br",".br.com",".bs",".bt",".bv",".bw",".by",".bz",".ca",".ca.us",".cat",".cc",".cd",".cf",".cg",".ch",".chat",".church",".ci",".ck",".cl",".club",".cm",".cn",".cn.com",".co",".co.uk",".co.us",".com",".com.au",".com.ac",".com.cn",".com.au",".com.tw",".coop",".cq.cn",".cr",".ct.us",".cu",".cv",".cx",".cy",".cz",".dc.us",".de",".de.com",".de.net",".de.us",".dir",".dj",".dk",".dk.org",".dm",".do",".dz",".ec",".edu",".edu.ac",".edu.af",".edu.cn",".ee",".eg",".eh",".er",".es",".et",".eu",".eu.com",".eu.org",".family",".fi",".firm",".fj",".fj.cn",".fk",".fl.us",".fm",".fo",".fr",".free",".ga",".ga.us",".game",".gb",".gb.com",".gb.net",".gd",".gd.cn",".ge",".gf",".gg",".gh",".gi",".gl",".gm",".gmbh",".gn",".golf",".gov",".gov.ac",".gov.ae",".gov.cn",".gp",".gq",".gr",".gs",".gs.cn",".gt",".gu",".gw",".gy",".gx.cn",".gz.cn",".ha.cn",".hb.cn",".he.cn",".health",".hi.cn",".hi.us",".hk",".hl.cn",".hm",".hn",".hn.cn",".hr",".ht",".hu",".hu.com",".ia.us",".id",".id.us",".ie",".il",".il.us",".im",".in",".in.us",".inc",".info",".int",".io",".iq",".ir",".is",".it",".je",".jl.cn",".jm",".jo",".jobs",".jp",".js.cn",".jx.cn",".ke",".kg",".kh",".ki",".kids",".ku",".km",".kn",".kp",".kr",".ks.us",".kw",".ky",".ky.us",".kz",".la",".la.us",".law",".lb",".lc",".li",".lk",".llc",".llp",".ln.cn",".love",".lr",".ls",".lt",".ltd",".ltd.uk",".lu",".lv",".ly",".m2c",".m2m",".ma",".ma.us",".mc",".md",".md.us",".me",".me.us",".med",".me.uk",".mf",".mg",".mh",".mi.us",".mil",".mil.ac",".mil.ae",".mil.cn",".mk",".ml",".mm",".mn",".mn",".mo",".mo.us",".mobi",".movie",".mp",".mq",".mr",".ms",".ms.us",".mt",".mt.us",".mu",".museum",".music",".mv",".mw",".mx",".my",".mz",".na",".ne.us",".name",".nc",".nc.us",".nd.us",".ne",".net",".net.ac",".net.ae","net.cn",".net.tw",".net.uk",".news",".nf",".ng",".nh.us",".ni",".nj.us",".nl",".nm.cn",".nm.us",".no",".no.com",".nom.ad",".np",".nr",".nu",".nv.us",".ny.us",".nx.cn",".nz",".oh.us",".ok.us",".om",".or.us",".org",".org.ac",".org.ae",".org.cn",".org.tw",".org.uk",".pa",".pa.us",".pe",".pf",".pg",".ph",".pk",".pl",".plc",".plc.uk",".pm",".pn",".pr",".pro",".pro.ae",".ps",".pt",".pw",".py",".qa",".qc.com",".qh.cn",".re",".rec",".ri.us",".ro",".rs",".ru",".ru.com",".rw",".sa",".sa.com",".sb",".sc",".sc.cn",".sc.us",".sch.uk",".sch.ae",".school",".sd",".sd.cn",".sd.us",".se",".se.com",".search",".sg",".sh",".sh.cn",".shop",".si",".sj",".sk",".sl",".sm",".sn",".sn.cn",".so",".soc",".sport",".sr",".st",".su",".sv",".sy",".sx.cn",".sz",".tc",".td",".tech",".tel",".tf",".tg",".th",".tj",".tj.cn",".tk",".tl",".tm",".tn",".tn.us",".to",".tp",".tr",".trade",".travel",".tt",".tv",".tw",".tw.cn",".tx.us",".tz",".ua",".ug",".uk",".uk.com",".uk.net",".um",".us",".us.com",".ut.us",".uy",".uy.com",".uz",".va",".va.us",".vc",".ve",".vg",".vi",".video",".vn",".voyage",".vt.us",".vu",".wa.us",".wf",".wi.us",".ws",".wv.us",".wy.us",".xj.cn",".xxx",".xz.cn",".ye",".yn.cn",".yt",".yu",".za",".za.com",".zj.cn",".zm",".zr",".zw");
$url = preg_replace('/\\\\/', "/", $url);
$url = str_ireplace(array("http://","https://", "ftp://", "feed://"), "", trim($url));
if(substr(strtolower($url),0,4) == "www."){
$fixed_url[] = "http://$url";
//echo "rule 1<br />";
}

$check_array = array('"',"*","'","//","///","////","'","./",".//","../",".../","..../","...../","./../","../../",'"\"',".//.//");
$excludes_array = array("ac","ad","ae","aero","af","ag","agent","ai","al","am","an","ao","aq","ar","arpa","arts","as","asia","at","au","auction","aw","ax","az","b2b","b2c","b2m","ba","bb","bd","be","bf","bg","bh","bi","biz","bj","bl","bm","bn","bo","boutique","br","bs","bt","bv","bw","by","bz","ca","cat","cc","cd","cf","cg","ch","chat","church","ci","ck","cl","club","cm","cn","co","com","coop","cr","cu","cv","cx","cy","cz","de","dir","dj","dk","dm","do","dz","ec","edu","ee","eg","eh","er","es","et","eu","family","fi","firm","fj","fk","fm","fo","fr","free","ga","game","gb","gd","ge","gf","gg","gh","gi","gl","gm","gmbh","gn","golf","gov","gp","gq","gr","gs","gt","gu","gw","gy","hk","hm","hn","hr","ht","hu","id","ie","il","im","in","inc","info","int","io","iq","ir","is","it","je","jm","jo","jobs","jp","ke","kg","kh","ki","kids","ku","km","kn","kp","kr","kw","ky","kz","la","law","lb","lc","li","lk","llc","llp","love","lr","ls","lt","ltd","lu","lv","ly","m2c","m2m","ma","mc","md","me","med","mf","mg","mh","mil","mk","ml","mm","mn","mn","mo","mobi","movie","mp","mq","mr","ms","mt","mu","museum","music","mv","mw","mx","my","mz","na","name","nc","ne","net","news","nf","ng","ni","nl","no","np","nr","nu","nz","om","org","pa","pe","pf","pg","ph","pk","pl","plc","pm","pn","pr","pro","ps","pt","pw","py","qa","re","rec","ro","rs","ru","rw","sa","sb","sc","school","sd","se","search","sg","sh","shop","si","sj","sk","sl","sm","sn","so","soc","sport","sr","st","su","sv","sy","sz","tc","td","tech","tel","tf","tg","th","tj","tk","tl","tm","tn","to","tp","tr","trade","travel","tt","tv","tw","tz","ua","ug","uk","um","us","uy","uz","va","vc","ve","vg","vi","video","vn","voyage","vu","wf","ws","xxx","ye","yt","yu","za","zm","zr","zw");

if(substr($url,0,1) == "/"){
$url = ltrim($url,"/");
$fixed_url[] = "$domain/$url";
//echo "main site and url<br />";
}

if(substr($url,0,1) == "#"){
$fixed_url[] = "$domain/$full_path$url";
//echo "target and url<br />";
}

if(substr($url,0,15) == "../../../../../"){
$url = str_replace("../../../../../","",$url);
$fixed_url[] = "$domain/$up_five$url";
//echo "five directory up<br />";
}

if(substr($url,0,12) == "../../../../"){
$url = str_replace("../../../../","",$url);
$fixed_url[] = "$domain/$up_four$url";
//echo "four directory up<br />";
}

if(substr($url,0,9) == "../../../"){
$url = str_replace("../../../","",$url);
$fixed_url[] = "$domain/$up_three$url";
//echo "three directory up<br />";
}

if(substr($url,0,6) == "../../"){
$url = str_replace("../../","",$url);
$fixed_url[] = "$domain/$up_two$url";
//echo "two directory up<br />";
}

if(substr($url,0,3) == "../"){
$url = str_replace("../","",$url);
$fixed_url[] = "$domain/$up_one$url";
//echo "one directory up<br />";
}

foreach($check_array as $checks){
$check_length = strlen($checks);
$temporary_url = $url;
$url = @ltrim($url,$checks);
$url = @rtrim($url,$checks);
if(substr($temporary_url,0,$check_length) == $checks){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 2<br />";
}
}

$parse_url = parseHOST($url);
$parse_ext_explode = end(explode(".",$parse_url));
$parse_ext_check = ".$parse_ext_explode";
//echo "$parse_ext_check<br />";




//the following if statements will do checks on what to be added, only the first $fixed_url will be returned
if(in_array($parse_url, $excludes_array)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 3<br />";
}

if(in_array($parse_ext_check, $domain_array)){
$fixed_url[] = "http://$url";
//echo "rule 4<br />";
}

if(preg_match("/([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}/",$ip_check)){
$fixed_url[] = "http://$url";
//echo "is an ip<br />";
}

if(!in_array($parse_ext_check, $domain_array)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 5<br />";
}

if(!preg_match("/^(\w+.)$/siU",$parse_url)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 6<br />";
}

if($parse_url == $fixed_domain) {
$fixed_url[] = "http://$url";
//echo "rule 7<br />";
}

if (0 !== strpos($fixed_url[0], 'http')) {
$fixed_url[0] = "http://$fixed_url[0]";
//echo "rule 8<br />";
}

$lower_domain = parseHOST($fixed_url[0]);
$lower_url = str_ireplace($lower_domain,strtolower($lower_domain),$fixed_url[0]);
$lower_url = trim($lower_url,"'");
$lower_url = trim($lower_url,"#");
return $lower_url;

}//end fix relative function

$file_type = end(explode(".",strtolower(trim($target_url))));
//try to load as xml file first
@$xml = simplexml_load_file($target_url);
if($xml!==False && $file_type == "xml") {
echo "<h3>Page is xml</h3>";

    foreach ($xml->url as $url_list) {
    $raw_url_array[] = json_decode($url_list->loc,TRUE);//create array
    }
    
    
    } else {//connect with curl request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
    //echo "<br />cURL error number:" .curl_errno($ch);
    //echo "<br />cURL error:" . curl_error($ch);
    die("Unable to connect to that url");
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the links on the page
$xpath = new DOMXPath($dom);

//only looking for a href links in the body section
$hrefs = $xpath->query('/html/body//a');

//loop all the found href links
for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $raw_url_array[] = $href->getAttribute('href');//create array
    }//end loop

}//end if/else xml type or not

//check if is array, loop through urls and clean/fix
if(is_array($raw_url_array)){
foreach($raw_url_array as $url){
$url = fixRELATIVE($target_url,$url);//fix the self relative links 
    //only show http links in array
    if($url != '' || substr($url,0,4) != "http:"  || substr($url,0,5) != "https:"){
    $url_array[] = $url;//create a url array
    }

}//end foreach

//displaying it
$url_array = array_unique($url_array);
    foreach($url_array as $clean_url){
    $clean_url = htmlentities(urldecode($clean_url));
    echo "<a href='$clean_url'  target='_blank'>$clean_url</a><br />";
    }//end display

}//end if is array

}//end if $_GET['target'] set
?>
</div>
</body>
</html>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.