How to read a sitemap using PHP

Help!php · March 23, 2012

First of all I have no idea how to do this. But I know it can be done. A little help would be alot for me.

Thank you in advance.

So I want to go through a sitemap and visit each of the link on the sitemap and save the URL on to a database.

Example:

HP 5550N A3 Colour Laser Printer

Konica Minolta PagePro 1350W A4 Mono Laser Printer

HP 9050dn A3 Mono Laser Printer

HP 5550DTN A3 Colour Laser Printer

HP 5550HDN A3 Colour Laser Printer

HP 5550DN A3 Colour Laser Printer

Lets say this was on the sitemap and it just continues with other product like this. I want to write a code where it will go to the first link and saves the product URL on to the database and continues to do the same until the last link. On my example it would be HP 5550DN A3 Colour Laser Printer.

Any help??? Any ideas??

I am not asking someone to write the code for me.. Just need help and good direction

QuickOldCar · March 23, 2012

for something simple:

simplexml()

<?php
//test url http://www.domain.com/sitemap.xml
//test url http://www.phpfreaks.com/sitemap.xml
if(isset($_GET['url']) && $_GET['url'] != ''){
$url = trim($_GET['url']);
$xml = simplexml_load_file($url);

foreach ($xml->url as $url_list) {
    $url_array[] = $url_list->loc;  
}

//display the array
foreach($url_array as $urls){
echo "<a href='$urls'>$urls</a><br />";
}

} else {
echo "No xml url inserted";
}
?>

make the php file , use in the address bar something like http://mysite.com/script.php?url=http://www.phpfreaks.com/sitemap.xml

Now that you have the array full of urls you can insert them into a database however you want, AUTO_INCREMENT per id, or serialize or implode the array into a field per each sites xml file

To grab the titles can use curl() or file_get_contents()

I have a script to obtain the titles for websites, it's not that easy actually to get it from every site, been there.

QuickOldCar · March 23, 2012

I had a little time waiting for a friend, decided to expand upon this for you.

I used a stream context and file_get_contents, like I said could also use curl()

Checking for valid xml

Also grabbed the title,description and keywords, and made it all arrays

You can work out any filtering or escaping before the mysql inserts.

<?php
//test url http://www.domain.com/sitemap.xml
//test url http://www.phpfreaks.com/sitemap.xml
if(isset($_GET['url']) && $_GET['url'] != ''){
$url = trim($_GET['url']);
@$xml = simplexml_load_file($url);
if($xml===FALSE) {
die('not a valid xml string');
} else {

foreach ($xml->url as $url_list) {
    $urls = $url_list->loc;
    if (substr($urls, 0, 4) != "http") {
$urls = "http://$urls";
}

$context = stream_context_create(array('http' => array('timeout' => 1)));
$str = file_get_contents($urls, 0, $context);

if(!$str){
die("Unable to connect");
}

$tags = get_meta_tags($urls);
preg_match("/<title>(.*)<\/title>/Umis", $str, $title); 
preg_match("/<head>(.*)<\/head>/is", $str, $head);

$title = $title[1];
if($title == ''){
$title = $urls;
}

$description = $tags['description'];
$keywords = $tags['keywords'];

//make it an array each value, json decode used to remove url from being xml object
$urls_array[] = array("url"=>json_decode($urls,TRUE),"title"=>$title,"description"=>$description,"keywords"=>$keywords);

}//end loop

//see the array
//print_r($urls_array);

//display the array
echo "<hr>";
foreach($urls_array as $url_value){
echo "<a href=".$url_value['url'].">".$url_value['title']."</a><br />";
echo $url_value['description']."<br />";
echo $url_value['keywords']."<br />";
echo "<hr>";
}
}
} else {
echo "No xml url inserted";
}
?>

It will return an array like this for phpfreaks sitemap

Array ( [0] => Array ( [url] => http://www.phpfreaks.com [title] => PHP Freaks - PHP Help Index [description] => PHP Freaks is a website dedicated to learning and teaching PHP. Here you will find a forum consisting of 128,486 members who have posted a total of 1,330,567 posts on the forums. Additionally, we have tutorials covering various aspects of PHP and you will find news syndicated from other websites so you can stay up-to-date. Along with the tutorials, the developers on the forum will be able to help you with your scripts, or you may perhaps share your knowledge so others can learn from you. [keywords] => php help, php forums, php tutorials, php tutorial, php news, php snippets, php, help, news, resources, news, snippets, tutorials, web development, programming ) [1] => Array ( [url] => http://www.phpfreaks.com/forums [title] => PHP Freaks Forums - Index [description] => PHP Freaks Forums - Index [keywords] => php, tutorials, help, tutorial, forum, free, resources, advice, oop, design ) [2] => Array ( [url] => http://www.phpfreaks.com/tutorials [title] => PHP Freaks - Tutorials [description] => Free tutorials on various PHP subjects covering basic to advanced principles. [keywords] => ) [3] => Array ( [url] => http://www.phpfreaks.com/blogs [title] => PHP Freaks - Blog posts [description] => [keywords] => ) [4] => Array ( [url] => http://www.phpfreaks.com/news [title] => PHP Freaks - PHP News [description] => [keywords] => ) )

Help!php · March 26, 2012

Thank you for helping.

I have tired your code and it works for phpfreak.com/sitemap but not for the website I want to try.

I dont need the title I just want the URL for these titles.

Lets ssay for example http://www.php.net/sitemap.php

You know how they have different link eg. Homepage, News Archives ect. I want each of the URL to be printed then I can import that to database.

So for homepage it would be > http://www.php.net/index.php

QuickOldCar · March 27, 2012

So you want to get href links from normal pages and also xml pages.

I could just respond back saying use DOM , simple_html_dom, or connect to a page and pattern match href links with regex ... , but I'll let you have this link scraper script that I wrote.

The code below uses simplexml if is an xml, curl and DOM if it's not to find the urls, most of the code is functions for finding valid urls and fixing relative links.

demo of this code:

http://dynainternet.com/test/grab-links.php

the site you provided as an example (non xml):

http://dynainternet.com/test/grab-links.php?target=http://www.php.net/sitemap.php

and an xml example:

http://dynainternet.com/test/grab-links.php?target=http://www.phpfreaks.com/sitemap.xml

<html>
<title>Link scraper</title>
<meta name="description" content="Scrape the href links from the body section of a page" />
<meta name="keywords" content="scrape link, scrape urls,links,url,urls,fetch url,grab link" />
<meta name="author" content="Jay Przekop - dynainternet.com" />
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
<head>
<style type="text/css">
#content {
  width: 800px ;
  margin-left: auto ;
  margin-right: auto ;
}
</style>
</head>
<body>
<div id="content">
<form action="" method="GET">
<input type="text" name="target" size="100" id="target" value="" placeholder="Insert url to get links" />
<input type="submit" value="Get the links" />
<br />
</form>

<?php
if(isset($_GET['target']) && $_GET['target'] != ''){

$target_url = trim($_GET['target']);
echo "<h2>Links from ".htmlentities(urldecode($target_url))."</h2>";

$userAgent = 'Linkhunter/1.0 (http://dynainternet.com/test/grab-links.php)';


//replace hxxp function
function replaceHxxp($url){
$url = str_ireplace(array("hxxps://xxx.","hxxps://","hxxp://xxx.","hxxp://"), array("https://www.","https://","http://www.","http://"), trim($url)); 
return $url;
}

//parse the host, no http:// returned
function parseHOST($url){
$new_parse_url = str_ireplace(array("http://","https://", "http://", "ftp://", "feed://"), "", trim($url));
$parsedUrl = @parse_url("http://$new_parse_url");
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

function removePaths($url,$number_positions=NULL) {

        $path = @parse_url($url, PHP_URL_PATH);
        $trim_path = trim($path, '/');
        $positions = "";
        $positions = explode('/', $trim_path);
        if(preg_match("/\./",end($positions))) {
        array_pop($positions);
        }
        if(!is_null($number_positions)){
        for ($i = 1; $i <= $number_positions; $i++) {
        array_pop($positions);
        }
        }
        foreach($positions as $folders){
        if(!empty($folders)){
        $folder_path .= "$folders/";
        }
        
        }
        
        return $folder_path;
}

//check relative and fix
function fixRELATIVE($target_url,$url) {
$url = replaceHxxp($url);
$domain = parseHOST($target_url);
$ip_check = parse_url($url, PHP_URL_HOST);
$up_one = removePaths($target_url,1);
$up_two = removePaths($target_url,2);
$up_three = removePaths($target_url,3);
$up_four = removePaths($target_url,4);
$up_five = removePaths($target_url,5);
$path = parse_url($target_url, PHP_URL_PATH);
$full_path = trim($path, '/');
$explode_path = explode("/", $full_path);
$last = end($explode_path);
//echo "last path: $last<br />";
$fixed_paths = "";
if(is_array($explode_path)){
foreach($explode_path as $paths){
if(!empty($paths) && !preg_match("/\./",$paths)){
$fixed_paths .= "$paths/";
}
}
}
$fixed_domain = "$domain/$fixed_paths";

//echo "Target: $target_url<br />";
//echo "Original: $url<br />";

$domain_array = array(".ac",".ac.cn",".ac.ae",".ad",".ae",".aero",".af",".ag",".agent","ah.cn",".ai",".ak.us",".al",".al.us",".am",".an",".ao",".aq",".ar",".ar.us",".arpa",".arts",".as",".asia",".at",".au",".au.com",".auction",".aw",".ax",".az",".az.us",".b2b",".b2c",".b2m",".ba",".bb",".bd",".be",".bf",".bg",".bh",".bi",".biz",".bj",".bj.cn",".bl",".bm",".bn",".bo",".boutique",".br",".br.com",".bs",".bt",".bv",".bw",".by",".bz",".ca",".ca.us",".cat",".cc",".cd",".cf",".cg",".ch",".chat",".church",".ci",".ck",".cl",".club",".cm",".cn",".cn.com",".co",".co.uk",".co.us",".com",".com.au",".com.ac",".com.cn",".com.au",".com.tw",".coop",".cq.cn",".cr",".ct.us",".cu",".cv",".cx",".cy",".cz",".dc.us",".de",".de.com",".de.net",".de.us",".dir",".dj",".dk",".dk.org",".dm",".do",".dz",".ec",".edu",".edu.ac",".edu.af",".edu.cn",".ee",".eg",".eh",".er",".es",".et",".eu",".eu.com",".eu.org",".family",".fi",".firm",".fj",".fj.cn",".fk",".fl.us",".fm",".fo",".fr",".free",".ga",".ga.us",".game",".gb",".gb.com",".gb.net",".gd",".gd.cn",".ge",".gf",".gg",".gh",".gi",".gl",".gm",".gmbh",".gn",".golf",".gov",".gov.ac",".gov.ae",".gov.cn",".gp",".gq",".gr",".gs",".gs.cn",".gt",".gu",".gw",".gy",".gx.cn",".gz.cn",".ha.cn",".hb.cn",".he.cn",".health",".hi.cn",".hi.us",".hk",".hl.cn",".hm",".hn",".hn.cn",".hr",".ht",".hu",".hu.com",".ia.us",".id",".id.us",".ie",".il",".il.us",".im",".in",".in.us",".inc",".info",".int",".io",".iq",".ir",".is",".it",".je",".jl.cn",".jm",".jo",".jobs",".jp",".js.cn",".jx.cn",".ke",".kg",".kh",".ki",".kids",".ku",".km",".kn",".kp",".kr",".ks.us",".kw",".ky",".ky.us",".kz",".la",".la.us",".law",".lb",".lc",".li",".lk",".llc",".llp",".ln.cn",".love",".lr",".ls",".lt",".ltd",".ltd.uk",".lu",".lv",".ly",".m2c",".m2m",".ma",".ma.us",".mc",".md",".md.us",".me",".me.us",".med",".me.uk",".mf",".mg",".mh",".mi.us",".mil",".mil.ac",".mil.ae",".mil.cn",".mk",".ml",".mm",".mn",".mn",".mo",".mo.us",".mobi",".movie",".mp",".mq",".mr",".ms",".ms.us",".mt",".mt.us",".mu",".museum",".music",".mv",".mw",".mx",".my",".mz",".na",".ne.us",".name",".nc",".nc.us",".nd.us",".ne",".net",".net.ac",".net.ae","net.cn",".net.tw",".net.uk",".news",".nf",".ng",".nh.us",".ni",".nj.us",".nl",".nm.cn",".nm.us",".no",".no.com",".nom.ad",".np",".nr",".nu",".nv.us",".ny.us",".nx.cn",".nz",".oh.us",".ok.us",".om",".or.us",".org",".org.ac",".org.ae",".org.cn",".org.tw",".org.uk",".pa",".pa.us",".pe",".pf",".pg",".ph",".pk",".pl",".plc",".plc.uk",".pm",".pn",".pr",".pro",".pro.ae",".ps",".pt",".pw",".py",".qa",".qc.com",".qh.cn",".re",".rec",".ri.us",".ro",".rs",".ru",".ru.com",".rw",".sa",".sa.com",".sb",".sc",".sc.cn",".sc.us",".sch.uk",".sch.ae",".school",".sd",".sd.cn",".sd.us",".se",".se.com",".search",".sg",".sh",".sh.cn",".shop",".si",".sj",".sk",".sl",".sm",".sn",".sn.cn",".so",".soc",".sport",".sr",".st",".su",".sv",".sy",".sx.cn",".sz",".tc",".td",".tech",".tel",".tf",".tg",".th",".tj",".tj.cn",".tk",".tl",".tm",".tn",".tn.us",".to",".tp",".tr",".trade",".travel",".tt",".tv",".tw",".tw.cn",".tx.us",".tz",".ua",".ug",".uk",".uk.com",".uk.net",".um",".us",".us.com",".ut.us",".uy",".uy.com",".uz",".va",".va.us",".vc",".ve",".vg",".vi",".video",".vn",".voyage",".vt.us",".vu",".wa.us",".wf",".wi.us",".ws",".wv.us",".wy.us",".xj.cn",".xxx",".xz.cn",".ye",".yn.cn",".yt",".yu",".za",".za.com",".zj.cn",".zm",".zr",".zw");
$url = preg_replace('/\\\\/', "/", $url);
$url = str_ireplace(array("http://","https://", "ftp://", "feed://"), "", trim($url));
if(substr(strtolower($url),0,4) == "www."){
$fixed_url[] = "http://$url";
//echo "rule 1<br />";
}

$check_array = array('"',"*","'","//","///","////","'","./",".//","../",".../","..../","...../","./../","../../",'"\"',".//.//");
$excludes_array = array("ac","ad","ae","aero","af","ag","agent","ai","al","am","an","ao","aq","ar","arpa","arts","as","asia","at","au","auction","aw","ax","az","b2b","b2c","b2m","ba","bb","bd","be","bf","bg","bh","bi","biz","bj","bl","bm","bn","bo","boutique","br","bs","bt","bv","bw","by","bz","ca","cat","cc","cd","cf","cg","ch","chat","church","ci","ck","cl","club","cm","cn","co","com","coop","cr","cu","cv","cx","cy","cz","de","dir","dj","dk","dm","do","dz","ec","edu","ee","eg","eh","er","es","et","eu","family","fi","firm","fj","fk","fm","fo","fr","free","ga","game","gb","gd","ge","gf","gg","gh","gi","gl","gm","gmbh","gn","golf","gov","gp","gq","gr","gs","gt","gu","gw","gy","hk","hm","hn","hr","ht","hu","id","ie","il","im","in","inc","info","int","io","iq","ir","is","it","je","jm","jo","jobs","jp","ke","kg","kh","ki","kids","ku","km","kn","kp","kr","kw","ky","kz","la","law","lb","lc","li","lk","llc","llp","love","lr","ls","lt","ltd","lu","lv","ly","m2c","m2m","ma","mc","md","me","med","mf","mg","mh","mil","mk","ml","mm","mn","mn","mo","mobi","movie","mp","mq","mr","ms","mt","mu","museum","music","mv","mw","mx","my","mz","na","name","nc","ne","net","news","nf","ng","ni","nl","no","np","nr","nu","nz","om","org","pa","pe","pf","pg","ph","pk","pl","plc","pm","pn","pr","pro","ps","pt","pw","py","qa","re","rec","ro","rs","ru","rw","sa","sb","sc","school","sd","se","search","sg","sh","shop","si","sj","sk","sl","sm","sn","so","soc","sport","sr","st","su","sv","sy","sz","tc","td","tech","tel","tf","tg","th","tj","tk","tl","tm","tn","to","tp","tr","trade","travel","tt","tv","tw","tz","ua","ug","uk","um","us","uy","uz","va","vc","ve","vg","vi","video","vn","voyage","vu","wf","ws","xxx","ye","yt","yu","za","zm","zr","zw");

if(substr($url,0,1) == "/"){
$url = ltrim($url,"/");
$fixed_url[] = "$domain/$url";
//echo "main site and url<br />";
}

if(substr($url,0,1) == "#"){
$fixed_url[] = "$domain/$full_path$url";
//echo "target and url<br />";
}

if(substr($url,0,15) == "../../../../../"){
$url = str_replace("../../../../../","",$url);
$fixed_url[] = "$domain/$up_five$url";
//echo "five directory up<br />";
}

if(substr($url,0,12) == "../../../../"){
$url = str_replace("../../../../","",$url);
$fixed_url[] = "$domain/$up_four$url";
//echo "four directory up<br />";
}

if(substr($url,0,9) == "../../../"){
$url = str_replace("../../../","",$url);
$fixed_url[] = "$domain/$up_three$url";
//echo "three directory up<br />";
}

if(substr($url,0,6) == "../../"){
$url = str_replace("../../","",$url);
$fixed_url[] = "$domain/$up_two$url";
//echo "two directory up<br />";
}

if(substr($url,0,3) == "../"){
$url = str_replace("../","",$url);
$fixed_url[] = "$domain/$up_one$url";
//echo "one directory up<br />";
}

foreach($check_array as $checks){
$check_length = strlen($checks);
$temporary_url = $url;
$url = @ltrim($url,$checks);
$url = @rtrim($url,$checks);
if(substr($temporary_url,0,$check_length) == $checks){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 2<br />";
}
}

$parse_url = parseHOST($url);
$parse_ext_explode = end(explode(".",$parse_url));
$parse_ext_check = ".$parse_ext_explode";
//echo "$parse_ext_check<br />";




//the following if statements will do checks on what to be added, only the first $fixed_url will be returned
if(in_array($parse_url, $excludes_array)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 3<br />";
}

if(in_array($parse_ext_check, $domain_array)){
$fixed_url[] = "http://$url";
//echo "rule 4<br />";
}

if(preg_match("/([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}/",$ip_check)){
$fixed_url[] = "http://$url";
//echo "is an ip<br />";
}

if(!in_array($parse_ext_check, $domain_array)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 5<br />";
}

if(!preg_match("/^(\w+.)$/siU",$parse_url)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 6<br />";
}

if($parse_url == $fixed_domain) {
$fixed_url[] = "http://$url";
//echo "rule 7<br />";
}

if (0 !== strpos($fixed_url[0], 'http')) {
$fixed_url[0] = "http://$fixed_url[0]";
//echo "rule 8<br />";
}

$lower_domain = parseHOST($fixed_url[0]);
$lower_url = str_ireplace($lower_domain,strtolower($lower_domain),$fixed_url[0]);
$lower_url = trim($lower_url,"'");
$lower_url = trim($lower_url,"#");
return $lower_url;

}//end fix relative function


//try to load as xml file first
@$xml = simplexml_load_file($target_url);
if($xml===TRUE) {
echo "<h3>Page is xml</h3>";

    foreach ($xml->url as $url_list) {
    $raw_url_array[] = json_decode($url_list->loc,TRUE);//create array
    }
    
    
    } else {//connect with curl request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
    //echo "<br />cURL error number:" .curl_errno($ch);
    //echo "<br />cURL error:" . curl_error($ch);
    die("Unable to connect to that url");
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the links on the page
$xpath = new DOMXPath($dom);

//only looking for a href links in the body section
$hrefs = $xpath->query('/html/body//a');

//loop all the found href links
for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $raw_url_array[] = $href->getAttribute('href');//create array
    }//end loop

}//end if/else xml type or not

//check if is array, loop through urls and clean/fix
if(is_array($raw_url_array)){
foreach($raw_url_array as $url){
$url = fixRELATIVE($target_url,$url);//fix the self relative links 
    //only show http links in array
    if($url != '' || substr($url,0,4) != "http:"  || substr($url,0,5) != "https:"){
    $url_array[] = $url;//create a url array
    }

}//end foreach

//displaying it
$url_array = array_unique($url_array);
    foreach($url_array as $clean_url){
    $clean_url = htmlentities(urldecode($clean_url));
    echo "<a href='$clean_url'  target='_blank'>$clean_url</a><br />";
    }//end display

}//end if is array

}//end if $_GET['target'] set
?>
</div>
</body>
</html>

QuickOldCar · March 27, 2012

Made a correction in the code determining if is xml and also xml extension

<html>
<title>Link scraper</title>
<meta name="description" content="Scrape the href links from the body section of a page" />
<meta name="keywords" content="scrape link, scrape urls,links,url,urls,fetch url,grab link" />
<meta name="author" content="Jay Przekop - dynainternet.com" />
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
<head>
<style type="text/css">
#content {
  width: 800px ;
  margin-left: auto ;
  margin-right: auto ;
}
</style>
</head>
<body>
<div id="content">
<form action="" method="GET">
<input type="text" name="target" size="100" id="target" value="" placeholder="Insert url to get links" />
<input type="submit" value="Get the links" />
<br />
</form>

<?php
if(isset($_GET['target']) && $_GET['target'] != ''){

$target_url = trim($_GET['target']);
echo "<h2>Links from ".htmlentities(urldecode($target_url))."</h2>";

$userAgent = 'Linkhunter/1.0 (http://dynainternet.com/test/grab-links.php)';


//replace hxxp function
function replaceHxxp($url){
$url = str_ireplace(array("hxxps://xxx.","hxxps://","hxxp://xxx.","hxxp://"), array("https://www.","https://","http://www.","http://"), trim($url)); 
return $url;
}

//parse the host, no http:// returned
function parseHOST($url){
$new_parse_url = str_ireplace(array("http://","https://", "http://", "ftp://", "feed://"), "", trim($url));
$parsedUrl = @parse_url("http://$new_parse_url");
return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))));
}

function removePaths($url,$number_positions=NULL) {

        $path = @parse_url($url, PHP_URL_PATH);
        $trim_path = trim($path, '/');
        $positions = "";
        $positions = explode('/', $trim_path);
        if(preg_match("/\./",end($positions))) {
        array_pop($positions);
        }
        if(!is_null($number_positions)){
        for ($i = 1; $i <= $number_positions; $i++) {
        array_pop($positions);
        }
        }
        foreach($positions as $folders){
        if(!empty($folders)){
        $folder_path .= "$folders/";
        }
        
        }
        
        return $folder_path;
}

//check relative and fix
function fixRELATIVE($target_url,$url) {
$url = replaceHxxp($url);
$domain = parseHOST($target_url);
$ip_check = parse_url($url, PHP_URL_HOST);
$up_one = removePaths($target_url,1);
$up_two = removePaths($target_url,2);
$up_three = removePaths($target_url,3);
$up_four = removePaths($target_url,4);
$up_five = removePaths($target_url,5);
$path = parse_url($target_url, PHP_URL_PATH);
$full_path = trim($path, '/');
$explode_path = explode("/", $full_path);
$last = end($explode_path);
//echo "last path: $last<br />";
$fixed_paths = "";
if(is_array($explode_path)){
foreach($explode_path as $paths){
if(!empty($paths) && !preg_match("/\./",$paths)){
$fixed_paths .= "$paths/";
}
}
}
$fixed_domain = "$domain/$fixed_paths";

//echo "Target: $target_url<br />";
//echo "Original: $url<br />";

$domain_array = array(".ac",".ac.cn",".ac.ae",".ad",".ae",".aero",".af",".ag",".agent","ah.cn",".ai",".ak.us",".al",".al.us",".am",".an",".ao",".aq",".ar",".ar.us",".arpa",".arts",".as",".asia",".at",".au",".au.com",".auction",".aw",".ax",".az",".az.us",".b2b",".b2c",".b2m",".ba",".bb",".bd",".be",".bf",".bg",".bh",".bi",".biz",".bj",".bj.cn",".bl",".bm",".bn",".bo",".boutique",".br",".br.com",".bs",".bt",".bv",".bw",".by",".bz",".ca",".ca.us",".cat",".cc",".cd",".cf",".cg",".ch",".chat",".church",".ci",".ck",".cl",".club",".cm",".cn",".cn.com",".co",".co.uk",".co.us",".com",".com.au",".com.ac",".com.cn",".com.au",".com.tw",".coop",".cq.cn",".cr",".ct.us",".cu",".cv",".cx",".cy",".cz",".dc.us",".de",".de.com",".de.net",".de.us",".dir",".dj",".dk",".dk.org",".dm",".do",".dz",".ec",".edu",".edu.ac",".edu.af",".edu.cn",".ee",".eg",".eh",".er",".es",".et",".eu",".eu.com",".eu.org",".family",".fi",".firm",".fj",".fj.cn",".fk",".fl.us",".fm",".fo",".fr",".free",".ga",".ga.us",".game",".gb",".gb.com",".gb.net",".gd",".gd.cn",".ge",".gf",".gg",".gh",".gi",".gl",".gm",".gmbh",".gn",".golf",".gov",".gov.ac",".gov.ae",".gov.cn",".gp",".gq",".gr",".gs",".gs.cn",".gt",".gu",".gw",".gy",".gx.cn",".gz.cn",".ha.cn",".hb.cn",".he.cn",".health",".hi.cn",".hi.us",".hk",".hl.cn",".hm",".hn",".hn.cn",".hr",".ht",".hu",".hu.com",".ia.us",".id",".id.us",".ie",".il",".il.us",".im",".in",".in.us",".inc",".info",".int",".io",".iq",".ir",".is",".it",".je",".jl.cn",".jm",".jo",".jobs",".jp",".js.cn",".jx.cn",".ke",".kg",".kh",".ki",".kids",".ku",".km",".kn",".kp",".kr",".ks.us",".kw",".ky",".ky.us",".kz",".la",".la.us",".law",".lb",".lc",".li",".lk",".llc",".llp",".ln.cn",".love",".lr",".ls",".lt",".ltd",".ltd.uk",".lu",".lv",".ly",".m2c",".m2m",".ma",".ma.us",".mc",".md",".md.us",".me",".me.us",".med",".me.uk",".mf",".mg",".mh",".mi.us",".mil",".mil.ac",".mil.ae",".mil.cn",".mk",".ml",".mm",".mn",".mn",".mo",".mo.us",".mobi",".movie",".mp",".mq",".mr",".ms",".ms.us",".mt",".mt.us",".mu",".museum",".music",".mv",".mw",".mx",".my",".mz",".na",".ne.us",".name",".nc",".nc.us",".nd.us",".ne",".net",".net.ac",".net.ae","net.cn",".net.tw",".net.uk",".news",".nf",".ng",".nh.us",".ni",".nj.us",".nl",".nm.cn",".nm.us",".no",".no.com",".nom.ad",".np",".nr",".nu",".nv.us",".ny.us",".nx.cn",".nz",".oh.us",".ok.us",".om",".or.us",".org",".org.ac",".org.ae",".org.cn",".org.tw",".org.uk",".pa",".pa.us",".pe",".pf",".pg",".ph",".pk",".pl",".plc",".plc.uk",".pm",".pn",".pr",".pro",".pro.ae",".ps",".pt",".pw",".py",".qa",".qc.com",".qh.cn",".re",".rec",".ri.us",".ro",".rs",".ru",".ru.com",".rw",".sa",".sa.com",".sb",".sc",".sc.cn",".sc.us",".sch.uk",".sch.ae",".school",".sd",".sd.cn",".sd.us",".se",".se.com",".search",".sg",".sh",".sh.cn",".shop",".si",".sj",".sk",".sl",".sm",".sn",".sn.cn",".so",".soc",".sport",".sr",".st",".su",".sv",".sy",".sx.cn",".sz",".tc",".td",".tech",".tel",".tf",".tg",".th",".tj",".tj.cn",".tk",".tl",".tm",".tn",".tn.us",".to",".tp",".tr",".trade",".travel",".tt",".tv",".tw",".tw.cn",".tx.us",".tz",".ua",".ug",".uk",".uk.com",".uk.net",".um",".us",".us.com",".ut.us",".uy",".uy.com",".uz",".va",".va.us",".vc",".ve",".vg",".vi",".video",".vn",".voyage",".vt.us",".vu",".wa.us",".wf",".wi.us",".ws",".wv.us",".wy.us",".xj.cn",".xxx",".xz.cn",".ye",".yn.cn",".yt",".yu",".za",".za.com",".zj.cn",".zm",".zr",".zw");
$url = preg_replace('/\\\\/', "/", $url);
$url = str_ireplace(array("http://","https://", "ftp://", "feed://"), "", trim($url));
if(substr(strtolower($url),0,4) == "www."){
$fixed_url[] = "http://$url";
//echo "rule 1<br />";
}

$check_array = array('"',"*","'","//","///","////","'","./",".//","../",".../","..../","...../","./../","../../",'"\"',".//.//");
$excludes_array = array("ac","ad","ae","aero","af","ag","agent","ai","al","am","an","ao","aq","ar","arpa","arts","as","asia","at","au","auction","aw","ax","az","b2b","b2c","b2m","ba","bb","bd","be","bf","bg","bh","bi","biz","bj","bl","bm","bn","bo","boutique","br","bs","bt","bv","bw","by","bz","ca","cat","cc","cd","cf","cg","ch","chat","church","ci","ck","cl","club","cm","cn","co","com","coop","cr","cu","cv","cx","cy","cz","de","dir","dj","dk","dm","do","dz","ec","edu","ee","eg","eh","er","es","et","eu","family","fi","firm","fj","fk","fm","fo","fr","free","ga","game","gb","gd","ge","gf","gg","gh","gi","gl","gm","gmbh","gn","golf","gov","gp","gq","gr","gs","gt","gu","gw","gy","hk","hm","hn","hr","ht","hu","id","ie","il","im","in","inc","info","int","io","iq","ir","is","it","je","jm","jo","jobs","jp","ke","kg","kh","ki","kids","ku","km","kn","kp","kr","kw","ky","kz","la","law","lb","lc","li","lk","llc","llp","love","lr","ls","lt","ltd","lu","lv","ly","m2c","m2m","ma","mc","md","me","med","mf","mg","mh","mil","mk","ml","mm","mn","mn","mo","mobi","movie","mp","mq","mr","ms","mt","mu","museum","music","mv","mw","mx","my","mz","na","name","nc","ne","net","news","nf","ng","ni","nl","no","np","nr","nu","nz","om","org","pa","pe","pf","pg","ph","pk","pl","plc","pm","pn","pr","pro","ps","pt","pw","py","qa","re","rec","ro","rs","ru","rw","sa","sb","sc","school","sd","se","search","sg","sh","shop","si","sj","sk","sl","sm","sn","so","soc","sport","sr","st","su","sv","sy","sz","tc","td","tech","tel","tf","tg","th","tj","tk","tl","tm","tn","to","tp","tr","trade","travel","tt","tv","tw","tz","ua","ug","uk","um","us","uy","uz","va","vc","ve","vg","vi","video","vn","voyage","vu","wf","ws","xxx","ye","yt","yu","za","zm","zr","zw");

if(substr($url,0,1) == "/"){
$url = ltrim($url,"/");
$fixed_url[] = "$domain/$url";
//echo "main site and url<br />";
}

if(substr($url,0,1) == "#"){
$fixed_url[] = "$domain/$full_path$url";
//echo "target and url<br />";
}

if(substr($url,0,15) == "../../../../../"){
$url = str_replace("../../../../../","",$url);
$fixed_url[] = "$domain/$up_five$url";
//echo "five directory up<br />";
}

if(substr($url,0,12) == "../../../../"){
$url = str_replace("../../../../","",$url);
$fixed_url[] = "$domain/$up_four$url";
//echo "four directory up<br />";
}

if(substr($url,0,9) == "../../../"){
$url = str_replace("../../../","",$url);
$fixed_url[] = "$domain/$up_three$url";
//echo "three directory up<br />";
}

if(substr($url,0,6) == "../../"){
$url = str_replace("../../","",$url);
$fixed_url[] = "$domain/$up_two$url";
//echo "two directory up<br />";
}

if(substr($url,0,3) == "../"){
$url = str_replace("../","",$url);
$fixed_url[] = "$domain/$up_one$url";
//echo "one directory up<br />";
}

foreach($check_array as $checks){
$check_length = strlen($checks);
$temporary_url = $url;
$url = @ltrim($url,$checks);
$url = @rtrim($url,$checks);
if(substr($temporary_url,0,$check_length) == $checks){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 2<br />";
}
}

$parse_url = parseHOST($url);
$parse_ext_explode = end(explode(".",$parse_url));
$parse_ext_check = ".$parse_ext_explode";
//echo "$parse_ext_check<br />";




//the following if statements will do checks on what to be added, only the first $fixed_url will be returned
if(in_array($parse_url, $excludes_array)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 3<br />";
}

if(in_array($parse_ext_check, $domain_array)){
$fixed_url[] = "http://$url";
//echo "rule 4<br />";
}

if(preg_match("/([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}/",$ip_check)){
$fixed_url[] = "http://$url";
//echo "is an ip<br />";
}

if(!in_array($parse_ext_check, $domain_array)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 5<br />";
}

if(!preg_match("/^(\w+.)$/siU",$parse_url)){
$fixed_url[] = "$fixed_domain$url";
//echo "rule 6<br />";
}

if($parse_url == $fixed_domain) {
$fixed_url[] = "http://$url";
//echo "rule 7<br />";
}

if (0 !== strpos($fixed_url[0], 'http')) {
$fixed_url[0] = "http://$fixed_url[0]";
//echo "rule 8<br />";
}

$lower_domain = parseHOST($fixed_url[0]);
$lower_url = str_ireplace($lower_domain,strtolower($lower_domain),$fixed_url[0]);
$lower_url = trim($lower_url,"'");
$lower_url = trim($lower_url,"#");
return $lower_url;

}//end fix relative function

$file_type = end(explode(".",strtolower(trim($target_url))));
//try to load as xml file first
@$xml = simplexml_load_file($target_url);
if($xml!==False && $file_type == "xml") {
echo "<h3>Page is xml</h3>";

    foreach ($xml->url as $url_list) {
    $raw_url_array[] = json_decode($url_list->loc,TRUE);//create array
    }
    
    
    } else {//connect with curl request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
    //echo "<br />cURL error number:" .curl_errno($ch);
    //echo "<br />cURL error:" . curl_error($ch);
    die("Unable to connect to that url");
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the links on the page
$xpath = new DOMXPath($dom);

//only looking for a href links in the body section
$hrefs = $xpath->query('/html/body//a');

//loop all the found href links
for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $raw_url_array[] = $href->getAttribute('href');//create array
    }//end loop

}//end if/else xml type or not

//check if is array, loop through urls and clean/fix
if(is_array($raw_url_array)){
foreach($raw_url_array as $url){
$url = fixRELATIVE($target_url,$url);//fix the self relative links 
    //only show http links in array
    if($url != '' || substr($url,0,4) != "http:"  || substr($url,0,5) != "https:"){
    $url_array[] = $url;//create a url array
    }

}//end foreach

//displaying it
$url_array = array_unique($url_array);
    foreach($url_array as $clean_url){
    $clean_url = htmlentities(urldecode($clean_url));
    echo "<a href='$clean_url'  target='_blank'>$clean_url</a><br />";
    }//end display

}//end if is array

}//end if $_GET['target'] set
?>
</div>
</body>
</html>

Help!php · March 27, 2012

Just aspx page. The code you provided me work. . I did try to use simple_html_dom.. didnt really work.

So I thank you for the help and if you were here would have got a big fat hug(I am a girl) .. you saved me.. So thank you..

Sign In

How to read a sitemap using PHP

Recommended Posts

Help!php

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

Help!php

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

Help!php

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information