Help!php Posted March 23, 2012 Share Posted March 23, 2012 First of all I have no idea how to do this. But I know it can be done. A little help would be alot for me. Thank you in advance. So I want to go through a sitemap and visit each of the link on the sitemap and save the URL on to a database. Example: HP 5550N A3 Colour Laser Printer Konica Minolta PagePro 1350W A4 Mono Laser Printer HP 9050dn A3 Mono Laser Printer HP 5550DTN A3 Colour Laser Printer HP 5550HDN A3 Colour Laser Printer HP 5550DN A3 Colour Laser Printer Lets say this was on the sitemap and it just continues with other product like this. I want to write a code where it will go to the first link and saves the product URL on to the database and continues to do the same until the last link. On my example it would be HP 5550DN A3 Colour Laser Printer. Any help??? Any ideas?? I am not asking someone to write the code for me.. Just need help and good direction Quote Link to comment https://forums.phpfreaks.com/topic/259569-how-to-read-a-sitemap-using-php/ Share on other sites More sharing options...
QuickOldCar Posted March 23, 2012 Share Posted March 23, 2012 for something simple: simplexml() <?php //test url http://www.domain.com/sitemap.xml //test url http://www.phpfreaks.com/sitemap.xml if(isset($_GET['url']) && $_GET['url'] != ''){ $url = trim($_GET['url']); $xml = simplexml_load_file($url); foreach ($xml->url as $url_list) { $url_array[] = $url_list->loc; } //display the array foreach($url_array as $urls){ echo "<a href='$urls'>$urls</a><br />"; } } else { echo "No xml url inserted"; } ?> make the php file , use in the address bar something like http://mysite.com/script.php?url=http://www.phpfreaks.com/sitemap.xml Now that you have the array full of urls you can insert them into a database however you want, AUTO_INCREMENT per id, or serialize or implode the array into a field per each sites xml file To grab the titles can use curl() or file_get_contents() I have a script to obtain the titles for websites, it's not that easy actually to get it from every site, been there. Quote Link to comment https://forums.phpfreaks.com/topic/259569-how-to-read-a-sitemap-using-php/#findComment-1330536 Share on other sites More sharing options...
QuickOldCar Posted March 23, 2012 Share Posted March 23, 2012 I had a little time waiting for a friend, decided to expand upon this for you. I used a stream context and file_get_contents, like I said could also use curl() Checking for valid xml Also grabbed the title,description and keywords, and made it all arrays You can work out any filtering or escaping before the mysql inserts. <?php //test url http://www.domain.com/sitemap.xml //test url http://www.phpfreaks.com/sitemap.xml if(isset($_GET['url']) && $_GET['url'] != ''){ $url = trim($_GET['url']); @$xml = simplexml_load_file($url); if($xml===FALSE) { die('not a valid xml string'); } else { foreach ($xml->url as $url_list) { $urls = $url_list->loc; if (substr($urls, 0, 4) != "http") { $urls = "http://$urls"; } $context = stream_context_create(array('http' => array('timeout' => 1))); $str = file_get_contents($urls, 0, $context); if(!$str){ die("Unable to connect"); } $tags = get_meta_tags($urls); preg_match("/<title>(.*)<\/title>/Umis", $str, $title); preg_match("/<head>(.*)<\/head>/is", $str, $head); $title = $title[1]; if($title == ''){ $title = $urls; } $description = $tags['description']; $keywords = $tags['keywords']; //make it an array each value, json decode used to remove url from being xml object $urls_array[] = array("url"=>json_decode($urls,TRUE),"title"=>$title,"description"=>$description,"keywords"=>$keywords); }//end loop //see the array //print_r($urls_array); //display the array echo "<hr>"; foreach($urls_array as $url_value){ echo "<a href=".$url_value['url'].">".$url_value['title']."</a><br />"; echo $url_value['description']."<br />"; echo $url_value['keywords']."<br />"; echo "<hr>"; } } } else { echo "No xml url inserted"; } ?> It will return an array like this for phpfreaks sitemap Array ( [0] => Array ( [url] => http://www.phpfreaks.com [title] => PHP Freaks - PHP Help Index [description] => PHP Freaks is a website dedicated to learning and teaching PHP. Here you will find a forum consisting of 128,486 members who have posted a total of 1,330,567 posts on the forums. Additionally, we have tutorials covering various aspects of PHP and you will find news syndicated from other websites so you can stay up-to-date. Along with the tutorials, the developers on the forum will be able to help you with your scripts, or you may perhaps share your knowledge so others can learn from you. [keywords] => php help, php forums, php tutorials, php tutorial, php news, php snippets, php, help, news, resources, news, snippets, tutorials, web development, programming ) [1] => Array ( [url] => http://www.phpfreaks.com/forums [title] => PHP Freaks Forums - Index [description] => PHP Freaks Forums - Index [keywords] => php, tutorials, help, tutorial, forum, free, resources, advice, oop, design ) [2] => Array ( [url] => http://www.phpfreaks.com/tutorials [title] => PHP Freaks - Tutorials [description] => Free tutorials on various PHP subjects covering basic to advanced principles. [keywords] => ) [3] => Array ( [url] => http://www.phpfreaks.com/blogs [title] => PHP Freaks - Blog posts [description] => [keywords] => ) [4] => Array ( [url] => http://www.phpfreaks.com/news [title] => PHP Freaks - PHP News [description] => [keywords] => ) ) Quote Link to comment https://forums.phpfreaks.com/topic/259569-how-to-read-a-sitemap-using-php/#findComment-1330567 Share on other sites More sharing options...
Help!php Posted March 26, 2012 Author Share Posted March 26, 2012 Thank you for helping. I have tired your code and it works for phpfreak.com/sitemap but not for the website I want to try. I dont need the title I just want the URL for these titles. Lets ssay for example http://www.php.net/sitemap.php You know how they have different link eg. Homepage, News Archives ect. I want each of the URL to be printed then I can import that to database. So for homepage it would be > http://www.php.net/index.php Quote Link to comment https://forums.phpfreaks.com/topic/259569-how-to-read-a-sitemap-using-php/#findComment-1331144 Share on other sites More sharing options...
QuickOldCar Posted March 27, 2012 Share Posted March 27, 2012 So you want to get href links from normal pages and also xml pages. I could just respond back saying use DOM , simple_html_dom, or connect to a page and pattern match href links with regex ... , but I'll let you have this link scraper script that I wrote. The code below uses simplexml if is an xml, curl and DOM if it's not to find the urls, most of the code is functions for finding valid urls and fixing relative links. demo of this code: http://dynainternet.com/test/grab-links.php the site you provided as an example (non xml): http://dynainternet.com/test/grab-links.php?target=http://www.php.net/sitemap.php and an xml example: http://dynainternet.com/test/grab-links.php?target=http://www.phpfreaks.com/sitemap.xml <html> <title>Link scraper</title> <meta name="description" content="Scrape the href links from the body section of a page" /> <meta name="keywords" content="scrape link, scrape urls,links,url,urls,fetch url,grab link" /> <meta name="author" content="Jay Przekop - dynainternet.com" /> <meta http-equiv="content-type" content="text/html;charset=UTF-8" /> <head> <style type="text/css"> #content { width: 800px ; margin-left: auto ; margin-right: auto ; } </style> </head> <body> <div id="content"> <form action="" method="GET"> <input type="text" name="target" size="100" id="target" value="" placeholder="Insert url to get links" /> <input type="submit" value="Get the links" /> <br /> </form> <?php if(isset($_GET['target']) && $_GET['target'] != ''){ $target_url = trim($_GET['target']); echo "<h2>Links from ".htmlentities(urldecode($target_url))."</h2>"; $userAgent = 'Linkhunter/1.0 (http://dynainternet.com/test/grab-links.php)'; //replace hxxp function function replaceHxxp($url){ $url = str_ireplace(array("hxxps://xxx.","hxxps://","hxxp://xxx.","hxxp://"), array("https://www.","https://","http://www.","http://"), trim($url)); return $url; } //parse the host, no http:// returned function parseHOST($url){ $new_parse_url = str_ireplace(array("http://","https://", "http://", "ftp://", "feed://"), "", trim($url)); $parsedUrl = @parse_url("http://$new_parse_url"); return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)))); } function removePaths($url,$number_positions=NULL) { $path = @parse_url($url, PHP_URL_PATH); $trim_path = trim($path, '/'); $positions = ""; $positions = explode('/', $trim_path); if(preg_match("/\./",end($positions))) { array_pop($positions); } if(!is_null($number_positions)){ for ($i = 1; $i <= $number_positions; $i++) { array_pop($positions); } } foreach($positions as $folders){ if(!empty($folders)){ $folder_path .= "$folders/"; } } return $folder_path; } //check relative and fix function fixRELATIVE($target_url,$url) { $url = replaceHxxp($url); $domain = parseHOST($target_url); $ip_check = parse_url($url, PHP_URL_HOST); $up_one = removePaths($target_url,1); $up_two = removePaths($target_url,2); $up_three = removePaths($target_url,3); $up_four = removePaths($target_url,4); $up_five = removePaths($target_url,5); $path = parse_url($target_url, PHP_URL_PATH); $full_path = trim($path, '/'); $explode_path = explode("/", $full_path); $last = end($explode_path); //echo "last path: $last<br />"; $fixed_paths = ""; if(is_array($explode_path)){ foreach($explode_path as $paths){ if(!empty($paths) && !preg_match("/\./",$paths)){ $fixed_paths .= "$paths/"; } } } $fixed_domain = "$domain/$fixed_paths"; //echo "Target: $target_url<br />"; //echo "Original: $url<br />"; $domain_array = array(".ac",".ac.cn",".ac.ae",".ad",".ae",".aero",".af",".ag",".agent","ah.cn",".ai",".ak.us",".al",".al.us",".am",".an",".ao",".aq",".ar",".ar.us",".arpa",".arts",".as",".asia",".at",".au",".au.com",".auction",".aw",".ax",".az",".az.us",".b2b",".b2c",".b2m",".ba",".bb",".bd",".be",".bf",".bg",".bh",".bi",".biz",".bj",".bj.cn",".bl",".bm",".bn",".bo",".boutique",".br",".br.com",".bs",".bt",".bv",".bw",".by",".bz",".ca",".ca.us",".cat",".cc",".cd",".cf",".cg",".ch",".chat",".church",".ci",".ck",".cl",".club",".cm",".cn",".cn.com",".co",".co.uk",".co.us",".com",".com.au",".com.ac",".com.cn",".com.au",".com.tw",".coop",".cq.cn",".cr",".ct.us",".cu",".cv",".cx",".cy",".cz",".dc.us",".de",".de.com",".de.net",".de.us",".dir",".dj",".dk",".dk.org",".dm",".do",".dz",".ec",".edu",".edu.ac",".edu.af",".edu.cn",".ee",".eg",".eh",".er",".es",".et",".eu",".eu.com",".eu.org",".family",".fi",".firm",".fj",".fj.cn",".fk",".fl.us",".fm",".fo",".fr",".free",".ga",".ga.us",".game",".gb",".gb.com",".gb.net",".gd",".gd.cn",".ge",".gf",".gg",".gh",".gi",".gl",".gm",".gmbh",".gn",".golf",".gov",".gov.ac",".gov.ae",".gov.cn",".gp",".gq",".gr",".gs",".gs.cn",".gt",".gu",".gw",".gy",".gx.cn",".gz.cn",".ha.cn",".hb.cn",".he.cn",".health",".hi.cn",".hi.us",".hk",".hl.cn",".hm",".hn",".hn.cn",".hr",".ht",".hu",".hu.com",".ia.us",".id",".id.us",".ie",".il",".il.us",".im",".in",".in.us",".inc",".info",".int",".io",".iq",".ir",".is",".it",".je",".jl.cn",".jm",".jo",".jobs",".jp",".js.cn",".jx.cn",".ke",".kg",".kh",".ki",".kids",".ku",".km",".kn",".kp",".kr",".ks.us",".kw",".ky",".ky.us",".kz",".la",".la.us",".law",".lb",".lc",".li",".lk",".llc",".llp",".ln.cn",".love",".lr",".ls",".lt",".ltd",".ltd.uk",".lu",".lv",".ly",".m2c",".m2m",".ma",".ma.us",".mc",".md",".md.us",".me",".me.us",".med",".me.uk",".mf",".mg",".mh",".mi.us",".mil",".mil.ac",".mil.ae",".mil.cn",".mk",".ml",".mm",".mn",".mn",".mo",".mo.us",".mobi",".movie",".mp",".mq",".mr",".ms",".ms.us",".mt",".mt.us",".mu",".museum",".music",".mv",".mw",".mx",".my",".mz",".na",".ne.us",".name",".nc",".nc.us",".nd.us",".ne",".net",".net.ac",".net.ae","net.cn",".net.tw",".net.uk",".news",".nf",".ng",".nh.us",".ni",".nj.us",".nl",".nm.cn",".nm.us",".no",".no.com",".nom.ad",".np",".nr",".nu",".nv.us",".ny.us",".nx.cn",".nz",".oh.us",".ok.us",".om",".or.us",".org",".org.ac",".org.ae",".org.cn",".org.tw",".org.uk",".pa",".pa.us",".pe",".pf",".pg",".ph",".pk",".pl",".plc",".plc.uk",".pm",".pn",".pr",".pro",".pro.ae",".ps",".pt",".pw",".py",".qa",".qc.com",".qh.cn",".re",".rec",".ri.us",".ro",".rs",".ru",".ru.com",".rw",".sa",".sa.com",".sb",".sc",".sc.cn",".sc.us",".sch.uk",".sch.ae",".school",".sd",".sd.cn",".sd.us",".se",".se.com",".search",".sg",".sh",".sh.cn",".shop",".si",".sj",".sk",".sl",".sm",".sn",".sn.cn",".so",".soc",".sport",".sr",".st",".su",".sv",".sy",".sx.cn",".sz",".tc",".td",".tech",".tel",".tf",".tg",".th",".tj",".tj.cn",".tk",".tl",".tm",".tn",".tn.us",".to",".tp",".tr",".trade",".travel",".tt",".tv",".tw",".tw.cn",".tx.us",".tz",".ua",".ug",".uk",".uk.com",".uk.net",".um",".us",".us.com",".ut.us",".uy",".uy.com",".uz",".va",".va.us",".vc",".ve",".vg",".vi",".video",".vn",".voyage",".vt.us",".vu",".wa.us",".wf",".wi.us",".ws",".wv.us",".wy.us",".xj.cn",".xxx",".xz.cn",".ye",".yn.cn",".yt",".yu",".za",".za.com",".zj.cn",".zm",".zr",".zw"); $url = preg_replace('/\\\\/', "/", $url); $url = str_ireplace(array("http://","https://", "ftp://", "feed://"), "", trim($url)); if(substr(strtolower($url),0,4) == "www."){ $fixed_url[] = "http://$url"; //echo "rule 1<br />"; } $check_array = array('"',"*","'","//","///","////","'","./",".//","../",".../","..../","...../","./../","../../",'"\"',".//.//"); $excludes_array = array("ac","ad","ae","aero","af","ag","agent","ai","al","am","an","ao","aq","ar","arpa","arts","as","asia","at","au","auction","aw","ax","az","b2b","b2c","b2m","ba","bb","bd","be","bf","bg","bh","bi","biz","bj","bl","bm","bn","bo","boutique","br","bs","bt","bv","bw","by","bz","ca","cat","cc","cd","cf","cg","ch","chat","church","ci","ck","cl","club","cm","cn","co","com","coop","cr","cu","cv","cx","cy","cz","de","dir","dj","dk","dm","do","dz","ec","edu","ee","eg","eh","er","es","et","eu","family","fi","firm","fj","fk","fm","fo","fr","free","ga","game","gb","gd","ge","gf","gg","gh","gi","gl","gm","gmbh","gn","golf","gov","gp","gq","gr","gs","gt","gu","gw","gy","hk","hm","hn","hr","ht","hu","id","ie","il","im","in","inc","info","int","io","iq","ir","is","it","je","jm","jo","jobs","jp","ke","kg","kh","ki","kids","ku","km","kn","kp","kr","kw","ky","kz","la","law","lb","lc","li","lk","llc","llp","love","lr","ls","lt","ltd","lu","lv","ly","m2c","m2m","ma","mc","md","me","med","mf","mg","mh","mil","mk","ml","mm","mn","mn","mo","mobi","movie","mp","mq","mr","ms","mt","mu","museum","music","mv","mw","mx","my","mz","na","name","nc","ne","net","news","nf","ng","ni","nl","no","np","nr","nu","nz","om","org","pa","pe","pf","pg","ph","pk","pl","plc","pm","pn","pr","pro","ps","pt","pw","py","qa","re","rec","ro","rs","ru","rw","sa","sb","sc","school","sd","se","search","sg","sh","shop","si","sj","sk","sl","sm","sn","so","soc","sport","sr","st","su","sv","sy","sz","tc","td","tech","tel","tf","tg","th","tj","tk","tl","tm","tn","to","tp","tr","trade","travel","tt","tv","tw","tz","ua","ug","uk","um","us","uy","uz","va","vc","ve","vg","vi","video","vn","voyage","vu","wf","ws","xxx","ye","yt","yu","za","zm","zr","zw"); if(substr($url,0,1) == "/"){ $url = ltrim($url,"/"); $fixed_url[] = "$domain/$url"; //echo "main site and url<br />"; } if(substr($url,0,1) == "#"){ $fixed_url[] = "$domain/$full_path$url"; //echo "target and url<br />"; } if(substr($url,0,15) == "../../../../../"){ $url = str_replace("../../../../../","",$url); $fixed_url[] = "$domain/$up_five$url"; //echo "five directory up<br />"; } if(substr($url,0,12) == "../../../../"){ $url = str_replace("../../../../","",$url); $fixed_url[] = "$domain/$up_four$url"; //echo "four directory up<br />"; } if(substr($url,0,9) == "../../../"){ $url = str_replace("../../../","",$url); $fixed_url[] = "$domain/$up_three$url"; //echo "three directory up<br />"; } if(substr($url,0,6) == "../../"){ $url = str_replace("../../","",$url); $fixed_url[] = "$domain/$up_two$url"; //echo "two directory up<br />"; } if(substr($url,0,3) == "../"){ $url = str_replace("../","",$url); $fixed_url[] = "$domain/$up_one$url"; //echo "one directory up<br />"; } foreach($check_array as $checks){ $check_length = strlen($checks); $temporary_url = $url; $url = @ltrim($url,$checks); $url = @rtrim($url,$checks); if(substr($temporary_url,0,$check_length) == $checks){ $fixed_url[] = "$fixed_domain$url"; //echo "rule 2<br />"; } } $parse_url = parseHOST($url); $parse_ext_explode = end(explode(".",$parse_url)); $parse_ext_check = ".$parse_ext_explode"; //echo "$parse_ext_check<br />"; //the following if statements will do checks on what to be added, only the first $fixed_url will be returned if(in_array($parse_url, $excludes_array)){ $fixed_url[] = "$fixed_domain$url"; //echo "rule 3<br />"; } if(in_array($parse_ext_check, $domain_array)){ $fixed_url[] = "http://$url"; //echo "rule 4<br />"; } if(preg_match("/([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}/",$ip_check)){ $fixed_url[] = "http://$url"; //echo "is an ip<br />"; } if(!in_array($parse_ext_check, $domain_array)){ $fixed_url[] = "$fixed_domain$url"; //echo "rule 5<br />"; } if(!preg_match("/^(\w+.)$/siU",$parse_url)){ $fixed_url[] = "$fixed_domain$url"; //echo "rule 6<br />"; } if($parse_url == $fixed_domain) { $fixed_url[] = "http://$url"; //echo "rule 7<br />"; } if (0 !== strpos($fixed_url[0], 'http')) { $fixed_url[0] = "http://$fixed_url[0]"; //echo "rule 8<br />"; } $lower_domain = parseHOST($fixed_url[0]); $lower_url = str_ireplace($lower_domain,strtolower($lower_domain),$fixed_url[0]); $lower_url = trim($lower_url,"'"); $lower_url = trim($lower_url,"#"); return $lower_url; }//end fix relative function //try to load as xml file first @$xml = simplexml_load_file($target_url); if($xml===TRUE) { echo "<h3>Page is xml</h3>"; foreach ($xml->url as $url_list) { $raw_url_array[] = json_decode($url_list->loc,TRUE);//create array } } else {//connect with curl request to $target_url $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); if (!$html) { //echo "<br />cURL error number:" .curl_errno($ch); //echo "<br />cURL error:" . curl_error($ch); die("Unable to connect to that url"); } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the links on the page $xpath = new DOMXPath($dom); //only looking for a href links in the body section $hrefs = $xpath->query('/html/body//a'); //loop all the found href links for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $raw_url_array[] = $href->getAttribute('href');//create array }//end loop }//end if/else xml type or not //check if is array, loop through urls and clean/fix if(is_array($raw_url_array)){ foreach($raw_url_array as $url){ $url = fixRELATIVE($target_url,$url);//fix the self relative links //only show http links in array if($url != '' || substr($url,0,4) != "http:" || substr($url,0,5) != "https:"){ $url_array[] = $url;//create a url array } }//end foreach //displaying it $url_array = array_unique($url_array); foreach($url_array as $clean_url){ $clean_url = htmlentities(urldecode($clean_url)); echo "<a href='$clean_url' target='_blank'>$clean_url</a><br />"; }//end display }//end if is array }//end if $_GET['target'] set ?> </div> </body> </html> Quote Link to comment https://forums.phpfreaks.com/topic/259569-how-to-read-a-sitemap-using-php/#findComment-1331403 Share on other sites More sharing options...
QuickOldCar Posted March 27, 2012 Share Posted March 27, 2012 Made a correction in the code determining if is xml and also xml extension <html> <title>Link scraper</title> <meta name="description" content="Scrape the href links from the body section of a page" /> <meta name="keywords" content="scrape link, scrape urls,links,url,urls,fetch url,grab link" /> <meta name="author" content="Jay Przekop - dynainternet.com" /> <meta http-equiv="content-type" content="text/html;charset=UTF-8" /> <head> <style type="text/css"> #content { width: 800px ; margin-left: auto ; margin-right: auto ; } </style> </head> <body> <div id="content"> <form action="" method="GET"> <input type="text" name="target" size="100" id="target" value="" placeholder="Insert url to get links" /> <input type="submit" value="Get the links" /> <br /> </form> <?php if(isset($_GET['target']) && $_GET['target'] != ''){ $target_url = trim($_GET['target']); echo "<h2>Links from ".htmlentities(urldecode($target_url))."</h2>"; $userAgent = 'Linkhunter/1.0 (http://dynainternet.com/test/grab-links.php)'; //replace hxxp function function replaceHxxp($url){ $url = str_ireplace(array("hxxps://xxx.","hxxps://","hxxp://xxx.","hxxp://"), array("https://www.","https://","http://www.","http://"), trim($url)); return $url; } //parse the host, no http:// returned function parseHOST($url){ $new_parse_url = str_ireplace(array("http://","https://", "http://", "ftp://", "feed://"), "", trim($url)); $parsedUrl = @parse_url("http://$new_parse_url"); return strtolower(trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)))); } function removePaths($url,$number_positions=NULL) { $path = @parse_url($url, PHP_URL_PATH); $trim_path = trim($path, '/'); $positions = ""; $positions = explode('/', $trim_path); if(preg_match("/\./",end($positions))) { array_pop($positions); } if(!is_null($number_positions)){ for ($i = 1; $i <= $number_positions; $i++) { array_pop($positions); } } foreach($positions as $folders){ if(!empty($folders)){ $folder_path .= "$folders/"; } } return $folder_path; } //check relative and fix function fixRELATIVE($target_url,$url) { $url = replaceHxxp($url); $domain = parseHOST($target_url); $ip_check = parse_url($url, PHP_URL_HOST); $up_one = removePaths($target_url,1); $up_two = removePaths($target_url,2); $up_three = removePaths($target_url,3); $up_four = removePaths($target_url,4); $up_five = removePaths($target_url,5); $path = parse_url($target_url, PHP_URL_PATH); $full_path = trim($path, '/'); $explode_path = explode("/", $full_path); $last = end($explode_path); //echo "last path: $last<br />"; $fixed_paths = ""; if(is_array($explode_path)){ foreach($explode_path as $paths){ if(!empty($paths) && !preg_match("/\./",$paths)){ $fixed_paths .= "$paths/"; } } } $fixed_domain = "$domain/$fixed_paths"; //echo "Target: $target_url<br />"; //echo "Original: $url<br />"; $domain_array = array(".ac",".ac.cn",".ac.ae",".ad",".ae",".aero",".af",".ag",".agent","ah.cn",".ai",".ak.us",".al",".al.us",".am",".an",".ao",".aq",".ar",".ar.us",".arpa",".arts",".as",".asia",".at",".au",".au.com",".auction",".aw",".ax",".az",".az.us",".b2b",".b2c",".b2m",".ba",".bb",".bd",".be",".bf",".bg",".bh",".bi",".biz",".bj",".bj.cn",".bl",".bm",".bn",".bo",".boutique",".br",".br.com",".bs",".bt",".bv",".bw",".by",".bz",".ca",".ca.us",".cat",".cc",".cd",".cf",".cg",".ch",".chat",".church",".ci",".ck",".cl",".club",".cm",".cn",".cn.com",".co",".co.uk",".co.us",".com",".com.au",".com.ac",".com.cn",".com.au",".com.tw",".coop",".cq.cn",".cr",".ct.us",".cu",".cv",".cx",".cy",".cz",".dc.us",".de",".de.com",".de.net",".de.us",".dir",".dj",".dk",".dk.org",".dm",".do",".dz",".ec",".edu",".edu.ac",".edu.af",".edu.cn",".ee",".eg",".eh",".er",".es",".et",".eu",".eu.com",".eu.org",".family",".fi",".firm",".fj",".fj.cn",".fk",".fl.us",".fm",".fo",".fr",".free",".ga",".ga.us",".game",".gb",".gb.com",".gb.net",".gd",".gd.cn",".ge",".gf",".gg",".gh",".gi",".gl",".gm",".gmbh",".gn",".golf",".gov",".gov.ac",".gov.ae",".gov.cn",".gp",".gq",".gr",".gs",".gs.cn",".gt",".gu",".gw",".gy",".gx.cn",".gz.cn",".ha.cn",".hb.cn",".he.cn",".health",".hi.cn",".hi.us",".hk",".hl.cn",".hm",".hn",".hn.cn",".hr",".ht",".hu",".hu.com",".ia.us",".id",".id.us",".ie",".il",".il.us",".im",".in",".in.us",".inc",".info",".int",".io",".iq",".ir",".is",".it",".je",".jl.cn",".jm",".jo",".jobs",".jp",".js.cn",".jx.cn",".ke",".kg",".kh",".ki",".kids",".ku",".km",".kn",".kp",".kr",".ks.us",".kw",".ky",".ky.us",".kz",".la",".la.us",".law",".lb",".lc",".li",".lk",".llc",".llp",".ln.cn",".love",".lr",".ls",".lt",".ltd",".ltd.uk",".lu",".lv",".ly",".m2c",".m2m",".ma",".ma.us",".mc",".md",".md.us",".me",".me.us",".med",".me.uk",".mf",".mg",".mh",".mi.us",".mil",".mil.ac",".mil.ae",".mil.cn",".mk",".ml",".mm",".mn",".mn",".mo",".mo.us",".mobi",".movie",".mp",".mq",".mr",".ms",".ms.us",".mt",".mt.us",".mu",".museum",".music",".mv",".mw",".mx",".my",".mz",".na",".ne.us",".name",".nc",".nc.us",".nd.us",".ne",".net",".net.ac",".net.ae","net.cn",".net.tw",".net.uk",".news",".nf",".ng",".nh.us",".ni",".nj.us",".nl",".nm.cn",".nm.us",".no",".no.com",".nom.ad",".np",".nr",".nu",".nv.us",".ny.us",".nx.cn",".nz",".oh.us",".ok.us",".om",".or.us",".org",".org.ac",".org.ae",".org.cn",".org.tw",".org.uk",".pa",".pa.us",".pe",".pf",".pg",".ph",".pk",".pl",".plc",".plc.uk",".pm",".pn",".pr",".pro",".pro.ae",".ps",".pt",".pw",".py",".qa",".qc.com",".qh.cn",".re",".rec",".ri.us",".ro",".rs",".ru",".ru.com",".rw",".sa",".sa.com",".sb",".sc",".sc.cn",".sc.us",".sch.uk",".sch.ae",".school",".sd",".sd.cn",".sd.us",".se",".se.com",".search",".sg",".sh",".sh.cn",".shop",".si",".sj",".sk",".sl",".sm",".sn",".sn.cn",".so",".soc",".sport",".sr",".st",".su",".sv",".sy",".sx.cn",".sz",".tc",".td",".tech",".tel",".tf",".tg",".th",".tj",".tj.cn",".tk",".tl",".tm",".tn",".tn.us",".to",".tp",".tr",".trade",".travel",".tt",".tv",".tw",".tw.cn",".tx.us",".tz",".ua",".ug",".uk",".uk.com",".uk.net",".um",".us",".us.com",".ut.us",".uy",".uy.com",".uz",".va",".va.us",".vc",".ve",".vg",".vi",".video",".vn",".voyage",".vt.us",".vu",".wa.us",".wf",".wi.us",".ws",".wv.us",".wy.us",".xj.cn",".xxx",".xz.cn",".ye",".yn.cn",".yt",".yu",".za",".za.com",".zj.cn",".zm",".zr",".zw"); $url = preg_replace('/\\\\/', "/", $url); $url = str_ireplace(array("http://","https://", "ftp://", "feed://"), "", trim($url)); if(substr(strtolower($url),0,4) == "www."){ $fixed_url[] = "http://$url"; //echo "rule 1<br />"; } $check_array = array('"',"*","'","//","///","////","'","./",".//","../",".../","..../","...../","./../","../../",'"\"',".//.//"); $excludes_array = array("ac","ad","ae","aero","af","ag","agent","ai","al","am","an","ao","aq","ar","arpa","arts","as","asia","at","au","auction","aw","ax","az","b2b","b2c","b2m","ba","bb","bd","be","bf","bg","bh","bi","biz","bj","bl","bm","bn","bo","boutique","br","bs","bt","bv","bw","by","bz","ca","cat","cc","cd","cf","cg","ch","chat","church","ci","ck","cl","club","cm","cn","co","com","coop","cr","cu","cv","cx","cy","cz","de","dir","dj","dk","dm","do","dz","ec","edu","ee","eg","eh","er","es","et","eu","family","fi","firm","fj","fk","fm","fo","fr","free","ga","game","gb","gd","ge","gf","gg","gh","gi","gl","gm","gmbh","gn","golf","gov","gp","gq","gr","gs","gt","gu","gw","gy","hk","hm","hn","hr","ht","hu","id","ie","il","im","in","inc","info","int","io","iq","ir","is","it","je","jm","jo","jobs","jp","ke","kg","kh","ki","kids","ku","km","kn","kp","kr","kw","ky","kz","la","law","lb","lc","li","lk","llc","llp","love","lr","ls","lt","ltd","lu","lv","ly","m2c","m2m","ma","mc","md","me","med","mf","mg","mh","mil","mk","ml","mm","mn","mn","mo","mobi","movie","mp","mq","mr","ms","mt","mu","museum","music","mv","mw","mx","my","mz","na","name","nc","ne","net","news","nf","ng","ni","nl","no","np","nr","nu","nz","om","org","pa","pe","pf","pg","ph","pk","pl","plc","pm","pn","pr","pro","ps","pt","pw","py","qa","re","rec","ro","rs","ru","rw","sa","sb","sc","school","sd","se","search","sg","sh","shop","si","sj","sk","sl","sm","sn","so","soc","sport","sr","st","su","sv","sy","sz","tc","td","tech","tel","tf","tg","th","tj","tk","tl","tm","tn","to","tp","tr","trade","travel","tt","tv","tw","tz","ua","ug","uk","um","us","uy","uz","va","vc","ve","vg","vi","video","vn","voyage","vu","wf","ws","xxx","ye","yt","yu","za","zm","zr","zw"); if(substr($url,0,1) == "/"){ $url = ltrim($url,"/"); $fixed_url[] = "$domain/$url"; //echo "main site and url<br />"; } if(substr($url,0,1) == "#"){ $fixed_url[] = "$domain/$full_path$url"; //echo "target and url<br />"; } if(substr($url,0,15) == "../../../../../"){ $url = str_replace("../../../../../","",$url); $fixed_url[] = "$domain/$up_five$url"; //echo "five directory up<br />"; } if(substr($url,0,12) == "../../../../"){ $url = str_replace("../../../../","",$url); $fixed_url[] = "$domain/$up_four$url"; //echo "four directory up<br />"; } if(substr($url,0,9) == "../../../"){ $url = str_replace("../../../","",$url); $fixed_url[] = "$domain/$up_three$url"; //echo "three directory up<br />"; } if(substr($url,0,6) == "../../"){ $url = str_replace("../../","",$url); $fixed_url[] = "$domain/$up_two$url"; //echo "two directory up<br />"; } if(substr($url,0,3) == "../"){ $url = str_replace("../","",$url); $fixed_url[] = "$domain/$up_one$url"; //echo "one directory up<br />"; } foreach($check_array as $checks){ $check_length = strlen($checks); $temporary_url = $url; $url = @ltrim($url,$checks); $url = @rtrim($url,$checks); if(substr($temporary_url,0,$check_length) == $checks){ $fixed_url[] = "$fixed_domain$url"; //echo "rule 2<br />"; } } $parse_url = parseHOST($url); $parse_ext_explode = end(explode(".",$parse_url)); $parse_ext_check = ".$parse_ext_explode"; //echo "$parse_ext_check<br />"; //the following if statements will do checks on what to be added, only the first $fixed_url will be returned if(in_array($parse_url, $excludes_array)){ $fixed_url[] = "$fixed_domain$url"; //echo "rule 3<br />"; } if(in_array($parse_ext_check, $domain_array)){ $fixed_url[] = "http://$url"; //echo "rule 4<br />"; } if(preg_match("/([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}/",$ip_check)){ $fixed_url[] = "http://$url"; //echo "is an ip<br />"; } if(!in_array($parse_ext_check, $domain_array)){ $fixed_url[] = "$fixed_domain$url"; //echo "rule 5<br />"; } if(!preg_match("/^(\w+.)$/siU",$parse_url)){ $fixed_url[] = "$fixed_domain$url"; //echo "rule 6<br />"; } if($parse_url == $fixed_domain) { $fixed_url[] = "http://$url"; //echo "rule 7<br />"; } if (0 !== strpos($fixed_url[0], 'http')) { $fixed_url[0] = "http://$fixed_url[0]"; //echo "rule 8<br />"; } $lower_domain = parseHOST($fixed_url[0]); $lower_url = str_ireplace($lower_domain,strtolower($lower_domain),$fixed_url[0]); $lower_url = trim($lower_url,"'"); $lower_url = trim($lower_url,"#"); return $lower_url; }//end fix relative function $file_type = end(explode(".",strtolower(trim($target_url)))); //try to load as xml file first @$xml = simplexml_load_file($target_url); if($xml!==False && $file_type == "xml") { echo "<h3>Page is xml</h3>"; foreach ($xml->url as $url_list) { $raw_url_array[] = json_decode($url_list->loc,TRUE);//create array } } else {//connect with curl request to $target_url $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); if (!$html) { //echo "<br />cURL error number:" .curl_errno($ch); //echo "<br />cURL error:" . curl_error($ch); die("Unable to connect to that url"); } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the links on the page $xpath = new DOMXPath($dom); //only looking for a href links in the body section $hrefs = $xpath->query('/html/body//a'); //loop all the found href links for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $raw_url_array[] = $href->getAttribute('href');//create array }//end loop }//end if/else xml type or not //check if is array, loop through urls and clean/fix if(is_array($raw_url_array)){ foreach($raw_url_array as $url){ $url = fixRELATIVE($target_url,$url);//fix the self relative links //only show http links in array if($url != '' || substr($url,0,4) != "http:" || substr($url,0,5) != "https:"){ $url_array[] = $url;//create a url array } }//end foreach //displaying it $url_array = array_unique($url_array); foreach($url_array as $clean_url){ $clean_url = htmlentities(urldecode($clean_url)); echo "<a href='$clean_url' target='_blank'>$clean_url</a><br />"; }//end display }//end if is array }//end if $_GET['target'] set ?> </div> </body> </html> Quote Link to comment https://forums.phpfreaks.com/topic/259569-how-to-read-a-sitemap-using-php/#findComment-1331408 Share on other sites More sharing options...
Help!php Posted March 27, 2012 Author Share Posted March 27, 2012 Just aspx page. The code you provided me work. . I did try to use simple_html_dom.. didnt really work. So I thank you for the help and if you were here would have got a big fat hug(I am a girl) .. you saved me.. So thank you.. Quote Link to comment https://forums.phpfreaks.com/topic/259569-how-to-read-a-sitemap-using-php/#findComment-1331551 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.