Jump to content

Recommended Posts

Hiya,

This has been doing my head-in for days now!
Some crawler codes on the internet exist where you get it to crawl to a webpage to extracts all html links. hrefs.
Code such as this one which I found:

A.

hrefs Extractor - Extracts from html files

	<?php
//1.
	//General Page Crawler. Not Xml Sitemap Crawler.
	//---
include_once('simplehtmldom_1_9_1/simple_html_dom.php');
//---
//FAILS
//$url = "https://www.rocktherankings.com/post-sitemap.xml";
//$url = "https://bytenota.com/sitemap.xml";
//$url = "https://www.rocktherankings.com/sitemap_index.xml";
	//WORKS
$url = "https://www.rocktherankings.com/footer-links-seo/";
 
  //WORKS
  $url = "";
  $html = new simple_html_dom();
  $html->load_file($url);
  //--
  foreach($html->find("a") as $link)
  {
    echo $link->href."<br>";
  }
	?>
	

 

And there are those that extract links from xml files.
Like these two:

 

 

1.

Extracts from Xml files

	//Sitemap Protocol: https://www.sitemaps.org/protocol.html
	include_once('simplehtmldom_1_9_1/simple_html_dom.php');
	//WORKS.
//$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
//$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
	//FAILS. Shows blank page.
$sitemap = "https://bytenota.com/sitemap.xml";
	$html = new simple_html_dom();
$html->load_file($sitemap);
	foreach($html->find("loc") as $link)
{
    echo $link->innertext."<br>";
}
	

 

2

Extracts from Xml files

	//Sitemap Crawler: If starting url is an xml file listing further xml files then it will show blank page and not visit the found xml files to extract links from them.
//Sitemap Protocol: https://www.sitemaps.org/protocol.html
	// sitemap url or sitemap file
//FAILS.
//$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
//WORKS
//$sitemap = "https://bytenota.com/sitemap.xml";
//$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
	// get sitemap content
$content = file_get_contents($sitemap);
	// parse the sitemap content to object
$xml = simplexml_load_string($content);
	// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement)
{
    // get properties
    $url = $urlElement->loc;
    $lastmod = $urlElement->lastmod;
    $changefreq = $urlElement->changefreq;
    $priority = $urlElement->priority;
	    // print out the properties
    echo 'url: '. $url . '<br>';
    echo 'lastmod: '. $lastmod . '<br>';
    echo 'changefreq: '. $changefreq . '<br>';
    echo 'priority: '. $priority . '<br>';
	    echo '<br>---<br>';
}
	

 

But can you figure-out the issues I am having with these last 2 crawlers above ?

If you try getting them to headover to an xml file (sitemap) that lists further xml links (sitemaps), one chokes. Do try it out yourself without taking my word for it.

So, got no choice but to build my own crawler, where when I set it to navigate to an xml sitemap then initially it would check if the listed links on the navigated page are href links or further xml links to more xml files (more sitemaps). Good idea ?

So what I did was, I first got my crawler to navigate to an xml file. Starting point page.
And now I want to make it to extract all found links and check whether the found links are hrefs or further xml links.
If the links are hrefs, then add them to the $extracted_urls array.
Else add them to the $crawl_xml_files array.


Now later on, the crawler can crawl those extracted href & xml links dumped on both arrays.
Now, I am stuck on the part where, the code fails to echo the link extensions of the found links on the initially navigated page.
It fails to extract any links to the respective arrays.
Here is the code. Test it and see for yourself where I am going wrong. I am scratching my head.

	//Sitemap Crawler: If starting url is an xml file listing further xml files then it will show blank page and not visit the found xml files to extract links from them.
//Sitemap Protocol: https://www.sitemaps.org/protocol.html
	    //$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
    //$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
    $sitemap = 'https://bytenota.com/sitemap.xml';
   
	    // get sitemap content
    //$sitemap = 'sitemap.xml';
    // get sitemap content
    $content = file_get_contents($sitemap);
	    // parse the sitemap content to object
    $xml = simplexml_load_string($content);
    //var_dump($xml);
    // Init arrays
    $crawl_xml_files = [];
    $extracted_urls = [];
    $extracted_last_mods = [];
    $extracted_changefreqs = [];
    $extracted_priorities = [];
    // retrieve properties from the sitemap object
    foreach ($xml->url as $urlElement) {
        // provide path of curren xml/html file
        $path = (string)$urlElement->loc;
        // get pathinfo
        $ext = pathinfo($path, PATHINFO_EXTENSION);
        echo 'The extension is: ' . $ext;
        echo '<br>'; //DELETE IN DEV MODE
	        echo $urlElement; //DELETE IN DEV MODE
	        if ($ext == 'xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
        {
            echo __LINE__;
            echo '<br>'; //DELETE IN DEV MODE
	            //Add Xml Links to array.
            $crawl_xml_files[] = $path;
        } elseif ($ext == 'html' || $ext == 'htm' || $ext == 'shtml' || $ext == 'shtm' || $ext == 'php' || $ext == 'py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
        {
            echo __LINE__;
            echo '<br>'; //DELETE IN DEV MODE
	            //Add hrefs to array.
            //$extracted_urls[] = $path;
	            // get properties
	            $extracted_urls[] = $extracted_url = $urlElement->loc; //Add hrefs to array.
            $extracted_last_mods[] = $extracted_lastmod = $urlElement->lastmod; //Add lastmod to array.
            $extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq; //Add changefreq to array.
            $extracted_priorities[] = $extracted_priority = $urlElement->priority; //Add priority to array.
        }
    }
	    var_dump($crawl_xml_files); //Print all extracted Xml Links.
    var_dump($extracted_urls); //Print all extracted hrefs.
    var_dump($extracted_last_mods); //Print all extracted last mods.
    var_dump($extracted_changefreqs); //Print all extracted changefreqs.
    var_dump($extracted_priorities); //Print all extracted priorities.
	    foreach($crawl_xml_files as $crawl_xml_file)
    {
        echo 'Xml File to crawl: ' .$crawl_xml_file; //Print all extracted Xml Links.
    }
	    echo __LINE__;
    echo '<br>'; //DELETE IN DEV MODE
	    foreach($extracted_urls as $extracted_url)
    {
        echo 'Extracted Url: ' .$extracted_url; //Print all extracted hrefs.
    }
	    echo __LINE__;
    echo '<br>'; //DELETE IN DEV MODE
	    foreach($extracted_last_mods as $extracted_last_mod)
    {
        echo 'Extracted last Mod: ' .$extracted_last_mod; //Print all extracted last mods.
    }
	    echo __LINE__;
    echo '<br>'; //DELETE IN DEV MODE
	    foreach($extracted_changefreqs as $extracted_changefreq)
    {
        echo 'Extracted Change Frequency: ' .$extracted_changefreq; //Print all extracted changefreqs.
    }
	    echo __LINE__;
    echo '<br>'; //DELETE IN DEV MODE
	    foreach($extracted_priorities as $extracted_priority)
    {
        echo 'Extracted Priority: ' .$extracted_priority; //Print all extracted priorities.
    }
	    echo __LINE__;
    echo '<br>'; //DELETE IN DEV MODE
	


Can someone be kind enough to fix this by shortening it as much as possible using procedural style programming and show me how you fixed it ?

Thanks!

Edited by TheStudent2023

Checking the extension is pointless.  Extensions don't necessarily mean anything, and many URLs these days do not even have extensions to check.  Trying to check the extension of a URL to determine if it's an XML file and then treat it as a site map is the entirely wrong approach here.  Just because a URL ends in .xml doesn't make it a sitemap file, it could be any sort of XML file.  What determines if something is a sitemap file is that the URL is part of a <sitemap> element, so that's what you need to check.

If you look at the protocol page, you'll see there are essentially two types of sitemap files.  A urlset listing all the URLs of the site, or a sitemapindex listing  other site maps.  So when you download an XML file, you should be checking whether the file is one of those two types and parse them accordingly.

I'm not familiar enough with simplexml to know the code for that, but with DOMDocument you'd load the XML then check the nodeName to determine if it's a urlset or a sitemapindex.

 

  • Like 1
8 minutes ago, kicken said:

Checking the extension is pointless.  Extensions don't necessarily mean anything, and many URLs these days do not even have extensions to check.  Trying to check the extension of a URL to determine if it's an XML file and then treat it as a site map is the entirely wrong approach here.  Just because a URL ends in .xml doesn't make it a sitemap file, it could be any sort of XML file.  What determines if something is a sitemap file is that the URL is part of a <sitemap> element, so that's what you need to check.

If you look at the protocol page, you'll see there are essentially two types of sitemap files.  A urlset listing all the URLs of the site, or a sitemapindex listing  other site maps.  So when you download an XML file, you should be checking whether the file is one of those two types and parse them accordingly.

I'm not familiar enough with simplexml to know the code for that, but with DOMDocument you'd load the XML then check the nodeName to determine if it's a urlset or a sitemapindex.

 

@kicken

Phew! It seems you understood my code intention.

>>but with DOMDocument you'd load the XML then check the nodeName to determine if it's a urlset or a sitemapindex.<<

Can you be kind enough to show me how to code it to do this ? Talking about this particular part ...

>>check the nodeName to determine if it's a urlset or a sitemapindex<<

Then, I should be able to move forward from there on.

 

Thanks!

Edited by TheStudent2023

@kicken,

I made the first move. Here you go:

	<?php
	$url = "https://techalltype.com/";
	$html = file_get_contents($url);
	$doc = new \DOMDocument('1.0', 'UTF-8'); /* instance of DOMDocument */
	@$doc->loadHTML($html); /*The function parses the HTML contained in the string source */
	$xpath = new \DOMXpath($doc); /*to retrieve selected html data */
	$nodes = $xpath->query('//a');
	foreach($nodes as $key => $node) {
	echo $key++.".) ".$node->getAttribute('href')."<br/>";
	}
	

Now, can you do the part I am stuck in ?

Edited by TheStudent2023
1 minute ago, TheStudent2023 said:

Can you be kind enough to show me how to code it to do this

Something like this:

$xml = file_get_contents($sitemapUrl);
$dom = new DOMDocument();
$dom->loadXML($xml);
if ($dom->nodeName === 'sitemapindex'){
    //parse the index
} else if ($dom->nodeName === 'urlset'){
    //parse url set
} else {
    //some other xml file
}

 

11 minutes ago, kicken said:

I'm not familiar enough with simplexml to know the code for that,

But ChatGPT is, apparently.

 

  • Thanks 1

@kicken

I do not like copycating others codes as I will never learn to walk by myself.

Might asweel nug you a little longer and learn how to use the DOMDOCUMENT.

 

I see you memorised this part:

	$dom = new DOMDocument(); $dom->loadXML($xml);
	[/cpde]
	found here:
	https://www.php.net/manual/en/domdocument.loadxml.php
	 
	But from the manual, where did you learn this part:
	[code]
	if ($dom->nodeName
	

This code looks at an Xml link's element, I am guessing.

Where on the DomDocument manual, is that particular page that shows this particular line of code ? I cannot find it. Look, nowhere is mentioned a link that teaches how to write DomDocument code that checks a link's element or NodeName:

https://www.php.net/domdocument

Can you kindly point the right link out to me ?

EDIT: Tried many times but ChatGpt link you gave fails to load:

https://imgur.com/fPm8mjD

Same on your end or not ?

Thanks

 

Edited by TheStudent2023

@ginerjm

 

I failing to find a list of NodeNames on DomDocument:

https://www.php.net/manual/en/class.domnode.php#domnode.props.nodename

You know the correct link ? Asking as you know your DomDocument stuffs.

Edited by TheStudent2023
4 minutes ago, TheStudent2023 said:

I failing to find a list of NodeNames on DomDocument:

There isn't a list of values for nodeName because the list would be infinitely long.  The nodeName is the tag name in the source code.  Since the code is interested in Sitemap files, and the root element of those files is either <sitemapindex> or <urlset> then you'll be looking for a nodeName of 'sitemapindex' or 'urlset'.

  • Like 1
43 minutes ago, kicken said:

There isn't a list of values for nodeName because the list would be infinitely long.  The nodeName is the tag name in the source code.  Since the code is interested in Sitemap files, and the root element of those files is either <sitemapindex> or <urlset> then you'll be looking for a nodeName of 'sitemapindex' or 'urlset'.

Thanks. Bear with me. All this extractors (simple_html_dom, DomDocument) never got through to my head.

Let me see if I understanding you or not.

In the html language, we call these tag names:

<a = href tag

<title> = title tag.

And so on.

Q1.

In php or PARSER lang, you do not say these are 'tag' but 'node'. Right ?

In html lang, I understand about parent tags and child tags. Do not worry.

 

Currently, I am over here:

https://www.sitemaps.org/protocol.html

I can see 2 Xml link listing formats:

1.

	<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">    <url>       <loc>http://www.example.com/</loc>       <lastmod>2005-01-01</lastmod>       <changefreq>monthly</changefreq>       <priority>0.8</priority>    </url>    <url>       <loc>http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii</loc>       <changefreq>weekly</changefreq>    </url>    <url>       <loc>http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand</loc>       <lastmod>2004-12-23</lastmod>       <changefreq>weekly</changefreq>    </url>    <url>       <loc>http://www.example.com/catalog?item=74&amp;desc=vacation_newfoundland</loc>       <lastmod>2004-12-23T18:00:15+00:00</lastmod>       <priority>0.3</priority>    </url>    <url>       <loc>http://www.example.com/catalog?item=83&amp;desc=vacation_usa</loc>       <lastmod>2004-11-23</lastmod>    </url> </urlset>
	

 

2.

	<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">    <sitemap>       <loc>http://www.example.com/sitemap1.xml.gz</loc>       <lastmod>2004-10-01T18:23:17+00:00</lastmod>    </sitemap>    <sitemap>       <loc>http://www.example.com/sitemap2.xml.gz</loc>       <lastmod>2005-01-01</lastmod>    </sitemap> </sitemapindex>
	

First format lists a tag links.

Second format lists xml links to further xml sitemaps.

I get that part.

Now, to get the php to determine which format is on the page, you said I must write code for it to check the parent node. Right ?

So, if it finds the tag/node name is "urlset":

	<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	

Then, I need to write code for php to jump to the <loc tag. And dump the extracted url to hrefs files array (for example) by identifying the url as an html, php file etc (but not another xml file).

 

And, if it finds the tag/node name is "<sitemapindex":

	<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	

Then, I need to write code for php to jump to the <loc tag. And dump the extracted url to xml files array (for example) by identifying the extracted url as another xml file and not an html, php, etc webpage file.

 

Q2. Did I understand you so far ?

(Remember, I am a beginner level programmer with no other programming background and so my questions will sound stupid to you.

 

Now, I got this particular code from a programmer few weeks back:

	$sitemap = 'https://www.***/home-sitemap.xml';
	// get sitemap content
$content = file_get_contents($sitemap);
	// parse the sitemap content to object
$xml = simplexml_load_string($content);
	// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement) //Extracts html file urls.
//foreach ($xml->sitemap as $urlElement) //Extracts Sitemap Urls.
{
    // get properties
    $url = $urlElement->loc;
    $lastmod = $urlElement->lastmod;
    $changefreq = $urlElement->changefreq;
    $priority = $urlElement->priority;
	    // print out the properties
    echo 'url: '. $url . '<br>';
    echo 'lastmod: '. $lastmod . '<br>';
    echo 'changefreq: '. $changefreq . '<br>';
    echo 'priority: '. $priority . '<br>';
	    echo '<br>---<br>';
}
	

I am guessing, the above extracts links from the 1st format.

 

Q3

And to get it to extract links (xml files) from the 2nd format, I must change this:

	foreach ($xml->url as $urlElement) //Extracts html file urls.
	

to either this:

	foreach ($xml->sitemap as $urlElement) //Extracts html file urls.
	

Or this:

	foreach ($xml->sitemapindex as $urlElement) //Extracts html file urls.
	
Edited by TheStudent2023

@kicken

 

Correction:

I meant these:

Extract Urls

	foreach ($xml->urlset as $urlElement) //Extracts html file urls.
{
    // get properties
    $url = $urlElement->loc;
    $lastmod = $urlElement->lastmod;
    $changefreq = $urlElement->changefreq;
    $priority = $urlElement->priority;
	    // print out the properties
    echo 'url: '. $url . '<br>';
    echo 'lastmod: '. $lastmod . '<br>';
    echo 'changefreq: '. $changefreq . '<br>';
    echo 'priority: '. $priority . '<br>';
	    echo '<br>---<br>';
}
	

 

2. Extract SiteMaps

	foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls.
{
    // get properties
    $url = $urlElement->loc;
    $lastmod = $urlElement->lastmod;
    $changefreq = $urlElement->changefreq;
    $priority = $urlElement->priority;
	    // print out the properties
    echo 'url: '. $url . '<br>';
    echo 'lastmod: '. $lastmod . '<br>';
    echo 'changefreq: '. $changefreq . '<br>';
    echo 'priority: '. $priority . '<br>';
	    echo '<br>---<br>';
}
	

 

So far, so good ?

@kicken

 

Check it out:

	<?php
	$xml = file_get_contents($sitemapUrl); //Should I stick to this line or below line ?
// parse the sitemap content to object
$xml = simplexml_load_string($sitemapUrl); //Should I stick to this line or above line ?
	$dom = new DOMDocument();
$dom->loadXML($xml);
if ($dom->nodeName === 'sitemapindex')
{
    //parse the index
    // retrieve properties from the sitemap object
    foreach ($xml->urlset as $urlElement) //Extracts html file urls.
    {
        // get properties
        $url = $urlElement->loc;
        $lastmod = $urlElement->lastmod;
        $changefreq = $urlElement->changefreq;
        $priority = $urlElement->priority;
	        // print out the properties
        echo 'url: '. $url . '<br>';
        echo 'lastmod: '. $lastmod . '<br>';
        echo 'changefreq: '. $changefreq . '<br>';
        echo 'priority: '. $priority . '<br>';
	        echo '<br>---<br>';
    }
}
else if ($dom->nodeName === 'urlset')
{
    //parse url set
    // retrieve properties from the sitemap object
    foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls.
    {
        // get properties
        $url = $urlElement->loc;
        $lastmod = $urlElement->lastmod;
        $changefreq = $urlElement->changefreq;
        $priority = $urlElement->priority;
	        // print out the properties
        echo 'url: '. $url . '<br>';
        echo 'lastmod: '. $lastmod . '<br>';
        echo 'changefreq: '. $changefreq . '<br>';
        echo 'priority: '. $priority . '<br>';
	        echo '<br>---<br>';
    }
}
else
{
    //some other
	

I am stuck on the last ELSE as I do not know of any other format for php yo check. No other formats listed here:

https://www.sitemaps.org/protocol.html

Obviously, you had something in mind. What had you in mind for that ELSE ?

Can you finish that ELSE of your's ? Then, I can move onto writing code to extract the meta tags.

Edited by TheStudent2023
2 hours ago, TheStudent2023 said:

I do not like copycating others codes as I will never learn to walk by myself.

If this in reference to the ChatGPT link and not wanting to use it, it's not "copycatting" to look at examples to try and understand how something works, whether that example is from someone here, code on github, generated by AI, etc.   There's nothing wrong with asking it for an example of how to do something and learning from that.  I would advise against just taking code from ChatGPT or other and using it directly.  While the AI's are pretty good, they are not perfect and their code isn't always the best quality but it's fine for an example to study.

1 hour ago, TheStudent2023 said:

What had you in mind for that ELSE

Just that the document is something other than a site map file.  There are lots of different XML documents on the web, not just site map files.  You need to decide what you want to do if you find a non-site-map xml file and put that in the else branch.

@kicken

 

No. I meant, I do not like to copy codes exact as is from tutorial sites, manuals, forums and even from you. That would be like me giving you orders and you writing codes for me and I just using your codes to build my websites. I ask you people code samples to learn your ways of coding. Both basic, orthodox. then, I try changing the codes and experiment, fiddle, test, etc. gain experience. Then, whatever version (that I derived from you peoples' or tutorials codes) works best for me, I just stick to it. memorise the lines and write from memory to a template. So, when I ask for code samples or snippets, do not assume I trying to get you to do work for me ffor free to build my websites. Not using anyone indirectly to build my websites. Just using everytone to learn coding. That is all. In past, some members assumed I just trying to get them do my dirty work for free. Hence, I making things clear again in this forum just incase someone gets that paranoid notion again that I asking for code snippets bit by bit and combining them altogether to build my websites and am not really motivated to learn programming.

So, when I pester pros for snippets, you now know what will happen to those snippets. They willfirst become part of my experiments.

Right now, trying to get snippets out from pros for code to extract meta tags. Not by using default php code. Php function. But by using parsers: DOm Document & smple_html_document.

@kicken

 

My Crawler will only start it's crawls from Xml Sitemaps. It will crawl all Xml Sitemaps found on any given site, if more than one SiteMap exists.

From SIteMaps, it will extract regular hrefs from a tags. Then visit those html pages and extract the meta tags and title tags. That is all for now.

And so, I guess I will ignore that final ELSE from your code as I do not wish to get crawler deal with anymore Xml Files that are not SiteMaps.

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.