Jump to content

Extracting Title With DomDocument - How Was this Possible ?


Recommended Posts

  

 

Fellow Programmers,

I was given this to extract meta tags:

	    <?php
/*
$url = "
	// https://www.php.net/manual/en/function.file-get-contents
$html = file_get_contents($url);
	//https://www.php.net/manual/en/domdocument.construct.php
$doc = new DOMDocument();
	// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);
	// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
	// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();
	// https://www.php.net/manual/en/domdocument.getelementsbytagname.php
$title_tags = $doc->getElementsByTagName('title');
	if ($title_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($title_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo 'Title: ' .$title = $tag->getAttribute('textContent'); echo '<br>';
    }
}
	

I could have asked for the code to extract the title but I did not as I thought if I learn the above code then I would be able to extract the title too. i overlooked the fact that, the meta tag and the title tag structure are different. The above code works to extract meta tags.

And I knew, if I downright copy the code structure then it will fail to extract the title from the title tag. nevertheless, I like fiddling around and so experimented. Look, these failed as expected as I see blank white page:

 

by: text

    $title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    foreach ($title_tags as $tag)
    {
        echo 'Title: ' .$title = $tag->getAttribute('text'); echo '<br>';
    }
}

  

 

by: textContent

	$title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    foreach ($title_tags as $tag)
    {
        echo 'Title: ' .$title = $tag->getAttribute('textContent'); echo '<br>';
    }
}
	

 

by: innertext

	    $title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($title_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo 'Title: ' .$title = $tag->getAttribute('innertext'); echo '<br>';
    }
}
	


    

 

by: outertext

    $title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($title_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo 'Title: ' .$title = $tag->getAttribute('outertext'); echo '<br>';
    }
}

 

 

by: innerhtml

	    $title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($title_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo 'Title: ' .$title = $tag->getAttribute('innerhtml'); echo '<br>';
    }
}
	


    

 

by: outerhtml

    $title_tags = $doc->getElementsByTagName('title');
if ($title_tags->length > 0)
{
    // https://www.php.net/manual/en/class.domnodelist.php
    foreach ($title_tags as $tag)
    {
        // https://www.php.net/manual/en/domnodelist.item.php
        echo 'Title: ' .$title = $tag->getAttribute('outerhtml'); echo '<br>';
    }
}

    

 

All above failed.

As this line is not correct:

echo 'Title: ' .$title = $tag->getAttribute('textContent'); echo '<br>';

So, how to fix this line ?
 

 

Edited by TheStudent2023
Link to comment
Share on other sites

4 hours ago, TheStudent2023 said:

Where is DomDocument doc shows how to extract page title ?

You're generally not going to find such specific examples in a manual.  You need to use the documentation and examples given to learn the components so you can then put them together as needed to accomplish your task.

4 hours ago, TheStudent2023 said:

All above failed.

As this line is not correct

Perhaps you should lookup what getAttribute does.  You might also need to read up on how the DOM (the link talks JavaScript, but the interfaces work the same in any language) works in general if you want to continue using DOMDocument.

Spend some time reading the documentation for DOMDocument and some about the DOM and you should be able to figure out where you went wrong in your above attempts.

 

  • Like 1
Link to comment
Share on other sites

@kicken,

 

I got a little help.

So, switched to this:

	echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
	

From:

	echo 'Title: ' .$title = $tag->getAttribute('textContent'); echo '<br>';
	

Working now.

here is the code, I guess it cannot be any shorter, concise & better.

	<?php    
$url = "https://www.daniweb.com/programming/web-development/threads/540013/how-to-find-does-not-contain-or-does-contain";
	// https://www.php.net/manual/en/function.file-get-contents
$html = file_get_contents($url);
	//https://www.php.net/manual/en/domdocument.construct.php
$doc = new DOMDocument();
	// https://www.php.net/manual/en/function.libxml-use-internal-errors.php
libxml_use_internal_errors(true);
	// https://www.php.net/manual/en/domdocument.loadhtml.php
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
	// https://www.php.net/manual/en/function.libxml-clear-errors.php
libxml_clear_errors();
	$title_tag = $doc->getElementsByTagName('title');
if ($title_tag->length>0)
{
    echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
}
die;
	?>
	

You see there any lines that should not be there ?
Since this part of the code will not be dealing with xml files is the above xml error reporting code necessary ?
Maybe, I should add some other error reporting code that deals with failure to extract from regular html pages ? You yourself will put which code and from which link in DomDocument will you copy it from ?

The DomDocument doc is too big. If I start reading from top to bottom, I will forget first half of what i read by the time I end the last half. So better to ask some pro, instead to point me in the right section or direction. Then I go and do some digging.

Link to comment
Share on other sites

2 hours ago, TheStudent2023 said:

The DomDocument doc is too big. If I start reading from top to bottom, I will forget first half of what i read by the time I end the last half.

It's not going anywhere.  You don't have to absorb every detail in the first pass.   By looking through it at least once though, you'll familiarize yourself with what's available and then when you encounter some problem hopefully you'll be able to say to yourself "I remember seeing this one thing that might be relevant here, let me go look it up again" instead of just being completely lost.  I skim through the manual usually at least once per year or so, just to familiarize myself with what's available, mostly just looking at the index pages that list the extensions/functions/features.  If something sounds interesting then I dig a bit deeper.

A large part of being a successful programmer is knowing how to find the information you need, not trying to just memorize every possible thing.  That means being familiar with good resources like the PHP Manual, MDN, W3C Standards, Can I Use, and so on such that you can quickly look stuff up when you need to. If you don't have or develop that basic skill and instead always "ask a pro" then you will find that you quickly run out of pros to ask (ex: your ban history).

 

Link to comment
Share on other sites

22 hours ago, kicken said:

It's not going anywhere.  You don't have to absorb every detail in the first pass.   By looking through it at least once though, you'll familiarize yourself with what's available and then when you encounter some problem hopefully you'll be able to say to yourself "I remember seeing this one thing that might be relevant here, let me go look it up again" instead of just being completely lost.  I skim through the manual usually at least once per year or so, just to familiarize myself with what's available, mostly just looking at the index pages that list the extensions/functions/features.  If something sounds interesting then I dig a bit deeper.

A large part of being a successful programmer is knowing how to find the information you need, not trying to just memorize every possible thing.  That means being familiar with good resources like the PHP Manual, MDN, W3C Standards, Can I Use, and so on such that you can quickly look stuff up when you need to. If you don't have or develop that basic skill and instead always "ask a pro" then you will find that you quickly run out of pros to ask (ex: your ban history).

 

@kicken

 

Thanks for reminding my ban history.But that was due to so-called cross posting. No thanks to Benanamen. Not for breaking any forum TOS. I know I push my luck everytime I tag and bother you guys but for some reason I see that unless I pester, my threads get no answers for 3+ days unless I bump it up again. Anyway, enough talk. I am going to do now something that I been planning to do. And that is to re-invent the drawer in the design that I will be familiar with so I can easily find things. Re-invent the wheel. Yes, I am going to build my own PARSER instead. simple_html_dom() parser syntax looks much easier than the DomDocument() parser. I guess the latter was built by some programmer who hated the official one.

Yes, I know, your eye brows have flown at the top of your forehead and gonna reply back "NO! No! Do not re-invent the wheel". Stick to the standard parsers. But, like I said, I like procrastination and experimenting, fiddling, playing around. It generates work experience for me. Just watch where my upcoming own parser ends up to. You might even like it. While I try building one, I will encounter a lot of obstacles and ask a lot of tech questions and will get a lot of flavoured answers and get recommended to a lot of php functions that I never would have come across had I not tried re-inventing my own parser. SO yes, some Pos+ will come out of it.Not all Neg-. Stay tuned and look forward for my upcoming parser thread.

On caniuse.com. What do you use it for ? Different Langs' Syntax checking ?

Edited by TheStudent2023
Link to comment
Share on other sites

Kicken

 

I saw this youtube vid on AI. One AI watches your computer moves. And when you forget where you saw something, then you type a few words and it reminds you where you saw it. Like in your certain file or in the web. That one might become handy. Talking about this ....

"By looking through it at least once though, you'll familiarize yourself with what's available and then when you encounter some problem hopefully you'll be able to say to yourself "I remember seeing this one thing that might be relevant here, let me go look it up again"

Link to comment
Share on other sites

Oh bother!

I thought I finished the crawler but I get this error that ````$html_page_urls```` is not defined!

If you look at the first few lines of the whole script which I will give at the bottom of this post, then you will notice this:

	//Preparing Crawler & Session: Initialising Variables.
	//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//Data Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemap.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.
//Data Scraped from SiteMaps or Xml Files.
$html_page_urls  = []; //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).


Check the final line above! Then you see this has been defined. Look again:

$html_page_urls  = []; //Same as: $html_page_urls  = array();


And so, I do not understand why I get error that this is not defined.
I get error on this line:

function scrape_page_data()
{
    if(array_count_values($html_page_urls)>0)

CONTEXT

<?php
	//Preparing Crawler & Session: Initialising Variables.
	//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//Data Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemap.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.
	//Data Scraped from SiteMaps or Xml Files.
$html_page_urls  = []; //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods  = []; //This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs  = []; //his will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities  = []; //This will list html pages priorities - found on Sitemaps.
	//Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
//Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_titles  = []; //This will list crawled pages Titles - found on html pages.
$html_page_meta_names  = []; //This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions  = []; //This will list crawled pages Meta Tag Descriptions - found on html pages.
	// -----
	//Step 1: Initiate Session - Feed Xml SiteMap Url. Crawing Starting Point.
//Crawl Session Starting Page/Initial Xml Sitemap.
$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
	$xml = file_get_contents($sitemap); //Should I stick to this line or below line ?
// parse the sitemap content to object
//$xml = simplexml_load_string($sitemap); //Should I stick to this line or above line ?
	$dom = new DOMDocument();
$dom->loadXML($xml);
	//Trigger following IF/ELSEs on each Crawled Page to check for link types. Whether Links lead to more SiteMaps (.xml) or webpages (.html, .htm, .php, etc.).
if ($dom->nodeName === 'sitemapindex')  //Current Xml SiteMap Page lists more Xml SiteMaps. Lists links to Xml links. Not lists links to html links.
{
    //parse the index
    // retrieve properties from the sitemap object
    foreach ($xml->urlset as $urlElement) //Extracts html file urls.
    {
        // get properties
        $sitemaps[] = $sitemap_url = $urlElement->loc;
        $sitemaps_last_mods[] = $last_mod = $urlElement->lastmod;
        $sitemaps_change_freqs[] = $change_freq = $urlElement->changefreq;
        $sitemaps_priorities[] = $priority = $urlElement->priority;
	        // print out the properties
        echo 'url: '. $sitemap_url . '<br>';
        echo 'lastmod: '. $last_mod . '<br>';
        echo 'changefreq: '. $change_freq . '<br>';
        echo 'priority: '. $priority . '<br>';
	        echo '<br>---<br>';
    }
}
else if ($dom->nodeName === 'urlset')  //Current Xml SiteMap Page lists no more Xml SiteMap links. Lists only html links.
{
    //parse url set
    // retrieve properties from the sitemap object
    foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls.
    {
        // get properties
        $html_page_urls[] = $html_page_url = $urlElement->loc;
        $html_page_last_mods[] = $last_mod = $urlElement->lastmod;
        $html_page_change_freqs[] = $change_freq = $urlElement->changefreq;
        $html_page_priorities[] = $priority = $urlElement->priority;
	        // print out the properties
        echo 'url: '. $html_page_url . '<br>';
        echo 'lastmod: '. $last_mod . '<br>';
        echo 'changefreq: '. $change_freq . '<br>';
        echo 'priority: '. $priority . '<br>';
	        echo '<br>---<br>';
    }
}
else
{
    //Scrape Webpage Data as current page is an hmtl page for visitors and no Xml SiteMap page for Crawlers.
    //scrape_page_data(); //Scrape Page Title & Meta Tags.
}
	echo 'SiteMaps Crawled: ---';echo '<br><br>';
if(array_count_values($html_page_urls)>0)
{    
    print_r($sitemaps);
    echo '<br>';
}
elseif(array_count_values($sitemaps_last_mods)>0)
{    
    print_r($sitemaps_last_mods);
    echo '<br>';
}
elseif(array_count_values($sitemaps_change_freqs)>0)
{    
    print_r($sitemaps_change_freqs);
    echo '<br>';
}
elseif(array_count_values($sitemaps_priorities)>0)
{    
    print_r($sitemaps_priorities);
    echo '<br><br>';
}
	echo 'Html Pages Crawled: ---'; echo '<br><br>';
	if(array_count_values($html_page_urls)>0)
{    
    print_r($html_page_urls);
    echo '<br>';
}
if(array_count_values($html_page_last_mods)>0)
{    
    print_r($html_page_last_mods);
    echo '<br>';
}
if(array_count_values($html_page_change_freqs)>0)
{    
    print_r($html_page_change_freqs);
    echo '<br>';
}
if(array_count_values($html_page_priorities)>0)
{    
    print_r($html_page_priorities);
    echo '<br>';
}
	scrape_page_data(); //Scrape Page Title & Meta Tags.
	function scrape_page_data()
{
    if(array_count_values($html_page_urls)>0)
    {        
        foreach($html_page_urls AS $url)
        {
            //Extract Page's Meta Data & Title.
            file_get_contents($url);
            
            // https://www.php.net/manual/en/function.file-get-contents
            $html = file_get_contents($url);
	            //https://www.php.net/manual/en/domdocument.construct.php
            $doc = new DOMDocument();
	            // https://www.php.net/manual/en/function.libxml-use-internal-errors.php
            libxml_use_internal_errors(true);
	            // https://www.php.net/manual/en/domdocument.loadhtml.php
            $doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
	            // https://www.php.net/manual/en/function.libxml-clear-errors.php
            libxml_clear_errors();
	            // https://www.php.net/manual/en/domdocument.getelementsbytagname.php
            $meta_tags = $doc->getElementsByTagName('meta');
	            // https://www.php.net/manual/en/domnodelist.item.php
            if ($meta_tags->length > 0)
            {
                // https://www.php.net/manual/en/class.domnodelist.php
                foreach ($meta_tags as $tag)
                {
                    // https://www.php.net/manual/en/domnodelist.item.php
                    echo 'Name: ' .$name = $tag->getAttribute('name'); echo '<br>';
                    echo 'Content: ' .$content = $tag->getAttribute('content');  echo '<br>';
                }
            }
	            //EXAMPLE 1: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
            if ($title_tag->length>0)
            {
                echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
            }
	            //EXAMPLE 2: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
            for ($i = 0; $i < $title_tag->length; $i++) {
                echo $title_tag->item($i)->nodeValue . "\n";
            }
        }
    }
}

Folks, Do test the code and see in your localhost what you get!
Puzzling!
It's 3:06am here and I do not have sleep in my eyes to do a typo herein the $var name!

Edited by TheStudent2023
Link to comment
Share on other sites

On 5/7/2023 at 5:29 AM, kicken said:

Thanks. I know about the scopes. Just overlooked the fact that I was calling the $var from a function.

Error is gone now.

Nevertheless, why I get no urls crawled or extracted ? Get no errors, either. Strange! I just get this echoed:

SiteMaps Crawled: ---

Array ( )
Html Pages Crawled: ---

Array ( )
Array ( )
Array ( )
Array ( )

 

As you can see, the starting point link does have urls on it's pages:

	<?php
	ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(E_ALL);
	 
	//Preparing Crawler & Session: Initialising Variables.
	//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//Data Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemap.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.
	//Data Scraped from SiteMaps or Xml Files.
$html_page_urls  = array(); //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods  = []; //This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs  = []; //his will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities  = []; //This will list html pages priorities - found on Sitemaps.
	//Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
//Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_titles  = []; //This will list crawled pages Titles - found on html pages.
$html_page_meta_names  = []; //This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions  = []; //This will list crawled pages Meta Tag Descriptions - found on html pages.
	// -----
	//Step 1: Initiate Session - Feed Xml SiteMap Url. Crawing Starting Point.
//Crawl Session Starting Page/Initial Xml Sitemap.
$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
	$xml = file_get_contents($sitemap); //Should I stick to this line or below line ?
// parse the sitemap content to object
//$xml = simplexml_load_string($sitemap); //Should I stick to this line or above line ?
	$dom = new DOMDocument();
$dom->loadXML($xml);
	//Trigger following IF/ELSEs on each Crawled Page to check for link types. Whether Links lead to more SiteMaps (.xml) or webpages (.html, .htm, .php, etc.).
if ($dom->nodeName === 'sitemapindex')  //Current Xml SiteMap Page lists more Xml SiteMaps. Lists links to Xml links. Not lists links to html links.
{
    //parse the index
    // retrieve properties from the sitemap object
    foreach ($xml->urlset as $urlElement) //Extracts html file urls.
    {
        // get properties
        $sitemaps[] = $sitemap_url = $urlElement->loc;
        $sitemaps_last_mods[] = $last_mod = $urlElement->lastmod;
        $sitemaps_change_freqs[] = $change_freq = $urlElement->changefreq;
        $sitemaps_priorities[] = $priority = $urlElement->priority;
	        // print out the properties
        echo 'url: '. $sitemap_url . '<br>';
        echo 'lastmod: '. $last_mod . '<br>';
        echo 'changefreq: '. $change_freq . '<br>';
        echo 'priority: '. $priority . '<br>';
	        echo '<br>---<br>';
    }
}
else if ($dom->nodeName === 'urlset')  //Current Xml SiteMap Page lists no more Xml SiteMap links. Lists only html links.
{
    //parse url set
    // retrieve properties from the sitemap object
    foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls.
    {
        // get properties
        $html_page_urls[] = $html_page_url = $urlElement->loc;
        $html_page_last_mods[] = $last_mod = $urlElement->lastmod;
        $html_page_change_freqs[] = $change_freq = $urlElement->changefreq;
        $html_page_priorities[] = $priority = $urlElement->priority;
	        // print out the properties
        echo 'url: '. $html_page_url . '<br>';
        echo 'lastmod: '. $last_mod . '<br>';
        echo 'changefreq: '. $change_freq . '<br>';
        echo 'priority: '. $priority . '<br>';
	        echo '<br>---<br>';
    }
}
else
{
    //Scrape Webpage Data as current page is an hmtl page for visitors and no Xml SiteMap page for Crawlers.
    //scrape_page_data(); //Scrape Page Title & Meta Tags.
}
	echo 'SiteMaps Crawled: ---';echo '<br><br>';
if(array_count_values($html_page_urls)>0)
{    
    print_r($sitemaps);
    echo '<br>';
}
elseif(array_count_values($sitemaps_last_mods)>0)
{    
    print_r($sitemaps_last_mods);
    echo '<br>';
}
elseif(array_count_values($sitemaps_change_freqs)>0)
{    
    print_r($sitemaps_change_freqs);
    echo '<br>';
}
elseif(array_count_values($sitemaps_priorities)>0)
{    
    print_r($sitemaps_priorities);
    echo '<br><br>';
}
	echo 'Html Pages Crawled: ---'; echo '<br><br>';
	if(array_count_values($html_page_urls)>0)
{    
    print_r($html_page_urls);
    echo '<br>';
}
if(array_count_values($html_page_last_mods)>0)
{    
    print_r($html_page_last_mods);
    echo '<br>';
}
if(array_count_values($html_page_change_freqs)>0)
{    
    print_r($html_page_change_freqs);
    echo '<br>';
}
if(array_count_values($html_page_priorities)>0)
{    
    print_r($html_page_priorities);
    echo '<br>';
}
	scrape_page_data(); //Scrape Page Title & Meta Tags.
	function scrape_page_data()
{
    GLOBAL $html_page_urls;
    if(array_count_values($html_page_urls)>0)
    {        
        foreach($html_page_urls AS $url)
        {
            //Extract Page's Meta Data & Title.
            file_get_contents($url);
            
            // https://www.php.net/manual/en/function.file-get-contents
            $html = file_get_contents($url);
	            //https://www.php.net/manual/en/domdocument.construct.php
            $doc = new DOMDocument();
	            // https://www.php.net/manual/en/function.libxml-use-internal-errors.php
            libxml_use_internal_errors(true);
	            // https://www.php.net/manual/en/domdocument.loadhtml.php
            $doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
	            // https://www.php.net/manual/en/function.libxml-clear-errors.php
            libxml_clear_errors();
	            // https://www.php.net/manual/en/domdocument.getelementsbytagname.php
            $meta_tags = $doc->getElementsByTagName('meta');
	            // https://www.php.net/manual/en/domnodelist.item.php
            if ($meta_tags->length > 0)
            {
                // https://www.php.net/manual/en/class.domnodelist.php
                foreach ($meta_tags as $tag)
                {
                    // https://www.php.net/manual/en/domnodelist.item.php
                    echo 'Name: ' .$name = $tag->getAttribute('name'); echo '<br>';
                    echo 'Content: ' .$content = $tag->getAttribute('content');  echo '<br>';
                }
            }
	            //EXAMPLE 1: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
            if ($title_tag->length>0)
            {
                echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
            }
	            //EXAMPLE 2: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
	            for ($i = 0; $i < $title_tag->length; $i++) {
                echo $title_tag->item($i)->nodeValue . "\n";
            }
        }
    }
}
	?>
	

That is my latest update. What do you think about it and why you think I getting echoed no links ?

 

Edited by TheStudent2023
Link to comment
Share on other sites

@kicken

Scratching my head why no link and their meta tags & titles are getting extracted by this crawler. I have done all the basic logics. See for yourself. have i missed out any logic ?

Can the below code get any shorter or not so I can easily spot where the issue is as I get no error and no proper result either.
Just get echoed:

**SiteMaps Crawled: ---

Array ( )
Html Pages Crawled: ---

Array ( )
Array ( )
Array ( )
Array ( ) **

 

I am really really puzzled.

FULL CRAWLER

<?php
	ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(E_ALL);
	//Preparing Crawler & Session: Initialising Variables.
	//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//Data Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemap.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.
	//Data Scraped from SiteMaps or Xml Files.
$html_page_urls  = array(); //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods  = []; //This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs  = []; //his will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities  = []; //This will list html pages priorities - found on Sitemaps.
	//Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
//Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_titles  = []; //This will list crawled pages Titles - found on html pages.
$html_page_meta_names  = []; //This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions  = []; //This will list crawled pages Meta Tag Descriptions - found on html pages.
	// -----
	//Step 1: Initiate Session - Feed Xml SiteMap Url. Crawing Starting Point.
//Crawl Session Starting Page/Initial Xml Sitemap.
$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
	$xml = file_get_contents($sitemap); //Should I stick to this line or below line ?
// parse the sitemap content to object
//$xml = simplexml_load_string($sitemap); //Should I stick to this line or above line ?
	$dom = new DOMDocument();
$dom->loadXML($xml);
	extract_links();
	function extract_links()
{
    GLOBAL $dom;
    //Trigger following IF/ELSEs on each Crawled Page to check for link types. Whether Links lead to more SiteMaps (.xml) or webpages (.html, .htm, .php, etc.).
    if ($dom->nodeName === 'sitemapindex')  //Current Xml SiteMap Page lists more Xml SiteMaps. Lists links to Xml links. Not lists links to html links.
    {
        //parse the index
        // retrieve properties from the sitemap object
        foreach ($xml->urlset as $urlElement) //Extracts html file urls.
        {
            // get properties
            $sitemaps[] = $sitemap_url = $urlElement->loc;
            $sitemaps_last_mods[] = $last_mod = $urlElement->lastmod;
            $sitemaps_change_freqs[] = $change_freq = $urlElement->changefreq;
            $sitemaps_priorities[] = $priority = $urlElement->priority;
	            // print out the properties
            echo 'url: '. $sitemap_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';
	            echo '<br>---<br>';
        }
    }
    else if ($dom->nodeName === 'urlset')  //Current Xml SiteMap Page lists no more Xml SiteMap links. Lists only html links.
    {
        //parse url set
        // retrieve properties from the sitemap object
        foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls.
        {
            // get properties
            $html_page_urls[] = $html_page_url = $urlElement->loc;
            $html_page_last_mods[] = $last_mod = $urlElement->lastmod;
            $html_page_change_freqs[] = $change_freq = $urlElement->changefreq;
            $html_page_priorities[] = $priority = $urlElement->priority;
	            // print out the properties
            echo 'url: '. $html_page_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';
	            echo '<br>---<br>';
        }
    }
    else
    {
        //Scrape Webpage Data as current page is an hmtl page for visitors and no Xml SiteMap page for Crawlers.
        //scrape_page_data(); //Scrape Page Title & Meta Tags.
    }
    
    GLOBAL $sitemaps;
    GLOBAL $sitemaps_last_mods;
    GLOBAL $sitemaps_change_freqs;
    GLOBAL $sitemaps_priorities;
    
    GLOBAL $html_page_urls;
    GLOBAL $html_page_last_mods;
    GLOBAL $html_page_change_freqs;
    GLOBAL $html_page_priorities;
    
    echo 'SiteMaps Crawled: ---'; echo '<br><br>';
    if(array_count_values($sitemaps)>0)
    {    
        print_r($sitemaps);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_last_mods)>0)
    {    
        print_r($sitemaps_last_mods);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_change_freqs)>0)
    {    
        print_r($sitemaps_change_freqs);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_priorities)>0)
    {    
        print_r($sitemaps_priorities);
        echo '<br><br>';
    }
	    echo 'Html Pages Crawled: ---'; echo '<br><br>';
	    if(array_count_values($html_page_urls)>0)
    {    
        print_r($html_page_urls);
        echo '<br>';
    }
    if(array_count_values($html_page_last_mods)>0)
    {    
        print_r($html_page_last_mods);
        echo '<br>';
    }
    if(array_count_values($html_page_change_freqs)>0)
    {    
        print_r($html_page_change_freqs);
        echo '<br>';
    }
    if(array_count_values($html_page_priorities)>0)
    {    
        print_r($html_page_priorities);
        echo '<br>';
    }
}
	foreach($sitemaps AS $sitemap)
{
    extract_links();
}
	foreach($html_page_urls AS $html_page_url)
{
    extract_links();
}
	scrape_page_data(); //Scrape Page Title & Meta Tags.
	function scrape_page_data()
{
    GLOBAL $html_page_urls;
    if(array_count_values($html_page_urls)>0)
    {        
        foreach($html_page_urls AS $url)
        {
            //Extract Page's Meta Data & Title.
            file_get_contents($url);
            
            // https://www.php.net/manual/en/function.file-get-contents
            $html = file_get_contents($url);
	            //https://www.php.net/manual/en/domdocument.construct.php
            $doc = new DOMDocument();
	            // https://www.php.net/manual/en/function.libxml-use-internal-errors.php
            libxml_use_internal_errors(true);
	            // https://www.php.net/manual/en/domdocument.loadhtml.php
            $doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
	            // https://www.php.net/manual/en/function.libxml-clear-errors.php
            libxml_clear_errors();
	            // https://www.php.net/manual/en/domdocument.getelementsbytagname.php
            $meta_tags = $doc->getElementsByTagName('meta');
	            // https://www.php.net/manual/en/domnodelist.item.php
            if ($meta_tags->length > 0)
            {
                // https://www.php.net/manual/en/class.domnodelist.php
                foreach ($meta_tags as $tag)
                {
                    // https://www.php.net/manual/en/domnodelist.item.php
                    echo 'Name: ' .$name = $tag->getAttribute('name'); echo '<br>';
                    echo 'Content: ' .$content = $tag->getAttribute('content');  echo '<br>';
                }
            }
	            //EXAMPLE 1: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
            if ($title_tag->length>0)
            {
                echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
            }
	            //EXAMPLE 2: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
	            for ($i = 0; $i < $title_tag->length; $i++) {
                echo $title_tag->item($i)->nodeValue . "\n";
            }
        }
    }
}
	?>
	

Want to see how much you are able to cut it short.
 

Thanks

Link to comment
Share on other sites

@ignace

 

Since you are expert in preventing malicioius injections, is my crawler code safe ? Crawlers cannot be trapped by hackers on their sites can they ? I mean, let us say a crook called my crawler to one of his malicious or phishing sites, is he able to trap my crawler and inject virus so my crawler dumps viruses and malicious code onto my searchengine index by the crawler ? Or, worst, can my crawler carry the virus on other sites it crawls and infect them ? Good question. Yes ?

What you think of my code above ? Is it orthodox or weird ? I cannot think of any better basic logics than the ones I used. What you say ?

Edited by TheStudent2023
Link to comment
Share on other sites

Damn! I give-up for tonight! Nearly 2am and I still cannot figure-out why my crawler fails to extract links, meta data & page titles!

Here is the latest code. Do you see any flaws ? I get no errors. What the heck is wrong!

Put my codes inside functions this time to make it look neater.

For some reason, this forum messes up my code indentations. So, best you copy & paste the following and test on your localhost.

FULL CODE

````

<?php

ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(E_ALL);
 

//START OF SCRIPT FLOW.

//Preparing Crawler & Session: Initialising Variables.

//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//Data Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemaps.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.

//Data Scraped from SiteMaps or Xml Files.
$html_page_urls  = []; //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods  = []; //This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs  = []; //his will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities  = []; //This will list html pages priorities - found on Sitemaps.

//Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
//Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_meta_names  = []; //This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions  = []; //This will list crawled pages Meta Tag Descriptions - found on html pages.
$html_page_titles  = []; //This will list crawled pages Titles - found on html pages.
// -----

//Step 1: Initiate Session - Feed Xml SiteMap Url. Crawing Starting Point.
//Crawl Session Starting Page/Initial Xml Sitemap.
$initial_url = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.

$xml = file_get_contents($initial_url); //Should I stick to this line or below line ?
//Parse the sitemap content to object
//$xml = simplexml_load_string($initial_url); //Should I stick to this line or above line ?

$dom = new DOMDocument();
$dom->loadXML($xml);

echo __LINE__; echo '<br>'; //LINE: 334

extract_links($xml);

echo __LINE__; echo '<br>';  //LINE: 338

foreach($sitemaps AS $sitemap)
{
    echo __LINE__; echo '<br>';
    extract_links($sitemap); //Extract Links on page.
}

foreach($html_page_urls AS $html_page_url)
{
    echo __LINE__; echo '<br>';
    extract_links($html_page_url); //Extract Links on page.
}

scrape_page_data(); //Scrape Page Title & Meta Tags.

//END OF SCRIPT FLOW.

//FUNCTIONS BEYOND THIS POINT.

//Links Extractor.
function extract_links()
{
    echo __LINE__; echo '<br>';  //LINE: 361
    
    GLOBAL $dom;
    //Trigger following IF/ELSEs on each Crawled Page to check for link types. Whether Links lead to more SiteMaps (.xml) or webpages (.html, .htm, .php, etc.).
    if ($dom->nodeName === 'sitemapindex')  //Current Xml SiteMap Page lists more Xml SiteMaps. Lists links to Xml links. Not lists links to html links.
    {
        echo __LINE__; echo '<br>';
        
        //parse the index
        // retrieve properties from the sitemap object
        foreach ($xml->urlset as $urlElement) //Extracts html file urls.
        {
            // get properties
            $sitemaps[] = $sitemap_url = $urlElement->loc;
            $sitemaps_last_mods[] = $last_mod = $urlElement->lastmod;
            $sitemaps_change_freqs[] = $change_freq = $urlElement->changefreq;
            $sitemaps_priorities[] = $priority = $urlElement->priority;

            // print out the properties
            echo 'url: '. $sitemap_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';

            echo '<br>---<br>';
        }
    }
    else if ($dom->nodeName === 'urlset')  //Current Xml SiteMap Page lists no more Xml SiteMap links. Lists only html links.
    {
        echo __LINE__; echo '<br>';
        
        //parse url set
        // retrieve properties from the sitemap object
        foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls.
        {
            // get properties
            $html_page_urls[] = $html_page_url = $urlElement->loc;
            $html_page_last_mods[] = $last_mod = $urlElement->lastmod;
            $html_page_change_freqs[] = $change_freq = $urlElement->changefreq;
            $html_page_priorities[] = $priority = $urlElement->priority;

            // print out the properties
            echo 'url: '. $html_page_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';

            echo '<br>---<br>';
        }
    }
    
    GLOBAL $sitemaps;
    GLOBAL $sitemaps_last_mods;
    GLOBAL $sitemaps_change_freqs;
    GLOBAL $sitemaps_priorities;
    
    GLOBAL $html_page_urls;
    GLOBAL $html_page_last_mods;
    GLOBAL $html_page_change_freqs;
    GLOBAL $html_page_priorities;
    
    echo 'SiteMaps Crawled: ---'; echo '<br><br>';
    if(array_count_values($sitemaps)>0)
    {    
        print_r($sitemaps);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_last_mods)>0)
    {    
        print_r($sitemaps_last_mods);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_change_freqs)>0)
    {    
        print_r($sitemaps_change_freqs);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_priorities)>0)
    {    
        print_r($sitemaps_priorities);
        echo '<br><br>';
    }

    echo 'Html Pages Crawled: ---'; echo '<br><br>';

    if(array_count_values($html_page_urls)>0)
    {    
        print_r($html_page_urls);
        echo '<br>';
    }
    if(array_count_values($html_page_last_mods)>0)
    {    
        print_r($html_page_last_mods);
        echo '<br>';
    }
    if(array_count_values($html_page_change_freqs)>0)
    {    
        print_r($html_page_change_freqs);
        echo '<br>';
    }
    if(array_count_values($html_page_priorities)>0)
    {    
        print_r($html_page_priorities);
        echo '<br>';
    }
}

//Meta Data & Title Extractor.
function scrape_page_data()
{
    GLOBAL $html_page_urls;
    if(array_count_values($html_page_urls)>0)
    {        
        foreach($html_page_urls AS $url)
        {
            // https://www.php.net/manual/en/function.file-get-contents
            $html = file_get_contents($url);

            //https://www.php.net/manual/en/domdocument.construct.php
            $doc = new DOMDocument();

            // https://www.php.net/manual/en/function.libxml-use-internal-errors.php
            libxml_use_internal_errors(true);

            // https://www.php.net/manual/en/domdocument.loadhtml.php
            $doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

            // https://www.php.net/manual/en/function.libxml-clear-errors.php
            libxml_clear_errors();

            // https://www.php.net/manual/en/domdocument.getelementsbytagname.php
            $meta_tags = $doc->getElementsByTagName('meta');

            // https://www.php.net/manual/en/domnodelist.item.php
            if ($meta_tags->length > 0)
            {
                // https://www.php.net/manual/en/class.domnodelist.php
                foreach ($meta_tags as $tag)
                {
                    // https://www.php.net/manual/en/domnodelist.item.php
                    echo 'Meta Name: ' .$meta_name = $tag->getAttribute('name'); echo '<br>';
                    echo 'Meta Content: ' .$meta_content = $tag->getAttribute('content');  echo '<br>';
                    $html_page_meta_names[] = $meta_name;
                    $html_page_meta_descriptions[] = $meta_content;
                }
            }

            //EXAMPLE 1: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
            if ($title_tag->length>0)
            {
                echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
                $html_page_titles[] = $title;
            }

            //EXAMPLE 2: Extract Title
            $title_tag = $doc->getElementsByTagName('title');

            for ($i = 0; $i < $title_tag->length; $i++) {
                echo 'Title: ' .$title = $title_tag->item($i)->nodeValue . "\n";
                $html_page_titles[] = $title;
            }
        }
    }
}

if(array_count_values($html_page_meta_names)>0)
{    
    print_r($html_page_meta_names);
    echo '<br>';
}

if(array_count_values($html_page_meta_descriptions)>0)
{    
    print_r($html_page_meta_descriptions);
    echo '<br>';
}

if(array_count_values($html_page_titles)>0)
{    
    print_r($html_page_titles);
    echo '<br>';
}

//END OF FUNCTIONS.

````

I only get this echoed.Notice the arrays are empty. It means no data is getting extracted from pages.

334
361
SiteMaps Crawled: ---

Array ( )
Html Pages Crawled: ---

Array ( )
Array ( )
Array ( )
Array ( )
338
Array ( )
Array ( )
Array ( )

 

 

Edited by TheStudent2023
Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.