Jump to content

Count the number of links or some other tag from a website.


SecureMind

Recommended Posts

My php skills are modest and I would like to be able to visit a given URL or other item like img tags and count how many exist on the page. ie.:

 

1.) take a url like "http://www.google.com"

2.) visit that url from within the script

3.) walk through the code of the page finding all the href instances

4.) return a count of the number of links found

 

Sounds simple enough but I fear it won't be. I assume that I should start with curl but I'm not sure. Any advice is appreciated.

You're going to want to use preg_match_all with a string such as "<a[^>]+>"

 

http://www.php.net/manual/en/function.preg-match-all.php

 

<?php

$contents=file_get_contents("http://www.google.com");

$count=preg_match_all("@<a [^>]+>@", $contents, $matches);

echo $count;
print_r($matches);

?>

 

Edit: Tested code / updated above.

I did some more digging based on the links you suggested and found: http://www.phpfreaks.com/forums/index.php?topic=317646.msg1497723#msg1497723

 

Based on the initial post in that thread I came up with:

 

<?php

if (isset($_POST['Submit'])) {


    // First Function for Getting Links from the page
    function urlstatspoller($link)
    {   
        $ret = array(); // returns an array
        $dom = new domDocument; // sets up a new dom object
        @$dom->loadHTML(file_get_contents($link)); // gets the html of the page while supressing any errors
        $dom->preserveWhiteSpace = false; // does not preserve whitespaces in the html
        $links = $dom->getElementsByTagName('a'); // polls the links in the page and stores them as "$links"
        // Loop for walking through each "a" tag and looking for href to make sure it's a link
        foreach ($links as $tag)
        {   
            $ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return $ret;
    }

    // Second Function for Getting images from the page
    function imgstatspoller($link)
    {   
        $ret = array(); // returns an array
        $dom = new domDocument; // sets up a new dom object
        @$dom->loadHTML(file_get_contents($link)); // gets the html of the page while supressing any errors
        $dom->preserveWhiteSpace = false; // does not preserve whitespaces in the html
        $images = $dom->getElementsByTagName('img'); // polls the links in the page and stores them as "$links"
        // Loop for walking through each "a" tag and looking for href to make sure it's a link
        foreach ($images as $tag)
        {   
            $ret[$tag->getAttribute('src')] = $tag->childNodes->item(0)->nodeValue;
        }
  
      return $ret;
    }

    // Get the Link to Search From the Web Form
    $link = $_POST['address'];

    // Call the URL Stats Polling Function Function
    $urls = urlstatspoller($link);

    // Call the Image Stats Polling Function
    $imgs = imgstatspoller($link);

    // Output Findings, they are output to a csv file as: #of links, #of images

    if(sizeof($urls) > 0)
    { 
        $counter1 = count($urls);
        echo $counter1, ",";
    }
    else
    {   
        echo "0,";
    }

    if(sizeof($imgs) > 0)
    { 
        $counter2 = count($imgs);
        echo $counter2, ",";
    }   
    else
    {   
        echo "0,";
    }
}
?>
<br /><br />
<form action="" method="post" enctype="multipart/form-data" name="link">
<input name="address" type="text" value="" />
<input name="Submit" type="Submit" />
</form>

 

I'm sure that there is a more elegant solution and even this needs a little more work to fit my in/out format needs but this will work for now. I'm just trying to get some really basic stats from a list of links.

 

Thanks for both of your help! I really appreciate it.  :D

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.