Jump to content

Recommended Posts

My php skills are modest and I would like to be able to visit a given URL or other item like img tags and count how many exist on the page. ie.:

 

1.) take a url like "http://www.google.com"

2.) visit that url from within the script

3.) walk through the code of the page finding all the href instances

4.) return a count of the number of links found

 

Sounds simple enough but I fear it won't be. I assume that I should start with curl but I'm not sure. Any advice is appreciated.

You're going to want to use preg_match_all with a string such as "<a[^>]+>"

 

http://www.php.net/manual/en/function.preg-match-all.php

 

<?php

$contents=file_get_contents("http://www.google.com");

$count=preg_match_all("@<a [^>]+>@", $contents, $matches);

echo $count;
print_r($matches);

?>

 

Edit: Tested code / updated above.

I did some more digging based on the links you suggested and found: http://www.phpfreaks.com/forums/index.php?topic=317646.msg1497723#msg1497723

 

Based on the initial post in that thread I came up with:

 

<?php

if (isset($_POST['Submit'])) {


    // First Function for Getting Links from the page
    function urlstatspoller($link)
    {   
        $ret = array(); // returns an array
        $dom = new domDocument; // sets up a new dom object
        @$dom->loadHTML(file_get_contents($link)); // gets the html of the page while supressing any errors
        $dom->preserveWhiteSpace = false; // does not preserve whitespaces in the html
        $links = $dom->getElementsByTagName('a'); // polls the links in the page and stores them as "$links"
        // Loop for walking through each "a" tag and looking for href to make sure it's a link
        foreach ($links as $tag)
        {   
            $ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return $ret;
    }

    // Second Function for Getting images from the page
    function imgstatspoller($link)
    {   
        $ret = array(); // returns an array
        $dom = new domDocument; // sets up a new dom object
        @$dom->loadHTML(file_get_contents($link)); // gets the html of the page while supressing any errors
        $dom->preserveWhiteSpace = false; // does not preserve whitespaces in the html
        $images = $dom->getElementsByTagName('img'); // polls the links in the page and stores them as "$links"
        // Loop for walking through each "a" tag and looking for href to make sure it's a link
        foreach ($images as $tag)
        {   
            $ret[$tag->getAttribute('src')] = $tag->childNodes->item(0)->nodeValue;
        }
  
      return $ret;
    }

    // Get the Link to Search From the Web Form
    $link = $_POST['address'];

    // Call the URL Stats Polling Function Function
    $urls = urlstatspoller($link);

    // Call the Image Stats Polling Function
    $imgs = imgstatspoller($link);

    // Output Findings, they are output to a csv file as: #of links, #of images

    if(sizeof($urls) > 0)
    { 
        $counter1 = count($urls);
        echo $counter1, ",";
    }
    else
    {   
        echo "0,";
    }

    if(sizeof($imgs) > 0)
    { 
        $counter2 = count($imgs);
        echo $counter2, ",";
    }   
    else
    {   
        echo "0,";
    }
}
?>
<br /><br />
<form action="" method="post" enctype="multipart/form-data" name="link">
<input name="address" type="text" value="" />
<input name="Submit" type="Submit" />
</form>

 

I'm sure that there is a more elegant solution and even this needs a little more work to fit my in/out format needs but this will work for now. I'm just trying to get some really basic stats from a list of links.

 

Thanks for both of your help! I really appreciate it.  :D

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.