Jump to content

help with function to crawl for links in all website.


dsp77

Recommended Posts

I'm trying to crawl for links in a specific website and show them at the end. The problem i'm facing is that it only show the links from the specific page not the whole pages in the website. I tried several loops with no success please give some advise.

Here is the code:

<?php
if (isset($_POST['Submit'])) {


    function getLinks($link)
    {
        /*** return array ***/
        $ret = array();

        /*** a new dom object ***/
        $dom = new domDocument;

        /*** get the HTML (suppress errors) ***/
        @$dom->loadHTML(file_get_contents($link));

        /*** remove silly white space ***/
        $dom->preserveWhiteSpace = false;

        /*** get the links from the HTML ***/
        $links = $dom->getElementsByTagName('a');
    
        /*** loop over the links ***/
        foreach ($links as $tag)
        {
            $ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return $ret;
    }

    /*** a link to search ***/
    $link = $_POST['address'];

    /*** get the links ***/
    $urls = getLinks($link);

    /*** check for results ***/
    if(sizeof($urls) > 0)
    {
        foreach($urls as $key=>$value)
        {
		if (preg_match('/^(http|https):\/\/([a-z0-9-]\.+)*/i',$key)) {
			echo '<span style="color:RED;">' . $key .' - external</span><br >';
		} else {
            	echo '<span style="color:BLUE;">' . $link . $key . ' - internal</span><br >';
		}
        }
    }
    else
    {
        echo "No links found at $link";
    }
}
?>
<br /><br />
<form action="" method="post" enctype="multipart/form-data" name="link">
<input name="address" type="text" value="" />
<input name="Submit" type="Submit" />
</form>

Depending on how you do this is could go on forever..

Inside your code you will, I think, have to look at the first page get all the "internal" links and systematically follow them. Here is what I mean..

$links = array();

getLinks($_POST['address'], $links);

function getLinks($url,&$links) {
  // do the lookup and extract the link //
  if (FOUND_LINK == INTERNAL) {
    getLinks(FOUND_LINK,$links)
  } else {
    $links[] = FOUND_LINK;
  }
}

 

Or something like that..

the error is generated inside the code you provided

 

function getLinks($url,&$links) {
  // do the lookup and extract the link //
  if (FOUND_LINK == INTERNAL) {
//at the line below
    getLinks(FOUND_LINK,$links);
  } else {
    $links[] = FOUND_LINK;
  }
}

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.