find full url

deadlyp99 · August 7, 2008

so this is the error I am encountering in my crawler.

Warning: file_get_contents(/home.aspx) [function.file-get-contents]: failed to open stream: No such file or directory in C:\wamp\www\a\crawl.php on line 12

What is happening, is when it goes to the new page, it is trying to find that file on my local server.

I need a way to get the full url, only if the url found isn't the full url on that site.

Make enough sense?

my code:

<?php
function Main($StartUrl){
	$x = 1;
	While ($x <= 5) {	
		//support for links without full urls
		if (file_get_contents($StartUrl)==FALSE){
			echo "false";
			}
		else {
			//Assign page a variable
			$Page = file_get_contents($StartUrl);
			//search string for a pattern
			// and store content found inside the set of parents in the array, $matches
			preg_match('|<a.*?href="(.*?)"|is', $Page,$matches);
			//see what's inside $matches[1]
			echo '<pre>'. print_r($matches[1], true) . '</pre>';
			//Go to next
			$StartUrl = $matches[1];
			$x++;
			}
		}
	}
?>

MasterACE14 · August 7, 2008

free script I found to get the whole URL of the current page.

<?php
/* Grab current pages complete URL including after the question mark */	
$QueryString="";
foreach ($_GET as $key => $value)
{ 
$value = urlencode(stripslashes($value));
if($QueryString!="")
$QueryString .="&";

$QueryString .= "$key=$value";
}

$pageName=basename($_SERVER['PHP_SELF'] );

$page =$pageName."?".$QueryString;
// echo $page; // will show the full page 
// echo $QueryString; // will show after ? 
// end of URL grab	

?>

deadlyp99 · August 7, 2008

Normally that would do just fine.

But if you run the script I first posted you can see, that say you pass jinx.com as the url (which I am testing on).

That script uses php's detection of the domain it is working on, so when ran it would do nothing more than show 127.0.0.1/a/home.aspx

I need it to find jinx.com/home.aspx because that is the domain the spider is currently working on.

deadlyp99 · August 7, 2008

Anyone?

thebadbad · August 7, 2008

I've got a function that converts relative URLs to absolute URLs:

<?php
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}
?>

Then you just put the page URL you're crawling as the first parameter, and the link found on the page as the second, and it spits out the absolute URL:

<?php
echo relative2absolute('http://jinx.com/', '/home.aspx');
//should output http://jinx.com/home.aspx
?>

If the page you're crawling contains a <base> tag with an other href than the actual page URL, you'll need to fetch that, but that will only happen on rare occasions.

The function originates from http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/

Edit: Another thing; you can safely run every link through the above function - if the URL is already absolute, the function will quickly return it.

deadlyp99 · August 7, 2008

I think I will be able to adapt that for my use, thanks.

I'll keep this thread open just in case someone finds any simpler ways.

deadlyp99 · August 8, 2008

So here is how I have modified the file, but I still cannot get it to work.

I'll post all of my code up so everything going on is viewed.

I tested the function, and it works when a propper url is inserted, but I need a way to do several things.

First, I've got little clue if my use of the function in my code is correct, so some help there would be just dandy.

Second, I am going to need a method to make all base url.... univerally formatted.

For example, they all need to be changed to:

http://www.thesite.com

With the end file removed, as well as the slash. I was thinking checking the url against an array of every domain extension possible, and trimming away all strings after, is that over doing it?

index.php

<html>
<head>
	<title>
		Web Crawler - learn
	</title>
	<style>
		#main
		{
			text-align: center;
			//display: none;
		}

		#title
		{
			text-align: center;
		}
		#url
		{
			//display: none;
		}
	</style>	
</head>

<body id="main">

	<div id="title">Web Crawler - Test</div>
	<a id="url" href="http://www.google.com">Google</a>
	<?php 
		require("library.tpl");
		include("crawl.php");
		Main("http://www.jinx.com");
	?>
</body>

</html>

Crawl.php

<?php
function Main($StartUrl){
	$x = 1;
	While ($x <= 5) {	
		//support for links without full urls
		if (file_get_contents($StartUrl)==FALSE){
			$StartUrl = relative2absolute($LastUrl, $StartUrl);
			$Page = file_get_contents($StartUrl);
			//search string for a pattern
			// and store content found inside the set of parents in the array $matches
			preg_match('|<a.*?href="(.*?)"|is', $Page,$matches);
			//see what's inside $matches[1]
			echo '<pre>'. print_r($matches[1], true) . '</pre>';
			//Go to next
			$StartUrl = $matches[1];
			$x++;

			}
		else {
			//Assign page a variable
			$Page = file_get_contents($StartUrl);
			//search string for a pattern
			// and store content found inside the set of parents in the array $matches
			preg_match('|<a.*?href="(.*?)"|is', $Page,$matches);
			//see what's inside $matches[1]
			echo '<pre>'. print_r($matches[1], true) . '</pre>';
			//Go to next
			$LastUrl = $StartUrl;
			$StartUrl = $matches[1];
			$x++;
			}
		}
	}
?>

library.tpl

<?php

function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}

?>

thebadbad · August 8, 2008

You haven't explained what you're trying to do with this crawler. Maybe then I can help

Sign In

find full url

Recommended Posts

deadlyp99

Link to comment

Share on other sites

MasterACE14

Link to comment

Share on other sites

deadlyp99

Link to comment

Share on other sites

deadlyp99

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

deadlyp99

Link to comment

Share on other sites

deadlyp99

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information