Jump to content

find full url


deadlyp99

Recommended Posts

so this is the error I am encountering in my crawler.

 

Warning: file_get_contents(/home.aspx) [function.file-get-contents]: failed to open stream: No such file or directory in C:\wamp\www\a\crawl.php on line 12

 

What is happening, is when it goes to the new page, it is trying to find that file on my local server.

 

I need a way to get the full url, only if the url found isn't the full url on that site.

 

Make enough sense?

 

my code:

<?php
function Main($StartUrl){
	$x = 1;
	While ($x <= 5) {	
		//support for links without full urls
		if (file_get_contents($StartUrl)==FALSE){
			echo "false";
			}
		else {
			//Assign page a variable
			$Page = file_get_contents($StartUrl);
			//search string for a pattern
			// and store content found inside the set of parents in the array, $matches
			preg_match('|<a.*?href="(.*?)"|is', $Page,$matches);
			//see what's inside $matches[1]
			echo '<pre>'. print_r($matches[1], true) . '</pre>';
			//Go to next
			$StartUrl = $matches[1];
			$x++;
			}
		}
	}
?>		

Link to comment
https://forums.phpfreaks.com/topic/118599-find-full-url/
Share on other sites

free script I found to get the whole URL of the current page.

<?php
/* Grab current pages complete URL including after the question mark */	
$QueryString="";
foreach ($_GET as $key => $value)
{ 
$value = urlencode(stripslashes($value));
if($QueryString!="")
$QueryString .="&";

$QueryString .= "$key=$value";
}

$pageName=basename($_SERVER['PHP_SELF'] );

$page =$pageName."?".$QueryString;
// echo $page; // will show the full page 
// echo $QueryString; // will show after ? 
// end of URL grab	

?>

Link to comment
https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-610636
Share on other sites

Normally that would do just fine.

But if you run the script I first posted you can see, that say you pass jinx.com as the url (which I am testing on).

 

That script uses php's detection of the domain it is working on, so when ran it would do nothing more than show 127.0.0.1/a/home.aspx

I need it to find jinx.com/home.aspx because that is the domain the spider is currently working on.

Link to comment
https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-610650
Share on other sites

I've got a function that converts relative URLs to absolute URLs:

 

<?php
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}
?>

 

Then you just put the page URL you're crawling as the first parameter, and the link found on the page as the second, and it spits out the absolute URL:

 

<?php
echo relative2absolute('http://jinx.com/', '/home.aspx');
//should output http://jinx.com/home.aspx
?>

 

If the page you're crawling contains a <base> tag with an other href than the actual page URL, you'll need to fetch that, but that will only happen on rare occasions.

 

The function originates from http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/

 

Edit: Another thing; you can safely run every link through the above function - if the URL is already absolute, the function will quickly return it.

Link to comment
https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-610826
Share on other sites

So here is how I have modified the file, but I still cannot get it to work.

I'll post all of my code up so everything going on is viewed.

 

I tested the function, and it works when a propper url is inserted, but I need a way to do several things.

First, I've got little clue if my use of the function in my code is correct, so some help there would be just dandy.

Second, I am going to need a method to make all base url.... univerally formatted.

For example, they all need to be changed to:

http://www.thesite.com

With the end file removed, as well as the slash. I was thinking checking the url against an array of every domain extension possible, and trimming away all strings after, is that over doing it?

 

index.php

<html>
<head>
	<title>
		Web Crawler - learn
	</title>
	<style>
		#main
		{
			text-align: center;
			//display: none;
		}

		#title
		{
			text-align: center;
		}
		#url
		{
			//display: none;
		}
	</style>	
</head>

<body id="main">

	<div id="title">Web Crawler - Test</div>
	<a id="url" href="http://www.google.com">Google</a>
	<?php 
		require("library.tpl");
		include("crawl.php");
		Main("http://www.jinx.com");
	?>
</body>

</html>

 

Crawl.php

<?php
function Main($StartUrl){
	$x = 1;
	While ($x <= 5) {	
		//support for links without full urls
		if (file_get_contents($StartUrl)==FALSE){
			$StartUrl = relative2absolute($LastUrl, $StartUrl);
			$Page = file_get_contents($StartUrl);
			//search string for a pattern
			// and store content found inside the set of parents in the array $matches
			preg_match('|<a.*?href="(.*?)"|is', $Page,$matches);
			//see what's inside $matches[1]
			echo '<pre>'. print_r($matches[1], true) . '</pre>';
			//Go to next
			$StartUrl = $matches[1];
			$x++;

			}
		else {
			//Assign page a variable
			$Page = file_get_contents($StartUrl);
			//search string for a pattern
			// and store content found inside the set of parents in the array $matches
			preg_match('|<a.*?href="(.*?)"|is', $Page,$matches);
			//see what's inside $matches[1]
			echo '<pre>'. print_r($matches[1], true) . '</pre>';
			//Go to next
			$LastUrl = $StartUrl;
			$StartUrl = $matches[1];
			$x++;
			}
		}
	}
?>		

 

library.tpl

<?php

function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}

?>

Link to comment
https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-611227
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.