deadlyp99 Posted August 7, 2008 Share Posted August 7, 2008 so this is the error I am encountering in my crawler. Warning: file_get_contents(/home.aspx) [function.file-get-contents]: failed to open stream: No such file or directory in C:\wamp\www\a\crawl.php on line 12 What is happening, is when it goes to the new page, it is trying to find that file on my local server. I need a way to get the full url, only if the url found isn't the full url on that site. Make enough sense? my code: <?php function Main($StartUrl){ $x = 1; While ($x <= 5) { //support for links without full urls if (file_get_contents($StartUrl)==FALSE){ echo "false"; } else { //Assign page a variable $Page = file_get_contents($StartUrl); //search string for a pattern // and store content found inside the set of parents in the array, $matches preg_match('|<a.*?href="(.*?)"|is', $Page,$matches); //see what's inside $matches[1] echo '<pre>'. print_r($matches[1], true) . '</pre>'; //Go to next $StartUrl = $matches[1]; $x++; } } } ?> Link to comment https://forums.phpfreaks.com/topic/118599-find-full-url/ Share on other sites More sharing options...
MasterACE14 Posted August 7, 2008 Share Posted August 7, 2008 free script I found to get the whole URL of the current page. <?php /* Grab current pages complete URL including after the question mark */ $QueryString=""; foreach ($_GET as $key => $value) { $value = urlencode(stripslashes($value)); if($QueryString!="") $QueryString .="&"; $QueryString .= "$key=$value"; } $pageName=basename($_SERVER['PHP_SELF'] ); $page =$pageName."?".$QueryString; // echo $page; // will show the full page // echo $QueryString; // will show after ? // end of URL grab ?> Link to comment https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-610636 Share on other sites More sharing options...
deadlyp99 Posted August 7, 2008 Author Share Posted August 7, 2008 Normally that would do just fine. But if you run the script I first posted you can see, that say you pass jinx.com as the url (which I am testing on). That script uses php's detection of the domain it is working on, so when ran it would do nothing more than show 127.0.0.1/a/home.aspx I need it to find jinx.com/home.aspx because that is the domain the spider is currently working on. Link to comment https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-610650 Share on other sites More sharing options...
deadlyp99 Posted August 7, 2008 Author Share Posted August 7, 2008 Anyone? Link to comment https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-610809 Share on other sites More sharing options...
thebadbad Posted August 7, 2008 Share Posted August 7, 2008 I've got a function that converts relative URLs to absolute URLs: <?php function relative2absolute($absolute, $relative) { $p = @parse_url($relative); if(!$p) { //$relative is a seriously malformed URL return false; } if(isset($p["scheme"])) return $relative; $parts=(parse_url($absolute)); if(substr($relative,0,1)=='/') { $cparts = (explode("/", $relative)); array_shift($cparts); } else { if(isset($parts['path'])){ $aparts=explode('/',$parts['path']); array_pop($aparts); $aparts=array_filter($aparts); } else { $aparts=array(); } $rparts = (explode("/", $relative)); $cparts = array_merge($aparts, $rparts); foreach($cparts as $i => $part) { if($part == '.') { unset($cparts[$i]); } else if($part == '..') { unset($cparts[$i]); unset($cparts[$i-1]); } } } $path = implode("/", $cparts); $url = ''; if($parts['scheme']) { $url = "$parts[scheme]://"; } if(isset($parts['user'])) { $url .= $parts['user']; if(isset($parts['pass'])) { $url .= ":".$parts['pass']; } $url .= "@"; } if(isset($parts['host'])) { $url .= $parts['host']."/"; } $url .= $path; return $url; } ?> Then you just put the page URL you're crawling as the first parameter, and the link found on the page as the second, and it spits out the absolute URL: <?php echo relative2absolute('http://jinx.com/', '/home.aspx'); //should output http://jinx.com/home.aspx ?> If the page you're crawling contains a <base> tag with an other href than the actual page URL, you'll need to fetch that, but that will only happen on rare occasions. The function originates from http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/ Edit: Another thing; you can safely run every link through the above function - if the URL is already absolute, the function will quickly return it. Link to comment https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-610826 Share on other sites More sharing options...
deadlyp99 Posted August 7, 2008 Author Share Posted August 7, 2008 I think I will be able to adapt that for my use, thanks. I'll keep this thread open just in case someone finds any simpler ways. Link to comment https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-610850 Share on other sites More sharing options...
deadlyp99 Posted August 8, 2008 Author Share Posted August 8, 2008 So here is how I have modified the file, but I still cannot get it to work. I'll post all of my code up so everything going on is viewed. I tested the function, and it works when a propper url is inserted, but I need a way to do several things. First, I've got little clue if my use of the function in my code is correct, so some help there would be just dandy. Second, I am going to need a method to make all base url.... univerally formatted. For example, they all need to be changed to: http://www.thesite.com With the end file removed, as well as the slash. I was thinking checking the url against an array of every domain extension possible, and trimming away all strings after, is that over doing it? index.php <html> <head> <title> Web Crawler - learn </title> <style> #main { text-align: center; //display: none; } #title { text-align: center; } #url { //display: none; } </style> </head> <body id="main"> <div id="title">Web Crawler - Test</div> <a id="url" href="http://www.google.com">Google</a> <?php require("library.tpl"); include("crawl.php"); Main("http://www.jinx.com"); ?> </body> </html> Crawl.php <?php function Main($StartUrl){ $x = 1; While ($x <= 5) { //support for links without full urls if (file_get_contents($StartUrl)==FALSE){ $StartUrl = relative2absolute($LastUrl, $StartUrl); $Page = file_get_contents($StartUrl); //search string for a pattern // and store content found inside the set of parents in the array $matches preg_match('|<a.*?href="(.*?)"|is', $Page,$matches); //see what's inside $matches[1] echo '<pre>'. print_r($matches[1], true) . '</pre>'; //Go to next $StartUrl = $matches[1]; $x++; } else { //Assign page a variable $Page = file_get_contents($StartUrl); //search string for a pattern // and store content found inside the set of parents in the array $matches preg_match('|<a.*?href="(.*?)"|is', $Page,$matches); //see what's inside $matches[1] echo '<pre>'. print_r($matches[1], true) . '</pre>'; //Go to next $LastUrl = $StartUrl; $StartUrl = $matches[1]; $x++; } } } ?> library.tpl <?php function relative2absolute($absolute, $relative) { $p = @parse_url($relative); if(!$p) { //$relative is a seriously malformed URL return false; } if(isset($p["scheme"])) return $relative; $parts=(parse_url($absolute)); if(substr($relative,0,1)=='/') { $cparts = (explode("/", $relative)); array_shift($cparts); } else { if(isset($parts['path'])){ $aparts=explode('/',$parts['path']); array_pop($aparts); $aparts=array_filter($aparts); } else { $aparts=array(); } $rparts = (explode("/", $relative)); $cparts = array_merge($aparts, $rparts); foreach($cparts as $i => $part) { if($part == '.') { unset($cparts[$i]); } else if($part == '..') { unset($cparts[$i]); unset($cparts[$i-1]); } } } $path = implode("/", $cparts); $url = ''; if($parts['scheme']) { $url = "$parts[scheme]://"; } if(isset($parts['user'])) { $url .= $parts['user']; if(isset($parts['pass'])) { $url .= ":".$parts['pass']; } $url .= "@"; } if(isset($parts['host'])) { $url .= $parts['host']."/"; } $url .= $path; return $url; } ?> Link to comment https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-611227 Share on other sites More sharing options...
thebadbad Posted August 8, 2008 Share Posted August 8, 2008 You haven't explained what you're trying to do with this crawler. Maybe then I can help Link to comment https://forums.phpfreaks.com/topic/118599-find-full-url/#findComment-611496 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.