hellonoko Posted March 18, 2009 Share Posted March 18, 2009 I am using the below code to compare a base url : http://www.site.com/ to a list of scraped urls. If the scraped url ($url) does not contain the base url ($target_url) it combines the two to make a full url. so files/login.php turns into www.site.com/files/login.php However. It doesn't seem to be comparing correctly. For sites that do contain the base URL already it will turn them into www.site.com/index.htmlwww.site.com/index.html I have tried with strcmp() also. Any ideas? if ( strstr($url , $target_url) === FALSE ) { echo 'INCOMPLETE: '.$url; //echo $target_url.$url; echo '<br>'; } else { echo 'COMPLETE: '.$url; echo '<br>'; } Link to comment https://forums.phpfreaks.com/topic/149945-comparing-urls/ Share on other sites More sharing options...
thebadbad Posted March 18, 2009 Share Posted March 18, 2009 Don't know if it's overkill, but I've got a function that turns relative URLs into absolute ones; relative2absolute(). Works great if you scrape both absolute and relative paths from a page, and want to turn them all into absolute paths. Examples: <?php echo relative2absolute('http://www.site.com/', 'files/login.php'); //http://www.site.com/files/login.php echo relative2absolute('http://www.site.com/some-directory/', '../files/login.php'); //http://www.site.com/files/login.php echo relative2absolute('http://www.site.com/some-directory/', '/files/login.php'); //http://www.site.com/files/login.php ?> Link to comment https://forums.phpfreaks.com/topic/149945-comparing-urls/#findComment-787525 Share on other sites More sharing options...
hellonoko Posted March 18, 2009 Author Share Posted March 18, 2009 Will take a look but still not sure what is up with my code. When I have it in a page on it own it works. But in the full code below it fails. <?php //echo $site_url = 'http://www.empreintes-digitales.fr/'; $target_url = "http://www.empreintes-digitales.fr/"; //$target_url = 'http://redthreat.wordpress.com/'; //$target_url= 'http://www.kissatlanta.com/blog/'; //$target_url= 'http://www.empreintes-digitales.fr/'; $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; crawl_page( $target_url, $userAgent, $site_url, $url); function crawl_page( $target_url, $userAgent , $site_url, $url ) { $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html = curl_exec($ch); if (!$html) { echo "<br />cURL error number:" .curl_errno($ch); echo "<br />cURL error:" . curl_error($ch); exit; } // // load scrapped data into the DOM // $dom = new DOMDocument(); @$dom->loadHTML($html); // // get only LINKS from the DOM with XPath // $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); // // go through all the links and store to db or whatever // for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); $links_1[$link] = $url; //if the $url does not contain the web site base address: http://www.thesite.com/ then add it onto the front if ( strpos($url , $target_url) === FALSE ) { echo 'INCOMPLETE: '.$url; echo '<br>'; } else { echo 'COMPLETE: '.$url; echo '<br>'; } } } ?> Link to comment https://forums.phpfreaks.com/topic/149945-comparing-urls/#findComment-787528 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.