Jump to content

comparing urls


hellonoko

Recommended Posts

I am using the below code to compare a base url : http://www.site.com/

to a list of scraped urls.

 

If the scraped url ($url) does not contain the base url ($target_url) it combines the two to make a full url.

so files/login.php turns into www.site.com/files/login.php

 

However. It doesn't seem to be comparing correctly. For sites that do contain the base URL already it will turn them into www.site.com/index.htmlwww.site.com/index.html

 

I have tried with strcmp() also.

 

Any ideas?

 

if ( strstr($url , $target_url) === FALSE )
{
echo 'INCOMPLETE: '.$url;
//echo $target_url.$url;
echo '<br>';
}
else
{
echo 'COMPLETE: '.$url;
echo '<br>';
}

 

 

Link to comment
https://forums.phpfreaks.com/topic/149945-comparing-urls/
Share on other sites

Don't know if it's overkill, but I've got a function that turns relative URLs into absolute ones; relative2absolute(). Works great if you scrape both absolute and relative paths from a page, and want to turn them all into absolute paths. Examples:

 

<?php
echo relative2absolute('http://www.site.com/', 'files/login.php');
//http://www.site.com/files/login.php

echo relative2absolute('http://www.site.com/some-directory/', '../files/login.php');
//http://www.site.com/files/login.php

echo relative2absolute('http://www.site.com/some-directory/', '/files/login.php');
//http://www.site.com/files/login.php
?>

Link to comment
https://forums.phpfreaks.com/topic/149945-comparing-urls/#findComment-787525
Share on other sites

Will take a look but still not sure what is up with my code. When I have it in a page on it own it works. But in the full code below it fails.

<?php

//echo $site_url = 'http://www.empreintes-digitales.fr/';
$target_url = "http://www.empreintes-digitales.fr/";

//$target_url = 'http://redthreat.wordpress.com/';
//$target_url= 'http://www.kissatlanta.com/blog/';
//$target_url= 'http://www.empreintes-digitales.fr/';

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

crawl_page( $target_url, $userAgent, $site_url, $url);

function crawl_page( $target_url, $userAgent , $site_url, $url )
{
	$ch = curl_init();

	curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
	curl_setopt($ch, CURLOPT_URL,$target_url);
	curl_setopt($ch, CURLOPT_FAILONERROR, true);
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
	curl_setopt($ch, CURLOPT_AUTOREFERER, true);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
	curl_setopt($ch, CURLOPT_TIMEOUT, 10);

	$html = curl_exec($ch);

	if (!$html) 
	{
		echo "<br />cURL error number:" .curl_errno($ch);
		echo "<br />cURL error:" . curl_error($ch);
		exit;
	}

	//
	// load scrapped data into the DOM
	//

	$dom = new DOMDocument();
	@$dom->loadHTML($html);

	//
	// get only LINKS from the DOM with XPath
	//

	$xpath = new DOMXPath($dom);
	$hrefs = $xpath->evaluate("/html/body//a");

	//
	// go through all the links and store to db or whatever
	//
	for ($i = 0; $i < $hrefs->length; $i++) 
	{
		$href = $hrefs->item($i);
		$url = $href->getAttribute('href');

		$links_1[$link] = $url;

		//if the $url does not contain the web site base address: http://www.thesite.com/ then add it onto the front
		if ( strpos($url , $target_url) === FALSE )
		{
			echo 'INCOMPLETE: '.$url;
			echo '<br>';
		}
		else
		{
			echo 'COMPLETE: '.$url;
			echo '<br>';
		}
	}
}

?>

 

 

Link to comment
https://forums.phpfreaks.com/topic/149945-comparing-urls/#findComment-787528
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.