Jump to content

[SOLVED] PHP Spider Problem With Parsing Reletive Links


jjacquay712

Recommended Posts

I know most of you wont have time or effort to look over my code to debug it... i know i wouldn't if i were you, but i will post it anyway. Ok, here is my problem, I made a php spider that gets links from html and puts it into a database. The problem is with relative links. If it doesnt have http:// at the beginning it assumes its a relative link, and adds the domain name to the front of it. But if its searching a page with a url like this: http://google.com/test.html and finds a relative link, it will do this: http://google.com/test.html/link.html. I cant think of a way to solve this problem. Here is my code:

 

<?php
set_time_limit(2);
if ($_POST['url']) {
//connect************************************************************
mysql_connect("localhost", "johnj_admin", "lolsn1perjesus");
mysql_select_db("johnj_search");
//end connect *******************************************************
function isrelurl($val, $rofl) {
	$urlhttp = substr($val, 0, 7);//If the link is reletive add http and the domain name in front of it
	if ( $urlhttp == "http://" || $urlhttp == "https:/" ) {
		$lolvar = $val;
	} else {
		$snap = substr($val, 0, 1);
		if ($snap == "/") {
			$lolvar = $rofl . $val;
		} else {
			$lolvar = $rofl . '/' . $val;
		}
	}
	return $lolvar;
}

//make url have http in front of it******************
$urltype = substr($_POST['url'], 0, 7);
if ( $urltype == "http://" ) {
	$url = $_POST['url'];
} else {
	$url = "http://" . $_POST['url'];
}
mysql_query("INSERT INTO mem(links) VALUES('{$url}')"); //Enter the first link into database
//end make url have http in front of it***************
for($x = $_POST['level']; $x > 0; $x--) {
	$query = mysql_query("SELECT links FROM mem");
	while ($gigty = mysql_fetch_array($query)) {
		$content = file_get_contents($gigty['links']);//get the pages html content
		preg_match_all('/<a[^>]+href="([^"]+)"[^"]*>/is', $content, $array); //Search the html for links
		foreach ($array[1] as $lol => $val) { //Put the found links into the history table and the mem table
			$lolvar = isrelurl($val, $gigty['links']);
			if ( mysql_num_rows(mysql_query("SELECT links FROM history WHERE links = '{$lolvar}'")) < 1 ) {
				mysql_query("INSERT INTO mem(links) VALUES ('{$lolvar}');");
				mysql_query("INSERT INTO history(links) VALUES ('{$lolvar}');");
				echo "\n<br />" . $lolvar . " Indexed";
			} else {
				echo "\n<br />" . $lolvar . " is Already In History";
			}
		}
		mysql_query("DELETE FROM mem LIMIT 1"); //delete the link from mem
	}
}
echo '\n\n<br /><br />No More Links To Follow';
mysql_query("TRUNCATE TABLE mem");
}
/*
To Do:
Make relitive links work for the other levels
Get the rest of the content from the page and insert it into a searchable table
*/
?>

 

any help is greatly appreciated, Thanks, John

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.