Jump to content

Recommended Posts

I know most of you wont have time or effort to look over my code to debug it... i know i wouldn't if i were you, but i will post it anyway. Ok, here is my problem, I made a php spider that gets links from html and puts it into a database. The problem is with relative links. If it doesnt have http:// at the beginning it assumes its a relative link, and adds the domain name to the front of it. But if its searching a page with a url like this: http://google.com/test.html and finds a relative link, it will do this: http://google.com/test.html/link.html. I cant think of a way to solve this problem. Here is my code:

 

<?php
set_time_limit(2);
if ($_POST['url']) {
//connect************************************************************
mysql_connect("localhost", "johnj_admin", "lolsn1perjesus");
mysql_select_db("johnj_search");
//end connect *******************************************************
function isrelurl($val, $rofl) {
	$urlhttp = substr($val, 0, 7);//If the link is reletive add http and the domain name in front of it
	if ( $urlhttp == "http://" || $urlhttp == "https:/" ) {
		$lolvar = $val;
	} else {
		$snap = substr($val, 0, 1);
		if ($snap == "/") {
			$lolvar = $rofl . $val;
		} else {
			$lolvar = $rofl . '/' . $val;
		}
	}
	return $lolvar;
}

//make url have http in front of it******************
$urltype = substr($_POST['url'], 0, 7);
if ( $urltype == "http://" ) {
	$url = $_POST['url'];
} else {
	$url = "http://" . $_POST['url'];
}
mysql_query("INSERT INTO mem(links) VALUES('{$url}')"); //Enter the first link into database
//end make url have http in front of it***************
for($x = $_POST['level']; $x > 0; $x--) {
	$query = mysql_query("SELECT links FROM mem");
	while ($gigty = mysql_fetch_array($query)) {
		$content = file_get_contents($gigty['links']);//get the pages html content
		preg_match_all('/<a[^>]+href="([^"]+)"[^"]*>/is', $content, $array); //Search the html for links
		foreach ($array[1] as $lol => $val) { //Put the found links into the history table and the mem table
			$lolvar = isrelurl($val, $gigty['links']);
			if ( mysql_num_rows(mysql_query("SELECT links FROM history WHERE links = '{$lolvar}'")) < 1 ) {
				mysql_query("INSERT INTO mem(links) VALUES ('{$lolvar}');");
				mysql_query("INSERT INTO history(links) VALUES ('{$lolvar}');");
				echo "\n<br />" . $lolvar . " Indexed";
			} else {
				echo "\n<br />" . $lolvar . " is Already In History";
			}
		}
		mysql_query("DELETE FROM mem LIMIT 1"); //delete the link from mem
	}
}
echo '\n\n<br /><br />No More Links To Follow';
mysql_query("TRUNCATE TABLE mem");
}
/*
To Do:
Make relitive links work for the other levels
Get the rest of the content from the page and insert it into a searchable table
*/
?>

 

any help is greatly appreciated, Thanks, John

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.