jjacquay712 Posted January 22, 2009 Share Posted January 22, 2009 I know most of you wont have time or effort to look over my code to debug it... i know i wouldn't if i were you, but i will post it anyway. Ok, here is my problem, I made a php spider that gets links from html and puts it into a database. The problem is with relative links. If it doesnt have http:// at the beginning it assumes its a relative link, and adds the domain name to the front of it. But if its searching a page with a url like this: http://google.com/test.html and finds a relative link, it will do this: http://google.com/test.html/link.html. I cant think of a way to solve this problem. Here is my code: <?php set_time_limit(2); if ($_POST['url']) { //connect************************************************************ mysql_connect("localhost", "johnj_admin", "lolsn1perjesus"); mysql_select_db("johnj_search"); //end connect ******************************************************* function isrelurl($val, $rofl) { $urlhttp = substr($val, 0, 7);//If the link is reletive add http and the domain name in front of it if ( $urlhttp == "http://" || $urlhttp == "https:/" ) { $lolvar = $val; } else { $snap = substr($val, 0, 1); if ($snap == "/") { $lolvar = $rofl . $val; } else { $lolvar = $rofl . '/' . $val; } } return $lolvar; } //make url have http in front of it****************** $urltype = substr($_POST['url'], 0, 7); if ( $urltype == "http://" ) { $url = $_POST['url']; } else { $url = "http://" . $_POST['url']; } mysql_query("INSERT INTO mem(links) VALUES('{$url}')"); //Enter the first link into database //end make url have http in front of it*************** for($x = $_POST['level']; $x > 0; $x--) { $query = mysql_query("SELECT links FROM mem"); while ($gigty = mysql_fetch_array($query)) { $content = file_get_contents($gigty['links']);//get the pages html content preg_match_all('/<a[^>]+href="([^"]+)"[^"]*>/is', $content, $array); //Search the html for links foreach ($array[1] as $lol => $val) { //Put the found links into the history table and the mem table $lolvar = isrelurl($val, $gigty['links']); if ( mysql_num_rows(mysql_query("SELECT links FROM history WHERE links = '{$lolvar}'")) < 1 ) { mysql_query("INSERT INTO mem(links) VALUES ('{$lolvar}');"); mysql_query("INSERT INTO history(links) VALUES ('{$lolvar}');"); echo "\n<br />" . $lolvar . " Indexed"; } else { echo "\n<br />" . $lolvar . " is Already In History"; } } mysql_query("DELETE FROM mem LIMIT 1"); //delete the link from mem } } echo '\n\n<br /><br />No More Links To Follow'; mysql_query("TRUNCATE TABLE mem"); } /* To Do: Make relitive links work for the other levels Get the rest of the content from the page and insert it into a searchable table */ ?> any help is greatly appreciated, Thanks, John Link to comment https://forums.phpfreaks.com/topic/141870-solved-php-spider-problem-with-parsing-reletive-links/ Share on other sites More sharing options...
.josh Posted January 22, 2009 Share Posted January 22, 2009 I have no intention of rewriting your code for you, but I suggest you look into using any one or more of the following functions: parse_url dirname basename pathinfo Link to comment https://forums.phpfreaks.com/topic/141870-solved-php-spider-problem-with-parsing-reletive-links/#findComment-742820 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.