jjacquay712 Posted January 22, 2009 Share Posted January 22, 2009 I know most of you wont have time or effort to look over my code to debug it... i know i wouldn't if i were you, but i will post it anyway. Ok, here is my problem, I made a php spider that gets links from html and puts it into a database. The problem is with relative links. If it doesnt have http:// at the beginning it assumes its a relative link, and adds the domain name to the front of it. But if its searching a page with a url like this: http://google.com/test.html and finds a relative link, it will do this: http://google.com/test.html/link.html. I cant think of a way to solve this problem. Here is my code: <?php set_time_limit(2); if ($_POST['url']) { //connect************************************************************ mysql_connect("localhost", "johnj_admin", "lolsn1perjesus"); mysql_select_db("johnj_search"); //end connect ******************************************************* function isrelurl($val, $rofl) { $urlhttp = substr($val, 0, 7);//If the link is reletive add http and the domain name in front of it if ( $urlhttp == "http://" || $urlhttp == "https:/" ) { $lolvar = $val; } else { $snap = substr($val, 0, 1); if ($snap == "/") { $lolvar = $rofl . $val; } else { $lolvar = $rofl . '/' . $val; } } return $lolvar; } //make url have http in front of it****************** $urltype = substr($_POST['url'], 0, 7); if ( $urltype == "http://" ) { $url = $_POST['url']; } else { $url = "http://" . $_POST['url']; } mysql_query("INSERT INTO mem(links) VALUES('{$url}')"); //Enter the first link into database //end make url have http in front of it*************** for($x = $_POST['level']; $x > 0; $x--) { $query = mysql_query("SELECT links FROM mem"); while ($gigty = mysql_fetch_array($query)) { $content = file_get_contents($gigty['links']);//get the pages html content preg_match_all('/<a[^>]+href="([^"]+)"[^"]*>/is', $content, $array); //Search the html for links foreach ($array[1] as $lol => $val) { //Put the found links into the history table and the mem table $lolvar = isrelurl($val, $gigty['links']); if ( mysql_num_rows(mysql_query("SELECT links FROM history WHERE links = '{$lolvar}'")) < 1 ) { mysql_query("INSERT INTO mem(links) VALUES ('{$lolvar}');"); mysql_query("INSERT INTO history(links) VALUES ('{$lolvar}');"); echo "\n<br />" . $lolvar . " Indexed"; } else { echo "\n<br />" . $lolvar . " is Already In History"; } } mysql_query("DELETE FROM mem LIMIT 1"); //delete the link from mem } } echo '\n\n<br /><br />No More Links To Follow'; mysql_query("TRUNCATE TABLE mem"); } /* To Do: Make relitive links work for the other levels Get the rest of the content from the page and insert it into a searchable table */ ?> any help is greatly appreciated, Thanks, John Quote Link to comment https://forums.phpfreaks.com/topic/141870-solved-php-spider-problem-with-parsing-reletive-links/ Share on other sites More sharing options...
.josh Posted January 22, 2009 Share Posted January 22, 2009 I have no intention of rewriting your code for you, but I suggest you look into using any one or more of the following functions: parse_url dirname basename pathinfo Quote Link to comment https://forums.phpfreaks.com/topic/141870-solved-php-spider-problem-with-parsing-reletive-links/#findComment-742820 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.