JSHINER Posted October 2, 2007 Share Posted October 2, 2007 <?php for ($n=0;$n<90;$n++) { $seed = "http://www.site.com/page.php?p=$n"; $data = file_get_contents($seed); if (preg_match_all("/\http:[^\"\s']+/", $data, $links)) { header("Content-type: text/plain"); for ($i=0;$i<count($links[0]);$i++) { echo $links[0][$i]. "\n"; } } } ?> This collects all the links on a page, and other pages based on # (1-89) - however I am having a few problems: 1) It does not seem to go through all 89 pages 2) It duplicates some email links - how can I limit to only one result per link so there are not duplicates? Quote Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/ Share on other sites More sharing options...
JSHINER Posted October 2, 2007 Author Share Posted October 2, 2007 Nevermind it IS going through all pages. But duplicates are still a problem. Quote Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360496 Share on other sites More sharing options...
MadTechie Posted October 3, 2007 Share Posted October 3, 2007 try this <?php header("Content-type: text/plain"); //only 1 header please! $email = array(); for ($n=0;$n<90;$n++) { $seed = "http://www.site.com/page.php?p=$n"; $data = file_get_contents($seed); if (preg_match_all("/\http:[^\"\s']+/", $data, $links)) { for ($i=0;$i<count($links[0]);$i++) { $email[] = $links[0][$i]; } }else{ echo "Skipped: $n<br>"; } } $newemail = array_unique($email); echo "<pre>"; print_r($newemail); ?> untested (written on the fly) Quote Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360499 Share on other sites More sharing options...
JSHINER Posted October 3, 2007 Author Share Posted October 3, 2007 Every so often I get an error: <b>Warning</b>: Cannot modify header information - headers already sent by (output started at ) . . . (This is unrelated . . .) How can I hide errors in the output? Quote Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360504 Share on other sites More sharing options...
MadTechie Posted October 3, 2007 Share Posted October 3, 2007 @header("Content-type: text/plain"); //only 1 header please! @ omits the error Quote Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360508 Share on other sites More sharing options...
JSHINER Posted October 3, 2007 Author Share Posted October 3, 2007 Thanks. Now how can I get it to only display a result once? That one posted before wasn't in plain text and I'm not sure it's what I needed. I know a problem in there has something to do with the $n++ so it must hit pages more than once, so either a fix to displaying emails more than once, or the page problem would be much appreciated. Quote Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360524 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.