JSHINER Posted October 2, 2007 Share Posted October 2, 2007 <?php for ($n=0;$n<90;$n++) { $seed = "http://www.site.com/page.php?p=$n"; $data = file_get_contents($seed); if (preg_match_all("/\http:[^\"\s']+/", $data, $links)) { header("Content-type: text/plain"); for ($i=0;$i<count($links[0]);$i++) { echo $links[0][$i]. "\n"; } } } ?> This collects all the links on a page, and other pages based on # (1-89) - however I am having a few problems: 1) It does not seem to go through all 89 pages 2) It duplicates some email links - how can I limit to only one result per link so there are not duplicates? Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/ Share on other sites More sharing options...
JSHINER Posted October 2, 2007 Author Share Posted October 2, 2007 Nevermind it IS going through all pages. But duplicates are still a problem. Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360496 Share on other sites More sharing options...
MadTechie Posted October 3, 2007 Share Posted October 3, 2007 try this <?php header("Content-type: text/plain"); //only 1 header please! $email = array(); for ($n=0;$n<90;$n++) { $seed = "http://www.site.com/page.php?p=$n"; $data = file_get_contents($seed); if (preg_match_all("/\http:[^\"\s']+/", $data, $links)) { for ($i=0;$i<count($links[0]);$i++) { $email[] = $links[0][$i]; } }else{ echo "Skipped: $n<br>"; } } $newemail = array_unique($email); echo "<pre>"; print_r($newemail); ?> untested (written on the fly) Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360499 Share on other sites More sharing options...
JSHINER Posted October 3, 2007 Author Share Posted October 3, 2007 Every so often I get an error: <b>Warning</b>: Cannot modify header information - headers already sent by (output started at ) . . . (This is unrelated . . .) How can I hide errors in the output? Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360504 Share on other sites More sharing options...
MadTechie Posted October 3, 2007 Share Posted October 3, 2007 @header("Content-type: text/plain"); //only 1 header please! @ omits the error Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360508 Share on other sites More sharing options...
JSHINER Posted October 3, 2007 Author Share Posted October 3, 2007 Thanks. Now how can I get it to only display a result once? That one posted before wasn't in plain text and I'm not sure it's what I needed. I know a problem in there has something to do with the $n++ so it must hit pages more than once, so either a fix to displaying emails more than once, or the page problem would be much appreciated. Link to comment https://forums.phpfreaks.com/topic/71599-solved-having-a-problem-with-my-spider/#findComment-360524 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.