Batfan Posted April 28, 2011 Share Posted April 28, 2011 First and foremost, I am a PHP novice so, if there's better or more efficient way of doing what I'm trying to do, please feel free to point it out I came across an old PHP script that was used to crawl a site and check the response code on the pages found. I have modified it to do a duplicate content check. It's using the similar_text function to compare 1 page's content (specified by the user) against the content of each page it finds. It's a little slow but, its working. The only problem that I'm having is that it stops after about the first 10 links and I can't figure out why. Any help is greatly appreciated. <form action="<?php echo $_SERVER['PHP_SELF']; ?>" method="post"> <div class="row"><label for="page1" class="small label"><strong>Page? </strong>: </label><input type="text" name="page1" id="page1" value="" size="40" /></div> <div class="row"><label for="url" class="small label"><strong>Please Enter URL </strong>: </label><input type="text" name="url" id="url" value="" size="40" /></div> <div class="row"><label for="maxlinks" class="small label"><strong>Number of links to get </strong>: </label><input type="text" name="maxlinks" id="maxlinks" value="25" size="3" maxlength="3" /></div> <div class="row"><label for="linkdepth" class="small label"><strong>Links Maximum depth</strong> : </label> <select name="linkdepth" id="linkdepth" ><option value="1">1</option> <option value="2" selected="selected">2</option> <option value="3">3</option> <option value="4">4</option> <option value="5">5</option> <option value="6">6</option> </select></div> <input type="submit" name="submit" style="font-weight: bold" value="Check links" id="submit" /> </form> <?php if (isset($_POST['submit'])){ $page1 = ($_POST['page1']); $baseurl = ($_POST['url']); $pages = array(); $i=($_POST['linkdepth']); $maxlinks = (integer)$_POST['maxlinks']; $domain= extract_domain_name($baseurl); echo '<p class="small">Extracted domain name: <strong>'.$domain.'</strong>. '; echo 'Maximum depth: <strong>'.$i.'</strong></p>'; function get_urls($page){ global $domain, $i; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $page); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_HEADER, true); /* Spoof the User-Agent header value; just to be safe */ curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)'); /* I set timeout values for the connection and download because I don't want my script to get stuck downloading huge files or trying to connect to a nonresponsive server. These are optional. */ curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 100); curl_setopt($ch, CURLOPT_TIMEOUT, 100); /* This ensures 404 Not Found (and similar) will be treated as errors */ curl_setopt($ch, CURLOPT_FAILONERROR, 0); /* Download the page */ $html = curl_exec($ch); /* in case of an error*/ if(curl_exec($ch) === false) { echo '<p class="small">Error. Please check URL: <strong style="color:#ae3100">' . curl_error($ch).'</p></strong>'; } curl_close($ch); if(!$html) return false; /* Extract the BASE tag (if present) for relative-to-absolute URL conversions later */ if(preg_match('/<base[\s]+href=\s*[\"\']?([^\'\" >]+)[\'\" >]/i',$html, $matches)){ $base_url=$matches[1]; echo $base_url; } else { $base_url=$page; //base url = strani4ka s kotoroy na4inaetsa novaja porverka } $links=array(); $html = str_replace("\n", ' ', $html); preg_match_all('/<a[\s]+[^>]*href\s*=\s*[\"\']?([^\'\" >]+)[\'\" >]/i', $html, $m); /* this regexp is a combination of numerous versions I saw online*/ foreach($m[1] as $url) { $url=trim($url); /* get rid of PHPSESSID, #linkname, & and javascript: */ $url=preg_replace( array('/([\?&]PHPSESSID=\w+)$/i','/(#[^\/]*)$/i', '/&/','/^(javascript:.*)/i'), array('','','&',''), $url); /* turn relative URLs into absolute URLs. relative2absolute() is defined further down below on this page. */ $url = relative2absolute($base_url, $url); // check if in the same (sub-)$domain if(preg_match("/^http[s]?:\/\/[^\/]*".str_replace('.', '\.', $domain)."/i", $url)) { $depth= substr_count($url, "/")-2 ; /* Counts slashes in URL Responsible for link depth */ if ($depth <= $i){ if(!in_array($url, $links, check)) $links[]=$url; } } } return $links; } // Functions to crawl the next page function next_page(){ global $pages; $k=0; foreach( array_keys($pages) as $k=> $page){ if($pages[$page] == NULL){ $k++; echo "[$k] - "; return $page; } } return NULL; } function add_urls($page){ // ads new unique urls in to array and checks each url for Server Header Status global $pages, $maxlinks; $start = microtime(); $urls = get_urls($page); $resptime = microtime() - $start; // with microtime it is possible to find out on which page the crowler stops responding. //Start checking for Server Header $ch = curl_init($page); curl_setopt($ch, CURLOPT_NOBODY, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // Execute curl_exec($ch); $info = curl_getinfo($ch); print "$page"; // If the status code os 200, then print OK, else = NO // if($info['http_code']==200) { $page1 = ($_POST['page1']); $page1data = file_get_contents($page1); $page2 = file_get_contents($page); $i = similar_text($page1data, $page2, $p); $p = round($p, 2); echo ' - Match Percentage:' . $p . '%'; // } else { // echo '<strong style="color:#ba3d00"> NO </strong>';} /* echo substr(($resptime),0,5). " seconds"; */ // Activate ths to see how much time it takes to crawl echo '<br/>'; curl_close($ch); // Close handle $pages[$page] = array ('resptime' => floor($resptime * 9000), 'url' => $page); foreach($urls as $url){ if(!array_key_exists($url, $pages) && !in_array($url, $pages) && count($pages)<$maxlinks){ $pages[$url] = NULL; } } } echo '[1] - '; // this is for the first input url, as it will be extracted from input add_urls($baseurl); while(($page= next_page()) != NULL ) //while there are urls available { add_urls($page); } echo '<p class="small">Amount of crawled links: <strong>'.count ($pages).'</strong></p>'; if (count($pages)<$maxlinks) echo '<p class="small">Sorry, no more links to crawl!!</p>';// count all extracted Urls } ?><?php function extract_domain_name($url){ /* old domain extractor if(preg_match('@^(?:http:\/\/)?([^\/]+)@i', $url, $matches)) { return trim(strtolower($matches[1])); } else { return ''; }*/ preg_match("/^(http:\/\/)?([^\/]+)/i", $url, $matches); $host = $matches[2]; // get last two segments of host name preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches); return $matches[0]; } function relative2absolute($absolute, $relative) { $p = parse_url($relative); if($p["scheme"])return $relative; extract(parse_url($absolute)); $path = dirname($path); if($relative{0} == '/') { $newPath = array_filter(explode("/", $relative)); } else { $aparts = array_filter(explode("/", $path)); $rparts = array_filter(explode("/", $relative)); $cparts = array_merge($aparts, $rparts); $k = 0; $newPath = array(); foreach($cparts as $i => $part) { if($part == '..') { $k = $k - 1; $newPath[$k] = null; } else { $newPath[$k] = $cparts[$i]; $k = $k + 1; } } $newPath = array_filter($newPath); } $path = implode("/", $newPath); $url = ""; if($scheme) { $url = "$scheme://"; } if($user) { $url .= "$user"; if($pass) { $url .= ":$pass"; } $url .= "@"; } if($host) { $url .= "$host/"; } $url .= $path; return $url; } ################################################## ?> Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/ Share on other sites More sharing options...
btherl Posted April 29, 2011 Share Posted April 29, 2011 It's probably this: $i=($_POST['linkdepth']); $maxlinks = (integer)$_POST['maxlinks']; and this: if ($depth <= $i){ if(!in_array($url, $links, check)) $links[]=$url; } } } Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1207955 Share on other sites More sharing options...
Batfan Posted April 29, 2011 Author Share Posted April 29, 2011 It's probably this: $i=($_POST['linkdepth']); $maxlinks = (integer)$_POST['maxlinks']; and this: if ($depth <= $i){ if(!in_array($url, $links, check)) $links[]=$url; } } } But wouldn't changing the link depth in the form fix that? I've tried selecting a higher link depth and it still stops after 10 Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1207959 Share on other sites More sharing options...
btherl Posted April 29, 2011 Share Posted April 29, 2011 There's also this: if (count($pages)<$maxlinks) echo '<p class="small">Sorry, no more links to crawl!!</p>';// count all extracted Urls Are you increasing both link depth and maxlinks in the form? Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1207963 Share on other sites More sharing options...
Batfan Posted April 29, 2011 Author Share Posted April 29, 2011 There's also this: if (count($pages)<$maxlinks) echo '<p class="small">Sorry, no more links to crawl!!</p>';// count all extracted Urls Are you increasing both link depth and maxlinks in the form? I did not increase both but, the default "links to crawl" value is 25 Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1207964 Share on other sites More sharing options...
btherl Posted April 29, 2011 Share Posted April 29, 2011 Hmm, in the original post you said it stops after "about the first 10 links". How many links does it stop after? Does it vary even when processing the same page multiple times? Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1207968 Share on other sites More sharing options...
Batfan Posted April 29, 2011 Author Share Posted April 29, 2011 Hmm, in the original post you said it stops after "about the first 10 links". How many links does it stop after? Does it vary even when processing the same page multiple times? Right. The default is set @ 25 but, it only lists 10. It should be doing 25 (or whatever the max is). Its consistent Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1207972 Share on other sites More sharing options...
btherl Posted April 29, 2011 Share Posted April 29, 2011 Can you post the output? There's just not enough information to help at the moment. Another thing you can do is use var_dump() to display the contents of various important variables like $pages and the output of get_urls(). That can help to find what's going wrong. Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1207974 Share on other sites More sharing options...
Batfan Posted April 29, 2011 Author Share Posted April 29, 2011 Can you post the output? There's just not enough information to help at the moment. Another thing you can do is use var_dump() to display the contents of various important variables like $pages and the output of get_urls(). That can help to find what's going wrong. Sure, so using lipsum.com as a test, here is the result: [1] - http://lipsum.com - Match Percentage:67.4% [2] - http://hy.lipsum.com/ - Match Percentage:60.14% [3] - http://sq.lipsum.com/ - Match Percentage:63.54% [4] - http://ar.lipsum.com/ - Match Percentage:57.79% [5] - http://bg.lipsum.com/ - Match Percentage:61.2% [6] - http://ca.lipsum.com/ - Match Percentage:65.17% [7] - http://hr.lipsum.com/ - Match Percentage:65.65% [8] - http://cs.lipsum.com/ - Match Percentage:64.63% [9] - http://da.lipsum.com/ - Match Percentage:65.17% [10] - http://nl.lipsum.com/ Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1208202 Share on other sites More sharing options...
Batfan Posted May 2, 2011 Author Share Posted May 2, 2011 Any thoughts?? Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1209492 Share on other sites More sharing options...
btherl Posted May 3, 2011 Share Posted May 3, 2011 It appears to be stopping between where it prints out the [10] - http://nl.lipsum.com/ and where it prints out the match percentage. Can you add a few print statements between those two points so we can see exactly where it stops? Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1209699 Share on other sites More sharing options...
Batfan Posted May 3, 2011 Author Share Posted May 3, 2011 It appears to be stopping between where it prints out the [10] - http://nl.lipsum.com/ and where it prints out the match percentage. Can you add a few print statements between those two points so we can see exactly where it stops? Sure. What variables should I be printing? Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1209708 Share on other sites More sharing options...
btherl Posted May 3, 2011 Share Posted May 3, 2011 You don't need to print any variables. Just print "Reached line 200", "About to call get_pages()", and things like that. The last message displayed will tell you where it stopped. Quote Link to comment https://forums.phpfreaks.com/topic/235043-comparing-crawled-page-text/#findComment-1209711 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.