Search the Community
Showing results for tags 'crawler'.
-
Hello, I made a search engine and now I am trying to find an open source spider for its. I have a database phpmyadmin, where are around 200 urls, descriptions, titles and keywords and now I want connect it with spider to add more results in it.
- 2 replies
-
- crawler
- search engine
-
(and 2 more)
Tagged with:
-
I was trying to write a class which would generate a sitemap for every post which is made or edited, but i don't seem to understand what is my mistake here. class sitemap { var $file_net; var $url_net; var $extention_net; var $freq_net; var $priority_net; function set() { $file = $this->file_net; $url = $this->url_net; $extention = $this->extention_net; $freq = $this->freq_net; $priority = $this->priority_net; } function Path ($p) { $a = explode ("/", $p); $len = strlen ($a[count ($a) - 1]); return (substr ($p, 0, strlen ($p) - $len)); } function GetUrl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $data = curl_exec($ch); curl_close($ch); return $data; } function Scan($url) { global $scanned, $pf, $extension, $skip, $freq, $priority; echo "scan url $url\n"; array_push ($scanned, $url); $html = GetUrl ($url); $a1 = explode ("<a", $html); foreach ($a1 as $key => $val) { $parts = explode (">", $val); $a = $parts[0]; $aparts = explode ("href=", $a); $hrefparts = explode (" ", $aparts[1]); $hrefparts2 = explode ("#", $hrefparts[0]); $href = str_replace ("\"", "", $hrefparts2[0]); if ((substr ($href, 0, 7) != "http://") && (substr ($href, 0, != "https://") && (substr ($href, 0, 6) != "ftp://")) { if ($href[0] == '/') $href = "$scanned[0]$href"; else $href = Path ($url) . $href; } if (substr ($href, 0, strlen ($scanned[0])) == $scanned[0]) { $ignore = false; if (isset ($skip)) foreach ($skip as $k => $v) if (substr ($href, 0, strlen ($v)) == $v) $ignore = true; if ((!$ignore) && (!in_array ($href, $scanned)) && (strpos ($href, $extension) > 0) ) { fwrite ($pf, "<url>\n <loc>$href</loc>\n" . " <changefreq>$freq</changefreq>\n" . " <priority>$priority</priority>\n</url>\n"); echo $href. "\n"; Scan ($href); } } } } $pf = fopen ($file, "w"); if (!$pf) { echo "cannot create $file\n"; return; } fwrite ($pf,"<?xml version=\"1.0\" encoding=\"UTF-8\"?> <urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd\"> <!-- created with SiteMap Generator --> <url> <loc>$url/</loc> <changefreq>daily</changefreq> </url> "); $scanned = array(); Scan ($url); fwrite ($pf, "</urlset>\n"); fclose ($pf); }
-
Looked on the web couldn't find it or I wasn't typing what should. I'm looking for information on how to write identifier code for your crawler bot so website owners can see that my bot crawled them in there awstats robot/spiders visitors. Currently I'm coming up as Unknown robot (identified by empty user agent string). Anyone know how to do this or where I could find some info on the subject?
-
I've used this code on my script, but for some reason its giving me a bad result. else if ($site==3) { //$content=file_get_contents('data/tp2'); $content=httpGet('http://thepiratebay.se/search/'.urlencode($string).'/0/99/0'); preg_match_all('#<tr.*>(.+)</tr>#Us',$content,$m); $array=array(); foreach ($m[1] as $k =>$v) { if (strpos($v,'detLink')===false) { continue; } preg_match_all('#<td.*>(.+)</td>#Us',$v,$m2); $array[$k]=array(); $array[$k]['size']=0; foreach ($m2[1] as $k2 =>$v2) { switch ($k2) { case 0: $c=strip_tags($v2); $array[$k]['category']=substr($c,0,strpos($c,'>')-1); break; case 2: $d=trim(strip_tags($v2)); $array[$k]['date']=convertDate2($d); break; case 1: preg_match('#<a href="([^"]+)".*>(.+)</a>#Us',$v2,$m3); if (isset($m3[1])) { $array[$k]['id']=$m3[1]; } else { $array[$k]['id']=''; } $name=trim(str_replace(' ',"\n",strip_tags($m3[2]))); $array[$k]['name']=$name; break; case 4: @list($num,$base)=explode(' ',trim(strip_tags($v2))); $array[$k]['size']=toB($num,$base); break; case 5: $array[$k]['seeds']=trim(strip_tags($v2)); break; case 6: $array[$k]['peers']=trim(strip_tags($v2)); break; } } if (empty($array[$k]['id']) || empty($array[$k]['category'])) { unset($array[$k]); } } This is how its coming out on my project site. Date, Size, Seeders, Leechers are not showing anything. Any ideas why this happens? Also the word i used on search was "adobe". Thank you get.php