Jump to content

Search the Community

Showing results for tags 'crawler'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • Welcome to PHP Freaks
    • Announcements
    • Introductions
  • PHP Coding
    • PHP Coding Help
    • Regex Help
    • Third Party Scripts
    • FAQ/Code Snippet Repository
  • SQL / Database
    • MySQL Help
    • PostgreSQL
    • Microsoft SQL - MSSQL
    • Other RDBMS and SQL dialects
  • Client Side
    • HTML Help
    • CSS Help
    • Javascript Help
    • Other
  • Applications and Frameworks
    • Applications
    • Frameworks
    • Other Libraries
  • Web Server Administration
    • PHP Installation and Configuration
    • Linux
    • Apache HTTP Server
    • Microsoft IIS
    • Other Web Server Software
  • Other
    • Application Design
    • Other Programming Languages
    • Editor Help (PhpStorm, VS Code, etc)
    • Website Critique
    • Beta Test Your Stuff!
  • Freelance, Contracts, Employment, etc.
    • Services Offered
    • Job Offerings
  • General Discussion
    • PHPFreaks.com Website Feedback
    • Miscellaneous

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


AIM


MSN


Website URL


ICQ


Yahoo


Jabber


Skype


Location


Interests


Age


Donation Link

Found 5 results

  1. I've made a PHP web crawler and then made a MySQL table called "dex" as in index, then I connected to the database through PDO and tweaked the code to "INSERT" websites that aren't already crawled into the table, "UPDATE" for websites that are crawled, and used URL hashes as an indicator or "id" for links. The terminal shows all the links and links related to them, the if statement works perfectly and there are no major errors, so why does it not insert the data into the "dex" table? every-time I check the table after the process I only find the row that I inserted manually to test the if statement for "UPDATE" or "INSERT". what can I do to fix this issue and insert the date the crawler retrieves? Test.html: <a href="https://google.com"></a> <a href="https://www.yahoo.com/"></a> <a href="https://www.bing.com/"></a> <a href="https://duckduckgo.com/"></a> Crawler: <?php error_reporting(E_ALL); ini_set('display_errors', 1); $start = "http://localhost/deepsearch/test.html"; $pdo = new PDO('mysql:host=127.0.0.1;dbname=deepsearch', 'root', ''); $already_crawled = array(); $crawling = array(); function get_details($url) { $options = array('http'=>array('method'=>"GET", 'headers'=>"User-Agent: howBot/0.1\n")); $context = stream_context_create($options); // Suppress warnings for HTML parsing errors libxml_use_internal_errors(true); $doc = new DOMDocument(); @$html = @file_get_contents($url, false, $context); // Load HTML content and check for parsing errors if ($doc->loadHTML($html)) { if (!empty($titleElements)) { $title = $titleElements->item(0); $title = $title->nodeValue; } else { $title = ""; } $description = ""; $keywords = ""; $metas = $doc->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++) { $meta = $metas->item($i); if ($meta->getAttribute("name") == strtolower("description")) { $description = $meta->getAttribute("content"); } if ($meta->getAttribute("name") == strtolower("keywords")) { $keywords = $meta->getAttribute("content"); } } return '{"Title": "'.str_replace("\n", "", $title).'", "Description": "'.str_replace("\n", "", $description).'", "Keywords": "'.str_replace("\n", "", $keywords).'", "URL": "'.$url.'"}'; } else { // Handle the parsing error echo "HTML parsing error: " . libxml_get_last_error()->message . "\n"; return ''; // Return an empty string or handle the error as needed } } function follow_links($url) { global $pdo; global $already_crawled; global $crawling; $options = array('http' => array('method' => "GET", 'headers' => "User-Agent: howBot/0.1\n")); $context = stream_context_create($options); $doc = new DOMDocument(); @$doc->loadHTML(@file_get_contents($url, false, $context)); $linklist = $doc->getElementsByTagName("a"); foreach ($linklist as $link) { $l = $link->getAttribute("href"); if (substr($l, 0, 1) == "/" && substr($l, 0, 2) != "//") { $l = parse_url($url)["scheme"] . "://" . parse_url($url)["host"] . $l; } else if (substr($l, 0, 2) == "//") { $l = parse_url($url)["scheme"] . ":" . $l; } else if (substr($l, 0, 2) == "./") { $l = parse_url($url)["scheme"] . "://" . parse_url($url)["host"] . dirname(parse_url($url)["path"]) . substr($l, 1); } else if (substr($l, 0, 1) == "#") { $l = parse_url($url)["scheme"] . "://" . parse_url($url)["host"] . parse_url($url)["path"] . $l; } else if (substr($l, 0, 3) == "../") { $l = parse_url($url)["scheme"] . "://" . parse_url($url)["host"] . "/" . $l; } else if (substr($l, 0, 11) == "javascript:") { continue; } else if (substr($l, 0, 5) != "https" && substr($l, 0, 4) != "http") { $l = parse_url($url)["scheme"] . "://" . parse_url($url)["host"] . "/" . $l; } if (!in_array($l, $already_crawled)) { $already_crawled[] = $l; $crawling[] = $l; $details = json_decode(get_details($l)); echo $details->URL . " "; $rows = $pdo->query("SELECT * FROM dex WHERE url_hash='" . md5($details->URL) . "'"); $rows = $rows->fetchColumn(); $params = array(':title' => $details->Title, ':description' => $details->Description, ':keywords' => $details->Keywords, ':url' => $details->URL, ':url_hash' => md5($details->URL)); if ($rows > 0) { echo "UPDATE" . "\n"; } else { if (!is_null($params[':title']) && !is_null($params[':description']) && $params[':title'] != '') { $result = $pdo->prepare("INSERT INTO dex (title, description, keywords, url, url_hash) VALUES (:title, :description, :keywords, :url, :url_hash)"); $result= $result->execute($params); //if ($result) { // echo "Inserted successfully.\n"; //} else { // echo "Insertion failed.\n"; // print_r($stmt->errorInfo()); //} } } //print_r($details)."\n"; //echo get_details($l)."\n"; //echo $l."\n"; } } array_shift($crawling); foreach ($crawling as $site) { follow_links($site); } } follow_links($start); //print_r($already_crawled); ?> at first I tried different links that got me an empty value which resulted in errors and warnings then I changed the links and started writing the "UPDATE", "INSERT" if statement and started specifically writing the insert PDO first to test it out. when I executed the the file using command php I got the intended results in term of how it was supposed to look like in the terminal but then I checked on the table and found out that nothing was inserted. I want to insert these to use them in my search engine and make them searchable by query.
  2. Hello, I made a search engine and now I am trying to find an open source spider for its. I have a database phpmyadmin, where are around 200 urls, descriptions, titles and keywords and now I want connect it with spider to add more results in it.
  3. I was trying to write a class which would generate a sitemap for every post which is made or edited, but i don't seem to understand what is my mistake here. class sitemap { var $file_net; var $url_net; var $extention_net; var $freq_net; var $priority_net; function set() { $file = $this->file_net; $url = $this->url_net; $extention = $this->extention_net; $freq = $this->freq_net; $priority = $this->priority_net; } function Path ($p) { $a = explode ("/", $p); $len = strlen ($a[count ($a) - 1]); return (substr ($p, 0, strlen ($p) - $len)); } function GetUrl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $data = curl_exec($ch); curl_close($ch); return $data; } function Scan($url) { global $scanned, $pf, $extension, $skip, $freq, $priority; echo "scan url $url\n"; array_push ($scanned, $url); $html = GetUrl ($url); $a1 = explode ("<a", $html); foreach ($a1 as $key => $val) { $parts = explode (">", $val); $a = $parts[0]; $aparts = explode ("href=", $a); $hrefparts = explode (" ", $aparts[1]); $hrefparts2 = explode ("#", $hrefparts[0]); $href = str_replace ("\"", "", $hrefparts2[0]); if ((substr ($href, 0, 7) != "http://") && (substr ($href, 0, != "https://") && (substr ($href, 0, 6) != "ftp://")) { if ($href[0] == '/') $href = "$scanned[0]$href"; else $href = Path ($url) . $href; } if (substr ($href, 0, strlen ($scanned[0])) == $scanned[0]) { $ignore = false; if (isset ($skip)) foreach ($skip as $k => $v) if (substr ($href, 0, strlen ($v)) == $v) $ignore = true; if ((!$ignore) && (!in_array ($href, $scanned)) && (strpos ($href, $extension) > 0) ) { fwrite ($pf, "<url>\n <loc>$href</loc>\n" . " <changefreq>$freq</changefreq>\n" . " <priority>$priority</priority>\n</url>\n"); echo $href. "\n"; Scan ($href); } } } } $pf = fopen ($file, "w"); if (!$pf) { echo "cannot create $file\n"; return; } fwrite ($pf,"<?xml version=\"1.0\" encoding=\"UTF-8\"?> <urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd\"> <!-- created with SiteMap Generator --> <url> <loc>$url/</loc> <changefreq>daily</changefreq> </url> "); $scanned = array(); Scan ($url); fwrite ($pf, "</urlset>\n"); fclose ($pf); }
  4. Looked on the web couldn't find it or I wasn't typing what should. I'm looking for information on how to write identifier code for your crawler bot so website owners can see that my bot crawled them in there awstats robot/spiders visitors. Currently I'm coming up as Unknown robot (identified by empty user agent string). Anyone know how to do this or where I could find some info on the subject?
  5. I've used this code on my script, but for some reason its giving me a bad result. else if ($site==3) { //$content=file_get_contents('data/tp2'); $content=httpGet('http://thepiratebay.se/search/'.urlencode($string).'/0/99/0'); preg_match_all('#<tr.*>(.+)</tr>#Us',$content,$m); $array=array(); foreach ($m[1] as $k =>$v) { if (strpos($v,'detLink')===false) { continue; } preg_match_all('#<td.*>(.+)</td>#Us',$v,$m2); $array[$k]=array(); $array[$k]['size']=0; foreach ($m2[1] as $k2 =>$v2) { switch ($k2) { case 0: $c=strip_tags($v2); $array[$k]['category']=substr($c,0,strpos($c,'>')-1); break; case 2: $d=trim(strip_tags($v2)); $array[$k]['date']=convertDate2($d); break; case 1: preg_match('#<a href="([^"]+)".*>(.+)</a>#Us',$v2,$m3); if (isset($m3[1])) { $array[$k]['id']=$m3[1]; } else { $array[$k]['id']=''; } $name=trim(str_replace(' ',"\n",strip_tags($m3[2]))); $array[$k]['name']=$name; break; case 4: @list($num,$base)=explode(' ',trim(strip_tags($v2))); $array[$k]['size']=toB($num,$base); break; case 5: $array[$k]['seeds']=trim(strip_tags($v2)); break; case 6: $array[$k]['peers']=trim(strip_tags($v2)); break; } } if (empty($array[$k]['id']) || empty($array[$k]['category'])) { unset($array[$k]); } } This is how its coming out on my project site. Date, Size, Seeders, Leechers are not showing anything. Any ideas why this happens? Also the word i used on search was "adobe". Thank you get.php
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.