Jump to content


Web crawler for Search engine

crawler search engine codes spider

  • Please log in to reply
2 replies to this topic

#1 kotty5

  • New Members
  • Pip
  • Newbie
  • 1 posts

Posted 12 November 2013 - 06:21 PM


I made a search engine and now I am trying to find an open source spider for its. I have a database phpmyadmin, where are around 200 urls, descriptions, titles and keywords and now I want connect it with spider to add more results in it.



#2 dalecosp

  • Members
  • PipPipPip
  • Advanced Member
  • 396 posts
  • LocationMissouri

Posted 12 November 2013 - 11:40 PM



But, really, you made a search engine but no spider?  I should think the spider is the main thing!  (Maybe I should go read the "History of Google" again...)

Edited by dalecosp, 12 November 2013 - 11:41 PM.

"God doesn't play dice" --- Albert Einstein
"Perl is hardly a paragon of beautiful syntax." --- Weedpacket

#3 QuickOldCar

  • Moderators
  • Advanced Member
  • 2,995 posts
  • LocationNorthEast Pennsylvania

Posted 18 November 2013 - 04:43 AM

I can probably help you out with this.


There is a lot more than meets the eye with all this, seems like a simple thing, but is many unforeseen obstacles.


There are a few opensource ones out there can use





and tons more if looked for them...

I used sphider quite a few years ago and seemed pretty good out of a lot of them I tried, but I wanted a lot more control and do it all different than what they do.

Their system is that you add sites to a crawl list, and will keep hitting those sites looking if any new data. Which is fine if want a search for the sites you select.


I wrote a few of my own crawler/scraper/spider or whatever someone wants to call it.

I have a few ways to add new sites and links, is manual submission, pulling urls from lists or db, through my webcrawler or with my site or page scraper.


I started out like 5 years ago and did more like what google does, even looked similar to them, but after doing it a while and seeing how long it takes to scrape entire sites...I changed it all around to pull in more data faster and simpler. But i can still scrape entire sites if wanted to.


Basically I hit a url, grab any information want from it, scrape all their links from pages, and they get stored into my links search. The site itself does not get indexed into specific categories or tags, but the information of that site does get stored. I use a full text search to sort my results in the website index, and use sphinxsearch to handle my links results.


The toughest part of all of this is not getting information, but to actually display it in a timely manner, once you get to a million+ results will quickly see what i mean.

That's why you have to make sure you do indexing on the database so can return results faster. And is better to fetch and return exactly what you need.


You can check my search engine and website index with the link in my signature, if any questions just ask.

I could probably write a novel about search engines and indexing.

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users