Web crawler for Search engine

kotty5 · November 12, 2013

Hello,

I made a search engine and now I am trying to find an open source spider for its. I have a database phpmyadmin, where are around 200 urls, descriptions, titles and keywords and now I want connect it with spider to add more results in it.

dalecosp · November 12, 2013

http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers

But, really, you made a search engine but no spider? I should think the spider is the main thing! (Maybe I should go read the "History of Google" again...)

Edited November 12, 2013 by dalecosp

QuickOldCar · November 18, 2013

I can probably help you out with this.

There is a lot more than meets the eye with all this, seems like a simple thing, but is many unforeseen obstacles.

There are a few opensource ones out there can use

http://www.sphider.eu/

http://cuab.de/

https://code.google.com/p/phpspider/

and tons more if looked for them...

I used sphider quite a few years ago and seemed pretty good out of a lot of them I tried, but I wanted a lot more control and do it all different than what they do.

Their system is that you add sites to a crawl list, and will keep hitting those sites looking if any new data. Which is fine if want a search for the sites you select.

I wrote a few of my own crawler/scraper/spider or whatever someone wants to call it.

I have a few ways to add new sites and links, is manual submission, pulling urls from lists or db, through my webcrawler or with my site or page scraper.

I started out like 5 years ago and did more like what google does, even looked similar to them, but after doing it a while and seeing how long it takes to scrape entire sites...I changed it all around to pull in more data faster and simpler. But i can still scrape entire sites if wanted to.

Basically I hit a url, grab any information want from it, scrape all their links from pages, and they get stored into my links search. The site itself does not get indexed into specific categories or tags, but the information of that site does get stored. I use a full text search to sort my results in the website index, and use sphinxsearch to handle my links results.

The toughest part of all of this is not getting information, but to actually display it in a timely manner, once you get to a million+ results will quickly see what i mean.

That's why you have to make sure you do indexing on the database so can return results faster. And is better to fetch and return exactly what you need.

You can check my search engine and website index with the link in my signature, if any questions just ask.

I could probably write a novel about search engines and indexing.

Sign In

Web crawler for Search engine

Recommended Posts

kotty5

Link to comment

Share on other sites

dalecosp

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information