Difficulties of Building a Search Engine

Glese · December 21, 2011

Since there are not many qualitative alternatives around. What are the critical difficulties of building a search engine?

Imagine it low traffic and lightweight instead of a giant Google dimension, because that is how you start out.

I do know by now that a search engine is also possible with PHP and ngnix. Since speed is one factor, I do not know how much the amount of traffic would influence the speed. Would it have a drastic effect?

Other than that what else are drastic factors?

Once one has a letter system and sorting script and also a bot to gather and analyze, one can start with a dedicated server and aim for a cluster one. And what about the database? Is it considered tedious, cluttered, problematic or can it be learned?

Overall, I actually do think there are alternatives around, they are simply not on Google level in my opinion, in this sense, it is probably not that difficult to build and run a search engine, but to have it a high qualitative level may be perhaps. In a way one would have to be a letter crack to sort by quality, and perhaps this is the main critical part next to others.

scootstah · December 21, 2011

You could build a simple search engine pretty easily. But to do something on the scale of Google, massive undertaking.

Google has very complex algorithms for its search engine, that it doesn't share with anyone.

Philip · December 21, 2011

But to do something on the scale of Google, massive undertaking.

Yup. Plus, it's going to take you quite a while to index a good chunk of the interwebz

QuickOldCar · December 21, 2011

I agree with the 2 posters above.

Most of the so called search engines piggyback off others, and are actually metasearch engines, meaning they are just a tool to search actual search engines data.

It's certainly not an easy task to create your own search engine and crawler.

Many search engines existed, the good ones were bought, the others vanished.

Here are a few issues will have, and is lot's of unforseen hurdles.

charsets and languages

relative urls

fixing urls cases and properly encoded

parsing urls in a few ways will be required

duplicate data (checking if the data already exists every time is costly, may be best to remove duplicates timely through a script)

properly sanitizing the data

crashing databases or corrupt data (one hiccup can ruin your day or week)

special pattern matching if wanted all types of images if doing an image search, to get the images data must download it first.

if wanted any javascript would have to write many types of pattern matching for those

you need adequate bandwidth, harddrive space, cpu and memory

Crawling many of the links to get the information takes time and resources. You could help the load across a few machines.

Displaying the content in a fast method is even harder, when get many millions+ of results.

I've been working on my search engine/index for about 5 years now. Two years of it was just in thought and testing for better methods.

I always hear about "google algorithm", what could they possibly do special?

Their results go by paying customers and who's ranked higher at alexa.

I'm pretty sure they scrape anything and all they find.

My guess would be they try to display data with the most popular keywords first. (most likely saving users search inputs)

Can easily make filters for weeding out any unwanted sites or content to crawl, or bogus sites.

If you look at their possible "About 697,000,000 results (0.28 seconds)", that's surely some fictitious round number there.

I assume they display up to x amounts of content, usually like 87 pages max?

They do go by date, and if none for that date exists, is displayed when they do have the data.

Obviously they store data by date to each server, or multiple servers, and that is how can pull faster results, plus also caching.

I wrote a post a while ago on the same subject.

You may want to look into doing the crawler with python , use cassandra for the database and sphinx to display the results with an advanced search.

I use php/mysql and fulltext for mine, for now anyway.

I guess my advice would be to start scraping front pages of sites to get a more variety of data and their best content, later on start scraping their individual pages.

Try not to keep hitting sites continuously or they will ban you possibly.

scootstah · December 22, 2011

I always hear about "google algorithm", what could they possibly do special?

It mostly refers to how they decide page rank. There is a lot of factors involved. They don't just sort by "paying customers", or keyword usage... there really is a lot going on.

Sign In

Difficulties of Building a Search Engine

Recommended Posts

Glese

Link to comment

Share on other sites

scootstah

Link to comment

Share on other sites

Philip

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

scootstah

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information