Glese Posted December 21, 2011 Share Posted December 21, 2011 Since there are not many qualitative alternatives around. What are the critical difficulties of building a search engine? Imagine it low traffic and lightweight instead of a giant Google dimension, because that is how you start out. I do know by now that a search engine is also possible with PHP and ngnix. Since speed is one factor, I do not know how much the amount of traffic would influence the speed. Would it have a drastic effect? Other than that what else are drastic factors? Once one has a letter system and sorting script and also a bot to gather and analyze, one can start with a dedicated server and aim for a cluster one. And what about the database? Is it considered tedious, cluttered, problematic or can it be learned? Overall, I actually do think there are alternatives around, they are simply not on Google level in my opinion, in this sense, it is probably not that difficult to build and run a search engine, but to have it a high qualitative level may be perhaps. In a way one would have to be a letter crack to sort by quality, and perhaps this is the main critical part next to others. Quote Link to comment Share on other sites More sharing options...
scootstah Posted December 21, 2011 Share Posted December 21, 2011 You could build a simple search engine pretty easily. But to do something on the scale of Google, massive undertaking. Google has very complex algorithms for its search engine, that it doesn't share with anyone. Quote Link to comment Share on other sites More sharing options...
Philip Posted December 21, 2011 Share Posted December 21, 2011 But to do something on the scale of Google, massive undertaking. Yup. Plus, it's going to take you quite a while to index a good chunk of the interwebz Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted December 21, 2011 Share Posted December 21, 2011 I agree with the 2 posters above. Most of the so called search engines piggyback off others, and are actually metasearch engines, meaning they are just a tool to search actual search engines data. It's certainly not an easy task to create your own search engine and crawler. Many search engines existed, the good ones were bought, the others vanished. Here are a few issues will have, and is lot's of unforseen hurdles. charsets and languages relative urls fixing urls cases and properly encoded parsing urls in a few ways will be required duplicate data (checking if the data already exists every time is costly, may be best to remove duplicates timely through a script) properly sanitizing the data crashing databases or corrupt data (one hiccup can ruin your day or week) special pattern matching if wanted all types of images if doing an image search, to get the images data must download it first. if wanted any javascript would have to write many types of pattern matching for those you need adequate bandwidth, harddrive space, cpu and memory Crawling many of the links to get the information takes time and resources. You could help the load across a few machines. Displaying the content in a fast method is even harder, when get many millions+ of results. I've been working on my search engine/index for about 5 years now. Two years of it was just in thought and testing for better methods. I always hear about "google algorithm", what could they possibly do special? Their results go by paying customers and who's ranked higher at alexa. I'm pretty sure they scrape anything and all they find. My guess would be they try to display data with the most popular keywords first. (most likely saving users search inputs) Can easily make filters for weeding out any unwanted sites or content to crawl, or bogus sites. If you look at their possible "About 697,000,000 results (0.28 seconds)", that's surely some fictitious round number there. I assume they display up to x amounts of content, usually like 87 pages max? They do go by date, and if none for that date exists, is displayed when they do have the data. Obviously they store data by date to each server, or multiple servers, and that is how can pull faster results, plus also caching. I wrote a post a while ago on the same subject. You may want to look into doing the crawler with python , use cassandra for the database and sphinx to display the results with an advanced search. I use php/mysql and fulltext for mine, for now anyway. I guess my advice would be to start scraping front pages of sites to get a more variety of data and their best content, later on start scraping their individual pages. Try not to keep hitting sites continuously or they will ban you possibly. Quote Link to comment Share on other sites More sharing options...
scootstah Posted December 22, 2011 Share Posted December 22, 2011 I always hear about "google algorithm", what could they possibly do special? It mostly refers to how they decide page rank. There is a lot of factors involved. They don't just sort by "paying customers", or keyword usage... there really is a lot going on. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.