Jump to content

Recommended Posts

http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers

 

But, really, you made a search engine but no spider?  I should think the spider is the main thing!  (Maybe I should go read the "History of Google" again...)

Edited by dalecosp

I can probably help you out with this.

 

There is a lot more than meets the eye with all this, seems like a simple thing, but is many unforeseen obstacles.

 

There are a few opensource ones out there can use

http://www.sphider.eu/

http://cuab.de/

https://code.google.com/p/phpspider/

 

and tons more if looked for them...

I used sphider quite a few years ago and seemed pretty good out of a lot of them I tried, but I wanted a lot more control and do it all different than what they do.

Their system is that you add sites to a crawl list, and will keep hitting those sites looking if any new data. Which is fine if want a search for the sites you select.

 

I wrote a few of my own crawler/scraper/spider or whatever someone wants to call it.

I have a few ways to add new sites and links, is manual submission, pulling urls from lists or db, through my webcrawler or with my site or page scraper.

 

I started out like 5 years ago and did more like what google does, even looked similar to them, but after doing it a while and seeing how long it takes to scrape entire sites...I changed it all around to pull in more data faster and simpler. But i can still scrape entire sites if wanted to.

 

Basically I hit a url, grab any information want from it, scrape all their links from pages, and they get stored into my links search. The site itself does not get indexed into specific categories or tags, but the information of that site does get stored. I use a full text search to sort my results in the website index, and use sphinxsearch to handle my links results.

 

The toughest part of all of this is not getting information, but to actually display it in a timely manner, once you get to a million+ results will quickly see what i mean.

That's why you have to make sure you do indexing on the database so can return results faster. And is better to fetch and return exactly what you need.

 

You can check my search engine and website index with the link in my signature, if any questions just ask.

I could probably write a novel about search engines and indexing.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.