Jump to content

Recommended Posts

http://lucene.apache.org/solr/

 

looks very interesting, but it seems like they just take your sql database data and put it in xml files for this server to index and search. but a lot of people are talking about using it.

 

what do you guys think?

 

anyone using it?

...but it seems like they just take your sql database data and put it in xml files...what do you guys think?

 

I haven't read much about it, or used it, but I wanted to mention that "just" putting your data into XML can be a huge advantage when you bring XSLT into the picture. Its speed is impressive.

It depends on what you're looking for. Databases are 2-dimensional and may require many joins just to get a fragment of information. In an XML document, after it's loaded into memory, everything is laid out in paths which make retrieval quick and powerful. A few months back I was working with a (I think) 15MB XML file. It took around 0.2 seconds to pass it through an XSLT stylesheet and give me an entirely new structure. I can verify these numbers on Monday.

the thing i dont understand with this approach is the duplication of your data. you can choose what to index, but what if you want the system to seach your whole db? that's essentially two db's you're running. this software must be something, digg is going to use it for their search. the waybackmachine uses it too

I can verify these numbers on Monday.

 

The test numbers I remembered were older and more impressive than my tests this morning. Here they are from the project's current state:

 

XSLT Processor: Xalan C++ version 1.1.0

Input file: 8.5MB

Stylesheet: 7KB

Result file: 4MB

 

Stylesheet parse time: 20 milliseconds

XML parse time: 6,230 milliseconds

Transformation time: 9,310 milliseconds

 

If the XML file was already loaded into memory, then you're down to a 9 second transformation time. That doesn't seem fast, but it's gutting an 8.5MB file, running calculations, looking around the relationship tree, calling recursive templates, and then outputting a very different 4MB file.

 

Yes, that still may not sound like a selling point, but the PHP and/or Perl I'd have to write to parse this file would be more complex to write and maintain, and would take longer to run. Also, this isn't the best example since the application you posted seems to be indexing and searching, not creating the entire book like I am.

 

the thing i dont understand with this approach is the duplication of your data. you can choose what to index, but what if you want the system to seach your whole db? that's essentially two db's you're running.

 

I think this comes down to balance. Databases are very important and powerful, but why put more stress on them if you don't need to? It depends on how static something is: if you have a page that doesn't need to be 100% real-time, why not set up a cron job to create the "dynamic" page and only hit the database that one time? If your data needs to be 100% real-time, then, yes, I don't see how doubling data can be beneficial.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.