Jump to content

Multiple Attribute Comparison with large data sets


abatardi

Recommended Posts

Hey guys,

 

I am trying to solve a particular problem and not sure the best way to go about it.  Not looking for specific code, mostly just theory and some guidance on how to go about this. 

 

I'm basically trying to compare several products against one another and determine similarity based on attributes.  This is simple enough by comparing the object attrs (putting them into an array or whatever and doing a comparison.. for example:

 

item1: {color: red, size: large, material: plastic}

item2: {color: black, size: large, material: plastic}

 

comparison of these two items would spit out a percentage of similarity to be used by the app to find similar items.. In this case, they would be 66.6% similar.  Easy enough.

 

The more complicated part comes when I have let's say 100,000 or more products that I need to compare against each other, and more are being added all the time.  What is the best way to handle this situation?  What I mean is, is it better to compare 1 item against the other 99,999 when trying to find similar items in real time, or is it better to batch this out and create some type of massive table that contains similarity data in a 100k x 100k matrix (and growing) and store it somehow/somewhere?  Neither one seems feasible... the former for processing constraints and speed (need the similarity data pretty much in real time) and the latter due to memory constraints as the grid grows even larger.

 

So I basically need some way to determine that Product A is most similar to these 20 products out of the 100,000... which means of course that I can't just stop when I hit 20 products above a certain threshold and call it good enough.  I need to actually compare against *all* of them. 

 

Hopefully this makes sense, having trouble with this one and would appreciate another viewpoint.  Would be interested to hear how you would do it... what methods you would use, would you use a mysql db in anyway, maybe a lucene search box with a large in-memory index dedicated for this, etc?  Any (free or cheap) option is on the table at this point.  Just not sure what the best way is to go about it..  Don't want to go down a road too far and hit a dead end if I can avoid it. 

 

Thanks!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.