Jump to content

Libpuzzle vectors in database


lefthand

Recommended Posts

There is some information on indexing libpuzzle vectors in their readme:

http://download.pureftpd.org/pub/pure-ftpd/misc/libpuzzle/doc/README

http://libpuzzle.pureftpd.org/project/libpuzzle/php

 

However - I don't understand it. Has anyone done this?

This problem is part php and part mysql...

 

What I need help with is how to split the vectors into words and what parts to put where in the database.

 

Help would be appreciated :)

Link to comment
https://forums.phpfreaks.com/topic/197264-libpuzzle-vectors-in-database/
Share on other sites

So after all the compilation help it calls out:

$signature = puzzle_fill_cvec_from_file($filename);

as the way to use the library to get pictures.  Are you able to do this and echo the results?

 

So next get 2 signatures:

$signature = puzzle_fill_cvec_from_file($filename);

$signature2 = puzzle_fill_cvec_from_file($filename2);

 

$d = puzzle_vector_normalized_distance($signature1, $signature2);

 

echo that result.

 

So now that you have signatures set up a database as the directions calls out:

CREATE TABLE signatures (sig_id int auto_increment primary key, signature char(544), pic_id int);

CREATE TABLE pic (pic_id bigint auto_increment primary key, filename varchar(255));

CREATE TABLE words (words char(10), sig_id int);

 

so your $signatures go into signatures table, with pic_id being the key returned from an insert into pic with the filename

 

words are generated by parsing through the signature (1 to 544-10)-1

for ($i=0;$i<533;$i++){
$words=substr($signature,$i,10);
//INSERT $WORDS INTO WORDS with sig_id being a reference to signatures
}

 

let me know how it goes, this is an interesting module, let me know of a practical purpose if you could.

as the way to use the library to get pictures.  Are you able to do this and echo the results?

Yes :)

 

Thanks for the loop helped a bunch! Though to be correct I had to change it to:

$words[]=substr($signature,$i,10);

 

CREATE TABLE words (words char(10), sig_id int);

My trouble now is that this should contain "pos_and_word" where do I put position??

 

And when that is done ... how does one sort the table to put similar pictures next to eachother?

 

let me know how it goes, this is an interesting module, let me know of a practical purpose if you could.

Big database of images downloaded by different people. I wish to remove "duplicates" that does not share the same sha1 hash - save space :) And it's fun working with databases :)

Ok, I didn't think that the pos was important but you should extend the table with a pos into it and input the value of $i.  Also your modification will not put a single value into the table.  Be advised that the $words was supposed to be a single value of 10 chars input into a table and then reused.

 

that mysql command should

insert into table words (word,pos,sig_id) values ($words, $i, $sig_id)

 

 

BUT forget about the words.  That is for seraching internal elements of the picture for similarity.  You can simply use the entire signature and verify closer to .8 similarity because you are looking for identical pictures, not just pictures with 60% similarity in the upper right corner--which is the implementation of the words.

Hmmm, I'm heavily loaded on vin rouge right now but I'll reply anyways :)

 

Then, index your vector with a compound index of (word + position).

 

Even with millions of images, K = 10 and N = 100 should be enough to have very

little entries sharing the same index.

 

I was sort of hoping to have MySql sort the output and that would create a list of similar images. What I wish to do is something like tineye.com ...

Your input has nonetheless been very useful Andrew :) Thanks! I'll reread it again when sober  :P

 

BUT forget about the words.  That is for seraching internal elements of the picture for similarity.  You can simply use the entire signature and verify closer to .8 similarity because you are looking for identical pictures, not just pictures with 60% similarity in the upper right corner--which is the implementation of the words.

I sort of did it. Indexed pos and word ...

SELECT DISTINCT sha1 FROM puzzle_words USE INDEX(pos_and_word)

 

So far unable to tell if it works. With 10.000 pics you'll get a table with 10.000.000 rows (using K=100).

I'll try it on a smaller sample size with a lot of dupes (not sharing the same sha1 hash).

 

Oh, you would actually just compile a database of signatures.

 

Then do a nested foreach

 

so you'd

//TODO assign $signature[]=SQL select of all signatures
$signatures2=$signature;
foreach $signatures as $sig{
foreach $signatures2 as $sig2{
if ($d = puzzle_vector_normalized_distance($sig, $sig2)>.99){
echo "match: $sig and $sig2";
}
}
}

 

savvy?

oops, I made an error in that code:

//TODO assign $signature[]=SQL select of all signatures
$signatures2=$signature;
foreach $signatures as $sig{
foreach $signatures2 as $sig2{
if (puzzle_vector_normalized_distance($sig, $sig2)>.99){
echo "match: $sig and $sig2";
}
}
}

 

Also this example would output the actual signatures.  You'd better create a class that holds the signatures and have ->id and ->sig so you could print the ids and normalize the ->sig(s)

  • 3 months later...

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.