|
Hi,
we are using an MD5 hash over parts of a file to uniquely identify images and display thumbnails. This has worked quite well, but recently I have seen two or three cases where the hash fails (same hash for completely different images). There is another problem with the current hash, it relies on a binary blob of the metadata produced by Exiv2, but this format is not guaranteed to be stable (possibly, the hash changes with a new Exiv2 version). The recommendation by Andreas Huggel was to simply use the first 100kB of a file, which will typically include the file header, the metadata, and reach actual image data. A variant would be to include the last 100kB as well. Attached is a small application which scans a given collection directory, creates the hash, and will output if the hash is successful in differentiating all files. I have run this on my collection, but I would ask you to repeat testing with your collections to find out if it works for you as well: qmake testhash.pro make ./testhash /toplevel/directory/to/your/collection Here it takes 15s per 1000 files. At the end, it will tell you if any files failed, or if it succeeded. If it fails, it would be interesting to find out if the files are actually very similar, and if they have the same file size. (a hard failure would be two dissimilar files with the same file size) Thanks Marcel _______________________________________________ Digikam-devel mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-devel |
|
Hi Marcel,
here are my results: Directory scanning and hash generation took 35.9236 ms/file Success: All 4557 files have a different hash.
Also, I might understand it wrong, but wouldn't reading the beginning be better than reading the end of file in regards to IO operations? (as with reading the end of the file you must move the file "cursor" to somewhere near the end, with the beginning you just open and read)
Marty On Thu, Dec 9, 2010 at 12:04, Marcel Wiesweg <[hidden email]> wrote: Hi, _______________________________________________ Digikam-devel mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-devel |
|
In reply to this post by Marcel Wiesweg
See my trace from my office computer :
[gilles@localhost Download]$ ./testhash /mnt/data/Rep1 Scanned "/mnt/data/Rep1/0.9.2-splashcreens/Juergen Flosbach" , 9 files and 0 subdirectories Scanned "/mnt/data/Rep1/0.9.2-splashcreens" , 0 files and 1 subdirectories Scanned "/mnt/data/Rep1/221460" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep1/Alpha 450 samples" , 21 files and 0 subdirectories Scanned "/mnt/data/Rep1/CanonVsdigiKam" , 9 files and 0 subdirectories Scanned "/mnt/data/Rep1/Corrupted" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep1/dimgscale" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep1/HDR/aligned" , 13 files and 0 subdirectories Scanned "/mnt/data/Rep1/HDR/Set1" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep1/HDR" , 27 files and 2 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie01" , 4 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie02" , 5 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie03" , 13 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie04" , 5 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie05" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie06" , 2 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie07" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie08" , 5 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Arnd" , 0 files and 8 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Daniel" , 6 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Julien" , 6 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table/From Seb" , 4 files and 0 subdirectories Scanned "/mnt/data/Rep1/Light Table" , 1 files and 4 subdirectories Scanned "/mnt/data/Rep1/NEW" , 14 files and 0 subdirectories Scanned "/mnt/data/Rep1/NEW2" , 60 files and 0 subdirectories Scanned "/mnt/data/Rep1/PhotoShop 7.0" , 6 files and 0 subdirectories Scanned "/mnt/data/Rep1/pipo" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep1/processed" , 0 files and 0 subdirectories Scanned "/mnt/data/Rep1/SAMPLES" , 12 files and 0 subdirectories Scanned "/mnt/data/Rep1/SAMPLES2" , 55 files and 0 subdirectories Scanned "/mnt/data/Rep1/SONY" , 15 files and 0 subdirectories Scanned "/mnt/data/Rep1/splash" , 9 files and 0 subdirectories Scanned "/mnt/data/Rep1/test/pipo" , 2 files and 0 subdirectories Scanned "/mnt/data/Rep1/test" , 23 files and 1 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/DNG(RAWconverter)" , 41 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/GPS" , 4 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/HOTPIXELSTOOL" , 6 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/JP2" , 9 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/JPEG/Horizontal" , 49 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/JPEG/Vertical" , 30 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/JPEG" , 8 files and 2 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/LENSFUN" , 22 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Metadata/Adobe" , 11 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Metadata/B&W" , 10 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Metadata/digiKam" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Metadata/LR" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Metadata/Picasa" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Metadata/Vista" , 20 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Metadata" , 15 files and 6 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/newpictures" , 4 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/NOISE" , 13 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/OGG" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/PCD" , 5 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Photoshop" , 28 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/PNG" , 14 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/PPM" , 4 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/RAW/Horizontal" , 61 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/RAW/Mix" , 39 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/RAW/Vertical" , 16 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/RAW" , 0 files and 3 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Red Eyes" , 10 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/TIFF" , 27 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/Video" , 61 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/WDP" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs/XCF" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep1/Test Photographs" , 3 files and 20 subdirectories Scanned "/mnt/data/Rep1/test.cameragui" , 9 files and 0 subdirectories Scanned "/mnt/data/Rep1" , 0 files and 20 subdirectories Directory scanning and hash generation took 3.66133 ms/file Success: All 874 files have a different hash. [gilles@localhost Download]$ ./testhash /mnt/data/Rep2 Scanned "/mnt/data/Rep2/test/processed" , 5 files and 0 subdirectories Scanned "/mnt/data/Rep2/test" , 13 files and 1 subdirectories Scanned "/mnt/data/Rep2" , 0 files and 1 subdirectories Directory scanning and hash generation took 16.4444 ms/file Success: All 18 files have a different hash. [gilles@localhost Download]$ ./testhash /mnt/data/Rep3 Scanned "/mnt/data/Rep3/Alina Dinu <[hidden email]>" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep3/Arturo Mann <[hidden email]>" , 4 files and 0 subdirectories Scanned "/mnt/data/Rep3/Aykut Turhan <[hidden email]>" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep3/Benoit Courty <[hidden email]>" , 2 files and 0 subdirectories Scanned "/mnt/data/Rep3/cabaflo <[hidden email]>" , 2 files and 0 subdirectories Scanned "/mnt/data/Rep3/Christophe Keckeis <[hidden email]>" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep3/D Vanraes" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep3/Eric Bayard <[hidden email]>" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep3/Fr�d�ric Martinot <[hidden email]>" , 4 files and 0 subdirectories Scanned "/mnt/data/Rep3/Gerhard Kulzer" , 10 files and 0 subdirectories Scanned "/mnt/data/Rep3/Gustavo Pichorim Boiko <[hidden email]>" , 4 files and 0 subdirectories Scanned "/mnt/data/Rep3/Josh & Erica Nijenhuis <[hidden email]>" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep3/J�rgen Flosbach dk <[hidden email]>" , 7 files and 0 subdirectories Scanned "/mnt/data/Rep3/Markus Volkmer <[hidden email]>" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep3/Mathias Ball <[hidden email]>" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep3/Michel Pottier <[hidden email]>" , 6 files and 0 subdirectories Scanned "/mnt/data/Rep3/Paul Radford <[hidden email]>" , 3 files and 0 subdirectories Scanned "/mnt/data/Rep3/Pol <[hidden email]>" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep3/Roger Larsson <[hidden email]>" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep3/S�bastien Benoit <[hidden email]>" , 1 files and 0 subdirectories Scanned "/mnt/data/Rep3" , 0 files and 20 subdirectories Directory scanning and hash generation took 8.98305 ms/file Success: All 59 files have a different hash. [gilles@localhost Download]$ ./testhash /mnt/data/Rep4 Scanned "/mnt/data/Rep4/new" , 10 files and 0 subdirectories Scanned "/mnt/data/Rep4/test/processed" , 607 files and 0 subdirectories Scanned "/mnt/data/Rep4/test" , 42 files and 1 subdirectories Scanned "/mnt/data/Rep4" , 0 files and 2 subdirectories Directory scanning and hash generation took 5.9393 ms/file Success: All 659 files have a different hash. [gilles@localhost Download]$ ./testhash /mnt/data/Rep5 Scanned "/mnt/data/Rep5/2010-07-16/jpg" , 64 files and 0 subdirectories Scanned "/mnt/data/Rep5/2010-07-16" , 0 files and 1 subdirectories Scanned "/mnt/data/Rep5/Sur Aix/2010-07-27/jpg" , 9 files and 0 subdirectories Scanned "/mnt/data/Rep5/Sur Aix/2010-07-27" , 0 files and 1 subdirectories Scanned "/mnt/data/Rep5/Sur Aix/2010-07-29/arw" , 11 files and 0 subdirectories Scanned "/mnt/data/Rep5/Sur Aix/2010-07-29/png" , 2 files and 0 subdirectories Scanned "/mnt/data/Rep5/Sur Aix/2010-07-29" , 0 files and 2 subdirectories Scanned "/mnt/data/Rep5/Sur Aix" , 0 files and 2 subdirectories Scanned "/mnt/data/Rep5/test" , 0 files and 0 subdirectories Scanned "/mnt/data/Rep5" , 2 files and 3 subdirectories Directory scanning and hash generation took 15.6023 ms/file Success: All 88 files have a different hash. [gilles@localhost Download]$ [gilles@localhost Download]$ ./testhash /mnt/data/Camera Scanned "/mnt/data/Camera" , 12 files and 0 subdirectories Directory scanning and hash generation took 1.58333 ms/file Success: All 12 files have a different hash. [gilles@localhost Download]$ There is not a lot files here. I can process home computer if you want, but this week end. Gilles 2010/12/9 Marcel Wiesweg <[hidden email]>: > Hi, > > we are using an MD5 hash over parts of a file to uniquely identify images and > display thumbnails. This has worked quite well, but recently I have seen two > or three cases where the hash fails (same hash for completely different > images). > There is another problem with the current hash, it relies on a binary blob of > the metadata produced by Exiv2, but this format is not guaranteed to be stable > (possibly, the hash changes with a new Exiv2 version). > > The recommendation by Andreas Huggel was to simply use the first 100kB of a > file, which will typically include the file header, the metadata, and reach > actual image data. > A variant would be to include the last 100kB as well. > > Attached is a small application which scans a given collection directory, > creates the hash, and will output if the hash is successful in differentiating > all files. > > I have run this on my collection, but I would ask you to repeat testing with > your collections to find out if it works for you as well: > > qmake testhash.pro > make > ./testhash /toplevel/directory/to/your/collection > > Here it takes 15s per 1000 files. > At the end, it will tell you if any files failed, or if it succeeded. If it > fails, it would be interesting to find out if the files are actually very > similar, and if they have the same file size. (a hard failure would be two > dissimilar files with the same file size) > > Thanks > Marcel > > > _______________________________________________ > Digikam-devel mailing list > [hidden email] > https://mail.kde.org/mailman/listinfo/digikam-devel > > Digikam-devel mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-devel |
|
In reply to this post by Marcel Wiesweg
Hi,
Here are my results: Directory scanning and hash generation took 29.5804 ms/file Success: All 39557 files have a different hash. Julien Le 09/12/2010 12:04, Marcel Wiesweg a écrit : > Hi, > > we are using an MD5 hash over parts of a file to uniquely identify images and > display thumbnails. This has worked quite well, but recently I have seen two > or three cases where the hash fails (same hash for completely different > images). > There is another problem with the current hash, it relies on a binary blob of > the metadata produced by Exiv2, but this format is not guaranteed to be stable > (possibly, the hash changes with a new Exiv2 version). > > The recommendation by Andreas Huggel was to simply use the first 100kB of a > file, which will typically include the file header, the metadata, and reach > actual image data. > A variant would be to include the last 100kB as well. > > Attached is a small application which scans a given collection directory, > creates the hash, and will output if the hash is successful in differentiating > all files. > > I have run this on my collection, but I would ask you to repeat testing with > your collections to find out if it works for you as well: > > qmake testhash.pro > make > ./testhash /toplevel/directory/to/your/collection > > Here it takes 15s per 1000 files. > At the end, it will tell you if any files failed, or if it succeeded. If it > fails, it would be interesting to find out if the files are actually very > similar, and if they have the same file size. (a hard failure would be two > dissimilar files with the same file size) > > Thanks > Marcel > > > > > _______________________________________________ > Digikam-devel mailing list > [hidden email] > https://mail.kde.org/mailman/listinfo/digikam-devel > Digikam-devel mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-devel |
|
In reply to this post by Martin Klapetek
> Also, I might understand it wrong, but wouldn't reading the beginning be > better than reading the end of file in regards to IO operations? (as with > reading the end of the file you must move the file "cursor" to somewhere > near the end, with the beginning you just open and read) I did a small benchmark reading 8 or 100 kb from the beginning, middle and/or end of a file: 8 - -: 8.5 ms 100 - -: 15.2 ms 100 - 8 - 100: 32.2 ms 100 - - 8: 22.0 ms 100 - - 100: 24.5 ms so there is a cost for linear reading (reading 8 kb is faster than reading 100 kb) and there is a cost for each seek operation (reading 100 kB is by far not 10 times slower than reading 8 kb, so there is a large constant amount). The numbers are not overly accurate. With larger scale scanning - my whole collection - I needed only 15ms/file for the "100 - - 100" approach, in the benchmark it was 24. Dont know why. Marcel _______________________________________________ Digikam-devel mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-devel |
|
In reply to this post by Marcel Wiesweg
Marcel,
I always get "success". Even when the hash should be the same in my understanding: andreas@mowgli:~/src/exiv2/trunk/src/tmp$ ls -la total 4052 drwxr-xr-x 2 andreas andreas 4096 10-Dec-10 ./ drwxr-xr-x 6 andreas andreas 8192 10-Dec-10 ../ -rw-r--r-- 1 andreas andreas 2052392 10-Dec-10 d1.jpg -rw-r--r-- 1 andreas andreas 2052392 10-Dec-10 d2.jpg -rw-r--r-- 1 andreas andreas 4745 10-Dec-10 e1.jpg -rw-r--r-- 1 andreas andreas 4745 10-Dec-10 e2.jpg andreas@mowgli:~/src/exiv2/trunk/src/tmp$ diff d1.jpg d2.jpg andreas@mowgli:~/src/exiv2/trunk/src/tmp$ diff e1.jpg e2.jpg andreas@mowgli:~/src/exiv2/trunk/src/tmp$ ~/src/testhash/testhash . Scanned "." , 4 files and 0 subdirectories Directory scanning and hash generation took 1.5 ms/file Success: All 4 files have a different hash. Andreas _______________________________________________ Digikam-devel mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-devel |
|
> andreas@mowgli:~/src/exiv2/trunk/src/tmp$ diff d1.jpg d2.jpg > andreas@mowgli:~/src/exiv2/trunk/src/tmp$ diff e1.jpg e2.jpg > andreas@mowgli:~/src/exiv2/trunk/src/tmp$ ~/src/testhash/testhash . > Scanned "." , 4 files and 0 subdirectories > Directory scanning and hash generation took 1.5 ms/file > Success: All 4 files have a different hash. If the files are fully identical (MD5 over whole file is the same), then it's all right. You will only get a failure if the full files are not identical, but the hash is. (I indeed have one example here, two PNG files where the full MD5 is not identical, but the files look in fact pixel-by-pixel identical and are derived from the same picture) _______________________________________________ Digikam-devel mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-devel |
| Free forum by Nabble | Edit this page |
