file hash creation: asking for short test

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

file hash creation: asking for short test

Marcel Wiesweg
Hi,

we are using an MD5 hash over parts of a file to uniquely identify images and
display thumbnails. This has worked quite well, but recently I have seen two
or three cases where the hash fails (same hash for completely different
images).
There is another problem with the current hash, it relies on a binary blob of
the metadata produced by Exiv2, but this format is not guaranteed to be stable
(possibly, the hash changes with a new Exiv2 version).

The recommendation by Andreas Huggel was to simply use the first 100kB of a
file, which will typically include the file header, the metadata, and reach
actual image data.
A variant would be to include the last 100kB as well.

Attached is a small application which scans a given collection directory,
creates the hash, and will output if the hash is successful in differentiating
all files.

I have run this on my collection, but I would ask you to repeat testing with
your collections to find out if it works for you as well:

qmake testhash.pro
make
./testhash /toplevel/directory/to/your/collection

Here it takes 15s per 1000 files.
At the end, it will tell you if any files failed, or if it succeeded. If it
fails, it would be interesting to find out if the files are actually very
similar, and if they have the same file size. (a hard failure would be two
dissimilar files with the same file size)

Thanks
Marcel


_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel

main.cpp (5K) Download Attachment
testhash.pro (41 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: file hash creation: asking for short test

Martin Klapetek
Hi Marcel,

here are my results:

Directory scanning and hash generation took 35.9236 ms/file 
Success: All 4557 files have a different hash. 

Also, I might understand it wrong, but wouldn't reading the beginning be better than reading the end of file in regards to IO operations? (as with reading the end of the file you must move the file "cursor" to somewhere near the end, with the beginning you just open and read)

Marty

On Thu, Dec 9, 2010 at 12:04, Marcel Wiesweg <[hidden email]> wrote:
Hi,

we are using an MD5 hash over parts of a file to uniquely identify images and
display thumbnails. This has worked quite well, but recently I have seen two
or three cases where the hash fails (same hash for completely different
images).
There is another problem with the current hash, it relies on a binary blob of
the metadata produced by Exiv2, but this format is not guaranteed to be stable
(possibly, the hash changes with a new Exiv2 version).

The recommendation by Andreas Huggel was to simply use the first 100kB of a
file, which will typically include the file header, the metadata, and reach
actual image data.
A variant would be to include the last 100kB as well.

Attached is a small application which scans a given collection directory,
creates the hash, and will output if the hash is successful in differentiating
all files.

I have run this on my collection, but I would ask you to repeat testing with
your collections to find out if it works for you as well:

qmake testhash.pro
make
./testhash /toplevel/directory/to/your/collection

Here it takes 15s per 1000 files.
At the end, it will tell you if any files failed, or if it succeeded. If it
fails, it would be interesting to find out if the files are actually very
similar, and if they have the same file size. (a hard failure would be two
dissimilar files with the same file size)

Thanks
Marcel


_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel



_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

Re: file hash creation: asking for short test

Gilles Caulier-4
In reply to this post by Marcel Wiesweg
See my trace from my office computer :

[gilles@localhost Download]$ ./testhash /mnt/data/Rep1
Scanned "/mnt/data/Rep1/0.9.2-splashcreens/Juergen Flosbach" , 9 files
and 0 subdirectories
Scanned "/mnt/data/Rep1/0.9.2-splashcreens" , 0 files and 1 subdirectories
Scanned "/mnt/data/Rep1/221460" , 1 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Alpha 450 samples" , 21 files and 0 subdirectories
Scanned "/mnt/data/Rep1/CanonVsdigiKam" , 9 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Corrupted" , 3 files and 0 subdirectories
Scanned "/mnt/data/Rep1/dimgscale" , 1 files and 0 subdirectories
Scanned "/mnt/data/Rep1/HDR/aligned" , 13 files and 0 subdirectories
Scanned "/mnt/data/Rep1/HDR/Set1" , 3 files and 0 subdirectories
Scanned "/mnt/data/Rep1/HDR" , 27 files and 2 subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie01" , 4 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie02" , 5 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie03" , 13 files and
0 subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie04" , 5 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie05" , 3 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie06" , 2 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie07" , 3 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Arnd/Serie08" , 5 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Arnd" , 0 files and 8 subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Daniel" , 6 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Julien" , 6 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Light Table/From Seb" , 4 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Light Table" , 1 files and 4 subdirectories
Scanned "/mnt/data/Rep1/NEW" , 14 files and 0 subdirectories
Scanned "/mnt/data/Rep1/NEW2" , 60 files and 0 subdirectories
Scanned "/mnt/data/Rep1/PhotoShop 7.0" , 6 files and 0 subdirectories
Scanned "/mnt/data/Rep1/pipo" , 1 files and 0 subdirectories
Scanned "/mnt/data/Rep1/processed" , 0 files and 0 subdirectories
Scanned "/mnt/data/Rep1/SAMPLES" , 12 files and 0 subdirectories
Scanned "/mnt/data/Rep1/SAMPLES2" , 55 files and 0 subdirectories
Scanned "/mnt/data/Rep1/SONY" , 15 files and 0 subdirectories
Scanned "/mnt/data/Rep1/splash" , 9 files and 0 subdirectories
Scanned "/mnt/data/Rep1/test/pipo" , 2 files and 0 subdirectories
Scanned "/mnt/data/Rep1/test" , 23 files and 1 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/DNG(RAWconverter)" , 41 files
and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/GPS" , 4 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/HOTPIXELSTOOL" , 6 files and
0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/JP2" , 9 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/JPEG/Horizontal" , 49 files
and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/JPEG/Vertical" , 30 files and
0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/JPEG" , 8 files and 2 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/LENSFUN" , 22 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Metadata/Adobe" , 11 files
and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Metadata/B&W" , 10 files and
0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Metadata/digiKam" , 3 files
and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Metadata/LR" , 1 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Metadata/Picasa" , 3 files
and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Metadata/Vista" , 20 files
and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Metadata" , 15 files and 6
subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/newpictures" , 4 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/NOISE" , 13 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/OGG" , 3 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/PCD" , 5 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Photoshop" , 28 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/PNG" , 14 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/PPM" , 4 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/RAW/Horizontal" , 61 files
and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/RAW/Mix" , 39 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/RAW/Vertical" , 16 files and
0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/RAW" , 0 files and 3 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Red Eyes" , 10 files and 0
subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/TIFF" , 27 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/Video" , 61 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/WDP" , 1 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs/XCF" , 3 files and 0 subdirectories
Scanned "/mnt/data/Rep1/Test Photographs" , 3 files and 20 subdirectories
Scanned "/mnt/data/Rep1/test.cameragui" , 9 files and 0 subdirectories
Scanned "/mnt/data/Rep1" , 0 files and 20 subdirectories
Directory scanning and hash generation took 3.66133 ms/file
Success: All 874 files have a different hash.
[gilles@localhost Download]$ ./testhash /mnt/data/Rep2
Scanned "/mnt/data/Rep2/test/processed" , 5 files and 0 subdirectories
Scanned "/mnt/data/Rep2/test" , 13 files and 1 subdirectories
Scanned "/mnt/data/Rep2" , 0 files and 1 subdirectories
Directory scanning and hash generation took 16.4444 ms/file
Success: All 18 files have a different hash.
[gilles@localhost Download]$ ./testhash /mnt/data/Rep3
Scanned "/mnt/data/Rep3/Alina Dinu <[hidden email]>" , 1 files
and 0 subdirectories
Scanned "/mnt/data/Rep3/Arturo Mann <[hidden email]>" , 4 files
and 0 subdirectories
Scanned "/mnt/data/Rep3/Aykut Turhan <[hidden email]>" , 1 files
and 0 subdirectories
Scanned "/mnt/data/Rep3/Benoit Courty <[hidden email]>" , 2
files and 0 subdirectories
Scanned "/mnt/data/Rep3/cabaflo <[hidden email]>" , 2 files and 0
subdirectories
Scanned "/mnt/data/Rep3/Christophe Keckeis <[hidden email]>" , 1 files
and 0 subdirectories
Scanned "/mnt/data/Rep3/D Vanraes" , 1 files and 0 subdirectories
Scanned "/mnt/data/Rep3/Eric Bayard <[hidden email]>" , 3 files and
0 subdirectories
Scanned "/mnt/data/Rep3/Fr�d�ric Martinot <[hidden email]>" , 4
files and 0 subdirectories
Scanned "/mnt/data/Rep3/Gerhard Kulzer" , 10 files and 0 subdirectories
Scanned "/mnt/data/Rep3/Gustavo Pichorim Boiko
<[hidden email]>" , 4 files and 0 subdirectories
Scanned "/mnt/data/Rep3/Josh & Erica Nijenhuis
<[hidden email]>" , 3 files and 0 subdirectories
Scanned "/mnt/data/Rep3/J�rgen Flosbach dk
<[hidden email]>" , 7 files and 0 subdirectories
Scanned "/mnt/data/Rep3/Markus Volkmer <[hidden email]>" , 1
files and 0 subdirectories
Scanned "/mnt/data/Rep3/Mathias Ball <[hidden email]>" , 3 files and
0 subdirectories
Scanned "/mnt/data/Rep3/Michel Pottier <[hidden email]>" , 6
files and 0 subdirectories
Scanned "/mnt/data/Rep3/Paul Radford <[hidden email]>" , 3 files
and 0 subdirectories
Scanned "/mnt/data/Rep3/Pol <[hidden email]>" , 1 files and 0
subdirectories
Scanned "/mnt/data/Rep3/Roger Larsson <[hidden email]>" , 1
files and 0 subdirectories
Scanned "/mnt/data/Rep3/S�bastien Benoit <[hidden email]>" , 1
files and 0 subdirectories
Scanned "/mnt/data/Rep3" , 0 files and 20 subdirectories
Directory scanning and hash generation took 8.98305 ms/file
Success: All 59 files have a different hash.
[gilles@localhost Download]$ ./testhash /mnt/data/Rep4
Scanned "/mnt/data/Rep4/new" , 10 files and 0 subdirectories
Scanned "/mnt/data/Rep4/test/processed" , 607 files and 0 subdirectories
Scanned "/mnt/data/Rep4/test" , 42 files and 1 subdirectories
Scanned "/mnt/data/Rep4" , 0 files and 2 subdirectories
Directory scanning and hash generation took 5.9393 ms/file
Success: All 659 files have a different hash.
[gilles@localhost Download]$ ./testhash /mnt/data/Rep5
Scanned "/mnt/data/Rep5/2010-07-16/jpg" , 64 files and 0 subdirectories
Scanned "/mnt/data/Rep5/2010-07-16" , 0 files and 1 subdirectories
Scanned "/mnt/data/Rep5/Sur Aix/2010-07-27/jpg" , 9 files and 0 subdirectories
Scanned "/mnt/data/Rep5/Sur Aix/2010-07-27" , 0 files and 1 subdirectories
Scanned "/mnt/data/Rep5/Sur Aix/2010-07-29/arw" , 11 files and 0 subdirectories
Scanned "/mnt/data/Rep5/Sur Aix/2010-07-29/png" , 2 files and 0 subdirectories
Scanned "/mnt/data/Rep5/Sur Aix/2010-07-29" , 0 files and 2 subdirectories
Scanned "/mnt/data/Rep5/Sur Aix" , 0 files and 2 subdirectories
Scanned "/mnt/data/Rep5/test" , 0 files and 0 subdirectories
Scanned "/mnt/data/Rep5" , 2 files and 3 subdirectories
Directory scanning and hash generation took 15.6023 ms/file
Success: All 88 files have a different hash.
[gilles@localhost Download]$
[gilles@localhost Download]$ ./testhash /mnt/data/Camera
Scanned "/mnt/data/Camera" , 12 files and 0 subdirectories
Directory scanning and hash generation took 1.58333 ms/file
Success: All 12 files have a different hash.
[gilles@localhost Download]$

There is not a lot files here. I can process home computer if you
want, but this week end.

Gilles


2010/12/9 Marcel Wiesweg <[hidden email]>:

> Hi,
>
> we are using an MD5 hash over parts of a file to uniquely identify images and
> display thumbnails. This has worked quite well, but recently I have seen two
> or three cases where the hash fails (same hash for completely different
> images).
> There is another problem with the current hash, it relies on a binary blob of
> the metadata produced by Exiv2, but this format is not guaranteed to be stable
> (possibly, the hash changes with a new Exiv2 version).
>
> The recommendation by Andreas Huggel was to simply use the first 100kB of a
> file, which will typically include the file header, the metadata, and reach
> actual image data.
> A variant would be to include the last 100kB as well.
>
> Attached is a small application which scans a given collection directory,
> creates the hash, and will output if the hash is successful in differentiating
> all files.
>
> I have run this on my collection, but I would ask you to repeat testing with
> your collections to find out if it works for you as well:
>
> qmake testhash.pro
> make
> ./testhash /toplevel/directory/to/your/collection
>
> Here it takes 15s per 1000 files.
> At the end, it will tell you if any files failed, or if it succeeded. If it
> fails, it would be interesting to find out if the files are actually very
> similar, and if they have the same file size. (a hard failure would be two
> dissimilar files with the same file size)
>
> Thanks
> Marcel
>
>
> _______________________________________________
> Digikam-devel mailing list
> [hidden email]
> https://mail.kde.org/mailman/listinfo/digikam-devel
>
>
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

Re: file hash creation: asking for short test

Bugzilla from Julien@narboux.fr
In reply to this post by Marcel Wiesweg
Hi,

Here are my results:

Directory scanning and hash generation took 29.5804 ms/file

Success: All 39557 files have a different hash.

Julien

Le 09/12/2010 12:04, Marcel Wiesweg a écrit :

> Hi,
>
> we are using an MD5 hash over parts of a file to uniquely identify images and
> display thumbnails. This has worked quite well, but recently I have seen two
> or three cases where the hash fails (same hash for completely different
> images).
> There is another problem with the current hash, it relies on a binary blob of
> the metadata produced by Exiv2, but this format is not guaranteed to be stable
> (possibly, the hash changes with a new Exiv2 version).
>
> The recommendation by Andreas Huggel was to simply use the first 100kB of a
> file, which will typically include the file header, the metadata, and reach
> actual image data.
> A variant would be to include the last 100kB as well.
>
> Attached is a small application which scans a given collection directory,
> creates the hash, and will output if the hash is successful in differentiating
> all files.
>
> I have run this on my collection, but I would ask you to repeat testing with
> your collections to find out if it works for you as well:
>
> qmake testhash.pro
> make
> ./testhash /toplevel/directory/to/your/collection
>
> Here it takes 15s per 1000 files.
> At the end, it will tell you if any files failed, or if it succeeded. If it
> fails, it would be interesting to find out if the files are actually very
> similar, and if they have the same file size. (a hard failure would be two
> dissimilar files with the same file size)
>
> Thanks
> Marcel
>
>    
>
>
> _______________________________________________
> Digikam-devel mailing list
> [hidden email]
> https://mail.kde.org/mailman/listinfo/digikam-devel
>    
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

Re: file hash creation: asking for short test

Marcel Wiesweg
In reply to this post by Martin Klapetek


> Also, I might understand it wrong, but wouldn't reading the beginning be
> better than reading the end of file in regards to IO operations? (as with
> reading the end of the file you must move the file "cursor" to somewhere
> near the end, with the beginning you just open and read)

I did a small benchmark reading 8 or 100 kb from the beginning, middle and/or
end of a file:
    8 - -: 8.5 ms
    100 - -: 15.2 ms
    100 - 8 - 100: 32.2 ms
    100 - - 8: 22.0 ms
    100 - - 100: 24.5 ms
so there is a cost for linear reading (reading 8 kb is faster than reading 100
kb) and there is a cost for each seek operation (reading 100 kB is by far not
10 times slower than reading 8 kb, so there is a large constant amount).

The numbers are not overly accurate. With larger scale scanning - my whole
collection - I needed only 15ms/file for the "100 - - 100" approach, in the
benchmark it was 24. Dont know why.

Marcel
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

Re: file hash creation: asking for short test

Bugzilla from ahuggel@gmx.net
In reply to this post by Marcel Wiesweg
Marcel,

I always get "success". Even when the hash should be the same in my
understanding:

andreas@mowgli:~/src/exiv2/trunk/src/tmp$ ls -la
total 4052
drwxr-xr-x 2 andreas andreas    4096 10-Dec-10 ./
drwxr-xr-x 6 andreas andreas    8192 10-Dec-10 ../
-rw-r--r-- 1 andreas andreas 2052392 10-Dec-10 d1.jpg
-rw-r--r-- 1 andreas andreas 2052392 10-Dec-10 d2.jpg
-rw-r--r-- 1 andreas andreas    4745 10-Dec-10 e1.jpg
-rw-r--r-- 1 andreas andreas    4745 10-Dec-10 e2.jpg
andreas@mowgli:~/src/exiv2/trunk/src/tmp$ diff d1.jpg d2.jpg
andreas@mowgli:~/src/exiv2/trunk/src/tmp$ diff e1.jpg e2.jpg
andreas@mowgli:~/src/exiv2/trunk/src/tmp$ ~/src/testhash/testhash .
Scanned "." , 4 files and 0 subdirectories
Directory scanning and hash generation took 1.5 ms/file
Success: All 4 files have a different hash.

Andreas
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

Re: file hash creation: asking for short test

Marcel Wiesweg


> andreas@mowgli:~/src/exiv2/trunk/src/tmp$ diff d1.jpg d2.jpg
> andreas@mowgli:~/src/exiv2/trunk/src/tmp$ diff e1.jpg e2.jpg
> andreas@mowgli:~/src/exiv2/trunk/src/tmp$ ~/src/testhash/testhash .
> Scanned "." , 4 files and 0 subdirectories
> Directory scanning and hash generation took 1.5 ms/file
> Success: All 4 files have a different hash.

If the files are fully identical (MD5 over whole file is the same), then it's
all right. You will only get a failure if the full files are not identical,
but the hash is.
(I indeed have one example here, two PNG files where the full MD5 is not
identical, but the files look in fact pixel-by-pixel identical and are derived
from the same picture)
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel