How does digikam calculate the uniqueHash ?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How does digikam calculate the uniqueHash ?

Bugzilla from hadmut@danisch.de
Hi,

I am trying to write some code for a scripting language to extract the
data for a given picture from the digikam database file.

(I want to keep my raw format picture files unmodified by digikam, but
sometimes I need to automatically convert some of them to jpeg files
for export outside the digikam directory and need to extract the
information needed for IPTC from the database. So I need to identify
the id of a picture in the database.)

Unfortunately the algorithm to calculate the uniqueHash appears to be
sort of weird.

What I found so far (from the undocumented source), that the
uniqueHash is an MD5 sum of the concatenation of

- the exif section of the picture
- the first 8192 bytes of the picture file
- the length of the picture file written as a decimal number


I then could correctly calculate the uniqueHash for jpeg images, but
not for raw images.

raw images are usually based on the TIFF file format. Exif data are
afaik TIFF entries. Therefore, TIFF files do (unlike JPEG) not have a
separate Exif section, but have Exif tags (which are in fact TIFF
tags) interwoven with the hole file.

How exactly is the uniqueHash calculated for these files?


regards
Hadmut

_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

Re: How does digikam calculate the uniqueHash ?

Bugzilla from hadmut@danisch.de
Hi again,

could anyone please point out how exactly the uniqueHash is caculated
for the different sorts of pictures (the middle part with the exif
data),  and what design criterions led to the decision to use
hash(first 8kb, exif, file length) ?

regards
Hadmut
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

Re: How does digikam calculate the uniqueHash ?

Gilles Caulier-4
http://lxr.kde.org/source/extragear/graphics/digikam/libs/dimg/dimg.cpp#2107

which point into :

http://lxr.kde.org/source/extragear/graphics/digikam/libs/dimg/loaders/dimgloader.cpp#204

... to compute it.

Gilles Caulier

2010/4/21 Hadmut Danisch <[hidden email]>:

> Hi again,
>
> could anyone please point out how exactly the uniqueHash is caculated
> for the different sorts of pictures (the middle part with the exif
> data),  and what design criterions led to the decision to use
> hash(first 8kb, exif, file length) ?
>
> regards
> Hadmut
> _______________________________________________
> Digikam-devel mailing list
> [hidden email]
> https://mail.kde.org/mailman/listinfo/digikam-devel
>
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

Re: How does digikam calculate the uniqueHash ?

Marcel Wiesweg
In reply to this post by Bugzilla from hadmut@danisch.de

> Hi again,
>
> could anyone please point out how exactly the uniqueHash is caculated
> for the different sorts of pictures (the middle part with the exif
> data),

libexiv2 is able to deliver us a data packet which contains the Exif
information packed as for inclusion in a JPEG file.
It's technically the easiest way to get a hash on this information.

> and what design criterions led to the decision to use
> hash(first 8kb, exif, file length) ?

1. We want a hash
2. A hash over the complete file is too slow
3. we need parts of the file as unique as possible
4. The exif info typically contains the creation date, which is pretty unique,
  and photographic parameters like aperture and shutter speed
5. The first 8kb: It's not 0, it's not the full file, it's in between. It's
small enough to be fast. In the end, an arbitrary decision.
6. The file length is pretty unique for compressed formats, because it depends
on compression entropy of the image data. It also contains at least the
smallest possible amount of information on the end of the file, while we
calculate the hash on the beginning.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel