Images.uniquehash calculation

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Images.uniquehash calculation

Bugzilla from mdione@grulic.org.ar

    Disclaimer: probably this is not the right list to ask this. if so,
just let me know. also, I'm not subscribed, so please CC me in the
answers.

    I'm trying to write a script that is able to take an image already
in digikam's database and resize it, apply the same tags as the
original, and possibly remove the original. so far the idea is that this
script will be independent of digikam, touching it's database when
needed. so I checked the database structure and it looks ok, except for
the md5sum. I tried to reimplement DImgLoader::uniqueHashV2() in
libs/dimg/loaders/dimgloader.cpp:329, and even reimplementing it in
python with the same libraries (qt4's md5) and copying the algo line by
line, I get different values in the database and with the script. am I
missing something? for omparisson, I attach the script I did.


--
(Not so) Random fortune:
19:39 < m4rgin4l> por chupamedias
19:40 < m4rgin4l> o como me gusta denominarlo: academic social
    engineering

_______________________________________________
Digikam-users mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-users

resize.py (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Images.uniquehash calculation

tosca
It'll be better to send this message to the developers' list, I think.

Marie-Noëlle

2013/6/17 Marcos Dione <[hidden email]>

    Disclaimer: probably this is not the right list to ask this. if so,
just let me know. also, I'm not subscribed, so please CC me in the
answers.

    I'm trying to write a script that is able to take an image already
in digikam's database and resize it, apply the same tags as the
original, and possibly remove the original. so far the idea is that this
script will be independent of digikam, touching it's database when
needed. so I checked the database structure and it looks ok, except for
the md5sum. I tried to reimplement DImgLoader::uniqueHashV2() in
libs/dimg/loaders/dimgloader.cpp:329, and even reimplementing it in
python with the same libraries (qt4's md5) and copying the algo line by
line, I get different values in the database and with the script. am I
missing something? for omparisson, I attach the script I did.


--
(Not so) Random fortune:
19:39 < m4rgin4l> por chupamedias
19:40 < m4rgin4l> o como me gusta denominarlo: academic social
    engineering

_______________________________________________
Digikam-users mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-users




--


Mes dernières photos sont dans ma galerie.
Connaissez-vous Image Fixe, le photo-club de Saint Jean du Gard ?
Et parcourez les Cévennes à ma façon avec Cévennes Plurielles,

_______________________________________________
Digikam-users mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-users
Reply | Threaded
Open this post in threaded view
|

Re: Images.uniquehash calculation

Marcel Wiesweg
In reply to this post by Bugzilla from mdione@grulic.org.ar

>     Disclaimer: probably this is not the right list to ask this. if so,
> just let me know. also, I'm not subscribed, so please CC me in the
> answers.
>
>     I'm trying to write a script that is able to take an image already
> in digikam's database and resize it, apply the same tags as the
> original, and possibly remove the original. so far the idea is that this
> script will be independent of digikam, touching it's database when
> needed. so I checked the database structure and it looks ok, except for
> the md5sum. I tried to reimplement DImgLoader::uniqueHashV2() in
> libs/dimg/loaders/dimgloader.cpp:329, and even reimplementing it in
> python with the same libraries (qt4's md5) and copying the algo line by
> line, I get different values in the database and with the script. am I
> missing something? for omparisson, I attach the script I did.

That's the fun of a hash...Well, I dont know.
For debugging, I would record the binary data you feed into the hash in Python
and C++ to a file, compare that one. If it differs, you'll be able to locate
the problem. If not, there's a difference in the hash implementation, but I
doubt that.

Marcel
_______________________________________________
Digikam-users mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-users
Reply | Threaded
Open this post in threaded view
|

Re: Images.uniquehash calculation

Remco Viëtor
On Monday 17 June 2013 19:24:45 Marcel Wiesweg wrote:
>
> >     Disclaimer: probably this is not the right list to ask this. if so,
> > just let me know. also, I'm not subscribed, so please CC me in the
> > answers.
> >
> >     I'm trying to write a script that is able to take an image already
> > in digikam's database and resize it, apply the same tags as the
> > original, and possibly remove the original. so far the idea is that
this

> > script will be independent of digikam, touching it's database when
> > needed. so I checked the database structure and it looks ok, except for
> > the md5sum. I tried to reimplement DImgLoader::uniqueHashV2() in
> > libs/dimg/loaders/dimgloader.cpp:329, and even reimplementing it in
> > python with the same libraries (qt4's md5) and copying the algo line by
> > line, I get different values in the database and with the script. am I
> > missing something? for omparisson, I attach the script I did.
>
> That's the fun of a hash...Well, I dont know.
> For debugging, I would record the binary data you feed into the hash in
Python
> and C++ to a file, compare that one. If it differs, you'll be able to
locate
> the problem. If not, there's a difference in the hash implementation, but
I
> doubt that.
>
> Marcel
> _______________________________________________
> Digikam-users mailing list
> [hidden email]
> https://mail.kde.org/mailman/listinfo/digikam-users

According to the code, the same hashing routine is used (not only the same
algorithm, but actually the same implementation).

There is one difference between the two routines though:
- in the Digikam C++ routine, the datablocks are only used if there are
actually data read
- in the python routine, this check is omitted, and the data block is added
to the data to be hashed /unconditionally/.

For the second data block (the last 100 kB), as there is a seek just
before, that could make a difference if the file is <100kB:
- in C++, the file's probably in an error state, so no data will be read, so
the second data block will not be fed to the hash routine.
- in Python, the data block /is/ fed, but will probably contain rubbish if
the file is <100kB...

Also, if the python script changes anything in the metadata (e.g. by
recording the correct image size...), the first 100kB will differ.

Remco
_______________________________________________
Digikam-users mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-users
Reply | Threaded
Open this post in threaded view
|

Re: Images.uniquehash calculation

Marcel Wiesweg


> For the second data block (the last 100 kB), as there is a seek just
> before, that could make a difference if the file is <100kB:
> - in C++, the file's probably in an error state, so no data will be read, so
> the second data block will not be fed to the hash routine.
> - in Python, the data block /is/ fed, but will probably contain rubbish if
> the file is <100kB...

Interesting observation. Anyway, if this was a bug, we wont change it to keep
the hash stable.

>
> Also, if the python script changes anything in the metadata (e.g. by
> recording the correct image size...), the first 100kB will differ.

That is intentional
_______________________________________________
Digikam-users mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-users
Reply | Threaded
Open this post in threaded view
|

Re: Images.uniquehash calculation

Remco Viëtor
On Tuesday 18 June 2013 21:00:44 Marcel Wiesweg wrote:
>
> > For the second data block (the last 100 kB), as there is a seek just
> > before, that could make a difference if the file is <100kB:
> > - in C++, the file's probably in an error state, so no data will be
read, so
> > the second data block will not be fed to the hash routine.
> > - in Python, the data block /is/ fed, but will probably contain rubbish
if
> > the file is <100kB...
>
> Interesting observation. Anyway, if this was a bug, we wont change it to
keep
> the hash stable.

The C++ version seems to me to do the correct thing, in that it doesn't
feed data to the hash generator if the file doesn't provide the data...

What I ment to show was that the two routines are /not/ identical, in that
they can feed different data to the hash generator, and in that case,
/should/ end up with a different hash value.

Remco

P.S. There might be a situation where the hash isn't stable: if the data
buffer isn't initialised, and not completely filled by the file reads, the end
of the buffer could differ between two calls on the same file, and thus the
hash value could differ (as the full buffer is sent to the generator).
_______________________________________________
Digikam-users mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-users