[Bug 205824] New: EXIF UserComments with special characters get tagged as ASCII

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] New: EXIF UserComments with special characters get tagged as ASCII

Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824

           Summary: EXIF UserComments with special characters get tagged
                    as ASCII
           Product: digikam
           Version: 1.0.0-beta3
          Platform: unspecified
        OS/Version: Linux
            Status: UNCONFIRMED
          Severity: normal
          Priority: NOR
         Component: general
        AssignedTo: [hidden email]
        ReportedBy: [hidden email]


Version:           1.0.0-beta3 (using 4.3.00 (KDE 4.3.0), Debian packages)
Compiler:          cc
OS:                Linux (i686) release 2.6.30-1-686

If storing to image metadata is activated, digiKam generates invalid EXIF
UserComment charset identifiers.

The UserComment field is tagged as "ASCII" although the actual content seems to
be encoded in my system's locale, ISO 8859-15.
I'd like to see the string encoded as UTF-8 and tagged as UNICODE if it cannot
be represented as an ASCII string.

The current behaviour creates invalid UserComment fields and further metadata
processing in other application gets messed up.

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Gilles Caulier-4
https://bugs.kde.org/show_bug.cgi?id=205824


Gilles Caulier <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]
          Component|general                     |Metadata




--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Marcel Wiesweg
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824


Marcel Wiesweg <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]




--- Comment #1 from Marcel Wiesweg <marcel wiesweg gmx de>  2009-09-21 19:12:29 ---
Looking at the code, we are writing either Latin1 or Unicode UCS-2.
Can you give us a sample image with invalid user comment field written by
digikam?

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Bugzilla from kdebugs@CustomCDROM.de
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #2 from Gunter Ohrner <kdebugs CustomCDROM de>  2009-09-21 22:31:42 ---
Yes, digikam writes Latin 1 or maybe Latin 9 (ISO 8859-1 or -15) and tags the
comment as "ASCII".

AFAIK Latin-charsets are not supported for this EXIF header field. This causes
problems with other applications which process the comment. As far as I
understand, the solution would be to encode the comment in UTF-8 if it contains
non-ASCII characters (ie. code > 127) and tag the comment field appropriately.

However, maybe I'm wrong and UserComment field do support Latin charsets,
somehow?

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Marcel Wiesweg
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #3 from Marcel Wiesweg <marcel wiesweg gmx de>  2009-09-22 20:11:29 ---
The Exif standard defines the use of Ascii, JIS or Unicode. The standard does
not say which unicode variant, but apparently it is UCS-2 in practice (UTF-16,
two bytes per character)

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Bugzilla from kdebugs@CustomCDROM.de
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #4 from Gunter Ohrner <kdebugs CustomCDROM de>  2009-09-22 21:36:23 ---
Ok, good to know. But not Latin1/9?

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Marcel Wiesweg
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #5 from Marcel Wiesweg <marcel wiesweg gmx de>  2009-09-23 18:33:07 ---
From my understanding, ASCII is Latin1.

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Bugzilla from kdebugs@CustomCDROM.de
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #6 from Gunter Ohrner <kdebugs CustomCDROM de>  2009-09-23 23:50:22 ---
From what I understand, ASCII is only a 7-bit-encoding (So a byte provides 7
data bits plus space for one optional parity bit.), in contrast to LatinX which
are 8-bit-encodings. The first 128 characters of Latin1 match the ASCII
character set, AFAIK, but all codes >= 128 are not defined for ASCII.

To be sure, I looked up the standardisation papers which seem to back my
opinion:

The EXIF 2.2 standard (http://www.exif.org/Exif2-2.PDF) states on page 28 that
the reference documentation for character code ASCII is ITU-T T.50 IA5 (ITU-T
International Alphabet No. 5, now ITU-T IRA = International Reference
Alphabet).

The International Reference Alphabet is a 7-bit-encoding, the ITU-T
recommendation document can be found at:
http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.50-199209-I!!PDF-E&type=items

So in my eyes Latin1 strings containing characters with character codes larger
than 127 are not allowed in UserComment fields with an encoding type of "ASCII"
(or any EXIF header field which mandates ASCII encoding) and the string must be
recoded to unicode and be written as a UserComment field with type "Unicode".
(It'd probably good for interoperability to use "ASCII" if no invalid
characters appear within the string.)

In case of header fields which only allow ASCII encoding, transliteration for
these invalid characters would need to be used. (iconv can do that, for
example, converting eg. "ö" to "oe" and the like.)

Latin1 would be acceptable with an "undefined" encoding type (8 null bytes, see
EXIF spec page 29), but that would not help interoperability at all...

The EXIF spec only refers to the unicode spec in case of a "Unicode" encoding
type, so just as you I'm not sure which flavour of unicode could be used. I'm
not familiar with the unicode spec and have not looked up any details so far,
but the exact encoding of unicode files is determined by its first few bytes
which must carry a Byte Order Mark (BOM) in case of UTF-16 and UTF-32, while
this BOM is allowed but optional for UTF-8 files
(http://en.wikipedia.org/wiki/Byte-order_mark). Maybe the encoding used for
shorter unicode sequences like the UserComment string is also distinguished
this way?

In this case it would probably be preferrable to use UTF-8 if the input is
LatinX, as this should result in the shortest byte sequences after recoding.
The "deluxe solution" in this case would be to dynamically use the unicode
encoding which produces the shortes byte sequence.

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Marcel Wiesweg
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824


Marcel Wiesweg <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]




--- Comment #7 from Marcel Wiesweg <marcel wiesweg gmx de>  2009-09-24 18:31:01 ---
CC'ing Andreas:

Do you know which unicode formats are allowed for the Exif UserComment?
Specifically, if Utf-8 is allowed or if UTF-16 is required? Would it be a good
decision to always use UNICODE there?

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Bugzilla from ahuggel@gmx.net
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #8 from Andreas Huggel <ahuggel gmx net>  2009-09-25 06:31:55 ---
I found only a hidden hint that seems to point to UTF-16 for a "UNICODE"
UserComment. It's in the comments for tag ImageDescription, on page 22 of the
Exif specs: "When a 2-byte code is necessary, the Exif Private tag UserComment
is to be used".

Exiv2 doesn't do any conversion (yet...), it leaves it to the application to do
the right thing.

For comparison, Exiftool writes the UserComment tag with an Exif character code
"ASCII" if the text consists of only 7-bit characters, else it uses the Exif
character code "UNICODE" and encodes the text in UTF-16.
It encodes the UTF-16 string using the same byte order as the rest of the
Exif/TIFF structure and without a BOM.
On read it expects a UTF-16 encoded text, has some intelligence to guess the
byte order, and interprets a BOM if there is one. It doesn't seems to have any
provision for UTF-8 encoded UserComment text, though.

Exiv2 should probably follow a similar logic eventually, although I'd think
that there are images with UTF-8 encoded UserComment tags out there in the
wild.

Andreas

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Gilles Caulier-4
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824


Gilles Caulier <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|1.0.0-beta3                 |1.0.0




--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Bugzilla from ahuggel@gmx.net
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #9 from Andreas Huggel <ahuggel gmx net>  2010-01-20 05:19:48 ---
With the fix for http://dev.exiv2.org/issues/show/662 Exiv2 now stores Exif
UNICODE user-comments in UCS-2 (using the byte-order of the Exif data and
without a BOM). The API expects and returns Exif UNICODE user-comments in
UTF-8. The behaviour of Exif ASCII, JIS and UNDEFINED user-comments remains
unchanged.

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Gilles Caulier-4
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #10 from Gilles Caulier <caulier gilles gmail com>  2010-01-20 10:04:40 ---
Andreas,

Sound like we need to adapt libkexiv2 code. Right ?

Gilles

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Bugzilla from ahuggel@gmx.net
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #11 from Andreas Huggel <ahuggel gmx net>  2010-01-20 11:03:08 ---
Gilles,

Probably yes. Assuming libkexiv2 has an interface like

void setUserComment(const std::string& comment);

to set an Exif user-comment, then the comment passed in should always be UTF-8
encoded now. The function can then simply set an Exif UNICODE user-comment, or,
as suggested above somewhere, analyse the comment and use an Exif ASCII
user-comment if the text is (7-Bit) ASCII only and an UNICODE user-comment if
not.

Andreas

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Marcel Wiesweg
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #12 from Marcel Wiesweg <marcel wiesweg gmx de>  2010-01-21 18:38:15 ---
That's about Exif.Photo.UserComment, isn't it?

What about backwards compatibility: Up to now we always passed UCS-2. Should we
keep this behavior for older Exiv2 versions? If yes, from exactly which version
number on should we pass UTF-8?

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Bugzilla from ahuggel@gmx.net
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #13 from Andreas Huggel <ahuggel gmx net>  2010-01-22 03:04:34 ---
Marcel,

Yes, for old versions you need to continue to pass UCS-2. The UTF-8 interface
will be in version 0.20 (in case there are any 0.19.x versions these will be
backward compatible and not contain this change).

Andreas

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Marcel Wiesweg
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #14 from Marcel Wiesweg <marcel wiesweg gmx de>  2010-01-23 15:02:12 ---
SVN commit 1079054 by mwiesweg:

For libexiv2 0.20, use UTF-8 for Unicode Exif UserComments.
Needs to be tested when a exiv2 library containing the fix is released.

CCBUG: 205824


 M  +6 -0      kexiv2_p.cpp  
 M  +8 -0      kexiv2exif.cpp  


WebSVN link: http://websvn.kde.org/?view=rev&revision=1079054

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Gilles Caulier-4
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #15 from Gilles Caulier <caulier gilles gmail com>  2011-12-16 16:32:57 ---
Gunter,

This file still valid using digiKam 2.x serie ?

Gilles Caulier

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Bugzilla from kdebugs@CustomCDROM.de
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #16 from Gunter Ohrner <kdebugs CustomCDROM de>  2012-01-05 00:01:36 ---
Sorry for the delay, will check soon.

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
Reply | Threaded
Open this post in threaded view
|

[Bug 205824] EXIF UserComments with special characters get tagged as ASCII

Bugzilla from kdebugs@CustomCDROM.de
In reply to this post by Bugzilla from kdebugs@CustomCDROM.de
https://bugs.kde.org/show_bug.cgi?id=205824





--- Comment #17 from Gunter Ohrner <kdebugs CustomCDROM de>  2012-01-14 20:43:13 ---
Yes, looks as if this bug is still present.

The following is a hex dump excerpt from a file with I comment I just saved
using digiKam 2.1 (Kubuntu 11.10-package).


000009f0  00 00 41 53 43 49 49 00  00 00 54 65 73 74 62 fc  |..ASCII...Testbü|
00000a00  6c 64 2d 43 6f 6d 6d 65  6e 74 00 f8 2a 00 e1 00  |ld-Comment.ø*.á.|

It's still tagges as "ASCII" but stored with characters outside of the ASCII
range.

--
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
_______________________________________________
Digikam-devel mailing list
[hidden email]
https://mail.kde.org/mailman/listinfo/digikam-devel
12