digiKam › digikam-users

UTF-8 captions

Classic

List

Threaded

5 messages Options

AndriusWild

UTF-8 captions

Hello,

Digikam handbook says that captions are UTF-8 compatible (see the text extract below) however I found some pictures with "ÃŒÃ¥Ã¤Ã¢Ã¥Ã¤Ã¼ Ã¯Ã®Ã±Ã«" simbols that do not make sense to me.

I went through all the settings and did not notice any related setting.

Please advise if I am missing something.

Extract from the handbook:
>>2.2.6.2 Comment View
>>The caption view can be used to type or paste in a caption of unlimited size (see note below).
>>The text is UTF-8 compatible, meaning that all special characters are allowed.
...
>>With digiKam you can enter unlimited amounts of text using internationalized alphabet (UTF-8) as caption.

Best regards,
Andrey Goreev

Remco Viëtor

Re: UTF-8 captions

On mardi 17 janvier 2017 06:26:46 CET Andrey Goreev wrote:
> Hello,
>
> Digikam handbook says that captions are UTF-8 compatible (see the text
> extract below) however I found some pictures with "ÃŒÃ¥Ã¤Ã¢Ã¥Ã¤Ã¼ Ã¯Ã®Ã±Ã«"
> simbols that do not make sense to me.
> I went through all the settings and did not notice any related setting.
> Please advise if I am missing something.
>
That fragment looks like you had some UTF-8 text in the captions that is then
displayed by a program that does not understand UTF-8 (but uses an 8-bit
character set). A series like that could be a text fragment in a non-latin
script, where each 'Ã?" stands for a letter. Note that, while Digikam can
handle UTF-8, only XPM tags can store it, IPTC and EXIF are limited to 8 bits/
char (and thus would give the kind of strings you quoted).

I can't be more precise, as you don't specify where you saw that text (within
Digikam's caption editor, its metadata viewer, or using an external tag
viewer/editor, or ...).

Remco

AndriusWild

Re: UTF-8 captions

Hello Remco,

Digikam shows the caption under the thumbnail as well as in the right panel:
Properties -> digiKam properties/Caption
Metadata -> EXIF/Image Description; IPTC/Caption (IPTC/Character Set shows UTF-8); XMP/Description, XMP/User comment; XMP/Image description;
Captions -> Description/Captions

Here is an extract from the output of ExifTool -a -G1 -s command:

[File] Comment : ├â┬Ø├â┬¼├â┬«

[IFD0] ImageDescription : ├Ø├¼├«

[ExifIFD] UserComment : ├â┬Ø├â┬¼├â┬«

[XMP-tiff] ImageDescription : ├â┬Ø├â┬¼├â┬«

[XMP-exif] UserComment : ├â┬Ø├â┬¼├â┬«

[XMP-acdsee] Notes : ├â┬Ø├â┬¼├â┬«

[XMP-dc] Description : ├â┬Ø├â┬¼├â┬«

[IPTC] Caption-Abstract : ├â┬Ø├â┬¼├â┬«

Best regards,
Andrey Goreev

On Tue, Jan 17, 2017 at 8:13 AM, Remco Viëtor <[hidden email]> wrote:

On mardi 17 janvier 2017 06:26:46 CET Andrey Goreev wrote:
> Hello,
>
> Digikam handbook says that captions are UTF-8 compatible (see the text
> extract below) however I found some pictures with "ÃŒÃ¥Ã¤Ã¢Ã¥Ã¤Ã¼ Ã¯Ã®Ã±Ã«"
> simbols that do not make sense to me.
> I went through all the settings and did not notice any related setting.
> Please advise if I am missing something.
>
That fragment looks like you had some UTF-8 text in the captions that is then
displayed by a program that does not understand UTF-8 (but uses an 8-bit
character set). A series like that could be a text fragment in a non-latin
script, where each 'Ã?" stands for a letter. Note that, while Digikam can
handle UTF-8, only XPM tags can store it, IPTC and EXIF are limited to 8 bits/
char (and thus would give the kind of strings you quoted).

I can't be more precise, as you don't specify where you saw that text (within
Digikam's caption editor, its metadata viewer, or using an external tag
viewer/editor, or ...).

Remco

Remco Viëtor

Re: UTF-8 captions

On mardi 17 janvier 2017 08:48:30 CET Andrey Goreev wrote:
> Hello Remco,
>
> Digikam shows the caption under the thumbnail as well as in the right panel:
> Properties -> digiKam properties/Caption
> Metadata -> EXIF/Image Description; IPTC/Caption (IPTC/Character Set shows
> UTF-8); XMP/Description, XMP/User comment; XMP/Image description;
> Captions -> Description/Captions
>
I know where to find the captions within Digikam. What wasn't clear to me is
where _you_ saw that mutilated utf-8.

> Here is an extract from the output of ExifTool -a -G1 -s command:
>
> [File] Comment : ├â┬Ø├â┬¼├â┬«
>
> [IFD0] ImageDescription : ├Ø├¼├«
>
> [ExifIFD] UserComment : ├â┬Ø├â┬¼├â┬«
>
> [XMP-tiff] ImageDescription : ├â┬Ø├â┬¼├â┬«
>
> [XMP-exif] UserComment : ├â┬Ø├â┬¼├â┬«
>
> [XMP-acdsee] Notes : ├â┬Ø├â┬¼├â┬«
>
> [XMP-dc] Description : ├â┬Ø├â┬¼├â┬«
>
> [IPTC] Caption-Abstract : ├â┬Ø├â┬¼├â┬«
>

Even stranger: this doesn't even look like the original string you posted,
almost as if your terminal uses something like the IBM850 codepage.

So what seems to have happened: somewhere in your chain, an utf-8 string was
interpreted using an 8-bit char encoding. And it looks like your terminal does
the same thing...

To give you an idea what I'm talking about (hoping the strings pass...)
UTF-8 string: æâÂ¢az#&ˇÉÉŠ
same coded as cp-8859-15: ÃŠÃ¢ÃÂ¢az#&ËÃÃÅ
same coded as cp-1254: Ã¦Ã¢Ã‚Â¢az#&Ë‡Ã‰Ã‰Å
same coded as IBM850: ├ª├ó├é┬óaz#&╦ç├ë├ë┼á
(the last three are different codepages, or different ways to assign char
glyphs to 8-bit values, the standard before utf-8 became more or less
generally used). Note that the 4 ASCII chars in the middle (az#&) survive
intact: those are coded on 7 bits, and utf-8 uses the same encoding as ASCII
for the first 127 characters. After that, the codes differ (utf-8 can use up
to 4 bytes per character, iirc).

Note that all of these examples use the exact same bytes, just interpreted
differently... (this would be even more striking with the utf-8 text in
cyrillic or greek alphabet, but I don't have such a keyboard handy)

AndriusWild

Re: UTF-8 captions

On Tue, Jan 17, 2017 at 10:22 AM, Remco Viëtor <[hidden email]> wrote:

æâÂ¢az#&ˇÉÉŠ

Remco,

Normally I would paste æâÂ¢az#&ˇÉÉŠ to a text file, view it (F3) in Total Commander or a similar program, change codepage and get the normal text.

It did not work this time. Well, since there are not so many images in my library with such issue and I don't really care about the captions I guess I will just delete the symbols and move on.

Thank you for your help anyways!

Best regards,
Andrey Goreev