UTF-8 captions

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF-8 captions

AndriusWild
Hello,

Digikam handbook says that captions are UTF-8 compatible (see the text extract below) however I found some pictures with "Ìåäâåäü ïîñë" simbols that do not make sense to me.
I went through all the settings and did not notice any related setting.
Please advise if I am missing something.

Extract from the handbook:
>>2.2.6.2 Comment View

>>The caption view can be used to type or paste in a caption of unlimited size (see note below).

>>The text is UTF-8 compatible, meaning that all special characters are allowed.
...

>>With digiKam you can enter unlimited amounts of text using internationalized alphabet (UTF-8)
as caption.

Best regards,
Andrey Goreev
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 captions

Remco Viëtor
On mardi 17 janvier 2017 06:26:46 CET Andrey Goreev wrote:
> Hello,
>
> Digikam handbook says that captions are UTF-8 compatible (see the text
> extract below) however I found some pictures with "Ìåäâåäü ïîñë"
> simbols that do not make sense to me.
> I went through all the settings and did not notice any related setting.
> Please advise if I am missing something.
>
That fragment looks like you had some UTF-8 text in the captions that is then
displayed by a program that does not understand UTF-8 (but uses an 8-bit
character set). A series like that could be a text fragment in a non-latin
script, where each 'Ã?" stands for a letter. Note that, while Digikam can
handle UTF-8, only XPM tags can store it, IPTC and EXIF are limited to 8 bits/
char (and thus would give the kind of strings you quoted).

I can't be more precise, as you don't specify where you saw that text (within
Digikam's caption editor, its metadata viewer, or using an external tag
viewer/editor, or ...).

Remco

Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 captions

AndriusWild
Hello Remco,

Digikam shows the caption under the thumbnail as well as in the right panel:
Properties -> digiKam properties/Caption
Metadata -> EXIF/Image Description; IPTC/Caption (IPTC/Character Set shows UTF-8); XMP/Description, XMP/User comment; XMP/Image description;
Captions -> Description/Captions

Here is an extract from the output of ExifTool -a -G1 -s command:

 [File]          Comment                         : âØâ¼â«

 [IFD0]          ImageDescription                : ؼ«

 [ExifIFD]       UserComment                     : âØâ¼â«

 [XMP-tiff]      ImageDescription                : âØâ¼â«

 [XMP-exif]      UserComment                     : âØâ¼â«

 [XMP-acdsee]    Notes                           : âØâ¼â«

 [XMP-dc]        Description                     : âØâ¼â«

 [IPTC]          Caption-Abstract                : âØâ¼â«



Best regards,
Andrey Goreev

On Tue, Jan 17, 2017 at 8:13 AM, Remco Viëtor <[hidden email]> wrote:
On mardi 17 janvier 2017 06:26:46 CET Andrey Goreev wrote:
> Hello,
>
> Digikam handbook says that captions are UTF-8 compatible (see the text
> extract below) however I found some pictures with "Ìåäâåäü ïîñë"
> simbols that do not make sense to me.
> I went through all the settings and did not notice any related setting.
> Please advise if I am missing something.
>
That fragment looks like you had some UTF-8 text in the captions that is then
displayed by a program that does not understand UTF-8 (but uses an 8-bit
character set). A series like that could be a text fragment in a non-latin
script, where each 'Ã?" stands for a letter. Note that, while Digikam can
handle UTF-8, only XPM tags can store it, IPTC and EXIF are limited to 8 bits/
char (and thus would give the kind of strings you quoted).

I can't be more precise, as you don't specify where you saw that text (within
Digikam's caption editor, its metadata viewer, or using an external tag
viewer/editor, or ...).

Remco


Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 captions

Remco Viëtor
On mardi 17 janvier 2017 08:48:30 CET Andrey Goreev wrote:
> Hello Remco,
>
> Digikam shows the caption under the thumbnail as well as in the right panel:
> Properties -> digiKam properties/Caption
> Metadata -> EXIF/Image Description; IPTC/Caption (IPTC/Character Set shows
> UTF-8); XMP/Description, XMP/User comment; XMP/Image description;
> Captions -> Description/Captions
>
I know where to find the captions within Digikam. What wasn't clear to me is
where _you_ saw that mutilated utf-8.

> Here is an extract from the output of ExifTool -a -G1 -s command:
>
>  [File]          Comment                         : ├â┬Ø├â┬¼├â┬«
>
>  [IFD0]          ImageDescription                : ├Ø├¼├«
>
>  [ExifIFD]       UserComment                     : ├â┬Ø├â┬¼├â┬«
>
>  [XMP-tiff]      ImageDescription                : ├â┬Ø├â┬¼├â┬«
>
>  [XMP-exif]      UserComment                     : ├â┬Ø├â┬¼├â┬«
>
>  [XMP-acdsee]    Notes                           : ├â┬Ø├â┬¼├â┬«
>
>  [XMP-dc]        Description                     : ├â┬Ø├â┬¼├â┬«
>
>  [IPTC]          Caption-Abstract                : ├â┬Ø├â┬¼├â┬«
>

Even stranger: this doesn't even look like the original string you posted,
almost as if your terminal uses something like the IBM850 codepage.

So what seems to have happened: somewhere in your chain, an utf-8 string was
interpreted using an 8-bit char encoding. And it looks like your terminal does
the same thing...

To give you an idea what I'm talking about (hoping the strings pass...)
UTF-8 string:               æâ¢az#&ˇÉÉŠ
same coded as cp-8859-15:   扢az#&ˇÉÉÅ
same coded as cp-1254:      Ã¦Ã¢Ã‚¢az#&ˇÉÉÅ
same coded as IBM850:       ├ª├ó├é┬óaz#&╦ç├ë├ë┼á
(the last three are different codepages, or different ways to assign char
glyphs to 8-bit values, the standard before utf-8 became more or less
generally used). Note that the 4 ASCII chars in the middle (az#&) survive
intact: those are coded on 7 bits, and utf-8 uses the same encoding as ASCII
for the first 127 characters. After that, the codes differ (utf-8 can use up
to 4 bytes per character, iirc).

Note that all of these examples use the exact same bytes, just interpreted
differently... (this would be even more striking with the utf-8 text in
cyrillic or greek alphabet, but I don't have such a keyboard handy)


Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 captions

AndriusWild

On Tue, Jan 17, 2017 at 10:22 AM, Remco Viëtor <[hidden email]> wrote:
æâ¢az#&ˇÉÉŠ

Remco,

Normally I would paste æâ¢az#&ˇÉÉŠ to a text file, view it (F3) in Total Commander or a similar program, change codepage and get the normal text.
It did not work this time. Well, since there are not so many images in my library with such issue and I don't really care about the captions I guess I will just delete the symbols and move on.
Thank you for your help anyways!

Best regards,
Andrey Goreev