Unicode clarification

Question

Why is the length of the some characters e.g. the following 'ᨒ' 3 when it should be 2

ᨒ U+1a12

1a12 means 6674

2^16 is 65536 so 6674 should take only 2 bytes and not three

What library function, in what programming language, did you use to measure the number of bytes? How did you invoke that function? — Gilles 'SO- stop being evil', Sep 29 '11 at 19:16

Pubby · Answer 1 · 2011-09-29T15:31:37.840

5

There are 3 encodings of unicode: UTF-8, UTF-16, UTF-32

In UTF-8 it takes 3 bytes: 0xE1 0xA8 0x92 as UTF-8 has variable length characters.

UTF-16 would take 2 bytes: 0x1A12 as UTF-16 has fixed length characters

http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/UTF-16

UTF-8 character length from Wikipedia page:

Last code point Byte 1      Byte 2      Byte 3      Byte 4      Byte 5      Byte 6
 7  U+007F      0xxxxxxx
11  U+07FF      110xxxxx    10xxxxxx
16  U+FFFF      1110xxxx    10xxxxxx    10xxxxxx
21  U+1FFFFF    11110xxx    10xxxxxx    10xxxxxx    10xxxxxx
26  U+3FFFFFF   111110xx    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx
31  U+7FFFFFFF  1111110x    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx

edited Sep 29 '11 at 15:31

answered Sep 29 '11 at 15:21

Pubby

3,380

why does it take three bytes? – Imran Omar Bukhsh Sep 29 '11 at 15:26
The Wikipedia page can explain it. Design is a good section to read. – Pubby Sep 29 '11 at 15:28
2

UTF-16 has variable-length characters also -- but those in the "Basic multilingual plane" (BMP) are all 16 bits. UCS-2 had fixed-width characters, but is long-since obsolete. – Jerry Coffin Sep 29 '11 at 15:33
3

@ImranOmarBukhsh 3 bytes because UTF-8 is a multibyte encoding. See this if you want to know how that works. – e-MEE Sep 29 '11 at 15:36
@Jerry: Arguably, there's not even a one-to-one mapping between characters and Unicode codepoints in any case, even without the whole business of encodings of those codepoints. It's extremely complex when you get away from the beaten track… – Donal Fellows Sep 29 '11 at 20:29
@DonalFellows: Right -- I was taking his use of "characters" to mean "code points", since even in UCS-4 there can be multiple code points per character (e.g., combining diacriticals). – Jerry Coffin Sep 30 '11 at 03:44

score 5 · Accepted Answer · answered Sep 29 '11 at 19:07

The code 6674 requires at least 13 binary bits to encode. UTF-8 requires 5 prefix bits to indicate that a 2-byte encoded Unicode value isn't just two regular old 7-bit ASCII characters instead. 13+5 = 18, which is more than can fit in 16 bits or 2 bytes. So it takes 3 bytes to encode (adding 2 more Unicode prefix bits).

Unicode clarification

2 Answers2