0

Why is the length of the some characters e.g. the following 'ᨒ' 3 when it should be 2

ᨒ U+1a12

1a12 means 6674

2^16 is 65536 so 6674 should take only 2 bytes and not three

2 Answers2

5

There are 3 encodings of unicode: UTF-8, UTF-16, UTF-32

In UTF-8 it takes 3 bytes: 0xE1 0xA8 0x92 as UTF-8 has variable length characters.

UTF-16 would take 2 bytes: 0x1A12 as UTF-16 has fixed length characters

http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/UTF-16

UTF-8 character length from Wikipedia page:

Last code point Byte 1      Byte 2      Byte 3      Byte 4      Byte 5      Byte 6
 7  U+007F      0xxxxxxx
11  U+07FF      110xxxxx    10xxxxxx
16  U+FFFF      1110xxxx    10xxxxxx    10xxxxxx
21  U+1FFFFF    11110xxx    10xxxxxx    10xxxxxx    10xxxxxx
26  U+3FFFFFF   111110xx    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx
31  U+7FFFFFFF  1111110x    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx
Pubby
  • 3,380
5

The code 6674 requires at least 13 binary bits to encode. UTF-8 requires 5 prefix bits to indicate that a 2-byte encoded Unicode value isn't just two regular old 7-bit ASCII characters instead. 13+5 = 18, which is more than can fit in 16 bits or 2 bytes. So it takes 3 bytes to encode (adding 2 more Unicode prefix bits).

hotpaw2
  • 7,978