Questions tagged [unicode]

Unicode is intended to be a universal character set for describing all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

63 questions
432
votes
20 answers

Should UTF-16 be considered harmful?

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?" Why do I ask this question? How many programmers are aware of the fact that UTF-16 is actually a variable…
Artyom
  • 2,079
22
votes
1 answer

Why are there so many spaces and line breaks in Unicode?

Unicode has maybe 50 spaces \u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000][\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000 and 6 line breaks not only…
maaartinus
  • 2,643
15
votes
8 answers

What's the point of adding Unicode identifier support to various language implementations?

I personally find reading code full of Unicode identifiers confusing. In my opinion, it also prevents the code from being easily maintained. Not to mention all the effort required for authors of various translators to implement such support. I also…
14
votes
3 answers

A Unicode sentinel value I can use?

I am desiging a file format and I want to do it right. Since it is a binary format, the very first byte (or bytes) of the file should not form valid textual characters (just like in the PNG file header1). This allows tools that do not recognize the…
5
votes
3 answers

Why does Unicode have separate codepoints for characters with identical glyphs?

(Not entirely sure whether this should go in the information-security StackExchange instead; feel free to move it there if that's where it belongs.) Unicode has many, many instances of pairs or larger sets of characters with identical glyphs…
Vikki
  • 169
4
votes
4 answers

Technical reasons to prefer coding business logic to support Unicode (when not required)

I have a legacy application in which the UI and business logic are already reasonably well-separated. There is a proposal to separate them even further, turning the core application into a "service" (without UI) and writing a kind of "UI Server" as…
omatai
  • 195
4
votes
2 answers

Prerequisites for developing an application with Unicode support

What could be the necessary prerequisites to be taken when developing an application with Unicode support in the context of Web applications Desktop applications Embedded applications Prerequisites to be taken care of relating to Type casting and…
Ubermensch
  • 1,349
1
vote
1 answer

How can I learn about typography, fonts, glyphs, etc.?

I know so little about this that I'm having trouble formulating the question. Apparently due to technical limitations, nastaleeq style of writing Urdu is very difficult, perhaps impossible, given current standards used on the web. I'd like to do a…
Shahbaz
  • 181
1
vote
3 answers

Unicode Explanation Required

Can someone explain what this means? Unicode defines a codespace of 1,114,112 code points in the range 0hex to 10FFFFhex. http://en.wikipedia.org/wiki/Unicode
0
votes
0 answers

How can I resolve Unicode Hex Value Mismatches between WordML and XSL:FO?

We have an important legal document that our app generates in WordML, with foreign characters represented via Unicode. These foreign characters vary widely, and include languages with special characters like Korean and Cyrillic. We have all of…
Zibbobz
  • 1,552
  • 3
  • 13
  • 20
0
votes
2 answers

Unicode clarification

Why is the length of the some characters e.g. the following 'ᨒ' 3 when it should be 2 ᨒ U+1a12 1a12 means 6674 2^16 is 65536 so 6674 should take only 2 bytes and not three