C: iterate over a unicode (utf-8/utf-16) string, conditionally modify individual characters, and store it as new string

Question

C's char array can hold an ASCII string, such that a user can query individual characters by indexing the char array, modify them or do what ever they want to, its pretty straight forward. I have browsed through many questions+answers on stack(exchange|overflow) on topic of unicode strings in char arrays but haven't found good interface for iterating over and manipulating such strings. So what am I looking for:

A way to iterate over unicode string in C, and print each character
Test each character to belong to a certain set of unicode characters
Modify all unicode characters in the string that belong to that set and store the result as a new string.

score 1 · Answer 1 · answered Oct 29 '21 at 10:19

Unicode characters are "Extended grapheme clusters", made out of one or more Unicode code points. Unicode code points are integers with values from 0 to 0x10ffff, often stored in an array of bytes in UTF-8 encoding as 1 to 4 bytes.

You can look up "UTF-8" encoding to see how to recognise Unicode code points in a char array, and change your code accordingly. For example, you could reverse a string by starting at the end, recognising the last code point (1 to 4 bytes), and then copying the whole code point, unreversed.

The Unicode standard describes all kinds of character classes if you want to analyse text. Be warned that the same character (and "same" means "must be treated exactly the same by any code") can be represented by different sequences of code points.

Best is to find and use a suitable library. Unicode without a library is very, very complex. For example, the single character "UK flag" is made of the code points "U+1F3F4, U+E0067, U+E0062, U+E0065, U+E006E, U+E0067, U+E007F"

Another example: my first name can be written with either four or five characters, which will break any naïve equality check. (How can two strings that aren't even the same length be equal? Well, they can, if they encode the same grapheme clusters.) — Jörg W Mittag, Oct 29 '21 at 23:35

C: iterate over a unicode (utf-8/utf-16) string, conditionally modify individual characters, and store it as new string

1 Answers1