Convert accented characters to non-accented counterparts (character folding)

Question

Is there a way for lisp to convert a string with accented characters into a non-accented counterpart without mapping each and every character or is creating a manual map the only way to achieve this?

Examples:

El Niño -> El Nino
René -> Rene
tåg -> tag

Searching for it, I found that it seems to be called character folding, however, the two things I found in several searches were either for the incremental search or similarly (char-fold-to-regexp), both seem to do the reverse in order to find more characters, which is fine for their respective use cases.

EDIT:

I generally get names, which can contain accents, but the database I need to add these to only accepts letters from the lower ASCII table in the input mask. The names themselves are extracted from a directory. This function would serve to not having to manually type these in, but have Emacs fold the characters so that for example "René" becomes "Rene". I can then simply copy and paste the result avoiding any accidental typos.

To be clear, is it that you want a way to specify unaccenting only for certain chars? If so, please make that clear in the question. Thx. — Drew, Jan 24 '23 at 14:46
I generally get names, which can contain accents, but the database I need to add these to only accepts letters from the lower ASCII table in the input mask. The names themselves are extracted from a directory. This function would serve to not having to manually type these in, but have Emacs fold the characters so that for example "René" becomes "Rene". I can then simply copy and paste the result avoiding any accidental typos. — Phoenix, Jan 24 '23 at 18:06
Please put all such info into the question, if it's relevant. Comments can be deleted at any time, and they're not searchable. Thx. — Drew, Jan 24 '23 at 18:12

NickD · Accepted Answer · 2023-01-26T21:05:59.080

I don't know if it will manage all the non-ascii characters in your input file, but get-char-code-property is able to deal with all the cases you show. The doc string C-h v get-char-code-property says:

(get-char-code-property CHAR PROPNAME)

Return the value of CHAR’s PROPNAME property.

The property you want is decomposition which somehow figures out that accented characters are made up from a base character and an accent. The call (get-char-code-property c 'decomposition) then returns a list with one or two elements: the first is the base character and the second is the accent.

Here are some examples of calling it (note that ?n is the character n - or equivalently the integer 110, since Emacs represents characters by integers):

(get-char-code-property ?a 'decomposition) --> (97)
(get-char-code-property ?n 'decomposition) --> (110)
(get-char-code-property ?e 'decomposition) --> (101)
(get-char-code-property ?á 'decomposition) --> (97 769)
(get-char-code-property ?ñ 'decomposition) --> (110 771)
(get-char-code-property ?ê 'decomposition) --> (101 770)

As you can see the first element of the list is the unaccented character (or integer). If you are wondering what characters the integers 769, 770 or 771 represent, you can use the same function with the name property:

(get-char-code-property 770 'name) -->  "COMBINING CIRCUMFLEX ACCENT"

COMBINING characters are combined with the previous character to produce the accented (or otherwise decorated) compound character.

So all you have to do is to loop over all characters of your string, run them through get-char-code-property with the decomposition property and throw away everything but the first character which is the base character. Here's e.g. a simple function that takes a string and translates it:

(defun xlate-unaccented (s)
  (mapconcat
   (lambda (c)
     (char-to-string
      (car (get-char-code-property c 'decomposition))))
   s ""))

The last argument to mapconcat is the separator argument (here an empty string). That became optional at some point after 28.1 but, as the OP points out in a comment, it is necessary in 28.1 or earlier (and maybe some later version too): you can also specify nil instead of the explicit "", but the argument has to be present.

Here are some tests:

(xlate-unaccented "El Niño") --> "El Nino"
(xlate-unaccented "René")  --> "Rene"
(xlate-unaccented "tåg") --> "tag"

Note that this is going to fail for more complicated characters, e.g. characters with multiple accents. For example, consider the character ậ whose name is LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW. If you evaluate (get-char-code-property ?ậ 'decomposition), you will get (7841 770) whose base character is NOT unaccented: instead it's the character with name "LATIN SMALL LETTER A WITH DOT BELOW". You need to apply the decomposition again: (get-char-code-property 7841 'decomposition) --> (97 803) to come up with the unaccented 97 (aka a):

The xlate-unaccented function strips one accent:

(xlate-unaccented "La Niñậ") --> "La Ninạ"

Running it through again (and again) will get rid of the remaining accent(s). You can also loop inside the lambda as @gigiair points out in a comment:

(defun xlate-unaccented (s)
  (mapconcat
   (lambda (c)
     (char-to-string
      (let ((dec (get-char-code-property c 'decomposition)))
        (while (cdr dec)
          (setq dec (get-char-code-property (car dec) 'decomposition)))
        (car dec))))
   s ""))

BTW, in case you are wondering: I didn't know about this function at all. I found it by noticing the decomposition field in the output of C-u C-x = (which ends up calling describe-char). So I invoked C-h f describe-char and clicked on the source link; scanning through the code of the function, I found get-char-code-property called near the bottom of it and did C-h f get-char-code-property, but I also had to look at C-h v describe-char-unidata-list: describe-char calls get-char-code-property on each element of that list; its default value is (name old-name general-category decomposition), but you can customize it to add more properties for describe-char to display. The customization buffer for it provides a convenient list of all the properties.

Neat! Thanks a lot! While the I grew up in the good(?) ol' Amiga and DOS times, I feel right at home with the ASCII-to-character conversions. However, the (get-char-code-property) function I gotta test out next time I can. So far I only encountered simple accents for Swedish, German and Czech names, but it should be fairly easy to check if the resulting character is <128 (or even between 65-90 and 97-122). If not, simply re-run it. Thanks again! That will get me several steps further. — Phoenix, Jan 24 '23 at 21:18
Just a side note: (mapconcat) needs a separator to work. An empty string ("") would be the right choice here. — Phoenix, Jan 24 '23 at 21:27
The separator is an optional argument to mapconcat (at least on my version of emacs). By default it is nil which stands for the empty string. The doc string (C-h v mapconcat) says: Optional argument SEPARATOR must be a string, a vector, or a list of characters; nil stands for the empty string. — NickD, Jan 24 '23 at 21:34
Ok. Then your version is different than mine. Here it blurted out that it requires the separator. C-h f mapconcat states: SEPARATOR must be a string, a vector, or a list of characters. There is no word about it being optional. I'm using GNU Emacs 28.1. — Phoenix, Jan 24 '23 at 21:38
Yes, I 'm on fairly recent upstream. I fixed up the function to include the separator and added a comment. Thanks for pointing it out! — NickD, Jan 24 '23 at 21:59
I made a small change in the lambda to handle characters with multiple diacritics. (lambda (c) (char-to-string (let((dec (get-char-code-property c 'decomposition))) (while (cdr dec) (setq dec(get-char-code-property (car dec) 'decomposition))) (because dec)))) which allows characters such as ậ to be processed in one go — gigiair, Jan 26 '23 at 17:59

score 0 · Answer 2 · answered Jan 24 '23 at 20:46

I don't have a fully-working answer but maybe that's a start.

I don't think you can reliably use char-fold-table, because for example the list of characters corresponding to e includes è but not é.

(string-match "è" (aref char-fold-table ?e)) ; returns 25
(string-match "é" (aref char-fold-table ?e)) ; returns nil

But if you want to use it (I don't have a better idea without defining your own table), since char-tables are basically vectors, as far as I'm aware the only way to search them is by looping.

This (painfully slow) loop would correctly return "e". Some accented characters like è appears several times, but from the ones I tried, it seems the unaccented character is found first.

(cl-loop
 for i from 0
 for chars across char-fold-table
 if (cl-search "ù'" chars)
 return (char-to-string i)) ; returns "e"
 ;; collect (char-to-string i)) ; returns ("e" "è")

As a full function, it only has a half-success:

(defun normalize-name (name)
  (mapconcat (lambda (c)
               (cl-loop
               for i from 0
               for chars across char-fold-table
               if (cl-search (char-to-string c) chars)
               return (char-to-string i)))
             name))
(normalize-name "Éàîù") ; returns "Éaîu"
(normalize-name "El Niño") ; returns El Nino
(normalize-name "René") ; returns René
(normalize-name "tåg") ; returns tag

score 0 · Answer 3 · answered Jul 30 '23 at 12:58

I know this one is somewhat old, and has some good answers, but I was looking into this problem too, and found an interesting thread at the help-gnu-emacs list which is also useful. The thread is https://lists.gnu.org/archive/html/help-gnu-emacs/2018-05/msg00222.html.

Eli Zaretskii suggested the use of the ucs-normalize.el library for the purpose (https://lists.gnu.org/archive/html/help-gnu-emacs/2018-05/msg00230.html), and Teemu Likonen made an application of the suggestion (https://lists.gnu.org/archive/html/help-gnu-emacs/2018-05/msg00232.html). Which goes something like:

(defun my-ascii-normalize-filter (string)
  (require 'cl-lib)
  (require 'ucs-normalize)
  (cl-remove-if (lambda (char)
                  (> char 127))
                (ucs-normalize-NFKD-string string)))

Which gives:

(my-ascii-normalize-filter "El Niño"); -> "El Nino"
(my-ascii-normalize-filter "René"); -> "Rene"
(my-ascii-normalize-filter "tåg"); -> "tag"

It even gets @NickD's corner case correctly:

(my-ascii-normalize-filter "ậ"); -> "a"

Convert accented characters to non-accented counterparts (character folding)

3 Answers3