I don't know if it will manage all the non-ascii characters in your input file, but get-char-code-property
is able to deal with all the cases you show. The doc string C-h v get-char-code-property
says:
(get-char-code-property CHAR PROPNAME)
Return the value of CHAR’s PROPNAME property.
The property you want is decomposition
which somehow figures out that accented characters are made up from a base character and an accent. The
call (get-char-code-property c 'decomposition)
then returns a list with one or two elements: the first is the base character and the second is the accent.
Here are some examples of calling it (note that ?n
is the character n
- or equivalently the integer 110, since Emacs represents characters by integers):
(get-char-code-property ?a 'decomposition) --> (97)
(get-char-code-property ?n 'decomposition) --> (110)
(get-char-code-property ?e 'decomposition) --> (101)
(get-char-code-property ?á 'decomposition) --> (97 769)
(get-char-code-property ?ñ 'decomposition) --> (110 771)
(get-char-code-property ?ê 'decomposition) --> (101 770)
As you can see the first element of the list is the unaccented character (or integer). If you are wondering what characters the integers 769, 770 or 771 represent, you can use the same function with the name
property:
(get-char-code-property 770 'name) --> "COMBINING CIRCUMFLEX ACCENT"
COMBINING
characters are combined with the previous character to produce the accented (or otherwise decorated) compound character.
So all you have to do is to loop over all characters of your string, run them through get-char-code-property
with the decomposition
property and throw away everything but the first character which is the base character. Here's e.g. a simple function that takes a string and translates it:
(defun xlate-unaccented (s)
(mapconcat
(lambda (c)
(char-to-string
(car (get-char-code-property c 'decomposition))))
s ""))
The last argument to mapconcat
is the separator argument (here an empty string). That became optional at some point after 28.1 but, as the OP points out in a comment, it is necessary in 28.1 or earlier (and maybe some later version too): you can also specify nil
instead of the explicit ""
, but the argument has to be present.
Here are some tests:
(xlate-unaccented "El Niño") --> "El Nino"
(xlate-unaccented "René") --> "Rene"
(xlate-unaccented "tåg") --> "tag"
Note that this is going to fail for more complicated characters, e.g. characters with multiple accents. For example, consider the character ậ
whose name is LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
. If you evaluate (get-char-code-property ?ậ 'decomposition)
, you will get (7841 770)
whose base character is NOT unaccented: instead it's the character with name "LATIN SMALL LETTER A WITH DOT BELOW"
. You need to apply the decomposition again: (get-char-code-property 7841 'decomposition) --> (97 803)
to come up with the unaccented 97
(aka a
):
The xlate-unaccented
function strips one accent:
(xlate-unaccented "La Niñậ") --> "La Ninạ"
Running it through again (and again) will get rid of the remaining accent(s). You can also loop inside the lambda
as @gigiair points out in a comment:
(defun xlate-unaccented (s)
(mapconcat
(lambda (c)
(char-to-string
(let ((dec (get-char-code-property c 'decomposition)))
(while (cdr dec)
(setq dec (get-char-code-property (car dec) 'decomposition)))
(car dec))))
s ""))
BTW, in case you are wondering: I didn't know about this function at all. I found it by noticing the decomposition
field in the output of C-u C-x =
(which ends up calling describe-char
). So I invoked C-h f describe-char
and clicked on the source link; scanning through the code of the function, I found get-char-code-property
called near the bottom of it and did C-h f get-char-code-property
, but I also had to look at C-h v describe-char-unidata-list
: describe-char
calls get-char-code-property
on each element of that list; its default value is (name old-name general-category decomposition)
, but you can customize it to add more properties for describe-char
to display. The customization buffer for it provides a convenient list of all the properties.