Handling multi-language text within the same sentences

Question

This is specifically for a vocab-type database, where you have a "head word" or "key word" field in the source language... and then you have a, usually bigger, field which you might call "definition" or "explanation", as you find in a dictionary.

For example, with the French head word "oeil" (eye), you might have this as the definition/explanation:

<EN>eye</EN> <phonetics>...</phonetics>, irregular pl. <FR>yeux</FR> <phonetics>...</phonetics>, <EN>eyes</EN>. And some more miscellaneous free-form text perhaps with some other embedded <FR> or <EN> words...

How best to accomplish that sort of markup? What sort of XML schema should I use, if applicable? The point is not merely cosmetic: if you can stipulate which language applies to certain parts of the text (or indeed if a piece of text is actually phonetics), then you can perform inverted indexing, of the Lucene type, in a way which is completely impossible otherwise.

For example, in the above example, not only would you make a French-language Lucene index mark down that record as containing the French head word "oeil", you would also make it mark it down as containing the French word "yeux".

I looked around both here and generally but I couldn't find any sort of "best practice" recommendations for this sort of situation: usually when you google "multi-language" it's about replacing an entire string in one language with one in another.

Have you considered using `xml:lang`? See [**this answer**](https://stackoverflow.com/a/38590213/290085). — kjhughes, Sep 17 '17 at 14:40
The convention for marking up text as being in a particular language by the XML spec itself is to use the `xml:lang` attribute (cf. https://www.w3.org/International/questions/qa-when-xmllang) on a container element. In your example, you wouldn't use `EN` and `FR` elements, but rather a generic element (`span`, say, but could be any element) like this `eye...oeil...`. However, this scheme doesn't deal with singular/plural forms; for that, you need to come up with your own markup vocabulary. — imhotap, Sep 17 '17 at 14:50

Handling multi-language text within the same sentences

0 Answers0