1

This is specifically for a vocab-type database, where you have a "head word" or "key word" field in the source language... and then you have a, usually bigger, field which you might call "definition" or "explanation", as you find in a dictionary.

For example, with the French head word "oeil" (eye), you might have this as the definition/explanation:

<EN>eye</EN> <phonetics>...</phonetics>, irregular pl. <FR>yeux</FR> <phonetics>...</phonetics>, <EN>eyes</EN>. And some more miscellaneous free-form text perhaps with some other embedded <FR> or <EN> words...

How best to accomplish that sort of markup? What sort of XML schema should I use, if applicable? The point is not merely cosmetic: if you can stipulate which language applies to certain parts of the text (or indeed if a piece of text is actually phonetics), then you can perform inverted indexing, of the Lucene type, in a way which is completely impossible otherwise.

For example, in the above example, not only would you make a French-language Lucene index mark down that record as containing the French head word "oeil", you would also make it mark it down as containing the French word "yeux".

I looked around both here and generally but I couldn't find any sort of "best practice" recommendations for this sort of situation: usually when you google "multi-language" it's about replacing an entire string in one language with one in another.

mike rodent
  • 14,126
  • 11
  • 103
  • 157
  • 1
    Have you considered using `xml:lang`? See [**this answer**](https://stackoverflow.com/a/38590213/290085). – kjhughes Sep 17 '17 at 14:40
  • The convention for marking up text as being in a particular language by the XML spec itself is to use the `xml:lang` attribute (cf. https://www.w3.org/International/questions/qa-when-xmllang) on a container element. In your example, you wouldn't use `EN` and `FR` elements, but rather a generic element (`span`, say, but could be any element) like this `eye...oeil...`. However, this scheme doesn't deal with singular/plural forms; for that, you need to come up with your own markup vocabulary. – imhotap Sep 17 '17 at 14:50
  • Thanks to both... – mike rodent Sep 17 '17 at 17:11

0 Answers0