3

In NLP, languages are often referred as low resource or high resource.

What do these terms mean?

Ethan
  • 1,633
  • 9
  • 24
  • 39
Astariul
  • 1,004
  • 8
  • 18

1 Answers1

6

High resource languages are languages for which many data resources exist, making possible the development of machine-learning based systems for these languages. English is by far the most well resourced language. West-Europe languages are quite well covered, as well as Japanese and Chinese. Naturally low-resource languages are the opposite, that is languages with none or very few resources available. This is the case for some extinct or near-extinct languages and many local dialects. There are actually many languages which are mostly oral, for which very few written resources exist (let alone resources in electronic format); for some there are written documents but not even something as basic as a dictionary.

There are many different types of resources which are needed in order to train good language-based systems:

  • a high amount of raw text from various genres (type of documents), e.g. books, scientific papers, emails, social media content, etc.
  • lexical, syntactic and semantic resources such as dictionaries, dependency tree corpora, semantic databases (e.g. WordNet), etc.
  • task-specific resources such as parallel corpora for machine translation, various kinds of annotated text (e.g. with part-of-speech tags, named entities, etc.)

Many types of language resources are costly to produce, this is why the economic inequalities between countries/languages are reflected in the amount (or absence) of language resources. The Universal Dependencies project is an interesting effort to fill this gap.

Erwan
  • 25,321
  • 3
  • 14
  • 35