The OP is correct:
- One definition of "encoding" is perfectly synonymous with the semantic definition of "embedding." and the two words are used interchangeably, as in the work the OP sites.
- Yet "encode" and "embed" can also be used in a contrasting non-semantic vs. semantic sense as the OP also suggests.
Synonymous by definition
According to Merriam-Webseter, one definition of encoding is "to convey symbolically -- 'the capacity of poetry to encode ideology—J. D. Niles'". The symbols of poetry are deeply linked to the subtle variations of meanings in words -- thus poetic "encoding" indeed "embeds" from one representation to another, using representations that are strongly linked to the meanings of terms.
Use of encode and embed in the first Transformer paper.
This section is written in the context of LLMs (large language models) like ChatGPT, specifically in the first Transformer paper, Attention is All You Need.
In this paper, the term "encoding" is indeed used for a non-semantic encoding -- that is, for representations that do NOT attempt to push words with similar meanings closer in the representational (encoding) space. The word "encoding" is used in this sense twice in the paper. The first is for positional encodings, which encode the position of a word within the input text uniquely with an encoding that is something like a soft embedding. The second is for "byte pair encodings" (BPE), which was the term used for Word-Piece encodings at the the time this paper was written. Word-Piece encodings map the input into a sequence of indices that correspond to known words or pieces of words. The original words can be recovered from the sequence of indices and the indices do NOT move meanings closer together.
The term "embedding" is also used in this paper in the sense the PO and others have discussed. Here, the BPE (word-piece) tokens are embedded into a 512-dimensional space, where, during learning, the embeddings will drive similar words towards similar embeddings, thus producing a semantic embedding.
But, in the same paper, the term "encoder" is consistently used for a large network whose sole purpose is to produce rich semantic embeddings that embed each word of the sequence into a rich 512-dimensional space where the position of the vector within that space not only encodes the meaning of that word, but also the meanings of other words to which that word is connected grammatically within the sentence.
What are we to make of this? The words "encoding" and "embedding" do seem to clearly differ in being non-semantic or semantic. Yet the word "encoder" is clearly semantic.
Are the authors of Attention is All You Need being inconsistent about their use of language? I don't think so. I believe that the popular "encoder-decoder" architecture, already popular long before Transformers came on the scene, breaks from the encoding/embedding distinction seen elsewhere and indeed uses the term "encoding" synonymous with the term "embedding."
The Encoder-Decoder Architecture
Why are these called encoder-decoder architectures? Here, encode is used in the synonymous sense of "semantic embeddings," and is used instead of embedding because entire sentences are considered rather than single words. The d2l.ai text clearly defines the encoder-decoder architecture as "consisting of two major components: an encoder that takes a variable-length sequence as input, and a decoder that acts as a conditional language model, taking in the encoded input and the leftwards context of the target sequence and predicting the subsequent token in the target sequence."
Embeddings as manifolds within a larger space
As illustrated clearly in a figure from an answer to another question already linked to this one, the term "embedding" can refer to the process of putting one space within another according to a nonlinear mapping.
When we use this term to talk about embedding the meanings of words within a higher-dimensional space, we should NOT assume that the embeddings fall on a lower dimensional manifold (surface) within that space. They may very well fill the entire dimensionality of that space. But the word still carries this picture of embedding points into a space, just as one embeds nails into wood to make nail-art.
The word encoding describes exactly the same process, but has an emphasis on preserving the meaning of the original text. But yet again, an embedding can also preserve the meaning of the original text. For example, the embeddings of word-pieces must be reversable, or some tokens in the vocabulary could never be copied by the network to the output.
Conclusion
It is clear from this brief review of a tiny fraction of the literature that the terms "encode" and "embed" can be used synonymously (in the sense of transforming one representation to another, considering the meaning of what is being transformed), but they can also be used in contrast to one another (with the word "encode" implying that the meanings of terms are NOT taken into account, while the word "embed" does take those meanings into account).
The encoder-decoder architecture uses "encode" in a sense synonymous with "embed."
Positional encodings and byte-pair encodings (word-piece encodings) use "encode" in a sense that is contrasting with the semantic meaning of "embedding."