Is there exist a manifold inside the set of all English sentences?

Question

First, I apologize if this question sounds naive or does not make any sense at all since I'm not a mathematician or a math major student.

I'm working on a problem related to approximating the manifold of real-world data. As you may have known, the Manifold Hypothesis states that real-world high-dimensional data lie on low-dimensional manifolds embedded within the high-dimensional space. This hypothesis makes sense to me for continuous data like images. However, I'm not really sure about discrete data such as texts.

Let me denote $V \subset \mathbb{R}^{|V|}$ as the set of English vocabulary. Each word $w_i \in V$ is represented as a $|V|$-dimensional one-hot vector where the $i$-th entry equals 1 and other entries equal 0. I define a sentence as an ordered sequence of words $w \in V$ and is denoted as $s$. Let $S$ be the set of all possible sentences.

To be precise, my questions are:

Is $V$ discrete? Is it a closed set?
Is $S$ discrete? Is it a closed set?
If $S$ is discrete, is it possible that a manifold can lie inside it?

I'm pretty sure that my questions are somewhat sounded very dumb, so I'm very grateful for your help and patience.

score 1 · Answer 1 · answered Oct 09 '21 at 11:39

Yes, $V$ is discrete and closed: It's just the standard basis of $\mathbb R^{|V|}$. But it's a finite set anyway so both discrete and closed. As for $S$, it's not a subset of $\mathbb R^{|V|}$, so it doesn't make sense to say whether it's discrete/closed or not. You can have a metric on finite sequences of $\mathbb R^{|V|}$, using e.g. the $\mathcal l^\infty$-norm, then the set of sentences are just $(e_{i_1}, e_{i_2}, \cdots)$, and two distinct sentences must differ at some coordinate that make the their distance be $1$. So again it's trivially discrete and closed.

The problem is the embedding $V\rightarrow \mathbb R^{|V|}$ doesn't tell us anything about the English vocabulary, except different words are different. "Apple" and "Orange" has distance $\sqrt{2}$, so does "Apple" and "Internet", while we probably want the former pair to be closer (well, I guess "Apple" can be closer to "Internet" under certain context.)

I don't know much about data science, however, as far as I understand it, this is not what the manifold hypothesis is about. In the setup of NLP, what we really want to do is to embed the set of words into a much lower dimensional space by mapping them to some vectors other than the standard basis (so the coordinates encode more information than they are simply different), such that the well-formed sentences/texts will form an even lower dimensional manifold, as sentences are supposed to connect "close" words (how to encode the order of words in a sentence is another question).

You can start with https://en.wikipedia.org/wiki/Word2vec

I will just note that the usual topology on $$(\mathbb R^n)^{\mathbb N},$$ i.e. the product topology, is metrizable with a fairly complicated metric such as https://math.stackexchange.com/q/361778/631742 — Maximilian Janisch, Oct 10 '21 at 07:27

score 0 · Accepted Answer · answered Oct 09 '21 at 11:29

This is from a mathematician's perspective, so I apologize if there are some technical details that are irrelevant to the question you are asking :).

If I understand correctly: $V$ is just the set $$\{(1,0,0,\dots, 0), (0,1,0,\dots, 0),\dots, (0,0,\dots, 0, 1)\}\subset\mathbb R^n,$$ where $n\in\mathbb N$ is the number of words in your vocabulary. Then $V$ is a basis of $\mathbb R^n$, it is discrete (since it is finite), and it is closed (in the canonical topology). $S$ is now defined as what you may write as $$\bigcup_{m\in\mathbb N_0} V^m,$$ i.e. it is the set of all finite sequences of words. Now it becomes a bit more tricky to say whether $S$ is discrete or closed, since these are topological properties, so they depend on what topological space you embed $S$ into.

I suggest that we embed $S$ into the product space $$(\mathbb R^n)^{\mathbb N},$$ associating to a $(w_1,w_2,\dots, w_m)\in V^m\subset S$ the sequence $$(w_1,w_2,\dots, w_m, 0, 0, \dots)\in (\mathbb R^n)^{\mathbb N}.$$ As is usual, the space $$(\mathbb R^n)^{\mathbb N}$$ shall be equipped with the product topology (see H. Schubert, Topologie (1969), pages 30ff. or https://encyclopediaofmath.org/wiki/Topological_product or https://en.wikipedia.org/wiki/Product_topology).

Claim. With the above conventions, $S$ is discrete but not closed.

Proof. Take any (in the sense of the embedding above) $$(w_1,w_2,\dots, w_m, 0, 0, \dots)\in S.$$ Then, since $V$ is discrete, for each $w_i$, there exists a neighborhood $U_i\subset\mathbb R^n$ of $w_i$ such that $V\cap U_i = \{w_i\}$. Then $$S\cap U_1\times U_2\times\dots\times U_m \times\mathbb R^n\times\mathbb R^n\times\dots=\{(w_1,w_2,\dots, w_m, 0, 0, \dots)\}.$$ This shows that $S$ is discrete. We now come to non-closedness of $S$: Consider the sequence $(x_k)_{k\in\mathbb N}$ with each $x_k\in S$ given by $$x_k=(\underbrace{w, w,\dots, w}_{k\text{ times}}, 0, 0, 0,\dots),$$ where $w\in V$ is any word. Then (exercise) the $x_k$ converge to $$(w,w,w,w,w\dots)\in (\mathbb R^n)^{\mathbb N}.$$ However, this is not a sentence since there are infinitely many words. Therefore $S$ is not closed. $\square$

Now about manifolds: The theory of infinite-dimensional manifolds is very complex and I know little about it, so I will restrict myself to looking at the following set: $$S_k=\{\text{All sentences with at most $k$ words}\}$$ for some $k\in\mathbb N$. A sentence is always an element of the form $$(w_1,w_2,\dots, w_l, 0, 0,\dots, 0)\in(\mathbb R^n)^k$$ for some words $w_1,\dots, w_l\in V$ and $l\in\{0,1,\dots, k\}$. Now $S_k$ is a subset of $(\mathbb R^n)^k$, which is much easier to handle.

Note. In fact, one can think of $S_k$ as a set of matrices, let me know if you are interested in me elaborating on this.

In this new setting, with the canonical topology of $(\mathbb R^n)^k$, one actually has that $S_k$ is discrete and closed. In particular, $S_k$ can itself be seen as a $0$-dimensional submanifold of $\mathbb R^{n\cdot k}$. However, it can have no higher dimension that $0$. The way that I understand the manifold hypothesis though would be that the sentences contained in $S_k$ which appear in the real world lie on some "nice" submanifold of $\mathbb R^{n\cdot k}$, even though I would first need to read more about the latter hypothesis.

Is there exist a manifold inside the set of all English sentences?

2 Answers2