Convergence of empirical means in other norms

Question

Say that we have a sequence $x_1,x_2,\dots$ of i.i.d. random vectors in $\mathbb{R}^n$ with mean $0$ and variance $\sigma^2$, meaning $$\mathbb{E}[\|x_i\|^2_2] = \sigma^2$$ for all $i$. Then it's a pretty standard excercise that the variance of the empirical averages tends to zero: $$\mathbb{E}\left[\left\|\frac{1}{N}\sum\limits_{i=1}^{N}x_i\right\|_2^2\right] = \frac{\sigma^2}{N}$$ What happens if I replace the Euclidean norm $\|\cdot\|_2$, in both the definition of $\sigma$ and in the norm of the average, with something else, for instance $\|\cdot\|_p$ with $p=1$ or $p=\infty$? Can I still obtain a bound, something like

$$\mathbb{E}\left[\left\|\frac{1}{N}\sum\limits_{i=1}^{N}x_i\right\|_p^2\right] \leq K\frac{\sigma^2}{N}$$ for appropriate $K$? I know that this is possible to do by invoking the equivalence any norm to the Euclidean norm, but this approach will introduce factors that depend on the dimension. My question: Is it possible to obtain a similar bound on the variance of the empirical averages that is independent of the dimension?

Your first inequality is an equality indeed... – Olivier Mar 10 '20 at 17:42 — Olivier, Mar 10 '20 at 17:42

Clement C. · Accepted Answer · 2020-03-24T10:21:16.847

1

This is false, at least for $\ell_1$. I'm going to use the following result: learning an arbitrary probability distribution $p$ over $\{1,..,n\}$ to total variation distance ($\ell_1$ distance between pdfs) say $0.01$ requires $\Theta(n)$ i.i.d. samples from $p$.

Now, see $p$ as a vector in $\mathbb{R}^n$ (a vector of non-negative values summing to $1$). If you let $\mu$ be the vector $\mu=(p(1),\dots,p(n))$, then learning $p$ to total variation distance is equivalent to your problem: you want to learn $\mu$ in $\ell_1$ distance, you get i.i.d. vectors in $\mathbb{R}^n$ with mean $\mu$ and $\sigma^2=1$ (each sample has exactly one non-zero coordinate, equal to one).

This means that, at least for $\ell_1$, your $K$ has to depend on $n$, otherwise you could learn $\mu$ with a number of samples independent of $n$.

To summarize: the (minimax) rate of learning a univariate probability distribution over $n$ elements in total variation (TV) distance has a linear dependence on $n$ (see, e.g., [1]). Given $N$ i.i.d. samples from such a distribution $p$, you can simulate $N$ i.i.d. vectors in $\mathbb{R}^n$ (just by mapping each sample to its indicator vector), and learning the mean of those vectors in $\ell_1$ is equivalent to learning the distribution $p$ in TV distance. Therefore, the rate of the latter question must have at least a linear dependence on $n$ as well (as otherwise, the could perform the former task too efficiently).

[1] A short note on learning discrete distributions, by C. Canonne https://arxiv.org/abs/2002.11457 (just a reference for some of these learning/density estimation "folklore" facts)

edited Mar 24 '20 at 10:21

answered Mar 08 '20 at 07:37

Clement C.

67,323

This is very interesting; thank you. If I'm understanding correctly, then, there should be a theorem that the expected total variation distance is lower-bounded by some increasing function of $n$, the number of points in the sample space? Do you know where I could look for proof of this fact? – ttb Mar 10 '20 at 20:40
1

Learning a probability distribution over ${1,\dots,\d}$ in total variation distance (which is the result I used above) is more or less folkore; see e.g., this recent note I put up to have a centralized reference: https://arxiv.org/abs/2002.11457 @ttb – Clement C. Mar 10 '20 at 21:03
1

@ClementC. I had a quick look at the paper. Nice indeed. Still, I'm left with a question : could you clarify in math terms what you mean by learning ? I had the impression it always refers to the empirical distribution. What is the underlying theoretical argument that ensures no measurable function of the data may do better ? (I'm not a statistician, maybe this is also folklore). – Olivier Mar 11 '20 at 13:16
2

The lower bound says "no algo can do that, not only the empirical estimator." The basic idea is to show that the statistical dist between (i) the distribution of sequence of $n$ i.i.d. draws from $p$ and (ii) a the distribution of a sequence of $n$ i.i.d. draws from $q$ is small (TV $\ll 1$) unless $n$ is large enough, for some choice of $p,q$ that are far enough (this is the very basic idea, formalized by Assouad's lemma (what I wrote about is not fully what happens; in particular, $p,q$ are chosen "randomly" themselves)). This imply you cannot learn $p$ unless $n$ is large enough. @Olivier – Clement C. Mar 11 '20 at 15:16
2

@Olivier If you're interested, see e.g., this on Assouad's lemma. – Clement C. Mar 11 '20 at 16:36
1

Thanks for the references, I'll check on them soon ! – Olivier Mar 11 '20 at 18:27
If this could be fleshed out more to be closer to the outlines of formal proof I will accept it as answer. – ttb Mar 24 '20 at 02:59
I added some details. I am not sure what other details you ask for? @ttb – Clement C. Mar 24 '20 at 10:21

Olivier · Answer 2 · 2020-03-11T12:17:04.093

1

To quote the above answer : "This is true, at least for $\ell_p$, $p \geq 2$" :-)

This is simply because on $\mathbb R^n$ (or more generally $\mathbb C^n$), if $0<p<q$,

$$\|x\|_{q}\le \|x\|_{p},$$

a surprisingly simple inequality a colleague taught me last week, see this question.

On the other hand the converse inequality (Jensen or Hölder - thanks Clement C.)

$$\|x\|_{p}\le n^{1/p-1/q} \|x\|_q$$

gives you a bound for $p \le 2$:

$$\mathbb{E}\left[\left\|\frac{1}{N}\sum\limits_{i=1}^{N}x_i\right\|_p^2\right] \leq \sigma^2 \frac{n^{2/p-1}}{N} $$

that indeed depends on the dimension.

EDIT : I noticed I did not change the definition of $\sigma$ as requested by the OP. Hence not a valid answer.

edited Mar 11 '20 at 12:17

answered Mar 10 '20 at 17:56

Olivier

1,191

1

The converse inequality is just Holder, isn't it? (regarding the monotonicity of $\ell_p$-norms, yes, that's a damn useful fact...) – Clement C. Mar 10 '20 at 21:19
good point yes ! – Olivier Mar 10 '20 at 21:29
@ClementC. or Jensen - both work here. – Olivier Mar 11 '20 at 10:29

Convergence of empirical means in other norms

2 Answers2