How well can a continuous distribution be approximated by a countable discrete distribution?

Question

Let $X$ be a random variable taking values on $\mathbb{R}^N$, with probability density function $f:\mathbb{R}^N\rightarrow \mathbb{R}_{\ge 0}$.

Consider a new random variable $Y$ taking values in the countable set $(y_n)_{n\in\mathbb{N}}\subseteq \mathbb{R}^N$, with respective probabilities $(p_n)_{n\in\mathbb{N}}$, so $P(Y=y_n)=p_n$ and $\sum_{n=0}^\infty{p_n}=1$. Without loss of generality, we may assume that $p_n$ is weakly decreasing in $n$.

By choosing $y_n$ and $p_n$ appropriately, can we ensure (non-trivial) error bounds of the something vaguely like one the following forms are satisfied?

$$\forall x\in\mathbb{R}^N\setminus\{y_n|n\in\mathbb{N}\},\,\forall \epsilon>0,\,|P(\|X-x\|_2<\epsilon)-P(\|Y-x\|_2<\epsilon)|\le C \epsilon^{-\kappa},$$

for some $C>0$ and $\kappa>0$, possibly depending on $N$.

AND/OR

$$\forall x\in\mathbb{R}^N\setminus\{y_n|n\in\mathbb{N}\},\,\forall \epsilon>0,\,|P(\|X-x\|_2<\epsilon)-P(\|Y-x\|_2<\epsilon)|\le C P(\|X-x\|_2<\epsilon),$$

for some $C\in [0,1)$,

AND/OR

$$\forall x\in\mathbb{R}^N\setminus\{y_n|n\in\mathbb{N}\},\,\lim_{\epsilon\rightarrow 0}{\frac{P(\|X-x\|_2<\epsilon)}{P(\|Y-x\|_2<\epsilon)}}=1.$$

I am not particularly attached to the precise form of these results. I am just curious if having countably many points means you can get in some sense "close" to the true, in a way that you can never with finitely many (as with finitely many, sufficiently small balls will contain $0$ $y_n$).

I.e. is there any sense in which you can do better with countably many support points than you can with finitely many?

For example, suppose we take $p_n={(n+1)}^{-1-\theta}{[\sum_{k=0}^\infty{{(k+1)}^{-1-\theta}}]}^{-1}$ for some small $\theta>0$. Then it seems plausible that there could be an algorithm for constructing $y_n$ that proceeded by drawing points from $X$ and discarding them if they are too close to previously drawn points.

My worry though is that the early points will end up with a kind of "rain shadow" around them. Sets close to $y_1$, but not containing $y_1$ will end up with lower mass than they should have.

Can this error be bounded?

A related problem is to generate a continuous random variate, using only random bits, to a desired accuracy (e.g., $1/2^n$); see this question: https://cs.stackexchange.com/questions/103041/complexity-of-generating-non-uniform-random-variates — Peter O., Jun 19 '21 at 16:29
@PeterO. Interesting! But quite a different problem. Here I'm assuming that you have countable support available, which effectively means infinitely many bits. But you still cannot get a perfect approximation it seems. — cfp, Jun 20 '21 at 08:09
You might explore density estimation in a nonparametric statistics book, for example kernel methods or for that matter empirical estimators. — A rural reader, Jun 24 '21 at 23:22
@Aruralreader I'm aware of that literature. With this (admittedly poorly phrased) question I was wondering about whether having countable support really helps. The question text has now been updated to be slightly clearer about what I was after! — cfp, Jun 25 '21 at 08:27

Clement C. · Answer 1 · 2021-06-25T00:03:04.640

You are asking for an approximation in total variation distance: $$ d_{\rm{}TV}(p,q) = \sup_{S\text{ measurable}}(p(S)-q(S)) $$ this is hopeless, unfortunately, as discrete distributions cannot approximate continuous distributions with that distance (the distance is almost always trivially the maximum, 1).

But there are other good distance measures! E.g., if you are asking instead for the (weaker) Kolmogorov distance: $$ d_{\rm K}(p,q) = \sup_{a\in\mathbb{R}}|p(\infty,a])-q(\infty,a])| $$ (for univariate distributions), then the DKW inequality in particular implies that any distribution over $\mathbb{R}$ can be approximate to distance $\varepsilon$ by a discrete distribution with support of size $O(1/\varepsilon^2)$ (and, more importantly, that with high probability the empircal distribution over that many i.i.d.\ draws is such a good approximation.

I hope I am not misundertanding the question. In any case, this paper may be of interest:

Kamath, S., Orlitsky, A., Pichapati, D. & Suresh, A.T.. (2015). On Learning Distributions from their Samples. Proceedings of The 28th Conference on Learning Theory, in PMLR 40:1066-1100

as well as (maybe) this short note (focusing on discrete distributions, but covering the implications of the DKW inequality and some discussion).

Thanks. But what does having countable many support points get you, over having finitely many? For example, for any epsilon ball, you ought to be able to bound the error, no matter how small epsilon is. — cfp, Jun 25 '21 at 07:30
@cfp for which notion of distance? Even with countably infinite support, the total variation distance between an atomic and a continuous distribution will be 1. If you want a specific example: how to approximate the uniform measure on $[0,1]$ by a measure with countably infinite support? — Clement C., Jun 25 '21 at 08:11
I have updated the question with vaguely the kind of thing I was interested in. This question was a matter of curiosity not need, so I did not put as much effort into its formulation as I should have! To summarise the changes, the error bound could be a function of a size of the epsilon ball under consideration, as long as it remains non-trivial. — cfp, Jun 25 '21 at 08:25

Jacob Manaker · Answer 2 · 2021-06-28T04:12:51.620

I've written this up in terms of measures; comment if you need it translated into probabilistic language.

Let $\lambda$ denote Lebesgue measure on $[0,1]$ and choose any $C\in[0,1)$. No probability measure $\mu$, supported on countable $Y$, can approximate $\lambda$ in the second sense you give. Since your second condition is weaker than your third, the third is also impossible. Similarly, in your first condition, you must take $\kappa\leq N$ (I don't know if you consider that a trivial case; as I discuss in my other answer, I can't find an example that does approximate Lebesgue measure even in that weak sense).

Choose any $\delta\in[0,1)$ and $\{\epsilon_k\}_{k=1}^{\infty}$ summing to at most $1-\delta$.

Let $S_0=[0,1]\setminus Y$ and $r_0=1$. We will define subsequent $S_j$ and $r_j$ inductively, so that $\{S_j\}_j$ and $\{r_j\}_j$ are decreasing sequences. For now, let $j$ be arbitrary and note that for all $x\in S_j$ and $u$, $$\mu(B(x,u))\geq(1-C)\lambda(B(x,u))$$ Moreover, define $l_j=\lambda(S_j)$. We will require $l_j\geq\delta$; this certainly holds in the $j=0$ case.

Take any countable cover $\mathcal{C}$ of $S_j$ by balls (in $[0,1]$) of radius $r_j$. By the Vitali covering lemma, we may also assume that $\mathcal{F}_j=\left\{\frac{1}{3}B:B\in\mathcal{C}\right\}$ are disjoint. Thus \begin{align*} \mu\left(\bigcup{\mathcal{F}_j}\right)&=\sum_{B\in\mathcal{F}_j}{\mu(B)} \\ &\geq(1-C)\sum_{B\in\mathcal{F}_j}{\lambda(B)} \\ &=\frac{1-C}{3}\sum_{B\in\mathcal{C}}{\lambda(B)} \\ &\geq\left(\frac{1-C}{3}\right)\lambda\left(\bigcup{\mathcal{C}}\right) \\ &\geq\frac{(1-C)l_j}{3} \end{align*}

Now, that value of $\mu$ arises from summing up infinitely many atoms. By taking the appropriate partial sum, we find finite $X_j\subseteq\bigcup{\mathcal{F}_j}$ such that $\mu(X_j)\geq\frac{(1-C)l_j}{6}$.

Finally, let \begin{gather*} r_{j+1}=\min{\left(\frac{\epsilon_j}{2|X_j|},r_j\right)} \\ S_{j+1}=S\setminus\bigcup_{x\in X_j}{B(x,r_{j+1})} \end{gather*} This ensures two things.

First, the $\{X_j\}_j$ are disjoint: suppose $x\in X_j\cap X_k$ with $j<k$. Then $x\in\mathcal{F}_k$, so there exists $y\in S_k$ with $|x-y|<\frac{r_k}{3}$. But $S_k\subseteq S_{j+1}$, so $$\inf_{z\in S_k}{|x-z|}\geq\inf_{z\in S_{j+1}}{|x-z|}\geq r_{j+1}\geq r_k>\frac{r_k}{3}$$

Second, \begin{align*} l_{j+1}&\geq l_j-2r_{j+1}|X_j| \\ &\geq l_j-\epsilon_j \end{align*} Since $\sum_k{\epsilon_k}\leq1-\delta$, we have $l_{j+1}\geq\delta$, as required above.

With the recursion closed for arbitrary $j$, we conclude: $$1\geq\mu\left(\bigcup_{j=0}^{\infty}{X_j}\right)=\sum_{j=0}^{\infty}{\mu(X_j)}\geq\sum_{j=0}^{\infty}{\frac{(1-C)\delta}{6}}=\infty$$ $\rightarrow\leftarrow$.

So more broadly, your position is that there is no sense in which you can do better with countably many support points than you can with finitely many? Doesn't that seem surprising? — cfp, Jun 27 '21 at 07:23
@cfp: Only a little? Measures and $\sigma$-algebras are fundamentally uncountable objects. For example: there are no infinite countable $\sigma$-algebras. Thus completing the Borels adds uncountably many null sets. You can also construct measures using Hahn-Banach, but that needs the Boolean Prime Ideal Theorem, which is independent of countable choice. (I assume it follows from choice up to $\omega_1$.) — Jacob Manaker, Jun 27 '21 at 08:49
Also, it's not like this was my first draft. I noticed this after trying to diagnose why my candidate to approximate Lebesgue measure failed. (I'll publish that tomorrow, after I tweak some details.) — Jacob Manaker, Jun 27 '21 at 08:50
@cfp: Here's a better answer: We usually think about measures as being characterized by relatively "small" (finitary) objects, because we want to integrate relatively smooth functions, and that places a weak topology on measures. But your conditions place a strong topology on measures, maybe too strong. You'll notice that my proof actually shows finitely-supported measures can't approximate Lebesgue (in your senses) either. So maybe this just shows that your goals for "approximation" are unrealistic. — Jacob Manaker, Jun 27 '21 at 08:53
(With that said, this is a fun measure theory problem. I wish I could give it more upvotes.) — Jacob Manaker, Jun 27 '21 at 08:54
As I said in the question, I'm not too attached to those goals, so maybe this means we still haven't quite found the right ones! As Clement pointed out, there's certainly a sense in which finitely supported measures can approximate continuous ones, with the error going to zero as the support grows (but remaining finite). So with countable support allowed you can certainly get arbitrarily small error (by using a large finite support). — cfp, Jun 27 '21 at 09:09

Jacob Manaker · Answer 3 · 2021-06-28T06:50:09.057

In my other answer, I showed that the only feasible definition of those you give is the first one, with $\kappa\leq N$. For example, one might hope to construct a countable approximation $\mu$ to 1-D Lebesgue measure $\lambda$ and take $\kappa=1$, so that $$|\mu(B(x,r))-\lambda(B(x,r))|<O(r)$$ for small $r$. (Note that this does not say much: since $\lambda(B(x,r))=2r$, this is equivalent to requiring $$0\leq\mu(B(x,r))\leq O(r)$$ Nevertheless…) I am unable to construct such a measure.

In fact, that example must contain essentially all the difficulty: let $X$ have c.d.f. $F$. Then $F(X)$ is a random variable uniformly distributed on $[0,1]$. If $\mu$ is approximates Lebesgue measure, then the pull back of $\mu$ via $F$ must approximate the law of $X$.

Below is the closest I have come to constructing such a measure.

Let $$D_n=\left\{\frac{a}{2^n}:|a|<2^n,2\nmid a\right\}$$ denote the dyadic rationals of level $n$ in $(-1,1)$ and $D=\bigcup_{n=0}^{\infty}{D_n}$. Then $|D_n|=2^n$ and $D$ is countable.

We let $\mu({x})=2^{-2n-1}$ for $x\in D_n$; this defines a probability measure on the power set of $D$.

We are interested in counting the intersection of $D_n$ with balls, so fix $n$ and choose a point $x\notin D$ and a radius $2^{-k}$. Note that the points of $D_n$ are spaced $\frac{1}{2^{n-1}}$ apart and intercalate between $D_{n-1}$ exactly. Thus the elements of $\bigcup_{n<k}{D_n}$ are spaced distance $2^{1-k}$ apart. Since $x\notin D_k$, we cannot have $\partial B(x,2^{-k})\subseteq D_{k-1}\sqcup D_{k-2}$; thus $$\left|B(x,2^{-k})\cap\bigcup_{n<k}{D_n}\right|=1\tag{1}$$ Likewise, if $k\leq n$, then $$|B(x,2^{-k})\cap D_n|=2^{n-k}\tag{2}$$

We can improve on (1) if we allow ourselves access to the binary representation of $x$. First, write $x=(-1)^s\sum_{j=0}^{\infty}{\frac{a_j}{2^j}}$ where each $a_j\in\{0,1\}$. Second, round $x$ to $(k-1)$-many digits, producing $y$; third, let $\delta_k(x)$ be one plus the index of the last "$1$" appearing in $\{a_j\}_{j<k}$. (If $y=0$, then $\delta_k=0$.)

By definition, $y\in D_{\delta_k(x)}$. In addition, we have constructed $y$ so that $|x-y|<\frac{1}{2^k}$ ($x\notin D$ rules out equality); thus $y\in B(x,2^{-k})$. So if $k<n$, we have $$\left|B(x,2^{-k})\cap D_n\right|=\begin{cases} 1 & n=\delta_k(x) \\ 0 & n\neq\delta_k(x) \end{cases} \tag{3}$$

Finally, we can compute: \begin{align*} \mu(B(x,2^{-k}))&=\frac{1}{2^{2\delta_k(x)+1}}+\sum_{n=k}^{\infty}{\frac{2^{n-k}}{2^{2n+1}}} \\ &=\frac{1}{2^{2\delta_k(x)+1}}+2^{-k-1}\sum_{n=k}^{\infty}{2^{-n}} \\ &=\frac{1}{2^{2\delta_k(x)+1}}+\frac{1}{2^{2k}} \end{align*}

Writing $r=2^{-k}$, we can summarize this as: $$\mu(B(x,r))=r^2+\frac{1}{2^{2\delta_k(x)+1}}$$ $r^2=O(r)$ for small $r$, so that's good; what about the other term? Well, since $x\notin D$, we know that $\lim_{k\to\infty}{\delta_k}=\infty$. But we do not have any bounds on $\delta_k$ uniform in $x$, and in fact cannot. There are transcendental numbers with arbitrarily large gaps between "$1$"s in their digit sequence; $x$ might be one such. (We can find a bound that holds for $\lambda$-a.e. $x$: \begin{align*} \lambda\left(\liminf_{k\to\infty}{\left\{x:\delta_k(x)>\frac{k}{\log{(k)}}\right\}}\right)&=\liminf_{k\to\infty}{\lambda\left(\left\{x:\delta_k(x)>\frac{k}{\log{(k)}}\right\}\right)} \tag{*} \\ &=1-\limsup_{k\to\infty}{\lambda\left(\left\{x:\delta_k(x)\leq\frac{k}{\log{(k)}}\right\}\right)} \\ &=1-\limsup_{k\to\infty}{\sum_{d\in\{0,1\}^{\frac{k}{\log{(k)}}}}{2^{-k}}} \tag{†} \\ &=1-\limsup_{k\to\infty}{2^{\frac{k}{\log{(k)}}-k}} \\ &=0 \end{align*} where (*) follows from the Dominated Convergence Theorem and (†) from the characterization of $\left\{x:\delta_k(x)\leq\frac{k}{\log{(k)}}\right\}$ as "numbers with all binary digits from index $\frac{k}{\log{(k)}}$ to $k$ equal to 0.")

However, Hurwitz's theorem tells us that any irrational $x$ can be approximated by rationals very closely: there are infinitely many $p,q$ such that $$\left|x-\frac{p}{q}\right|<\frac{1}{\sqrt{5}q^2}$$ This suggests that, if we place atoms at not only the dyadic rationals, but all rationals, then we might be able to force $\delta$ to grow at a reasonable rate. But when I try to write down such a measure, I find calculations of the sort (1-3) are beyond my ability. Perhaps you, dear reader, can complete them, bound $\delta$, and solve the problem.

score 1 · Answer 4 · answered Jun 24 '21 at 22:01

1

Consider $X$ to be a standard normal distribution (any continuous distribution will work). Also, let $Y$ be a random variable taking values in a countable set, say $S$. Furthermore $P(Y \in S)=1$ by definition and $P(X \in S)=0$. This shows that for every approximating discrete random variable there will be a Borel set $S$, where the approximation will fail.

answered Jun 24 '21 at 22:01

Suman Chakraborty

2,870
1
6
20

As I said to Clement above, I have now updated the question with vaguely the kind of thing I was interested in. This question was a matter of curiosity not need, so I did not put as much effort into its formulation as I should have! To summarise the changes, the error bound could be a function of a size of the epsilon ball under consideration, as long as it remains non-trivial. – cfp Jun 25 '21 at 08:25

Jacob Manaker · Answer 5 · 2021-06-29T02:58:52.503

Posting answer-in-progress so I don't miss the bounty.

In my first other answer, I pointed out that most of the requirements you've placed are impossible to satisfy. I don't know a good requirement, but some offhand remarks I made in the comments on that post inspired a good way to think about such requirements.

Consider the Banach space $l^1([0,1])$, the space of functions on $[0,1]$ that are summable with respect to counting measure. This is an unusual Banach space (it's not separable!), but it pairs (in the sense of dual vector spaces) with $C([0,1])$, as follows: $$\langle s,f\rangle_{l^1\times C}=\sum_{x\in\operatorname{supp}{s}}{s(x)f(x)}$$ Suffice it to say, these are your (signed) measures of countable support. The pairing gives two different injections $C(X)\overset{P}{\hookrightarrow}l^1([0,1])^*\cong l^{\infty}([0,1])$ and $l^1([0,1])\hookrightarrow C([0,1])^*$. Each map causes the other to factor through the double-dual: $$l^1([0,1])\overset{J}{\hookrightarrow}l^{\infty}([0,1])^*\overset{P^*}{\twoheadrightarrow}C([0,1])^*$$ ($P^*$ is surjective by Hahn-Banach.)

The dual spaces I've written as duals are also known: $(l^{\infty})^*\cong\text{ba}$ and $C(X)^*\cong\text{rca}$, where ba is the space of bounded, finitely-additive Borel measures and rca the space of regular Borel measures (both under the total variation norm). Identifying, we have $$l^1([0,1])\overset{J}{\hookrightarrow}\text{ba}([0,1])\overset{P^*}{\twoheadrightarrow}\text{rca}([0,1])$$

Now, Lebesgue measure is certainly regular Borel, so the preimage $$(P^*)^{\leftarrow}(\{\lambda\})$$ is an affine subspace of ba. Of course, $\lambda\in\text{ba}([0,1])$ too, so that we can abuse notation and write the confusing $$\lambda\in(P^*)^{\leftarrow}(\{\lambda\})$$ We also know that there are finitely supported measures $\{R_n\}_n$ ($R$ for "Riemann"…) such that $(P^*\circ J)(R_n)\rightharpoonup\lambda$. Thus $\text{im}{(J)}$ cannot be strictly separated from $(P^*)^{\leftarrow}(\{\lambda\})$. Unfortunately, I think my other answers rule out these two spaces intersecting.

What one might hope to do is find another space $B$ that sits "in between" $l^1([0,1])$ and $C(X)$ in a similar way, but such that the two spaces intersect. Then hopefully one can tweak the topology on $l^1$ to ensure that $R_n$ converges to a measure in this intersection. But I haven't found such a construction yet.

For example, if one were even more lucky, then the entire construction would arise as a dual. This is impossible, at least for the predual I know. The predual I'm thinking of for $l^1([0,1])$ is $$c_0([0,1])=\{f:\forall\epsilon\exists\text{ finite }\sup_{x\notin S}{|f(x)|}<\epsilon\}\subseteq[0,1]\to X$$

Suppose $$C(X)\overset{P}{\hookrightarrow}B\overset{Q}{\twoheadrightarrow}c_0([0,1])$$ This dualizes to $$l^1([0,1])\overset{Q^*}{\hookrightarrow}B^*\overset{P^*}{\to}C(X)$$

Let $\{R_{\alpha}\}_{\alpha}$ arise from any net of partitions of $[0,1]$ refining itself unto $[0,1]$ and let the set of all those partition points be $U(\{R_{\alpha}\}_{\alpha})$. Since $(Q^*\circ P^*)(R_n)\rightharpoonup\lambda$, we should have $$R_n(QPf)\to\int{f\,d\lambda}$$ Note that $QPf$ and $\int{f\,d\lambda}$ are fixed! Since the integral is fixed, $QPf$ must preserve the $\sum_{x\in U(\{R_{\alpha}\}_{\alpha})}{f(x)}$ (at least modulo some finite set of $x$). Since $QPf$ is fixed, $QPf$ cannot depend on $R_{\alpha}$. But then $QPf$ cannot change $f$ on more than a countable number of points in the interval, and $f\notin c_0$.

How well can a continuous distribution be approximated by a countable discrete distribution?

5 Answers5