1

In statistics and machine learning, we often see expressions like (e.g. it is used in [2], written by very important statisticians)

$$\mathbb{E}_q(x) \left[ \log p(x) \right] \tag{0} \label{0} $$

which is apparently supposed to mean

$$\mathbb{E}_q(x) \left[ \log p(X) \right] \tag{1} \label{1}$$

where $X$ is some random variable, because expectations take random variable as inputs and the lower case letter in $ \log p(x)$ inside the expectation (\ref{0}) suggests that $ \log p(x)$ is not a random variable, but $\log p(X) $ is more descriptive and suggestive, and it should indicate that it's a random variable that is the composition of $\log$, $p$ and $X$.

Now, the expectation (\ref{1}) is with respect to the p.d.f. $q$, so we can write it as follows

$$\mathbb{E}_q(x) \left[ \log p(X) \right] = \int q(x) \left( \log p(x) \right) dx$$

Inside the integral, $x$ is a dummy variable, i.e. it's not a random variable or a realization of a random variable.

However, I don't understand what the relationship between

  1. $log p(x) $ inside the integral $\int q(x) \left( \log p(x) \right) dx$, and

  2. the random variable $\log p(X)$ inside the expectation $\mathbb{E}_q(x) \left[ \log p(X) \right]$

is.

Does the random variable $\log p(X)$ have pdf $\log p(x)$? What about $X$? Does it have pdf $q$ or $\log p(x)$, or maybe $p$ (if it's a pdf)?

The answer to this question Can we really compose random variables and probability density functions? (that I asked) says that we can compose random variables and pdfs, but when exactly can we do it?

  • Consider a (real valued) random variable $X$ and $Y=f(X)$ for some $f$. As long as $f$ is Borel measurable then $Y$ is a RV. For the case needed in entropy calculations, and we have $f(x):=\log f_X(x)$ for densities or $f(x):=\log \mathbb{P}(X=x)$ for PMFs. For each outcome $\omega$ we get a value $X(\omega)$ which can be plugged into the function $f(\cdot)$. And then by LOTUS, $\mathbb{E}(f(X))=\int f(x) f_X(x) dx$ in the continuous case for example. – Nap D. Lover Jul 26 '20 at 16:54
  • @NapD.Lover Ooookey... and? I mean, I don't see how this answers my question(s). Maybe you need to be more explicit. –  Jul 26 '20 at 16:55
  • @NapD.Lover Ok, but what is the pdf of $f(X)$ and the pdf of $X$? –  Jul 26 '20 at 17:00
  • Let $f_X(x)$ be the PDF of $X$ and $Y=f(X)$. If $f$ is invertible then the standard cdf technique yields from $\mathbb{P}(f(X)\leq y)$ that $f_Y(y)=f_X(f^{-1}(y)) \cdot(f^{-1}(y))’$. – Nap D. Lover Jul 26 '20 at 17:03
  • and LOTUS states that $\mathbb{E}(Y)=\int y f_Y(y) dy=\int f(x) f_X(x) dx$ for $Y=f(X)$, where $f$ is Borel, so we only need knowledge of the pdf of $X$ to compute expectations of functions of $X$. – Nap D. Lover Jul 26 '20 at 17:07
  • @NapD.Lover Feel free to provide a formal answer below. I would appreciate if you could also provide an example when $X$ has Gaussian density, show what the other random variables have as densities as a consequence of $X$ having Gaussian pdf. –  Jul 26 '20 at 17:15
  • certainly, I’ll try to get it written up shortly. – Nap D. Lover Jul 26 '20 at 17:23

1 Answers1

1

In short, the fact that $$\mathbb{E}(\log f_X(X))=\int_\mathbb{R} \log (f_X(x)) f_X(x) dx,$$ is just an application of LOTUS and a strict adherence to the convention of uppercase RVs and lower-case text for the values they take on (which not every author equally follows).


Does the random variable $\log p(X)$ have pdf $\log p(x)$? What about $X$? Does it have pdf $q$ or $\log p(x)$, or maybe $p$ (if it's a pdf)?

Suppose $X$ is a continuous RV with PDF $f_X(x)$. In general, a standard but not always applicable way to find the PDF of a transformation of a random variable $X$, given by $Y=h(X)$ for some Borel function $h:\mathbb{R}\to\mathbb{R}$, is known by the inverse-CDF-method (or CDF transformation method, or...). That is, provided $h$ is nice enough (invertible and with differentiable inverse), then $$f_Y(y)=f_X(h^{-1}(y))(h^{-1}(y))'$$ This follows from $$F_Y(y):=\mathbb{P}(Y\leq y)=\mathbb{P}(h(X)\leq y)$$ $$=\mathbb{P}(X\leq h^{-1}(y))=F_X(h^{-1}(y)),$$ and then using chain-rule. Depending on the specific choice of $h$, the computation of $f_Y(y)$ may be easy or difficult. In the case for entropy computations, we have $$h(x)=\log f_X(x),$$ so that if $f_X(x)$ is invertible, we have $$h^{-1}(y)=f^{-1}_X(e^y),$$ from which we get $$f_Y(y)=e^y (f^{-1}_X(e^y))'$$ where the rest of the computation depends on the nature of $f_X$. A more general method (and in my opinion, better, more systematic) for finding PDFs of transformations is outlined in this answer. Here we have also made the minor assumption that inverting $h$ does not change the inequality direction. For a more general discussion see this wikipedia page in addition to the LOTUS page. This is often called the Jacobian-transformation technique, or something similar. Fortunately, it is not always necessary to know $f_Y(y)$ when $Y=h(X)$ in order to compute $\mathbb{E}(Y)=\mathbb{E}(h(X))$ due to LOTUS, as explained below.


For a general overview:

The following references section 6.12 in D. Williams' Probability with Martingales. In measure-theoretic terms, given some probability triple $(\Omega, \mathscr{F}, \mathbb{P})$, then a mapping $X:\Omega\to \mathbb{R}$ is a random variable if its a measurable function of the sample space and then the expectation (if it exists) is defined by $$\mathbb{E}(X):=\int_\Omega X(\omega) \mathbb{P}(d\omega),$$ (of which, there are many variations of this notation). Of course, we almost never use this for computations.

Instead, if $h:\mathbb{R}\to\mathbb{R}$ is Borel, and we write $\Lambda_X(B):=\mathbb{P}(X\in B)$ for the law of $X$, where $B$ a Borel subset of reals, then $Y=h(X)$ is in $\mathcal{L}^1(\Omega, \mathscr{F}, \mathbb{P})$ if and only if $h\in \mathcal{L}^1(\mathbb{R}, \mathscr{B}, \Lambda_X)$ and then $$\mathbb{E}(h(X))=\int_{\mathbb{R}} h(x) \Lambda_X(dx)$$ which is esssentially LOTUS. When $X$ possesses a density, the measure $\Lambda_X(dx)=f_X(x)dx$ (here $dx$ is really an abuse of notation for $\text{Leb}(dx)$). The proof is in the referenced text and can be outlined as: verify it holds with $h=\mathbb{1}_B$ indicator functions, then use linearity to show it holds for simple-functions, then MCT can be used to show it holds for non-negative Borel $h$ and linearity once more for any Borel $h:\mathbb{R}\to\mathbb{R}$.


Toy Example

I only have time to do a simple example: let $X$ have density $f_X(x)=2x \mathbb{1}_{0<x<1}$ and $Y=\log (f_X(X))$. Then the inverse on $y \in (0,2)$ of $f_X$ is $f_X^{-1}(y)=y/2,$ and by the above formula, $f_Y(y)=\frac 12 e^{2y} \mathbb{1}_{-\infty <y<\log 2}$. So we get $$\mathbb{E}(Y)=\int_{-\infty}^{\log 2} \frac y2 e^{2y} dy =\log 2 - \frac 12=\int_0^1 \log(2x) 2x dx=\mathbb{E}(\log f_X(X)).$$

Sorry for the length, hopefully this is not too rambling (I tried to provide a general answer as well as some specific responses, if you think I should edit it down, feel free to suggest so). Of course, please let me know if you have any questions, comments, or corrections.

Nap D. Lover
  • 1,207
  • Thanks for the answer! It's really helpful! To make sure, what do you mean by the notation $f_Y(y)=f_X(h^{-1}(y))(h^{-1}(y))'$? Here $h^{-1}$ is the inverse of $h$, but what is $'$, and why there's twice $(h^{-1}(y))$? –  Jul 26 '20 at 21:39
  • @nbro No problem. The $’$ refers to Newton’s prime notation for differentiation, so that $(h^{-1}(y))’$ is just derivative of the inverse-function of $h$. The inverse appears twice by chain rule, once as the argument plugged into $f_X(\cdot)$ and this is multiplied by $(h^{-1}(y))’$. It may help to compare to a generic statement of the chain rule for two nice functions $w$ and $g$: then $(w(g(x)))’=w’(g(x))g’(x)$. In the case above, $w=f_X$ and $g=h^{-1}$. – Nap D. Lover Jul 26 '20 at 23:48
  • Also, if I use the notation $p(X)$ to denote a r.v. inside the expectation (where $p(X)$ means the composition of $p$ and $X$), what happens if I am considering a conditional (or joint) density $p$, i.e. $p$ is a conditional density of two variables where one is given (i.e. constant), i.e. $p(y \mid x = x_i)$. How would you create a r.v. now? You cannot mix the notations $p(y \mid x = x_i)$ and $p(X)$, because $(\cdot)$ means different things in these two cases. What's your suggestion? –  Jul 26 '20 at 23:49
  • @nbro allow me to switch notations, indulgently. If $f(y|x)$ is the conditional density of some RV $Y$ conditional on $X=x$ then we can define a new RV by $Z_x=\log f(Y|x)$ for each fixed $x$, just as before but now $h_x(\cdot)=\log f(\cdot|x)$. Then for each $x$, we have $\mathbb{E}(Z_x|X=x)=\int \log (f(y|x)) f(y|x) dy=:g(x)$, say, since it depends only on $x$. Then by LTE $\mathbb{E}(Z_X)=\mathbb{E}(g(X))$ (where now the expectation is with respect to $X$). I’m not entirely sure if thats what you had in mind though. – Nap D. Lover Jul 27 '20 at 00:07
  • But your notation would then become $\mathbb{E}(Z_X \mid X=x) = \mathbb{E}(\log f(Y \mid x) \mid X=x)$. This doesn't look right. –  Jul 27 '20 at 00:13
  • @nbro I admit it is a bit awkward or clunky (which is why I wrote out the integral explicitly and then labeled it as $g(x)$ since the $y$ variable is integrated out completely) but I can’t see any mistake or contradiction that would follow, at least not at this hour, as long as you are careful about which variables are fixed as conditions and which are being integrated out. Can you clarify what doesn’t look right or perhaps you have an example where things go wrong? (Or you might consider to post it as a new question and someone wiser may come along who can inform both of us). – Nap D. Lover Jul 27 '20 at 01:08
  • For example, look at equation 13 of this paper. In $p(z, x)$, $x$ is the data, so it is given, but then I don't understand why their manipulations (even just after that equation 13) can be done, also according to your explanations. You use conditional expectations, but they don't use this. I am just trying to understand why the notation in this paper makes sense mathematically. –  Jul 27 '20 at 01:41
  • @nbro there, $q$ is an approximation to $p(z|x)$ and from equation $11$ to $13$ is just rearrangement and substitution, using the fact that $p(z|x)=p(z,x)/p(z)$, that $\log (ab) =\log a + \log b$ and the unnumbered equation following $13$ just rearranges equation $11$ and substitutes into $13$. I haven’t read the whole paper so I do not have the whole context, of course, but these manipulations at least appear to be algebra on the formula for $KL$ divergence between $q$ and $p(\cdot|x)$ together with the definition of ELBO. – Nap D. Lover Jul 27 '20 at 18:42
  • That's not what I am asking. I am asking about random variables inside the expectations, i.e. why do those terms inside the expectations make sense at all. I understand the manipulations (i.e. the algebra). –  Jul 27 '20 at 18:43
  • @nbro my apologies. I don’t think I can really help clear up the confusion at this point. – Nap D. Lover Jul 27 '20 at 19:20