3

Linear algebra states Schwarz inequality as $$\lvert\mathbf x^\mathrm T\mathbf y\rvert\le\lVert\mathbf x\rVert\lVert\mathbf y\rVert\tag 1$$ However, probability theory states it as $$(\mathbf E[XY])^2\le\mathbf E[X^2]\mathbf E[Y^2]\tag 2$$ By comparing $\lvert\sum_i x_iy_i\rvert\le\sqrt{\sum_i x_i^2\sum_i y_i^2}$ with $\lvert\sum_y\sum_x xyp_{X,Y}(x,y)\rvert\le\sqrt{\sum_x x^2p_X(x)\sum_y y^2p_Y(y)}$, we see that $(1)$ and $(2)$ are equivalent when $p_{X,Y}(x,y)=\begin{cases}\frac1n&\text{if $x=x_i$ and $y=y_i$ for $i\in\{1,2,\cdots,n\}$}\\0&\text{otherwise}\end{cases}$. Thus, $(2)$ can be thought of as a more general form of the inequality.

Another way to think about this is to compare $\lvert\cos\theta\rvert=\frac{\lvert\mathbf x^\mathrm T\mathbf y\rvert}{\lVert\mathbf x\rVert\lVert\mathbf y\rVert}\le1$ with $\lvert\rho\rvert=\frac{\lvert\mathbf{cov}(X,Y)\rvert}{\sqrt{\mathbf{var}(X)\mathbf{var}(Y)}}\le1$. The former is exactly $(1)$, while the latter becomes $(2)$ only when $\mathbf E[X]=\mathbf E[Y]=0$. In some sense, we can view $\mathbf x^\mathrm T\mathbf y$ as a special form of $\mathbf{cov}(X,Y)$. Then, it follows that $\mathbf x^\mathrm T\mathbf x$ is a form of $\mathbf{var}(X)$ and $\lVert\mathbf x\rVert$ is a form of $\sqrt{\mathbf{var}(X)}$.

What is the special form of $\mathbf E[X]$ and how do we understand $\mathbf E[X]=\mathbf E[Y]=0$ in linear algebra? With $p_{X,Y}$ defined above, we have $\mathbf E[XY]=\frac{\mathbf x^\mathrm T\mathbf y}n$, but $\mathbf{cov}(X,Y)\ne\mathbf E[XY]$ unless $\mathbf E[X]=0$ or $\mathbf E[Y]=0$. How can we obtain a relation between $\mathbf{cov}(X,Y)$ and $\mathbf x^\mathrm T\mathbf y$?

W. Zhu
  • 1,325
  • 1
    Not "the same as", rather "a particular case of" (can you spot how?). – Did Jan 13 '19 at 13:01
  • @Did The two inequalities are equivalent when $p_{X,Y}(x,y)= \begin{cases} \frac1n&\text{ if $x=x_i$ and $y=y_i$ for $i\in{1,2,\cdots,n}$}\ 0&\text{ otherwise} \end{cases}$! – W. Zhu Jan 14 '19 at 02:59
  • Thus, question solved? – Did Jan 14 '19 at 11:26
  • @Did I have one more question. If we write $\mathbf{cov}(X, Y)$ as $\mathbf x^\mathrm T\mathbf y$, then $\lvert\rho\rvert\le1$ becomes $\lvert\cos\theta\rvert\le1$. But we need to set $\mathbf E[X]=\mathbf E[Y]=0$, which means that the components of each of $\mathbf x$ and $\mathbf y$ average to zero. Shouldn't the inequality hold for all vectors $\mathbf x$ and $\mathbf y$? – W. Zhu Jan 14 '19 at 15:07
  • 1
    I don't understand the downvote, as is often the case when there's no comment accompanying it. Anyway, there's a recent question on the covariance which addresses exactly the doubts of this post. – Giuseppe Negro Jan 15 '19 at 13:32

2 Answers2

0

After reading J.G.'s answer and some thinking, I have arrived at a satisfactory answer. I will post my thoughts below.

Let $\mathbf x\in\Bbb R^n$ denote a discrete uniform random variable with each component corresoponding to each outcome. Then $\mathbf E[\mathbf x]$ is the average of the components, and $\mathbf E[\mathbf x]=0$ means that the components sum to zero. Thus, for zero-mean random variables, we can choose $n-1$ components and set the last component to $-\sum_{i=1}^{n-1}x_i$. These vectors form an $n-1$-dimensional subspace. We can bring any vector to this centered subspace $C$ by subtracting from each component the average of all components.

Now we consider two vectors $\mathbf x$ and $\mathbf y$ in $C$. We can use a matrix to represent the joint distribution. Put $x_i$'s in the rows and $y_i$'s in the columns, and consider this joint distribution matrix: $$D= \begin{bmatrix} \frac1n&0&0&\cdots&0\\ 0&\frac1n&0&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&\frac1n \end{bmatrix}$$

This distribution is special because it puts equal weights on the diagonal entries and zero weight on the off-diagonal entries. We may call this the discrete uniform diagonal joint distribution. It is easily seen that $\mathbf x$ and $\mathbf y$ are discrete uniform but not independent ($\mathbf x$ being $x_i$ forces $\mathbf y$ to be $y_i$).

Under these assumptions, $\mathbf{cov}(\mathbf x, \mathbf y)=\frac{\mathbf x^\mathrm T\mathbf y}n$, $\mathbf{var}(\mathbf x)=\frac{\mathbf x^\mathrm T\mathbf x}n$, $\sigma_{\mathbf x}=\frac{\lVert\mathbf x\rVert}{\sqrt n}$ and $\rho=\frac{\mathbf{cov}(\mathbf x,\mathbf y)}{\sigma_{\mathbf x}\sigma_{\mathbf y}}=\frac{\mathbf x^\mathrm T\mathbf y}{\lVert\mathbf x\rVert\lVert\mathbf y\rVert}=\cos\theta$. When $\mathbf x$ and $\mathbf y$ are orthogonal vectors, they are uncorrelated random variables. Although they are linearly independent vectors, they are not independent random variables.

Now we have a correspondence between covariance and dot product, standard deviation and length, correlation coefficient and the cosine of the angle between two vectors, and uncorrelatedness and orthogonality. Thus, Schwarz inequality $\lvert\cos\theta\rvert\le1$ matches $\lvert\rho\rvert\le1$.

Let us look at 3 more examples that connect linear algebra to probability theory:

  1. The triangle inequality $\lVert\mathbf x+\mathbf y\rVert\le\lVert\mathbf x\rVert+\lVert\mathbf y\rVert$ matches $\sigma_{X+Y}\le\sigma_X+\sigma_Y$.
  2. $(\mathbf x+\mathbf y)^\mathrm T(\mathbf x+\mathbf y)=\mathbf x^\mathrm T\mathbf x+\mathbf y^\mathrm T\mathbf y+2\mathbf x^\mathrm T\mathbf y$ matches $\mathbf{var}(X+Y)=\mathbf{var}(X)+\mathbf{var}(Y)+2\mathbf{cov}(X,Y)$.
  3. Pythagoras theorem $\lVert\mathbf b\rVert^2=\lVert\mathbf p\rVert^2+\lVert\mathbf e\rVert^2$ with orthogonal projection $\mathbf p$ and error $\mathbf e=\mathbf b-\mathbf p$ matches $\mathbf{var}(\Theta)=\mathbf{var}(\hat\Theta)+\mathbf{var}(\tilde\Theta)$, with uncorrelated estimator $\hat\Theta$ and estimation error $\tilde\Theta=\Theta-\hat\Theta$. In fact, this is just the law of total variance $\mathbf{var}(\Theta)=\mathbf{var}(\mathbf E[\Theta|X])+\mathbf E[\mathbf{var}(\Theta|X)]$ with $\hat\Theta=\mathbf E[\Theta|X]$.
W. Zhu
  • 1,325
0

Having gained more knowledge, I post my updated answer.

Actually, Schwarz inequality in linear algebra $(1)$ is not a special form of the inequality in probability theory $(2)$, but vice versa. This requires seeing vectors more abstractly, not just as arrays of numbers.

In probability theory, every experiment has an outcome set $\Omega$. For simplicity, we assume that it is finite. A random variable is a function $X:\Omega\to\mathbb R$. Consider the set of all random variables $V_\Omega$. Note that constant variables are also included because they are random variables that map all $\omega\in\Omega$ to the same real number. $V_\Omega$ is a vector space over $\mathbb R$ because the axioms are satisfied with the zero random variable $0$ as the identity and $-X$ as the additive inverse of $X$. This means that every random variable is a vector in $V_\Omega$.

Now, we can define a real inner product for $V_\Omega$ as $\langle X|Y\rangle=\mathbf E[XY]$ because it satisfies the axioms:

  1. Positive definiteness: $\mathbf E[X^2]\ge0\;\forall X\in V_\Omega$ with equality if and only if $X=0$
  2. Symmetry: $\mathbf E[XY]=\mathbf E[YX]\;\forall X,Y\in V_\Omega$
  3. Bilinearity: $\mathbf E[(X+Y)Z]=\mathbf E[XZ]+\mathbf E[YZ]\;\forall X,Y,Z\in V_\Omega$ and $\mathbf E[(aX)Y]=a\mathbf E[XY]\;\forall X,Y\in V_\Omega,a\in\mathbb R$ (vice versa by symmetry)

With this definition, the two forms of Schwarz inequality are equivalent.

In the subspace of zero-mean random variables, there is an equivalence between

  1. the inner product and the covariance
  2. the length and the standard deviation
  3. the cosine of the angle and the correlation coefficient
  4. orthogonality and uncorrelatedness
W. Zhu
  • 1,325