Why we have the formula for covariance the way it is?

Question

I was going through the definition and meaning of variance and covariance. The resources which I have, have only definition and formula without any insight.

For variance, I wrote the formula and asked myself what this formula tells me. I figured out that variance has do to something with the mean of spread. The term in the formula of variance, $(x_i-\bar x)^2$, takes the square of the distance between the $i^{th}$ observation and mean.

Now moving further if we get data on two attributes, for this we can plot a scatter diagram. I compared this situation with the notion of centre of mass in 2D. The mean of the scatter plot will be $(\bar x,\bar y)$. Covariance is expanding the idea of variance to higher dimensions(this was given in book.).

$$\mathbb{Cov}(X,Y)=\mathbb{E}[(x_i-\bar x)(y_i-\bar y)]$$

Above is the formula of covariance. I could not understand why this particular formula?

Second, if we have a point $(x_i,y_i)$ in scatter plot and we have our mean as $(\bar x,\bar y)$ then the square of distance between them will be $(x_i-\bar x)^2+(y_i-\bar y)^2$. So I thought that if we are generalising the concept of variance in two dimensions then we should have the formula of variance as:

$$\mathbb{Cov(X,Y)}=\mathbb{E}[(x_i-\bar x)^2+(y_i-\bar y)^2]=\mathbb{Var(X)}+\mathbb{Var(Y)}$$

Summary of my problem

In the formula of variance we have "square of distance between ith observation and mean" and covariance is same as variance but for two or more dimension then why we are not using "square of distance between the ith observation and mean"? Why we are using something else and in particular why that formula?

Please help I am struggling to digest it. Please.

The lines $x=\bar{x}$ and $y=\bar{y}$ divide the plane into four quadrants. The product $(x_i-\bar{x})(y_i-\bar{y})$ is positive when the i-th observation $(x_i, y_i)$ is in the first or third such quadrant; it's negative when the observation is in the second or fourth quadrant. Averaging these gives a measure of whether the data are roughly in a positive linear trend (from third quadrant to first quadrant) or in a negative linear trend (from the second quadrant to the fourth quadrant). — symplectomorphic, Mar 04 '21 at 18:54
Íf $x_i,y_i$ are observed data then I wouldn´t use the expected value operator. $Cov(x,y)=\frac1n \cdot \sum\limits_{i=1}^n (x_i-\overline x)\cdot (y_i-\overline y)$. The variance is similar: $Var(x)=\frac1n \cdot \sum\limits_{i=1}^n (x_i-\overline x)\cdot (x_i-\overline x)=\frac1n \cdot \sum\limits_{i=1}^n (x_i-\overline x)^2$ — callculus42, Mar 04 '21 at 19:01
@callculus Sir this is the problem, why this particular formula of Cov(x,y) why not something else? How this is formula is a generalisation of the formula of variance? It will be appreciated if you further elaborate. — Singh, Mar 04 '21 at 19:06
Maybe a completely different, ungeometric understanding might help you better to accept the definition: the covariance of a random variable with itself should be its variance. Your formula doesn't accomplish this, while the usual definition does. — Vercassivelaunos, Mar 04 '21 at 19:07
@Vercassivelaunos Sir I know what I am thinking is certainly wrong. I just what to figure out why I am wrong and why the formula of Cov(x,y) is the way it is. — Singh, Mar 04 '21 at 19:10
@Singh Covariance: You sum up the product of two factors: I) Difference of obs. value x and mean. II) Difference of obs. value y and mean. If $x_i=y_i$, then you obtain the variance. "Why not something else?" Why should we use a different concept for the covariance? Btw, the return of your formula is always positive. This is not so informative like the usual covariance. The information of the direction would be disappear. — callculus42, Mar 04 '21 at 19:15
@callculus Sir I cannot understand, how the concepts are same? In case of variance we have "sum of square of distance form mean" why we are not having something like this in covariance? — Singh, Mar 04 '21 at 19:23
@callculus Sir thank you for pointing that out, that my formula gives always non-negative covariance. This is going along with the earlier notion of variance which is always non-negative. — Singh, Mar 04 '21 at 19:28
@sing I´ve no better idea than replacing $x_i$ by $y_i$ (only once) to obtain the covariance. There are (probably) properties of the usual definition of cov. Look for them. You can try to find another expression which fulfill them. At the moment you haven´t, since you cov is always positive. Do you see the advantage, that the usual cov can be negative as well? You need to know if the data are positive or negative related. — callculus42, Mar 04 '21 at 19:31
@callculus Putting aside the advantage and application of the formula. I just want to know why I am wrong. I am just thinking of variance and trying to build something parallel for 2-dimension. As you have said, we can have other expressions, then why and how the usual covariance formula is best among them all? Sir, please understand my problem. I seek some insight. — Singh, Mar 04 '21 at 19:45
Covariance isn't meant to be parallel to variance in two dimensions in the first place. If we tried to define a variance for vector valued random variables, your definition would be fine. But we don't. We are talking about the covariance of two scalar random variables, which is something entirely different (see my answer). — Vercassivelaunos, Mar 04 '21 at 19:51
@Singh If you ask why is the usual definition of the covariance is the best among others you should ask yourself first why the usual definition of the variance is the best among others. Then the covariance is straightforward deduced from the variance, since $var(x)=cov(x,x)$. So look at the properties of the variance first. Have you already done that? — callculus42, Mar 04 '21 at 20:18

score 1 · Answer 1 · answered Mar 04 '21 at 19:47

Covariance of two random variables $X$ and $Y$ is supposed to measure how they covary.

For instance, take two independent dice rolls, and let the result of the first roll be $X$ and the sum of both rolls be $Y$. If $X$ is higher than expected, then $Y$ is probably also higher than expected. If $X$ is lower than expected, then $Y$ is probably also lower than expected. We want this to lead to a positive covariance.

For another example, we again consider two independent dice rolls, $X$ is still the first roll, but $Y$ is now the second roll minus the first roll. This time, if $X$ rolls higher than expected, $Y$ is probably lower than expected, and if $X$ is lower than expected, then $Y$ is probably higher than expected. $X$ and $Y$ vary in opposite directions, so to speak. We want this to lead to a negative covariance. Positive covariance means that the two random variables vary in the same direction, while negative covariance means that they vary in different directions. For this, covariance must allow for negative values, which your formula already doesn't. Basically, yours just measures how much they vary independently, not how they covary.

Now for why this particular formula was chosen, there are multiple reasons. For one, it'd be nice if the covariance of the random variable with itself would just be its variance. It covaries with itself just the way it varies in general. The usual definition accomplished this: $\operatorname{Cov}(X,X)=\operatorname{Var}(X)$. Another reason is that the usual definition has some pretty cool algebraic properties. It's linear in both variables. It's also symmetric and positive definite in the sense that only almost surely constant random variables have nonzero covariance with themselves. Taken together, this means that covariance is an inner product on suitable spaces of random variables, with all the cool results that brings, like the Cauchy-Schwarz inequality.

I am trying to understand but I am unable to grasp it completely. I have a new problem now. Sir, you are saying that covariance is not parallel to variance and you also want Cov(X,X)=Var(X). As an example, I am constructing a plane with both the axes as X-axis. Say the mean of X is $\bar x$ and $x_i$ is any observation. Then clearly the distance(spread) between $\bar x$ and $x_i$ is less than the distance(spread) between $(\bar x, \bar x)$ and $(x_i,x_i)$. If this is so then why we want Cov(X,X) to be equal to Var(X)? — Singh, Mar 05 '21 at 04:31
Because the spread between $(\bar x,\bar x)$ and $(x_i,x_i)$ is not what covariance is supposed to measure. Yes, that spread would be larger than the spread between $\bar x$ and $x_i$, but that's beside the point, because it's not the covariance. Covariance is supposed to measure how two random variables depend on each other, not just their spread. — Vercassivelaunos, Mar 05 '21 at 07:47

Why we have the formula for covariance the way it is?

1 Answers1