2

In wikipedia, in reference to generalized linear models, I read:

Ordinary linear regression predicts the expected value of a given unknown quantity (the response variable, a random variable) as a linear combination of a set of observed values (predictors). This implies that a constant change in a predictor leads to a constant change in the response variable (i.e. a linear-response model). This is appropriate when the response variable has a normal distribution (intuitively, when a response variable can vary essentially indefinitely in either direction with no fixed "zero value", or more generally for any quantity that only varies by a relatively small amount, e.g. human heights).

I think I understand intuitively that if the error, after you do a fit with ordinary least squares, is normally distributed, then the OLS was likely a good model. (It got the expectation correct, and the errors were normally distributed about the mean.

But why does the dependent variable (response variable) itself need to be normally distributed? What does it matter if the variable only varies by a small amount? I think they mean the variance is low?

Frank
  • 880

1 Answers1

1

It doesn't need to be normally distributed. The paragraph merely says if normally distributed then the assumptions in OLS are satisfied with log likelihood as our loss function.

Basically in OLS we are minimising a loss function that is a quadratic $L(\beta):=\sum(y_i-x_{i,1}\beta_1-\dots-x_{i,p}\beta_p)^2$. The $x_i$ doesn't need to be stochastic (for example, if you know $y$ is a complicated function in $x$'s, and you know the exact values at known $x_i,y_i$ but due to computational resource constraints you can only fit linear models, then it still make sense to come up with the best approximation that minimizes the sum of squared errors at these known values).

user10354138
  • 33,239
  • The paragraph doesn't mention assumptions of OLS, so maybe that's why I am confused. What are these assumptions that are satisfied when the dependent variable is normally distributed (not the error of the fitted variable, but the dependent variable itself) – Frank Jun 05 '19 at 16:07
  • You would find the discussion on assumptions in wiki's article on OLS or linear least squares. – user10354138 Jun 05 '19 at 16:18
  • After reading, I did not see any explanation for why a normally dependent variable is desired.

    If I had to guess.... is it because... if the sample data is normally distributed, then points will get weighted in a way such that the mean will be the minimum residual error? And if they are not normally distributed, then there may be more points say, to the right of the mean, and this would create bias?

    – Frank Jun 05 '19 at 18:10