Theory question: How to use Mean Absolute Error properly in a log scaled linear regression

Question

First of all, I had a look here and in a couple of other questions: I couldn't find what I am looking for.

So my question is purely theoretical (although I have an example by my hands).

Suppose I have some data $(x_i,y_i)$ for $i=1,..,n$. Suppose I fit the following models with IID $\epsilon_i \sim N(0, \sigma^2)$ for $i=1,..,n$

$M_1: \log(y_i)= \beta_0+\beta_1x_i+\epsilon_i$
$M_2: \log(y_i)= \beta_0+\beta_1x_i+\beta_2x_i^2+\epsilon_i$
$M_3: \log(y_i)= \beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3+\epsilon_i$

Now I want to see which of these models is better, so I use the following (maybe weird, but stay with me) method, to evaluate their "predictive powers":

Use $(x_i, \log(y_i))$ for $i=1,..,\frac{n}{2}$, to fit $M_1, M_2, M_3$ respectively.
Now use the fitted model (so $M_1, M_2,M_3$ respectively), to predict $y_i$'s using the $x_i$'s from the remaining $\frac{n}{2}$ data , so from $i = \frac{n}{2}+1, .., n$ (careful, predict $y_i$ not $\log(y_i)$)
Use MAE or Mean Absolute Error (here) $MAE = \frac{1}{\frac{n}{2}}\sum_{i=\frac{n}{2}+1}^{n}|y_i-\hat{y}_i|$, being careful that $\hat{y}_i$ is in the original scale of values!

So now my question:

If I do point $1.$ and I fit the three models (hence obtaining estimates for the parameters, their standard errors etc..) and then use these parameters (respectively of course!) to predict the responses of the other $x_i$'s:

Will I be predicting $\log(y_i)$'s right? And this is true... Is it also true that in order to get $\hat{y}_i$'s , instead of $\widehat{\log{(y)}}_i$, I should just take the exponential of those terms? So in general, is it true $\hat{y}_i = e^{\widehat{\log{(y)}}_i}$?
Once I find the three MAE's, how do I judge the models? Should I be looking for the one with smaller MAE?

EDIT

For example suppose I have $1000$ data points. I use the first $500$ to fit model $M_1$. Once I've fitted it, I can predict new values. Hence I predict the new responses of the other $500$ $x_i$'s left. of course, the prediction will be given in logarithmic scale. But I want to calculate MAE on the normal scale.

This is the context of my question, of course I would do this procedure for all the three models and compare the MAEs.

@dantopa unluckily I cannot post the dataset here. However, if it is helpful, I can provide the R code that I've used — Euler_Salter, Apr 18 '17 at 17:43
The hope was to create $L_{1}$ versions of the plots below to bring these issues to life. Your post raises important foundation issues, questions common to so many, and it would be nice to burnish the response with illustration. — dantopa, Apr 18 '17 at 18:22
@dantopa Although I hope you are not misinterpreting the question! Try to read the comment I just wrote under the answer given by unseen_rider.My I am not sure you are understanding what process I'm doing. Basically I'm using half of the data to predict the other half of the data (which I already have) — Euler_Salter, Apr 18 '17 at 18:25
The connection between familiar mathematics and your question is not completely understood. — dantopa, Apr 18 '17 at 18:28

unseen_rider · Answer 1 · 2017-04-19T16:31:37.663

IMO which model is better will depend on many factors.

These include:

Amount of data in each $M_k$
Skewness / spread of the data for each $M_k$ - eg done via box plots.
Plots of errors for each $M_k$ observed vs expected.

These should be done first in my opinion, since the results of these should be used for which seeing which assumptions can be used in each model.

Answering your questions:

Will I be predicting $log(y_i)$'s right?

Yes with what you have wrote.

Is it also true that in order to get $\hat{y_i}$'s , instead of $\widehat{log(y_i)}$, I should just take the exponential of those terms? So in general, is it true $\hat{y_i}=\widehat{e^{\log(y_i)}}$?

Not quite: for example in your first model $M_1$ you define as:

$$\log(y_i)=\beta_0+\beta_1x_i+\epsilon_i$$

Hence $\hat{y_i}=\widehat{e^{\beta_0+\beta_1x_i+\epsilon_i}}$

$=e^{\hat{\beta_0}}e^{\hat{\beta_1}x_i}e^{\hat{\epsilon_i}}$

Once I find the three MAE's, how do I judge the models? Should I be looking for the one with smaller MAE?

Taking the one with the smaller MAE would make sense, however I would take the value of highest $R^2$.

Most importantly to be able to use any of these models, they need to be significant. The way this is measured is typically via p-values. Depending on the hypothesis being tested, from a p-value that is less than eg $0.05$ it can be inferred it is significant.

A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

http://www.dummies.com/education/math/statistics/what-a-p-value-tells-you-about-statistical-data/

Thank you for your great answer! However I think none of you two actually understood half of the question, my bad. The data is the same for both models! I have some data, I want to test 3 models, on the same data! — Euler_Salter, Apr 18 '17 at 18:20
Regarding the second yellow box you wrote: then if I use $R$ to do my calculations and I fit those three models, then I will get fitted values of the logarithm, i.e. I will get $\widehat{\log{(y_i)}}$. So my question is: once I get these outputs from $R$, how do I find the actual fitted values $\hat{y}_I$? What should I do in order to find the fitted values in the original scale? — Euler_Salter, Apr 18 '17 at 18:23
I am not familiar with the software $R$, however if you have a significant model, and have $\beta_0$, $\beta_1$, $\epsilon$, then for each $x_i$ you can calculate corresponding $\hat{y_i}$ by using formula I gave in my answer. — unseen_rider, Apr 19 '17 at 16:24

dantopa · Answer 2 · 2017-04-18T18:14:04.467

@unseen_rider provides a great answer. This post forwards mathematical questions which may be relevant.

Part of the theme regards the dangers of logarithmic transforms. An example of how this affects the $L_{2}$ minimization problem is in Linearize $y=c+ax^b$.

But certainly, for a given $a$, $$ \big| y_{k} - a_{0} e^{a_{1}x} \big| \ne \big| \ln y_{k} - \ln a_{0} - a_{1}x \big| $$

Input

A sequence of $m$ measurements $\left\{ x_{k}, y_{k} \right\}_{k=1}^{m}$.

Models

Switching to natural logarithms as a personal choice, the models are a sequence like $$ \begin{align} y_{1} (x) &= a_{0} e^{a_{1}x} \\ % y_{2} (x) &= a_{0} e^{a_{1}x+a_{2}x^{2}} \\ % y_{3} (x) &= a_{0} e^{a_{1}x + a_{2}x^{2} + a_{3} x^{3}} \\ % \end{align} $$ The problem is to find the best solution vector in the $L_{1}$ norm.

Solution

Transformation distorts problem

As noted in the earlier post, true logarithmic transformation doesn't deliver a linear problem. It simply distorts the problem. The logarithmic form seems to hide this flaw in plain view. Colloquially, the logarithmic transformation provides an easy path to a point which is not the solution.

If we can get a data set, we can quantify this point. Until then, here is an $L_{2}$ example. The white dot is the true minimum, the true least squares solution. The yellow is the solution to the logarithmically transformed data set.

Increasing order of fit

You pose a merit function, a definition of the error, which you want to minimize. In general, more fit parameters will give a better answer, up to a point. An example is in Polynomial best fit line for very large values

Typical results for fits with polynomials of increasing fit order in $L_{2}$.

Theory question: How to use Mean Absolute Error properly in a log scaled linear regression

2 Answers2