Reward negative derivative on linear regression

Question

I'm actually new to Data Science and I'm trying to make a simple linear regression with only one feature X ( which I added the feature log(X) before adding a polynomial features) on a motley dataset using Python an all the Data Science stack that comes with it (numpy, pandas, sci-kit learn, ...)

Here you can find a piece of code of my regression using scikitlearn:

def add_log(x):
    return np.concatenate((x, np.log(x)), axis=1)

 # Fetch the training set
_X = np.array(X).reshape(-1, 1) # X = [1, 26, 45, ..., 100, ..., 8000 ]
_Y = np.array(Y).reshape(-1, 1) # Y = [1206.78, 412.4, 20.8, ..., 1.34, ..., 0.034]
Y_train = _Y
X_train = add_log(_X) if use_log else _X

# Create the pipeline
steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(6)),
    ('model', Lasso(alpha=alpha, fit_intercept=True))
]



pipeline = Pipeline(steps)
pipeline.fit(X_train, Y_train)

My feature X can go between 1 to ~80 000 and Y can go between 0 and ~2M

There is one thing I know about the curve I should obtain is that it should always decrease so the derivative should be always negative

I make a little schema to explain what I expect vs what I have: Therefore I would like to reward prediction where derivative is always negative even if my data suggest the opposite.

Is there a way to do that with sci-kit learn? Or maybe I'm suggesting a bad solution to my problem and there is another way to obtain what I want ?

Thank you

Another thing to consider: it looks like your data has an inverse relationship, so consider fitting instead to $1/y$. Your polynomial and logarithmic terms all try to go to infinity, so forcing coefficients to be negative in the original model will push $y\to-\infty$ for large $x$... — Ben Reiniger, Jul 12 '19 at 15:01

fuwiak · Answer 1 · 2020-03-28T21:48:24.503

4

It's a classic outlier. You could, for example, remove him or replace him with a new value(by interpolation). You have many ways to work around this problem.

Links for you:

https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

With some code: https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623

edited Mar 28 '20 at 21:48

answered Jul 08 '19 at 16:39

fuwiak

1,373
8
13
26

score 3 · Answer 2 · answered Jul 08 '19 at 16:39

This question seems related, and I think Adarsh's answer can help you out.

Lasso has a parameter positive which can be set to True and force the coefficients to be positive. Further, setting the Regularization coefficient alpha to lie close to 0 makes the Lasso mimic Linear Regression with no regularization.

In your case, you need the coefficients to be negative instead of positive. If you flip the sign of your target value, then this becomes equivalent to forcing positive coefficients. I think the following modification to your code would work:

def add_log(x):
    return np.concatenate((x, np.log(x)), axis=1)

# Fetch the training set
_X = np.array(X).reshape(-1, 1) # X = [1, 26, 45, ..., 100, ..., 8000 ]
_Y = np.array(Y).reshape(-1, 1) # Y = [1206.78, 412.4, 20.8, ..., 1.34, ..., 0.034]

# flip the sign of the targets
Y_train = -1 * _Y
X_train = add_log(_X) if use_log else _X

# Create the pipeline
steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(6)),
    ('model', Lasso(alpha=alpha, fit_intercept=True, positive=True))
]

pipeline = Pipeline(steps)
pipeline.fit(X_train, Y_train)

# Don't forget to flip the sign of your model output
```

Since OP is using PolynomialFeatures (and a logarithmic transform), this enforces that all the coefficients are negative, not just that the overall derivative is. (Happily, X>0 and $log$ is increasing, so at least all-negative coefficients does enforce a negative derivative.) Also, this forces a negative slope, rather than just penalizing a positive one. That said, this may do exactly what OP wanted, if not exactly what OP asked for. — Ben Reiniger, Jul 08 '19 at 16:47
@BenReiniger both are great points. This won't do exactly what OP asked for. I imagine it would be a lot more work to enforce a constraint on the derivative rather than the coefficients. — zachdj, Jul 08 '19 at 16:56

score 3 · Accepted Answer · answered Jul 08 '19 at 16:49

When you use linear regression you always need to define a parametric function you want to fit. So if you know that your fitted curve/line should have a negative slope, you could simply choose a linear function, such as: y = b0 + b1*x + u (no polys!). Judging from your figure, the slope (b1) should be negative. The consequence will be that you probably will not get a great fit since the function is not very flexible. But you will get an easy-to-interpret result.

What you can do to improve performance in this case is to work on your features. You can center the features (divide by mean) or scale them (divide by 1000 or so). However, since this is a linear transformation you will not gain much from this. Another option would be to do a log-log transformation (take logs for y and X). This will give you an interpretation such as "if X increases by 1%, y changes by b1%". The advantage is that "large" values will become smaller, which gives a better fit on data with large(r) values. Since your data seems to be mostly positive, this could be an option. The model looks like: log(y) = b0 + b1*log(x) + u.

Another approach would be to see if some of your observations are "outliers" and cause your estimated function to be "wobbly". You can - for instance - define a quadratic model such as: y = b0 + b1*x + b2*x^2 + u, estimate the model, and detect outliers based on Cook's distance. However, this approach seems arbitrary since you would need to remove observations until you get the desired slope. It is not a really good idea to select data until the data fit what we want to see. It may only be an option if just a few observations cause trouble (as it seems to be the case in your plot).

Yet another possibility would be that you "split" your data. Here I assume that only observations in some range cause trouble (in you figure the "low" x's) while the rest of the observations ("higer" x's) follow a linear trend or so. I had exactly the same problem recently. I had a linear trend for the largest part of my x's, while only few observations had a highly non-linear pattern. I detected this using generalised additive models (GAM). Here is a tutorial for a Python implementation.

This was my result:

The figure shows that there is a mostly linear trend for the largest part of the data (lower 90% here). Only the upper 10% caused trouble. So I estimated a linear model, but added an interaction term to alow for a separate slope for the upper 10% of data. By doing so I got a reasonable linear estimate for the slope of the lower 90%, while avoiding a "biased" estimate by the "wobbly" upper 10% of data. This works as follows: you generate a dummy/indicator variable which equals I=1 for the "wobbly" data and I=0 otherwise. Then you estimate a linear model like: y = b0 + b1*X + b2*I + b3*I*X + u. The result is that you get an extra intercept (b2) and slope (b3) for the "wobbly" part of the data indicated by I. This in turn means that you also get an extra slope for the non-wobbly part of the data (b0, b1).

Another thing: Why do you use lasso? Lasso is used to "shrink" features/variables. You only have one variable, so there is no need to shrink it. I would go for ordinary least squares (OLS), so a simple linear regression.

Reward negative derivative on linear regression

3 Answers3