Obtaining a confidence interval for the prediction of a linear regression

Question

The data I am working with is being used to predict the duration of a trip between two points. There are about 100 different trips in the data and ~90k observations.

I am using the standard pattern:

feature_cols = df_features.columns.drop( [ 'log_duration' ] )
X            = df_features[ feature_cols ]
y            = df_features.log_duration

X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 42 )
linreg = LinearRegression()
linreg.fit( X_train, y_train )
linreg.score( X_test, y_test )

To perform the regression and obtain my score (~.74).

However, let's say, that it predicts that it will take 40 minutes to make the trip between two points. Obviously, it will not take exactly 40 minutes. What I am looking for is a way to report that it will take 40 minutes +/- C number of minutes for the trip.

Using Pandas and SciKit, how can I obtain C?

A good description is in this video https://www.youtube.com/watch?v=qVCQi0KPR0s — Angadishop, May 12 '20 at 17:56
A good read: https://github.com/scikit-learn/scikit-learn/issues/6773 — Wok, Aug 10 '21 at 21:28

score 10 · Answer 1 · answered Dec 02 '18 at 19:45

10

You can estimate the standard deviation of your prediction:

stdev = np.sqrt(sum((linreg.predict(X_train) - y_train)**2) / (len(y_train) - 2))

Then, for any significance level you want, you should check correspondent Gaussian critical value (for example, for significance level 95% it is 1.96).

Finally, confidence intervals are (prediction - 1.96*stdev, prediction + 1.96*stdev) (or similarly for any other confidence level).

Another approach is to use statsmodels package.

answered Dec 02 '18 at 19:45

Viacheslav Komisarenko

388
1
5

2

This is a really naive approach. – ldmtwo Mar 23 '19 at 22:32
I am wondering if we could use the standard error of the coefficients to get the CI of predictions? – Angadishop May 12 '20 at 17:22
2

it's kind of impressive how sklearn doesn't believe in the concept of coefficient CI – Dave Kielpinski Jul 22 '20 at 19:53

score 0 · Answer 2 · answered Apr 29 '20 at 11:21

0

Since you are using a linear regression, you could make use of the method described here. It is quite a bit more complex that the +- standard deviations, but is would be more accurate.

answered Apr 29 '20 at 11:21

DannyVanpoucke

71
5

Obtaining a confidence interval for the prediction of a linear regression

2 Answers2