8

The data I am working with is being used to predict the duration of a trip between two points. There are about 100 different trips in the data and ~90k observations.

I am using the standard pattern:

feature_cols = df_features.columns.drop( [ 'log_duration' ] )
X            = df_features[ feature_cols ]
y            = df_features.log_duration

X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 42 )
linreg = LinearRegression()
linreg.fit( X_train, y_train )
linreg.score( X_test, y_test )

To perform the regression and obtain my score (~.74).

However, let's say, that it predicts that it will take 40 minutes to make the trip between two points. Obviously, it will not take exactly 40 minutes. What I am looking for is a way to report that it will take 40 minutes +/- C number of minutes for the trip.

Using Pandas and SciKit, how can I obtain C?

ericg
  • 181
  • 1
  • 1
  • 2

2 Answers2

10

You can estimate the standard deviation of your prediction:

stdev = np.sqrt(sum((linreg.predict(X_train) - y_train)**2) / (len(y_train) - 2))

Then, for any significance level you want, you should check correspondent Gaussian critical value (for example, for significance level 95% it is 1.96).

Finally, confidence intervals are (prediction - 1.96*stdev, prediction + 1.96*stdev) (or similarly for any other confidence level).

Another approach is to use statsmodels package.

0

Since you are using a linear regression, you could make use of the method described here. It is quite a bit more complex that the +- standard deviations, but is would be more accurate.