0

Let's say I have dataset within the following dataframe format with a non-standard timestamp column without datetime format as follows:

+--------+-----+
|TS_24hrs|count|
+--------+-----+
|0       |157  |
|1       |334  |
|2       |176  |
|3       |86   |
|4       |89   |
 ...      ...
|270     |192  |
|271     |196  |
|270     |251  |
|273     |138  |
+--------+-----+
274 rows × 2 columns

The dataset shape is $274*2$ and contains the first column of timestamp and 2nd column of numerical statistical count as the label, and I want to train some ML\DL regression models and do out-of-sample prediction (predicting beyond the training dataset) over future data (unseen test-set). I have already implemented RF regression within pipeline() after splitting data with the following strategy for 274 records:

  • split data into [training-set + validation-set] Ref. e.g. The first 200 records [160 +40]
  • keeping unseen [test-set] hold-on for final forecasting e.g. The last 74 records (after 200th rows\event)
#print(train.shape)          #(160, 2)
#print(validation.shape)     #(40, 2)
#print(test.shape)           #(74, 2)

img

#ِDataset matrix is formed
def create_dataset(df , lookback=1):
  data  = np.array(df.iloc[:, :-1])
  label = np.array(df.iloc[:, -1])
  #create X_train and Y_train and MinMaxScaler on Y_train
  X = list()
  Y = list()
  for i in range(lookback , len(data)):
      X.append(data[i-lookback:i])
      Y.append(label[i])
  X = np.array(X)
  Y = np.array(Y)
  Y=np.expand_dims(Y,-1) 
  return X,Y

Lookback period

lookback = 5 X_train, Y_train = create_dataset(train, lookback) X_val, Y_val = create_dataset(validation, lookback) X_test, Y_test = create_dataset(test, lookback)

print(X_train.shape , Y_train.shape) #(155, 5, 1) (155, 1)

print(X_val.shape, Y_val.shape) #(35, 5, 1) (35, 1)

print(X_test.shape, Y_test.shape) #(69, 5, 1) (69, 1)

# Finalize the model and make a prediction for monthly births with random forest
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.ensemble import RandomForestRegressor

transform a time series dataset into a supervised learning dataset

def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list()

input sequence (t-n, ... t-1)

for i in range(n_in, 0, -1): cols.append(df.shift(i))

forecast sequence (t, t+1, ... t+n)

for i in range(0, n_out): cols.append(df.shift(-i))

put it all together

agg = concat(cols, axis=1)

drop rows with NaN values

if dropnan: agg.dropna(inplace=True) return agg.values

load the dataset

series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values

transform the time series data into supervised learning

train = series_to_supervised(values, n_in=6)

split into input and output columns

trainX, trainy = train[:, :-1], train[:, -1]

fit model

model = RandomForestRegressor(n_estimators=1000) model.fit(trainX, trainy)

construct an input for a new prediction

row = values[-6:].flatten()

make a one-step prediction

yhat = model.predict(asarray([row])) print('Input: %s, Predicted: %.3f' % (row, yhat[0]))

Qs:

Q1: What is the difference between these two approaches? (lookback period Vs transform a time series dataset into a supervised learning dataset) I also tried in meantime walk_forward_validation() approach inspired from this post:

Walk forward validation is a method for estimating the skill of the model on out of sample data. We contrive out of sample and each time step one out of sample observation becomes in-sample. We can use the same model in ops, as long as the walk-forward is performed each time a new observation is received.

My observation shows no difference between using:

  • just series_to_supervised (STS)
  • walk_forward_validation (WFV) & series_to_supervised (STS) comparing outputs, especially MAE results, but a bit in quality of forecasting up & downs(could be due to weights during training):

img

Q2: How possibly can I reflect\compare prediction results when data split size based on my strategy has undergone of above methods, e.g. Look back period while lookback = 5? (I already missed 5 records, and in practice leads delay in forecasting the right point)

# The first 200 records slice for training-set and validation-set
df200=df[:200]

The rest records = 74 events (after 200th event) kept as hold-on unseen-set for forcasting

test = df[200:] #test

Split the data into training and testing sets

from sklearn.model_selection import train_test_split train, validation = train_test_split(df200 , test_size=0.2, shuffle=False) #train + validation #print(train.shape) #(160, 2) #print(validation.shape) #(40, 2) #print(test.shape) #(74, 2) ```

Mario
  • 396
  • 5
  • 24

0 Answers0