What is the difference between lookback period and transform a time series dataset into a supervised learning dataset for time-series forecasting?

Question

Let's say I have dataset within the following pandas dataframe format with a non-standard timestamp column without datetime format as follows:

+--------+-----+
|TS_24hrs|count|
+--------+-----+
|0       |157  |
|1       |334  |
|2       |176  |
|3       |86   |
|4       |89   |
 ...      ...
|270     |192  |
|271     |196  |
|270     |251  |
|273     |138  |
+--------+-----+
274 rows × 2 columns

The dataset shape is $274*2$ and contains the first column of timestamp and 2nd column of numerical statistical count as the label, and I want to train some ML\DL regression models and do out-of-sample prediction (predicting beyond the training dataset) over future data (unseen test-set). I have already implemented RF regression within sklearn pipeline() after splitting data with the following strategy for 274 records:

split data into [training-set + validation-set] Ref. e.g. The first 200 records [160 +40]

keeping unseen [test-set] hold-on for final forecasting e.g. The last 74 records (after 200th rows\event)

#print(train.shape)          #(160, 2)
#print(validation.shape)     #(40, 2)
#print(test.shape)           #(74, 2)

lookback period, example1, example2

#ِDataset matrix is formed
def create_dataset(df , lookback=1):
  data  = np.array(df.iloc[:, :-1])
  label = np.array(df.iloc[:, -1])
  #create X_train and Y_train and MinMaxScaler on Y_train
  X = list()
  Y = list()
  for i in range(lookback , len(data)):
      X.append(data[i-lookback:i])
      Y.append(label[i])
  X = np.array(X)
  Y = np.array(Y)
  Y=np.expand_dims(Y,-1) 
  return X,Y
Lookback period
lookback = 5
X_train, Y_train = create_dataset(train, lookback)
X_val, Y_val = create_dataset(validation, lookback)
X_test, Y_test = create_dataset(test, lookback)
print(X_train.shape , Y_train.shape)       #(155, 5, 1) (155, 1)
print(X_val.shape, Y_val.shape)            #(35, 5, 1) (35, 1)
print(X_test.shape, Y_test.shape)          #(69, 5, 1) (69, 1)

transform a time series dataset into a supervised learning dataset posted by Jason Brownlee:

# Finalize the model and make a prediction for monthly births with random forest
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.ensemble import RandomForestRegressor
transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
 n_vars = 1 if type(data) is list else data.shape[1]
 df = DataFrame(data)
 cols = list()
input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
 cols.append(df.shift(i))
forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
 cols.append(df.shift(-i))
put it all together
agg = concat(cols, axis=1)
drop rows with NaN values
if dropnan:
 agg.dropna(inplace=True)
 return agg.values
load the dataset
series = read_csv('daily-total-female-births.csv', header=0, index_col=0)
values = series.values
transform the time series data into supervised learning
train = series_to_supervised(values, n_in=6)
split into input and output columns
trainX, trainy = train[:, :-1], train[:, -1]
fit model
model = RandomForestRegressor(n_estimators=1000)
model.fit(trainX, trainy)
construct an input for a new prediction
row = values[-6:].flatten()
make a one-step prediction
yhat = model.predict(asarray([row]))
print('Input: %s, Predicted: %.3f' % (row, yhat[0]))

Qs:

Q1: What is the difference between these two approaches? (lookback period Vs transform a time series dataset into a supervised learning dataset) I also tried in meantime walk_forward_validation() approach inspired from this post:

Walk forward validation is a method for estimating the skill of the model on out of sample data. We contrive out of sample and each time step one out of sample observation becomes in-sample. We can use the same model in ops, as long as the walk-forward is performed each time a new observation is received.

My observation shows no difference between using:

just series_to_supervised (STS)
walk_forward_validation (WFV) & series_to_supervised (STS) comparing outputs, especially MAE results, but a bit in quality of forecasting up & downs(could be due to weights during training):

Q2: How possibly can I reflect\compare prediction results when data split size based on my strategy has undergone of above methods, e.g. Look back period while lookback = 5? (I already missed 5 records, and in practice leads delay in forecasting the right point)

# The first 200 records slice for training-set and validation-set
df200=df[:200]
The rest records = 74 events (after 200th event) kept as hold-on unseen-set for forcasting
test = df[200:]                                   #test
Split the data into training and testing sets
from sklearn.model_selection import train_test_split
train, validation = train_test_split(df200 , test_size=0.2, shuffle=False) #train + validation
#print(train.shape)          #(160, 2)
#print(validation.shape)     #(40, 2)
#print(test.shape)           #(74, 2)
```

What is the difference between lookback period and transform a time series dataset into a supervised learning dataset for time-series forecasting?

Lookback period

print(X_train.shape , Y_train.shape) #(155, 5, 1) (155, 1)

print(X_val.shape, Y_val.shape) #(35, 5, 1) (35, 1)

print(X_test.shape, Y_test.shape) #(69, 5, 1) (69, 1)

transform a time series dataset into a supervised learning dataset

input sequence (t-n, ... t-1)

forecast sequence (t, t+1, ... t+n)

put it all together

drop rows with NaN values

load the dataset

transform the time series data into supervised learning

split into input and output columns

fit model

construct an input for a new prediction

make a one-step prediction

The rest records = 74 events (after 200th event) kept as hold-on unseen-set for forcasting

Split the data into training and testing sets

0 Answers0

Linked