Why is there a difference between predicting on Validation set and Test set?

Question

I have a XGBoost model trying to predict if a currency will go up or down next period (5 min). I have a dataset from 2004 to 2018. I split the data randomized into 95% train and 5% validation and the accuracy on the Validation set is up to 55%. When I then use the model on a new Test set (data from 2019), the accuracy goes down to below 51%.

Can someone explain why that might be?

I mean, I assume the model has not "seen" (trained on) the validation data anymore than it has the test data so can it really be overfitting?

I have attached a simple model below to illustrate. That one gives 54% on validation set but only 50.9% on the test set.

Thankful for any assistance!

N.B. One theory I had was that, as some of the features relies on historic data (e.g. moving average), it could be data leakage of some sort. I then tried to correct for that by only sample data that was not part of creating the moving average. E.g. if there is a moving average of 3 periods I then don't sample/use the data rows from 2 periods back. That did not change anything so it is not in the model below.

N.B.2 The model below is a simple version of what I use. The reason for a validation set for me is that i use a genetic algorithm for the hyperparameter tuning but all that is removed here for clarity.

import pandas as pd
import talib as ta
from sklearn.utils import shuffle
pd.options.mode.chained_assignment = None
from sklearn.metrics import accuracy_score

# ## TRAINING AND VALIDATING
# ### Read in data
input_data_file = 'EURUSDM5_2004-2018_cleaned.csv'   # For train and validation
df = pd.read_csv(input_data_file)

# ### Generate features
#######################
# SET TARGET
#######################
df['target'] = df['Close'].shift(-1)>df['Close']       # target is binary, i.e. either up or down next period

#######################
# DEFINE FEATURES
#######################
df['rsi'] = ta.RSI(df['Close'], 14) 

# ### Treat the data
#######################
# FIND AND MAKE CATEGORICAL VARAIBLES AND DO ONE-HOT ENCODING
#######################
for col in df.drop('target',axis=1).columns:     # Crude way of defining variables with few unique variants as categorical
    if df[col].nunique() < 25:
        df[col] = pd.Categorical(df[col])

cats = df.select_dtypes(include='category')     # Do one-hot encoding for the categorical variables
for cat_col in cats:
    df = pd.concat([df,pd.get_dummies(df[cat_col], prefix=cat_col,dummy_na=False)],axis=1).drop([cat_col],axis=1)

uints = df.select_dtypes(include='uint8')
for col in uints.columns:                   # Variables from the one-hot encoding is not created as categoricals so do it here
    df[col] = df[col].astype('category')

#######################
# REMOVE ROWS WITH NO TRADES
#######################
df = df[df['Volume']>0]

#######################
# BALANCE NUMBER OF UP/DOWN IN TARGET SO THE MODEL CANNOT SIMPLY CHOOSE ONE AND BE SUCCESSFUL THAT WAY
#######################
df_true = df[df['target']==True]
df_false = df[df['target']==False]

len_true = len(df_true)
len_false = len(df_false)
rows = min(len_true,len_false)

df_true = df_true.head(rows)
df_false = df_false.head(rows)
df = pd.concat([df_true,df_false],ignore_index=True)
df = shuffle(df)
df.dropna(axis=0, how='any', inplace=True)

# ### Split data
df = shuffle(df)
split = int(0.95*len(df))

train_set = df.iloc[0:split]
val_set = df.iloc[split:-1]

# ### Generate X,y
X_train = train_set[train_set.columns.difference(['target', 'Datetime'])]
y_train = train_set['target']

X_val = val_set[val_set.columns.difference(['target', 'Datetime'])]
y_val = val_set['target']

# ### Scale
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

cont = X_train.select_dtypes(exclude='category')                   # Find columns with continous (not categorical) variables
X_train[cont.columns] = sc.fit_transform(X_train[cont.columns])    # Fit and transform

cont = X_val.select_dtypes(exclude='category')                     # Find columns with continous (not categorical) variables
X_val[cont.columns] = sc.transform(X_val[cont.columns])            # Transform

cats = X_train.select_dtypes(include='category')
for col in cats.columns:
    X_train[col] = X_train[col].astype('uint8')

cats = X_val.select_dtypes(include='category')
for col in cats.columns:
    X_val[col] = X_val[col].astype('uint8')


# ## MODEL
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)

predictions = model.predict(X_val)
acc = 100*accuracy_score(y_val, predictions)
print('{0:0.1f}%'.format(acc))

# # TESTING
input_data_file = 'EURUSDM5_2019_cleaned.csv'   # For testing
df = pd.read_csv(input_data_file)

#######################
# SET TARGET
#######################
df['target'] = df['Close'].shift(-1)>df['Close']       # target is binary, i.e. either up or down next period
#######################
# DEFINE FEATURES
#######################
df['rsi'] = ta.RSI(df['Close'], 14)

#######################
# FIND AND MAKE CATEGORICAL VARAIBLES AND DO ONE-HOT ENCODING
#######################
for col in df.drop('target',axis=1).columns:     # Crude way of defining variables with few unique variants as categorical
    if df[col].nunique() < 25:
        df[col] = pd.Categorical(df[col])

cats = df.select_dtypes(include='category')     # Do one-hot encoding for the categorical variables
for cat_col in cats:
    df = pd.concat([df,pd.get_dummies(df[cat_col], prefix=cat_col,dummy_na=False)],axis=1).drop([cat_col],axis=1)

uints = df.select_dtypes(include='uint8')
for col in uints.columns:                   # Variables from the one-hot encoding is not created as categoricals so do it here
    df[col] = df[col].astype('category')

#######################
# REMOVE ROWS WITH NO TRADES
#######################
df = df[df['Volume']>0]
df.dropna(axis=0, how='any', inplace=True)

X_test = df[df.columns.difference(['target', 'Datetime'])]
y_test = df['target']

cont = X_test.select_dtypes(exclude='category')                     # Find columns with continous (not categorical) variables
X_test[cont.columns] = sc.transform(X_test[cont.columns])            # Transform

cats = X_test.select_dtypes(include='category')
for col in cats.columns:
    X_test[col] = X_test[col].astype('uint8')

predictions = model.predict(X_test)
acc = 100*accuracy_score(y_test, predictions)
print('{0:0.1f}%'.format(acc))

score 6 · Answer 1 · answered Aug 24 '19 at 23:24

6

The only difference appears to be the data. Maybe the test set (which was the newest data) differed slightly from that of the training/validation sets and led to an under-performance by your model.

answered Aug 24 '19 at 23:24

kitty

345
1
5

score 6 · Answer 2 · answered Aug 25 '19 at 01:09

The most likely thing is that there has been some concept drift. Since your model is trained on data up through 2018 and tested on 2019, things have changed, and some of these changes your model might not be able to foresee.

A couple of other possibilities though:

You say you performed hyperparameter tuning, but omitted that from the code for simplicity. But if you are using the validation set in order to choose the hyperparameters, then the score you get will be biased optimistically. (But you say the model hasn't seen the validation set, so maybe this isn't how you're doing it.)

Finally, it's possible you've done everything right, and there isn't really concept drift going on, but that random effects just account for a few points of accuracy.

score 2 · Answer 3 · answered Sep 01 '19 at 23:10

2

There are two primary reasons:

The trained model has close to random performance. For example, 50% is random performance in a binary classification task assuming equal class membership. In other words, the model does not learn meaningful predictive patterns from 2004 to 2018 data.
There could be new patterns in the 2019 data. The (barely learned) patterns from the 2004 to 2018 data do not transfer to the 2019 data.

answered Sep 01 '19 at 23:10

Brian Spiering

21,136
2
26
109

Oh, yeah, I somehow missed that this was binary classification, that the scores reported were accuracies, and only 54% and 51%. +1 – Ben Reiniger Sep 02 '19 at 00:32

Robin Gertenbach · Answer 4 · 2019-08-25T19:02:01.760

As the old investment mantra goes, "past performance is not indicative of future performance".

My prime candidate is overfitting. While the chance for a particular pattern to be symptomatic of a certain direction even though it is not causal (or predictive beyond the sample at hand) at all is astronomically small there is also an astronomical amount of patterns to be detected that can exhibit such behavior.

Let's assume they were real patterns you learned:
While you were training an algo learning its triple bottoms and heads and shoulders hundreds of banks were too, and doing so faster than you and using that information.
That information was reflected in different price movements, because they knew more than in 2018 and acted differently, your model doesn't know know to take those actions into account yet because they are new.

Why is there a difference between predicting on Validation set and Test set?

4 Answers4