I have a XGBoost model trying to predict if a currency will go up or down next period (5 min). I have a dataset from 2004 to 2018. I split the data randomized into 95% train and 5% validation and the accuracy on the Validation set is up to 55%. When I then use the model on a new Test set (data from 2019), the accuracy goes down to below 51%.
Can someone explain why that might be?
I mean, I assume the model has not "seen" (trained on) the validation data anymore than it has the test data so can it really be overfitting?
I have attached a simple model below to illustrate. That one gives 54% on validation set but only 50.9% on the test set.
Thankful for any assistance!
N.B. One theory I had was that, as some of the features relies on historic data (e.g. moving average), it could be data leakage of some sort. I then tried to correct for that by only sample data that was not part of creating the moving average. E.g. if there is a moving average of 3 periods I then don't sample/use the data rows from 2 periods back. That did not change anything so it is not in the model below.
N.B.2 The model below is a simple version of what I use. The reason for a validation set for me is that i use a genetic algorithm for the hyperparameter tuning but all that is removed here for clarity.
import pandas as pd
import talib as ta
from sklearn.utils import shuffle
pd.options.mode.chained_assignment = None
from sklearn.metrics import accuracy_score
# ## TRAINING AND VALIDATING
# ### Read in data
input_data_file = 'EURUSDM5_2004-2018_cleaned.csv' # For train and validation
df = pd.read_csv(input_data_file)
# ### Generate features
#######################
# SET TARGET
#######################
df['target'] = df['Close'].shift(-1)>df['Close'] # target is binary, i.e. either up or down next period
#######################
# DEFINE FEATURES
#######################
df['rsi'] = ta.RSI(df['Close'], 14)
# ### Treat the data
#######################
# FIND AND MAKE CATEGORICAL VARAIBLES AND DO ONE-HOT ENCODING
#######################
for col in df.drop('target',axis=1).columns: # Crude way of defining variables with few unique variants as categorical
if df[col].nunique() < 25:
df[col] = pd.Categorical(df[col])
cats = df.select_dtypes(include='category') # Do one-hot encoding for the categorical variables
for cat_col in cats:
df = pd.concat([df,pd.get_dummies(df[cat_col], prefix=cat_col,dummy_na=False)],axis=1).drop([cat_col],axis=1)
uints = df.select_dtypes(include='uint8')
for col in uints.columns: # Variables from the one-hot encoding is not created as categoricals so do it here
df[col] = df[col].astype('category')
#######################
# REMOVE ROWS WITH NO TRADES
#######################
df = df[df['Volume']>0]
#######################
# BALANCE NUMBER OF UP/DOWN IN TARGET SO THE MODEL CANNOT SIMPLY CHOOSE ONE AND BE SUCCESSFUL THAT WAY
#######################
df_true = df[df['target']==True]
df_false = df[df['target']==False]
len_true = len(df_true)
len_false = len(df_false)
rows = min(len_true,len_false)
df_true = df_true.head(rows)
df_false = df_false.head(rows)
df = pd.concat([df_true,df_false],ignore_index=True)
df = shuffle(df)
df.dropna(axis=0, how='any', inplace=True)
# ### Split data
df = shuffle(df)
split = int(0.95*len(df))
train_set = df.iloc[0:split]
val_set = df.iloc[split:-1]
# ### Generate X,y
X_train = train_set[train_set.columns.difference(['target', 'Datetime'])]
y_train = train_set['target']
X_val = val_set[val_set.columns.difference(['target', 'Datetime'])]
y_val = val_set['target']
# ### Scale
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
cont = X_train.select_dtypes(exclude='category') # Find columns with continous (not categorical) variables
X_train[cont.columns] = sc.fit_transform(X_train[cont.columns]) # Fit and transform
cont = X_val.select_dtypes(exclude='category') # Find columns with continous (not categorical) variables
X_val[cont.columns] = sc.transform(X_val[cont.columns]) # Transform
cats = X_train.select_dtypes(include='category')
for col in cats.columns:
X_train[col] = X_train[col].astype('uint8')
cats = X_val.select_dtypes(include='category')
for col in cats.columns:
X_val[col] = X_val[col].astype('uint8')
# ## MODEL
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_val)
acc = 100*accuracy_score(y_val, predictions)
print('{0:0.1f}%'.format(acc))
# # TESTING
input_data_file = 'EURUSDM5_2019_cleaned.csv' # For testing
df = pd.read_csv(input_data_file)
#######################
# SET TARGET
#######################
df['target'] = df['Close'].shift(-1)>df['Close'] # target is binary, i.e. either up or down next period
#######################
# DEFINE FEATURES
#######################
df['rsi'] = ta.RSI(df['Close'], 14)
#######################
# FIND AND MAKE CATEGORICAL VARAIBLES AND DO ONE-HOT ENCODING
#######################
for col in df.drop('target',axis=1).columns: # Crude way of defining variables with few unique variants as categorical
if df[col].nunique() < 25:
df[col] = pd.Categorical(df[col])
cats = df.select_dtypes(include='category') # Do one-hot encoding for the categorical variables
for cat_col in cats:
df = pd.concat([df,pd.get_dummies(df[cat_col], prefix=cat_col,dummy_na=False)],axis=1).drop([cat_col],axis=1)
uints = df.select_dtypes(include='uint8')
for col in uints.columns: # Variables from the one-hot encoding is not created as categoricals so do it here
df[col] = df[col].astype('category')
#######################
# REMOVE ROWS WITH NO TRADES
#######################
df = df[df['Volume']>0]
df.dropna(axis=0, how='any', inplace=True)
X_test = df[df.columns.difference(['target', 'Datetime'])]
y_test = df['target']
cont = X_test.select_dtypes(exclude='category') # Find columns with continous (not categorical) variables
X_test[cont.columns] = sc.transform(X_test[cont.columns]) # Transform
cats = X_test.select_dtypes(include='category')
for col in cats.columns:
X_test[col] = X_test[col].astype('uint8')
predictions = model.predict(X_test)
acc = 100*accuracy_score(y_test, predictions)
print('{0:0.1f}%'.format(acc))