0

I am participating in a Kaggle multiclass classification competition. The submissions will be scored based on the 'logloss' score. I am using Keras and Scikit libraries and a deep learning network model and have taken the below approach.

I have corrected class imbalance in the training data using oversampling the minority classes. I have split the training data into training (X_train, y_train) and validation datasets (X_test, y_test). I have scaled the features and I have done categorical encoding of labels.

When I run the model, I am getting very good Validation loss (1.708) and Validation accuracy (compared to Kaggle leaderboard scores; top logloss score is 1.744), but when I submit my predicted probabilities for different classes for the test_set, I am getting awfully high loss score (4+) (It is a different matter I got a different, decent score - 2.02, using a different model approach, which is reflected in the leaderboard).

Why is this? Any suggestions on what should be done or where I am going wrong?

total classes:

Class_3 51811 Class_7 51811 Class_2 51811 Class_5 51811 Class_1 51811 Class_9 51811 Class_6 51811 Class_8 51811 Class_4 51811 Name: target, dtype: int64 466299

X_train, X_test, y_train, y_test = tts(X, y,test_size =.3, stratify=y, random_state=9) print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape)

(326409, 75) (326409, 9) (139890, 75) (139890, 9)

display(X_train.head(3)) display(X_test.head(3)) display(y_train[:3]) display(y_test[:3])

feature_0   feature_1   feature_2   feature_3   feature_4   feature_5   feature_6   feature_7   feature_8   feature_9   ...     feature_65  feature_66  feature_67  feature_68  feature_69  feature_70  feature_71  feature_72  feature_73  feature_74

425643 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 3 0 1 0 0 0 303754 2 3 2 2 5 0 0 1 1 1 ... 1 0 0 0 0 0 0 4 6 0 80710 2 8 2 0 18 2 0 2 1 3 ... 0 0 4 1 0 3 0 0 1 0

3 rows × 75 columns feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_65 feature_66 feature_67 feature_68 feature_69 feature_70 feature_71 feature_72 feature_73 feature_74 300226 0 0 1 4 0 0 0 4 1 1 ... 1 0 1 0 0 1 0 0 2 2 124793 0 0 0 6 0 0 0 3 7 2 ... 0 0 0 0 0 0 0 0 0 0 439437 0 3 0 0 5 0 0 2 1 1 ... 2 0 0 0 3 0 4 0 0 0

3 rows × 75 columns

array([[0., 0., 0., 0., 0., 0., 0., 0., 1.], [0., 0., 0., 0., 0., 1., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

array([[0., 0., 0., 0., 0., 1., 0., 0., 0.], [0., 0., 1., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

print(X_train.index.isin(X_test.index).sum()) print(X_test.index.isin(X_train.index).sum()) 0 0

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.fit_transform(X_test) test_set = scaler.fit_transform(test_set)

from keras.optimizers import Adam from tensorflow.keras import layers

model = Sequential() model.add(Dense(1024, input_shape=(75,), activation='relu')) model.add(Dense(256, activation='relu')) model.add(Dense(64, activation='relu')) model.add(Dense(16, activation='relu')) model.add(Dense(9, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=.001), metrics=['accuracy'], )

from tensorflow.keras.callbacks import EarlyStopping monitor_val_acc = EarlyStopping(monitor='val_loss', patience=5) model.fit(X_train, y_train, epochs = 50, validation_split=.3, callbacks= [monitor_val_acc], batch_size=1024) accuracy = model.evaluate(X_test, y_test)[1] print('Accuracy:', accuracy)

............ Epoch 28/30 45/45 [==============================] - 5s 117ms/step - loss: 1.6676 - accuracy: 0.3626 - val_loss: 1.7675 - val_accuracy: 0.3333 Epoch 29/30 45/45 [==============================] - 5s 114ms/step - loss: 1.6140 - accuracy: 0.3809 - val_loss: 1.7815 - val_accuracy: 0.3357 Epoch 30/30 45/45 [==============================] - 5s 117ms/step - loss: 1.5942 - accuracy: 0.3869 - val_loss: 1.7126 - val_accuracy: 0.3563 4372/4372 [==============================] - 11s 2ms/step - loss: 1.7085 - accuracy: 0.3582 Accuracy: 0.3581957221031189

from sklearn.metrics import accuracy_score from sklearn.metrics import log_loss preds_val = model.predict(X_test)

preds_val[:3] array([[1.13723904e-01, 5.20741269e-02, 4.70720865e-02, 1.59640312e-02, 1.92086305e-02, 2.25828230e-01, 1.81854114e-01, 1.99746847e-01, 1.44528091e-01], [6.04994688e-03, 1.40825182e-01, 9.95656699e-02, 5.96038415e-04, 5.59030111e-09, 4.57442701e-02, 3.05081338e-01, 1.77178025e-01, 2.24959582e-01], [6.54266328e-02, 9.87399742e-02, 1.07230745e-01, 1.46904245e-01, 6.80148089e-03, 1.52257413e-01, 1.22348621e-01, 1.58026025e-01, 1.42264828e-01]], dtype=float32)

log_loss(y_test, preds_val) 1.708450169537806

Srinivas
  • 101
  • 1
    Maybe your training/validation split is made in a way that it is leaking. You should try to split them in an analogous way to how the gold test data would be split by the competition organizers. For instance, in speech tasks, you may want not to mix the same speakers in the different splits. – noe Jun 14 '21 at 11:13
  • @noe, I don't think so. the below commands shown in the question prove it. right??? print(X_train.index.isin(X_test.index).sum()) print(X_test.index.isin(X_train.index).sum()). Both are zeros. – Srinivas Jun 14 '21 at 11:17
  • Leaking is more than having the exact data in both sets. That's what I was trying to illustrate with my example of the speaker split. You should question if, for the specific data domain, you should perform the split based on some feature value. – noe Jun 14 '21 at 12:04
  • @noe, Apologies. I think I am missing on your suggestion. Can you please provide me with a link to details on speaker split so that I can better understand. Then, probably I can understand where is the problem. Thank you. – Srinivas Jun 14 '21 at 12:19
  • I suggest you take a look at the Wikipedia page of leakage, which contains sensible explanations and examples. – noe Jun 14 '21 at 12:31
  • 2
    It looks to me like you're resampling the whole dataset before splitting, am I right? If yes that can certainly explain the problem: resampling should be applied only on the training set (see for instance this question) – Erwan Jun 14 '21 at 17:56
  • Erwan, you are right. I have realised that after @noe's suggestion. I think that explains why I get high validation score and not on testset. Thank you. – Srinivas Jun 15 '21 at 12:38

0 Answers0