There is no fixed rule while selecting the size of the training set and testing set. Its all about trial and error, so try out different ratios 80-20, 70-30, 65-35 and pick one that gives the best performance result.
Its suggested in several machine learning research articles to generally opt for
- Training dataset to be 70% (for setting model parameters)
- Validation dataset to be 15% (helps to tune hyperparameters)
- Testing dataset to be 15% (helps to access model performance)
If you plan to keep only split data into two, ideally it would be
- Training dataset to be 75%
- Testing dataset to be 25%
Like in your case of extremely large datasets which typically can go to millions of records,
a train/validation/test split of 98/1/1 would suffice since even 1% is a huge amount of data.
Note:
Do ensure you split your training, test datasets using an algorithm and not manually.
One major concern while splitting data is to ensure the data is not imbalanced. For instance, you have 5 classes that needs to be classified in your ML problem. You need to ensure the train and test dataset has sufficient data with these 5 classes for model to give best performance. If you manually split the data chances are your dataset might have only 10% of data in varied classes or worse it might only have 3 classes in the test/train data. To ensure data is not imbalanced use stratify
in sklearn's train_test_split
X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.15, train_size=0.15,stratify=X['YOUR_COLUMN_LABEL'])