Best choice for splitting data given a quantity and a expected accuracy

Question

I have a dataset with at least 1,000,000 images (from IDs) which I am using to detect the presence of sealed IDs.

The legacy algorithm got nearly 60% accuracy, but my current algorithm yielded almost 80% on a small set. There must be some logic for deciding how many images I should use for training, validation, and test sets. I was tempted to use 500k images for training, 250k images for validation, and 250k for testing. Any ideas would be greatly appreciated.

score 1 · Accepted Answer · answered Jul 20 '21 at 18:06

There is no fixed rule while selecting the size of the training set and testing set. Its all about trial and error, so try out different ratios 80-20, 70-30, 65-35 and pick one that gives the best performance result.

Its suggested in several machine learning research articles to generally opt for

Training dataset to be 70% (for setting model parameters)
Validation dataset to be 15% (helps to tune hyperparameters)
Testing dataset to be 15% (helps to access model performance)

If you plan to keep only split data into two, ideally it would be

Training dataset to be 75%
Testing dataset to be 25%

Like in your case of extremely large datasets which typically can go to millions of records, a train/validation/test split of 98/1/1 would suffice since even 1% is a huge amount of data.

Note: Do ensure you split your training, test datasets using an algorithm and not manually. One major concern while splitting data is to ensure the data is not imbalanced. For instance, you have 5 classes that needs to be classified in your ML problem. You need to ensure the train and test dataset has sufficient data with these 5 classes for model to give best performance. If you manually split the data chances are your dataset might have only 10% of data in varied classes or worse it might only have 3 classes in the test/train data. To ensure data is not imbalanced use stratify in sklearn's train_test_split

X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.15, train_size=0.15,stratify=X['YOUR_COLUMN_LABEL'])

@manu if this or any answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. Kindly follow stackoverflow norms — Archana David, Jul 21 '21 at 03:29

Best choice for splitting data given a quantity and a expected accuracy

1 Answers1