Balanced vs total dataset rows, which one is better?

Question

I work on a dataset concerning games playing results. i.e every child play an indefinite number of games and it has as output (y) two possible values "success" or "Failure". It's about 800 000 samples, with the total number of positive output (1) is ~~ 640 000 and the total number of negative output (0) is ~~ 160 000.

id     player     game    y
1      P1         G1      1
2      P1         G2      0
3      P1         G2      1
4      P1         G3      0
5      P2         G1      0
6      P2         G1      1
7      P2         G2      0

For making a prediction I sort my data by "players" first then by "games".

I want to make some tests for improving the prediction accuracy. As I am a beginner in machine learning and data Science I don't know what is better to choose:

working with this database as it is, even it unbalanced? ??
or selecting just games that have (approximatively) balanced output? i.e to except any game result which has a total number of positive too begger than a total number of negative ??

Not an exact duplicate, but very similar to https://datascience.stackexchange.com/q/810/1156 and probably several related questions on CrossValidated. — shadowtalker, Apr 22 '19 at 17:11
Possible duplicate of Should I go for a 'balanced' dataset or a 'representative' dataset? — Pedro Henrique Monforte, Apr 22 '19 at 18:00

score 1 · Answer 1 · answered Apr 22 '19 at 17:24

You could work with your database as it is, because 20% positives is not really that unbalanced. Most of the algorithms work perfectly on this kind of dataset.

But if you want to work it differently, I suggest you use Synthetic Minority Over Sampling (SMOTE), which is a technique which repeats the minority dataset until is balanced.

Balanced vs total dataset rows, which one is better?

1 Answers1