I work on a dataset concerning games playing results. i.e every child play an indefinite number of games and it has as output (y) two possible values "success" or "Failure". It's about 800 000 samples, with the total number of positive output (1) is ~~ 640 000 and the total number of negative output (0) is ~~ 160 000.
id player game y
1 P1 G1 1
2 P1 G2 0
3 P1 G2 1
4 P1 G3 0
5 P2 G1 0
6 P2 G1 1
7 P2 G2 0
For making a prediction I sort my data by "players" first then by "games".
I want to make some tests for improving the prediction accuracy. As I am a beginner in machine learning and data Science I don't know what is better to choose:
- working with this database as it is, even it unbalanced? ??
- or selecting just games that have (approximatively) balanced output? i.e to except any game result which has a total number of positive too begger than a total number of negative ??