How generate variation in datasets

Question

I building a deep learning model to detect what drug user use. I have many symptoms and duration of each drug. I create X and y data but, for example, LSD have an effect duration of 180 - 720 minutes. I really need make 540 arrays? I really want a help.

My LSD array:

[28, 180],
[28, 720],
[29, 180],
[29, 720],
[30, 180],
[30, 720],
[31, 180],
[31, 720],
[32, 180],
[32, 720],
[33, 180],
[33, 720],
[34, 180],
[34, 720],
[35, 180],
[35, 720],
[36, 180],
[36, 720],
[37, 180],
[37, 720],
[1, 180],
[1, 720],
[38, 180],
[38, 720],
[12, 180],
[12, 720],
[9, 180],
[9, 720],
[24, 180],
[24, 720],
[17, 180],
[17, 720],
[7, 180],
[7, 720],
[4, 180],
[4, 720],

In first position I have differents symptoms and in second position duration. I just duplicated each symptoms and set min duration and max duration. But this return to me a perfection model. I know, I need add all minutes to each symptoms, but how I make this using python?

List of symptoms

0 - relaxamento
1 - euforia
2 - diminuicao da memoria a curto prazo
3 - boca seca
4 - habilidades motoras debilitadas
5 - olhos vermelhos
6 - humor
7 - aumento frequencia cardiaca
8 - aumento apetite
9 - concentracao debilitada
10 - sensacao de poder
11 - ausencia de medo
12 - ansiedade
13 - agressividade
14 - excitacao
15 - perda do apetite
16 - tremores
17 - dilatacao da pupila
18 - dentes anestesiados
19 - insonia
20 - movimentos descontrolados
21 - espasmos maxilar
22 - dor de cabeça
23 - visao turva
24 - nauseas
25 - desidratacao
26 - periodos de depressao
27 - perda total da memoria
28 - ilusões
29 - alucinações
30 - grande sensibilidade sensorial
31 - experiências místicas
32 - flashbacks
33 - paranoia
34 - perda da noção temporal e espacial
35 - confusão
36 - perda do controle emocional
37 - sentimento de bem-estar
38 - pânico
39 - sonolencia
40 - batimentos cardiacos diminuem
41 - insuficiencia respiratoria
42 - desanimo
43 - desinteresse pela vida familiar/profissional
44 - sensacao de estar no paraiso
45 - mal-estar
46 - Incapacidade de sentir prazer
47 - Incapacidade de sentir dor

** Durations effects (in minutes) **

Cannabis. 120 - 240
Cocain. 30 - 40
Ecstasy. 240 - 480
LSD. 180 - 720
Heroin. 45 - 60

My full code:

X = [
    #cannabis
    [0, 120],
    [0, 240],
    [1, 120],
    [1, 240],
    [2, 120],
    [2, 240],
    [3, 120],
    [3, 240],
    [4, 120],
    [4, 240],
    [5, 120],
    [5, 240],
    [6, 120],
    [6, 240],
    [7, 120],
    [7, 240],
    [8, 120],
    [8, 240],
    [9, 120],
    [9, 240],
    #cocain
    [1, 30],
    [1, 40],
    [10, 30],
    [10, 40],
    [11, 30],
    [11, 40],
    [12, 30],
    [12, 40],
    [13, 30],
    [13, 40],
    [14, 30],
    [14, 40],
    [15, 30],
    [15, 40],
    [7, 30],
    [7, 40],
    [16, 30],
    [16, 40],
    [17, 30],
    [17, 40],
    [18, 30],
    [18, 40],
    #ecstasy
    [19, 240],
    [19, 480],
    [20, 240],
    [20, 480],
    [21, 240],
    [21, 480],
    [22, 240],
    [22, 480],
    [23, 240],
    [23, 480],
    [24, 240],
    [24, 480],
    [25, 240],
    [25, 480],
    [26, 240],
    [26, 480],
    [27, 240],
    [27, 480],
    [15, 240],
    [15, 480],
    #LSD
    [28, 180],
    [28, 720],
    [29, 180],
    [29, 720],
    [30, 180],
    [30, 720],
    [31, 180],
    [31, 720],
    [32, 180],
    [32, 720],
    [33, 180],
    [33, 720],
    [34, 180],
    [34, 720],
    [35, 180],
    [35, 720],
    [36, 180],
    [36, 720],
    [37, 180],
    [37, 720],
    [1, 180],
    [1, 720],
    [38, 180],
    [38, 720],
    [12, 180],
    [12, 720],
    [9, 180],
    [9, 720],
    [24, 180],
    [24, 720],
    [17, 180],
    [17, 720],
    [7, 180],
    [7, 720],
    [4, 180],
    [4, 720],
    # Heroin
    [39, 45],
    [39, 60],
    [29, 45],
    [29, 60],
    [40, 45],
    [40, 60],
    [41, 45],
    [41, 60],
    [42, 45],
    [42, 60],
    [43, 45],
    [43, 60],
    [44, 45],
    [44, 60],
    [12, 45],
    [12, 60],
    [45, 45],
    [45, 60],
    [46, 45],
    [46, 60],
    [1, 45],
    [1, 60],
    [13, 45],
    [13, 60],
    [24, 45],
    [24, 60],
]
"""
    # DROGAS

    0 - Cannabis
    1 - Cocain
    2 - Ecstasy
    3 - LSD
    4 - Heroin
"""
y = [ 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5)

from sklearn import tree
my_classifier = tree.DecisionTreeClassifier()

my_classifier.fit(X_train, y_train)

predictions = my_classifier.predict(X_test)

print(predictions)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))

Sorry for my bad english :( Thanks

If you are actually using a deep learning model, with the amount of training data you have provided, it will very easily overfit. Simpler algorithms would be better for this like the decision tree you are using in your python code. — Aiden Grossman, Sep 26 '17 at 00:25

score 2 · Answer 1 · answered Oct 06 '17 at 06:46

I have many symptoms and duration of each drug. I create X and y data but, for example, LSD have an effect duration of 180 - 720 minutes. I really need make 540 arrays?

You can (in this particular case, it's fairly easy to generate ~800 rows in CSV once), but you don't have to: you an apply data augmentation on the fly. This will add some randomness in your training, but it usually helps generalization.

By the way, it seems you aren't actually using deep learning, but rather DecisionTreeClassifier, which is a bit different.

How generate variation in datasets

1 Answers1