2

I have a data set which should be divided into train, test, and validation sets.

set.seed(98274)                          # Creating example data
y <- sample(c(0,1), replace=TRUE, size=500)
x1 <- rnorm(500) + 0.2 * y
x2 <- rnorm(500) + 0.2 * x1 + 0.1 * y
x3 <- rnorm(500) - 0.1 * x1 + 0.3 * x2 - 0.3 * y
x4 <- rnorm(500) + 0.19 * y
x5 <- rnorm(500) + 0.2 * x3 + 0.11 * x2 - 0.174 * x4
x6 <- rnorm(500) - 0.12 * x1 + 0.28 * x2 - 0.33 * y
mydata <- data.frame(y, x1, x2, x3, x4, x5, x6)

divide the data

set.seed(200)
n=nrow(mydata) id.train=sample(1:n,300,replace=FALSE) id.valid=sample(setdiff(1:n,id.train),100,replace=FALSE) id.test=setdiff(setdiff(1:n,id.train),id.valid)

mydata.train=mydata[id.train,] mydata.valid=mydata[id.valid,] mydata.test=mydata[id.test,]

I want to do variable selection so that the AUC will be maximized. If I just had train and test sets, I would do something like this:

# Transformation of the variable of interest into a factor, with names for the levels
library(caret)
levels <- unique(mydata$y) 
mydata.train$new_y=factor(mydata.train$y, labels=make.names(levels))

data_ctrl = trainControl(method = "cv", number = 5,summaryFunction=twoClassSummary, classProbs = TRUE)

build model and select variables

model = train(new_y ~x1 + x2 + x3 + x4 + x5 + x6 , data=mydata.train, method="glmStepAIC",metric = "ROC" , trControl=data_ctrl,trace=FALSE) model$finalModel

test model on test set

prob.predict = predict.glm (model$finalModel, mydata.test, type="response")

cutoff=0.5 test.pred = rep(0, nrow(mydata.test)) test.pred[prob.predict >= cutoff] = 1

Confusion matrix

M=table(test.pred, mydata.test$y,dnn=c("Prediction","Observation")) M

a=M[1,1] b=M[1,2] c=M[2,1] d=M[2,2]

Sensitivity

d/(b+d)

Specificity

a/(a+c)

AUC

library(ROCR) pred=prediction(prob.predict,mydata.test$y ) perf=performance(pred,measure="tpr",x.measure="fpr")

auc.perf = performance(pred, measure = "auc") [email protected]

I could use the train set to do cross-validation. However, since I have a separate validation set I don't know how to use it to validate my model. How we should do variable selection and build a model when we have three separate data sets (i.e. train, validation, and test)?

ebrahimi
  • 1,307
  • 7
  • 20
  • 40

1 Answers1

5

The validation set would be used for the same job as the split in cross-validation, except that it's done only once:

  1. For each different assignment for the variables, train on the training set, then apply on the validation set and calculate AUC.
  2. Pick the assignment which obtained the max AUC.
  3. Optionally, train the final model on training+validation set with this assignment. Otherwise the final model is the one which was trained earlier with this assignment. Anyway all the other models can be discarded, they should not be applied on the test set.
  4. Evaluate the final model only on the test set.

The separation into 3 sets is important because the selection of the best assignment is a kind of training by itself. As a consequence, the performance obtained on the validation set by the best assignment could happen by chance (especially if many assignments are tested; btw this would be akin to overfitting). This is why the true performance is later obtained on the fresh test set.

Erwan
  • 25,321
  • 3
  • 14
  • 35