How to build a model when we have three separate train, validation, and test sets?

Question

I have a data set which should be divided into train, test, and validation sets.

set.seed(98274)                          # Creating example data
y <- sample(c(0,1), replace=TRUE, size=500)
x1 <- rnorm(500) + 0.2 * y
x2 <- rnorm(500) + 0.2 * x1 + 0.1 * y
x3 <- rnorm(500) - 0.1 * x1 + 0.3 * x2 - 0.3 * y
x4 <- rnorm(500) + 0.19 * y
x5 <- rnorm(500) + 0.2 * x3 + 0.11 * x2 - 0.174 * x4
x6 <- rnorm(500) - 0.12 * x1 + 0.28 * x2 - 0.33 * y
mydata <- data.frame(y, x1, x2, x3, x4, x5, x6)
divide the data
set.seed(200)

n=nrow(mydata)
id.train=sample(1:n,300,replace=FALSE)
id.valid=sample(setdiff(1:n,id.train),100,replace=FALSE)
id.test=setdiff(setdiff(1:n,id.train),id.valid)
mydata.train=mydata[id.train,]
mydata.valid=mydata[id.valid,]
mydata.test=mydata[id.test,]

I want to do variable selection so that the AUC will be maximized. If I just had train and test sets, I would do something like this:

# Transformation of the variable of interest into a factor, with names for the levels
library(caret)
levels <- unique(mydata$y) 
mydata.train$new_y=factor(mydata.train$y, labels=make.names(levels))
data_ctrl = trainControl(method = "cv", number = 5,summaryFunction=twoClassSummary,
                         classProbs = TRUE)
build model and select variables
model = train(new_y ~x1 + x2 + x3 + x4 + x5 + x6
              , data=mydata.train, method="glmStepAIC",metric = "ROC"
              , trControl=data_ctrl,trace=FALSE)
model$finalModel
test model on test set
prob.predict = predict.glm (model$finalModel, mydata.test, type="response")
cutoff=0.5
test.pred = rep(0, nrow(mydata.test))
test.pred[prob.predict >= cutoff] = 1
Confusion matrix
M=table(test.pred, mydata.test$y,dnn=c("Prediction","Observation"))
M
a=M[1,1]
b=M[1,2]
c=M[2,1]
d=M[2,2]
Sensitivity
d/(b+d)
Specificity
a/(a+c)
AUC
library(ROCR)
pred=prediction(prob.predict,mydata.test$y )
perf=performance(pred,measure="tpr",x.measure="fpr")
auc.perf = performance(pred, measure = "auc")
[email protected]

I could use the train set to do cross-validation. However, since I have a separate validation set I don't know how to use it to validate my model. How we should do variable selection and build a model when we have three separate data sets (i.e. train, validation, and test)?

score 5 · Answer 1 · answered Oct 11 '22 at 22:10

The validation set would be used for the same job as the split in cross-validation, except that it's done only once:

For each different assignment for the variables, train on the training set, then apply on the validation set and calculate AUC.
Pick the assignment which obtained the max AUC.
Optionally, train the final model on training+validation set with this assignment. Otherwise the final model is the one which was trained earlier with this assignment. Anyway all the other models can be discarded, they should not be applied on the test set.
Evaluate the final model only on the test set.

The separation into 3 sets is important because the selection of the best assignment is a kind of training by itself. As a consequence, the performance obtained on the validation set by the best assignment could happen by chance (especially if many assignments are tested; btw this would be akin to overfitting). This is why the true performance is later obtained on the fresh test set.

Thanks. I would greatly appreciate it if you could tell me whether there is an R package that make it easy to do so? — ebrahimi, Oct 12 '22 at 23:05
@ebrahimi to do what exactly? splitting the data into 3 parts? — Erwan, Oct 13 '22 at 19:54
I need something like watch list in XGboost which I can easily specify the train set and validation set. like here. — ebrahimi, Oct 13 '22 at 20:28

How to build a model when we have three separate train, validation, and test sets?

divide the data

build model and select variables

test model on test set

Confusion matrix

Sensitivity

Specificity

AUC

1 Answers1

Linked