4

From this post, I know you can set scale_pos_weight for an imbalanced dataset. However, for the multi-classification problem in the imbalanced dataset, I don't quite understand how to set the weight parameter in dmatrix.

How can I use XGBoost for an imbalanced dataset in a multi-classification problem?

Stephen Rauch
  • 1,783
  • 11
  • 22
  • 34

1 Answers1

1

As you say, scale_pos_weight works for two classes (binary classification). weight can be used for three or more classes. The parameter goes into the xgb.DMatrix function and must contain one value for each observation.

Example:

library(xgboost)
data(iris)

We'll predict Species

label = as.integer(iris$Species)-1 iris$Species = NULL

Split the data for training and testing (75/25 split)

n = nrow(iris) train.index = sample(n,floor(0.75*n))

For example, pick a weight of 1.5 for label "0", 1.0 for the other Species

weights = sapply(label[train.index], function(x) {ifelse(x == 0, 1.5, 1.0)})

Train the data using weights

xgb.train = xgb.DMatrix(data=as.matrix(iris[train.index,]), label=label[train.index], weight = weights)

A similar question can be found here.

Peurke
  • 11
  • 2