1

I have a balanced dataset for a multiclass classification problem with one high-priority label (this ought to be classified properly at all costs). How do I go about creating a workflow for this problem? What specific feature engineering/selection methods and classifiers should I be considering for this problem?

To be more specific, the data I'm dealing with (including labels) is completely anonymized, so I don't have a clue as to what it actually stands for.

Some approaches I am considering -

  1. Creating synthetic data points for the priority label through oversampling.
  2. Creating a highly non-linear model for prediction as accuracy is very important.

Any help is much appreciated!

mallochio
  • 91
  • 1
  • 6

1 Answers1

2

This can be tackled in two places:

  1. Data: as you mentioned, this is done by artificially increasing the number of samples from critical class $cc$. This produces the same effect as data-sets that are naturally imbalanced,

  2. Model: this is generally done by over-penalizing the miss-classification of $cc$ compared to other classes. One place for this modification is the loss function. A frequently used loss function in classification is cross entropy. It can be modified for this purpose as follows. Let $y_{ik} = 1$ if $k$ is the true class of data point $i$, otherwise $y_{ik} = 0$, and $y'_{ik} \in (0, 1]$ be the corresponding model estimation. The original cross-entropy can be written as: $$H_y(y')=-\sum_{i}\sum_{k=1}^{K}y_{ik}log(y'_{ik})$$
    which can be changed to $$H_y(y')=-\sum_{i}\sum_{k=1}^{K}\color{blue}{w_{k}}y_{ik}log(y'_{ik})$$ For example, by setting $w_{cc} = 10$ and $w_{k \neq cc}=1$, you are essentially telling the model that miss-classifying $1$ member from $cc$ is as punishable as miss-classifying $10$ members from other classes. This is roughly equivalent to increasing the ratio of class $cc$ $10$ times in the training set using method (1).

Esmailian
  • 9,312
  • 2
  • 32
  • 48