How can I balance sentence data for NLP tasks

Question

I have been given a task to train the SVM model on conll2003 dataset for Named Entity "Identification" (That is I have to tag all tokens in "Statue of Liberty" as named entities not as a place, which is the case in named entity recognition.)

I am building features which involve multiple tokens in sequence to determine whether token at particular position in that sequence is named entity or not. That is, I am building features that use surrounding tokens to determine whether a token is named entity or not. So as you have guessed there is relation between these tokens.

Now the data is very imbalanced. That is there are far more non-named entities than named entities and I wish to fix this. But I cannot simply oversample / undersample tokens randomly as it may result non-sensical sentences due loss of relation between tokens.

I am unable to guess how I can use other balancing techniques like tomek links, SMOTE for such sentence data (that is without making sentences sound meaningless).

So what are best / preferred techniques to balance such data?

In general resampling doesn't work well with text data, it's usually better to deal with the imbalanced data as it is. Side question: why do you use SVM? NER is usually trained with sequence labeling models like conditional random fields. — Erwan, Nov 01 '21 at 09:56
Class imbalance almost certainly is not a problem, and there is no need to solve a non-problem. https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Nov 01 '21 at 10:47
Its an class assignment to use SVM. So I cant change the model. I am able to build different features and improve performance. But now I am on next part of the assignment which asks to deal with imbalanced data. — Rnj, Nov 01 '21 at 20:47
@Erwan it seems to be the case. Earlier I have tried oversampling, undersampling, SMOTE and undersampling using nearmiss. But surprisingly, I got exactly the same F1 score. I felt that I am doing something fishy and had done some stupid mistake. But now I feel I miss some subtle understanding. Can you please share insight exactly why such balancing techniques have no effect? Also, is it text data or SVM, in the context of which such techniques don't have any effect? Any details / links? — Rnj, Nov 01 '21 at 20:49
@Dave it seems to be the case. Earlier I have tried oversampling, undersampling, SMOTE and undersampling using nearmiss. But surprisingly, I got exactly the same F1 score. I felt that I am doing something fishy and had done some stupid mistake. But now I feel I miss some subtle understanding. Can you please share insight exactly why such balancing techniques have no effect? Also, is it text data or SVM, in the context of which such techniques don't have any effect? Any details / links? Also can you point me which of links u already shared discuss this? — Rnj, Nov 01 '21 at 20:50
@Rnj First in general I agree with Dave that resampling is rarely a good solution, see this short answer. But it gets even worse with text, because text has some special characteristics: it has very high variance in general (there are many different ways to express the same thing) and for this reason it's practically impossible for a dataset to be statistically representative. So when trying to resample one usually ends up with either adding zero information (just repeating the same text instances) or generating some unrealistic ... — Erwan, Nov 01 '21 at 21:13
... text instances, therefore adding bias in the dataset which can be even worse. So no, it's not specific to SVM, it wouldn't work well anyway. I don't know of any general reference about this, I suspect that it's just common knowledge among NLP people who have experience with this stuff. What I'd suggest is to inspect your error cases manually and try to see if there's a way to give the right indication to the model with some specific features. — Erwan, Nov 01 '21 at 21:24
Sorry for the extended delayed conversation, but it is taking me time to ponder. I came across this comment, which says "Over / under-sampling have no impact on the decision boundary if the support vector set remains unchanged during sampling, which typically is the case at least for the minority class. This has nothing to do with NLP." Then I thought this: "Say we had Neural network with same features, even then, after sampling, decision boundary might not have moved... — Rnj, Nov 03 '21 at 21:11
...significantly, resulting in the same performance, right? If this is correct, then does this behavior is the result of how features are defined (because it determines how feature vectors are well separated by decision hyperplane), instead of the specific model? I mean any good model will not move decision hyperplanes much, leading to the same performance, right? And because most NLP related features are of such nature, this behavior is common to NLP task, right?" @Erwan — Rnj, Nov 03 '21 at 21:11
@Rnj I'm not sure I'm knowledgeable enough to answer your questions :) In particular I don't follow your point about neural networks: first representing text for neural networks is normally done with embeddings, and this means that many things differ from the traditional one-hot encoding. Also I don't see how it could be compared, since NNs don't rely on hyper-plane geometry like SVM, it's a completely different paradigm. — Erwan, Nov 04 '21 at 11:49
The specificity of text data is that language is highly structured while at the same time extremely diverse, thus it's practically impossible to have a truly representative sample, so any supervised model is overfit at some level. — Erwan, Nov 04 '21 at 11:53

How can I balance sentence data for NLP tasks

0 Answers0