I have been given a task to train the SVM model on conll2003 dataset for Named Entity "Identification" (That is I have to tag all tokens in "Statue of Liberty" as named entities not as a place, which is the case in named entity recognition.)
I am building features which involve multiple tokens in sequence to determine whether token at particular position in that sequence is named entity or not. That is, I am building features that use surrounding tokens to determine whether a token is named entity or not. So as you have guessed there is relation between these tokens.
Now the data is very imbalanced. That is there are far more non-named entities than named entities and I wish to fix this. But I cannot simply oversample / undersample tokens randomly as it may result non-sensical sentences due loss of relation between tokens.
I am unable to guess how I can use other balancing techniques like tomek links, SMOTE for such sentence data (that is without making sentences sound meaningless).
So what are best / preferred techniques to balance such data?