When to clean data?

Question

I am very new to data science / ML and I have what I think is a very basic question - when to 'clean' the data?

Do I clean data before using it to train a classifier (a binary classifier in my experiments)?
Do I clean data that I try to classify using this classifer?
Both?

The data in my case is just a series of Tweets.

score 1 · Answer 1 · answered Apr 24 '18 at 19:17

You would want to clean your data before training a classifier. To give you a rough and abstract look on ML, think of your classifiers as a giant mathematical matrix.

You want to use the classifiers to give you information on the data you have. But in order to do that, you would want to clean, parse or even encode your data to a format that the ML would be able to understand and one where you can gain the most knowledge from the data.

I can't think of any case where you would want to clean data with a classifier. You could use dimensionality reduction but that is not exactly data cleaning.

Hope that answers your question

score 1 · Answer 2 · answered Apr 24 '18 at 19:47

Cleaning is usually done in the pre-processing or data preparation phase of data mining.

Therefore you might want to clean before you train or use your classifier. The cleaning algorithm should be the same for training and applying the classifier.

If you would apply your cleaning algorithm to only one of the training and test datasets, the prediction accuracy must not necessarily be worse than if applied to both. This may be the case when your classifier does not depend on cleaned features. In this case you would probably might want to check if these features are relevant for the classification at all.

score 1 · Accepted Answer · answered Apr 24 '18 at 20:01

Data Cleaning or Data Munging as it is referred in most cases, is the process of transforming the data from the raw form that they exist after their collection into another format with the intent of making it more appropriate for their future process e.g. training models etc..

This process is taking place at the beginning of the whole procedure and before the training and validation of the models. In text mining problems, you have also to treat the punctuation marks, remove the stopwords (it depends on the data representation that you will choose, for unigrams it is fine, but for bigrams it is not recommended at all) and also do the stemming or lemmatization processes.

When to clean data?

3 Answers3