6

I have a dataframe with about 50 columns. The columns are either categorical or continuous data. The continuous data can be between 0.000001-1.00000 or they can be between 500,000-5,000,000. The categorical data is usually a name, for example a store name.

How can I normalize this data so that I can feed it into a dense layer of a Sequential model?

The Y values are either 0 or 1, so it is a binary classification problem. I am currently normalizing all of the continuous data to be 0-1 and one-hot encoding all of the categorical data, so that if I have a column with 5 names it in, I will get a matrix with 5 columns filled with 0's and 1's. Then I join all of the continuous and categorical data and feed it into a Dense layer with init='uniform' and activation='relu'.

Is this the standard way of doing things?

Ethan
  • 1,633
  • 9
  • 24
  • 39
user1367204
  • 201
  • 1
  • 3
  • 6

1 Answers1

6

Yes it does, you're doing well!

In most cases, categorical features(columns) should be one-hot encoded. However, continuous features might be a little complicated.

There are two common ways to preprocess continuous feature:

  1. scaling features to range [0, 1] (as you have done)
  2. removing the mean and scaling to unit variance(make the feature has zero mean and 1 standard variance)

In my practice, I take these two ways depending on my dataset.

Icyblade
  • 4,326
  • 1
  • 24
  • 34
  • Isn't it better to scale features to [-1,1] instead of [0,1]? If biases are initialized randomly with mean 0 then we want features to have a mean of 0 – Hugh Feb 04 '17 at 10:31
  • 2
    In my practice, [0, 1] is always better than [-1, 1]. But [-1, 1] might be better for some scenario I haven't met before. By the way, there are reports mentioning that we can increase the scaling range(let's say [-5, 5]) and increase learning rate synchronously(1e-3 to 1e-2 for instance). – Icyblade Feb 04 '17 at 10:39