3

I know a similar post was made here, but I wanted to ask some follow up questions. I am conducting a cross-validation search to find values of a set of hyper-parameters and need to normalise the data.

If we split up the data as follows:

  1. 'Training' (call this set 'A' for now) and testing data
  2. Split the 'training' into training (call this set 'B' for now) and validation sets

what parameters should be used when normalising the datasets?

Am I correct in thinking that:

  1. We normalise dataset 'B' and then extract the means and standard deviations on it
  2. We then normalise the validation set using those parameters obtained from set 'B'
  3. Once we have used the validation set to find my hyper-parameters with cross-validation, then we normalise set 'A' and extract its parameters
  4. Use the parameters from set 'A' to normalise the testing set

Is this correct, or have I misunderstood something? I know this is basic, but I can't seem to find a straightforward answer to this anywhere?

1 Answers1

1

I am not exactly sure what you mean by "what parameters should be used when normalizing datasets."

However, it is important to note:

Normalization is a preprocessing step that you do to some or all of the parameters of your model before constructing the model.

But in answer to your question:

You always normalize the same parameters used in both the train and the test set (otherwise how would you be able to compare the results?).

Ethan
  • 1,633
  • 9
  • 24
  • 39
  • Thanks. Yes, the original question ought to have been phrased better. That does make sense. After we have done cross-validation to tune hyper parameters, do we 're-normalise' the (whole) training and testing sets? (steps 3 & 4). I am just not sure if those steps are the correct thing to do after we have found parameters to use and now want to test the model's metrics. Intuition suggests those steps 3 & 4 are correct, but I just want to double check. Thanks – Rocky the Owl Dec 04 '20 at 19:55
  • Once you get your $\hat{y}$ prediction out of the model with normalized parameters you would need to undo this if you would like to interpret it in the same context as the unnormalized values. – Ethan Dec 04 '20 at 20:11
  • It is still a little bit unclear what you are asking, but I think your reasoning is correct. You would need to normalize all of the parameters from all of the partitions of the original dataset that you are using to build your model. – Ethan Dec 04 '20 at 20:12
  • Apologies, I will try to explain it more clearly. Goal: find optimal hyper parameters for model. Original dataset is split into training+validation and testing. For cross-validation, we just use the training+validation data and partition it into training and validation. We then normalise training set and use those same parameters for the validation set. Then we do CV and find parameters. Now (for steps 3&4), we use those optimal hyperparameters (for the model) and normalise (training+validation) together than use those extracted parameters to normalise the test set. Does that sound correct? – Rocky the Owl Dec 04 '20 at 20:19
  • 1
    Does that make more sense? – Rocky the Owl Dec 04 '20 at 20:20
  • So you are not using the parameters to normalize the test set. The procedure is as follows. Partition into train, validation, test. Normalize parameters from train. Normalize same parameters you normalized in train in validation. Do CV. Normalize these same parameters in test. Evaluate model performance. Does this make sense? – Ethan Dec 04 '20 at 20:26
  • What is important is that the parameters you select to normalize must be normalized in each of the partitions that you use throughout this process. – Ethan Dec 04 '20 at 20:26
  • 1
    I am with you up until after the 'Do CV'. After that stage, should we not train our model using the training and validation sets combined (thus requiring us to combine train and validation sets to form a new training set)? Then we can use this new training set to do the normalisation on the testing set? – Rocky the Owl Dec 04 '20 at 20:40
  • 1
    I know that the difference may be minimal depending on the size of the validation set, but wondering from a conceptual standpoint – Rocky the Owl Dec 04 '20 at 20:40