Why use fit when already have fit_transform?

Question

This is a follow up question to: What's the difference between fit and fit_transform in scikit-learn models?

I want to know why should we use fit at all when we have fit_transform which is much faster than using fit and transform separately? After all we will always transform the training data after fitting it. Do we have any use of fit all by itself?

I don't think this is a duplicate; it's a follow-up. Given that fit_transform=fit-then-transform, and you usually fit_transform training data then transform test data, why does fit exist as a separate method (or when/why should you use it)? — Ben Reiniger, Apr 12 '21 at 20:13

score 8 · Accepted Answer · answered Apr 12 '21 at 14:28

It probably is fairly rare to need to use fit and not instead fit_transform for a sklearn transformer. It nevertheless makes sense to keep the method separate: fitting a transformer is learning relevant information about the data, while transforming produces an altered dataset. Fitting still makes sense for sklearn predictors, and only some of those (particularly clusterers and outlier detectors) provide a combined fit_predict.

I can think of at least one instance where a transformer gets fitted but does not (immediately) transform data, but it is internal. In KBinsDiscretizer, if encode='onehot', then an internal instance of OneHotEncoder is created, and at fit time for the discretizer, the encoder is fitted (to dummy data) just to prepare it to transform future data. Transforming the data given to KBinsDiscretizer.fit would be wasteful at this point.

Finally, one comment on your post:

we have fit_transform which is much faster than using fit and transform separately

In most (but not all) cases, fit_transform is literally the same as fit(X, y).transform(X), so this should not be faster.

score 1 · Answer 2 · edited Jun 27 '22 at 09:33

1

A relatively late answer, but it is also very convenient to first fit all the data then during the deep learning training loop transform only the current batch. Otherwise, you have to keep all the arrays in memory, and depending on the amount of data and number of categories, this can become quite huge!

edited Jun 27 '22 at 09:33

Lynn

1,287
1
3
18

answered Jun 26 '22 at 17:21

Robert

111
2

Why use fit when already have fit_transform?

2 Answers2