198

How could I randomly split a data matrix and the corresponding label vector into a X_train, X_test, X_val, y_train, y_test, y_val with scikit-learn?

As far as I know, sklearn.model_selection.train_test_split is only capable of splitting into two not into three...

Arun
  • 3
  • 3
Hendrik
  • 8,587
  • 17
  • 42
  • 55

16 Answers16

238

You could just use sklearn.model_selection.train_test_split twice. First to split to train, test and then split train again into validation and train. Something like this:

 X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

 X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
hh32
  • 2,732
  • 1
  • 10
  • 9
  • 10
    Yes, this works of course but I hoped for something more elegant ;) Never mind, I accept this answer. – Hendrik Nov 17 '16 at 08:10
  • 1
    I wanted to add that if you want to use the validation set to search for the best hyper-parameters you can do the following after the split: https://gist.github.com/albertotb/1bad123363b186267e3aeaa26610b54b – skd Jun 06 '18 at 16:34
  • 20
    So what is the final train, test, validation proportion in this example? Because on the second train_test_split, you are doing this over the previous 80/20 split. So your val is 20% of 80%. The split proportions aren't very straightforward in this way. – Monica Heddneck Jun 14 '18 at 19:22
  • 1
    I agree with @Monica Heddneck that the 64% train, 16% validation and 20% test splt could be clearer. It's an annoying inference you have to make with this solution. – Perry Jun 25 '19 at 08:00
  • 1
    if test_size is an integer number this function will take test_size number of elements for test, so you can pre-compute the number of elements in each subsets given your proportion and use these values to do a double split – CodeRonin Nov 10 '19 at 10:39
  • I found this answer useful, so I thought I would add some explanatory text regarding the numbers. The first split creates 80% training+validation and 20% test. The second split starts with the 80% training+validation split and assigns 25% of this 80% to the validation split - this size comes from 0.25 X 0.80 = 0.20 (20%). So the validation split is 20%. So, now we have validation and testing at 20% each. The training split size is calculated as 75% of the 80% = 0.75 X 0.80 = 0.60 (60%). So, this gives a training split size of 60%. Overall, this gives 60%-20%-20% for train-validation-test. – edesz Oct 20 '20 at 03:53
  • I don't have any labels....how do I do the split? – Charlie Parker Feb 13 '21 at 20:28
  • Quick utility that wraps input validation andnext(ShuffleSplit().split(X, y)) -- see my answer below for further details – Carlos Mougan Nov 18 '21 at 08:38
  • 1
    It is more straightforward to extract validation data from the test dataset in the second call, e.g splitting into 80/20 train and test and splitting the test dataset into 50/50 (test_size=0.50), now you have 80% for training, 10% for validation and 10% for testing. – HumbleBee Jun 09 '22 at 08:23
81

There is a great answer to this question over on SO that uses numpy and pandas.

The command (see the answer for the discussion):

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

produces a 60%, 20%, 20% split for training, validation and test sets.

0_0
  • 955
  • 6
  • 5
  • 4
    I can see the .6 meaning 60%... but what does the .8 mean? – Tom Hale May 11 '19 at 05:02
  • 3
    @TomHale np.split will split at 60% of the length of the shuffled array, then 80% of length (which is an additional 20% of data), thus leaving a remaining 20% of the data. This is due to the definition of the function. You can test/play with: x = np.arange(10.0), followed by np.split(x, [ int(len(x)*0.6), int(len(x)*0.8)]) – 0_0 May 14 '19 at 13:35
  • 1
    This is fantastic, such a simple, straightforward method. I always tried shuffling the indexes, then selecting a first X%, a.s.o. Just great! – devplayer Mar 11 '20 at 11:24
  • 12
    Major benefit of train_test_split is stratification – Kermit Oct 05 '20 at 01:16
  • 1
    Having a random state to this makes it better: train, validate, test = np.split(df.sample(frac=1, random_state=1), [int(.6*len(df)), int(.8*len(df))]) – Julien Nyambal Apr 17 '22 at 23:14
42

Adding to @hh32's answer, while respecting any predefined proportions such as (75, 15, 10):

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

train is now 75% of the entire data set

x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)

test is now 10% of the initial data set

validation is now 15% of the initial data set

x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))

print(x_train, x_val, x_test)

Andrei Florea
  • 521
  • 4
  • 3
  • 4
    I think this is the best answer and should be accepted. What do you mean by "# the _junk suffix means that we drop that variable completely" though? – PascalIv Jun 12 '20 at 07:52
  • 1
    And I think the shuffle argument should be set to False in the second call, simply because there is no reason to shuffle again. – PascalIv Jun 12 '20 at 08:01
  • Is this 1-fold-cross validation? – alper Aug 14 '23 at 19:07
11

You can use train_test_split twice. I think this is most straightforward.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1)

In this way, train, val, test set will be 60%, 20%, 20% of the dataset respectively.

Stephen Rauch
  • 1,783
  • 11
  • 22
  • 34
David Jung
  • 211
  • 2
  • 3
7

Most often you will find yourself not splitting it once but in a first step you will split your data in a training and test set. Subsequently you will perform a parameter search incorporating more complex splittings like cross-validation with a 'split k-fold' or 'leave-one-out(LOO)' algorithm.

JLT
  • 171
  • 1
  • 3
7

Extension of @hh32's answer with preserved ratios.

# Defines ratios, w.r.t. whole dataset.
ratio_train = 0.8
ratio_val = 0.1
ratio_test = 0.1

Produces test split.

x_remaining, x_test, y_remaining, y_test = train_test_split( x, y, test_size=ratio_test)

Adjusts val ratio, w.r.t. remaining dataset.

ratio_remaining = 1 - ratio_test ratio_val_adjusted = ratio_val / ratio_remaining

Produces train and val splits.

x_train, x_val, y_train, y_val = train_test_split( x_remaining, y_remaining, test_size=ratio_val_adjusted)

Since the remaining dataset is reduced after the first split, new ratios for the reduced dataset must be calculated:

$ R_{new} = \frac{R_{old}}{R_{remaining}}$

Jorge Barrios
  • 191
  • 1
  • 6
4

Best answer above does not mention that by separating two times using train_test_split not changing partition sizes won`t give initially intended partition:

x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

Then the portion of validation and test sets in the x_remain change and could be counted as

new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

In this occasion all initial partitions are saved.

Stephen Rauch
  • 1,783
  • 11
  • 22
  • 34
A.Ametov
  • 141
  • 3
2

I would like to summarize all the good and elegant answers.

The sklearn.model_selection.train_test_split is de facto option for train, validation split. However, if you want train,val and test split, then the following code can be used.

(Extending answer from 0_0)

  1. Let's say you want to do a split of 75,15 and 10 percentages. If you have data and labels in the panda dataframe then use the following
# suffle and split
train_df, val_df, test_df = np.split(df.sample(frac=1), [int(.75*len(df)), int(.9*len(df))])
  1. Let's say you have data and labels in 2 different NumPy arrays.
data = np.arange(1000)
data = np.reshape(data,(100,10)) # 100 examples with 10 features
labels = np.arange(100) # assuming 100 different categories

print(data[3]) print(labels[3])

idx = np.random.permutation(len(data)) # get suffeled indices x,y = data[idx], labels[idx] # uniform suffle of data and label

x_train, x_val, x_test = np.split(x, [int(len(x)0.75), int(len(x)0.9)]) # split of 75:15:10 y_train, y_val, y_test = np.split(y, [int(len(y)0.75), int(len(y)0.9)])

print(len(x_train),len(x_val),len(x_test)) print(x_train[:3]) print(y_train[:3])

2

The most pythonic way of doing this would be (and running this twice, as a nested loop)

>>> import numpy as np
>>> from sklearn.model_selection import ShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])
>>> y = np.array([1, 2, 1, 2, 1, 2])
>>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
>>> rs.get_n_splits(X)
5
>>> print(rs)
ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)
>>> for train_index, test_index in rs.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [1 3 0 4] TEST: [5 2]
TRAIN: [4 0 2 5] TEST: [1 3]
TRAIN: [1 2 4 0] TEST: [3 5]
TRAIN: [3 4 1 0] TEST: [5 2]
TRAIN: [3 5 1 0] TEST: [2 4]
>>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25,
...                   random_state=0)
>>> for train_index, test_index in rs.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [1 3 0] TEST: [5 2]
TRAIN: [4 0 2] TEST: [1 3]
TRAIN: [1 2 4] TEST: [3 5]
TRAIN: [3 4 1] TEST: [5 2]
TRAIN: [3 5 1] TEST: [2 4]

Scikit learn now provides a much more detailed way of doing cross-validation:https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators

There is also the option of KFold that might be what you are looking for:

>>> import numpy as np
>>> from sklearn.model_selection import RepeatedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> random_state = 12883823
>>> rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
>>> for train, test in rkf.split(X):
...     print("%s %s" % (train, test))
...
[2 3] [0 1]
[0 1] [2 3]
[0 2] [1 3]
[1 3] [0 2]

They also now provide graphics that will allow you to visualize the type of train-test split that you are looking for (there are more types of train test split than random)

CV

Carlos Mougan
  • 6,252
  • 2
  • 18
  • 48
2

Here's another approach (assumes equal three-way split):

# randomly shuffle the dataframe
df = df.reindex(np.random.permutation(df.index))

# how many records is one-third of the entire dataframe
third = int(len(df) / 3)

# Training set (the top third from the entire dataframe)
train = df[:third]

# Testing set (top half of the remainder two third of the dataframe)
test = df[third:][:third]

# Validation set (bottom one third)
valid = df[-third:]

This can be made more concise but I kept it verbose for explanation purposes.

Vishal
  • 268
  • 2
  • 5
2

Given train_frac=0.8, this function creates a 80% / 10% / 10% split:

import sklearn

def data_split(examples, labels, train_frac, random_state=None):
    ''' https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    param data:       Data to be split
    param train_frac: Ratio of train set to whole dataset

    Randomly split dataset, based on these ratios:
        'train': train_frac
        'valid': (1-train_frac) / 2
        'test':  (1-train_frac) / 2

    Eg: passing train_frac=0.8 gives a 80% / 10% / 10% split
    '''

    assert train_frac >= 0 and train_frac <= 1, "Invalid training set fraction"

    X_train, X_tmp, Y_train, Y_tmp = sklearn.model_selection.train_test_split(
                                        examples, labels, train_size=train_frac, random_state=random_state)

    X_val, X_test, Y_val, Y_test   = sklearn.model_selection.train_test_split(
                                        X_tmp, Y_tmp, train_size=0.5, random_state=random_state)

    return X_train, X_val, X_test,  Y_train, Y_val, Y_test
Tom Hale
  • 201
  • 2
  • 5
2

How about using numpy random choice

import numpy as np
from sklearn.datasets import load_iris

def ttv_split(X, y = None, train_size = .6, test_size = .2, validation_size = .2, random_state = 42): """ Basic approach using np random choice """ np.random.seed(random_state) X = pd.DataFrame(X, columns = ["col_" + str(i) for i in range(X.shape[1])]) size = sum((train_size,test_size,validation_size)) n_samples = X.shape[0] if size != 1: return f"Size of the dataset must sum up to 100% instead: {size} correct and try again" else: split_series = np.random.choice(a = ["train","test","validation"], p = [train_size, test_size, validation_size], size = n_samples) split_series = pd.Series(split_series)

    X_train, X_test, X_validation = X.iloc[split_series[split_series == &quot;train&quot;].index,:], X.iloc[split_series[split_series == &quot;test&quot;].index,:], X.iloc[split_series[split_series == &quot;validation&quot;].index,:]

    if not y is None:
        y = pd.DataFrame(y,columns=[&quot;target&quot;])

        y_train, y_test, y_validation = y.iloc[split_series[split_series == &quot;train&quot;].index,:], y.iloc[split_series[split_series == &quot;test&quot;].index,:], y.iloc[split_series[split_series == &quot;validation&quot;].index,:]

        return X_train,X_test,X_validation,y_train,y_test,y_validation
    else:
        return X_train,X_test,X_validation


X,y = load_iris(return_Xy = True)

X_train,X_test,X_validation,y_train,y_test,y_validation = ttv_split(X, y)

Multivac
  • 2,959
  • 2
  • 9
  • 26
0

All the answers I see work only if you split two arrays (X and y), which is usually the case, but I found myself needing to split more than two arrays. Therefore I wrote the following function, which can handle arbitrary number of arrays:

def train_test_valid_split(*arrays, test_size: float, valid_size: float, **kwargs):
    first_split = train_test_split(*arrays, test_size=test_size, **kwargs)
    testing_data = first_split[1::2]
    if valid_size == 0:
        training_data = first_split[::2]
        validation_data = []
    else:
        training_validation_data = train_test_split(*first_split[::2], test_size=(valid_size / (1 - test_size)),
                                                    **kwargs)
        training_data = training_validation_data[::2]
        validation_data = training_validation_data[1::2]
return training_data + testing_data + validation_data

  • Also, if you want to return the data in the same format as the original sklearn function, return list(chain.from_iterable(zip(training_data, testing_data, validation_data))) instead – user132771 Feb 23 '22 at 12:31
0

The easiest way I could think of is to map split fractions to array indices as follows:

train_set = data[:int((len(data)+1)*train_fraction)]
test_set = data[int((len(data)+1)*train_fraction):int((len(data)+1)*(train_fraction+test_fraction))]
val_set = data[int((len(data)+1)*(train_fraction+test_fraction)):]

where data = random.shuffle(data)

Coddy
  • 101
  • 1
0

Run it twice. Here is the math for the 2nd test_size.

Let's say I want {train:0.67, validation:0.13, test:0.20}

The first test_size is 20% which leaves 80% of the original data to be split into validation and training data.

(1.0/(1.0-test_size))*validation_size = second_test_size

(1.0/(1.0-0.20))*0.13 = 0.1625

Also, look into the stratify parameter as that is the real reason to use train_test_split as opposed to selecting random row indices.

Kermit
  • 529
  • 5
  • 17
-1
import numpy as np
import pandas as pd

#length of data 
N = 10
scale=2


#generated random data
X, y = np.arange(N*scale).reshape((N, scale)), np.arange(N)

#Works for pandas dataframe too
#You can download titanic.csv from here 
#https://github.com/fuwiak/faster_ds/blob/master/sample_data/titanic.csv

#df = pd.read_csv("titanic.csv", sep="\t")
#X=df[df.columns.difference(["Survived"])]
#y=df["Survived"]



def train_test_val(X, y, train_ratio, test_ratio, val_ratio):
    assert sum([train_ratio, test_ratio, val_ratio])==1.0, "wrong given ratio, all ratios have to sum to 1.0"
    assert X.shape[0]==len(y), "X and y shape mismatch"

    ind_train = int(round(X.shape[0]*train_ratio))
    ind_test = int(round(X.shape[0]*(train_ratio+test_ratio)))

    X_train = X[:ind_train]
    X_test = X[ind_train:ind_test]
    X_val = X[ind_test:]

    y_train = y[:ind_train]
    y_test = y[ind_train:ind_test]
    y_val = y[ind_test:]

    return X_train, X_test, X_val, y_train, y_test, y_val
# put ratio as you wish
X_train, X_test, X_val, y_train, y_test, y_val=train_test_val(X, y, 0.8, 0.1, 0.1) 
fuwiak
  • 1,373
  • 8
  • 13
  • 26
  • You do not randomize the choice of the training set / testing set. You just put a given share on the full dataset. The model will not learn from a representative dataset as soon as the dataset is not fully randomly distributed, which is likely in such datasets. – questionto42 May 22 '21 at 21:04