10

I have data with the following structure:

created_at | customer_id | features | target
2019-01-01             2   xxxxxxxx       y  
2019-01-02             3   xxxxxxxx       y  
2019-01-03             3   xxxxxxxx       y  
...

That is, a session timestamp, a customer id, some features, and a target. I want to build an ML model to predict this target, and I'm having issues to do cross-validation properly.

The idea is that this model is deployed and used to model new customers. For this reason, I need the cross-validation setting to satisfy the following properties:

  • It has to be done in a time-series way: that is, for every train-validation split in cross-validation, we need all created_at of the validation set to be higher than all created_at of the training set.
  • It has to split customers: that is, for every train-validation split in cross-validation, we cannot have any customer both in train and validation.

Can you think of a way of doing this? Is there an implementation in python or in the scikit-learn ecosystem?

Itamar Mushkin
  • 1,061
  • 4
  • 17
David Masip
  • 6,051
  • 2
  • 24
  • 61

5 Answers5

4

As @NoahWeber mentioned, one solution is to:

  • split by customer ids (A)
  • do the time series split on all dataset (B)
  • keep in the training (resp. testing) dataset only the data from training (resp. testing) customers split (A) and from training (resp. testing) time series split (B).

Below is a code sample I was writing at the same time he answered.

import pandas as pd
import numpy as np
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import TimeSeriesSplit

Generating dates

def pp(start, end, n): start_u = start.value//109 end_u = end.value//109

return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]'))

start = pd.to_datetime('2015-01-01') end = pd.to_datetime('2018-01-01') fake_date = pp(start, end, 500)

Fake dataframe

df = pd.DataFrame(data=np.random.random((500,5)), index=fake_date, columns=['feat'+str(i) for i in range(5)]) df['customer_id'] = np.random.randint(0, 5, 500) df['label'] = np.random.randint(0, 3, 500)

First split by customer

rkf = RepeatedKFold(n_splits=2, n_repeats=5, random_state=42) for train_cust, test_cust in rkf.split(df['customer_id'].unique()): print("training/testing with customers : " + str(train_cust)+"/"+str(test_cust))

# Then sort all the data (if not already sorted)
sorted_df = df.sort_index()

# Then do the time series split
tscv = TimeSeriesSplit(max_train_size=None, n_splits=5)
for train_index, test_index in tscv.split(sorted_df.values):
    df_train, df_test = sorted_df.iloc[train_index], sorted_df.iloc[test_index]

    # Keep the right customers for training/testing 
    df_train_final = pd.concat( [ df_train.groupby('customer_id').get_group(i) for i in train_cust ])
    df_test_final = pd.concat( [ df_test.groupby('customer_id').get_group(i) for i in test_cust ])

Note: Generating random dates is based on this post

Note bis: I tested the generated training/testing dataframes ready for cross-val with this sample code that you can add right after the line df_test_final:

# Test condition 1: temporality
for i in range(len(df_test_final)):
    for j in range(len(df_train_final)):
        if df_test_final.index[i] < df_train_final.index[j]:
            print("Error with " + str(i) + "/" + str(j))

Test condition 2: training customers are not in testing final df

for i in train_cust: if i in df_test_final['customer_id'].values: print("Error in df_train with " + str(i) + "th customer")

Test condition 2: testing customers are not in training final df

for i in test_cust: if i in df_train_final['customer_id'].values: print("Error in df_train with " + str(i) + "th customer")


Here is a pseudo-code implementation:

function keep_customer_ids( data, ids ):
    goal: this function returns a subset of data with only the events that have
          an id tag that is in ids
    data: labeled events containing features, date and a customer id tag
    ids: the customer ids you want to keep
    for event in data:
        if event has a customer id tag that is in ids, keep it
        else, drop it
    return data

algorithm: for the number of cross-val you want: customer_train_ids, customer_test_ids = split_by_customers( customer_ids ) train_data, test_data = split_data_in_time_series_way( data ) final_train_data = keep_customer_ids( train_data, customer_train_ids ) final_test_data = keep_customer_ids( test_data, customer_test_ids ) do_the_fit_predict_things( final_train_data, final_test_data )

etiennedm
  • 1,395
  • 6
  • 13
  • 1
    Is this line test_df = pd.concat( [ df.groupby('customer_id').get_group(i) for i in train_cust ]) right? or should it be test_cust? – David Masip Jul 30 '20 at 07:54
  • @DavidMasip: yes you are right, I have updated the code – etiennedm Jul 30 '20 at 08:01
  • can you provide a pseudo-code description of the algorithm? I'm not sure on how you choose customers when there are "conflicts" – David Masip Jul 31 '20 at 12:17
  • I have updated the answer including a pseudo-code. I have also updated the python code to be more explicit. I hope it is more relevant now. – etiennedm Jul 31 '20 at 14:14
  • yeah I understand the pseudocode, it's now clearer. however, what does the kepp_customer_ids function do? – David Masip Aug 01 '20 at 08:16
  • I have added details in the pseudo-code, I hope it helps. For information, the two lines with keep_customer_ids in pseudo-code are exactly the two lines after the comment # Keep the right customers for training/testing in the python code – etiennedm Aug 01 '20 at 09:08
4

Here is a solution based on @NoahWeber and @etiennedm answers. It is based on a juxtaposition of splittings, a 1) repeated k fold splitting (to get training customers and testing customers), and 2) a time series splits on each k fold.

This strategy is based on a time series' splitting using a custom CV split iterator on dates (whereas usual CV split iterators are based on sample size / folds number).

An implementation within sklearn ecosystem is provided.

Let's restate the problem.

Say you have 10 periods and 3 customers indexed as follows :

example_data = pd.DataFrame({
    'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
    'cutomer': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
    'date': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
})

We do a repeated k fold with 2 folds and 2 iterations (4 folds in total) and within each k fold split we split again with time series split such that each time series split has 2 folds

kfold split 1 : training customers are [0, 1] and testing customers are [2]

kfold split 1 time series split 1 : train indices are [0, 1, 2, 3, 10, 11, 12, 13] and test indices are [24, 25, 26]

kfold split 1 time series split 2 : train indices are [0, 1, 2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 15, 16] and test indices are [27, 28, 29]

kfold split 2 : training customers are [2] and testing customers are [0, 1]

kfold split 2 time series split 1 : train indices are [20, 21, 22, 23] and test indices are [4, 5, 6, 7, 15, 16, 17]

kfold split 2 time series split 2 : train indices are [20, 21, 22, 23, 24, 25, 26] and test indices are [7, 8, 9, 17, 18, 19]

kfold split 3 : training customers are [0, 2] and testing customers are [1]

kfold split 3 time series split 1 : train indices are [0, 1, 2, 3, 20, 21, 22, 23] and test indices are [14, 15, 16]

kfold split 3 time series split 2 : train indices are [0, 1, 2, 3, 4, 5, 6, 20, 21, 22, 23, 24, 25, 26] and test indices are [17, 18, 19]

kfold split 4 : training customers are [1] and testing customers are [0, 2]

kfold split 4 time series split 1 : train indices are [10, 11, 12, 13,] and test indices are [4, 5, 6, 24, 25, 26]

kfold split 4 time series split 2 : train indices are [10, 11, 12, 13, 14, 15, 16] and test indices are [7, 8, 9, 27, 28, 29]

Usually, cross-validation iterators, such as those in sklearn, which are based on the number of folds, i.e., the sample size in each fold. These are unfortunately not suited in our kfold / time series split with real data. In fact, nothing guarantees that data is perfectly distributed over time and over groups. (as we assumed in the previous example).

For instance, we can have the 4th observation in the consumer training sample (say customer 0 and 1 in kfold split 1 in the example) that comes after the 4th observation in the test sample (say customer 2). This violates condition 1.

Here is one CV splits strategy based on dates by fold (not by sample size or the number of folds). Say you have previous data but with random dates. Define an initial_training_rolling_months, rolling_window_months. say for example 6 and 1 months.

kfold split 1 : training customers are [0, 1] and testing customers are [2]

kfold split 1 time series split 1 : train sample is the 6 first months of customers [0, 1] and test sample is the month starting after train sample for customers [2]

kfold split 1 time series split 2 : train sample is the 7 first months of customers [0, 1] and test sample is the month starting after train sample for customers [2]

Below a suggestion of implementation to build such a time series split iterator.

The returned iterator is a list of tuples that you can use as another cross-validation iterator.

With a simple generated data like in our previous example to debug the folds generation, noting that customers 1 (resp. 2) data begins at index 366 and (resp. 732).

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
df = generate_happy_case_dataframe()
grouped_ts_validation_iterator = build_grouped_ts_validation_iterator(df)
gridsearch = GridSearchCV(estimator=RandomForestClassifier(), cv=grouped_ts_validation_iterator, param_grid={})
gridsearch.fit(df[['feat0', 'feat1', 'feat2', 'feat3', 'feat4']].values, df['label'].values)
gridsearch.predict([[0.1, 0.2, 0.1, 0.4, 0.1]])

With randomly generated data like in @etiennedm's example (to debug split, I covered simple cases such as when the test sample begins before the training samples or just after).

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
df = generate_fake_random_dataframe()
grouped_ts_validation_iterator = build_grouped_ts_validation_iterator(df)
gridsearch = GridSearchCV(estimator=RandomForestClassifier(), cv=grouped_ts_validation_iterator, param_grid={})
gridsearch.fit(df[['feat0', 'feat1', 'feat2', 'feat3', 'feat4']].values, df['label'].values)
gridsearch.predict([[0.1, 0.2, 0.1, 0.4, 0.1]])

The implementation :

import pandas as pd
import numpy as np
from sklearn.model_selection import RepeatedKFold

def generate_fake_random_dataframe(start=pd.to_datetime('2015-01-01'), end=pd.to_datetime('2018-01-01')): fake_date = generate_fake_dates(start, end, 500) df = pd.DataFrame(data=np.random.random((500,5)), columns=['feat'+str(i) for i in range(5)]) df['customer_id'] = np.random.randint(0, 5, 500) df['label'] = np.random.randint(0, 3, 500) df['dates'] = fake_date df = df.reset_index() # important since df.index will be used as split index return df

def generate_fake_dates(start, end, n): start_u = start.value//109 end_u = end.value//109 return pd.DatetimeIndex((10*9np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]'))

def generate_happy_case_dataframe(start=pd.to_datetime('2019-01-01'), end=pd.to_datetime('2020-01-01')): dates = pd.date_range(start, end) length_year = len(dates) lenght_df = length_year * 3 df = pd.DataFrame(data=np.random.random((lenght_df, 5)), columns=['feat'+str(i) for i in range(5)]) df['label'] = np.random.randint(0, 3, lenght_df) df['dates'] = list(dates) * 3 df['customer_id'] = [0] * length_year + [1] * length_year + [2] * length_year return df

def build_grouped_ts_validation_iterator(df, kfold_n_split=2, kfold_n_repeats=5, initial_training_rolling_months=6, rolling_window_months=1): rkf = RepeatedKFold(n_splits=kfold_n_split, n_repeats=kfold_n_repeats, random_state=42) CV_iterator = list() for train_customers_ids, test_customers_ids in rkf.split(df['customer_id'].unique()): print("rkf training/testing with customers : " + str(train_customers_ids)+"/"+str(test_customers_ids)) this_k_fold_ts_split = split_with_dates_for_validation(df=df, train_customers_ids=train_customers_ids, test_customers_ids=test_customers_ids, initial_training_rolling_months=initial_training_rolling_months, rolling_window_months=rolling_window_months) print("In this k fold, there is", len(this_k_fold_ts_split), 'time series splits') for split_i, split in enumerate(this_k_fold_ts_split) : print("for this ts split number", str(split_i)) print("train ids is len", len(split[0]), 'and are:', split[0]) print("test ids is len", len(split[1]), 'and are:', split[1]) CV_iterator.extend(this_k_fold_ts_split) print('***')

return tuple(CV_iterator)


def split_with_dates_for_validation(df, train_customers_ids, test_customers_ids, initial_training_rolling_months=6, rolling_window_months=1): start_train_df_date, end_train_df_date, start_test_df_date, end_test_df_date =
fetch_extremas_train_test_df_dates(df, train_customers_ids, test_customers_ids)

start_training_date, end_training_date, start_testing_date, end_testing_date = \
    initialize_training_dates(start_train_df_date, start_test_df_date, initial_training_rolling_months, rolling_window_months)

ts_splits = list()
while not stop_time_series_split_decision(end_train_df_date, end_test_df_date, start_training_date, end_testing_date, rolling_window_months):
    # The while implies that if testing sample is les than one month, then the process stops
    this_ts_split_training_indices = fetch_this_split_training_indices(df, train_customers_ids, start_training_date, end_training_date)
    this_ts_split_testing_indices = fetch_this_split_testing_indices(df, test_customers_ids, start_testing_date, end_testing_date)
    if this_ts_split_testing_indices:
        # If testing data is not empty, i.e. something to learn
        ts_splits.append((this_ts_split_training_indices, this_ts_split_testing_indices))
    start_training_date, end_training_date, start_testing_date, end_testing_date =\
        update_testing_training_dates(start_training_date, end_training_date, start_testing_date, end_testing_date, rolling_window_months)
return ts_splits


def fetch_extremas_train_test_df_dates(df, train_customers_ids, test_customers_ids): train_df, test_df = df.loc[df['customer_id'].isin(train_customers_ids)], df.loc[df['customer_id'].isin(test_customers_ids)] start_train_df_date, end_train_df_date = min(train_df['dates']), max(train_df['dates']) start_test_df_date, end_test_df_date = min(test_df['dates']), max(test_df['dates']) return start_train_df_date, end_train_df_date, start_test_df_date, end_test_df_date

def initialize_training_dates(start_train_df_date, start_test_df_date, initial_training_rolling_months, rolling_window_months): start_training_date = start_train_df_date # cover the case where test consumers begins long after (initial_training_rolling_months after) train consumers if start_training_date + pd.DateOffset(months=initial_training_rolling_months) < start_test_df_date: start_training_date = start_test_df_date - pd.DateOffset(months=initial_training_rolling_months) end_training_date = start_train_df_date + pd.DateOffset(months=initial_training_rolling_months) start_testing_date = end_training_date end_testing_date = start_testing_date + pd.DateOffset(months=rolling_window_months) return start_training_date, end_training_date, start_testing_date, end_testing_date

def stop_time_series_split_decision(end_train_df_date, end_test_df_date, end_training_date, end_testing_date, rolling_window_months): no_more_training_data_stoping_condition = end_training_date + pd.DateOffset(months=rolling_window_months) > end_train_df_date no_more_testing_data_stoping_condition = end_testing_date + pd.DateOffset(months=rolling_window_months) > end_test_df_date stoping_condition = no_more_training_data_stoping_condition or no_more_testing_data_stoping_condition return stoping_condition

def update_testing_training_dates(start_training_date, end_training_date, start_testing_date, end_testing_date, rolling_window_months): start_training_date = start_training_date end_training_date += pd.DateOffset(months=rolling_window_months) start_testing_date += pd.DateOffset(months=rolling_window_months) end_testing_date += pd.DateOffset(months=rolling_window_months) return start_training_date, end_training_date, start_testing_date, end_testing_date

def fetch_this_split_training_indices(df, train_customers_ids, start_training_date, end_training_date): train_df = df.loc[df['customer_id'].isin(train_customers_ids)] in_training_period_df = train_df.loc[(train_df['dates'] >= start_training_date) & (train_df['dates'] < end_training_date)] this_ts_split_training_indices = in_training_period_df.index.to_list() return this_ts_split_training_indices

def fetch_this_split_testing_indices(df, test_customers_ids, start_testing_date, end_testing_date): test_df = df.loc[df['customer_id'].isin(test_customers_ids)] in_testing_period_df = test_df.loc[(test_df['dates'] >= start_testing_date) & (test_df['dates'] < end_testing_date)] this_ts_split_testing_indices = in_testing_period_df.index.to_list() return this_ts_split_testing_indices

SoufianeK
  • 226
  • 1
  • 5
  • I don't understand the difference between the two approaches you are giving, can you elaborate on that? – David Masip Aug 03 '20 at 06:01
  • 1
    sklearn Times series CV iterator splits dataset based on sample size: base training sample and rolling windows are expressed with sample size. 1) the 100 obs are train and the 50 that follow are test. 2) the first 150 obs are train and the 50 after test. etc. This approach is not suitable for many groups. This why I suggest a time series iterator based on time periods. 1) train with first 6 months training customer data and test with the month after testing customer data. 2) train with first 7 months training customer data and test with the month after testing customer data. etc. – SoufianeK Aug 03 '20 at 09:55
3

As a first porint, when you say "The idea is that this model is deployed and used to model new customers" I guess you mean and used to infere on new customers, is it correct? I can think of two possible options:

  1. following the properties you impose, you can first make use of the TimeSeriesSplit cross-validator by scikit-learn, with wich you get the time-ordered indices of each train-validation split, so that you can use them later on the clients IDs you decide to fulfill the second condition, something like: enter image description here

  2. As a second option, you could try to apply clustering on your clients, based on certain features, and build as many models as clients types you get (each cluster having n clients history data). This would solve a possible problem I see in your approach, which is (due to the second restriction) not using a client whole history data both for training and validating

German C M
  • 2,696
  • 5
  • 18
  • hmm I would prefer not to use clustering or anything fancy. I don't understand how to choose the client ids in an algorithmic way, have you thought of it? – David Masip Jul 30 '20 at 07:56
  • yup, infere on new customers – David Masip Jul 30 '20 at 07:57
  • The way I thought is clustering to find similar clients (if you do not want clustering maybe you can apply an ad-hoc similarity rule among clients); this way, it woud ensure in each train-test split validation that similar clients are used to train and validate (again, based on your second restriction) – German C M Jul 30 '20 at 08:11
2

Sort on the customer id. And than do the time series split. If there is any overlapping than drop these rows if possible.

These are mutually exclusive conditions, meaning that if you have class 2 for customer id in the beginning of the time series and Right and the end of it, you can not expect not to have to drop these rows in the beginning. Because not doing that would damage one of the two posed conditions.

Noah Weber
  • 5,669
  • 1
  • 12
  • 26
  • thanks! to me there's the question on how to decide to drop, should I drop from train or from test, how would you do it in an algorithmic way? – David Masip Jul 30 '20 at 07:57
1

This feature was requested on scikit-learn and I have added a PR for it . The code is awaiting review at this point. This code was used with some good results on a recent Kaggle competition .

from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args

https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243

class GroupTimeSeriesSplit(_BaseKFold): """Time Series cross-validator variant with non-overlapping groups. Provides train/test indices to split time series data samples that are observed at fixed time intervals according to a third-party provided group. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate. This cross-validation object is a variation of :class:KFold. In the kth split, it returns first k folds as train set and the (k+1)th fold as test set. The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds). Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Read more in the :ref:User Guide &lt;cross_validation&gt;. Parameters ---------- n_splits : int, default=5 Number of splits. Must be at least 2. max_train_size : int, default=None Maximum size for a single training set. Examples -------- >>> import numpy as np >>> from sklearn.model_selection import GroupTimeSeriesSplit >>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',
'b', 'b', 'b', 'b', 'b',
'c', 'c', 'c', 'c',
'd', 'd', 'd']) >>> gtss = GroupTimeSeriesSplit(n_splits=3) >>> for train_idx, test_idx in gtss.split(groups, groups=groups): ... print("TRAIN:", train_idx, "TEST:", test_idx) ... print("TRAIN GROUP:", groups[train_idx],
"TEST GROUP:", groups[test_idx]) TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10] TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']
TEST GROUP: ['b' 'b' 'b' 'b' 'b'] TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14] TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']
TEST GROUP: ['c' 'c' 'c' 'c'] TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
TEST: [15, 16, 17] TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']
TEST GROUP: ['d' 'd' 'd'] """ @_deprecate_positional_args def init(self, n_splits=5, *, max_train_size=None ): super().init(n_splits, shuffle=False, random_state=None) self.max_train_size = max_train_size

def split(self, X, y=None, groups=None):
    &quot;&quot;&quot;Generate indices to split data into training and test set.
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        Training data, where n_samples is the number of samples
        and n_features is the number of features.
    y : array-like of shape (n_samples,)
        Always ignored, exists for compatibility.
    groups : array-like of shape (n_samples,)
        Group labels for the samples used while splitting the dataset into
        train/test set.
    Yields
    ------
    train : ndarray
        The training set indices for that split.
    test : ndarray
        The testing set indices for that split.
    &quot;&quot;&quot;
    if groups is None:
        raise ValueError(
            &quot;The 'groups' parameter should not be None&quot;)
    X, y, groups = indexable(X, y, groups)
    n_samples = _num_samples(X)
    n_splits = self.n_splits
    n_folds = n_splits + 1
    group_dict = {}
    u, ind = np.unique(groups, return_index=True)
    unique_groups = u[np.argsort(ind)]
    n_samples = _num_samples(X)
    n_groups = _num_samples(unique_groups)
    for idx in np.arange(n_samples):
        if (groups[idx] in group_dict):
            group_dict[groups[idx]].append(idx)
        else:
            group_dict[groups[idx]] = [idx]
    if n_folds &gt; n_groups:
        raise ValueError(
            (&quot;Cannot have number of folds={0} greater than&quot;
             &quot; the number of groups={1}&quot;).format(n_folds,
                                                 n_groups))
    group_test_size = n_groups // n_folds
    group_test_starts = range(n_groups - n_splits * group_test_size,
                              n_groups, group_test_size)
    for group_test_start in group_test_starts:
        train_array = []
        test_array = []
        for train_group_idx in unique_groups[:group_test_start]:
            train_array_tmp = group_dict[train_group_idx]
            train_array = np.sort(np.unique(
                                  np.concatenate((train_array,
                                                  train_array_tmp)),
                                  axis=None), axis=None)
        train_end = train_array.size
        if self.max_train_size and self.max_train_size &lt; train_end:
            train_array = train_array[train_end -
                                      self.max_train_size:train_end]
        for test_group_idx in unique_groups[group_test_start:
                                            group_test_start +
                                            group_test_size]:
            test_array_tmp = group_dict[test_group_idx]
            test_array = np.sort(np.unique(
                                          np.concatenate((test_array,
                                                          test_array_tmp)),
                                 axis=None), axis=None)
        yield [int(i) for i in train_array], [int(i) for i in test_array]