Machine learning - features engineering from date/time data

Question

What are the common/best practices to handle time data for machine learning application?

For example, if in data set there is a column with timestamp of event, such as "2014-05-05", how you can extract useful features from this column if any?

Thanks in advance!

Ben Haley · Accepted Answer · 2014-11-05T22:02:14.857

57

I would start by graphing the time variable vs other variables and looking for trends.

For example

enter image description here

In this case there is a periodic weekly trend and a long term upwards trend. So you would want to encode two time variables:

day_of_week
absolute_time

In general

There are several common time frames that trends occur over:

absolute_time
day_of_year
day_of_week
month_of_year
hour_of_day
minute_of_hour

Look for trends in all of these.

Weird trends

Look for weird trends too. For example you may see rare but persistent time based trends:

is_easter
is_superbowl
is_national_emergency
etc.

These often require that you cross reference your data against some external source that maps events to time.

Why graph?

There are two reasons that I think graphing is so important.

Weird trends
While the general trends can be automated pretty easily (just add them every time), weird trends will often require a human eye and knowledge of the world to find. This is one reason that graphing is so important.
Data errors
All too often data has serious errors in it. For example, you may find that the dates were encoded in two formats and only one of them has been correctly loaded into your program. There are a myriad of such problems and they are surprisingly common. This is the other reason I think graphing is important, not just for time series, but for any data.

edited Nov 05 '14 at 22:02

answered Oct 29 '14 at 13:54

Ben Haley

686
5
5

after identifying and selecting date driven features, do you recommend dropping from the Train_X features those original Date columns? – sAguinaga Dec 09 '19 at 17:13
Thanks! One questions, so after we create day_of_year, day_of_week column and so on... is it necessary to change them with one-hot encoding as 1 < 2 < 3 < ... < 7 in day of week – haneulkim Mar 05 '20 at 00:49
You get variables is_easter, is_superbowl from any package? – Tajni May 26 '20 at 04:54
1

@Tajni You can use the holidays package for Easter and a lot of other things, but I'm not aware of anything out of the box for is_superbowl, but the same package allows you to specify custom holidays. "For more complex logic like 4th Monday of January [or, in the case of the Super Bowl, first Sunday in February], you can inherit the HolidayBase class and define your own _populate(year) method. See [the] documentation for examples." Source: https://github.com/dr-prodigy/python-holidays/blame/abc1b31b112bf787d6cd906f76691db406d3fbee/README.rst#L73-L75 – deepyaman Nov 29 '20 at 04:34
if all other numerical columns are scaled do we need to scale weekday(1~7) as well? – haneulkim Jan 27 '22 at 09:10

score 9 · Answer 2 · answered Sep 03 '15 at 17:43

One more thing to consider, beyond everything that Ben Haley said, is to convert to user local time. For example, if you are trying to predict something that occurs around 8pm for all users, if you look at UTC time, it will be harder to predict from.

score 8 · Answer 3 · edited May 18 '15 at 07:34

8

Divide the data into windows and find features for those windows like autocorrelation coefficients, wavelets, etc. and use those features for learning.

For example, if you have temperature and pressure data, break it down to individual parameters and calculate features like number of local minima in that window and others, and use these features for your model.

edited May 18 '15 at 07:34

Alexey Grigorev

2,880
1
13
19

answered May 14 '15 at 17:30

Gurpreet Mohaar

81
1
1

score 7 · Answer 4 · edited Oct 14 '22 at 09:40

7

In several cases, data and events inside a time series are seasonal. In such cases, the month and the year of the event matters a lot. Hence, in such scenarios you can use binary variables to represent if the event is during a given month/year or not.

Hope this answers your question. If not, kindly be a little more specific on what exactly are you trying to achieve.

edited Oct 14 '22 at 09:40

tripleee

127
7

answered Oct 29 '14 at 07:52

show_stopper

171
2

score 5 · Answer 5 · edited Apr 13 '17 at 12:50

As Ben and Nar nicely explained, breaking down the date-time object into buckets of date and time parts would help detect seasonal trends, where the complete (and usually even worse - unique) date-time object would miss it

You didn't mention any specific machine learning algorithm you're interested in, but in case you're also interested with distance-based clustering, like k-means, I'd generalize the date-time object into the unix-time format. This would allow for a simple numerical distance comparison for the algorithm, simply stating how far 2 date values are.

In your example I'd generalize the date-only value 2014-05-05 to 1399248000 (the unix time representing the start of may the 5th 2014, UTC).

[One could argue that you can achieve that by bucketing the date-time into every possible date-time part.. but that would significantly increase your dataset dimensions. So, I'd suggest combining the unix-time, for distance measuring, and some of the date-time buckets]

score 3 · Answer 6 · answered Mar 03 '17 at 15:06

Depending on what you are interested in with the date/time info, you might just want to bin it. For e.g., if you are interested in distance from a starting point (e.g., Jan 1, 2015), and you want to measure it in months, I would just code it as month 1 (for Jan 1-31, 2015), 2 (Feb 1-28, 2015), 3, 4, 5, 6, etc. Since the distance between the start dates are approximately the same, this represents time distance in a straightforward continuous format. And I say continuous because you can say month 6.5 and know that it is half-way through June, 2015. Then you don't have to worry about actual date coding and you can use all your typical classification methods.

If you want to measure in days, I know MySql has a 'to_days' function, if you happen to use that to pull data prior to classification. Python probably has something similar, or use the unix-time format suggested by mork.

Hope this helps!

wolfe · Answer 7 · 2017-03-27T08:54:42.273

Ben is talking about the static features, and make use of the timestamp features.

As an extension, i will introduce the lag features, I am not talking the raw time series, but the aggregates on it.

The most mystical part is that the future value is unseen for us, how can we use that aggregate features in the training data?

A little example: There is yearly electric consumption data from 1991 to 2015, I want predict the electric consumption in the future 5 years, 2016 to 2020. I will calculate the last 5 years moving average of electric consumption as the 2020's feature values, but the 2016 to 2020 is unknown for us, so we leading (opposite the lagging) the time series 5 years, lets do the moving average on 2010 to 2015, then use this value as 2020's feature values. So, we can construct the future 5 years' feature data.

The next step is just using the moving function (count\mean\median\min\max.etc) and try different windows, then you will construct lots of features!

score 1 · Answer 8 · answered Jul 22 '16 at 21:37

Plot graphs with different variations of time against the outcome variable to see its impact. You could use month, day, year as separate features and since month is a categorical variable, you could try a box/whisker plot and see if there are any patterns. For numerical variables, you could use a scatter plot.

score 1 · Answer 9 · answered Mar 02 '17 at 18:36

1

I don't know if this is a common/best practice, but it's another point of view of the matter.

If you have, let's say, a date, you can treat each field as a "category variable" instead a "continuous variable". The day would have a value in the set {1, 2... ,31}, the month would have a value in {1,...,12} and, for the year, you choose a minimum and a maximum value and build a set.

Then, as the specific numeric values of days, months and years might not be useful for finding trends in the data, use a binary representation to encode the numeric values, being each bit a feature. For example, month 5 would be 0 0 0 0 1 0 0 0 0 0 0 0 (11 0's an a 1 in 5th position, each bit being a feature).

So, having, for example, 10 years in the "year's set", a date would be transformed into a vector of 43 features (= 31 + 12 + 10). Using "sparse vectors", the amount of features shouldn't be a problem.

Something similar could be done for time data, day of the week, day of the month...

It all depends of the question you want your machine learning model to answer.

answered Mar 02 '17 at 18:36

Paco Barter

11
1

This fails to capture relationships that probably exist, like, that the 14th and 15th of the month are 'similar'. To the extent that you believe that every day is literally different, you also believe that prediction about tomorrow is not possible. It's also not necessary to one-hot encode categoricals, not necessarily. – Sean Owen Mar 02 '17 at 19:07
I can't see why it fails capturing the "proximity" of near dates. If you, for example, feed the binary vector to a NN it'll figure it out itself after proper training. Using binary vectors is only one way of representing categories. – Paco Barter Mar 02 '17 at 23:33
In this instance, you effectively have columns like "is_12th" and "is_13th" which are, in the input space, unrelated, and unrelated to "is_1st", etc. As a continuous feature, it would correctly capture that the 12th and 13th are in some sense closer than 1st and 12th are. You are appealing to what a model might infer, but, I am talking about what the input features encode. – Sean Owen Mar 03 '17 at 13:18
Ok, I see. You're right, a continuos feature better captures the "proximity" quality of dates. My point is that there might be trends in the data for what the numeric values of dates are irrelevant (for example, a certain pattern of customer purchasing only in saturdays). Hence offering another point of view for dealing with dates. – Paco Barter Mar 03 '17 at 18:03
Actuall as @PacoBarter said, one-hot encoding ignore the different distance between categories. This is not that easily tackle-able as these features are intrinsically phase info, while most machine learning models has no phase type input. Some DIY on distance metrics might do though. – plpopk Mar 19 '19 at 10:39

score 1 · Answer 10 · answered Sep 18 '18 at 20:44

Context of my Response: There has been great responses so far. But, I want to extend the conversation by assuming you are speaking about a machine learning application to predict future values of this particular time series. With that context in mind, my advice is below.

Advice: Look into traditional statistical forecasting strategies first (ie. Exponential Smoothing, SARIMAX or Dynamic Regression) as a baseline for prediction performance. Although machine learning has shown great promise for a variety of applications, for times series, there are tried and true statistical methods which may serve you better for your application. I would draw your attention to two recent articles:

Statistical and Machine Learning Forecasting Methods: Concerns and Ways Forward by Spyros Makridakis et al. The article points out that for many time series, traditional statistical time series analysis outperform machine learning (ML) models. In essence, ML has a tendency to overfit and any ML model assumptions regarding to independent entries is violated.
Simple Versus Complex Forecasting: The Evidence by Kesten C Green et al. The article compares and examines the time series output of peer reviewed journal article reporting time series analysis with and without comparisons to a variety of models. In conclusion, researchers over complicate their analysis with models which are more difficult to interpret and have worse performance. Commonly, this occurs because of poor incentive structures.

If you are looking for good performance, choose a metric to compare against several models (ie. like MASE) and sweep through several statistical (references below) and machine learning models (with feature development strategies mentioned above).

Cheers,

Resources for Learning Statistical Forecasting: I would start by reviewing the free textbook by Rob J Hyndman here: https://otexts.org/fpp2/. The text is based upon a R package you can easily incorporate into your analysis: https://otexts.org/fpp2/appendix-using-r.html. Finally, please please be aware of the difference between cross sectional cross validation and time series cross validation as explained here: https://robjhyndman.com/hyndsight/tscv/.

Machine learning - features engineering from date/time data

10 Answers10

For example

In general

Weird trends

Why graph?

Linked