Cleaning time series data

Question

I have a time series data about daily usage of a computer program, here is an example

2017-11-10: 0
2017-11-09: 14
2017-11-08: 0
2017-11-07: 6
2017-11-06: 102
2017-11-05: 0
2017-11-04: 0

As you can see 11-06 has a spike at 102. Due to our way of gathering this data, we know that data is probably erroneous and we are sure that 102 is not correct according other values.

So we need to clean these dirty values.

Is there a mathematical way to do this? Is there a python lib to help us?

I have used MeanShift to solve my problem – melih Nov 12 '17 at 21:13 — melih, Nov 12 '17 at 21:13

score 3 · Accepted Answer · answered Nov 13 '17 at 04:15

I think you have a few options:

If you have a pre-set rule to exclude outliers, such as a hard-threshold at 100 which you know the data shouldn't exceed, then something as simple as x = [e for e in x if e < 100] will do.
If you have a parametric belief, such as any observation that falls beyond so many standard deviations from mean, or quartiles, are outliers; then you can implement the other answers that have been mentioned.
Else, you can go for a clustering approach. Here I believe your first shot should be a k-means clustering. This is super easy to build and interpret. See my code below.

x = [0,14,0,6,102,0,0] from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=2).fit(np.array(x).reshape(-1, 1))

#First cluster: np.array(x)[np.where(kmeans.labels_ == 0)]

#Second cluster (outliers): np.array(x)[np.where(kmeans.labels_ == 1)]
K-means is known to be sensitive to outliers, hence a more robust method such as MeanShift, which you tried, is a good rival to k-means. I would run both, and stick with the result that makes better sense to me.

Hope this helps!

score 1 · Answer 2 · answered Nov 12 '17 at 20:39

One solution is using mean and variance to detect outlires in your time-series. For example:

>> data=np.array([0,0,102,6,0,14,0])
>> c = 1
>> abs(data - np.mean(data)) < c * np.std(data)
Output: array([ True,  True, False,  True,  True,  True,  True], dtype=bool)
>> clean_data= data[abs(data - np.mean(data)) < c * np.std(data)]
Output: array([ 0,  0,  6,  0, 14,  0])

you can play with c based on your requirement.

Moreover, instead of using mean and variance of all the data, you can use this method for each section of your time-series separately (e.g. every 30 days). Because there might be different behavior in different time-intervals.

melih · Answer 3 · 2017-11-12T21:48:10.230

Here is what I am using:

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

x = [0,14,0,6,102,0,0]

X = list(zip(x,np.zeros(len(x))))
bandwidth = estimate_bandwidth(X, quantile=0.2)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

X = np.array(X)
for k in range(n_clusters_):
    my_members = labels == k
    print(k, X[my_members, 0])

Source: http://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html

score 1 · Answer 4 · edited May 06 '19 at 12:49

1

I would use the Interquartile range ($IQR$), where the outliers are the values larger than $Q3+1.5 \times IQR$, and the values less than $Q1-1.5 \times IQR$, where $Q1$ and $Q3$ are the first and third quartiles, respectively. Here is a good example.

edited May 06 '19 at 12:49

Glorfindel

289
1
6
13

answered Nov 12 '17 at 21:23

Shadi

146
4

Muralidhar A · Answer 5 · 2019-05-06T14:24:43.990

0

Usually, everyone is trying to remove the outlier which is there in data. Instead, you can replace those outliers with Median or Mean, which can give you better results and trend analysis.

Some references: Replacing outlier with median, Remove outlier from data frame

In my project, have replaced outliers with the median, and it gave better results.

edited May 06 '19 at 14:24

answered May 06 '19 at 13:55

Muralidhar A

56
4

Cleaning time series data

5 Answers5