How to treat outliers in a time series dataset?

Question

I've read the following article about how to treat outliers in a dataset: http://napitupulu-jon.appspot.com/posts/outliers-ud120.html

Basically, he removes all the y which has a huge difference with the majority:

def outlierCleaner(predictions, ages, net_worths):
    """
        clean away the 10% of points that have the largest
        residual errors (different between the prediction
        and the actual net worth)

        return a list of tuples named cleaned_data where 
        each tuple is of the form (age, net_worth, error)
    """

    #calculate the error,make it descend sort, and fetch 90% of the data

    errors = (net_worths-predictions)**2
    cleaned_data =zip(ages,net_worths,errors)
    cleaned_data = sorted(cleaned_data,key=lambda x:x[2][0], reverse=True)
    limit = int(len(net_worths)*0.1)


    return cleaned_data[limit:]

But how may I apply this technique to a time series dataset if its rows are correlative?

http://datascience.stackexchange.com/questions/16930/anamoly-detection-for-transaction-data — Hobbes, Mar 21 '17 at 17:19
@Hobbes ouh! that's exactly what I'm looking for. Look at this: https://aqibsaeed.github.io/2016-07-17-anomaly-detection/ — mllamazares, Mar 21 '17 at 19:42

CalZ · Accepted Answer · 2017-03-21T18:47:59.187

Decide how auto-correlative your usual event in the time series is. For example, "I'm tracking temperature over time and it rarely changes more than 30 degrees F in an hour".
Throw out or smooth any values where the observed value changes more than that. In other words, "If ever I see the temperature changing more than 30 degrees in an hour, I'm going to ignore that value and substitute the average of the prior and the next value because that must be a sensor malfunction".

Once you were comfortable with doing that, use something like the standard deviation of the data over a rolling window instead of an absolute, arbitrary value like I did.

+1. What is an outlier and how to "fix" them very much depends on the case in point. But @CalZ approach should be pretty good for most problems. — Ricardo Cruz, Mar 24 '17 at 09:13

How to treat outliers in a time series dataset?

1 Answers1