10

I am trying to setup a big data infrastructure using Hadoop, Hive, Elastic Search (amongst others), and I would like to run some algorithms over certain datasets. I would like the algorithms themselves to be scalable, so this excludes using tools such as Weka, R, or even RHadoop. The Apache Mahout Library seems to be a good option, and it features algorithms for regression and clustering tasks.

What I am struggling to find is a solution for anomaly or outlier detection.

Since Mahout features Hidden Markov Models and a variety of clustering techniques (including K-Means) I was wondering if it would be possible to build a model to detect outliers in time-series, using any of this. I would be grateful if somebody experienced on this could advice me

  1. if it is possible, and in case it is
  2. how-to do it, plus
  3. an estimation of the effort involved and
  4. accuracy/problems of this approach.
VividD
  • 656
  • 7
  • 18
doublebyte
  • 420
  • 3
  • 9
  • 1
    This is too vague to be answered. Time series are too different to just throw k-means on them and get out anything useful. It heavily depends on your data. – Has QUIT--Anony-Mousse Oct 17 '14 at 12:14
  • 1
    For outlier detection, have a look at the algorithms in ELKI. That seems to be the most complete collection of outlier detection. – Has QUIT--Anony-Mousse Dec 09 '14 at 22:37
  • The newer Elasticsearch versions have time series anomaly detection built in (I think you have to buy the X-Pack). I am not sure what algorithms they are using but it might be worth investigating an off-the-shelf solution. – tom Nov 15 '17 at 22:04

2 Answers2

7

I would take a look at t-digest algorithm. It's been merged into mahout and also a part of some other libraries for big data streaming. You can get more about this algorithm particularly and big data anomaly detection in general in next resources:

  1. Practical machine learning anomaly detection book.
  2. Webinar: Anomaly Detection When You Don't Know What You Need to Find
  3. Anomaly Detection in Elasticsearch.
  4. Beating Billion Dollar Fraud Using Anomaly Detection: A Signal Processing Approach using Argyle Data on the Hortonworks Data Platform with Accumulo
prudenko
  • 206
  • 1
  • 4
  • How does t-digest compare to the p-square algorithm? – David Marx Oct 17 '14 at 17:16
  • Thanks for the answer: this is a simple model to compute extreme quantiles, and I think it will fit my needs. However for more complex time-series that do not have a nearly stationary distribution this approach may fail, and that's when I think we would need something adaptive such as a Markov chain. – doublebyte Oct 20 '14 at 09:32
0

You can refer to my response related to h2o R or Python anomaly detection method in stackexchange,since that is scalable too.

0xF
  • 571
  • 2
  • 10