Isolation forest: how to deal with identical values?

Question

I am trying to develop my own implementation of isolation forest algorithm. However I don't know how to deal with points that have the same value for a given feature. To better understand the problem, consider this example: in my dataset I have the following data: (1, 2), (3,5), (3,4)

We may suppose that after one iteration, we get the following split:

      root
     /     \
    /       \
   /         \
(1,2)    (3,4), (3,5)

Now, how am I supposed to deal with the right branch? If (due to randomness) we decide to split according to the second feature, we should not have any problem because they are different (4 and 5). However what about if we decide to split according to the first feature (which is 3 in both cases)? Should I repeat the random feature selection until I can split the remaining data?

No need. If they have same values and you have not selected the feature. There might be some other tree that does that splitting for you. If you have large number of trees and instead of (3, 4) and (3, 5), you got (3, 4) and (3, 100000). This would have got high score from some other tree, making that point an outlier. — Kiritee Gak, May 29 '18 at 10:38
Thank you for your answer! Maybe you should post it as an answer, so I can tick it as solving my problem! — gagarine, May 29 '18 at 13:01

score 0 · Accepted Answer · answered May 30 '18 at 06:52

No need to use all the features. That is the cool part of having an ensemble. If one tree who is using a small set of data with some set of features is unable to use the feature, some other tree will use it. If the point is a true outlier, the score that is calculated later will offer the due reflection.

Isolation forest: how to deal with identical values?

1 Answers1