I am trying to develop my own implementation of isolation forest algorithm. However I don't know how to deal with points that have the same value for a given feature. To better understand the problem, consider this example: in my dataset I have the following data: (1, 2), (3,5), (3,4)
We may suppose that after one iteration, we get the following split:
root
/ \
/ \
/ \
(1,2) (3,4), (3,5)
Now, how am I supposed to deal with the right branch? If (due to randomness) we decide to split according to the second feature, we should not have any problem because they are different (4 and 5). However what about if we decide to split according to the first feature (which is 3 in both cases)? Should I repeat the random feature selection until I can split the remaining data?
(3, 4)
and(3, 5)
, you got(3, 4)
and(3, 100000)
. This would have got high score from some other tree, making that point an outlier. – Kiritee Gak May 29 '18 at 10:38