🌲 Implementation of the Robust Random Cut Forest Algorithm for anomaly detection on streams
View the Project on GitHub kLabUM/rrcf
Traditional approaches for anomaly detection most often work by constructing a profile of “normal” data points, then finding the points that deviate from this profile using a distance or density metric (Liu et al. 2012) 1. We briefly survey some of these methods here:
To address problems with the methods mentioned above, the Isolation Forest (IF) algorithm proposes a novel ensemble method that isolates anomalies directly without relying on an explicit distance or density metric (Liu et al. 2012) 1. The IF algorithm works by recursively partitioning a point set \(S\) such that all points are isolated in non-overlapping axis-aligned bounding boxes. Partitions are stored in a binary search tree. Outliers are then identified based on their depth in the tree. Because outliers are isolated earlier on average, points with a depth that is much smaller than the average depth are more likely to be outliers.
While the Isolation Forest algorithm has shown promising performance in detecting outliers, it suffers from a few key limitations. First, it is not designed for use with streaming data, given that points cannot be inserted or deleted from isolation trees once the tree has been constructed. Second, the IF algorithm is sensitive to “irrelevant dimensions”, meaning that partitions are often wasted on dimensions that provide relatively little information. Finally, while the tree depth shows empirical success in detecting outliers, there is little theoretical justification for using this metric as an anomaly score (Guha et al. 2016) 6. The RRCF algorithm was devised to overcome these limitations by using a novel sketching algorithm (Guha et al. 2016) 6.
Liu, F. T., Ting, K. M., Zhou, Z.-H., 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (1), 3. ↩ ↩2
Tax, D. M., Duin, R. P., 2004. Support vector data description. Machine learning 54 (1), 45–66. ↩
Rousseeuw, P. J., Driessen, K. V., 1999. A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 (3), 212–223. ↩
Breunig, M. M., Kriegel, H.-P., Ng, R. T., Sander, J., May 2000. Lof: Identifying density-based local outliers. SIGMOD Rec. 29 (2), 93–104. ↩
Williams, G., Baxter, R., He, H., Hawkins, S., Gu, L., 2002. A comparative study of RNN for outlier detection in data mining. In: Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, pp. 709–712. ↩
Guha, S., Mishra, N., Roy, G., Schrijvers, O., 2016. Robust random cut forest based anomaly detection on streams. In: International conference on machine learning. pp. 2712–2721. ↩ ↩2