rrcf

🌲 Implementation of the Robust Random Cut Forest Algorithm for anomaly detection on streams

View the Project on GitHub kLabUM/rrcf

Theory

    â€¢ Related work and motivation
    â€¢ Tree construction
    â€¢ Insertion and deletion of points
    â€¢ Anomaly scoring

Basics

    â€¢ RCTree data structure
    â€¢ Modifying the RCTree
    â€¢ Measuring anomalies
    â€¢ API documentation
    â€¢ Caveats and gotchas

Examples

    â€¢ Batch detection
    â€¢ Streaming detection
    â€¢ Analyzing taxi data
    â€¢ Classification
    â€¢ Comparison of methods

Caveats and gotchas

Scaling of dimensions

The RRCF algorithm considers the relative scale of each dimension when constructing robust random cut trees. This means that dimensions with less variability (on an absolute scale) will affect the outlier score of a point less than dimensions with higher variability.

This consideration is important to remember if each dimension represents a different categorical property or is measured with a different set of units. Consider, for example the following dataset.

Person Height (in) Weight (lb) Age (yr)
Alice 61 105 34
Bob 70 300 50
Timmy 48 70 10
Nosferatu 75 180 170

In this case, Weight will influence the outlier score most, because the range between the maximum and minimum values is largest (300 - 70 = 230). However, looking at the table, age seems like the most intuitive category for determining the outlier (in this case, Nosferatu is more than three times as old as the second-oldest person).

In cases where each column is measured in different units, or measures a different type of quantity, it may be necessary to scale each column before constructing the random cut tree. For example, min-max scaling each column between zero and one yields:

Person Height (-) Weight (-) Age (-)
Alice 0.48 0.15 0.15
Bob 0.81 1.0 0.25
Timmy 0.0 0.0 0.0
Nosferatu 1.0 0.48 1.0

Other scaling methods may suit other datasets better (for instance, scaling each dimension to a mean of zero and a standard deviation of one). The user should experiment with different scalings to determine the method that works best for the task at hand.