rrcf

🌲 Implementation of the Robust Random Cut Forest Algorithm for anomaly detection on streams

View the Project on GitHub kLabUM/rrcf

Theory

    â€¢ Related work and motivation
    â€¢ Tree construction
    â€¢ Insertion and deletion of points
    â€¢ Anomaly scoring

Basics

    â€¢ RCTree data structure
    â€¢ Modifying the RCTree
    â€¢ Measuring anomalies
    â€¢ API documentation
    â€¢ Caveats and gotchas

Examples

    â€¢ Batch detection
    â€¢ Streaming detection
    â€¢ Analyzing taxi data
    â€¢ Classification
    â€¢ Comparison of methods

Anomaly score

The likelihood that a point is an outlier is measured by its collusive displacement (CoDisp): if including a new point significantly changes the model complexity (i.e. bit depth), then that point is more likely to be an outlier.

Computing the anomaly score using the codisp method

The codisp method is used to compute the collusive displacement for a point. The codisp method takes the index label of the point as an argument and returns the collusive displacement.

First, let’s construct an RCTree with 100 normally-distributed 2-dimensional points.

import numpy as np
import rrcf

# Seed tree with zero-mean, normally distributed data
X = np.random.randn(100,2)
tree = rrcf.RCTree(X)

Next, generate an inlier and outlier point, and insert them into the tree with the labels inlier and outlier:

# Generate an inlier and outlier point
inlier = np.array([0, 0])
outlier = np.array([4, 4])

# Insert into tree
tree.insert_point(inlier, index='inlier')
tree.insert_point(outlier, index='outlier')

Finally, compute the anomaly score by calling the codisp method on each point’s index label.

tree.codisp('inlier')
>>> 1.75
tree.codisp('outlier')
>>> 39.0

Note that the codisp of the outlier point is significantly larger.

Computing codisp over a random forest

To make the codisp metric more robust, we can compute the codisp for each point over a forest (distribution) of randomly-constructed RCTrees.

First, let’s create a sample of 1000 points where the first 990 points are inliers distributed according to \( x_i \sim \mathcal{N}([0, 0]^T, I_{2 \times 2})\) and the last ten points are outliers with \( x_i \sim \mathcal{N}([4, 4]^T, I_{2 \times 2})\).

import numpy as np
import pandas as pd
import rrcf

# Specify sample parameters
n = 1000
d = 2
num_outliers = 10

# Seed tree with zero-mean, normally distributed data
X = np.random.randn(n,d)
# Set the last ten points as outliers
X[-num_outliers:, :] += 4

To create a random forest, we simply create a list of RCTrees, with each RCTree constructed from a random sample of the input dataset. Let’s create a random forest with 100 trees, each containing 128 points from the original sample.

# Construct forest
forest = []

# Specify forest parameters
num_trees = 100
tree_size = 128
sample_size_range = (n // tree_size, tree_size)

while len(forest) < num_trees:
    # Select random subsets of points uniformly from point set
    ixs = np.random.choice(n, size=sample_size_range,
                           replace=False)
    # Add sampled trees to forest
    trees = [rrcf.RCTree(X[ix], index_labels=ix)
             for ix in ixs]
    forest.extend(trees)

Finally, to determine outliers we compute the average codisp over all trees for each point in the original sample.

# Compute average CoDisp
avg_codisp = pd.Series(0.0, index=np.arange(n))
index = np.zeros(n)
for tree in forest:
    codisp = pd.Series({leaf : tree.codisp(leaf)
                        for leaf in tree.leaves})
    avg_codisp[codisp.index] += codisp
    np.add.at(index, codisp.index.values, 1)
avg_codisp /= index

Now, let’s see the average codisp for each set of points.

For the inlier points:

avg_codisp[:-10].mean()
>>> 5.267

And for the outlier points:

avg_codisp[-10:].mean()
>>> 40.681

Note that the outlier points have larger codisp. To classify the original points into inlier and outlier classes, we can perform a simple threshold test on the codisp result. For example:

avg_codisp > avg_codisp.quantile(0.99)