异常检测孤立森林（isolation Forest）iForest算法理论讲解及python实战

一、原理

孤立森林（Isolation Forest，简称 iForest）是一种无监督学习算法，用于识别异常值。

其基本原理可以概括为一句话：异常数据由于数量较少且与正常数据差异较大，因此在被隔离时需要较少的步骤。

有两个假设：

异常的值是非常少的（如果异常值很多，可能被识别为正常的）
异常值与其他值的差异较大（这点也可以引出主要是全局上都为异常的异常，局部小异常可能发现不了，因为差异并不大）

二、具体流程

2.1 训练森林

子采样: 首先从整个数据集中随机抽取一定数量的样本来为构建树做准备。这些抽样的子集大小通常远小于原始数据集的大小，这样可以限制树的大小，并且减少计算复杂度。
构建孤立树 (iTrees): 对于每个子采样集，算法构建一棵孤立树。构建孤立树的过程是递归的。在每个节点，算法随机选择一个特征，并在该特征的最大值和最小值之间随机选择一个分割值。然后，数据根据这个分割值将样本分到左子树或右子树（这里其实就是简单的将样本中特征小于这个分割点的样本分到左边，其次分到右边）。这个过程的结束条件：树达到限定的高度，节点中的样本数量到一定的数目，或者所有样本的所选特征值都是同一个值。
森林构建: 重复1-2构建完特定数量的孤立树，集合为孤立森林。

2.2 检测阶段

2.2.1 首先要明确几个相关概念

路径长度（h(x)）: 指样本通过该孤立树构建阶段的特征选择方式，从树的根节点到达该样本被孤立的节点（被孤立就是意味着这个样本最终到达的树的叶子节点）所需要的边数。
平均路径长度E(h(x)): 该样本在森林中所有树的路径长度的平均值。
树的平均路径长度： H(i)是调和数，可以近似为ln(i) + 0.5772156649，其中n为样本个数，对于给定的数据集大小n，平均路径长度的期望是一个常数，该公式提供了一个标准化的基准，用于将路径长度标准化
异常分数：对于每个样本x，其异常分数s(x, n)的计算公式如下:

其中，E(h(x))是数据点x在所有树中路径长度的平均值，n 是训练数据的样本量，c(n)是树的平均路径长度，用于标准化。

2.2.2 异常分数

score的范围为【0，1】

E(h(x)) 约等于 c(n)，样本点的路径长度和平均路径长度没啥差别，看不出是否有异常
E(h(x)) 越靠近0， score越接近1，说明样本很容易被孤立，更可能是异常的
E(h(x)) 的极端情况就是为样本的n-1，这时候score很小，不容易被孤立，也看不出是否为异常

2.2.3 检测过程

路径长度计算: 对于新的测试样本计算它们在每棵孤立树中的路径长度, 并算平均路径长度。
计算异常分数: 利用上述计算，孤立森林会计算每个数据点的异常分数。
判定异常: 根据计算出的异常分数，可以设置一个阈值来判定哪些数据点是异常的。

三、优缺点

3.1 优点

高效性: 孤立森林算法特别适合处理大数据集。它具有线性的时间复杂度，并且由于使用了子采样，使得在计算上更加高效。
易于并行化: 和随机森林一样，构建孤立树是独立的过程，构建森林可以并行化

3.2 缺点

异常值比例敏感性: 如果数据集中异常值的比例相对较高，孤立森林的效果可能就会下降，因为它是基于异常值“少而不同”的假设。
对局部异常检测不敏感：因为少而不同的前提条件决定主要解决全局异常的特点，对在局部区域表现出轻微异常特征的点检测不是很敏感
不适用于多维特征情况：iforest不会因为特征的多少而降低算法的效率，但也正因为每次只随机用其中一个特征作为分割的特征，如果特征维度很高，就会有很多特征没有用到。

四、代码实战

具体参数配置参考：https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

4.1 准备数据

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

n_samples, n_outliers = 120, 10
rng = np.random.RandomState(0)
cluster_1 = 0.4 * rng.randn(n_samples, 2) + np.array([2, 2])
cluster_2 = 0.3 * rng.randn(n_samples, 2) + np.array([-2, -2])
outliers = rng.uniform(low=-4, high=4, size=(n_outliers, 2))

X = np.concatenate([cluster_1, cluster_2, outliers])
y = np.concatenate(
    [np.ones((2 * n_samples), dtype=int), -np.ones(n_outliers, dtype=int)]
)

scatter = plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
handles, labels = scatter.legend_elements()
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.title("data distribution")
plt.show()

4.2 模型

# 通过y的比例进行均匀拆分
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

from sklearn.ensemble import IsolationForest

clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X_train)

# 得到score，Negative scores represent outliers, positive scores represent inliers
y_pre_score_test = clf.decision_function(cluster_1)
print(y_pre_score_test)

# -1为异常， 1为正常， 可以看下这个函数内部，就是score<0的设为-1
y_pre_label_test = clf.predict(cluster_1) 
print(y_pre_label_test)

4.3 可视化边缘

主要利用plt.contourf来展示，一个范围内的，iforest的score分布情况

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contourf.html

# 通过网格的方式得到location的x和y坐标
xx, yy = np.meshgrid(np.linspace(-6, 6, 60), np.linspace(-6, 6, 60))
# concat x和y 得到输入的坐标
input_location = np.c_[xx.ravel(), yy.ravel()]

Z = clf.decision_function(input_location)
Z = Z.reshape(xx.shape)

plt.title("IsolationForest")
plt.contourf(xx, yy, Z, camp=plt.cm.Blues_r)
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white', s=20, edgecolor='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green', s=20, edgecolor='k')
c = plt.scatter(outliers[:, 0], outliers[:, 1], c='red', s=20, edgecolor='k')
plt.axis('tight')
plt.xlim((-6, 6))
plt.ylim((-6, 6))
plt.legend([b1, b2, c],
           ["train data",
            "test data",
            "outlier"],
           loc="best")
plt.show()

4.4 全部代码

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

n_samples, n_outliers = 120, 10
rng = np.random.RandomState(0)
cluster_1 = 0.4 * rng.randn(n_samples, 2) + np.array([2, 2])
cluster_2 = 0.3 * rng.randn(n_samples, 2) + np.array([-2, -2])
outliers = rng.uniform(low=-4, high=4, size=(n_outliers, 2))

X = np.concatenate([cluster_1, cluster_2, outliers])
y = np.concatenate(
    [np.ones((2 * n_samples), dtype=int), -np.ones(n_outliers, dtype=int)]
)

scatter = plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
handles, labels = scatter.legend_elements()
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.title("data distribution")
plt.show()

# 通过y的比例进行均匀拆分
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

from sklearn.ensemble import IsolationForest

clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X_train)

# 得到score，Negative scores represent outliers, positive scores represent inliers
y_pre_score_test = clf.decision_function(cluster_1)
print(y_pre_score_test)

# -1为异常， 1为正常， 可以看下这个函数内部，就是score<0的设为-1
y_pre_label_test = clf.predict(cluster_1)
print(y_pre_label_test)

# 通过网格的方式得到location的x和y坐标
xx, yy = np.meshgrid(np.linspace(-6, 6, 60), np.linspace(-6, 6, 60))
# concat x和y 得到输入的坐标
input_location = np.c_[xx.ravel(), yy.ravel()]

Z = clf.decision_function(input_location)
Z = Z.reshape(xx.shape)

plt.title("IsolationForest")
plt.contourf(xx, yy, Z, camp=plt.cm.Blues_r)
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white', s=20, edgecolor='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green', s=20, edgecolor='k')
c = plt.scatter(outliers[:, 0], outliers[:, 1], c='red', s=20, edgecolor='k')
plt.axis('tight')
plt.xlim((-6, 6))
plt.ylim((-6, 6))
plt.legend([b1, b2, c],
           ["train data",
            "test data",
            "outlier"],
           loc="best")
plt.show()

主要参考：https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html#sphx-glr-auto-examples-ensemble-plot-isolation-forest-py