Blog Data Science Python

Outlier detection using Local Outlier Factor (LOF)

What is an Outlier?

In Normal language outlier is a person or thing that is situated away or detached away from a body or system. But in Statistics In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

How to detect outlier?

The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which computes the local density deviation of a given data point with respect to its neighbors.

Local Outlier:

When a point is considered to be outlier by its local neighbor points it is called local outlier.

Using LOF we can identify the outlier by considering its neighborhood density. LOF is used when Dataset has not same density throughout.

Prerequisites to understand LOF

  • K-distance and K-neighbors
  • Reachability distance (RD)
  • Local reachability density (LRD)
  • Local Outlier Factor (LOF)

K-distance and K-neighbors

K-distance is the distance between the point, and it’s Kᵗʰ nearest neighbor. Set of Points which lie in a circle of radius K-distance is called K Neighbor. This set is denoted by Nk (A).

Reachability Density (RD)

It is defined as the maximum of K-distance of Xj and the distance between Xi and Xj. The distance measure is problem-specific (Euclidean, Manhattan, etc.)

In layman terms, if a point Xi lies within the K-neighbors of Xj, the reachability distance will be K-distance of Xj (blue line), else reachability distance will be the distance between Xi and Xj (orange line).

LOCAL REACHABILITY DENSITY (LRD)

LRD is inverse of the average reachability distance of A from its neighbors. Intuitively according to LRD formula, more the average reachability distance (i.e., neighbors are far from the point), less density of points are present around a particular point. This tells how far a point is from the nearest cluster of points. Low values of LRD implies that the closest cluster is far from the point.

LOCAL OUTLIER FACTOR (LOF)

LRD of each point is used to compare with the average LRD of its K neighbors. LOF is the ratio of the average LRD of the K neighbors of A to the LRD of A.

Intuitively, if the point is not an outlier (inlier), the ratio of average LRD of neighbors is approximately equal to the LRD of a point (because the density of a point and its neighbors are roughly equal). In that case, LOF is nearly equal to 1. On the other hand, if the point is an outlier, the LRD of a point is less than the average LRD of neighbors. Then LOF value will be high.

Generally, if LOF> 1, it is considered as an outlier, but that is not always true. Let’s say we know that we only have one outlier in the data, then we take the maximum LOF value among all the LOF values, and the point corresponding to the maximum LOF value will be considered as an outlier.

EXAMPLE

4 points: A(0,0), B(1,0), C(1,1) and D(0,3) and K=2. We will use LOF to detect one outlier among these 4 points.

A(0,0), B(1,0), C(1,1), and D(0,3)

Following the procedure discussed above:

First, calculate the K-distancedistance between each pair of points, and K-neighborhood of all the points with K=2. We will be using Manhattan distance as a measure of distance.

K-distance(A) –> since C is the 2ᴺᴰ nearest neighbor of A –> distance(A, C) =2
K-distance(B) –> since A, C are the 2ᴺᴰ nearest neighbor of B –> distance(B,C) OR distance(B,A) = 1
K-distance(C) –> since A is the 2ᴺᴰ nearest neighbor of C –> distance(C,A) =2
K-distance(D) –> since A,C are the 2ᴺᴰ nearest neighbor of D –> distance(D,A) or distance(D,C) =3

K-neighborhood (A) = {B,C} , ||N2(A)|| =2
K-neighborhood (B) = {A,C}, ||N2(B)|| =2
K-neighborhood (C)= {B,A}, ||N2(C)|| =2
K-neighborhood (D) = {A,C}, ||N2(D)|| =2

K-distance, the distance between each pair of points, and K-neighborhood will be used to calculate LRD.

LRD for each point A, B, C, and D

Local reachability density (LRD) will be used to calculate the Local Outlier Factor (LOF).

LOF for each point A, B, C, and D

Highest LOF among the four points is LOF(D). Therefore, D is an outlier.

ADVANTAGES OF LOF

A point will be considered as an outlier if it is at a small distance to the extremely dense cluster. The global approach may not consider that point as an outlier. But the LOF can effectively identify the local outliers.

DISADVANTAGES OF LOF

Since LOF is a ratio, it is tough to interpret. There is no specific threshold value above which a point is defined as an outlier. The identification of an outlier is dependent on the problem and the user.

CONCLUSION

Local outlier factor (LOF) values identify an outlier based on the local neighborhood. It gives better results than the global approach to find outliers. Since there is no threshold value of LOF, the selection of a point as an outlier is user-dependent.

REFERENCES

Breunig, M. M., Kriegel, H. P., Ng, R. T., and Sander, J. (2000). LOF: identifying density-based local outliers. In ACM sigmod record (Vol. 29, №2, pp. 93–104). ACM.

Read more python blogs here. And Follow the author here .

You can buy this book to learn more about Data Science with Python

Leave a Reply

Discover more from Geeky Codes

Subscribe now to keep reading and get access to the full archive.

Continue reading

Discover more from Geeky Codes

Subscribe now to keep reading and get access to the full archive.

Continue reading