R Statistics Blog

Data Science From R Programmers Point Of View

Analysing Outliers

Outlier is a value which does not follow usual norms of the data. For almost all the statistical methods outliers present a particular challenge, and so it becomes important to identify, and treat them. Let’s see which all packages and functions can be used in R to deal with outliers.

Overview

Th presence of outliers in the dataset can be a result of an error or it can be a real value present in the data as a result of real distribution of the data. In either case, it is a responsibility of the analyst to understand and ensure that a proper treatment is given to such values.

Things You Will Master

  1. Using tukey formula
  2. Using histograms and boxplots
  3. Using cook’s distance
  4. Using DBSCAN - A machine learning approach
  5. Possible actions towards outliers

Using tukey formula

This formula uses quantiles to produce a upper and lower range values beyond which all values are considered as outliers.

Outlier on the upper side = 3rd Quartile + 1.5 * IQR
Outlier on the lower side = 1st Quartile – 1.5 * IQR

This formula is also used in boxplot and all the values which you see after the whiskers are above or below these threshold values. In the above formula, IQR stands for interquartile range. It is the difference between 3rd Quartile and 1st Quartile values.

Using histograms and boxplots

In histogram, values beyond ± 2 standard deviations is tossed as outlier. However, this is not a very rigid rule one can modify this on need basis.

# Building a histogram
hist(iris$Sepal.Width, 
     col = "blue",
     main = "Histogram",
     xlab = "Sepal width of flowers")

Histogram to identify outliers

Similary one can draw boxplot to visually check for outliers.

# Building a box plot
library(ggplot2)
ggplot(iris, aes(x = "", y=Sepal.Width)) +
  geom_boxplot(outlier.colour="red", 
             outlier.shape=16,
             outlier.size=2, notch=FALSE)

boxplot
In the above plot, red dots represent the outlier values. We can also take a deeper plunge by looking at the boxplot by flower type. This should help us explore and understand if missing values are coming from a particular group or not.

# Building a box plot
library(ggplot2)
ggplot(iris, aes(x = Species, y=Sepal.Width)) +
  geom_boxplot(outlier.colour="red", 
             outlier.shape=16,
             outlier.size=2, notch=FALSE)

boxplot by group variable

There are few flowers in setosa and virginica which have few flowers with unsual sepal width. However, versicolor species of flowers has no outlier.

Using cook’s distance

Cooks Distance is a multivariate method which is used to identify outliers while running regression analysis. The algorithm tries to capture information about the predictor variables through a distance measure which is a combination of leverage and each value in the dataset.

Higher value of cooks distance indicate that the values are outliers. Following rules are used take decision using cooks distance.

  • A values is considered outlier if it is three times the mean value.
  • Another rule states that all values which are higher than 4/n (n is the total number of observations) must be investigated.
# Detecting outliers in cars dataset
mod <- glm(mpg ~ ., data = mtcars)
mtcars$cooksd <- cooks.distance(mod)

# Defining outliers based on 4/n criteria
mtcars$outlier <- ifelse(mtcars$cooksd < 4/nrow(mtcars), "keep","delete")

# Inspecting the dataset
mtcars[5:12, ]
#Output
                  mpg cyl  disp  hp drat   wt  qsec vs am gear carb        cooksd outlier
Hornet Sportabout 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2 0.00408313839    keep
Valiant           18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1 0.03697080496    keep
Duster 360        14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4 0.00006907413    keep
Merc 240D         24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2 0.03454201049    keep
Merc 230          22.8   4 140.8  95 3.92 3.15 22.90  1  0    4    2 0.37922064047  delete
Merc 280          19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4 0.00428238516    keep
Merc 280C         17.8   6 167.6 123 3.92 3.44 18.90  1  0    4    4 0.02404481831    keep
Merc 450SE        16.4   8 275.8 180 3.07 4.07 17.40  0  0    3    3 0.04013728968    keep

Merc 230 is detected as an outlier by cooks distance

Using DBSCAN - A machine learning approach

DBSCAN is a method which uses clustering algorithm to separate dense areas with the sparse areas. DBSCAN identifies collective outliers and thus one should ensure that not more than 5% of values are choosen to be identified as outliers while running the algorithm.

library(fpc)
# Compute DBSCAN using fpc package
set.seed(86)
db <- fpc::dbscan(iris[, -c(5,4,3)], eps = 0.20, MinPts = 3)
# Plot DBSCAN results
plot(db, iris[, -c(5,4,3)], main = "DBSCAN", frame = FALSE)

Using DBSCAN to identify outliers

Optimal value of eps

eps is the maximum distance between two points and based upon this distance the algorithm decides wether a point is outlier or not. So it becomes essential to identify the optimal eps value in order to make correct decision. This can be achieved using the below code.

library(dbscan)
dbscan::kNNdistplot(iris[, -c(5,4,3)], k =  4)
abline(h = 0.15, lty = 2)

The point where you see an elbow like bend corresponds to the optimal eps value or where the dotted line crosses the solid line is considered as optimal.

optimal eps value

Possible actions towards outliers

The possible actions which we can take outliers are mentioned below:

1. Capping And Flooring - We cap every values which is greater or lesser than the tukey forumal by the values returned by the tukey formula. When you seal the value on higher side we call it as capping and the same action for the lower side values is called as flooring.
2. Delete Outliers - Another solution is to delete all the values which are unsual and do not represent the major chunk of the data.
3. Model Outliers - In certain cases you may find that outliers are significant percentage of total data. In such cases, you can seperate all the ouliers and build a separate model for these values.

Closing Note

In this chapter, we learned different statistical algorithms and methods which can be used to identify the outliers. These methods included univariate and multivariate techniques. We also learned what possible actions can a data scientist take in case data has outliers. In the next chapter, we will learn how to train linear regression models and validate the same before using it for scoring in R.
Last updated on 4 Jan 2019 / Published on 17 Oct 2017