# Analysing Outliers

Outlieris a value which does not follow usual norms of the data. For almost all the statistical methods outliers present a particular challenge, and so it becomes important to identify, and treat them. Let’s see which all packages and functions can be used in R to deal with outliers.

## Overview

Th presence of outliers in the dataset can be a result of an error or it can be a real value present in the data as a result of real distribution of the data. In either case, it is a responsibility of the analyst to understand and ensure that a proper treatment is given to such values.

### Things You Will Master

- Using tukey formula
- Using histograms and boxplots
- Using cook’s distance
- Using DBSCAN - A machine learning approach
- Possible actions towards outliers

## Using tukey formula

This formula uses quantiles to produce a upper and lower range values beyond which all values are considered as outliers.

Outlier on the upper side = 3rd Quartile + 1.5 * IQR

Outlier on the lower side = 1st Quartile – 1.5 * IQR

This formula is also used in boxplot and all the values which you see after the whiskers are above or below these threshold values. In the above formula, IQR stands for interquartile range. It is the difference between 3rd Quartile and 1st Quartile values.

## Using histograms and boxplots

In histogram, values beyond ± 2 standard deviations is tossed as outlier. However, this is not a very rigid rule one can modify this on need basis.

```
# Building a histogram
hist(iris$Sepal.Width,
col = "blue",
main = "Histogram",
xlab = "Sepal width of flowers")
```

### Similary one can draw boxplot to visually check for outliers.

```
# Building a box plot
library(ggplot2)
ggplot(iris, aes(x = "", y=Sepal.Width)) +
geom_boxplot(outlier.colour="red",
outlier.shape=16,
outlier.size=2, notch=FALSE)
```

```
# Building a box plot
library(ggplot2)
ggplot(iris, aes(x = Species, y=Sepal.Width)) +
geom_boxplot(outlier.colour="red",
outlier.shape=16,
outlier.size=2, notch=FALSE)
```

There are few flowers in setosa and virginica which have few flowers with unsual sepal width. However, versicolor species of flowers has no outlier.

## Using cook’s distance

**Cooks Distance** is a multivariate method which is used to identify outliers while running regression analysis. The algorithm tries to capture information about the predictor variables through a distance measure which is a combination of leverage and each value in the dataset.

Higher value of cooks distance indicate that the values are outliers. Following rules are used take decision using cooks distance.

- A values is considered outlier if it is three times the mean value.
- Another rule states that all values which are higher than 4/n (n is the total number of observations) must be investigated.

```
# Detecting outliers in cars dataset
mod <- glm(mpg ~ ., data = mtcars)
mtcars$cooksd <- cooks.distance(mod)
# Defining outliers based on 4/n criteria
mtcars$outlier <- ifelse(mtcars$cooksd < 4/nrow(mtcars), "keep","delete")
# Inspecting the dataset
mtcars[5:12, ]
```

```
#Output
mpg cyl disp hp drat wt qsec vs am gear carb cooksd outlier
Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2 0.00408313839 keep
Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1 0.03697080496 keep
Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4 0.00006907413 keep
Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2 0.03454201049 keep
Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2 0.37922064047 delete
Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4 0.00428238516 keep
Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4 0.02404481831 keep
Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3 0.04013728968 keep
```

*Merc 230 is detected as an outlier by cooks distance*

## Using DBSCAN - A machine learning approach

**DBSCAN** is a method which uses clustering algorithm to separate dense areas with the sparse areas. DBSCAN identifies collective outliers and thus one should ensure that not more than 5% of values are choosen to be identified as outliers while running the algorithm.

```
library(fpc)
# Compute DBSCAN using fpc package
set.seed(86)
db <- fpc::dbscan(iris[, -c(5,4,3)], eps = 0.20, MinPts = 3)
# Plot DBSCAN results
plot(db, iris[, -c(5,4,3)], main = "DBSCAN", frame = FALSE)
```

### Optimal value of **eps**

**eps** is the maximum distance between two points and based upon this distance the algorithm decides wether a point is outlier or not. So it becomes essential to identify the optimal eps value in order to make correct decision. This can be achieved using the below code.

```
library(dbscan)
dbscan::kNNdistplot(iris[, -c(5,4,3)], k = 4)
abline(h = 0.15, lty = 2)
```

*The point where you see an elbow like bend corresponds to the optimal eps value or where the dotted line crosses the solid line is considered as optimal.*

## Possible actions towards outliers

The possible actions which we can take outliers are mentioned below:

**1. Capping And Flooring** - We cap every values which is greater or lesser than the tukey forumal by the values returned by the tukey formula. When you seal the value on higher side we call it as capping and the same action for the lower side values is called as flooring.

**2. Delete Outliers** - Another solution is to delete all the values which are unsual and do not represent the major chunk of the data.

**3. Model Outliers** - In certain cases you may find that outliers are significant percentage of total data. In such cases, you can seperate all the ouliers and build a separate model for these values.