Outlier is a value which does not follow usual norms of the data. For almost all the statistical methods outliers present a particular challenge, and so it becomes important to identify, and treat them. Let’s see which all packages and functions can be used in R to deal with outliers.
Th presence of outliers in the dataset can be a result of an error or it can be a real value present in the data as a result of real distribution of the data. In either case, it is a responsibility of the analyst to understand and ensure that a proper treatment is given to such values.
Things You Will Master
- Using tukey formula
- Using histograms and boxplots
- Using cook’s distance
- Using DBSCAN - A machine learning approach
- Possible actions towards outliers
Using tukey formula
This formula uses quantiles to produce a upper and lower range values beyond which all values are considered as outliers.
Outlier on the upper side = 3rd Quartile + 1.5 * IQR
Outlier on the lower side = 1st Quartile – 1.5 * IQR
This formula is also used in boxplot and all the values which you see after the whiskers are above or below these threshold values. In the above formula, IQR stands for interquartile range. It is the difference between 3rd Quartile and 1st Quartile values.
Using histograms and boxplots
In histogram, values beyond ± 2 standard deviations is tossed as outlier. However, this is not a very rigid rule one can modify this on need basis.
# Building a histogram hist(iris$Sepal.Width, col = "blue", main = "Histogram", xlab = "Sepal width of flowers")
Similary one can draw boxplot to visually check for outliers.
# Building a box plot library(ggplot2) ggplot(iris, aes(x = "", y=Sepal.Width)) + geom_boxplot(outlier.colour="red", outlier.shape=16, outlier.size=2, notch=FALSE)
# Building a box plot library(ggplot2) ggplot(iris, aes(x = Species, y=Sepal.Width)) + geom_boxplot(outlier.colour="red", outlier.shape=16, outlier.size=2, notch=FALSE)
There are few flowers in setosa and virginica which have few flowers with unsual sepal width. However, versicolor species of flowers has no outlier.
Using cook’s distance
Cooks Distance is a multivariate method which is used to identify outliers while running regression analysis. The algorithm tries to capture information about the predictor variables through a distance measure which is a combination of leverage and each value in the dataset.
Higher value of cooks distance indicate that the values are outliers. Following rules are used take decision using cooks distance.
- A values is considered outlier if it is three times the mean value.
- Another rule states that all values which are higher than 4/n (n is the total number of observations) must be investigated.
# Detecting outliers in cars dataset mod <- glm(mpg ~ ., data = mtcars) mtcars$cooksd <- cooks.distance(mod) # Defining outliers based on 4/n criteria mtcars$outlier <- ifelse(mtcars$cooksd < 4/nrow(mtcars), "keep","delete") # Inspecting the dataset mtcars[5:12, ]
#Output mpg cyl disp hp drat wt qsec vs am gear carb cooksd outlier Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2 0.00408313839 keep Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1 0.03697080496 keep Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4 0.00006907413 keep Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2 0.03454201049 keep Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2 0.37922064047 delete Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4 0.00428238516 keep Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4 0.02404481831 keep Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3 0.04013728968 keep
Merc 230 is detected as an outlier by cooks distance
Using DBSCAN - A machine learning approach
DBSCAN is a method which uses clustering algorithm to separate dense areas with the sparse areas. DBSCAN identifies collective outliers and thus one should ensure that not more than 5% of values are choosen to be identified as outliers while running the algorithm.
library(fpc) # Compute DBSCAN using fpc package set.seed(86) db <- fpc::dbscan(iris[, -c(5,4,3)], eps = 0.20, MinPts = 3) # Plot DBSCAN results plot(db, iris[, -c(5,4,3)], main = "DBSCAN", frame = FALSE)
Optimal value of eps
eps is the maximum distance between two points and based upon this distance the algorithm decides wether a point is outlier or not. So it becomes essential to identify the optimal eps value in order to make correct decision. This can be achieved using the below code.
library(dbscan) dbscan::kNNdistplot(iris[, -c(5,4,3)], k = 4) abline(h = 0.15, lty = 2)
The point where you see an elbow like bend corresponds to the optimal eps value or where the dotted line crosses the solid line is considered as optimal.
Possible actions towards outliers
The possible actions which we can take outliers are mentioned below:
1. Capping And Flooring - We cap every values which is greater or lesser than the tukey forumal by the values returned by the tukey formula. When you seal the value on higher side we call it as capping and the same action for the lower side values is called as flooring.
2. Delete Outliers - Another solution is to delete all the values which are unsual and do not represent the major chunk of the data.
3. Model Outliers - In certain cases you may find that outliers are significant percentage of total data. In such cases, you can seperate all the ouliers and build a separate model for these values.