Analysing zero variance predictor
Some variable in the dataset contain very little information because they mostly consist of a single value (e.g. zero). These variables are called as zero variance variable.
Things You Will Master
- Overview - What is zero variance
- Methods to identify such variables
- R programming codes and functions
- Bonus Function - Identify zero variance conditioned on group variable
Many a time it happens that in the dataset, we have variables which either have unique values or have only a handful of unique values. The variables with only one exceptional value, when passed to fit the model, can cause problems like unstable models or in some cases, can also cause the model to crash. Such predictor variables are called as Zero-variance variables.
The variables which have a few unique values with low frequency can also lead to problems. For example, in the survey dataset, some of the levels of exercise variable have a low-frequency rate. These variables may become zero-variance predictors when split during the train/test split or during the cross-validation process. They are also sometimes termed as near-zero variance predictors.
Methods to identify such variables
All such variables which fall into these categories need to be identified and excluded from the datasets. There are many different things one can do to identify these variables like you can check unique values, generate frequency tables, etc. Also, one can use the following metrics to determine the zero variance or near-zero variance predictors –
Frequency Ration – It is the ration of the most common value over the second most prevalent value. If the value is close to one, then the predictor is good to use. However, a substantial value indicates that the variable is unbalanced and should be eliminated.
Percentage of Unique Values – It is calculated by dividing the number of unique values divided by the total number of samples that approaches zero as the granularity of the data increases. Any variable which crosses the predefined threshold for frequency ratio and has a frequency of unique values percentage less than the limit should be considered as zero variance predictor.
caret package in R provides a function called
nearZeroVar() for identify and removing such problematic variables before modeling. Some of the arguments which you must be aware or are -
- x = Dataset
- freqCut = the cutoff for the ratio of the most common value to the second most common value. Default value is set at 95⁄5
- uniqueCut = the cutoff for the percentage of distinct values out of the number of total samples. Default value is 10.
- saveMetrics = takes logical input. If false, only the position of zero- and near-zero variables is returned. If true, a dataframe with predictor information is returned.
The default values mentioned for
uniqueCut are fairly conservative. So, based upon your requirement feel free to change these number.
# Identifying near-zero variance variable nearZeroVar(iris[, -5], saveMetrics = TRUE)
#Output freqRatio percentUnique zeroVar nzv Sepal.Length 1.111111 23.33333 FALSE FALSE Sepal.Width 1.857143 15.33333 FALSE FALSE Petal.Length 1.000000 28.66667 FALSE FALSE Petal.Width 2.230769 14.66667 FALSE FALSE
Identify zero variance conditioned on group variable
checkConditionalX() function can be used to look at the distribution of the columns of x conditioned on the levels of y. This function identifies columns of data that are sparse within groups of y variable.
set.seed(1) classes <- factor(rep(letters[1:3], each = 30)) x <- data.frame(x1 = rep(c(0, 1), 45), x2 = c(rep(0, 10), rep(1, 80))) lapply(x, table, y = classes) checkConditionalX(x, classes)
# Output lapply(x, table, y = classes) $x1 y a b c 0 15 15 15 1 15 15 15 $x2 y a b c 0 10 0 0 1 20 30 30 checkConditionalX(x, classes)  2