R Statistics Blog

Data Science From R Programmers Point Of View

Binary Logistic Regression

Logistics Regression is used to explain the relationship between dependent variable and one or more independent variables. When the dependent variable is dichotomous we use binary logistic regression. However, by default, a binary logistic regression is almost always called as logistics regression.

Things You Will Master

  1. Overview - Logistic Regression
  2. Case Study - What is UCI Adult Income ?
  3. Data Cleaning & Exploratory Data Analysis
  4. Building the Model
  5. Interpreting Logistic Regression Output
  6. Predicting Dependent Variable(Y) in Test Dataset
  7. Evaluating Logistic Regression Model
    7.1. Classification Table
          7.1.1. Accuracy
          7.1.2. Misclassification Rate
          7.1.3. True Positive Rate(TPR) - Recall
          7.1.4. True Negative Rate(TNR)
          7.1.5. Precision
    7.2 F-Score
    7.3. ROC Curve
    7.4. Concordance

Overview - Logistic Regression

Logistic regression model is used to model the relationship between binary target variable and a set of independent variables. These independent variables can be either qualitative or quantitative. In logistic regression, the model predicts the logit transformation of the probability of the event. The following mathematical formula is used to generate the final output.

Logistic regression mathematical equation

In the above equation, p is used to represent the odds ratio and the formula of odds ratio is as given below:

Odds ratio mathematical equation

Case Study - What is UCI Adult Income ?

In this tutorial, we will be using Adult Income data from UCI machine learning repository to predict the income class of an individual based upon the information provided in the data.You can download this Adult Income data from the UCI repository.

Attention

Beta coefficient in logistics regression are chosen based upon maximum likelihood estimates.

The idea here is to give you fair idea about how a datascientist or a statistician builds a predictive model. So, we will try to demonstrate all the important tasks which are part of model building exercise. However, for the demo purpose we will be using only three variables from the whole dataset.

Getting the data

Adult dataset is fairly large in size and thus, to read it faster, i will be using read_csv() from readr package to load the data from my local mahcine.

library(readr)
adult <- read_csv("./static/data/adult.csv")
# Checking the structure of adult data
str(adult)
# Output
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	48842 obs. of  15 variables:
 $ Age           : int  25 38 28 44 18 34 29 63 24 55 ...
 $ Workclass     : chr  "Private" "Private" "Local-gov" "Private" ...
 $ Fnlwgt        : int  226802 89814 336951 160323 103497 198693 227026 104626 369667 104996 ...
 $ Education     : chr  "11th" "HS-grad" "Assoc-acdm" "Some-college" ...
 $ Education-num : int  7 9 12 10 10 6 9 15 10 4 ...
 $ Marital-status: chr  "Never-married" "Married-civ-spouse" "Married-civ-spouse" "Married-civ-spouse" ...
 $ Occupation    : chr  "Machine-op-inspct" "Farming-fishing" "Protective-serv" "Machine-op-inspct" ...
 $ Relationship  : chr  "Own-child" "Husband" "Husband" "Husband" ...
 $ Race          : chr  "Black" "White" "White" "Black" ...
 $ Sex           : chr  "Male" "Male" "Male" "Male" ...
 $ Capital-gain  : int  0 0 0 7688 0 0 0 3103 0 0 ...
 $ Capital-loss  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Hours-per-week: int  40 50 40 40 30 30 40 32 40 10 ...
 $ Native-country: chr  "United-States" "United-States" "United-States" "United-States" ...
 $ Class         : chr  "<=50K" "<=50K" ">50K" ">50K" ...

As mentioned earlier, we will be using three variables; WorkClass, Marital-status and Age to build the model. Out of these three variables - WorkClass and Marital-status are categorical variables where as Age is a continuous variable.

# Subsetting the data and keeping the required variables
adult <- adult[ ,c("Workclass", "Marital-status", "Age", "Class")]
# Checking the dim
dim(adult)
# Output
[1] 48842     4

The new dataset has 48842 observations and only 4 variables

We cannot use categorical variables directly in the model. So for these variables we need to be create dummy variables. A dummy variable takes the value of 0 or 1 to indicate the absence or presence of the particular level. In our example, the function will automatically create dummy variables.

Summarizing categorical variable

The best way to summarize the categorical variable is to create the frequency table and that is what we will create using table function.

# Generating the frequency table
table(adult$Workclass)
# Outptu
        ?      Federal-gov        Local-gov     Never-worked 
            2799             1432             3136               10 
         Private     Self-emp-inc Self-emp-not-inc        State-gov 
           33906             1695             3862             1981 
     Without-pay 
              21 

The table suggests that there are some 2799 issing values in this this variable which are represented by (?) symbol. Also, the data is not uniformly distributed some of the levels have very few observations and looks like that we have an opportunity to combine similar looking levels.

# Combining levels
adult$Workclass[adult$Workclass == "Without-pay" | adult$Workclass == "Never-worked"] <- "Unemployed"
adult$Workclass[adult$Workclass == "State-gov" | adult$Workclass == "Local-gov"] <- "SL-gov"
adult$Workclass[adult$Workclass == "Self-emp-inc" | adult$Workclass == "Self-emp-not-inc"] <- "Self-employed"

# Checking the table again
table(adult$Workclass)
# Output

            ?   Federal-gov       Private Self-employed 
         2799          1432         33906          5557 
       SL-gov    Unemployed 
         5117            31 

Let us do the similar treatement for our other categorical variable

# Generating the frequency table
table(adult$Marital-status)
# Outptu
    Divorced     Married-AF-spouse 
                 6633                    37 
   Married-civ-spouse Married-spouse-absent 
                22379                   628 
        Never-married             Separated 
                16117                  1530 
              Widowed 
                 1518 

We can reduce the above levels to never married, married and never married.

# Combining levels
adult$Marital-status[adult$Marital-status == "Married-AF-spouse" | adult$Marital-status == "Married-civ-spouse" | adult$Marital-status == "Married-spouse-absent"] <- "Married"

adult$Marital-status[adult$Marital-status == "Divorced" |
                       adult$Marital-status == "Separated" |
                       adult$Marital-status == "Widowed"] <- "Not-Married"

# Checking the table again
table(adult$Marital-status)
# Output
      Married       Never-married   Not-Married 
        23044         16117          9681 

This variable looks much more well distributed that then Workclass. Now, we must convert them to a factor variables.

# Converting to factor variables
adult$Workclass <- as.factor(adult$Workclass)
adult$Marital-status <- as.factor(adult$Marital-status)
adult$Class <- as.factor(adult$Class)

Deleting the missing values

We will first convert all ? to NA and then use na.omit() to keep the complete observation.

# Converting ? to NA
adult[adult == "?"] <- NA

# Keeping only the na.omit() function
adult <- na.omit(adult)

Finally taking a look into target variable

To save the time, I will directly going forward with the bivariate analysis. Let us see how the distribution of age looks for the two income groups.

library(ggplot2)
ggplot(adult, aes(Age)) + 
  geom_histogram(aes(fill = Class), color = "black", binwidth = 2)

Distribution of age by class variable

Data looks much more skewed for the lower income people as compared to the high income group.

Building the Model

We will be splitting the data into test and train using createDataPartition() function from caret package in R. We will train the model using training dataset and predict the values on the test dataset. TO train the model we will be using glm() function.

# Loading caret library
require(caret)
# Splitting the data into train and test
index <- createDataPartition(adult$Class, p = .70, list = FALSE)
train <- adult[index, ]
test <- adult[-index, ]

# Training the model
logistic_model <- glm(Class ~ ., family = binomial(), train)

# Checking the model
summary(logistic_model)
# Output
Call:
glm(formula = Class ~ ., family = binomial(), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6509  -0.8889  -0.3380  -0.2629   2.5834  

Coefficients:
                               Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -0.591532   0.094875  -6.235     0.00000000045227 ***
WorkclassPrivate              -0.717277   0.077598  -9.244 < 0.0000000000000002 ***
WorkclassSelf-employed        -0.575340   0.084055  -6.845     0.00000000000766 ***
WorkclassSL-gov               -0.445104   0.086089  -5.170     0.00000023374732 ***
WorkclassUnemployed           -2.494210   0.766488  -3.254              0.00114 ** 
`Marital-status`Never-married -2.435902   0.051187 -47.589 < 0.0000000000000002 ***
`Marital-status`Not-Married   -2.079032   0.045996 -45.200 < 0.0000000000000002 ***
Age                            0.023362   0.001263  18.503 < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 36113  on 32230  degrees of freedom
Residual deviance: 28842  on 32223  degrees of freedom
AIC: 28858

Number of Fisher Scoring iterations: 5

Interpreting Logistic Regression Output

All the variables in the above output have turned out to be significant(p values are less than 0.05 for all the variables) . If you look at the categorical variables you will notice that n - 1 dummy variables were created for these variables. Here, n represents the total number of levels. The one variable which is left is considerd as the reference variable and all other variable levels are interepreted in reference to this level.

1. Null and Residual deviance - Null deviance suggests the respons by the model if only intercept is considered; lower the value better is the model. The Residual deviance indicates the response by the model when all the variables are included; again lower the value, better is the model.

2. (Intercept) - Intercept(β0) indicates the log of odds of the whole population of interest to be on higher income class with no predictor variables in the model. We can transfer the log of odds back to simple probabilities by using sigmoid function.

Sigmoid function, p = exp(-0.591532)/(1+exp(-0.591532))

The other way is to convert this logit of odds to simple odds by taking exp(-0.591532) = 0.5534. This indicates that the odds of an individual being in the high income group decreases by 45% if we have no predictor variables.

3. WorkclassPrivate - The beta coefficient against this variable is -0.717277. Let us convert this value into odds by taking the exp(-0.717277) = 0.4880795. This indcates that the odds of an individual with Private workclass being in the high income group decreases by 52% than the one in a Federal-gov job.

Out of all the 5 levels; Federal-gov variable became the reference and thus all other levels of workclass variable are infered in comparision of this referenced variables. That is how we interpret the categorical variables.

4. Age - The beta coefficient of age variable is 0.023362 which is in logit of odds terms. When we convert this to odds by taking exp(0.023362) we get 1.023. This indicates that as age increase by 1 more unit the odds of an individual being in the high income group increases by 2%.

Remember

Odds value is never negative and the value of 1 indicates that this variable has no impact on the target variables. If the values is less than one that the odds is read as (1 - value) as decrease in odds and a value greater than one indicates an increase the odds.

Predicting Dependent Variable(Y) in Test Dataset

To predict the target variable in the unseen data we use predict function. The output of predict function is the probability.

# Predicting in test dataset
pred_prob <- predict(logistic_model, test, type = "response")

Evaluating Logistic Regression Model

There are number of ways in which we can validate our logistic regression model. We have picked all the popular once which you can use to evaluate the model. Let’s discuss and see how to run those in R.

1. Classification Table - I would say this one is the most popular validation technique among all the known validation methods of logistic model. It’s basically a contigency table which we draw between the actual values and the predicted values. The table is then used to dig in many other estimates like Accuracy, Misclassification Rate, True Positive Rate also known as Recall, True Negative Rate, and Precision.

Here is reprsentation of contigency table marking important terms.

classification table and important terms

Before we create a contigency table we need to convert the probability into the two levels IE class <=50K and >50K. To get these values we will be using simple ifelse() function and will create a new values in the train data by the name pred_class.

We have to repeate the below steps for both the test and train dataset.

Converting probability to class values in training dataset

# Converting from probability to actual output
train$pred_class <- ifelse(logistic_model$fitted.values >= 0.5, ">50K", "<=50K")

# Generating the classification table
ctab_train <- table(train$Class, train$pred_class)
ctab_train
# Output
        <=50K  >50K
  <=50K  1844 22391
  >50K   1697  6299

Training dataset converting from probability to class values

# Converting from probability to actual output
test$pred_class <- ifelse(pred_prob >= 0.5, ">50K", "<=50K")

# Generating the classification table
ctab_test <- table(test$Class, test$pred_class)
ctab_test
# Output
        <=50K  >50K
  <=50K  9602   784
  >50K   2676   750

Accuracy

Accuracy is calculated by adding the diagonal elements and dividing it by the sum of all the elements of the contigendcy table. We will also compare the accuracy of training dataset with the test dataset to see if our results are holding in the unseen data or not.

Accuracy = (TP + TN)/(TN + FP + FN + TP)
# Accuracy in Training dataset 
accuracy_train <- sum(diag(ctab_train))/sum(ctab_train)*100
accuracy_train
#Output
[1] 74.7355

Our logistics model is able to classify 74.7% of all the observations correctly in training dataset.

# Accuracy in Test dataset 
accuracy_test <- sum(diag(ctab_test))/sum(ctab_test)*100
accuracy_test
#Output
[1] 74.94932

The over all correct classification accuracy in test dataset is 74.9% which is comparable to train dataset. This shows that our model is perfomring good.

A model is considered fairly good if the model accuracy is greater than 70%.

Misclassification Rate

Misclassification Rate indicates how often is our predicted values are Fasle.

Misclassification Rate = (FP+FN)/(TN + FP + FN + TP)

True Positive Rate - Recall or Senstivity

Recall or TPR indicates how often does our model predicts actual TRUE from the overall TRUE events.

Recall Or TPR = TP/(FN + TP)
# Recall in Train dataset 
Recall <- (ctab_train[2, 2]/sum(ctab_train[2, ]))*100
Recall
# Output
[1] 21.22311

True Negative Rate

TNR indicates how often does our model predicts actual non events from the overall non events.

TNR = TN/(TN + FP)
# TNR in Train dataset 
TNR <- (ctab_train[1, 1]/sum(ctab_train[1, ]))*100
TNR
#Output
92.39117

Precision

Precision indicates how often does your predicted TRUE values are actually TRUE.

Precision = TP/FP + TP
# Precision in Train dataset 
Precision <- (ctab_train[2, 2]/sum(ctab_train[, 2]))*100
Precision
#Output
[1] 47.92432

Calculating F-Score

F-Score is a harmonic mean of the recall and precision. The score value lies between 0 and 1. Where a value of 1 represents a perfect precision & recall and 0 represents a worst case.

F_Score <- (2 * Precision * Recall / (Precision + Recall))/100
F_Score
#Output
[1] 0.2941839

ROC Curve

Area under the curve(AUC) is the measure which represents ROC(Receiver Operating Characterstic) curve. This ROC cure is a line plot which is draw between the Sensitivity and (1 - Specifity) Or between TPR and TNR. This graph is then used to generate the AUC value. A AUC value of greater than .70 indicates a good model.

library(pROC)
roc <- roc(train$Class, logistic_model$fitted.values)
auc(roc)
# Output
Area under the curve: 0.7965

Concordance

Concordance tells in how many pairs does the probability of ones is higher than the probability of zeros divided by total number of possible pairs. The higher the values better is the model. The value of concordance lies between 0 and 1.

Similar to concordance, we have disconcordance which tells in how many pairs the probability of ones was less than zeros. If the probability of ones is equal to 1 we say it is a tied pair.

library(InformationValue)
Concordance(logistic_model$y,logistic_model$fitted.values)
# Output
$`Concordance`
[1] 0.7943923

$Discordance
[1] 0.2056077

$Tied
[1] 0

$Pairs
[1] 193783060

Closing Note

In this chapter,
Last updated on 6 Jan 2019 / Published on 17 Oct 2017