R Statistics Blog

Data Science From R Programmers Point Of View

Linear Regression

Linear regression is a supervised machine learning algorithm which is used to predict the continuous variable. The algorithm assumes that the relation between dependent variable(Y) and independent variables(X), is linear. This relationship is represented by a line of best fit. In this chapter, we will learn how to execute linear regression in R using some select functions and test its assumptions before we use it for final prediction on test data.

Things You Will Master

  1. Overview - Linear Regression
  2. Glimpse of Exploratory Data Analysis
  3. Training Regression Model
  4. Interpreting Regression Coefficients
  5. Validating Regression Coefficients and Models
  6. Checking Assumptions of Linear Regression
  7. Predicting Dependent Variable(Y) in Test Dataset
  8. Generating R-Squared Value for test dataset

Overview - Linear Regression

In statistics, linear regression is used to model a relationship between continuous dependent variable and one or more independent variables. The independent variable can be either categorical or numerical. The case when we have only one independent variable then it is called as simple linear regression. The case of more than one independent variable is called as multivariate regression.

A mathematical representation of linear regression model

Y = β0 +β1X1 + β2X2 + β3X3 + ….. + βnXn + error

In the above equation, β0 coefficient represents intercept and βi coefficient represents slope. In the below case study, we will be using USA housing data to predict the price. Let us look at the top six observations of USA housing data.

# Reading data
housing <- read.csv("./static/data/USA_Housing.csv",
                    header = TRUE, sep = ",")
# Print top 6 observations
# Output
 AreaIncome AreaHouse AreaNumberofRooms AreaNumberofBedrooms AreaPopulation     Price
1   79545.46  5.682861          7.009188                 4.09       23086.80 1059033.6
2   79248.64  6.002900          6.730821                 3.09       40173.07 1505890.9
3   61287.07  5.865890          8.512727                 5.13       36882.16 1058988.0
4   63345.24  7.188236          5.586729                 3.26       34310.24 1260616.8
5   59982.20  5.040555          7.839388                 4.23       26354.11  630943.5
6   80175.75  4.988408          6.104512                 4.04       26748.43 1068138.1

Glimpse of Exploratory Data Analysis

1. Checking distribution of target variable - It is my personal advice to always first understand the nature of your target variable. To achieve this, we will be drawing histogram with density plot.

# Building histogram
ggplot(data=housing, aes(housing$Price)) + 
  geom_histogram(aes(y =..density..), fill = "orange") +

Visualizing price variable

It is good that the price variable follows normal distribution.

2. Analysing Summary Statistics - Here, we will simply create a summary statistics for all the variables to understand behaviour of all the independent variables. This should also provide information about missing values, if any.

# loading psych package
# Output
                     vars    n       mean        sd     median    trimmed       mad      min        max      range
AreaIncome              1 5000   68583.11  10657.99   68804.29   68611.84  10598.27 17796.63  107701.75   89905.12
AreaHouse               2 5000       5.98      0.99       5.97       5.98      0.99     2.64       9.52       6.87
AreaNumberofRooms       3 5000       6.99      1.01       7.00       6.99      1.01     3.24      10.76       7.52
AreaNumberofBedrooms    4 5000       3.98      1.23       4.05       3.92      1.33     2.00       6.50       4.50
AreaPopulation          5 5000   36163.52   9925.65   36199.41   36112.49   9997.21   172.61   69621.71   69449.10
Price                   6 5000 1232072.65 353117.63 1232669.38 1232159.69 350330.42 15938.66 2469065.59 2453126.94
                      skew kurtosis      se
AreaIncome           -0.03     0.04  150.73
AreaHouse            -0.01    -0.09    0.01
AreaNumberofRooms    -0.04    -0.08    0.01
AreaNumberofBedrooms  0.38    -0.70    0.02
AreaPopulation        0.05    -0.01  140.37
Price                 0.00    -0.06 4993.84

3. Checking Outliers Using Boxplots - As we have mentioned in out chapter Analysing Outliers that any point which lies beyond whispers is called as outliers.

meltData <- melt(housing)
p <- ggplot(meltData, aes(factor(variable), value)) 
p + geom_boxplot() + facet_wrap(~variable, scale="free")

Building boxplot to check the outliers

Apart from Area of Number of Bedrooms all other variables seem to have outliers

4. Correlation Matrix Visualization - We will use corrgram plot package to visualize and analyse the correlation matrix.

corrgram(housing, order=TRUE)

Visualizing correlation matrix

Training Regression Model

To build the linear regression we will be using lm() function. The function takes two main arguments. First, formula stating the dependent and independent variables separated by ~(tilder). Second, the dataset name. There are other very useful arguments and thus would request you to use help(lm) to read more from the documentation.

Diving data into train and test subsets

The housing data is divided into 70 : 30 split of train and test.

# Split data into train and test
index <- createDataPartition(housing$Price, p = .70, list = FALSE)
train <- housing[index, ]
test <- housing[-index, ]
# Checking the dim of train
# Output
[1] 3500    6

You can see we have 70% of the random observations in training dataset.

Building Model

# Taining model
lmModel <- lm(Price ~ . , data = train)
# Printing the model object
# Output
lm(formula = Price ~ ., data = housing)

         (Intercept)            AreaIncome             AreaHouse  
         -2637299.03                 21.58             165637.03  
   AreaNumberofRooms  AreaNumberofBedrooms        AreaPopulation  
           120659.95               1651.14                 15.20

Interpreting Regression Coefficients

In the above output, Intercept represents that the minimum value of Price that will be received, if all the variables are constant or absent. >Please note - intercept may not always make sense in business terms.

Slope(represented by independent variables) tells us about the rate of change which Price variable will witness, with every one unit change in the independent variable. For example – if AreaHouse of house increases by one more unit, the Price of house will increase by 165,637.

Validating Regression Coefficients and Models

It is important that we ensure the values of each beta coefficient is significant and has not come by chance. In R, the lm function runs a one-sample t-test against each beta coefficient to ensure that they are significant and have not come by change. Similary, we need to validate the overall model. Just like one-sample t-test, lm function also looks into three statistics which help data scientist to validate model. These statistics include R-Square, Adjusted R-Sqaure, and F-test also know as global testing.

To view these statistics we need to pass the lmModel object to the summary() function.

# Checking model statistics
# Output
lm(formula = Price ~ ., data = housing)

    Min      1Q  Median      3Q     Max 
-337020  -70236     320   69175  361870 

                          Estimate    Std. Error  t value            Pr(>|t|)    
(Intercept)          -2637299.0333    17157.8092 -153.708 <0.0000000000000002 ***
AreaIncome                 21.5780        0.1343  160.656 <0.0000000000000002 ***
AreaHouse              165637.0269     1443.4130  114.754 <0.0000000000000002 ***
AreaNumberofRooms      120659.9488     1605.1604   75.170 <0.0000000000000002 ***
AreaNumberofBedrooms     1651.1391     1308.6712    1.262               0.207    
AreaPopulation             15.2007        0.1442  105.393 <0.0000000000000002 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 101200 on 4994 degrees of freedom
Multiple R-squared:  0.918,	Adjusted R-squared:  0.9179 
F-statistic: 1.119e+04 on 5 and 4994 DF,  p-value: < 0.00000000000000022

In the above output Pr(>|t|) represents the p-value, which can be compared against alpha value of 0.05 to ensure if the corresponsing beta coefficient is siginificant or not. The lm function here lends a helping hand. All values which have (.) period or (*) astric in from of it are all significant. Based upon this we now know that all variables are significant except AreaNumberofBedrooms.

For overall model accuracy, let’s discuss statistics generated by lm function one by one.

1. Multiple R-squared: 0.918 - The R-squared values is formally called as coefficient of determination. Here, 0.918 indicates that the intercept, AreaIncome, AreaHouse, AreaNumberofRooms,and AreaPopulation variables when put together are able to explain 91.8% of variance in Price variable. The value of R-squared lies between 0 to 1. In practical aplications, if the values i greater than 0.70 we consider it as a good model.

2. Adjusted R-squared: 0.9179 - The Adjusted R-squared value indicate if the addition of new information ( variable ) brings significant improvement to the model or not. So as of now this value does not provide much information. However, increase in the adjusted R-squared value with additiong of indicates the variable is useful and helps by bringing significant improvment to the model.

Important Note

A large difference between the R-Squared and Adjusted R-squared is not appreciated and generally indicates that multicollinearity exists within the data.

3.F-statistic: 1.119e+04 on 5 and 4994 DF, p-value: < 0.00000000000000022 - This line talks about the global testing of the model. The lm function runs an anova test to check the significance of overall model. Here null hypothesis is that the model is not significant and alternative is that the model is significant. According to the p-values < 0.05, our model is significant.

Albiet, looking at these statistics is enough to take a call on the model siginificance. There are other validation methods for linear regression as mentioned below:

4. AIC and BIC values - The AIC(Akaike’s information criterion, 1974) and BIC(Bayesian information criterion, 1978) are penalized-likelihood criteria. Both these measures use “measure of fit + complexity penalty” to get the final values.

AIC = - 2 * ln(likelihood) + 2 * p
BIC = - 2 * ln(likelihood) + ln(N) * p

Here p = number of estimated parameters and N = sample size.

The AIC and BIC can be used for choosing best predictor subsets in regression and for comparing different models. When comparing different models the model with minimum AIC and BIC values are considered the best models. However, AIC is likely to overfit the data, whereas BIC is susceptible to underfit the data.

# Using AIC function
# Using BIC function
# Output AIC
[1] 129441.3

# Output BIC
[1] 129486.9

5. Root Mean Square Error(RMSE) - The RMSE statistics is also used for comparing different models and the model which has got the lowest value is considered to be the best model. There are other similar function like MAE, MAPE, MSE and so on which can be used. These functions can be found in Metrics R Package. These functions take majorly two arguments, one is actual value and second, predicted values. So let’s see how we can get the these values. The actuals can be 100% found from the actual dataset or the training data in our case. However to find the fitted values we need to explore the model object.

# Checking model object for actual and predicted values
# Output
  [1] "coefficients"  "residuals"    
  [3] "effects"       "rank"         
  [5] "fitted.values" "assign"       
  [7] "qr"            "df.residual"  
  [9] "xlevels"       "call"         
  [11] "terms"         "model"

The above vector presents the names of object which constitute the lmModel object. Here, fitted values are the predicted values. Now, we will use these value to generate the rmse values.

rmse(actual = train$Price, predicted = lmModel$fitted.values)
# Output
[1] 493370.4

Checking Assumptions of Linear Regression

It is important that your model meets all the four assumptions of linear regression as mentioned below. If the model fails to meet these assumptions then we simply cannot use this model.

1. Errors should follow normal distribution - This can be checked by drawing a histogram of residuals or by using plot() function. The plot function creates 4 different charts one of which is a NPP plot which can confirm if the errors are normally distributed or not.

Generating histogram

# Histogram to check the distribution of errors
hist(lmModel$residuals, color = "grey")

Histogram to check linear regression assumption

The above histogram of errors clearly states that errors are normally distributed.

Generating NPP plot

We except the points to be very close tot he dotted line in a NPPplot. Points being close to the line means errors are normally distributed and you can see this is what we get in the plot.


npp plot to check linear regression assumptions

2. There should be no heteroscedasticity - This means that the variance of error terms should be constant. We shall not see any patterns when we draw a plot between residuals and fitted values. And the mean line should be close to Zero.

Generating the scatterplot between residuals and fitted values

# Using plot function

no heteroscedasticity assumption check

The red line which represents the mean is staright and close to the zero value. This indicates that we do not have a heteroscedasticity in our data.

3. There should be no multicollinearity - The linear model assumes that the predictor variables have NO correlation between them. If they exhibit high correlation, it is a problem and it is called as multicollinearity. A variation inflation factor test can be run to check for the multicollinearity assumtion.

VIF = 1/(1-R2)

The R implimentation of the below function can be found here.

VIF is an iterative process. The function will remove one variables at a time which is cause for multicollinearity and repeats the process until all problem causing variables are removed. So, finally, we are left with the list of variables which have no or very weak correlation between them.

vif_func(housing[, 1:5]) 
var                  vif             
 AreaIncome           1.00115868968647
 AreaHouse            1.00057658981485
 AreaNumberofRooms    1.27353508823836
 AreaNumberofBedrooms 1.27441273719468
 AreaPopulation       1.00126579728799

All variables have VIF < 10, max VIF 1.27 

[1] "AreaIncome"          
[2] "AreaHouse"           
[3] "AreaNumberofRooms"   
[4] "AreaNumberofBedrooms"
[5] "AreaPopulation"     

There is no multicollelinearty problem in the dataset. Generally, vif values which are greater than 5 or 7 are cause of multicollinearity.

3. There should be no auto serial correlation - The auto correlation means that error terms should not be correlated with each other. To check of this we can run durbin-watson test(dw test). The test returns a value between 0 and 4. If the values is 2 we say there is no auto serial correlation does not exist. However, a value greater than 2 represents (+) ve correlation and value less than 2 represents (-) ve correlation.

# Output
	Durbin-Watson test

data:  lmModel
DW = 1.9867, p-value = 0.3477
alternative hypothesis: true autocorrelation is greater than 0

We got a value of 1.9867 which suggests that there is no auto serial correlation.

Our model met all the four assumptions of linear regression.

Predicting Dependent Variable(Y) in Test Dataset

We test this model performance on test data set to ensure that our model is stable and we get the same or closer enough results to use this trained model to predict and forecast future values of dependent variables. To predict we use predict function and then we generate R-Squared value to see if we get the same result as we got in training dataset or not.

# Predicting Price in test dataset
test$PreditedPrice <- predict(lmModel, test)
# Priting top 6 rows of actual and predited price
head(test[ , c("Price", "PreditedPrice")])
# Output
      Price   PreditedPrice
2  1505890.9     1494997.3
3  1058988.0     1253539.2
8  1573936.6     1572965.3
12  663732.4      629153.8
23  718887.2      727497.8
24  743999.8      991034.7

Generating R-Squared Value for test dataset

We are using a user defined formula to generate the R-Squared value here.

actual <- test$Price
preds <- test$PreditedPrice
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
# Output
[1] 0.9140436

Our model is performing fantastic.
In test dataset, we got an accuracy of 0.9140436 and training data set, we got an accuracy of 0.918.

Closing Note

In this chapter, We learned many things related to linear regression from a practical and theoretical point of view. We learned when to use linear regression, how to use it, how to check the assumptions of linear regression, how to predict the target variable in test dataset using trained model object, and we also learned how to validate the linear regression model using different statistical methods. In the next chapter, we will learn about an advanced linear regression model called as ridge regression.
Last updated on 5 Jan 2019 / Published on 17 Oct 2017