Linear regression is a supervised machine learning algorithm which is used to predict the continuous variable. The algorithm assumes that the relation between dependent variable(Y) and independent variables(X), is linear. This relationship is represented by a line of best fit. In this chapter, we will learn how to execute linear regression in R using some select functions and test its assumptions before we use it for final prediction on test data.
Things You Will Master
- Overview - Linear Regression
- Glimpse of Exploratory Data Analysis
- Training Regression Model
- Interpreting Regression Coefficients
- Validating Regression Coefficients and Models
- Checking Assumptions of Linear Regression
- Predicting Dependent Variable(Y) in Test Dataset
- Generating R-Squared Value for test dataset
Overview - Linear Regression
In statistics, linear regression is used to model a relationship between continuous dependent variable and one or more independent variables. The independent variable can be either categorical or numerical. The case when we have only one independent variable then it is called as simple linear regression. The case of more than one independent variable is called as multivariate regression.
A mathematical representation of linear regression model
Y = β0 +β1X1 + β2X2 + β3X3 + ….. + βnXn + error
In the above equation, β0 coefficient represents intercept and βi coefficient represents slope. In the below case study, we will be using USA housing data to predict the price. Let us look at the top six observations of USA housing data.
# Reading data housing <- read.csv("./static/data/USA_Housing.csv", header = TRUE, sep = ",") # Print top 6 observations head(housing)
# Output AreaIncome AreaHouse AreaNumberofRooms AreaNumberofBedrooms AreaPopulation Price 1 79545.46 5.682861 7.009188 4.09 23086.80 1059033.6 2 79248.64 6.002900 6.730821 3.09 40173.07 1505890.9 3 61287.07 5.865890 8.512727 5.13 36882.16 1058988.0 4 63345.24 7.188236 5.586729 3.26 34310.24 1260616.8 5 59982.20 5.040555 7.839388 4.23 26354.11 630943.5 6 80175.75 4.988408 6.104512 4.04 26748.43 1068138.1
Glimpse of Exploratory Data Analysis
1. Checking distribution of target variable - It is my personal advice to always first understand the nature of your target variable. To achieve this, we will be drawing histogram with density plot.
library(ggplot2) # Building histogram ggplot(data=housing, aes(housing$Price)) + geom_histogram(aes(y =..density..), fill = "orange") + geom_density()
It is good that the price variable follows normal distribution.
2. Analysing Summary Statistics - Here, we will simply create a summary statistics for all the variables to understand behaviour of all the independent variables. This should also provide information about missing values, if any.
# loading psych package library(psych) psych::describe(housing)
# Output vars n mean sd median trimmed mad min max range AreaIncome 1 5000 68583.11 10657.99 68804.29 68611.84 10598.27 17796.63 107701.75 89905.12 AreaHouse 2 5000 5.98 0.99 5.97 5.98 0.99 2.64 9.52 6.87 AreaNumberofRooms 3 5000 6.99 1.01 7.00 6.99 1.01 3.24 10.76 7.52 AreaNumberofBedrooms 4 5000 3.98 1.23 4.05 3.92 1.33 2.00 6.50 4.50 AreaPopulation 5 5000 36163.52 9925.65 36199.41 36112.49 9997.21 172.61 69621.71 69449.10 Price 6 5000 1232072.65 353117.63 1232669.38 1232159.69 350330.42 15938.66 2469065.59 2453126.94 skew kurtosis se AreaIncome -0.03 0.04 150.73 AreaHouse -0.01 -0.09 0.01 AreaNumberofRooms -0.04 -0.08 0.01 AreaNumberofBedrooms 0.38 -0.70 0.02 AreaPopulation 0.05 -0.01 140.37 Price 0.00 -0.06 4993.84
3. Checking Outliers Using Boxplots - As we have mentioned in out chapter Analysing Outliers that any point which lies beyond whispers is called as outliers.
library(reshape) meltData <- melt(housing) p <- ggplot(meltData, aes(factor(variable), value)) p + geom_boxplot() + facet_wrap(~variable, scale="free")
Apart from Area of Number of Bedrooms all other variables seem to have outliers
4. Correlation Matrix Visualization - We will use
corrgram plot package to visualize and analyse the correlation matrix.
require(corrgram) corrgram(housing, order=TRUE)
Training Regression Model
To build the linear regression we will be using
lm() function. The function takes two main arguments. First, formula stating the dependent and independent variables separated by ~(tilder). Second, the dataset name. There are other very useful arguments and thus would request you to use
help(lm) to read more from the documentation.
Diving data into train and test subsets
The housing data is divided into 70 : 30 split of train and test.
library(caret) # Split data into train and test index <- createDataPartition(housing$Price, p = .70, list = FALSE) train <- housing[index, ] test <- housing[-index, ]
# Checking the dim of train dim(train)
# Output  3500 6
You can see we have 70% of the random observations in training dataset.
# Taining model lmModel <- lm(Price ~ . , data = train) # Printing the model object print(lmModel)
# Output Call: lm(formula = Price ~ ., data = housing) Coefficients: (Intercept) AreaIncome AreaHouse -2637299.03 21.58 165637.03 AreaNumberofRooms AreaNumberofBedrooms AreaPopulation 120659.95 1651.14 15.20
Interpreting Regression Coefficients
In the above output, Intercept represents that the minimum value of Price that will be received, if all the variables are constant or absent. >Please note - intercept may not always make sense in business terms.
Slope(represented by independent variables) tells us about the rate of change which Price variable will witness, with every one unit change in the independent variable. For example – if AreaHouse of house increases by one more unit, the Price of house will increase by 165,637.
Validating Regression Coefficients and Models
It is important that we ensure the values of each beta coefficient is significant and has not come by chance. In R, the
lm function runs a one-sample t-test against each beta coefficient to ensure that they are significant and have not come by change. Similary, we need to validate the overall model. Just like one-sample t-test,
lm function also looks into three statistics which help data scientist to validate model. These statistics include R-Square, Adjusted R-Sqaure, and F-test also know as global testing.
To view these statistics we need to pass the
lmModel object to the
# Checking model statistics summary(lmModel)
# Output Call: lm(formula = Price ~ ., data = housing) Residuals: Min 1Q Median 3Q Max -337020 -70236 320 69175 361870 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2637299.0333 17157.8092 -153.708 <0.0000000000000002 *** AreaIncome 21.5780 0.1343 160.656 <0.0000000000000002 *** AreaHouse 165637.0269 1443.4130 114.754 <0.0000000000000002 *** AreaNumberofRooms 120659.9488 1605.1604 75.170 <0.0000000000000002 *** AreaNumberofBedrooms 1651.1391 1308.6712 1.262 0.207 AreaPopulation 15.2007 0.1442 105.393 <0.0000000000000002 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 101200 on 4994 degrees of freedom Multiple R-squared: 0.918, Adjusted R-squared: 0.9179 F-statistic: 1.119e+04 on 5 and 4994 DF, p-value: < 0.00000000000000022
In the above output Pr(>|t|) represents the p-value, which can be compared against alpha value of 0.05 to ensure if the corresponsing beta coefficient is siginificant or not. The
lm function here lends a helping hand. All values which have (.) period or (*) astric in from of it are all significant. Based upon this we now know that all variables are significant except AreaNumberofBedrooms.
For overall model accuracy, let’s discuss statistics generated by
lm function one by one.
1. Multiple R-squared: 0.918 - The R-squared values is formally called as coefficient of determination. Here, 0.918 indicates that the intercept, AreaIncome, AreaHouse, AreaNumberofRooms,and AreaPopulation variables when put together are able to explain 91.8% of variance in Price variable. The value of R-squared lies between 0 to 1. In practical aplications, if the values i greater than 0.70 we consider it as a good model.
2. Adjusted R-squared: 0.9179 - The Adjusted R-squared value indicate if the addition of new information ( variable ) brings significant improvement to the model or not. So as of now this value does not provide much information. However, increase in the adjusted R-squared value with additiong of indicates the variable is useful and helps by bringing significant improvment to the model.
A large difference between the R-Squared and Adjusted R-squared is not appreciated and generally indicates that multicollinearity exists within the data.
3.F-statistic: 1.119e+04 on 5 and 4994 DF, p-value: < 0.00000000000000022 - This line talks about the global testing of the model. The
lm function runs an anova test to check the significance of overall model. Here null hypothesis is that the model is not significant and alternative is that the model is significant. According to the p-values < 0.05, our model is significant.
Albiet, looking at these statistics is enough to take a call on the model siginificance. There are other validation methods for linear regression as mentioned below:
4. AIC and BIC values - The AIC(Akaike’s information criterion, 1974) and BIC(Bayesian information criterion, 1978) are penalized-likelihood criteria. Both these measures use “measure of fit + complexity penalty” to get the final values.
AIC = - 2 * ln(likelihood) + 2 * p
BIC = - 2 * ln(likelihood) + ln(N) * p
Here p = number of estimated parameters and N = sample size.
The AIC and BIC can be used for choosing best predictor subsets in regression and for comparing different models. When comparing different models the model with minimum AIC and BIC values are considered the best models. However, AIC is likely to overfit the data, whereas BIC is susceptible to underfit the data.
# Using AIC function AIC(lmModel) # Using BIC function BIC(lmModel)
# Output AIC  129441.3 # Output BIC  129486.9
5. Root Mean Square Error(RMSE) - The RMSE statistics is also used for comparing different models and the model which has got the lowest value is considered to be the best model. There are other similar function like MAE, MAPE, MSE and so on which can be used. These functions can be found in
Metrics R Package. These functions take majorly two arguments, one is actual value and second, predicted values. So let’s see how we can get the these values. The actuals can be 100% found from the actual dataset or the training data in our case. However to find the fitted values we need to explore the model object.
# Checking model object for actual and predicted values names(lmModel)
# Output  "coefficients" "residuals"  "effects" "rank"  "fitted.values" "assign"  "qr" "df.residual"  "xlevels" "call"  "terms" "model"
The above vector presents the names of object which constitute the lmModel object. Here, fitted values are the predicted values. Now, we will use these value to generate the rmse values.
library(Metrics) rmse(actual = train$Price, predicted = lmModel$fitted.values)
# Output  493370.4
Checking Assumptions of Linear Regression
It is important that your model meets all the four assumptions of linear regression as mentioned below. If the model fails to meet these assumptions then we simply cannot use this model.
1. Errors should follow normal distribution - This can be checked by drawing a histogram of residuals or by using
plot() function. The
plot function creates 4 different charts one of which is a NPP plot which can confirm if the errors are normally distributed or not.
# Histogram to check the distribution of errors hist(lmModel$residuals, color = "grey")
The above histogram of errors clearly states that errors are normally distributed.
Generating NPP plot
We except the points to be very close tot he dotted line in a NPPplot. Points being close to the line means errors are normally distributed and you can see this is what we get in the plot.
2. There should be no heteroscedasticity - This means that the variance of error terms should be constant. We shall not see any patterns when we draw a plot between residuals and fitted values. And the mean line should be close to Zero.
Generating the scatterplot between residuals and fitted values
# Using plot function plot(lmModel)
The red line which represents the mean is staright and close to the zero value. This indicates that we do not have a heteroscedasticity in our data.
3. There should be no multicollinearity - The linear model assumes that the predictor variables have NO correlation between them. If they exhibit high correlation, it is a problem and it is called as multicollinearity. A variation inflation factor test can be run to check for the multicollinearity assumtion.
VIF = 1/(1-R2)
The R implimentation of the below function can be found here.
VIF is an iterative process. The function will remove one variables at a time which is cause for multicollinearity and repeats the process until all problem causing variables are removed. So, finally, we are left with the list of variables which have no or very weak correlation between them.
var vif AreaIncome 1.00115868968647 AreaHouse 1.00057658981485 AreaNumberofRooms 1.27353508823836 AreaNumberofBedrooms 1.27441273719468 AreaPopulation 1.00126579728799 All variables have VIF < 10, max VIF 1.27  "AreaIncome"  "AreaHouse"  "AreaNumberofRooms"  "AreaNumberofBedrooms"  "AreaPopulation"
There is no multicollelinearty problem in the dataset. Generally, vif values which are greater than 5 or 7 are cause of multicollinearity.
3. There should be no auto serial correlation - The auto correlation means that error terms should not be correlated with each other. To check of this we can run durbin-watson test(dw test). The test returns a value between 0 and 4. If the values is 2 we say there is no auto serial correlation does not exist. However, a value greater than 2 represents (+) ve correlation and value less than 2 represents (-) ve correlation.
# Output Durbin-Watson test data: lmModel DW = 1.9867, p-value = 0.3477 alternative hypothesis: true autocorrelation is greater than 0
We got a value of 1.9867 which suggests that there is no auto serial correlation.
Our model met all the four assumptions of linear regression.
Predicting Dependent Variable(Y) in Test Dataset
We test this model performance on test data set to ensure that our model is stable and we get the same or closer enough results to use this trained model to predict and forecast future values of dependent variables. To predict we use
predict function and then we generate R-Squared value to see if we get the same result as we got in training dataset or not.
# Predicting Price in test dataset test$PreditedPrice <- predict(lmModel, test) # Priting top 6 rows of actual and predited price head(test[ , c("Price", "PreditedPrice")])
# Output Price PreditedPrice 2 1505890.9 1494997.3 3 1058988.0 1253539.2 8 1573936.6 1572965.3 12 663732.4 629153.8 23 718887.2 727497.8 24 743999.8 991034.7
Generating R-Squared Value for test dataset
We are using a user defined formula to generate the R-Squared value here.
actual <- test$Price preds <- test$PreditedPrice rss <- sum((preds - actual) ^ 2) tss <- sum((actual - mean(actual)) ^ 2) rsq <- 1 - rss/tss rsq
# Output  0.9140436
Our model is performing fantastic.
In test dataset, we got an accuracy of 0.9140436 and training data set, we got an accuracy of 0.918.