# Linear Regression

Linear regression is a supervised machine learning algorithm which is used to predict the continuous variable. The algorithm assumes that the relation between dependent variable(Y) and independent variables(X), is linear. This relationship is represented by a line of best fit. In this chapter, we will learn how to execute linear regression in R using some select functions and test its assumptions before we use it for final prediction on test data.

### Things You Will Master

- Overview - Linear Regression
- Glimpse of Exploratory Data Analysis
- Training Regression Model
- Interpreting Regression Coefficients
- Validating Regression Coefficients and Models
- Checking Assumptions of Linear Regression
- Predicting Dependent Variable(Y) in Test Dataset
- Generating R-Squared Value for test dataset

## Overview - Linear Regression

In statistics, linear regression is used to model a relationship between continuous dependent variable and one or more independent variables. The independent variable can be either categorical or numerical. The case when we have only **one independent variable** then it is called as **simple linear regression**. The case of **more than one independent variable** is called as **multivariate regression**.

A mathematical representation of linear regression model

Y = β0 +β1X1 + β2X2 + β3X3 + ….. + βnXn + error

In the above equation, **β0** coefficient represents **intercept** and **βi** coefficient represents **slope**.
In the below case study, we will be using **USA housing data** to predict the price. Let us look at the top six observations of USA housing data.

```
# Reading data
housing <- read.csv("./static/data/USA_Housing.csv",
header = TRUE, sep = ",")
# Print top 6 observations
head(housing)
```

```
# Output
AreaIncome AreaHouse AreaNumberofRooms AreaNumberofBedrooms AreaPopulation Price
1 79545.46 5.682861 7.009188 4.09 23086.80 1059033.6
2 79248.64 6.002900 6.730821 3.09 40173.07 1505890.9
3 61287.07 5.865890 8.512727 5.13 36882.16 1058988.0
4 63345.24 7.188236 5.586729 3.26 34310.24 1260616.8
5 59982.20 5.040555 7.839388 4.23 26354.11 630943.5
6 80175.75 4.988408 6.104512 4.04 26748.43 1068138.1
```

## Glimpse of Exploratory Data Analysis

**1. Checking distribution of target variable** - It is my personal advice to always first understand the nature of your target variable. To achieve this, we will be drawing histogram with density plot.

```
library(ggplot2)
# Building histogram
ggplot(data=housing, aes(housing$Price)) +
geom_histogram(aes(y =..density..), fill = "orange") +
geom_density()
```

*It is good that the price variable follows normal distribution.*

**2. Analysing Summary Statistics** - Here, we will simply create a summary statistics for all the variables to understand behaviour of all the independent variables. This should also provide information about missing values, if any.

```
# loading psych package
library(psych)
psych::describe(housing)
```

```
# Output
vars n mean sd median trimmed mad min max range
AreaIncome 1 5000 68583.11 10657.99 68804.29 68611.84 10598.27 17796.63 107701.75 89905.12
AreaHouse 2 5000 5.98 0.99 5.97 5.98 0.99 2.64 9.52 6.87
AreaNumberofRooms 3 5000 6.99 1.01 7.00 6.99 1.01 3.24 10.76 7.52
AreaNumberofBedrooms 4 5000 3.98 1.23 4.05 3.92 1.33 2.00 6.50 4.50
AreaPopulation 5 5000 36163.52 9925.65 36199.41 36112.49 9997.21 172.61 69621.71 69449.10
Price 6 5000 1232072.65 353117.63 1232669.38 1232159.69 350330.42 15938.66 2469065.59 2453126.94
skew kurtosis se
AreaIncome -0.03 0.04 150.73
AreaHouse -0.01 -0.09 0.01
AreaNumberofRooms -0.04 -0.08 0.01
AreaNumberofBedrooms 0.38 -0.70 0.02
AreaPopulation 0.05 -0.01 140.37
Price 0.00 -0.06 4993.84
```

**3. Checking Outliers Using Boxplots** - As we have mentioned in out chapter Analysing Outliers that any point which lies beyond whispers is called as outliers.

```
library(reshape)
meltData <- melt(housing)
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
```

*Apart from Area of Number of Bedrooms all other variables seem to have outliers*

**4. Correlation Matrix Visualization** - We will use `corrgram`

plot package to visualize and analyse the correlation matrix.

```
require(corrgram)
corrgram(housing, order=TRUE)
```

## Training Regression Model

To build the linear regression we will be using `lm()`

function. The function takes two main arguments. First, formula stating the dependent and independent variables separated by **~**(tilder). Second, the dataset name. There are other very useful arguments and thus would request you to use `help(lm)`

to read more from the documentation.

### Diving data into train and test subsets

The housing data is divided into 70 : 30 split of train and test.

```
library(caret)
# Split data into train and test
index <- createDataPartition(housing$Price, p = .70, list = FALSE)
train <- housing[index, ]
test <- housing[-index, ]
```

```
# Checking the dim of train
dim(train)
```

```
# Output
[1] 3500 6
```

You can see we have 70% of the random observations in training dataset.

### Building Model

```
# Taining model
lmModel <- lm(Price ~ . , data = train)
# Printing the model object
print(lmModel)
```

```
# Output
Call:
lm(formula = Price ~ ., data = housing)
Coefficients:
(Intercept) AreaIncome AreaHouse
-2637299.03 21.58 165637.03
AreaNumberofRooms AreaNumberofBedrooms AreaPopulation
120659.95 1651.14 15.20
```

## Interpreting Regression Coefficients

In the above output, **Intercept** represents that the minimum value of Price that will be received, if all the variables are constant or absent.
>Please note - intercept may not always make sense in business terms.

**Slope(represented by independent variables)** tells us about the rate of change which Price variable will witness, with every one unit change in the independent variable. For example – if **AreaHouse** of house increases by one more unit, the **Price** of house will increase by 165,637.

## Validating Regression Coefficients and Models

It is important that we ensure the values of each beta coefficient is significant and has not come by chance. In R, the `lm`

function runs a **one-sample t-test** against each beta coefficient to ensure that they are significant and have not come by change. Similary, we need to validate the overall model. Just like one-sample t-test, `lm`

function also looks into three statistics which help data scientist to validate model. These statistics include R-Square, Adjusted R-Sqaure, and F-test also know as global testing.

To view these statistics we need to pass the `lmModel`

object to the `summary()`

function.

```
# Checking model statistics
summary(lmModel)
```

```
# Output
Call:
lm(formula = Price ~ ., data = housing)
Residuals:
Min 1Q Median 3Q Max
-337020 -70236 320 69175 361870
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2637299.0333 17157.8092 -153.708 <0.0000000000000002 ***
AreaIncome 21.5780 0.1343 160.656 <0.0000000000000002 ***
AreaHouse 165637.0269 1443.4130 114.754 <0.0000000000000002 ***
AreaNumberofRooms 120659.9488 1605.1604 75.170 <0.0000000000000002 ***
AreaNumberofBedrooms 1651.1391 1308.6712 1.262 0.207
AreaPopulation 15.2007 0.1442 105.393 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 101200 on 4994 degrees of freedom
Multiple R-squared: 0.918, Adjusted R-squared: 0.9179
F-statistic: 1.119e+04 on 5 and 4994 DF, p-value: < 0.00000000000000022
```

In the above output **Pr(>|t|)** represents the p-value, which can be compared against alpha value of 0.05 to ensure if the corresponsing beta coefficient is siginificant or not. The `lm`

function here lends a helping hand. All values which have (.) period or (*) astric in from of it are all significant. Based upon this we now know that all variables are significant except **AreaNumberofBedrooms**.

For overall model accuracy, let’s discuss statistics generated by `lm`

function one by one.

**1. Multiple R-squared: 0.918** - The R-squared values is formally called as coefficient of determination. Here, 0.918 indicates that the intercept, AreaIncome, AreaHouse, AreaNumberofRooms,and AreaPopulation variables when put together are able to explain 91.8% of variance in Price variable. The value of R-squared lies between 0 to 1. In practical aplications, if the values i greater than 0.70 we consider it as a good model.

**2. Adjusted R-squared: 0.9179** - The Adjusted R-squared value indicate if the addition of new information ( variable ) brings significant improvement to the model or not. So as of now this value does not provide much information. However, increase in the adjusted R-squared value with additiong of indicates the variable is useful and helps by bringing significant improvment to the model.

### Important Note

A large difference between the R-Squared and Adjusted R-squared is not appreciated and generally indicates that multicollinearity exists within the data.

**3.F-statistic: 1.119e+04 on 5 and 4994 DF, p-value: < 0.00000000000000022** - This line talks about the global testing of the model. The `lm`

function runs an anova test to check the significance of overall model. Here null hypothesis is that the model is not significant and alternative is that the model is significant. According to the p-values < 0.05, our model is significant.

Albiet, looking at these statistics is enough to take a call on the model siginificance. There are other validation methods for linear regression as mentioned below:

**4. AIC and BIC values** - The AIC(Akaike’s information criterion, 1974) and BIC(Bayesian information criterion, 1978) are penalized-likelihood criteria. Both these measures use “measure of fit + complexity penalty” to get the final values.

AIC = - 2 * ln(likelihood) + 2 * p

BIC = - 2 * ln(likelihood) + ln(N) * p

Here p = number of estimated parameters and N = sample size.

The AIC and BIC can be used for choosing best predictor subsets in regression and for comparing different models. When comparing different models the model with minimum AIC and BIC values are considered the best models. However, AIC is likely to overfit the data, whereas BIC is susceptible to underfit the data.

```
# Using AIC function
AIC(lmModel)
# Using BIC function
BIC(lmModel)
```

```
# Output AIC
[1] 129441.3
# Output BIC
[1] 129486.9
```

**5. Root Mean Square Error(RMSE)** - The RMSE statistics is also used for comparing different models and the model which has got the lowest value is considered to be the best model. There are other similar function like MAE, MAPE, MSE and so on which can be used. These functions can be found in `Metrics`

R Package. These functions take majorly two arguments, one is actual value and second, predicted values. So let’s see how we can get the these values. The actuals can be 100% found from the actual dataset or the training data in our case. However to find the fitted values we need to explore the model object.

```
# Checking model object for actual and predicted values
names(lmModel)
```

```
# Output
[1] "coefficients" "residuals"
[3] "effects" "rank"
[5] "fitted.values" "assign"
[7] "qr" "df.residual"
[9] "xlevels" "call"
[11] "terms" "model"
```

The above vector presents the names of object which constitute the lmModel object. Here, fitted values are the predicted values. Now, we will use these value to generate the **rmse** values.

```
library(Metrics)
rmse(actual = train$Price, predicted = lmModel$fitted.values)
```

```
# Output
[1] 493370.4
```

## Checking Assumptions of Linear Regression

It is important that your model meets all the four assumptions of linear regression as mentioned below. If the model fails to meet these assumptions then we simply cannot use this model.

**1. Errors should follow normal distribution** - This can be checked by drawing a histogram of residuals or by using `plot()`

function. The `plot`

function creates 4 different charts one of which is a NPP plot which can confirm if the errors are normally distributed or not.

#### Generating histogram

```
# Histogram to check the distribution of errors
hist(lmModel$residuals, color = "grey")
```

*The above histogram of errors clearly states that errors are normally distributed.*

#### Generating NPP plot

We except the points to be very close tot he dotted line in a NPPplot. Points being close to the line means errors are normally distributed and you can see this is what we get in the plot.

`plot(lmModel)`

**2. There should be no heteroscedasticity** - This means that the variance of error terms should be constant. We shall not see any patterns when we draw a plot between residuals and fitted values. And the mean line should be close to Zero.

# Generating the scatterplot between residuals and fitted values

```
# Using plot function
plot(lmModel)
```

*The red line which represents the mean is staright and close to the zero value. This indicates that we do not have a heteroscedasticity in our data.*

**3. There should be no multicollinearity** - The linear model assumes that the predictor variables have NO correlation between them. If they exhibit high correlation, it is a problem and it is called as multicollinearity. A variation inflation factor test can be run to check for the multicollinearity assumtion.

VIF = 1/(1-R2)

The R implimentation of the below function can be found here.

VIF is an iterative process. The function will remove one variables at a time which is cause for multicollinearity and repeats the process until all problem causing variables are removed. So, finally, we are left with the list of variables which have no or very weak correlation between them.

`vif_func(housing[, 1:5]) `

```
var vif
AreaIncome 1.00115868968647
AreaHouse 1.00057658981485
AreaNumberofRooms 1.27353508823836
AreaNumberofBedrooms 1.27441273719468
AreaPopulation 1.00126579728799
All variables have VIF < 10, max VIF 1.27
[1] "AreaIncome"
[2] "AreaHouse"
[3] "AreaNumberofRooms"
[4] "AreaNumberofBedrooms"
[5] "AreaPopulation"
```

*There is no multicollelinearty problem in the dataset. Generally, vif values which are greater than 5 or 7 are cause of multicollinearity.*

**3. There should be no auto serial correlation** - The auto correlation means that error terms should not be correlated with each other. To check of this we can run durbin-watson test(dw test). The test returns a value between 0 and 4. If the values is 2 we say there is no auto serial correlation does not exist. However, a value greater than 2 represents (+) ve correlation and value less than 2 represents (-) ve correlation.

```
library("lmtest")
dwtest(lmModel)
```

```
# Output
Durbin-Watson test
data: lmModel
DW = 1.9867, p-value = 0.3477
alternative hypothesis: true autocorrelation is greater than 0
```

*We got a value of 1.9867 which suggests that there is no auto serial correlation.*

#### Our model met all the four assumptions of linear regression.

## Predicting Dependent Variable(Y) in Test Dataset

We test this model performance on test data set to ensure that our model is stable and we get the same or closer enough results to use this trained model to predict and forecast future values of dependent variables. To predict we use `predict`

function and then we generate R-Squared value to see if we get the same result as we got in training dataset or not.

```
# Predicting Price in test dataset
test$PreditedPrice <- predict(lmModel, test)
# Priting top 6 rows of actual and predited price
head(test[ , c("Price", "PreditedPrice")])
```

```
# Output
Price PreditedPrice
2 1505890.9 1494997.3
3 1058988.0 1253539.2
8 1573936.6 1572965.3
12 663732.4 629153.8
23 718887.2 727497.8
24 743999.8 991034.7
```

## Generating R-Squared Value for test dataset

We are using a user defined formula to generate the R-Squared value here.

```
actual <- test$Price
preds <- test$PreditedPrice
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
rsq
```

```
# Output
[1] 0.9140436
```

*Our model is performing fantastic*.

*In test dataset, we got an accuracy of 0.9140436 and training data set, we got an accuracy of 0.918*.