R Statistics Blog

Data Science From R Programmers Point Of View

Correlation

Correlation coefficient are used to describe the degree of association between quantitaive variables. The value of correlation lies between +1 to -1. The signs only idicate the direction of the relationship. That means a +0.86 value is equal to -0.86. However, the -ve sign indicates that if one variable increases the other decreases and +ve indicates that if one variable increases the other also increases. A value in the range of +0.20 to -0.20 indicates very weak or no correlation.

R Programming supports variety of correlations. However, we will only be discussing about Pearson, Spearman, and Kendall correlation as these are used most of the time. You can briefly learn about the correlation and types, HERE

The cor function produces all the above mentioned correlation coefficients. Although the cor function finds the correlation for a matrix, it does not provide any information related to significance of correlation. If you are interested in that you can use corr.test function.

Things You Will Master

  1. Quick look - Types of correlation
  2. Generate correlation matrix
  3. Testing correlation for significance
  4. Visualizing correlation matrix using corrgram and corrplot R packages

Quick Look - Types of correlation

Let us quickly learn when to use which correlation.

1. Pearson correlation - Pearson correlation is used when we want to assess the degree of association between two quantitative variables.

cor(x, method = "pearson")

2. Spearman correlation - Use spearman correlation when you want to assess the degree of association between rank-ordered variables.

cor(x, method = "spearman")

3. Kendall’s correlation - Kendall’s correlation can also be used to assess the degree of association between rank-ordered variables. However, it is non-parametric measure.

cor(x, method = "kendall")

Generate correlation matrix

One can generate correlation matrix given any correlation type using cor function. However, just ensure that you have carefully looked into the data type. This will ensure that you produce the correct results using appropriate correlation. Let us generate correlation between the variables of iris data.

# Computing correlation
corMat <- cor(x= iris[, -5], method = "pearson")
# Rounding the values to two decimal
round(corMat, 2)
# Output
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length         1.00       -0.12         0.87        0.82
Sepal.Width         -0.12        1.00        -0.43       -0.37
Petal.Length         0.87       -0.43         1.00        0.96
Petal.Width          0.82       -0.37         0.96        1.00

Above matrix suggests that Petal.Width and Sepal.Length have high correlation. Similarly, you find other variables which show high correlation between each other.

The same function can be used to print the correlation matrix between the two ranked variables. For example, we can calculate the correlation between the cylinder type and gear. Both these variables are ranked variables.

# Looking into the spearman correlation
cor(mtcars[, c("cyl", "gear")], method = "spearman")
# Output
           cyl       gear
cyl   1.0000000 -0.5643105
gear -0.5643105  1.0000000

Testing correlation for significance

To check the statistical significance of of the correlation we can use cor.test function. The function generates the p-value which when compared to alpha value reveals if the correlation is statistically significant or not.

Decision Rule

According to the decision rule, if p-value is less than alpha(0.05) we reject the null hypothesis. Here null hypothesis is - that correlation between the two variables is equal to zero.

# Checking significance of correlation
cor.test(mtcars$mpg, mtcars$disp)
# Output

	Pearson's product-moment
	correlation

data:  mtcars$mpg and mtcars$disp
t = -8.7472, df = 30,
p-value = 0.000000000938
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9233594 -0.7081376
sample estimates:
       cor 
-0.8475514 

Based upon the above test results we conclude that the correlation between mileage and displacement variable is significant.

Visualizing correlation Matrix

Visualization is a powerful tool. It speeds up the process of understanding and digesting the important points. As your dataset grows it gets more and more difficult to go through the numbers present in your correlation matrix. So the best way to represent your insights about the relationship between variables is through correlation charts.

We are sharing some of the examples below and that means you can use whatever suits your need. For building these graphs we are using a package called as corrgram and corrplot. If you dont have this R Package then use install.packages() to install it on your local system.

Example 1

# loading package
require(corrgram)
corrgram(mtcars[, c("mpg", "wt", "disp", "hp", "qsec")], order=TRUE)

correlation Graph 1

In the above graph:

The Red Shade indicates negative correlation between the variables, darker the shade stronger the association.

The Blue Shade indicates positive correlation between the variables, darker the shade stronger the association.

Example 2 - Visualizing correlation matrix using corrplot

There are seven different shapes or you can say ways in which you can represent the information - “pie”, “circle”, “square”, “number”, “ellipse”, “shade”, “color”.

The first argument to the function is a correlation matrix

# loading library
require(corrplot)

# Generating correlation matrix
corMat <- cor(mtcars[, c("mpg", "wt", "disp", "hp", "qsec")])
# Building the correlation plot
corrplot(corMat, method="pie")

correlation graph using corrplot

Example 3 - Changing the shape to square

# Generating correlation matrix
corMat <- cor(mtcars[, c("mpg", "wt", "disp", "hp", "qsec")])
# Building the correlation plot
corrplot(corMat, method="square")

Using squares to represent corrplot

Example 4 - Representing the correlation information using numbers

# Generating correlation matrix
corMat <- cor(mtcars[, c("mpg", "wt", "disp", "hp", "qsec")])
# Building the correlation plot
corrplot(corMat, method="number")

Using numbers to represent corrplot

Example 5 - Changing the layout of the correlation graph

So far we have been drawing the full correlation matrix. However as we know that upper triangle matrix and lower triangle matrix are similar so you can choose to represent only one half of the table.

# Generating correlation matrix
corMat <- cor(mtcars[, c("mpg", "wt", "disp", "hp", "qsec")])
# Building the correlation plot
corrplot(corMat, method="circle", type = "upper")

Presenting upper half of the corrplot

Closing Note

In this chapter, we learned about functions in R programming which can we use to generate the correlation coefficient. We also looked into how to check if the correlation is statistically significant, and te finally learned about packages in R using which we can create some nice visualizations to present the correlation matrix. In the next chapter, we will learn about the applications of statistical hypothesis testing.
Last updated on 4 Jan 2019 / Published on 17 Oct 2018