R Statistics Blog

Data Science From R Programmers Point Of View

Data Frame Manipulations

As a data analyst you will be working mostly with data frames. And thus it becomes important that you learn, understand, and practice data manipulations related to data frame. This chapter is thus focused on providing function names and how to use them.

Things You Will Master

  1. Check the structure of data frame
  2. Get column and row names of a data frame
  3. Converting rownames to column
  4. Accessing columns and rows in a data frame
  5. Rename and rerrange columns of a data frame
  6. Identifying, filtering and imputing missing values
  7. Sorting data frame by columns
  8. Removing duplicated values
  9. Renaming levels of a factor variable in a data frame
  10. Generating frequency table(one-way, two-way etc)
  11. Adding and deleting columns in a data frame
  12. Merging, combining and appending data frames
  13. Summarizing data
  14. Summarizing data by group variable - Aggregated summaries
  15. Transforming data between long and wide format
  16. Subsetting or filtering a data frame
  17. Inspecting top and bottom rows of a data frame

Before we start and dig into how to accomplish above mentioned tasks in R. Let’s see how to access the datasets which comes along with the R packages.

Accessing Built-in Datasets

You can get the list of all the datasets by using data() function and to get the list of datasets belonging to a particular package use the package argument data(package = "MASS") function.

Structure of a data frame

In structure of a data frame we would learn how to check the dimension of a dataframe. Total number of observations present in a data frame and things like that. This informationg can be extracted by using dim()and str() functions. The str() function gives much more detailed information as compared to dim() function.

# Getting number of rows and columns - dimension
dim(iris)

# Getting more elaborate information
str(iris)

Getting column and rownames

To print the column and rownames you can use the colnames() and rownames() function. There is another function called names() and can be used to get the columns names. Both the functions return a character vector.

# Printing columns names
colnames(iris)
# Printing columns names
names(iris)
# Getting more elaborate information
rownames(mtcars)

Conevrting rownames to columns

The function add_rownames() from dplyr can be used to add rownames as a columns to the data frame. This function by default adds the new column at the beginning of the data frame.

library(dplyr)
# Adding rownames as column in mtcars
mtcars <- add_rownames(mtcars, "carNames")

Well we can achieve the above task by using functions like rownames() and cbind() from base package. So let us see how one can achive the desired output using thee functions.

# Resetting the data to original form
data(mtcars)
# Extracting rownames
names <- rownames(mtcars)
#Setting the rownames to NULL - deleting basically
rownames(mtcars) <- NULL
# Combining the rownames back to the mtcars dataset
mtcars <- cbind(mtcars,names)

Accessing columns and rows in a data frame

A row or a column of a data frame can be accessed using either the index or name. For example,

Extract mileage data from mtcars dataset

# Using index
mtcars[, 1] # mileage is the first columns

# Using column name
mtcars[, "mpg"]

# Using dollar notation(again calling by name)
mtcars$mpg

Extract mileage and cyl data from mtcars dataset

# Using index
mtcars[, c(1,2)] # mileage is the first columns

# Using column name
mtcars[, c("mpg", "cyl")]

Extract data for Hornet 4 Drive from mtcars dataset

# Using index
mtcars[4, ] # Hornet 4 Drive is the fourth observation
# Using index for calling multiple rows
mtcars[c(3,4), ] # Hornet 4 Drive is the fourth observation

# Using row name
mtcars["Hornet 4 Drive",]

# Using row nams for calling multiple observations
mtcars[c("Datsun 710","Hornet 4 Drive"),] 

Renaming and rerranging columns of a data frame

One can also rename the columns using colnames() function. The same task can be also achieved by using rename() function from dplyr package in R.

# Renaming all the columns
colnames(cars) <- c("carSpeed", "distanceCovered")
# Renaming only 2nd column
colnames(cars)[2] <- "distCovered"

# Using rename() function from dplyr
require(dplyr)
iris = rename(iris, flowerSpecies = Species, SepalLength = Sepal.Length)

Columns of a data frame can be rearranged by using column names or by using column index as shown in the code snippet.

data(cars)
# Checking current column names
names(cars)
# Reordering by using column names
cars <- cars[c("dist", "speed")]

# Reordering by using column index
cars <- cars[c(2, 1)]

Identifying, filtering and imputing missing values

Identifying missing values

You can look for missing values by each column or by row. To check by column we can use describe() from Hmisc R package or summary() functionfrom stats R package. The describe() function returns a column calleds as missing where as summary() function indicates the presence of mising values by NA count.

library(Hmisc)
# using describe() function
describe(mtcars)
# Using summary() function
summary(mtcars)

Apart from these two functions you also use is.na() function along with apply() and any() function to identify if a variable contains missing value or not. If a variable contains a missing value it will return TRUE else it will return FALSE.

# Column wise missing value check
apply(mtcars, 2, function(x){any(is.na(x))})
# Row wise missing value check
apply(mtcars, 1, function(x){any(is.na(x))})

Filtering out missing values

To filter missing values we can again make use of is.na() function or we can also use a function named complete.cases(). Lets us see how to use these functions.

# Using is.na to get complete observations
CO2[!is.na(CO2), ]
# Using complete.cases to get complete rows
CO2[complete.cases(CO2), ]

Imputing missing values

There are many ways in which one can fill the missing values. However in this section we will learn how to replace missing values by mean and how to use a machine learning algorithm like KNN and Random Forest to impute the missing values.

# Filling missing values with Mean
iris$Sepal.Length[is.na(iris$Sepal.Length)] = mean(iris$Sepal.Length, na.rm=TRUE)
# alternative way is to use ifelse
iris = transform(iris, y = ifelse(is.na(Sepal.Length), mean(Sepal.Length, na.rm=TRUE), Sepal.Length))

# Using knn imputation to fill missing values
library(bnstruct)
knn.impute(iris, k = 3, cat.var = 1:ncol(iris), 
          to.impute = 1:nrow(iris),
          using = 1:nrow(iris))
          
# Using random forest to fill missing value
set.seed(50)
iris <- rfImpute(Species ~ ., iris.na)

Sorting variables in a data frame

You can sort a column in asceding or desceding order. Also, one can use more than one columns to sort the dataset. We will be demonstrating how to sort data frame by using order() and arrange() functions. The arrange() function is from dplyr package in R.

data(mtcars)
# Sorting mtcars by mpg(ascen) variable - order function
mtcars[order(mtcars$mpg),]
# Sorting mtcars by mpg(desc) variable - order function
mtcars[order(mtcars$mpg, decreasing = TRUE),]
# we can use (-)ve sign
mtcars[order(-mtcars$mpg),]

# Sorting by more than two variables
mtcars[order(mtcars$mpg, mtcars$cyl),]

# Using arrange function from dplyr package
require(dplyr)
arrange(mtcars, mpg) # ascending order one variable
arrange(mtcars, mpg, cyl) # ascending order two variables
arrange(mtcars, desc(mpg)) # descending order one variable

Removing duplicated values from a data frame

Duplicate values are mostly considered from a rows prospective but should also be looked from column prospective. In this example we will see how to remoce duplicated rows as well as duplicated columns.

data(CO2)
# Adding duplicated row
CO2 <- rbind(CO2, CO2[2,])
nrow(CO2)
# Removing duplicated rows
CO2[!duplicated(CO2), ]
nrow(CO2)

# Adding duplicate columns for demonstration
data(CO2)# Resting the data
CO2 <- cbind(CO2, CO2[,2])
colnames(CO2)[6] <- "Type"
# Removing duplicated columns
CO2[, !duplicated(colnames(CO2))]


data(CO2) # Resting the data
CO2 <- cbind(CO2, CO2[,2])
colnames(CO2)[6] <- "Type" # naming the column
# Removing duplicated columns and keeping the last columns
CO2[, !duplicated(colnames(CO2), fromLast = TRUE)]

Renaming levels of a factor variable in a data frame

Renaming levels of factor variable is pretty simple. First you need to call the levels of a factor variable by using levels() function and then assign new names to the desired levels.

# Renaming Mississippi level to Miss
levels(CO2$)[2] <- "Miss"
# Renaming all the levels
levels(CO2$) <- c("Queb", "Missi") # follow the order

Generating frequency table(one-way, two-way etc)

To generate a frequency table use table() or CrossTable() function from gmodels package in R. The output of a CrossTable() functions resembels the output of ctable in SAS. The output also includes Row Totals, Column Totals, Table Total and chi-sqaure contribution information.

# Building frequency table - univariate
table(mtcars$cyl) # ingnors NA
table(mtcars$cyl, useNA = "ifany") # gives freq of NA if present
table(mtcars$cyl, useNA = "always") # always prints NA count in a table if NA is missing 0 count is considered.

# Building frequency table - Bivariate
table(mtcars$cyl, mtcars$am)
# Building frequency table - multivariate
table(mtcars$cyl, mtcars$am, mtcars$gear)

# Using crosstable function to generate contigency table
library(gmodels)
CrossTable(mtcars$cyl, mtcars$am)

Adding and deleting columns in a data frame

The addition and deletion of new column is a part of regular job for a data analyst. Let us see how we can add a time column to the cars dataset by suing the distance over speed formula.

# Adding new column 
cars$time <- round(cars$dist/cars$speed, 1)

# Deleting a column 
cars$time <- NULL

Merging, combining and appending data frames

You can merge function to combine the two dataframe in R. This function is available in base package and using merge function one can get the inner, outer, left, right and cross joins.

# Inner Join
merge(x = data1, y = data2, by = "common Variable")
 
# Outer join 
merge(x = data1, y = data2, by = "common Variable", all = TRUE)
 
# Left outer join
merge(x = data1, y = data2, by = "common Variable", all.x = TRUE)
 
# Right outer
merge(x = data1, y = data2, by = "common Variable", all.y = TRUE)
 
# Cross join 
merge(x = data1, y = data2, by = NULL)

Summarizing data

Many different packages in R provide different set of functions with divergent statistice. Here is the list of functions which you can consume directly.

  • summary() function - available in basic R
  • describe() function - there are two different function by same name. One is available in psych package and the other is available in Hmisc package
  • fivenum() function - available in base R - stats package

Summarizing data by grouped variable

It is always a good idea to look at the data by different slices. The slices are mostly based on one or two grouped variables. For example, you may want to lookat the average mileage of a car by cylinder variable. Below is a list of functions which we will be using achieve the above mentioned task.

  • Using aggregate() function
  • Using tapply() function
  • Using group_by() and summary() function from dplyr package

Note

You can use these functions to aggregate data by multiple variables as well.
# Using aggregate function
aggregate(x = mtcars$mpg, by = list(mtcars$cyl), FUN = "mean")

# Using aggregate() to group by more than one variable
aggregate(x = mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), FUN = "mean")

# Using tapply function
tapply(mtcars$mpg, mtcars$cyl, mean)

# Using functions from dplyr package
# Loading library
library(dplyr)
mtcars %>% 
  dplyr::group_by(cyl) %>%
    dplyr::summarize(mean_mileage = mean(mpg))
    
# Using multiple variables to group
mtcars %>% 
  dplyr::group_by(cyl, gear) %>%
    dplyr::summarize(mean_mileage = mean(mpg))

Transforming data between long and wide format

While working on data in R you will reaslise that lot of function expect you to pass data in long form. This tasks canbe achieved using functions from tidyr package and rehape2 package in R.

We will first create two data frame one with wide format and one with long format however both will contain the same information.

Creating data frame with wide format

wide <- read.table(header=TRUE, text='
 subject sex control cond1 cond2
       1   M     7.9  12.3  10.7
       2   F     6.3  10.6  11.1
       3   F     9.5  13.1  13.8
       4   M    11.5  13.4  12.9
')
# Coneverting subject variable to factor variable
wide$subject <- factor(wide$subject)

Creating data frame with long format

long <- read.table(header=TRUE, text='
 subject sex condition measurement
       1   M   control         7.9
       1   M     cond1        12.3
       1   M     cond2        10.7
       2   F   control         6.3
       2   F     cond1        10.6
       2   F     cond2        11.1
       3   F   control         9.5
       3   F     cond1        13.1
       3   F     cond2        13.8
       4   M   control        11.5
       4   M     cond1        13.4
       4   M     cond2        12.9
')
# Coneverting subject variable to factor variable
long$subject  <- factor(long$subject)

{tidyr} package is just a new interface for the reshape2 package

Using gather() function from tidyr package to convert the data from wide to long format.

# loading package
library(tidyr)

wide_to_long <- gather(wide, condition, measurement, control:cond2, factor_key=TRUE)
wide_to_long

Using spread() function from tidyr package to convert the data from long to wide format.

long_to_wide <- spread(long, condition, measurement)
long_to_wide

Functions which can be used from reshape2 package in R are listed below:

  • melt() function - used for converting wide formate to long format

  • dcast() function - used for converting long formate to wide format

Subsetting or filtering a data frame

The subset of a data based upon a condition can done by many different ways in R. However, in this chapter we will be using three different approaches or say functions. For example say you want to take subset of CO2 data on the basis of Treatment variable. You wish to create a new data frame which have all the observations for nonchilled.

# Using square brackets
sub1 <- CO2[CO2$Treatment == "nonchilled", ]

# Using subset function 
sub2 <- subset(CO2, CO2$Treatment == "nonchilled")

# Using filter function from dplyr package
library(dplyr)
sub3 <- filter(CO2, Treatment == "nonchilled")

You can use logical operators to use more than one condition spread accross multiple variable. For example, get the subset of data where Treatment is nonchilled and Type is Quebec.

# Using square brackets
sub1 <- CO2[CO2$Treatment == "nonchilled" & CO2$Type == "Quebec", ]

# Using subset function 
sub2 <- subset(CO2, CO2$Treatment == "nonchilled" & CO2$Type == "Quebec" )

# Using filter function from dplyr package
library(dplyr)
sub3 <- filter(CO2, Treatment == "nonchilled", CO2$Type == "Quebec")

Inspecting top and bottom rows of a data frame

You can use head() and tail() functions to inspect the top and bottom n observations. By default both these function print top 6 and bottom 6 observations. However, you can fetch data for more or less observations.

head(mtcars) # default top 6 observations
head(mtcars, 10) # prints top 10 observations
tail(mtcars) # default bottom 6 observations
tail(mtcars, 10) # prints bottom 10 observations
Last updated on 4 Jan 2019 / Published on 17 Oct 2017