R Statistics Blog

Data Science From R Programmers Point Of View

R Basics

Overview

R was developed by Ross Ihaka and Robert in the University of Auckland in New Zealand. They started working on the tool in 1933 with the intetion to help their students. However, they were then encouraged to make it open source. The language is based on another single letter programming language called as S, primarily it is called as S+ and it still exits.

One of the major reasons for the popularity of R is that R and its packages are Open Source and Free.

Fact

R in its actual form is a command line language. The alpha version of R was first released in 1997.

Getting Help in R

R has an extencive help system and this is one the best feautures of R programming. One can access the documentation of functions, and packages by using help() or ?. These functions provide access to the documentation pages for R functions, data sets, and other objects. Almost all the documents of R packages and functions contain couple of examples showcasing how to use the function.

help(mean)
?mean

List Topics covered

Things You Will Master

  1. Operators in R
  2. Working with numbers and strings
    2.1 Working With Numbers
    2.2 Working With Strings
  3. Data types and structure
    3.1 Data Types in R Programming
    3.2 Data Structures in R Programming
      3.2.1 Vector Maniputaions and important functions
        3.2.1.1 Defining vectors
        3.2.1.2 Verifying and checking the class of the vectors
        3.2.1.3 Accessing the elements of a vector
        3.2.1.4 Replacing and adding values to a vector
        3.2.1.5 Getting the index of a particular element
        3.2.1.6 sorting, subsetting, and removing vectors
      3.3 List manipulation functions
        3.3.1 Defining list - simple and named lists
        3.3.2 Referencing and replacing values of a list
        3.3.3 Fatten out a list using unlist() function
        3.3.4 Checking the class of each vector in a list
  4. Matrix Manipulation
    4.1 Defining Matrix
    4.2 List of important matrix manipulation functions

Operators in R

R supports almost all the popupar binary and logical operators. I am sure you will be familer with almost all of them.

Binary/Arthematic Operators

Operator Description
+ Addition
- Subtraction
* Multiplication
/ Division
** or ^ Exponentiation
X%/%Y Integer Division
X%%Y Modulus gives remainder

The operators mentioned above can be used with scalars, vectors and matrices.

Arithmatic operators in action

# adding two values
2 + 2
# Multipling
23*34
# Integer division
1990%/%23
# Calculating Modulus
7%%2

Caution

Although R is a remarkable statistical tool, there is one exasperating thing about R that it is a case-sensitive language. This means that view and View are considered as two different objects.

Logical Operators

Operator Description
> Greater than
>= Greater than or equal to
< Less than
<= Less than or equal to
== Equal to
!= Not equal to
x y
x & y x and y
!x Not x
Logical operators return TRUE if the condition is met and return FALSE if condition is not met. They can be used with both numbers and strings.

Logical Operators in action

# Using Great than
10 > 11
# Using equal to
"Hanna"=="hanna"
# Using not x
!10 == 11
#Using AND operator
(10 == 10) & (2 ==2)

Assignment operator

Assignment operator is used in programming languages to save/assign a value to the variable. This variable can then be used for further processing. In R we use assignment operator (<-) to assign a value. We can also use equal to (=) symbol as well. However, assignment operator (<-) are far more popular the equal to sign.

# Assigning number values
num <- 23
num

# Assigning string value
strng <- "Hanna"
strng

Strings like most other programming languages are defined using double or single quoates.

Numbers and Strings

Numbers and strings are what constitute any dataset in general. So it becomes important we understand some of the most common tasks and functions you will be required to execute while dealing with them in general.

Working With Numbers

Generating sequence of numbers

To generate sequence of numbers one can either use semicolon(:) or can use seq() function.

# Using (:) to generate sequence of integer numbers
1:10
# Using seq() function to generate sequence of numbers
seq(10, 20, by = 0.7)

Generating uniformaly distributed random numbers

Among many functions the functions which I like the most are runif() and sample() functions.

# Using runif() function to generate 10 random numbers
# By default generates number between 0 and 1
runif(10)

# Generating numbers between 200 to 500
runif(10, min = 200, max = 500)

# Generating four random numbers **REPLACEMENT**
sample(10:15, 4, replace=TRUE)

# Generating three random numbers **WITHOUT REPLACEMENT**
sample(10:15, 4, replace=FALSE)

Usage

sample() function is often used for creating random samples from dataset. The sample data is then used for training Machine Learning models.

Generate random numbers from normal distribution

A normal distribution is a distribution which follows a bell curve. Statistically speaking its mean, median and mode are all same.

# Using rnorm() function to generate 10 random numbers
rnorm(10)
# Setting the desired standard deviation and mean
rnorm(10, mean = 5, sd = 2)

Generating same sequence of random numbers

This can be achieved by using set.seed(). A very useful function which ensures that you are able to produce same results. The function takes one argument which is any interger number. Keeping that number same gives you same results.

# With set.seed()
# Output 1
set.seed(23)
rnorm(10, mean = 5, sd = 2)
# Output 2
set.seed(23)
rnorm(10, mean = 5, sd = 2)

# Without set.see()
# Output 1
rnorm(10, mean = 5, sd = 2)
# Output 2
rnorm(10, mean = 5, sd = 2)

Rounding numers to nearest value

We have couple of ways to achieve this. One can round the values to nearest integer, to upper side, to lower side, or towards zero. Following set of functions can be used to achieve either of the said task.

# Generating a sequence of numbers
numerSeq<- seq(0, 1, by=.05)

# Rounding to nearest integer - it uses that .5 rule
round(numerSeq)
# Rounding to one decimal point
round(numerSeq, 1)
# Rounding towards upper side value
ceiling(numerSeq)
# Rounding towards lower side value
floor(numerSeq)
# Rounding towards Zero
trunc(numerSeq)

Working With Strings

The two tasks which are very critical from the data analysis point of view are

Combining strings

Knowing how to combine strings or a string with a number can be of great help. I often use this to represent or print my final output. Another use comes from the analysis point of view. Considering these two tasks in mind the two most widely used functions are paste()(space is a default separator) or paste0()(there is no separator) function and sprintf() function.

# Combing two strings usingusing paste() function
paste("Hanna", "Ask")
# Choosing different separator
paste("Hanna", "Ask", sep = "$")

# Using paste0() function
paste0("Hanna", "Ask")

You can also pass a collection of string inside the paste() function. This collection of similar elements in R is formally called as vector. More on this later.

# Creating a vector of string
strgVec <- c("Cat", "Dog", "Fish", "Cow")
# Combing the values by +
paste(strgVec, collapse = "+")

Fact

sprintf() function is derived from c programming.
# Using sprintf() funtion to combine two string
sprintf("My name is %s", "Hanna")

# Combining a string and an integer
sprintf("My name is %s and I am %d years old", "Hanna", 30)

Searching and Replacing strings

We will cover three very usefull functions here are those are sub(), gsub() and grep().

# Defining a string
strng <- "You’re gonna need a bigger boat boat."

# Replacing boat with car
sub("boat", "car", strng)

# Replacing boat by  with car at all instances
gsub("boat", "car", strng)

# Returns the index where the string matches
grep("[car]", letters)

Data types and structure

In R there are six data types and four data structure.

Data Types

  1. Character - it the collection of string. Example - “Hanna”, “Dog”, “Male”.
  2. Numeric - it is a numeric value which is represented by decimal points. Example - 10.4, 12.45.
  3. Integer - its is also a number but only the integer part. Example - 109, 123, 34.
  4. Logical - the boolean values. Example - TRUE, FALSE
  5. Factor - qualitative variable which can be either of nominal or ordinal type. If it is ordinal then it is called as ordered factor. Example Nominal - “Male” and “Female”. Example Ordinal - “Good”, “Average” and “Best”.
  6. Complex - a number which has got an imaginary part to it.
The factor variable can also be represented by integer values. So always make a habit to go back and look at the meta data for variable information.

Data Structure

Like anyother other programming language the data structres in R also are defined based on the dimentionality and homogenity of data type it can hold.

  1. Vector - They are also formally know as Atomic Vectors. A Vector can hold only one type of data and is one-dimensional.

  2. List - List is also one-dimensional structure however it can be used to save multiple data types.

  3. Matrix - Matrix is two-dimensional structure and can only save one data type.

  4. Data Frame - Data Frame is also two-dimensional structure but can save multiple types of data.

Vector manipulation

Now we will learn about some of the most basic data manipulaton functions. The knowledge of these fuctions is absolute must for any one to more forward and perform any kind of data analysis task.

Defining vectors

Here is a collection of all the functions which are used to define different data types and structures in R programming.

# Defining character vectors
characterVector <- c("Football", "Cricket", "Tennis", "Badminton")
# Defining numeric vectors
numericVector <- c(12.3, 23.4, 17.9, 89.7)
# Defining integer vectors
integerVector <- c(12L, 23L, 17L, 89L)
# Defining logical vectors
logicVector <- c(TRUE, FALSE, TRUE, TRUE)

# Defining factor - nominal
factorVector <- factor(characterVector)
# Defining factor - ordinal
orderedFactorVector <- factor(characterVector, ordered = TRUE)

Verifying and checking the class of the vectors

For vectors when we check the data structure type it returns the type of the data which it holds. For chcking the class of the vector we can use either class() function or typeof() function. There are other functions but these are common ones.

# Using class() function to check the object type 
class(numericVector)

# Using type of function to check the object type
typeof(numericVector)

If you just wish to certain about the type of vetor then we ca use is family functions. These functions will return TRUE if the vector belongs to specific type else it returns FALSE.

# Checking if the vector is character type 
is.character(numericVector)
# Checking if the vector is numeric type
is.numeric(numericVector)

Accessing the elements of a vector

The elements inside a vector can be accessed using index. Unlike other programming languages like C and Pythong the indexing in R starts from 1.

# Extracting third elements
characterVector[3]
# Extracting multiple elements 
characterVector[c(1,3)]

# Deleting element
characterVector[-1]
# Deleting multiple element
characterVector[-c(1,3)]

One is not allowed to pass both positive and negative index values.

Replacing and adding values to a vector

To replace exsisting values in a vector. First, call the value using square [] and then simply assign a new value to it.

# Replacing football with basketball 
characterVector[1] <- "Basketball"
characterVector
# Replacing more than one values
numericVector[c(1,4)] <- c(55, 66)
numericVector

To add new values to a vector you can use either of the below approaches based upon your requirement.

Using Index

The numericVector contains 4 elements. We will add new element to this vector by using index. However this method only allows us to add a new element at the end of the vector.

# Adding element at the end.
numericVector[5] <- 77
numericVector

Using c() function

By using c() function you can add new element either at the beginning or at the end.

# Adding element at the end.
numericVector <- c(numericVector, 99)
numericVector

# Adding element at the beginning
numericVector <- c(99, numericVector)
numericVector

Using append() function

If you wish to add new element at any given index in a vector then append() function is the correct choice. The function takes three arguments.

# Using append function to add value after 4th positon
numericVector <- append(numericVector, # vector
                        99, # element to be interted
                        4) # index after which to be inserted

Getting the index of a particular element

# Printing the index of values which are equal to 99
which(numericVector == 99.0)

Other important vector manipulation functions

Below the list of functions which you will be using day in day out for tasks related to data analysis or maniupulation.

Sorting a vector
# Sorting in ascending order
numericVector[order(numericVector)]
# Sorting in descending order
numericVector[order(numericVector, decreasing = TRUE)]
Checking and Removing missing values
# Adding NA value to a vector
numericVector[2] <- NA
# Checking if missing value is present
is.na(numericVector)

# Removing NA values using ! not
numericVector[!is.na(numericVector)]
# Removing NA values using na.omit() function 
na.omit(numericVector)
Subsetting the vector and getting length of a vector
# Getting elements greater than 30
numericVector <- numericVector[numericVector > 30]

# Checking total number of elements in the new vector
length(numericVector)

List manipulation

Defining list

To define a list we use list() function. The function can be used to create simple list or named list.

# Defining list
example1 <- list(c(2,3,4), c("aa", "bb", "cc", "dd"), c(TRUE, TRUE))
example1

# Defining vectors
empName <- c("Chris", "Robin", "Matt")
empSalary <- c(2000, 4000, 6000)
bonusGiven <- c(TRUE, TRUE, FALSE)

# Defining list using vectors
listStruct <- list(empName, empSalary, bonusGiven)
listStruct
# Defining Named list
namedListStruct <- list("empName" = empName, 
                   "empSalary" = empSalary, 
                   "bonusGiven" = bonusGiven)
namedListStruct

Referencing values of a list

A value inside a list can be accessed usingindex or by using the name(if it is a named list). Fundamentally list is nothing but a collection of vectors. This means we can aply all the data manipulations which we have just learned in the Vector Manipulation section.

Extracting values from a list
# Extracting a value list of emp names from unnamed list
listStruct[[1]]

# Extracting a value list of emp names from named list
namedListStruct$empName

# Extract Robin from the emp names
listStruct[[1]][2]

# Extracting a value list of emp names from named list
namedListStruct$empName[2]
Replacing values in a list
# Replace salary for Robin by 8000
listStruct[[2]][2] <- 8000
listStruct

# or in named list
namedListStruct$empSalary[2] <- 8000
namedListStruct
Unlisting the list

Unlist() function can be used to flatten out the list to one level.

unlist(listStruct)
Checking the class of each vector in a list
lapply(listStruct, class)

A list can consist of mutiple levels and one can also create a nested list. Also, lists can be used to bundle objects of diferent classes fn lengths.

Matrix Manipulation

Defining Matrix

As it is a two dimentional structure while defining we need to mention the number of rows and number of columns.

# Defining a matrix
matStruct <- matrix(integerVector, 
                    nrow = 2, ncol = 2,
                    byrow = TRUE)

# Defining a matrix
matStruct1 <- matrix(integerVector, 
                    nrow = 2, ncol = 2,
                    byrow = FALSE)

In the below code snippet we are sharing some functions which are good to know and will help you with your data science work.

# naming columns 
colnames(matStruct) <- c("col1", "col2")
# naming rows 
rownames(matStruct) <- c("row1", "row2")
# Getting the dimension of the matrix
dim(matStruct)
# Getting the count of the rows
nrow(matStruct)
# Getting the count of the columns
ncol(matStruct)
# Accessing 2 column values
matStruct[, 2]
# Accessing 1 row values
matStruct[1, ]
# Combing two matrix by columns
cbind(matStruct, matStruct1)
# Combing two matrix by rows - appending
rbind(matStruct, matStruct1)

Closing Note

In this chapter, we looked at some of the very basic concepts of R Programing. We learned spent some time looking at things like different operators, data types, structurs and some must know fuctions which will enable to you manipulate these structure. We hope you got some sense how powerful this software can be. In the next chapter, we’ll look at an extensive list of data manipulation tasks related to data frames.
Last updated on 4 Jan 2019 / Published on 17 Oct 2017