R Statistics Blog

Data Science From R Programmers Point Of View

Splitting Data

To create a best model which generalizes well to new unseen data. You must ensure that your test set serves as a proxy for actual dataset IE it represents the new dataset.

Things You Will Master

  1. Overview
  2. Six functions in R for splitting data into train and test

Overview

The very first step after preprocessing of the dataset is to split the data into training and test datasets. We usually split the data around 70%-30% between training and testing stages. The training set is the one that we use to learn the relationship between independent variables and the target variable. This relationship is either stored in terms of mathematical function or is captured as a set of rules.

We then use the learnings from the training dataset and test it on the testing dataset. Our results are considered satisfactory if we get comparable results in training and testing datasets.

While we look for this split, we need to make sure that our test set meets the following two conditions:

  1. Test data should be large enough to yield statistically meaningful results.
  2. We need to pick a test set such that it represents the data set as a whole. This means that the characteristics of testing and training datasets should be similar.

Six functions in R for splitting data into train and test

1. Using sample() function

set.seed(222)

sample_size = round(nrow(mtcars)*.70) # setting what is 70% 
index <- sample(seq_len(nrow(mtcars)), size = sample_size)

train <- mtcars[index, ]
test <- mtcars[-index, ]

2. Using sample.int() function

set.seed(222)

sample <- sample.int(n = nrow(CO2), 
                     size = floor(.70*nrow(CO2)), # Selecting 70% of data
                     replace = F)

train <- CO2[sample, ]
test  <- CO2[-sample, ]

3. Using sample_n() function from {dplyr} package

library(dplyr)

sample_n(mtcars, 3)
sample_n(mtcars, 10, replace = TRUE)

# Using the above function to create 70 - 30 slipt into test and train
sample_size = round(nrow(iris)*.70) # setting what is 70% 

train <- sample_n(iris, sample_size)
sample_id <- as.numeric(rownames(train)) # rownames() returns character so as.numeric
test <- iris[-sample_id,]

4. Using sample_frac() function from {dplyr} package

library(dplyr)

sample_frac(mtcars, .3)
sample_frac(mtcars, .3, replace = TRUE)

# Using the above function to create 70 - 30 slipt into test and train
train <- sample_frac(iris, 0.7)
sample_id <- as.numeric(rownames(train)) # rownames() returns character so as.numeric
test <- iris[-sample_id,]

5. Using createDataPartition() function from {caret} package

library(caret)

index = createDataPartition(iris$Species, p = 0.70, list = FALSE)
train = iris[index, ]
test = iris[-index, ]

6. Using sample.split() function from {caTools} package

require(caTools)

set.seed(101) 
sample = sample.split(iris$Species, SplitRatio = .75)
train = subset(iris, sample == TRUE)
test  = subset(iris, sample == FALSE)
Last updated on 12 Nov 2019
Published on 17 Oct 2017
Edit on GitHub