Tipps & Tricks for data analysis in R

This script serves a teaching material for the workshop “Tipps & Tricks for data analysis in R” presented at the DQBM retreat 2022.

It slightly builds upon the Introduction to Data Analysis sessions that were held in March 2020.

Some of the concepts are taken from Advanced R, an excellent book to dive deeper into how R works.

The session will address very different aspects of data analysis and programming using the statistical programming language R. To follow the session, please install R and RStudio and install the following packages in R:

install.packages("BiocManager")
BiocManager::install("R6", "BiocParallel", "DT", "vroom", "data.table", "bench", "forcats", "stringr", "dplyr", "purrr", "readr", "tidyr", "tibble", "ggplot2", "tidyverse")

To follow the session live, please clone the repository locally

git clone https://github.com/BodenmillerGroup/IntroDataAnalysis.git
cd IntroDataAnalysis/R/DQBM_retreat_2022

and open the file tipps_tricks.Rmd.

Coding style and flow

In the first section, we will discuss very basic coding practices in R mainly focusing on the assignment operator, pipes and function calls. For a full overview on coding stlye recommendations please refer to the tidyverse style guide.

Assignment operator

The assignment operator <- and -> in R is the preferred way of assigning variables. It alows left sided and right sided assignment while = is always interpreted as <- outside function calls. The = is reserved for parameter setting within function calls.

Calling <- within a function call allows assignment of a previously undefined variable.

y = 1:10
y

##  [1]  1  2  3  4  5  6  7  8  9 10

1:10 = y

## Error in 1:10 = y: target of assignment expands to non-language object

z <- 1:10
z

##  [1]  1  2  3  4  5  6  7  8  9 10

1:10 -> z
z

##  [1]  1  2  3  4  5  6  7  8  9 10

mean(x = 1:10)

## [1] 5.5

## Error in eval(expr, envir, enclos): object 'x' not found

mean(x <- 1:10)

## [1] 5.5

##  [1]  1  2  3  4  5  6  7  8  9 10

# Incorrect assignment within function call
system.time(x = lapply(1:10, function(x) {Sys.sleep(1); return(x)}))

## Error in system.time(x = lapply(1:10, function(x) {: unused argument (x = lapply(1:10, function(x) {
##     Sys.sleep(1)
##     return(x)
## }))

# Assignment to x within function call
system.time(x <- lapply(1:10, function(x) {Sys.sleep(1); return(x)}))

##    user  system elapsed 
##   0.017   0.002  10.038

Pipes

Using pipes in R improves readability of you code. In the next code chunk we want to find the date at which the maximum of accumulated Covid-19 cases were detected in Switzerland.

For this, base R by now provides the pipe operator |>.

covid <- read.csv("../../Data/covid19.csv")

str(covid)

## 'data.frame':    231264 obs. of  8 variables:
##  $ Country.Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Province.State: chr  "" "" "" "" ...
##  $ Lat           : num  33.9 33.9 33.9 33.9 33.9 ...
##  $ Long          : num  67.7 67.7 67.7 67.7 67.7 ...
##  $ Date          : chr  "2020-01-22" "2020-01-23" "2020-01-24" "2020-01-25" ...
##  $ Confirmed     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Deaths        : int  0 0 0 0 0 0 0 0 0 0 ...

# Bad
covid[covid$Country.Region == "Switzerland" & covid$Confirmed == max(covid[covid$Country.Region == "Switzerland", "Confirmed"]),"Date"]

## [1] "2022-06-14" "2022-06-15"

# Good
covid |> 
    subset(Country.Region == "Switzerland") |> 
    subset(Confirmed == max(Confirmed)) |> 
    getElement("Date")

## [1] "2022-06-14" "2022-06-15"

The tidyverse package provides all base packages for tidy data handling in R. The pipe operator %>% is provided by the magrittr package.

library(tidyverse)

covid %>%
    filter(Country.Region == "Switzerland") %>%
    filter(Confirmed == max(Confirmed)) %>%
    select("Date")

##         Date
## 1 2022-06-14
## 2 2022-06-15

Even though this is not recommended, piping naturally fits to right sided assignments:

# Bad but fun
covid %>%
    filter(Country.Region == "Switzerland") %>%
    filter(Confirmed == max(Confirmed)) %>%
    select("Date") -> max_conf_date
max_conf_date

##         Date
## 1 2022-06-14
## 2 2022-06-15

# Good
max_conf_date <- covid %>%
    filter(Country.Region == "Switzerland") %>%
    filter(Confirmed == max(Confirmed)) %>%
    select("Date")
max_conf_date

##         Date
## 1 2022-06-14
## 2 2022-06-15

One of the key features of piping is the direct use with ggplot2 for plotting.

covid %>%
    filter(Country.Region == "Switzerland") %>%
    ggplot() + geom_line(aes(Date, Confirmed, color = Country.Region, group = 1)) +
    theme_classic(base_size = 15) +
    theme(axis.text.x = element_text(angle = 90))

Efficient coding

The next section focuses on how to perform computations efficiently in (base) R. One of the best packages for efficient computations in data.table but we won’t have enough time to go through it.

Reading in data

Reading in large datasets in R can be a pain and the base R fiunctions are not optimized for this.

To test a number of file reader functions, we will first generate a “large” dataset on the fly and write it to disk:

matrix(rnorm(n = 10000000), ncol = 10) %>%
    as.data.frame() %>%
    write_csv(file = "../../Data/large_test_data.csv")

The resulting .csv file contains 1 million rows and 10 columns and has a size of 200mb on disk.

In the next code chunk we will test the following file reader functions:

read.csv: base R function for reading in .csv files
readr::read_csv: standard tidy R function
data.table::fread: main reader function of the data.table package
vroom::vroom: newer library playing along with tidy R

We will now use the bench library to perform some benchmarks on which reader function is the best.

library(bench)
library(data.table)
library(vroom)
library(DT)

bench::mark(
    cur_data <- read.csv("../../Data/large_test_data.csv"),
    cur_data <- readr::read_csv("../../Data/large_test_data.csv"),
    cur_data <- data.table::fread("../../Data/large_test_data.csv"),
    cur_data <- vroom::vroom("../../Data/large_test_data.csv"),
    check = FALSE
) %>% select(c("min", "median", "mem_alloc", "total_time")) %>%
    DT::datatable()

We can see that the base R reader function has by far the worst performance. The readr::read_csv function has already a decent performance and should be used whenever possible - especially when you plan on using the tidyverse packages, which you should ;).

The vroom::vroom function in this case does not seem to be much faster than the other ones bu memory allocation is drastically reduced. You should be using this one when reading in gigabytes and gigabytes of data.

Loops and apply

The R programming language performs best when using vectorized operations. Only in certain cases it is needed to use a for loop to iterated over objects.

The following code chunk highlights the worst possible way of calculating the average value per row of a matrix:

cur_data <- as.matrix(cur_data[1:100000,])

out <- vector()

# Bad
for (i in 1:nrow(cur_data)) {
    cur_mean <- mean(cur_data[i,])
    out <- c(out, cur_mean)
}

head(out, 100)

##   [1] -0.074207119  0.233431346  0.104103532 -0.518205623 -0.301064250
##   [6] -1.127165639 -0.150435722 -0.339594341 -0.081580333 -0.244686629
##  [11] -0.086633220 -0.032881760  0.138429943  0.067167798 -0.136472178
##  [16]  0.012583186 -0.318490843 -0.408403583 -0.400436714 -0.321152811
##  [21]  0.014438649  0.018794623  0.856228533  0.151950344  0.427907691
##  [26]  0.074230011 -0.005357597  0.176879299 -0.012597546 -0.414030659
##  [31] -0.374366404  0.276757738 -0.302159611 -0.277743381  0.406273566
##  [36]  0.410958393 -0.152828638  0.066117703 -0.392008452  0.434231416
##  [41] -0.083261396 -0.200989223  0.067177543  0.530071588  0.387739119
##  [46] -0.718724171 -0.136835433  0.063941132  0.195991346  0.111075637
##  [51] -0.280771690  0.127466606 -0.162754149  0.632138537 -0.372494478
##  [56]  0.108856251 -0.142078802 -0.300231957 -0.495349114 -0.097214440
##  [61]  0.138398738 -0.150945784  0.227089967  0.575571188 -0.141800980
##  [66]  0.106155740 -0.198169090 -0.376608391  0.097445763  0.058788789
##  [71] -0.461580999 -0.213832622  0.077024249  0.622917701  0.015589750
##  [76]  0.431167475 -0.068877270 -0.165183311  0.152648936  0.108396141
##  [81]  0.033924549 -0.223445850 -0.173617365 -0.098205042 -0.059709630
##  [86] -0.162718309  0.201106567 -0.342937053 -0.161730928 -0.274512559
##  [91] -0.015296974  0.369567604 -0.584568395  0.253261623 -0.071557989
##  [96]  0.108757421  0.145894200 -0.290742230  0.043156695 -0.440033909

Here, we used a for loop to iterate through all rows. In addition, we overwrite the full vector out in each iteration.

If you want/need to use a for loop it is better to first create the output object and only replace individual entries:

out <- vector(mode = "numeric", length = nrow(cur_data))

# Slightly better if needed
for (i in 1:nrow(cur_data)) {
    cur_mean <- mean(cur_data[i,])
    out[i] <- cur_mean
}

In any case, it is recommended to use functions of the apply family to perform iterated operations. To oerate on arrays and matrices, base R provides the apply function:

# Much better
out <- apply(cur_data, MARGIN = 1, FUN = mean)
head(out, 100)

##   [1] -0.074207119  0.233431346  0.104103532 -0.518205623 -0.301064250
##   [6] -1.127165639 -0.150435722 -0.339594341 -0.081580333 -0.244686629
##  [11] -0.086633220 -0.032881760  0.138429943  0.067167798 -0.136472178
##  [16]  0.012583186 -0.318490843 -0.408403583 -0.400436714 -0.321152811
##  [21]  0.014438649  0.018794623  0.856228533  0.151950344  0.427907691
##  [26]  0.074230011 -0.005357597  0.176879299 -0.012597546 -0.414030659
##  [31] -0.374366404  0.276757738 -0.302159611 -0.277743381  0.406273566
##  [36]  0.410958393 -0.152828638  0.066117703 -0.392008452  0.434231416
##  [41] -0.083261396 -0.200989223  0.067177543  0.530071588  0.387739119
##  [46] -0.718724171 -0.136835433  0.063941132  0.195991346  0.111075637
##  [51] -0.280771690  0.127466606 -0.162754149  0.632138537 -0.372494478
##  [56]  0.108856251 -0.142078802 -0.300231957 -0.495349114 -0.097214440
##  [61]  0.138398738 -0.150945784  0.227089967  0.575571188 -0.141800980
##  [66]  0.106155740 -0.198169090 -0.376608391  0.097445763  0.058788789
##  [71] -0.461580999 -0.213832622  0.077024249  0.622917701  0.015589750
##  [76]  0.431167475 -0.068877270 -0.165183311  0.152648936  0.108396141
##  [81]  0.033924549 -0.223445850 -0.173617365 -0.098205042 -0.059709630
##  [86] -0.162718309  0.201106567 -0.342937053 -0.161730928 -0.274512559
##  [91] -0.015296974  0.369567604 -0.584568395  0.253261623 -0.071557989
##  [96]  0.108757421  0.145894200 -0.290742230  0.043156695 -0.440033909

Here, MARGIN can be 1 to iterate over rows, 2 to iterate over columns or even 3 or higher to iterate over higher dimensions when working with arrays.

For most easy operations, base R provides easy vecorized functions, such as rowMeans:

# Best
out <- rowMeans(cur_data)
head(out, 100)

##   [1] -0.074207119  0.233431346  0.104103532 -0.518205623 -0.301064250
##   [6] -1.127165639 -0.150435722 -0.339594341 -0.081580333 -0.244686629
##  [11] -0.086633220 -0.032881760  0.138429943  0.067167798 -0.136472178
##  [16]  0.012583186 -0.318490843 -0.408403583 -0.400436714 -0.321152811
##  [21]  0.014438649  0.018794623  0.856228533  0.151950344  0.427907691
##  [26]  0.074230011 -0.005357597  0.176879299 -0.012597546 -0.414030659
##  [31] -0.374366404  0.276757738 -0.302159611 -0.277743381  0.406273566
##  [36]  0.410958393 -0.152828638  0.066117703 -0.392008452  0.434231416
##  [41] -0.083261396 -0.200989223  0.067177543  0.530071588  0.387739119
##  [46] -0.718724171 -0.136835433  0.063941132  0.195991346  0.111075637
##  [51] -0.280771690  0.127466606 -0.162754149  0.632138537 -0.372494478
##  [56]  0.108856251 -0.142078802 -0.300231957 -0.495349114 -0.097214440
##  [61]  0.138398738 -0.150945784  0.227089967  0.575571188 -0.141800980
##  [66]  0.106155740 -0.198169090 -0.376608391  0.097445763  0.058788789
##  [71] -0.461580999 -0.213832622  0.077024249  0.622917701  0.015589750
##  [76]  0.431167475 -0.068877270 -0.165183311  0.152648936  0.108396141
##  [81]  0.033924549 -0.223445850 -0.173617365 -0.098205042 -0.059709630
##  [86] -0.162718309  0.201106567 -0.342937053 -0.161730928 -0.274512559
##  [91] -0.015296974  0.369567604 -0.584568395  0.253261623 -0.071557989
##  [96]  0.108757421  0.145894200 -0.290742230  0.043156695 -0.440033909

In the next sections we will go through the different functions of the apply family:

apply: as seen above used on matrix-like objects
lapply: used on lists/vectors, returns a list
sapply: used on lists/vectors, returns a vector
vapply: used on lists/vectors, returns a vector with checking
tapply: applies a function per grouping level
mapply: combines multiple lists

In the following example we will generate a list which contains 10 entries, each of which is a numeric vector of length 10 which can contain NAs.

We will then compute the mean per entry while ignoring NAs. Finally, we will convert the list to a vector in different ways.

cur_list <- lapply(1:10, function(x){
    c(rnorm(n = 10), rep(NA, 10))[sample(1:20, 10)]
})

out <- lapply(cur_list, mean, na.rm = TRUE)
out

## [[1]]
## [1] -0.263992
## 
## [[2]]
## [1] 0.6969153
## 
## [[3]]
## [1] -0.5254196
## 
## [[4]]
## [1] 0.6592913
## 
## [[5]]
## [1] 0.5487733
## 
## [[6]]
## [1] -0.06734899
## 
## [[7]]
## [1] -0.2675872
## 
## [[8]]
## [1] -0.2385669
## 
## [[9]]
## [1] -0.2871847
## 
## [[10]]
## [1] 0.1435612

do.call(c, out)

##  [1] -0.26399195  0.69691535 -0.52541963  0.65929127  0.54877325 -0.06734899
##  [7] -0.26758724 -0.23856691 -0.28718470  0.14356122

as.numeric(out)

##  [1] -0.26399195  0.69691535 -0.52541963  0.65929127  0.54877325 -0.06734899
##  [7] -0.26758724 -0.23856691 -0.28718470  0.14356122

as(out, "numeric")

##  [1] -0.26399195  0.69691535 -0.52541963  0.65929127  0.54877325 -0.06734899
##  [7] -0.26758724 -0.23856691 -0.28718470  0.14356122

In the next example, we can see that most operators in R are also defined as functions. Here, we can first access the 3rd element of each entry of a list and even perform assignments to individual entries. Here we see again that it’s important to understand what the assignment operator is. All replacement functions in R are defined using the assignment operator.

x <- list(entr1 = 1:10, entr2 = 20:30)

lapply(x, `[[`, 3)

## $entr1
## [1] 3
## 
## $entr2
## [1] 22

lapply(x, `[[<-`, 3, 120)

## $entr1
##  [1]   1   2 120   4   5   6   7   8   9  10
## 
## $entr2
##  [1]  20  21 120  23  24  25  26  27  28  29  30

lapply(x, `[[=`, 3, 120)

## Error in match.fun(FUN): object '[[=' not found

The next function we will look at is sapply which operates similarly as lapply but returnes a “simplified” version of the output (e.g. directly a vector).

sapply(cur_list, mean, na.rm = TRUE)

##  [1] -0.26399195  0.69691535 -0.52541963  0.65929127  0.54877325 -0.06734899
##  [7] -0.26758724 -0.23856691 -0.28718470  0.14356122

However, it is not always trivial to simplify the output since it’s not always clear what the output is. When developing computational methods it’s always safer to use vapply which checks the type and length out the output.

# A single numeric should be returned
vapply(cur_list, mean, FUN.VALUE = 0, na.rm = TRUE)

##  [1] -0.26399195  0.69691535 -0.52541963  0.65929127  0.54877325 -0.06734899
##  [7] -0.26758724 -0.23856691 -0.28718470  0.14356122

# Two objects are returned
vapply(cur_list, 
       function(x){return(c("one", "two"))}, 
       FUN.VALUE = c("test1", "test2"))

##      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
## [1,] "one" "one" "one" "one" "one" "one" "one" "one" "one" "one"
## [2,] "two" "two" "two" "two" "two" "two" "two" "two" "two" "two"

# Doesn't work
vapply(cur_list, mean, FUN.VALUE = "test")

## Error in vapply(cur_list, mean, FUN.VALUE = "test"): values must be type 'character',
##  but FUN(X[[1]]) result is type 'double'

The tapply function performs operations on groups of an input. In the following case, we will calculate the maximum of confirmed covid cases per country.

Similar operations can be done with the base R aggregate function and with tidy R.

# tapply
head(tapply(covid$Confirmed, covid$Country.Region, max), 10)

##         Afghanistan             Albania             Algeria             Andorra 
##              181236              276821              265952               43449 
##              Angola          Antarctica Antigua and Barbuda           Argentina 
##               99761                  11                8537             9313453 
##             Armenia           Australia 
##              423044             2688494

# aggregate
head(aggregate(covid$Confirmed, by = list(country = covid$Country.Region), max), 10)

##                country       x
## 1          Afghanistan  181236
## 2              Albania  276821
## 3              Algeria  265952
## 4              Andorra   43449
## 5               Angola   99761
## 6           Antarctica      11
## 7  Antigua and Barbuda    8537
## 8            Argentina 9313453
## 9              Armenia  423044
## 10           Australia 2688494

# tidy R
covid %>%
    group_by(Country.Region) %>%
    summarize(max_confirmed = max(Confirmed)) %>%
    head(10)

## # A tibble: 10 × 2
##    Country.Region      max_confirmed
##    <chr>                       <int>
##  1 Afghanistan                181236
##  2 Albania                    276821
##  3 Algeria                    265952
##  4 Andorra                     43449
##  5 Angola                      99761
##  6 Antarctica                     11
##  7 Antigua and Barbuda          8537
##  8 Argentina                 9313453
##  9 Armenia                    423044
## 10 Australia                 2688494

Finally, the mapply combines a number of lists and performs operations on each paired entry:

cur_list_2 <- as.list(1:10)
mapply(function(x, y){
    return(mean(x, na.rm = TRUE) * y)
},
cur_list, cur_list_2)

##  [1] -0.263992  1.393831 -1.576259  2.637165  2.743866 -0.404094 -1.873111
##  [8] -1.908535 -2.584662  1.435612

mapply(`*`, cur_list_2, cur_list_2)

##  [1]   1   4   9  16  25  36  49  64  81 100

Parallelisation

One strength of R is to be able to perform operations in a parallelized fashion. For this a number of packages have been developed including parallel, foreach and future.

Personally, I prefer the BiocParallel package as it provides a user-friendly way of handling parallelised operations by simply extending the lapply function.

In the next code chunk, we will test how the bplapply function works. The only additional parameter is BPPARAM which in the following case is set to bpparam(). The bbparam() function registers a parallelisation backend which is most suited to your operating system.

library(BiocParallel)

bench::mark(
    cur_out <- lapply(1:10, function(x){Sys.sleep(5); return(x)}),
    cur_out <- bplapply(1:10, function(x){Sys.sleep(5); return(x)}, BPPARAM = bpparam()),
    memory = FALSE
) %>% select(c("min", "median", "mem_alloc", "total_time")) %>%
    DT::datatable()

Object-oriented programming

In the next section we will learn how obbject oriented programming (OOP) can be done in R. Even without knowing what that means you have used OOP whenever you code in R.

OOP in R consists of the following three parts:

object: they store data in a pre-defined form
generic function: functions that perform a certain task
methods: apply a generic function on a certain object

Understanding OOP in R is mainly relevant for methods development but also useful to understand where differences in the way you analyse data come from.

We will first start with observing the difference between a base type and a class attribute. Every object in R has a base type while only OOP onjects have a class attribute.

attr(1:10, "class")

## NULL

typeof(1:10)

## [1] "integer"

# mIsleading
class(1:10)

## [1] "integer"

is(1:10, "integer")

## [1] TRUE

attr(matrix(1:10), "class")

## NULL

typeof(matrix(1:10))

## [1] "integer"

class(matrix(1:10))

## [1] "matrix" "array"

is(matrix(1:10), "integer")

## [1] FALSE

# Class inheritance
is(matrix(1:10), "matrix")

## [1] TRUE

is(matrix(1:10), "array")

## [1] TRUE

We will now directly move to defining OOP objects, generic functions and methods in different ways. Please refer to Object Oriented Programming for full information.

Most objects that you work with when analyzing data in base R are of the “S3” class. This is a loosely defined construct for which generic functions and methods exist and allow the user and developer lots of flexibility.

The Bioconductor project mainly works with “S4” classes which are a lot more formally defined and allow for easier interoperability.

The “R6” class provides a framework which comes closest to real OOP as known from e.g. python.

In the following sections, we will go through different examples and how to construct objects, generic functions and methods.

S3 class

S3 class objects are most often used in base R functions. For example, the prcomp function to compute a PCA returns an object of class prcomp. Each slot of a S3 class object can be accessed via $.

cur_pca <- prcomp(iris[,-5])

class(cur_pca)

## [1] "prcomp"

# Access slots via "$"
head(cur_pca$sdev)

## [1] 2.0562689 0.4926162 0.2796596 0.1543862

Also base objects such as a factor or a data.frame are S3 objects which can be tested. In base R there are generic functions defined which can be applied to an S3 object. Here an example of such a generic function is print. Internally, print will find the matching method to be called on the provided object.

f <- factor(c("a", "b", "c"))
df <- data.frame(a = 1:10, b = letters[1:10])
is.object(f)

## [1] TRUE

is.object(df)

## [1] TRUE

# generic functions - always use these
isS3stdGeneric("print")

## print 
##  TRUE

print(f)

## [1] a b c
## Levels: a b c

print(df)

##     a b
## 1   1 a
## 2   2 b
## 3   3 c
## 4   4 d
## 5   5 e
## 6   6 f
## 7   7 g
## 8   8 h
## 9   9 i
## 10 10 j

# methods - never use these
isS3method("print.factor")

## [1] TRUE

print.factor(f)

## [1] a b c
## Levels: a b c

print.factor(df)

##                                                   a 
##                                                1:10 
##                                                   b 
## c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j") 
## attr(,"row.names")
##  [1]  1  2  3  4  5  6  7  8  9 10
## Levels:

## Warning in print.factor(df): factor levels must be "character"

print.data.frame(df)

##     a b
## 1   1 a
## 2   2 b
## 3   3 c
## 4   4 d
## 5   5 e
## 6   6 f
## 7   7 g
## 8   8 h
## 9   9 i
## 10 10 j

print.data.frame(f)

## NULL
## <0 rows> (or 0-length row.names)

There is never the need to directly call a method on an object but rather use the generic function.

In the next few chunks we will learn how to build an object of a certain class, how to construct a generic function and how to construct a method.

We will first construct an S3 class object.

# Create, then set class
x <- 1
class(x) <- "my_class"

class(x)

## [1] "my_class"

inherits(x, "my_class")

## [1] TRUE

inherits(x, "your_class")

## [1] FALSE

Next, we can define a generic function:

my_generic <- function(x) UseMethod("my_generic", x)

Finally, we will link a method to the object class and the generic function.

my_generic.my_class <- function(x) x + 10

my_generic(x)

## [1] 11
## attr(,"class")
## [1] "my_class"

my_generic(20)

## Error in UseMethod("my_generic", x): no applicable method for 'my_generic' applied to an object of class "c('double', 'numeric')"

S4 class

The S4 class system is more formally defined, One would first need to define the structure of the class. In the following example, we will first define a class “Person” which contains slots to store the name and the character of the person.

We can then use the new function to create a new object of class “Person”. Slots in the object can be accessed via @ or the slot function.

setClass("Person", 
  slots = c(
    name = "character", 
    age = "numeric"
  )
)

john <- new("Person", name = "John Smith", age = as.numeric(NA))

is(john)

## [1] "Person"

isS4(john)

## [1] TRUE

# Access slots via "@"
john@name

## [1] "John Smith"

slot(john, "age")

## [1] NA

Next, we can define a new generic function calle age for getting the age of a person and age<- for setting the age of a person.

setGeneric("age", function(x) standardGeneric("age"))

## [1] "age"

setGeneric("age<-", function(x, value) standardGeneric("age<-"))

## [1] "age<-"

Finally, we can construct the methods linked to the age and age<- generic functions and the Person object:

setMethod("age", "Person", function(x) x@age)
setMethod("age<-", "Person", function(x, value) {
  x@age <- value
  x
})

age(john) <- 50
age(john)

## [1] 50

R6 class

Just for demonstration purposes, we will also discuss the R6 class. Here, within the class definition one can directly specify class methods without the need to constructing generic functions.

library(R6)

# Class definition
Person <- R6Class("Person", list(
  name = NULL,
  age = NA,
  # Validity checks on the class construction
  initialize = function(name, age = NA) {
    stopifnot(is.character(name), length(name) == 1)
    stopifnot(is.numeric(age), length(age) == 1)
    
    self$name <- name
    self$age <- age
  },
  # Print method
  print = function(...) {
    cat("Person: \n")
    cat("  Name: ", self$name, "\n", sep = "")
    cat("  Age:  ", self$age, "\n", sep = "")
    invisible(self)
  },
  # Another method
  add_years = function(x) {
    self$age <- self$age + x 
    invisible(self)
  }
))

nils <- Person$new("Nils", age = "thirty-three")

## Error in initialize(...): is.numeric(age) is not TRUE

nils <- Person$new("Nils", age = 33)

nils$print()

## Person: 
##   Name: Nils
##   Age:  33

# In place operation
nils$add_years(10)
nils$print()

## Person: 
##   Name: Nils
##   Age:  43

nils$age

## [1] 43

nils$
    add_years(10)$
    age

## [1] 53

Common pitfalls

In the last chapter, I wanted to highlight a few common pitfalls when analysing data in R.

Vector recycling

For some reason, vector recycling is possible in R. Issues commonly arise when using a logical vector for subsetting. In the following example a vector of length 3 is recycled when subsetting a much larger data.frame without throwing an error. Please make sure that the logical vector always has the same length of the object that you want to subset.

dim(iris)

## [1] 150   5

cur_vector <- c(TRUE, FALSE, TRUE)
dim(iris[cur_vector,])

## [1] 100   5

cur_vector <- iris$Species == "virginica"
dim(iris[cur_vector,])

## [1] 50  5

It is important to understand that the .drop or drop argument exists in R. When using tidy R most often missing combinations of factor levels are dropped by default:

iris$subspecies <- factor(rep(letters[1:5], each = 30))

iris %>% count(Species, subspecies)

##      Species subspecies  n
## 1     setosa          a 30
## 2     setosa          b 20
## 3 versicolor          b 10
## 4 versicolor          c 30
## 5 versicolor          d 10
## 6  virginica          d 20
## 7  virginica          e 30

iris %>% count(Species, subspecies, .drop = FALSE)

##       Species subspecies  n
## 1      setosa          a 30
## 2      setosa          b 20
## 3      setosa          c  0
## 4      setosa          d  0
## 5      setosa          e  0
## 6  versicolor          a  0
## 7  versicolor          b 10
## 8  versicolor          c 30
## 9  versicolor          d 10
## 10 versicolor          e  0
## 11  virginica          a  0
## 12  virginica          b  0
## 13  virginica          c  0
## 14  virginica          d 20
## 15  virginica          e 30

When writing functions, make sure to always explicitely return an object with return. Otherwise the output of the last call within the function will be returned.

out <- lapply(1:10, function(x){
    cur_out <- data.frame(index = rep(x, 10))
    cur_out$table <- cur_out$index * 1:10
})
out

## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
##  [1]  2  4  6  8 10 12 14 16 18 20
## 
## [[3]]
##  [1]  3  6  9 12 15 18 21 24 27 30
## 
## [[4]]
##  [1]  4  8 12 16 20 24 28 32 36 40
## 
## [[5]]
##  [1]  5 10 15 20 25 30 35 40 45 50
## 
## [[6]]
##  [1]  6 12 18 24 30 36 42 48 54 60
## 
## [[7]]
##  [1]  7 14 21 28 35 42 49 56 63 70
## 
## [[8]]
##  [1]  8 16 24 32 40 48 56 64 72 80
## 
## [[9]]
##  [1]  9 18 27 36 45 54 63 72 81 90
## 
## [[10]]
##  [1]  10  20  30  40  50  60  70  80  90 100

out <- lapply(1:10, function(x){
    cur_out <- data.frame(index = rep(x, 10))
    cur_out$table <- cur_out$index * 1:10
    return(cur_out)
})
out

## [[1]]
##    index table
## 1      1     1
## 2      1     2
## 3      1     3
## 4      1     4
## 5      1     5
## 6      1     6
## 7      1     7
## 8      1     8
## 9      1     9
## 10     1    10
## 
## [[2]]
##    index table
## 1      2     2
## 2      2     4
## 3      2     6
## 4      2     8
## 5      2    10
## 6      2    12
## 7      2    14
## 8      2    16
## 9      2    18
## 10     2    20
## 
## [[3]]
##    index table
## 1      3     3
## 2      3     6
## 3      3     9
## 4      3    12
## 5      3    15
## 6      3    18
## 7      3    21
## 8      3    24
## 9      3    27
## 10     3    30
## 
## [[4]]
##    index table
## 1      4     4
## 2      4     8
## 3      4    12
## 4      4    16
## 5      4    20
## 6      4    24
## 7      4    28
## 8      4    32
## 9      4    36
## 10     4    40
## 
## [[5]]
##    index table
## 1      5     5
## 2      5    10
## 3      5    15
## 4      5    20
## 5      5    25
## 6      5    30
## 7      5    35
## 8      5    40
## 9      5    45
## 10     5    50
## 
## [[6]]
##    index table
## 1      6     6
## 2      6    12
## 3      6    18
## 4      6    24
## 5      6    30
## 6      6    36
## 7      6    42
## 8      6    48
## 9      6    54
## 10     6    60
## 
## [[7]]
##    index table
## 1      7     7
## 2      7    14
## 3      7    21
## 4      7    28
## 5      7    35
## 6      7    42
## 7      7    49
## 8      7    56
## 9      7    63
## 10     7    70
## 
## [[8]]
##    index table
## 1      8     8
## 2      8    16
## 3      8    24
## 4      8    32
## 5      8    40
## 6      8    48
## 7      8    56
## 8      8    64
## 9      8    72
## 10     8    80
## 
## [[9]]
##    index table
## 1      9     9
## 2      9    18
## 3      9    27
## 4      9    36
## 5      9    45
## 6      9    54
## 7      9    63
## 8      9    72
## 9      9    81
## 10     9    90
## 
## [[10]]
##    index table
## 1     10    10
## 2     10    20
## 3     10    30
## 4     10    40
## 5     10    50
## 6     10    60
## 7     10    70
## 8     10    80
## 9     10    90
## 10    10   100

Session info

Here are the packages used in this workshop.

sessionInfo()

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] R6_2.5.1            BiocParallel_1.31.8 DT_0.23            
##  [4] vroom_1.5.7         data.table_1.14.2   bench_1.1.2        
##  [7] forcats_0.5.1       stringr_1.4.0       dplyr_1.0.9        
## [10] purrr_0.3.4         readr_2.1.2         tidyr_1.2.0        
## [13] tibble_3.1.7        ggplot2_3.3.6       tidyverse_1.3.1    
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.8.0   assertthat_0.2.1  digest_0.6.29     utf8_1.2.2       
##  [5] cellranger_1.1.0  backports_1.4.1   reprex_2.0.1      evaluate_0.15    
##  [9] httr_1.4.3        highr_0.9         pillar_1.7.0      rlang_1.0.2      
## [13] readxl_1.4.0      rstudioapi_0.13   jquerylib_0.1.4   rmarkdown_2.14   
## [17] labeling_0.4.2    htmlwidgets_1.5.4 bit_4.0.4         munsell_0.5.0    
## [21] broom_0.8.0       compiler_4.2.0    modelr_0.1.8      xfun_0.31        
## [25] pkgconfig_2.0.3   htmltools_0.5.2   tidyselect_1.1.2  codetools_0.2-18 
## [29] fansi_1.0.3       crayon_1.5.1      tzdb_0.3.0        dbplyr_2.2.0     
## [33] withr_2.5.0       grid_4.2.0        jsonlite_1.8.0    gtable_0.3.0     
## [37] lifecycle_1.0.1   DBI_1.1.2         magrittr_2.0.3    scales_1.2.0     
## [41] profmem_0.6.0     cli_3.3.0         stringi_1.7.6     farver_2.1.0     
## [45] fs_1.5.2          xml2_1.3.3        bslib_0.3.1       ellipsis_0.3.2   
## [49] generics_0.1.2    vctrs_0.4.1       tools_4.2.0       bit64_4.0.5      
## [53] glue_1.6.2        crosstalk_1.2.0   hms_1.1.1         parallel_4.2.0   
## [57] fastmap_1.1.0     yaml_2.3.5        colorspace_2.0-3  rvest_1.0.2      
## [61] knitr_1.39        haven_2.5.0       sass_0.4.1