This script serves a teaching material for the workshop “Tipps & Tricks for data analysis in R” presented at the DQBM retreat 2022.
It slightly builds upon the Introduction to Data Analysis sessions that were held in March 2020.
Some of the concepts are taken from Advanced R, an excellent book to dive deeper into how R works.
The session will address very different aspects of data analysis and programming using the statistical programming language R. To follow the session, please install R and RStudio and install the following packages in R:
install.packages("BiocManager")
BiocManager::install("R6", "BiocParallel", "DT", "vroom", "data.table", "bench", "forcats", "stringr", "dplyr", "purrr", "readr", "tidyr", "tibble", "ggplot2", "tidyverse")
To follow the session live, please clone the repository locally
git clone https://github.com/BodenmillerGroup/IntroDataAnalysis.git
cd IntroDataAnalysis/R/DQBM_retreat_2022
and open the file tipps_tricks.Rmd
.
In the first section, we will discuss very basic coding practices in R mainly focusing on the assignment operator, pipes and function calls. For a full overview on coding stlye recommendations please refer to the tidyverse style guide.
The assignment operator <-
and ->
in
R is the preferred way of assigning variables. It alows left sided and
right sided assignment while =
is always interpreted as
<-
outside function calls. The =
is
reserved for parameter setting within function calls.
Calling <-
within a function call allows assignment
of a previously undefined variable.
y = 1:10
y
## [1] 1 2 3 4 5 6 7 8 9 10
1:10 = y
## Error in 1:10 = y: target of assignment expands to non-language object
z <- 1:10
z
## [1] 1 2 3 4 5 6 7 8 9 10
1:10 -> z
z
## [1] 1 2 3 4 5 6 7 8 9 10
mean(x = 1:10)
## [1] 5.5
x
## Error in eval(expr, envir, enclos): object 'x' not found
mean(x <- 1:10)
## [1] 5.5
x
## [1] 1 2 3 4 5 6 7 8 9 10
# Incorrect assignment within function call
system.time(x = lapply(1:10, function(x) {Sys.sleep(1); return(x)}))
## Error in system.time(x = lapply(1:10, function(x) {: unused argument (x = lapply(1:10, function(x) {
## Sys.sleep(1)
## return(x)
## }))
# Assignment to x within function call
system.time(x <- lapply(1:10, function(x) {Sys.sleep(1); return(x)}))
## user system elapsed
## 0.017 0.002 10.038
Using pipes in R improves readability of you code. In the next code chunk we want to find the date at which the maximum of accumulated Covid-19 cases were detected in Switzerland.
For this, base R by now provides the pipe operator
|>
.
covid <- read.csv("../../Data/covid19.csv")
str(covid)
## 'data.frame': 231264 obs. of 8 variables:
## $ Country.Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Province.State: chr "" "" "" "" ...
## $ Lat : num 33.9 33.9 33.9 33.9 33.9 ...
## $ Long : num 67.7 67.7 67.7 67.7 67.7 ...
## $ Date : chr "2020-01-22" "2020-01-23" "2020-01-24" "2020-01-25" ...
## $ Confirmed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Recovered : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Deaths : int 0 0 0 0 0 0 0 0 0 0 ...
# Bad
covid[covid$Country.Region == "Switzerland" & covid$Confirmed == max(covid[covid$Country.Region == "Switzerland", "Confirmed"]),"Date"]
## [1] "2022-06-14" "2022-06-15"
# Good
covid |>
subset(Country.Region == "Switzerland") |>
subset(Confirmed == max(Confirmed)) |>
getElement("Date")
## [1] "2022-06-14" "2022-06-15"
The tidyverse package
provides all base packages for tidy
data handling in R. The pipe operator %>%
is
provided by the magrittr
package.
library(tidyverse)
covid %>%
filter(Country.Region == "Switzerland") %>%
filter(Confirmed == max(Confirmed)) %>%
select("Date")
## Date
## 1 2022-06-14
## 2 2022-06-15
Even though this is not recommended, piping naturally fits to right sided assignments:
# Bad but fun
covid %>%
filter(Country.Region == "Switzerland") %>%
filter(Confirmed == max(Confirmed)) %>%
select("Date") -> max_conf_date
max_conf_date
## Date
## 1 2022-06-14
## 2 2022-06-15
# Good
max_conf_date <- covid %>%
filter(Country.Region == "Switzerland") %>%
filter(Confirmed == max(Confirmed)) %>%
select("Date")
max_conf_date
## Date
## 1 2022-06-14
## 2 2022-06-15
One of the key features of piping is the direct use with
ggplot2
for plotting.
covid %>%
filter(Country.Region == "Switzerland") %>%
ggplot() + geom_line(aes(Date, Confirmed, color = Country.Region, group = 1)) +
theme_classic(base_size = 15) +
theme(axis.text.x = element_text(angle = 90))
The next section focuses on how to perform computations efficiently in (base) R. One of the best packages for efficient computations in data.table but we won’t have enough time to go through it.
Reading in large datasets in R can be a pain and the base R fiunctions are not optimized for this.
To test a number of file reader functions, we will first generate a “large” dataset on the fly and write it to disk:
matrix(rnorm(n = 10000000), ncol = 10) %>%
as.data.frame() %>%
write_csv(file = "../../Data/large_test_data.csv")
The resulting .csv
file contains 1 million rows and 10
columns and has a size of 200mb on disk.
In the next code chunk we will test the following file reader functions:
read.csv
: base R function for reading in
.csv
filesreadr::read_csv
: standard tidy R functiondata.table::fread
: main reader function of the
data.table
packagevroom::vroom
: newer library playing along with tidy
RWe will now use the bench library to perform some benchmarks on which reader function is the best.
library(bench)
library(data.table)
library(vroom)
library(DT)
bench::mark(
cur_data <- read.csv("../../Data/large_test_data.csv"),
cur_data <- readr::read_csv("../../Data/large_test_data.csv"),
cur_data <- data.table::fread("../../Data/large_test_data.csv"),
cur_data <- vroom::vroom("../../Data/large_test_data.csv"),
check = FALSE
) %>% select(c("min", "median", "mem_alloc", "total_time")) %>%
DT::datatable()
We can see that the base R reader function has by far the worst
performance. The readr::read_csv
function has already a
decent performance and should be used whenever possible - especially
when you plan on using the tidyverse
packages, which you
should ;).
The vroom::vroom
function in this case does not seem to
be much faster than the other ones bu memory allocation is drastically
reduced. You should be using this one when reading in gigabytes and
gigabytes of data.
The R programming language performs best when using vectorized
operations. Only in certain cases it is needed to use a
for
loop to iterated over objects.
The following code chunk highlights the worst possible way of calculating the average value per row of a matrix:
cur_data <- as.matrix(cur_data[1:100000,])
out <- vector()
# Bad
for (i in 1:nrow(cur_data)) {
cur_mean <- mean(cur_data[i,])
out <- c(out, cur_mean)
}
head(out, 100)
## [1] -0.074207119 0.233431346 0.104103532 -0.518205623 -0.301064250
## [6] -1.127165639 -0.150435722 -0.339594341 -0.081580333 -0.244686629
## [11] -0.086633220 -0.032881760 0.138429943 0.067167798 -0.136472178
## [16] 0.012583186 -0.318490843 -0.408403583 -0.400436714 -0.321152811
## [21] 0.014438649 0.018794623 0.856228533 0.151950344 0.427907691
## [26] 0.074230011 -0.005357597 0.176879299 -0.012597546 -0.414030659
## [31] -0.374366404 0.276757738 -0.302159611 -0.277743381 0.406273566
## [36] 0.410958393 -0.152828638 0.066117703 -0.392008452 0.434231416
## [41] -0.083261396 -0.200989223 0.067177543 0.530071588 0.387739119
## [46] -0.718724171 -0.136835433 0.063941132 0.195991346 0.111075637
## [51] -0.280771690 0.127466606 -0.162754149 0.632138537 -0.372494478
## [56] 0.108856251 -0.142078802 -0.300231957 -0.495349114 -0.097214440
## [61] 0.138398738 -0.150945784 0.227089967 0.575571188 -0.141800980
## [66] 0.106155740 -0.198169090 -0.376608391 0.097445763 0.058788789
## [71] -0.461580999 -0.213832622 0.077024249 0.622917701 0.015589750
## [76] 0.431167475 -0.068877270 -0.165183311 0.152648936 0.108396141
## [81] 0.033924549 -0.223445850 -0.173617365 -0.098205042 -0.059709630
## [86] -0.162718309 0.201106567 -0.342937053 -0.161730928 -0.274512559
## [91] -0.015296974 0.369567604 -0.584568395 0.253261623 -0.071557989
## [96] 0.108757421 0.145894200 -0.290742230 0.043156695 -0.440033909
Here, we used a for
loop to iterate through all rows. In
addition, we overwrite the full vector out
in each
iteration.
If you want/need to use a for
loop it is better to first
create the output object and only replace individual entries:
out <- vector(mode = "numeric", length = nrow(cur_data))
# Slightly better if needed
for (i in 1:nrow(cur_data)) {
cur_mean <- mean(cur_data[i,])
out[i] <- cur_mean
}
In any case, it is recommended to use functions of the
apply
family to perform iterated operations. To oerate on
arrays and matrices, base R provides the apply
function:
# Much better
out <- apply(cur_data, MARGIN = 1, FUN = mean)
head(out, 100)
## [1] -0.074207119 0.233431346 0.104103532 -0.518205623 -0.301064250
## [6] -1.127165639 -0.150435722 -0.339594341 -0.081580333 -0.244686629
## [11] -0.086633220 -0.032881760 0.138429943 0.067167798 -0.136472178
## [16] 0.012583186 -0.318490843 -0.408403583 -0.400436714 -0.321152811
## [21] 0.014438649 0.018794623 0.856228533 0.151950344 0.427907691
## [26] 0.074230011 -0.005357597 0.176879299 -0.012597546 -0.414030659
## [31] -0.374366404 0.276757738 -0.302159611 -0.277743381 0.406273566
## [36] 0.410958393 -0.152828638 0.066117703 -0.392008452 0.434231416
## [41] -0.083261396 -0.200989223 0.067177543 0.530071588 0.387739119
## [46] -0.718724171 -0.136835433 0.063941132 0.195991346 0.111075637
## [51] -0.280771690 0.127466606 -0.162754149 0.632138537 -0.372494478
## [56] 0.108856251 -0.142078802 -0.300231957 -0.495349114 -0.097214440
## [61] 0.138398738 -0.150945784 0.227089967 0.575571188 -0.141800980
## [66] 0.106155740 -0.198169090 -0.376608391 0.097445763 0.058788789
## [71] -0.461580999 -0.213832622 0.077024249 0.622917701 0.015589750
## [76] 0.431167475 -0.068877270 -0.165183311 0.152648936 0.108396141
## [81] 0.033924549 -0.223445850 -0.173617365 -0.098205042 -0.059709630
## [86] -0.162718309 0.201106567 -0.342937053 -0.161730928 -0.274512559
## [91] -0.015296974 0.369567604 -0.584568395 0.253261623 -0.071557989
## [96] 0.108757421 0.145894200 -0.290742230 0.043156695 -0.440033909
Here, MARGIN
can be 1
to iterate over rows,
2
to iterate over columns or even 3
or higher
to iterate over higher dimensions when working with
arrays
.
For most easy operations, base R provides easy vecorized functions,
such as rowMeans
:
# Best
out <- rowMeans(cur_data)
head(out, 100)
## [1] -0.074207119 0.233431346 0.104103532 -0.518205623 -0.301064250
## [6] -1.127165639 -0.150435722 -0.339594341 -0.081580333 -0.244686629
## [11] -0.086633220 -0.032881760 0.138429943 0.067167798 -0.136472178
## [16] 0.012583186 -0.318490843 -0.408403583 -0.400436714 -0.321152811
## [21] 0.014438649 0.018794623 0.856228533 0.151950344 0.427907691
## [26] 0.074230011 -0.005357597 0.176879299 -0.012597546 -0.414030659
## [31] -0.374366404 0.276757738 -0.302159611 -0.277743381 0.406273566
## [36] 0.410958393 -0.152828638 0.066117703 -0.392008452 0.434231416
## [41] -0.083261396 -0.200989223 0.067177543 0.530071588 0.387739119
## [46] -0.718724171 -0.136835433 0.063941132 0.195991346 0.111075637
## [51] -0.280771690 0.127466606 -0.162754149 0.632138537 -0.372494478
## [56] 0.108856251 -0.142078802 -0.300231957 -0.495349114 -0.097214440
## [61] 0.138398738 -0.150945784 0.227089967 0.575571188 -0.141800980
## [66] 0.106155740 -0.198169090 -0.376608391 0.097445763 0.058788789
## [71] -0.461580999 -0.213832622 0.077024249 0.622917701 0.015589750
## [76] 0.431167475 -0.068877270 -0.165183311 0.152648936 0.108396141
## [81] 0.033924549 -0.223445850 -0.173617365 -0.098205042 -0.059709630
## [86] -0.162718309 0.201106567 -0.342937053 -0.161730928 -0.274512559
## [91] -0.015296974 0.369567604 -0.584568395 0.253261623 -0.071557989
## [96] 0.108757421 0.145894200 -0.290742230 0.043156695 -0.440033909
In the next sections we will go through the different functions of
the apply
family:
apply
: as seen above used on matrix-like objectslapply
: used on lists/vectors, returns a listsapply
: used on lists/vectors, returns a vectorvapply
: used on lists/vectors, returns a vector with
checkingtapply
: applies a function per grouping levelmapply
: combines multiple listsIn the following example we will generate a list which contains 10 entries, each of which is a numeric vector of length 10 which can contain NAs.
We will then compute the mean per entry while ignoring NAs. Finally, we will convert the list to a vector in different ways.
cur_list <- lapply(1:10, function(x){
c(rnorm(n = 10), rep(NA, 10))[sample(1:20, 10)]
})
out <- lapply(cur_list, mean, na.rm = TRUE)
out
## [[1]]
## [1] -0.263992
##
## [[2]]
## [1] 0.6969153
##
## [[3]]
## [1] -0.5254196
##
## [[4]]
## [1] 0.6592913
##
## [[5]]
## [1] 0.5487733
##
## [[6]]
## [1] -0.06734899
##
## [[7]]
## [1] -0.2675872
##
## [[8]]
## [1] -0.2385669
##
## [[9]]
## [1] -0.2871847
##
## [[10]]
## [1] 0.1435612
do.call(c, out)
## [1] -0.26399195 0.69691535 -0.52541963 0.65929127 0.54877325 -0.06734899
## [7] -0.26758724 -0.23856691 -0.28718470 0.14356122
as.numeric(out)
## [1] -0.26399195 0.69691535 -0.52541963 0.65929127 0.54877325 -0.06734899
## [7] -0.26758724 -0.23856691 -0.28718470 0.14356122
as(out, "numeric")
## [1] -0.26399195 0.69691535 -0.52541963 0.65929127 0.54877325 -0.06734899
## [7] -0.26758724 -0.23856691 -0.28718470 0.14356122
In the next example, we can see that most operators in R are also defined as functions. Here, we can first access the 3rd element of each entry of a list and even perform assignments to individual entries. Here we see again that it’s important to understand what the assignment operator is. All replacement functions in R are defined using the assignment operator.
x <- list(entr1 = 1:10, entr2 = 20:30)
lapply(x, `[[`, 3)
## $entr1
## [1] 3
##
## $entr2
## [1] 22
lapply(x, `[[<-`, 3, 120)
## $entr1
## [1] 1 2 120 4 5 6 7 8 9 10
##
## $entr2
## [1] 20 21 120 23 24 25 26 27 28 29 30
lapply(x, `[[=`, 3, 120)
## Error in match.fun(FUN): object '[[=' not found
The next function we will look at is sapply
which
operates similarly as lapply
but returnes a “simplified”
version of the output (e.g. directly a vector).
sapply(cur_list, mean, na.rm = TRUE)
## [1] -0.26399195 0.69691535 -0.52541963 0.65929127 0.54877325 -0.06734899
## [7] -0.26758724 -0.23856691 -0.28718470 0.14356122
However, it is not always trivial to simplify the output since it’s
not always clear what the output is. When developing computational
methods it’s always safer to use vapply
which checks the
type and length out the output.
# A single numeric should be returned
vapply(cur_list, mean, FUN.VALUE = 0, na.rm = TRUE)
## [1] -0.26399195 0.69691535 -0.52541963 0.65929127 0.54877325 -0.06734899
## [7] -0.26758724 -0.23856691 -0.28718470 0.14356122
# Two objects are returned
vapply(cur_list,
function(x){return(c("one", "two"))},
FUN.VALUE = c("test1", "test2"))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] "one" "one" "one" "one" "one" "one" "one" "one" "one" "one"
## [2,] "two" "two" "two" "two" "two" "two" "two" "two" "two" "two"
# Doesn't work
vapply(cur_list, mean, FUN.VALUE = "test")
## Error in vapply(cur_list, mean, FUN.VALUE = "test"): values must be type 'character',
## but FUN(X[[1]]) result is type 'double'
The tapply
function performs operations on groups of an
input. In the following case, we will calculate the maximum of confirmed
covid cases per country.
Similar operations can be done with the base R aggregate
function and with tidy R.
# tapply
head(tapply(covid$Confirmed, covid$Country.Region, max), 10)
## Afghanistan Albania Algeria Andorra
## 181236 276821 265952 43449
## Angola Antarctica Antigua and Barbuda Argentina
## 99761 11 8537 9313453
## Armenia Australia
## 423044 2688494
# aggregate
head(aggregate(covid$Confirmed, by = list(country = covid$Country.Region), max), 10)
## country x
## 1 Afghanistan 181236
## 2 Albania 276821
## 3 Algeria 265952
## 4 Andorra 43449
## 5 Angola 99761
## 6 Antarctica 11
## 7 Antigua and Barbuda 8537
## 8 Argentina 9313453
## 9 Armenia 423044
## 10 Australia 2688494
# tidy R
covid %>%
group_by(Country.Region) %>%
summarize(max_confirmed = max(Confirmed)) %>%
head(10)
## # A tibble: 10 × 2
## Country.Region max_confirmed
## <chr> <int>
## 1 Afghanistan 181236
## 2 Albania 276821
## 3 Algeria 265952
## 4 Andorra 43449
## 5 Angola 99761
## 6 Antarctica 11
## 7 Antigua and Barbuda 8537
## 8 Argentina 9313453
## 9 Armenia 423044
## 10 Australia 2688494
Finally, the mapply
combines a number of lists and
performs operations on each paired entry:
cur_list_2 <- as.list(1:10)
mapply(function(x, y){
return(mean(x, na.rm = TRUE) * y)
},
cur_list, cur_list_2)
## [1] -0.263992 1.393831 -1.576259 2.637165 2.743866 -0.404094 -1.873111
## [8] -1.908535 -2.584662 1.435612
mapply(`*`, cur_list_2, cur_list_2)
## [1] 1 4 9 16 25 36 49 64 81 100
One strength of R is to be able to perform operations in a parallelized fashion. For this a number of packages have been developed including parallel, foreach and future.
Personally, I prefer the BiocParallel
package as it provides a user-friendly way of handling parallelised
operations by simply extending the lapply
function.
In the next code chunk, we will test how the bplapply
function works. The only additional parameter is BPPARAM
which in the following case is set to bpparam()
. The
bbparam()
function registers a parallelisation backend
which is most suited to your operating system.
library(BiocParallel)
bench::mark(
cur_out <- lapply(1:10, function(x){Sys.sleep(5); return(x)}),
cur_out <- bplapply(1:10, function(x){Sys.sleep(5); return(x)}, BPPARAM = bpparam()),
memory = FALSE
) %>% select(c("min", "median", "mem_alloc", "total_time")) %>%
DT::datatable()
In the next section we will learn how obbject oriented programming (OOP) can be done in R. Even without knowing what that means you have used OOP whenever you code in R.
OOP in R consists of the following three parts:
Understanding OOP in R is mainly relevant for methods development but also useful to understand where differences in the way you analyse data come from.
We will first start with observing the difference between a
base type
and a class attribute
. Every object
in R has a base type while only OOP onjects have a
class attribute
.
attr(1:10, "class")
## NULL
typeof(1:10)
## [1] "integer"
# mIsleading
class(1:10)
## [1] "integer"
is(1:10, "integer")
## [1] TRUE
attr(matrix(1:10), "class")
## NULL
typeof(matrix(1:10))
## [1] "integer"
class(matrix(1:10))
## [1] "matrix" "array"
is(matrix(1:10), "integer")
## [1] FALSE
# Class inheritance
is(matrix(1:10), "matrix")
## [1] TRUE
is(matrix(1:10), "array")
## [1] TRUE
We will now directly move to defining OOP objects, generic functions and methods in different ways. Please refer to Object Oriented Programming for full information.
Most objects that you work with when analyzing data in base R are of the “S3” class. This is a loosely defined construct for which generic functions and methods exist and allow the user and developer lots of flexibility.
The Bioconductor project mainly works with “S4” classes which are a lot more formally defined and allow for easier interoperability.
The “R6” class provides a framework which comes closest to real OOP as known from e.g. python.
In the following sections, we will go through different examples and how to construct objects, generic functions and methods.
S3 class objects are most often used in base R functions. For
example, the prcomp
function to compute a PCA returns an
object of class prcomp
. Each slot of a S3 class object can
be accessed via $
.
cur_pca <- prcomp(iris[,-5])
class(cur_pca)
## [1] "prcomp"
# Access slots via "$"
head(cur_pca$sdev)
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
Also base objects such as a factor
or a
data.frame
are S3 objects which can be tested. In base R
there are generic functions defined which can be applied to an S3
object. Here an example of such a generic function is
print
. Internally, print
will find the
matching method to be called on the provided object.
f <- factor(c("a", "b", "c"))
df <- data.frame(a = 1:10, b = letters[1:10])
is.object(f)
## [1] TRUE
is.object(df)
## [1] TRUE
# generic functions - always use these
isS3stdGeneric("print")
## print
## TRUE
print(f)
## [1] a b c
## Levels: a b c
print(df)
## a b
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## 7 7 g
## 8 8 h
## 9 9 i
## 10 10 j
# methods - never use these
isS3method("print.factor")
## [1] TRUE
print.factor(f)
## [1] a b c
## Levels: a b c
print.factor(df)
## a
## 1:10
## b
## c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
## attr(,"row.names")
## [1] 1 2 3 4 5 6 7 8 9 10
## Levels:
## Warning in print.factor(df): factor levels must be "character"
print.data.frame(df)
## a b
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## 7 7 g
## 8 8 h
## 9 9 i
## 10 10 j
print.data.frame(f)
## NULL
## <0 rows> (or 0-length row.names)
There is never the need to directly call a method on an object but rather use the generic function.
In the next few chunks we will learn how to build an object of a certain class, how to construct a generic function and how to construct a method.
We will first construct an S3 class object.
# Create, then set class
x <- 1
class(x) <- "my_class"
class(x)
## [1] "my_class"
inherits(x, "my_class")
## [1] TRUE
inherits(x, "your_class")
## [1] FALSE
Next, we can define a generic function:
my_generic <- function(x) UseMethod("my_generic", x)
Finally, we will link a method to the object class and the generic function.
my_generic.my_class <- function(x) x + 10
my_generic(x)
## [1] 11
## attr(,"class")
## [1] "my_class"
my_generic(20)
## Error in UseMethod("my_generic", x): no applicable method for 'my_generic' applied to an object of class "c('double', 'numeric')"
The S4 class system is more formally defined, One would first need to define the structure of the class. In the following example, we will first define a class “Person” which contains slots to store the name and the character of the person.
We can then use the new
function to create a new object
of class “Person”. Slots in the object can be accessed via
@
or the slot
function.
setClass("Person",
slots = c(
name = "character",
age = "numeric"
)
)
john <- new("Person", name = "John Smith", age = as.numeric(NA))
is(john)
## [1] "Person"
isS4(john)
## [1] TRUE
# Access slots via "@"
john@name
## [1] "John Smith"
slot(john, "age")
## [1] NA
Next, we can define a new generic function calle age
for
getting the age of a person and age<-
for setting the
age of a person.
setGeneric("age", function(x) standardGeneric("age"))
## [1] "age"
setGeneric("age<-", function(x, value) standardGeneric("age<-"))
## [1] "age<-"
Finally, we can construct the methods linked to the age
and age<-
generic functions and the Person
object:
setMethod("age", "Person", function(x) x@age)
setMethod("age<-", "Person", function(x, value) {
x@age <- value
x
})
age(john) <- 50
age(john)
## [1] 50
Just for demonstration purposes, we will also discuss the R6 class. Here, within the class definition one can directly specify class methods without the need to constructing generic functions.
library(R6)
# Class definition
Person <- R6Class("Person", list(
name = NULL,
age = NA,
# Validity checks on the class construction
initialize = function(name, age = NA) {
stopifnot(is.character(name), length(name) == 1)
stopifnot(is.numeric(age), length(age) == 1)
self$name <- name
self$age <- age
},
# Print method
print = function(...) {
cat("Person: \n")
cat(" Name: ", self$name, "\n", sep = "")
cat(" Age: ", self$age, "\n", sep = "")
invisible(self)
},
# Another method
add_years = function(x) {
self$age <- self$age + x
invisible(self)
}
))
nils <- Person$new("Nils", age = "thirty-three")
## Error in initialize(...): is.numeric(age) is not TRUE
nils <- Person$new("Nils", age = 33)
nils$print()
## Person:
## Name: Nils
## Age: 33
# In place operation
nils$add_years(10)
nils$print()
## Person:
## Name: Nils
## Age: 43
nils$age
## [1] 43
nils$
add_years(10)$
age
## [1] 53
In the last chapter, I wanted to highlight a few common pitfalls when analysing data in R.
For some reason, vector recycling is possible in R. Issues commonly arise when using a logical vector for subsetting. In the following example a vector of length 3 is recycled when subsetting a much larger data.frame without throwing an error. Please make sure that the logical vector always has the same length of the object that you want to subset.
dim(iris)
## [1] 150 5
cur_vector <- c(TRUE, FALSE, TRUE)
dim(iris[cur_vector,])
## [1] 100 5
cur_vector <- iris$Species == "virginica"
dim(iris[cur_vector,])
## [1] 50 5
It is important to understand that the .drop
or
drop
argument exists in R. When using tidy R most often
missing combinations of factor levels are dropped by default:
iris$subspecies <- factor(rep(letters[1:5], each = 30))
iris %>% count(Species, subspecies)
## Species subspecies n
## 1 setosa a 30
## 2 setosa b 20
## 3 versicolor b 10
## 4 versicolor c 30
## 5 versicolor d 10
## 6 virginica d 20
## 7 virginica e 30
iris %>% count(Species, subspecies, .drop = FALSE)
## Species subspecies n
## 1 setosa a 30
## 2 setosa b 20
## 3 setosa c 0
## 4 setosa d 0
## 5 setosa e 0
## 6 versicolor a 0
## 7 versicolor b 10
## 8 versicolor c 30
## 9 versicolor d 10
## 10 versicolor e 0
## 11 virginica a 0
## 12 virginica b 0
## 13 virginica c 0
## 14 virginica d 20
## 15 virginica e 30
When writing functions, make sure to always explicitely return an
object with return
. Otherwise the output of the last call
within the function will be returned.
out <- lapply(1:10, function(x){
cur_out <- data.frame(index = rep(x, 10))
cur_out$table <- cur_out$index * 1:10
})
out
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 2 4 6 8 10 12 14 16 18 20
##
## [[3]]
## [1] 3 6 9 12 15 18 21 24 27 30
##
## [[4]]
## [1] 4 8 12 16 20 24 28 32 36 40
##
## [[5]]
## [1] 5 10 15 20 25 30 35 40 45 50
##
## [[6]]
## [1] 6 12 18 24 30 36 42 48 54 60
##
## [[7]]
## [1] 7 14 21 28 35 42 49 56 63 70
##
## [[8]]
## [1] 8 16 24 32 40 48 56 64 72 80
##
## [[9]]
## [1] 9 18 27 36 45 54 63 72 81 90
##
## [[10]]
## [1] 10 20 30 40 50 60 70 80 90 100
out <- lapply(1:10, function(x){
cur_out <- data.frame(index = rep(x, 10))
cur_out$table <- cur_out$index * 1:10
return(cur_out)
})
out
## [[1]]
## index table
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 1 7
## 8 1 8
## 9 1 9
## 10 1 10
##
## [[2]]
## index table
## 1 2 2
## 2 2 4
## 3 2 6
## 4 2 8
## 5 2 10
## 6 2 12
## 7 2 14
## 8 2 16
## 9 2 18
## 10 2 20
##
## [[3]]
## index table
## 1 3 3
## 2 3 6
## 3 3 9
## 4 3 12
## 5 3 15
## 6 3 18
## 7 3 21
## 8 3 24
## 9 3 27
## 10 3 30
##
## [[4]]
## index table
## 1 4 4
## 2 4 8
## 3 4 12
## 4 4 16
## 5 4 20
## 6 4 24
## 7 4 28
## 8 4 32
## 9 4 36
## 10 4 40
##
## [[5]]
## index table
## 1 5 5
## 2 5 10
## 3 5 15
## 4 5 20
## 5 5 25
## 6 5 30
## 7 5 35
## 8 5 40
## 9 5 45
## 10 5 50
##
## [[6]]
## index table
## 1 6 6
## 2 6 12
## 3 6 18
## 4 6 24
## 5 6 30
## 6 6 36
## 7 6 42
## 8 6 48
## 9 6 54
## 10 6 60
##
## [[7]]
## index table
## 1 7 7
## 2 7 14
## 3 7 21
## 4 7 28
## 5 7 35
## 6 7 42
## 7 7 49
## 8 7 56
## 9 7 63
## 10 7 70
##
## [[8]]
## index table
## 1 8 8
## 2 8 16
## 3 8 24
## 4 8 32
## 5 8 40
## 6 8 48
## 7 8 56
## 8 8 64
## 9 8 72
## 10 8 80
##
## [[9]]
## index table
## 1 9 9
## 2 9 18
## 3 9 27
## 4 9 36
## 5 9 45
## 6 9 54
## 7 9 63
## 8 9 72
## 9 9 81
## 10 9 90
##
## [[10]]
## index table
## 1 10 10
## 2 10 20
## 3 10 30
## 4 10 40
## 5 10 50
## 6 10 60
## 7 10 70
## 8 10 80
## 9 10 90
## 10 10 100
Here are the packages used in this workshop.
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] R6_2.5.1 BiocParallel_1.31.8 DT_0.23
## [4] vroom_1.5.7 data.table_1.14.2 bench_1.1.2
## [7] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
## [10] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
## [13] tibble_3.1.7 ggplot2_3.3.6 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] lubridate_1.8.0 assertthat_0.2.1 digest_0.6.29 utf8_1.2.2
## [5] cellranger_1.1.0 backports_1.4.1 reprex_2.0.1 evaluate_0.15
## [9] httr_1.4.3 highr_0.9 pillar_1.7.0 rlang_1.0.2
## [13] readxl_1.4.0 rstudioapi_0.13 jquerylib_0.1.4 rmarkdown_2.14
## [17] labeling_0.4.2 htmlwidgets_1.5.4 bit_4.0.4 munsell_0.5.0
## [21] broom_0.8.0 compiler_4.2.0 modelr_0.1.8 xfun_0.31
## [25] pkgconfig_2.0.3 htmltools_0.5.2 tidyselect_1.1.2 codetools_0.2-18
## [29] fansi_1.0.3 crayon_1.5.1 tzdb_0.3.0 dbplyr_2.2.0
## [33] withr_2.5.0 grid_4.2.0 jsonlite_1.8.0 gtable_0.3.0
## [37] lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.3 scales_1.2.0
## [41] profmem_0.6.0 cli_3.3.0 stringi_1.7.6 farver_2.1.0
## [45] fs_1.5.2 xml2_1.3.3 bslib_0.3.1 ellipsis_0.3.2
## [49] generics_0.1.2 vctrs_0.4.1 tools_4.2.0 bit64_4.0.5
## [53] glue_1.6.2 crosstalk_1.2.0 hms_1.1.1 parallel_4.2.0
## [57] fastmap_1.1.0 yaml_2.3.5 colorspace_2.0-3 rvest_1.0.2
## [61] knitr_1.39 haven_2.5.0 sass_0.4.1