This workshop is not exhaustive but meant to be a first contact with the R programming language. We hope you leave the workshop able to say:
R isn’t scary!
We also hope to show you you can use R to do things with data you’re already familiar with, as well as clinical computations.
Data Frames in R are like Datasets in SAS®. Data frames are made up of columns called vectors – treated like Variables in SAS® More data types exist, but we’ll focus on data frames
Basic Variable Types
Numeric
Character
Boolean
| a | b | c |
|---|---|---|
| 1 | a | TRUE |
| 2 | b | TRUE |
| 3 | c | FALSE |
These two methods yield the same results, but the convention is to use <-. Learn more here
| Operator | Meaning | Example |
|---|---|---|
| <- | assign | x <- y |
| == | equal to | x == y |
| != | not equal to | x != y |
| < | less than | x < y |
| <= | less than or equal to | x <= y |
| > | greater than | x > y |
| >= | greater than or equal to | x >= y |
| operator | Meaning | Example | Result |
|---|---|---|---|
| + | addition | 1 + 1 == 2 |
2 |
| - | subtraction | 1 -1 == 0 |
0 |
| / | division | 6/3 == 2 |
2 |
| * | multiplication | 2 * 3 == 6 |
6 |
^ or ** |
exponentiation | 3 ** 2 or 3 ^ 2 |
9 |
%% |
modulus | 6%%5 |
1 |
%% |
integer division | 7 %% 2 |
3 |
| Operator | Meaning | Example |
|---|---|---|
| & | and | x & y |
| |
or | x | y |
| ! | not | !x |
| %in% | in | x %in% y |
%>% OperatorThe pipe, %>%, is used to create a pipeline of functions and can be read as “and then”

Packagers are collections of functions and tools to expand the capabilities of R. You can import a package with: library(package_name)
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Install the complete tidyverse with:
install.packages("tidyverse")
We can keep only the columns a and b from the original dat:
| a | b | c |
|---|---|---|
| X | 5 | 15 |
| X | 10 | 20 |
| Y | 2 | 12 |
| Y | 7 | 17 |
With the code:
dat %>%
select(a,b)
| a | b |
|---|---|
| X | 5 |
| X | 10 |
| Y | 2 |
| Y | 7 |
We can drop column c, choosing everything but column c:
| a | b | c |
|---|---|---|
| X | 5 | 15 |
| X | 10 | 20 |
| Y | 2 | 12 |
| Y | 7 | 17 |
By using the code:
dat %>%
select(-c)
| a | b |
|---|---|
| X | 5 |
| X | 10 |
| Y | 2 |
| Y | 7 |
where)We can subset dat where f >= 5
| a | b | c |
|---|---|---|
| X | 5 | 15 |
| X | 10 | 20 |
| Y | 2 | 12 |
| Y | 7 | 17 |
Using the following code:
dat %>%
filter(b>=5)
| a | b |
|---|---|
| X | 5 |
| X | 10 |
| Y | 7 |
We can use R’s rename function to rename columns a and b to groups and values. Given this starting data frame:
| a | b | c |
|---|---|---|
| X | 5 | 15 |
| X | 10 | 20 |
| Y | 2 | 12 |
| Y | 7 | 17 |
We can use the code:
dat %>%
rename(
groups = a,
values = b
)
| groups | values | c |
|---|---|---|
| X | 5 | 15 |
| X | 10 | 20 |
| Y | 2 | 12 |
| Y | 7 | 17 |
Compared to SAS®, you don’t have to sort a lot of the time!
When do I sort? - Presentation - Order dependent operations (i.e. baseline flag) - Don’t need it for grouping
For instance, if we want to sort our data frame by column b:
| a | b | c |
|---|---|---|
| X | 5 | 15 |
| X | 10 | 20 |
| Y | 2 | 12 |
| Y | 7 | 17 |
We can use the arrange function on column b
dat %>%
arrange(b)
| a | b | c |
|---|---|---|
| Y | 2 | 12 |
| X | 5 | 15 |
| Y | 7 | 17 |
| X | 10 | 20 |
We can use the mutate function to create new columns using the data from existing columns. For instance we can create a new column c by adding 10 to column b. We can also use the mutate function to or modify existing columns in place. For example, rather than create a new column, we can overwrite column a adding - before and after each entry.
| a | b |
|---|---|
| X | 5 |
| X | 10 |
| Y | 2 |
| Y | 7 |
Using the following code:
dat %>%
mutate(
c = b + 10,
a = paste0("-", a, "-")
)
| a | b | c |
|---|---|---|
| -X- | 5 | 15 |
| -X- | 10 | 20 |
| -Y- | 2 | 12 |
| -Y- | 7 | 17 |
if_else logicWe can use if_else within a mutate to create new columns based on another column. For instance, we can create a categorical column of High and Low values based on column b:
| a | b |
|---|---|
| X | 5 |
| X | 10 |
| Y | 2 |
| Y | 7 |
We can use this code:
dat %>%
mutate(
level = if_else(b > 5,
"High",
"Low")
)
| a | b | level |
|---|---|---|
| X | 5 | Low |
| X | 10 | High |
| Y | 2 | Low |
| Y | 7 | High |
But what if we want another, Medium category for values greater than 3 but less than 7?
| a | b |
|---|---|
| X | 5 |
| X | 10 |
| Y | 2 |
| Y | 7 |
If we were to use an if_else statement that would require nesting
dat %>%
mutate(
level = if_else(b < 3, "Low",
if_else(b < 8, "Mid", "High"))
)
But this is really hard to read! Lucky for us we can use the case_when function.
The structure of case_when can be read as:
Left side of ~ True/False or something that evaluates to True/False
Right side of ~ Value to return
dat %>%
mutate(
level = case_when(
b < 3 ~ "Low",
b < 8 ~ "Mid",
TRUE ~ "High"
)
)
| a | b | level |
|---|---|---|
| X | 5 | Medium |
| X | 10 | High |
| Y | 2 | Low |
| Y | 7 | Medium |
| a | b |
|---|---|
| X | 5 |
| X | 10 |
| Y | 2 |
| Y | 7 |
| mean | sd | min | max |
|---|---|---|---|
| 6 | 3.36 | 2 | 10 |
| a | b |
|---|---|
| X | 5 |
| X | 10 |
| Y | 2 |
| Y | 7 |
| a | mean | sd | min | max |
|---|---|---|---|---|
| X | 7.5 | 3.53 | 5 | 10 |
| Y | 4.5 | 3.53 | 2 | 7 |
| a | b |
|---|---|
| X | dog |
| X | cat |
| X | rabbit |
| Y | rabbit |
| Y | rabbit |
dat %>%
group_by(b) %>%
summarize(
n = n()
)
| a | b |
|---|---|
| 1 | dog |
| 1 | cat |
| 3 | rabbit |
| a | b |
|---|---|
| X | dog |
| X | cat |
| X | rabbit |
| Y | rabbit |
| Y | rabbit |
dat %>%
count(b)
| a | b |
|---|---|
| 1 | dog |
| 1 | cat |
| 3 | rabbit |
| a | b |
|---|---|
| X | dog |
| X | cat |
| X | rabbit |
| Y | rabbit |
| Y | rabbit |
dat %>%
group_by(a, b) %>%
summarize(
n = n()
)
| a | b |
|---|---|
| 1 | dog |
| 1 | cat |
| 3 | rabbit |
In the course we also showcase 2 other ways to achieve the same goal. The first is using group_by and count
dat %>%
group_by(a) %>%
count(b)
or with even shorter code, calling count on both columns we’d like to group by:
dat %>%
count(a, b)