Slides

Workshop Goals

This workshop is not exhaustive but meant to be a first contact with the R programming language. We hope you leave the workshop able to say:

R isn’t scary!

We also hope to show you you can use R to do things with data you’re already familiar with, as well as clinical computations.

Object Types

Data Frames in R are like Datasets in SAS®. Data frames are made up of columns called vectors – treated like Variables in SAS® More data types exist, but we’ll focus on data frames

Basic Variable Types

Numeric

Character

Boolean

a b c
1 a TRUE
2 b TRUE
3 c FALSE

Assigning Variables

two r consoles, one assigning x to hello using an arrow, and another using an equal sign

These two methods yield the same results, but the convention is to use <-. Learn more here

Testing Equality

Operator Meaning Example
<- assign x <- y
== equal to x == y
!= not equal to x != y
< less than x < y
<= less than or equal to x <= y
> greater than x > y
>= greater than or equal to x >= y

Arithmetic Operators

operator Meaning Example Result
+ addition 1 + 1 == 2 2
- subtraction 1 -1 == 0 0
/ division 6/3 == 2 2
* multiplication 2 * 3 == 6 6
^ or ** exponentiation 3 ** 2 or 3 ^ 2 9
%% modulus 6%%5 1
%% integer division 7 %% 2 3

A Couple More

Operator Meaning Example
& and x & y
| or x | y
! not !x
%in% in x %in% y

The Pipe %>% Operator

The pipe, %>%, is used to create a pipeline of functions and can be read as “and then”

tweet showing how to read the pipe as waking up %>% getting dressed

What are packages?

Packagers are collections of functions and tools to expand the capabilities of R. You can import a package with: library(package_name)

What is the tidyverse?

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Install the complete tidyverse with: install.packages("tidyverse")

Keep:

We can keep only the columns a and b from the original dat:

a b c
X 5 15
X 10 20
Y 2 12
Y 7 17

With the code:

dat %>%
  select(a,b)
a b
X 5
X 10
Y 2
Y 7

Drop

We can drop column c, choosing everything but column c:

a b c
X 5 15
X 10 20
Y 2 12
Y 7 17

By using the code:

dat %>%
  select(-c)
a b
X 5
X 10
Y 2
Y 7

Sub-setting by rows (where)

We can subset dat where f >= 5

a b c
X 5 15
X 10 20
Y 2 12
Y 7 17

Using the following code:

dat %>%
  filter(b>=5)
a b
X 5
X 10
Y 7

Rename

We can use R’s rename function to rename columns a and b to groups and values. Given this starting data frame:

a b c
X 5 15
X 10 20
Y 2 12
Y 7 17

We can use the code:

dat %>%
  rename(
    groups = a,
    values = b
  )
groups values c
X 5 15
X 10 20
Y 2 12
Y 7 17

Sorting data

Compared to SAS®, you don’t have to sort a lot of the time!

When do I sort? - Presentation - Order dependent operations (i.e. baseline flag) - Don’t need it for grouping

For instance, if we want to sort our data frame by column b:

a b c
X 5 15
X 10 20
Y 2 12
Y 7 17

We can use the arrange function on column b

dat %>%
  arrange(b)
a b c
Y 2 12
X 5 15
Y 7 17
X 10 20

set AKA bind_rows

two dataframes merged based on column names

merge AKA *_join

venn diagrams of inner join, full join, left and right join

Adding/editing a variable

We can use the mutate function to create new columns using the data from existing columns. For instance we can create a new column c by adding 10 to column b. We can also use the mutate function to or modify existing columns in place. For example, rather than create a new column, we can overwrite column a adding - before and after each entry.

Original Data

a b
X 5
X 10
Y 2
Y 7

Using the following code:

dat %>%
  mutate(
    c = b + 10,
    a = paste0("-", a, "-")
)
a b c
-X- 5 15
-X- 10 20
-Y- 2 12
-Y- 7 17

if_else logic

We can use if_else within a mutate to create new columns based on another column. For instance, we can create a categorical column of High and Low values based on column b:

a b
X 5
X 10
Y 2
Y 7

We can use this code:

dat %>%
  mutate(
    level = if_else(b > 5,
                    "High",
                    "Low")
    )
a b level
X 5 Low
X 10 High
Y 2 Low
Y 7 High

But what if we want another, Medium category for values greater than 3 but less than 7?

a b
X 5
X 10
Y 2
Y 7

If we were to use an if_else statement that would require nesting

dat %>%
  mutate(
    level = if_else(b < 3, "Low",
                    if_else(b < 8, "Mid",  "High"))
    )

But this is really hard to read! Lucky for us we can use the case_when function.

The structure of case_when can be read as:

dat %>%
  mutate(
    level = case_when(
      b < 3 ~ "Low",
      b < 8 ~ "Mid",
      TRUE ~ "High"
    )
  )
a b level
X 5 Medium
X 10 High
Y 2 Low
Y 7 Medium

rowwise vs column operators

dat %>%
  mutate(
    c = mean(c(a,b))
  )
#
a b c
5 5 6.375
16 10 6.375
3 2 6.375
3 7 6.375
dat %>%
  rowwise() %>%
  mutate(
    c = mean(c(a,b))
  )
a b c
5 5 5.0
16 10 13.0
3 2 2.5
3 7 5.0

Descriptive Statistics

a b
X 5
X 10
Y 2
Y 7
dat %>%
  summarize(
    mean = mean(b),
    sd = sd(b),
    min = min(b),
    max = max(b)
  )
mean sd min max
6 3.36 2 10

Grouped Descriptive Statistics

a b
X 5
X 10
Y 2
Y 7
dat %>%
  group_by(a) %>%
  summarize(
    mean = mean(b),
    sd = sd(b),
    min = min(b),
    max = max(b)
  )
a mean sd min max
X 7.5 3.53 5 10
Y 4.5 3.53 2 7

Counting

Option 1

a b
X dog
X cat
X rabbit
Y rabbit
Y rabbit
dat %>% 
  group_by(b) %>% 
  summarize( 
    n = n() 
  )
a b
1 dog
1 cat
3 rabbit

Option 2

a b
X dog
X cat
X rabbit
Y rabbit
Y rabbit
dat %>% 
  count(b)
a b
1 dog
1 cat
3 rabbit

Grouped Option 1

a b
X dog
X cat
X rabbit
Y rabbit
Y rabbit
dat %>% 
  group_by(a, b) %>% 
  summarize( 
    n = n() 
  )
a b
1 dog
1 cat
3 rabbit

Other Options

In the course we also showcase 2 other ways to achieve the same goal. The first is using group_by and count

dat %>%
  group_by(a) %>% 
  count(b)

or with even shorter code, calling count on both columns we’d like to group by:

dat %>%
  count(a, b)