Skip to contents

Introduction

Both Tplyr v1 and tplyr2 build formatted clinical summary tables from ADaM-style data, but they take fundamentally different approaches.

Tplyr v1 uses an imperative, piped workflow: create a table object holding data and configuration, pipe in layers, set options with modifier functions, then call build().

tplyr2 uses a declarative, spec-based approach: build a tplyr_spec() that is pure configuration (no data, no side effects), then supply data at build time via tplyr_build(spec, data). This separation makes specs portable, serializable, and reusable across datasets and studies.

This vignette covers the key differences with side-by-side examples.

Quick Reference: Function Mapping

The table below maps v1 functions to their tplyr2 equivalents.

Tplyr v1 tplyr2 Notes
tplyr_table(data, treat_var) tplyr_spec(cols = "treat_var") Data at build time, not in the spec
add_layer() tplyr_layers() inside tplyr_spec() Declarative layer collection
group_count(target_var) group_count("target_var") Variable names are quoted strings
group_desc(target_var) group_desc("target_var") Variable names are quoted strings
group_shift(vars) group_shift(c(row = "v1", column = "v2")) Named character vector
set_format_strings() format_strings in layer_settings() Nested in settings object
set_distinct_by() distinct_by in layer_settings() Character string
set_denoms_by() denoms_by in layer_settings() Character vector
set_where() where parameter in layer or spec Bare expression (unquoted)
add_total_group() total_group() in spec’s total_groups Spec-level config
set_pop_data() pop_data() in spec + tplyr_build() Config in spec, data at build
build() tplyr_build(spec, data) Data supplied at build time
f_str() f_str() Variable names are now quoted strings

Key Differences in Detail

Data Is Separated from Configuration

In v1, data lives inside the table object from the moment you create it:

# Tplyr v1: data bound at table creation
t <- tplyr_table(adsl, TRT01P)

In tplyr2, the spec knows nothing about data. You supply data only when you are ready to build:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("SEX")
  )
)

# Data provided at build time
result <- tplyr_build(spec, tplyr_adsl)

This means the same spec can be applied to different datasets without modifying the spec itself.

Variable Names Are Quoted Strings

In v1, variable names are bare symbols. In tplyr2, they are character strings. This is required for JSON/YAML serialization and simpler programmatic construction. The one exception is where, which accepts bare expressions:

# Tplyr v1: bare symbols              # tplyr2: quoted strings
group_count(SEX)                       group_count("SEX")
group_desc(AGE)                        group_desc("AGE")
group_count("SEX", where = SAFFL == "Y")

Settings Are Collected in an Object

In v1, you configure layers by piping modifier functions:

# Tplyr v1: piped modifiers
group_count(RACE) %>%
  set_format_strings(f_str("xx (xx.x%)", n, pct)) %>%
  set_distinct_by(USUBJID) %>%
  set_denoms_by(TRT01P)

In tplyr2, all configuration lives in a single layer_settings() object:

# tplyr2: declarative settings
group_count("RACE",
  settings = layer_settings(
    format_strings = list(n_counts = f_str("xx (xx.x%)", "n", "pct")),
    distinct_by = "USUBJID",
    denoms_by = "TRT01P"
  )
)

Format Strings Use Quoted Variable Names

The f_str() function works the same way, but variable names are now strings. For desc layers, format strings are a named list (each name becomes a row label). For count layers, the list key is n_counts:

# v1: bare symbols                     # tplyr2: quoted strings
f_str("xx (xx.x%)", n, pct)            f_str("xx (xx.x%)", "n", "pct")
# Desc layer: named list of format strings
format_strings = list(
  "n"         = f_str("xxx", "n"),
  "Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd")
)
# Count layer: key is n_counts
format_strings = list(n_counts = f_str("xx (xx.x%)", "n", "pct"))

Side-by-Side Examples

Demographics Table

Tplyr v1

# Tplyr v1 approach (not evaluated)
tplyr_table(adsl, TRT01P, where = SAFFL == "Y") %>%
  add_layer(
    group_count(SEX, by = "Sex n (%)")
  ) %>%
  add_layer(
    group_desc(AGE, by = "Age (Years)") %>%
      set_format_strings(
        "n"         = f_str("xxx", n),
        "Mean (SD)" = f_str("xx.x (xx.xx)", mean, sd),
        "Median"    = f_str("xx.x", median),
        "Min, Max"  = f_str("xx, xx", min, max)
      )
  ) %>%
  build()

tplyr2

spec <- tplyr_spec(
  cols = "TRT01P",
  where = SAFFL == "Y",
  layers = tplyr_layers(
    group_count("SEX", by = "Sex n (%)"),
    group_desc("AGE",
      by = "Age (Years)",
      settings = layer_settings(
        format_strings = list(
          "n"         = f_str("xxx", "n"),
          "Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
          "Median"    = f_str("xx.x", "median"),
          "Min, Max"  = f_str("xx, xx", "min", "max")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])
rowlabel1 rowlabel2 res1 res2 res3
Sex n (%) F 53 (61.6%) 40 (47.6%) 50 (59.5%)
Sex n (%) M 33 (38.4%) 44 (52.4%) 34 (40.5%)
Age (Years) n 86 84 84
Age (Years) Mean (SD) 75.2 ( 8.59) 74.4 ( 7.89) 75.7 ( 8.29)
Age (Years) Median 76.0 76.0 77.5
Age (Years) Min, Max 52, 89 56, 88 51, 88

Adverse Events Table (Nested Counts)

Tplyr v1

# Tplyr v1 approach (not evaluated)
tplyr_table(adae, TRTA) %>%
  add_layer(
    group_count(vars(AEBODSYS, AEDECOD)) %>%
      set_distinct_by(USUBJID) %>%
      set_format_strings(f_str("xxx (xx.x%)", distinct_n, distinct_pct)) %>%
      set_order_count_method("bycount") %>%
      set_ordering_cols("Xanomeline High Dose")
  ) %>%
  build()

tplyr2

spec <- tplyr_spec(
  cols = "TRTA",
  layers = tplyr_layers(
    group_count(c("AEBODSYS", "AEDECOD"),
      settings = layer_settings(
        distinct_by = "USUBJID",
        format_strings = list(
          n_counts = f_str("xxx (xx.x%)", "distinct_n", "distinct_pct")
        ),
        order_count_method = "bycount",
        ordering_cols = "Xanomeline High Dose"
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adae)
kable(head(result[, !grepl("^ord", names(result))], 15))
rowlabel1 rowlabel2 res1 res2 res3
CARDIAC DISORDERS 4 (12.5%) 6 (14.0%) 5 (10.0%)
CARDIAC DISORDERS ATRIAL FIBRILLATION 0 ( 0.0%) 0 ( 0.0%) 1 ( 2.0%)
CARDIAC DISORDERS ATRIAL FLUTTER 0 ( 0.0%) 1 ( 2.3%) 0 ( 0.0%)
CARDIAC DISORDERS ATRIAL HYPERTROPHY 1 ( 3.1%) 0 ( 0.0%) 0 ( 0.0%)
CARDIAC DISORDERS BUNDLE BRANCH BLOCK RIGHT 1 ( 3.1%) 0 ( 0.0%) 0 ( 0.0%)
CARDIAC DISORDERS CARDIAC FAILURE CONGESTIVE 1 ( 3.1%) 0 ( 0.0%) 0 ( 0.0%)
CARDIAC DISORDERS MYOCARDIAL INFARCTION 0 ( 0.0%) 1 ( 2.3%) 2 ( 4.0%)
CARDIAC DISORDERS SINUS BRADYCARDIA 0 ( 0.0%) 3 ( 7.0%) 1 ( 2.0%)
CARDIAC DISORDERS SUPRAVENTRICULAR EXTRASYSTOLES 1 ( 3.1%) 0 ( 0.0%) 1 ( 2.0%)
CARDIAC DISORDERS SUPRAVENTRICULAR TACHYCARDIA 0 ( 0.0%) 0 ( 0.0%) 1 ( 2.0%)
CARDIAC DISORDERS TACHYCARDIA 1 ( 3.1%) 0 ( 0.0%) 0 ( 0.0%)
CARDIAC DISORDERS VENTRICULAR EXTRASYSTOLES 0 ( 0.0%) 1 ( 2.3%) 0 ( 0.0%)
CONGENITAL, FAMILIAL AND GENETIC DISORDERS 0 ( 0.0%) 1 ( 2.3%) 0 ( 0.0%)
CONGENITAL, FAMILIAL AND GENETIC DISORDERS VENTRICULAR SEPTAL DEFECT 0 ( 0.0%) 1 ( 2.3%) 0 ( 0.0%)
GASTROINTESTINAL DISORDERS 6 (18.8%) 4 ( 9.3%) 3 ( 6.0%)

Note how vars(AEBODSYS, AEDECOD) becomes c("AEBODSYS", "AEDECOD"), piped modifiers become arguments in layer_settings(), and data is supplied at build.

New Features in tplyr2

Beyond the API redesign, tplyr2 introduces several entirely new capabilities.

Spec Serialization

Specs can be saved to disk as JSON or YAML and loaded later, enabling centralized spec authoring with distributed execution:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("SEX"),
    group_desc("AGE")
  )
)

tmp <- tempfile(fileext = ".json")
tplyr_write_spec(spec, tmp)
spec_loaded <- tplyr_read_spec(tmp)
spec_loaded
#> tplyr2 table specification
#> Column variables: TRT01PLayers: 2[1] count: SEX (Layer 1)[2] desc: AGE (Layer 2)

Custom Analysis Layers

group_analyze() accepts a user-defined function for arbitrary computations. The function receives each group’s data subset and returns a data.frame of numeric results:

custom_fn <- function(.data, .target_var) {
  vals <- .data[[.target_var]]
  data.frame(
    geo_mean = exp(mean(log(vals[vals > 0]), na.rm = TRUE)),
    geo_sd   = exp(sd(log(vals[vals > 0]), na.rm = TRUE))
  )
}

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_analyze("AGE", analyze_fn = custom_fn, settings = layer_settings(
      format_strings = list(
        "Geometric Mean (SD)" = f_str("xx.xx (xx.xx)", "geo_mean", "geo_sd")
      )
    ))
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])
rowlabel1 res1 res2 res3
Geometric Mean (SD) 74.70 ( 1.13) 73.94 ( 1.12) 75.18 ( 1.12)

Cell-Level Metadata

When metadata = TRUE is passed to tplyr_build(), every cell carries metadata tracing back to source data rows for auditability:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(group_count("SEX"))
)
result <- tplyr_build(spec, tplyr_adsl, metadata = TRUE)
row_ids <- generate_row_ids(result)

# Inspect the metadata for one cell, then retrieve its source rows
tplyr_meta_result(result, row_ids[1], "res1")
#> tplyr_meta [layer 1]
#>   Names: TRT01P, SEX
#>   Filters:
#>     TRT01P == "Placebo"
#>     SEX == "F"
source_rows <- tplyr_meta_subset(result, row_ids[1], "res1", tplyr_adsl)
kable(head(source_rows[, c("USUBJID", "SEX", "TRT01P")]))
USUBJID SEX TRT01P
01-701-1015 F Placebo
01-701-1047 F Placebo
01-701-1153 F Placebo
01-701-1203 F Placebo
01-701-1345 F Placebo
01-701-1363 F Placebo

ARD Conversion and Numeric Data

tplyr_to_ard() converts results into long-format Analysis Results Data. tplyr_numeric_data() provides raw unformatted numbers for validation:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(group_count("SEX"))
)
result <- tplyr_build(spec, tplyr_adsl)

kable(head(tplyr_to_ard(result), 10))
analysis_id TRT01P SEX stat_name stat_value
1 Placebo F n 53.00000
1 Placebo M n 33.00000
1 Xanomeline High Dose F n 40.00000
1 Xanomeline High Dose M n 44.00000
1 Xanomeline Low Dose F n 50.00000
1 Xanomeline Low Dose M n 34.00000
1 Placebo F pct 61.62791
1 Placebo M pct 38.37209
1 Xanomeline High Dose F pct 47.61905
1 Xanomeline High Dose M pct 52.38095
kable(tplyr_numeric_data(result, layer = 1))
TRT01P SEX n pct total
Placebo F 53 61.62791 86
Placebo M 33 38.37209 86
Xanomeline High Dose F 40 47.61905 84
Xanomeline High Dose M 44 52.38095 84
Xanomeline Low Dose F 50 59.52381 84
Xanomeline Low Dose M 34 40.47619 84

Performance

Beyond the API improvements, tplyr2 delivers significant performance gains over Tplyr v1 thanks to a data.table backend replacing the tidyverse internals. The table below summarizes benchmarks run across five layer types at data scales from 1x to 500x the base dataset sizes, with and without metadata enabled.

Overall Speedup by Layer Type

Layer Type Avg Speedup Median Speedup Min Max
Shift 176.0x 187.6x 6.4x 388.5x
Desc 8.0x 7.1x 4.1x 18.3x
Count 4.0x 3.1x 2.5x 12.6x
Nested Count 3.8x 3.1x 2.8x 7.3x
Multi-Layer 2.8x 3.0x 1.2x 4.0x

The grand median speedup across all scenarios is 4.0x.

Detailed Results

Count Layers

Scale Tplyr v1 tplyr2 Speedup Tplyr v1 (meta) tplyr2 (meta) Speedup
1x 0.05s 0.01s 4.5x 0.17s 0.01s 12.6x
10x 0.12s 0.03s 3.8x 0.20s 0.06s 3.2x
50x 0.47s 0.16s 3.0x 0.80s 0.32s 2.5x
100x 0.99s 0.32s 3.1x 1.86s 0.63s 2.9x
250x 2.40s 0.74s 3.2x 4.28s 1.50s 2.9x
500x 5.13s 1.43s 3.6x 8.92s 3.29s 2.7x

Nested Count Layers

Scale Tplyr v1 tplyr2 Speedup Tplyr v1 (meta) tplyr2 (meta) Speedup
1x 0.21s 0.05s 4.0x 0.21s 0.03s 7.3x
10x 0.34s 0.11s 3.2x 0.41s 0.14s 3.0x
50x 0.89s 0.31s 2.9x 1.40s 0.50s 2.8x
100x 1.78s 0.61s 2.9x 2.86s 0.95s 3.0x
250x 5.24s 1.23s 4.3x 8.00s 2.82s 2.8x
500x 14.61s 2.58s 5.7x 19.64s 4.96s 4.0x

Descriptive Statistics Layers

Scale Tplyr v1 tplyr2 Speedup Tplyr v1 (meta) tplyr2 (meta) Speedup
1x 0.07s 0.01s 5.9x 0.23s 0.01s 18.3x
10x 0.18s 0.02s 7.2x 0.20s 0.04s 5.4x
50x 0.70s 0.08s 8.5x 0.83s 0.11s 7.6x
100x 1.43s 0.14s 10.1x 1.61s 0.23s 6.9x
250x 3.23s 0.35s 9.3x 3.65s 0.56s 6.5x
500x 6.27s 1.54s 4.1x 7.74s 1.21s 6.4x

Shift Layers

Scale Tplyr v1 tplyr2 Speedup Tplyr v1 (meta) tplyr2 (meta) Speedup
1x 0.06s 0.01s 6.4x 0.08s 0.01s 11.0x
10x 0.29s 0.01s 34.4x 0.43s 0.01s 49.7x
50x 1.20s 0.01s 110.4x 2.11s 0.01s 158.8x
100x 2.63s 0.01s 216.4x 4.53s 0.02s 231.8x
250x 6.69s 0.03s 265.9x 10.70s 0.04s 305.2x
500x 13.87s 0.04s 388.5x 21.91s 0.07s 333.8x

The benchmark script (benchmark_comparison.R) is included in the package repository root for reproducibility.

Summary

The migration from Tplyr v1 to tplyr2 involves three main shifts:

  1. Declare, don’t pipe. Replace tplyr_table() %>% add_layer() %>% build() with tplyr_spec() + tplyr_build(spec, data).

  2. Quote your variable names. Bare symbols like AGE become "AGE". The where parameter is the only place bare expressions are still used.

  3. Collect settings in one place. Instead of piping set_*() calls, pass a layer_settings() object to the settings parameter of each layer.

The output structure – rowlabel columns, res columns with label attributes, and ord columns for sorting – remains the same. Existing downstream code that consumes tplyr output should work with minimal changes.