Migrating from Tplyr v1
migration.RmdIntroduction
Both Tplyr v1 and tplyr2 build formatted clinical summary tables from ADaM-style data, but they take fundamentally different approaches.
Tplyr v1 uses an imperative, piped workflow: create
a table object holding data and configuration, pipe in layers, set
options with modifier functions, then call build().
tplyr2 uses a declarative, spec-based approach:
build a tplyr_spec() that is pure configuration (no data,
no side effects), then supply data at build time via
tplyr_build(spec, data). This separation makes specs
portable, serializable, and reusable across datasets and studies.
This vignette covers the key differences with side-by-side examples.
Quick Reference: Function Mapping
The table below maps v1 functions to their tplyr2 equivalents.
| Tplyr v1 | tplyr2 | Notes |
|---|---|---|
tplyr_table(data, treat_var) |
tplyr_spec(cols = "treat_var") |
Data at build time, not in the spec |
add_layer() |
tplyr_layers() inside tplyr_spec()
|
Declarative layer collection |
group_count(target_var) |
group_count("target_var") |
Variable names are quoted strings |
group_desc(target_var) |
group_desc("target_var") |
Variable names are quoted strings |
group_shift(vars) |
group_shift(c(row = "v1", column = "v2")) |
Named character vector |
set_format_strings() |
format_strings in layer_settings()
|
Nested in settings object |
set_distinct_by() |
distinct_by in layer_settings()
|
Character string |
set_denoms_by() |
denoms_by in layer_settings()
|
Character vector |
set_where() |
where parameter in layer or spec |
Bare expression (unquoted) |
add_total_group() |
total_group() in spec’s total_groups
|
Spec-level config |
set_pop_data() |
pop_data() in spec + tplyr_build()
|
Config in spec, data at build |
build() |
tplyr_build(spec, data) |
Data supplied at build time |
f_str() |
f_str() |
Variable names are now quoted strings |
Key Differences in Detail
Data Is Separated from Configuration
In v1, data lives inside the table object from the moment you create it:
# Tplyr v1: data bound at table creation
t <- tplyr_table(adsl, TRT01P)In tplyr2, the spec knows nothing about data. You supply data only when you are ready to build:
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_count("SEX")
)
)
# Data provided at build time
result <- tplyr_build(spec, tplyr_adsl)This means the same spec can be applied to different datasets without modifying the spec itself.
Variable Names Are Quoted Strings
In v1, variable names are bare symbols. In tplyr2, they are character
strings. This is required for JSON/YAML serialization and simpler
programmatic construction. The one exception is where,
which accepts bare expressions:
# Tplyr v1: bare symbols # tplyr2: quoted strings
group_count(SEX) group_count("SEX")
group_desc(AGE) group_desc("AGE")
group_count("SEX", where = SAFFL == "Y")Settings Are Collected in an Object
In v1, you configure layers by piping modifier functions:
# Tplyr v1: piped modifiers
group_count(RACE) %>%
set_format_strings(f_str("xx (xx.x%)", n, pct)) %>%
set_distinct_by(USUBJID) %>%
set_denoms_by(TRT01P)In tplyr2, all configuration lives in a single
layer_settings() object:
# tplyr2: declarative settings
group_count("RACE",
settings = layer_settings(
format_strings = list(n_counts = f_str("xx (xx.x%)", "n", "pct")),
distinct_by = "USUBJID",
denoms_by = "TRT01P"
)
)Format Strings Use Quoted Variable Names
The f_str() function works the same way, but variable
names are now strings. For desc layers, format strings are a named list
(each name becomes a row label). For count layers, the list key is
n_counts:
Side-by-Side Examples
Demographics Table
Tplyr v1
# Tplyr v1 approach (not evaluated)
tplyr_table(adsl, TRT01P, where = SAFFL == "Y") %>%
add_layer(
group_count(SEX, by = "Sex n (%)")
) %>%
add_layer(
group_desc(AGE, by = "Age (Years)") %>%
set_format_strings(
"n" = f_str("xxx", n),
"Mean (SD)" = f_str("xx.x (xx.xx)", mean, sd),
"Median" = f_str("xx.x", median),
"Min, Max" = f_str("xx, xx", min, max)
)
) %>%
build()tplyr2
spec <- tplyr_spec(
cols = "TRT01P",
where = SAFFL == "Y",
layers = tplyr_layers(
group_count("SEX", by = "Sex n (%)"),
group_desc("AGE",
by = "Age (Years)",
settings = layer_settings(
format_strings = list(
"n" = f_str("xxx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
"Median" = f_str("xx.x", "median"),
"Min, Max" = f_str("xx, xx", "min", "max")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| Sex n (%) | F | 53 (61.6%) | 40 (47.6%) | 50 (59.5%) |
| Sex n (%) | M | 33 (38.4%) | 44 (52.4%) | 34 (40.5%) |
| Age (Years) | n | 86 | 84 | 84 |
| Age (Years) | Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) |
| Age (Years) | Median | 76.0 | 76.0 | 77.5 |
| Age (Years) | Min, Max | 52, 89 | 56, 88 | 51, 88 |
Adverse Events Table (Nested Counts)
Tplyr v1
# Tplyr v1 approach (not evaluated)
tplyr_table(adae, TRTA) %>%
add_layer(
group_count(vars(AEBODSYS, AEDECOD)) %>%
set_distinct_by(USUBJID) %>%
set_format_strings(f_str("xxx (xx.x%)", distinct_n, distinct_pct)) %>%
set_order_count_method("bycount") %>%
set_ordering_cols("Xanomeline High Dose")
) %>%
build()tplyr2
spec <- tplyr_spec(
cols = "TRTA",
layers = tplyr_layers(
group_count(c("AEBODSYS", "AEDECOD"),
settings = layer_settings(
distinct_by = "USUBJID",
format_strings = list(
n_counts = f_str("xxx (xx.x%)", "distinct_n", "distinct_pct")
),
order_count_method = "bycount",
ordering_cols = "Xanomeline High Dose"
)
)
)
)
result <- tplyr_build(spec, tplyr_adae)
kable(head(result[, !grepl("^ord", names(result))], 15))| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| CARDIAC DISORDERS | 4 (12.5%) | 6 (14.0%) | 5 (10.0%) | |
| CARDIAC DISORDERS | ATRIAL FIBRILLATION | 0 ( 0.0%) | 0 ( 0.0%) | 1 ( 2.0%) |
| CARDIAC DISORDERS | ATRIAL FLUTTER | 0 ( 0.0%) | 1 ( 2.3%) | 0 ( 0.0%) |
| CARDIAC DISORDERS | ATRIAL HYPERTROPHY | 1 ( 3.1%) | 0 ( 0.0%) | 0 ( 0.0%) |
| CARDIAC DISORDERS | BUNDLE BRANCH BLOCK RIGHT | 1 ( 3.1%) | 0 ( 0.0%) | 0 ( 0.0%) |
| CARDIAC DISORDERS | CARDIAC FAILURE CONGESTIVE | 1 ( 3.1%) | 0 ( 0.0%) | 0 ( 0.0%) |
| CARDIAC DISORDERS | MYOCARDIAL INFARCTION | 0 ( 0.0%) | 1 ( 2.3%) | 2 ( 4.0%) |
| CARDIAC DISORDERS | SINUS BRADYCARDIA | 0 ( 0.0%) | 3 ( 7.0%) | 1 ( 2.0%) |
| CARDIAC DISORDERS | SUPRAVENTRICULAR EXTRASYSTOLES | 1 ( 3.1%) | 0 ( 0.0%) | 1 ( 2.0%) |
| CARDIAC DISORDERS | SUPRAVENTRICULAR TACHYCARDIA | 0 ( 0.0%) | 0 ( 0.0%) | 1 ( 2.0%) |
| CARDIAC DISORDERS | TACHYCARDIA | 1 ( 3.1%) | 0 ( 0.0%) | 0 ( 0.0%) |
| CARDIAC DISORDERS | VENTRICULAR EXTRASYSTOLES | 0 ( 0.0%) | 1 ( 2.3%) | 0 ( 0.0%) |
| CONGENITAL, FAMILIAL AND GENETIC DISORDERS | 0 ( 0.0%) | 1 ( 2.3%) | 0 ( 0.0%) | |
| CONGENITAL, FAMILIAL AND GENETIC DISORDERS | VENTRICULAR SEPTAL DEFECT | 0 ( 0.0%) | 1 ( 2.3%) | 0 ( 0.0%) |
| GASTROINTESTINAL DISORDERS | 6 (18.8%) | 4 ( 9.3%) | 3 ( 6.0%) |
Note how vars(AEBODSYS, AEDECOD) becomes
c("AEBODSYS", "AEDECOD"), piped modifiers become arguments
in layer_settings(), and data is supplied at build.
New Features in tplyr2
Beyond the API redesign, tplyr2 introduces several entirely new capabilities.
Spec Serialization
Specs can be saved to disk as JSON or YAML and loaded later, enabling centralized spec authoring with distributed execution:
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_count("SEX"),
group_desc("AGE")
)
)
tmp <- tempfile(fileext = ".json")
tplyr_write_spec(spec, tmp)
spec_loaded <- tplyr_read_spec(tmp)
spec_loaded
#> tplyr2 table specification
#> Column variables: TRT01PLayers: 2[1] count: SEX (Layer 1)[2] desc: AGE (Layer 2)Custom Analysis Layers
group_analyze() accepts a user-defined function for
arbitrary computations. The function receives each group’s data subset
and returns a data.frame of numeric results:
custom_fn <- function(.data, .target_var) {
vals <- .data[[.target_var]]
data.frame(
geo_mean = exp(mean(log(vals[vals > 0]), na.rm = TRUE)),
geo_sd = exp(sd(log(vals[vals > 0]), na.rm = TRUE))
)
}
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_analyze("AGE", analyze_fn = custom_fn, settings = layer_settings(
format_strings = list(
"Geometric Mean (SD)" = f_str("xx.xx (xx.xx)", "geo_mean", "geo_sd")
)
))
)
)
result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | res1 | res2 | res3 |
|---|---|---|---|
| Geometric Mean (SD) | 74.70 ( 1.13) | 73.94 ( 1.12) | 75.18 ( 1.12) |
Cell-Level Metadata
When metadata = TRUE is passed to
tplyr_build(), every cell carries metadata tracing back to
source data rows for auditability:
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(group_count("SEX"))
)
result <- tplyr_build(spec, tplyr_adsl, metadata = TRUE)
row_ids <- generate_row_ids(result)
# Inspect the metadata for one cell, then retrieve its source rows
tplyr_meta_result(result, row_ids[1], "res1")
#> tplyr_meta [layer 1]
#> Names: TRT01P, SEX
#> Filters:
#> TRT01P == "Placebo"
#> SEX == "F"
source_rows <- tplyr_meta_subset(result, row_ids[1], "res1", tplyr_adsl)
kable(head(source_rows[, c("USUBJID", "SEX", "TRT01P")]))| USUBJID | SEX | TRT01P |
|---|---|---|
| 01-701-1015 | F | Placebo |
| 01-701-1047 | F | Placebo |
| 01-701-1153 | F | Placebo |
| 01-701-1203 | F | Placebo |
| 01-701-1345 | F | Placebo |
| 01-701-1363 | F | Placebo |
ARD Conversion and Numeric Data
tplyr_to_ard() converts results into long-format
Analysis Results Data. tplyr_numeric_data() provides raw
unformatted numbers for validation:
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(group_count("SEX"))
)
result <- tplyr_build(spec, tplyr_adsl)
kable(head(tplyr_to_ard(result), 10))| analysis_id | TRT01P | SEX | stat_name | stat_value |
|---|---|---|---|---|
| 1 | Placebo | F | n | 53.00000 |
| 1 | Placebo | M | n | 33.00000 |
| 1 | Xanomeline High Dose | F | n | 40.00000 |
| 1 | Xanomeline High Dose | M | n | 44.00000 |
| 1 | Xanomeline Low Dose | F | n | 50.00000 |
| 1 | Xanomeline Low Dose | M | n | 34.00000 |
| 1 | Placebo | F | pct | 61.62791 |
| 1 | Placebo | M | pct | 38.37209 |
| 1 | Xanomeline High Dose | F | pct | 47.61905 |
| 1 | Xanomeline High Dose | M | pct | 52.38095 |
kable(tplyr_numeric_data(result, layer = 1))| TRT01P | SEX | n | pct | total |
|---|---|---|---|---|
| Placebo | F | 53 | 61.62791 | 86 |
| Placebo | M | 33 | 38.37209 | 86 |
| Xanomeline High Dose | F | 40 | 47.61905 | 84 |
| Xanomeline High Dose | M | 44 | 52.38095 | 84 |
| Xanomeline Low Dose | F | 50 | 59.52381 | 84 |
| Xanomeline Low Dose | M | 34 | 40.47619 | 84 |
Performance
Beyond the API improvements, tplyr2 delivers significant performance gains over Tplyr v1 thanks to a data.table backend replacing the tidyverse internals. The table below summarizes benchmarks run across five layer types at data scales from 1x to 500x the base dataset sizes, with and without metadata enabled.
Overall Speedup by Layer Type
| Layer Type | Avg Speedup | Median Speedup | Min | Max |
|---|---|---|---|---|
| Shift | 176.0x | 187.6x | 6.4x | 388.5x |
| Desc | 8.0x | 7.1x | 4.1x | 18.3x |
| Count | 4.0x | 3.1x | 2.5x | 12.6x |
| Nested Count | 3.8x | 3.1x | 2.8x | 7.3x |
| Multi-Layer | 2.8x | 3.0x | 1.2x | 4.0x |
The grand median speedup across all scenarios is 4.0x.
Detailed Results
Count Layers
| Scale | Tplyr v1 | tplyr2 | Speedup | Tplyr v1 (meta) | tplyr2 (meta) | Speedup | |
|---|---|---|---|---|---|---|---|
| 1x | 0.05s | 0.01s | 4.5x | 0.17s | 0.01s | 12.6x | |
| 10x | 0.12s | 0.03s | 3.8x | 0.20s | 0.06s | 3.2x | |
| 50x | 0.47s | 0.16s | 3.0x | 0.80s | 0.32s | 2.5x | |
| 100x | 0.99s | 0.32s | 3.1x | 1.86s | 0.63s | 2.9x | |
| 250x | 2.40s | 0.74s | 3.2x | 4.28s | 1.50s | 2.9x | |
| 500x | 5.13s | 1.43s | 3.6x | 8.92s | 3.29s | 2.7x |
Nested Count Layers
| Scale | Tplyr v1 | tplyr2 | Speedup | Tplyr v1 (meta) | tplyr2 (meta) | Speedup | |
|---|---|---|---|---|---|---|---|
| 1x | 0.21s | 0.05s | 4.0x | 0.21s | 0.03s | 7.3x | |
| 10x | 0.34s | 0.11s | 3.2x | 0.41s | 0.14s | 3.0x | |
| 50x | 0.89s | 0.31s | 2.9x | 1.40s | 0.50s | 2.8x | |
| 100x | 1.78s | 0.61s | 2.9x | 2.86s | 0.95s | 3.0x | |
| 250x | 5.24s | 1.23s | 4.3x | 8.00s | 2.82s | 2.8x | |
| 500x | 14.61s | 2.58s | 5.7x | 19.64s | 4.96s | 4.0x |
Descriptive Statistics Layers
| Scale | Tplyr v1 | tplyr2 | Speedup | Tplyr v1 (meta) | tplyr2 (meta) | Speedup | |
|---|---|---|---|---|---|---|---|
| 1x | 0.07s | 0.01s | 5.9x | 0.23s | 0.01s | 18.3x | |
| 10x | 0.18s | 0.02s | 7.2x | 0.20s | 0.04s | 5.4x | |
| 50x | 0.70s | 0.08s | 8.5x | 0.83s | 0.11s | 7.6x | |
| 100x | 1.43s | 0.14s | 10.1x | 1.61s | 0.23s | 6.9x | |
| 250x | 3.23s | 0.35s | 9.3x | 3.65s | 0.56s | 6.5x | |
| 500x | 6.27s | 1.54s | 4.1x | 7.74s | 1.21s | 6.4x |
Shift Layers
| Scale | Tplyr v1 | tplyr2 | Speedup | Tplyr v1 (meta) | tplyr2 (meta) | Speedup | |
|---|---|---|---|---|---|---|---|
| 1x | 0.06s | 0.01s | 6.4x | 0.08s | 0.01s | 11.0x | |
| 10x | 0.29s | 0.01s | 34.4x | 0.43s | 0.01s | 49.7x | |
| 50x | 1.20s | 0.01s | 110.4x | 2.11s | 0.01s | 158.8x | |
| 100x | 2.63s | 0.01s | 216.4x | 4.53s | 0.02s | 231.8x | |
| 250x | 6.69s | 0.03s | 265.9x | 10.70s | 0.04s | 305.2x | |
| 500x | 13.87s | 0.04s | 388.5x | 21.91s | 0.07s | 333.8x |
The benchmark script (benchmark_comparison.R) is
included in the package repository root for reproducibility.
Summary
The migration from Tplyr v1 to tplyr2 involves three main shifts:
Declare, don’t pipe. Replace
tplyr_table() %>% add_layer() %>% build()withtplyr_spec()+tplyr_build(spec, data).Quote your variable names. Bare symbols like
AGEbecome"AGE". Thewhereparameter is the only place bare expressions are still used.Collect settings in one place. Instead of piping
set_*()calls, pass alayer_settings()object to thesettingsparameter of each layer.
The output structure – rowlabel columns,
res columns with label attributes, and ord
columns for sorting – remains the same. Existing downstream code that
consumes tplyr output should work with minimal changes.