Migrating from Tplyr v1 • tplyr2

Introduction

Both Tplyr v1 and tplyr2 build formatted clinical summary tables from ADaM-style data, but they take fundamentally different approaches.

Tplyr v1 uses an imperative, piped workflow: create a table object holding data and configuration, pipe in layers, set options with modifier functions, then call build().

tplyr2 uses a declarative, spec-based approach: build a tplyr_spec() that is pure configuration (no data, no side effects), then supply data at build time via tplyr_build(spec, data). This separation makes specs portable, serializable, and reusable across datasets and studies.

This vignette covers the key differences with side-by-side examples.

Quick Reference: Function Mapping

The table below maps v1 functions to their tplyr2 equivalents.

Tplyr v1	tplyr2	Notes
`tplyr_table(data, treat_var)`	`tplyr_spec(cols = "treat_var")`	Data at build time, not in the spec
`add_layer()`	`tplyr_layers()` inside `tplyr_spec()`	Declarative layer collection
`group_count(target_var)`	`group_count("target_var")`	Variable names are quoted strings
`group_desc(target_var)`	`group_desc("target_var")`	Variable names are quoted strings
`group_shift(vars)`	`group_shift(c(row = "v1", column = "v2"))`	Named character vector
`set_format_strings()`	`format_strings` in `layer_settings()`	Nested in settings object
`set_distinct_by()`	`distinct_by` in `layer_settings()`	Character string
`set_denoms_by()`	`denoms_by` in `layer_settings()`	Character vector
`set_where()`	`where` parameter in layer or spec	Bare expression (unquoted)
`add_total_group()`	`total_group()` in spec’s `total_groups`	Spec-level config
`add_total_row()`	`total_row = TRUE` in `layer_settings()`	Plus `total_row_label`
`set_missing_count()`	`missing_count` in `layer_settings()`	List config
`keep_levels()`	`keep_levels` in `layer_settings()`	Character vector
`set_pop_data()`	`pop_data()` in spec + `tplyr_build()`	Config in spec, data at build
`set_pop_treat_var()`	`pop_data(cols = c(...))` mapping	Maps analysis to population column
`add_risk_diff()`	`risk_diff` in `layer_settings()`	See `vignette("riskdiff")`
`build()`	`tplyr_build(spec, data)`	Data supplied at build time
`f_str()`	`f_str()`	Variable names are now quoted strings

Key Differences in Detail

Data Is Separated from Configuration

In v1, data lives inside the table object from the moment you create it:

# Tplyr v1: data bound at table creation
t <- tplyr_table(adsl, TRT01P)

In tplyr2, the spec knows nothing about data. You supply data only when you are ready to build:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("SEX")
  )
)

# Data provided at build time
result <- tplyr_build(spec, tplyr_adsl)

This means the same spec can be applied to different datasets without modifying the spec itself.

Variable Names Are Quoted Strings

In v1, variable names are bare symbols. In tplyr2, they are character strings. This is required for JSON/YAML serialization and simpler programmatic construction. The one exception is where, which accepts bare expressions:

# Tplyr v1: bare symbols              # tplyr2: quoted strings
group_count(SEX)                       group_count("SEX")
group_desc(AGE)                        group_desc("AGE")

group_count("SEX", where = SAFFL == "Y")

Settings Are Collected in an Object

In v1, you configure layers by piping modifier functions:

# Tplyr v1: piped modifiers
group_count(RACE) %>%
  set_format_strings(f_str("xx (xx.x%)", n, pct)) %>%
  set_distinct_by(USUBJID) %>%
  set_denoms_by(TRT01P)

In tplyr2, all configuration lives in a single layer_settings() object:

# tplyr2: declarative settings
group_count("RACE",
  settings = layer_settings(
    format_strings = list(n_counts = f_str("xx (xx.x%)", "n", "pct")),
    distinct_by = "USUBJID",
    denoms_by = "TRT01P"
  )
)

Format Strings Use Quoted Variable Names

The f_str() function works the same way, but variable names are now strings. For desc layers, format strings are a named list (each name becomes a row label). For count layers, the list key is n_counts:

# v1: bare symbols                     # tplyr2: quoted strings
f_str("xx (xx.x%)", n, pct)            f_str("xx (xx.x%)", "n", "pct")

# Desc layer: named list of format strings
format_strings = list(
  "n"         = f_str("xxx", "n"),
  "Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd")
)
# Count layer: key is n_counts
format_strings = list(n_counts = f_str("xx (xx.x%)", "n", "pct"))

Side-by-Side Examples

Demographics Table

Tplyr v1

# Tplyr v1 approach (not evaluated)
tplyr_table(adsl, TRT01P, where = SAFFL == "Y") %>%
  add_layer(
    group_count(SEX, by = "Sex n (%)")
  ) %>%
  add_layer(
    group_desc(AGE, by = "Age (Years)") %>%
      set_format_strings(
        "n"         = f_str("xxx", n),
        "Mean (SD)" = f_str("xx.x (xx.xx)", mean, sd),
        "Median"    = f_str("xx.x", median),
        "Min, Max"  = f_str("xx, xx", min, max)
      )
  ) %>%
  build()

tplyr2

spec <- tplyr_spec(
  cols = "TRT01P",
  where = SAFFL == "Y",
  layers = tplyr_layers(
    group_count("SEX", by = "Sex n (%)"),
    group_desc("AGE",
      by = "Age (Years)",
      settings = layer_settings(
        format_strings = list(
          "n"         = f_str("xxx", "n"),
          "Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
          "Median"    = f_str("xx.x", "median"),
          "Min, Max"  = f_str("xx, xx", "min", "max")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])

rowlabel1	rowlabel2	res1	res2	res3
Sex n (%)	F	53 (61.6%)	40 (47.6%)	50 (59.5%)
Sex n (%)	M	33 (38.4%)	44 (52.4%)	34 (40.5%)
Age (Years)	n	86	84	84
Age (Years)	Mean (SD)	75.2 ( 8.59)	74.4 ( 7.89)	75.7 ( 8.29)
Age (Years)	Median	76.0	76.0	77.5
Age (Years)	Min, Max	52, 89	56, 88	51, 88

Adverse Events Table (Nested Counts)

Tplyr v1

# Tplyr v1 approach (not evaluated)
tplyr_table(adae, TRTA) %>%
  add_layer(
    group_count(vars(AEBODSYS, AEDECOD)) %>%
      set_distinct_by(USUBJID) %>%
      set_format_strings(f_str("xxx (xx.x%)", distinct_n, distinct_pct)) %>%
      set_order_count_method("bycount") %>%
      set_ordering_cols("Xanomeline High Dose")
  ) %>%
  build()

tplyr2

spec <- tplyr_spec(
  cols = "TRTA",
  layers = tplyr_layers(
    group_count(c("AEBODSYS", "AEDECOD"),
      settings = layer_settings(
        distinct_by = "USUBJID",
        format_strings = list(
          n_counts = f_str("xxx (xx.x%)", "distinct_n", "distinct_pct")
        ),
        order_count_method = "bycount",
        ordering_cols = "Xanomeline High Dose"
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adae)
kable(head(result[, !grepl("^ord", names(result))], 15))

rowlabel1	rowlabel2	res1	res2	res3
CARDIAC DISORDERS		4 (12.5%)	6 (14.0%)	5 (10.0%)
CARDIAC DISORDERS	SINUS BRADYCARDIA	0 ( 0.0%)	3 ( 7.0%)	1 ( 2.0%)
CARDIAC DISORDERS	ATRIAL FLUTTER	0 ( 0.0%)	1 ( 2.3%)	0 ( 0.0%)
CARDIAC DISORDERS	MYOCARDIAL INFARCTION	0 ( 0.0%)	1 ( 2.3%)	2 ( 4.0%)
CARDIAC DISORDERS	VENTRICULAR EXTRASYSTOLES	0 ( 0.0%)	1 ( 2.3%)	0 ( 0.0%)
CARDIAC DISORDERS	ATRIAL FIBRILLATION	0 ( 0.0%)	0 ( 0.0%)	1 ( 2.0%)
CARDIAC DISORDERS	ATRIAL HYPERTROPHY	1 ( 3.1%)	0 ( 0.0%)	0 ( 0.0%)
CARDIAC DISORDERS	BUNDLE BRANCH BLOCK RIGHT	1 ( 3.1%)	0 ( 0.0%)	0 ( 0.0%)
CARDIAC DISORDERS	CARDIAC FAILURE CONGESTIVE	1 ( 3.1%)	0 ( 0.0%)	0 ( 0.0%)
CARDIAC DISORDERS	SUPRAVENTRICULAR EXTRASYSTOLES	1 ( 3.1%)	0 ( 0.0%)	1 ( 2.0%)
CARDIAC DISORDERS	SUPRAVENTRICULAR TACHYCARDIA	0 ( 0.0%)	0 ( 0.0%)	1 ( 2.0%)
CARDIAC DISORDERS	TACHYCARDIA	1 ( 3.1%)	0 ( 0.0%)	0 ( 0.0%)
CONGENITAL, FAMILIAL AND GENETIC DISORDERS		0 ( 0.0%)	1 ( 2.3%)	0 ( 0.0%)
CONGENITAL, FAMILIAL AND GENETIC DISORDERS	VENTRICULAR SEPTAL DEFECT	0 ( 0.0%)	1 ( 2.3%)	0 ( 0.0%)
GASTROINTESTINAL DISORDERS		6 (18.8%)	4 ( 9.3%)	3 ( 6.0%)

Note how vars(AEBODSYS, AEDECOD) becomes c("AEBODSYS", "AEDECOD"), piped modifiers become arguments in layer_settings(), and data is supplied at build.

New Features in tplyr2

Beyond the API redesign, tplyr2 introduces several entirely new capabilities.

Comparative and Inferential Statistics

tplyr2 can attach cross-arm comparisons directly to a layer – something Tplyr v1 handled only through add_risk_diff():

Risk difference via risk_diff in layer_settings() (the direct successor to add_risk_diff()) – one rdiff column per comparison with a confidence interval. See vignette("riskdiff").
Association tests via assoc_test() – an omnibus p-value column (Fisher, chi-square, CMH, ANOVA/Kruskal on group_desc) or a pairwise per-level mode that emits a pval column per comparison, including on nested SOC/PT layers. See vignette("binding-statistics").
Single-proportion confidence intervals via the ci_lower/ci_upper f_str keywords with ci_method/ci_level (Clopper-Pearson, Wilson, and more). See vignette("denom").

Binding External Results and Display Helpers

apply_formats() gained na, width, and pad arguments for formatting externally computed statistics (e.g. MMRM/ANCOVA results) into fixed-width cells you can row-bind onto a table, and as_display() returns a render-ready frame (rowlabel*/res*/rdiff*/pval* only). group_desc() also adds an n_records statistic (records assessed), and shift layers gain a denom_row (the “n” denominator line). See vignette("binding-statistics") and vignette("post_processing").

Multi-Column Count Layouts

Count layers can display each statistic in its own column per treatment group – a layout Tplyr v1 could not produce. The stat_columns setting takes a named list of format strings; each entry becomes a separate result column (e.g. a distinct-subject “n (%)” column beside an event-count “E” column under every arm):

group_count("AEDECOD",
  settings = layer_settings(
    distinct_by = "USUBJID",
    stat_columns = list(
      "n (%)" = f_str("xxx (xx.x%)", "distinct_n", "distinct_pct"),
      "E"     = f_str("xxx", "n")
    )
  )
)

See vignette("count") for details.

Spec Serialization

Specs can be saved to disk as JSON or YAML and loaded later, enabling centralized spec authoring with distributed execution:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("SEX"),
    group_desc("AGE")
  )
)

tmp <- tempfile(fileext = ".json")
tplyr_write_spec(spec, tmp)
spec_loaded <- tplyr_read_spec(tmp)
spec_loaded
#> tplyr2 table specification
#> Column variables: TRT01PLayers: 2[1] count: SEX (Layer 1)[2] desc: AGE (Layer 2)

Custom Analysis Layers

group_analyze() accepts a user-defined function for arbitrary computations. The function receives each group’s data subset and returns a data.frame of numeric results:

custom_fn <- function(.data, .target_var) {
  vals <- .data[[.target_var]]
  data.frame(
    geo_mean = exp(mean(log(vals[vals > 0]), na.rm = TRUE)),
    geo_sd   = exp(sd(log(vals[vals > 0]), na.rm = TRUE))
  )
}

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_analyze("AGE", analyze_fn = custom_fn, settings = layer_settings(
      format_strings = list(
        "Geometric Mean (SD)" = f_str("xx.xx (xx.xx)", "geo_mean", "geo_sd")
      )
    ))
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])

rowlabel1	res1	res2	res3
Geometric Mean (SD)	74.70 ( 1.13)	73.94 ( 1.12)	75.18 ( 1.12)

Cell-Level Metadata

When metadata = TRUE is passed to tplyr_build(), every cell carries metadata tracing back to source data rows for auditability:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(group_count("SEX"))
)
result <- tplyr_build(spec, tplyr_adsl, metadata = TRUE)
row_ids <- generate_row_ids(result)

# Inspect the metadata for one cell, then retrieve its source rows
tplyr_meta_result(result, row_ids[1], "res1")
#> tplyr_meta [layer 1]
#>   Names: TRT01P, SEX
#>   Filters:
#>     TRT01P == "Placebo"
#>     SEX == "F"
source_rows <- tplyr_meta_subset(result, row_ids[1], "res1", tplyr_adsl)
kable(head(source_rows[, c("USUBJID", "SEX", "TRT01P")]))

USUBJID	SEX	TRT01P
01-701-1015	F	Placebo
01-701-1047	F	Placebo
01-701-1153	F	Placebo
01-701-1203	F	Placebo
01-701-1345	F	Placebo
01-701-1363	F	Placebo

ARD Conversion and Numeric Data

tplyr_to_ard() converts results into long-format Analysis Results Data. tplyr_numeric_data() provides raw unformatted numbers for validation:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(group_count("SEX"))
)
result <- tplyr_build(spec, tplyr_adsl)

kable(head(tplyr_to_ard(result), 10))

analysis_id	TRT01P	SEX	stat_name	stat_value
1	Placebo	F	n	53.00000
1	Placebo	M	n	33.00000
1	Xanomeline High Dose	F	n	40.00000
1	Xanomeline High Dose	M	n	44.00000
1	Xanomeline Low Dose	F	n	50.00000
1	Xanomeline Low Dose	M	n	34.00000
1	Placebo	F	pct	61.62791
1	Placebo	M	pct	38.37209
1	Xanomeline High Dose	F	pct	47.61905
1	Xanomeline High Dose	M	pct	52.38095

kable(tplyr_numeric_data(result, layer = 1))

TRT01P	SEX	n	pct	total
Placebo	F	53	61.62791	86
Placebo	M	33	38.37209	86
Xanomeline High Dose	F	40	47.61905	84
Xanomeline High Dose	M	44	52.38095	84
Xanomeline Low Dose	F	50	59.52381	84
Xanomeline Low Dose	M	34	40.47619	84

Performance

Beyond the API improvements, tplyr2 delivers significant performance gains over Tplyr v1 thanks to a data.table backend replacing the tidyverse internals. The table below summarizes benchmarks run across five layer types at data scales from 1x to 500x the base dataset sizes, with and without metadata enabled.

Overall Speedup by Layer Type

Layer Type	Avg Speedup	Median Speedup	Min	Max
Shift	176.0x	187.6x	6.4x	388.5x
Desc	8.0x	7.1x	4.1x	18.3x
Count	4.0x	3.1x	2.5x	12.6x
Nested Count	3.8x	3.1x	2.8x	7.3x
Multi-Layer	2.8x	3.0x	1.2x	4.0x

The grand median speedup across all scenarios is 4.0x.

Detailed Results

Count Layers

Scale	Tplyr v1	tplyr2	Speedup	Tplyr v1 (meta)	tplyr2 (meta)	Speedup
1x	0.05s	0.01s	4.5x	0.17s	0.01s	12.6x
10x	0.12s	0.03s	3.8x	0.20s	0.06s	3.2x
50x	0.47s	0.16s	3.0x	0.80s	0.32s	2.5x
100x	0.99s	0.32s	3.1x	1.86s	0.63s	2.9x
250x	2.40s	0.74s	3.2x	4.28s	1.50s	2.9x
500x	5.13s	1.43s	3.6x	8.92s	3.29s	2.7x

Nested Count Layers

Scale	Tplyr v1	tplyr2	Speedup	Tplyr v1 (meta)	tplyr2 (meta)	Speedup
1x	0.21s	0.05s	4.0x	0.21s	0.03s	7.3x
10x	0.34s	0.11s	3.2x	0.41s	0.14s	3.0x
50x	0.89s	0.31s	2.9x	1.40s	0.50s	2.8x
100x	1.78s	0.61s	2.9x	2.86s	0.95s	3.0x
250x	5.24s	1.23s	4.3x	8.00s	2.82s	2.8x
500x	14.61s	2.58s	5.7x	19.64s	4.96s	4.0x

Descriptive Statistics Layers

Scale	Tplyr v1	tplyr2	Speedup	Tplyr v1 (meta)	tplyr2 (meta)	Speedup
1x	0.07s	0.01s	5.9x	0.23s	0.01s	18.3x
10x	0.18s	0.02s	7.2x	0.20s	0.04s	5.4x
50x	0.70s	0.08s	8.5x	0.83s	0.11s	7.6x
100x	1.43s	0.14s	10.1x	1.61s	0.23s	6.9x
250x	3.23s	0.35s	9.3x	3.65s	0.56s	6.5x
500x	6.27s	1.54s	4.1x	7.74s	1.21s	6.4x

Shift Layers

Scale	Tplyr v1	tplyr2	Speedup	Tplyr v1 (meta)	tplyr2 (meta)	Speedup
1x	0.06s	0.01s	6.4x	0.08s	0.01s	11.0x
10x	0.29s	0.01s	34.4x	0.43s	0.01s	49.7x
50x	1.20s	0.01s	110.4x	2.11s	0.01s	158.8x
100x	2.63s	0.01s	216.4x	4.53s	0.02s	231.8x
250x	6.69s	0.03s	265.9x	10.70s	0.04s	305.2x
500x	13.87s	0.04s	388.5x	21.91s	0.07s	333.8x

The benchmark script (benchmark_comparison.R) is included in the package repository root for reproducibility.

Summary

The migration from Tplyr v1 to tplyr2 involves three main shifts:

Declare, don’t pipe. Replace tplyr_table() %>% add_layer() %>% build() with tplyr_spec() + tplyr_build(spec, data).
Quote your variable names. Bare symbols like AGE become "AGE". The where parameter is the only place bare expressions are still used.
Collect settings in one place. Instead of piping set_*() calls, pass a layer_settings() object to the settings parameter of each layer.

The output structure – rowlabel columns, res columns with label attributes, and ord columns for sorting – remains the same. Existing downstream code that consumes tplyr output should work with minimal changes.