Custom Analysis Layers
analyze.RmdIntroduction
tplyr2 provides three built-in layer types –
group_count(), group_desc(), and
group_shift() – that cover most clinical table patterns.
But sometimes you need computations that do not fit neatly into any of
these: geometric means, custom ratios, composite endpoints, or any
analysis where you need full control over both the calculation and the
presentation.
group_analyze() fills this gap. It lets you supply your
own function to compute summary statistics for each group, while
preserving tplyr2’s column-based layout, formatting, and ordering
infrastructure. This vignette covers both of its operating modes: format
strings mode (where tplyr2 handles formatting) and pre-formatted mode
(where your function returns display-ready strings).
The analyze_fn Contract
Every group_analyze() layer requires an
analyze_fn – a function that tplyr2 calls once for each
combination of column and by variables. The function
signature is:
function(.data, .target_var) { ... }where .data is a data.frame subset for the current group
and .target_var is a character string naming the target
variable.
What your function returns determines which of the two modes tplyr2 uses to process the results.
Format Strings Mode
In format strings mode, your analyze_fn returns a
single-row data.frame of named numeric columns. You then supply
format_strings in layer_settings() to control
how each statistic is formatted and labeled in the output. This mode is
useful when you want tplyr2’s formatting system (alignment, decimal
precision, parenthesis hugging) to handle the display.
Here is an example computing a geometric mean and geometric standard deviation for urate lab values by treatment group:
geo_fn <- function(.data, .target_var) {
vals <- .data[[.target_var]]
pos_vals <- vals[!is.na(vals) & vals > 0]
data.frame(
geo_mean = exp(mean(log(pos_vals))),
geo_sd = exp(sd(log(pos_vals)))
)
}
spec <- tplyr_spec(
cols = "TRTP",
layers = tplyr_layers(
group_analyze("AVAL",
by = "Urate (umol/L)",
where = AVISIT == "Baseline",
analyze_fn = geo_fn,
settings = layer_settings(
format_strings = list(
"Geometric Mean" = f_str("xxx.xx", "geo_mean"),
"Geometric SD" = f_str("xxx.xx", "geo_sd")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adlb)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| Urate (umol/L) | Geometric Mean | 314.12 | 300.59 | 291.59 |
| Urate (umol/L) | Geometric SD | 1.29 | 1.19 | 1.32 |
A few things to note:
- The names of the
format_stringslist (“Geometric Mean”, “Geometric SD”) become the row labels in the output, just likegroup_desc()format strings. - Each
f_str()references a column name from the data.frame your function returns. The"xxx.xx"template means three integer digits and two decimal places. - The
by = "Urate (umol/L)"string does not match any column in the data, so tplyr2 treats it as a text label that appears as an outer row label. This is useful for labeling blocks of custom statistics. - The
whereargument filters the data before your function is called.
Multiple Statistics Per Row
Format strings can combine multiple statistics into a single row,
just as in group_desc(). Here we compute a mean and
standard deviation in one row, and a median in another:
summary_fn <- function(.data, .target_var) {
vals <- .data[[.target_var]]
vals <- vals[!is.na(vals)]
data.frame(
n = length(vals),
mean = mean(vals),
sd = sd(vals),
median = median(vals)
)
}
spec <- tplyr_spec(
cols = "TRTP",
layers = tplyr_layers(
group_analyze("AVAL",
by = "Urate (umol/L)",
where = AVISIT == "Baseline",
analyze_fn = summary_fn,
settings = layer_settings(
format_strings = list(
"n" = f_str("xx", "n"),
"Mean (SD)" = f_str("xxx.x (xxx.xx)", "mean", "sd"),
"Median" = f_str("xxx.xx", "median")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adlb)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| Urate (umol/L) | Mean (SD) | 323.4 ( 85.66) | 305.1 ( 61.56) | 301.6 ( 85.35) |
| Urate (umol/L) | Median | 288.48 | 294.43 | 279.56 |
| Urate (umol/L) | n | 8 | 10 | 7 |
This produces the same style of output you would get from
group_desc(), but computed by your own function.
Pre-Formatted Mode
Sometimes you want complete control over the output strings – for
example, when the formatting logic is complex, when you need conditional
formatting, or when the output does not map cleanly to the
f_str() system. In pre-formatted mode, your
analyze_fn returns a data.frame with two columns:
row_label (character) and formatted
(character). No format_strings are needed in this case.
range_fn <- function(.data, .target_var) {
vals <- .data[[.target_var]]
vals <- vals[!is.na(vals)]
data.frame(
row_label = c("Range", "Ratio (Max/Min)"),
formatted = c(
sprintf("%.1f - %.1f", min(vals), max(vals)),
sprintf("%.2f", max(vals) / min(vals))
)
)
}
spec <- tplyr_spec(
cols = "TRTP",
layers = tplyr_layers(
group_analyze("AVAL",
by = "Urate (umol/L)",
where = AVISIT == "Baseline",
analyze_fn = range_fn
)
)
)
result <- tplyr_build(spec, tplyr_adlb)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| Urate (umol/L) | Range | 232.0 - 469.9 | 249.8 - 469.9 | 190.3 - 428.3 |
| Urate (umol/L) | Ratio (Max/Min) | 2.03 | 1.88 | 2.25 |
Each value in the row_label column becomes a row label
in the output. Each corresponding formatted value is placed
into the appropriate result column. This mode gives you full control,
but it also means you are responsible for alignment and spacing – tplyr2
will not pad or round the strings for you.
Integration with by Variables
The by parameter in group_analyze() works
the same way as in other layer types. Strings that match column names in
the data are treated as grouping variables; strings that do not match
are treated as text labels. Your analyze_fn is called once
for each unique combination of column variables and by data
variables.
Here we compute statistics separately for each visit:
mean_fn <- function(.data, .target_var) {
vals <- .data[[.target_var]]
vals <- vals[!is.na(vals)]
data.frame(
row_label = "Mean (SD)",
formatted = sprintf("%.1f (%.2f)", mean(vals), sd(vals))
)
}
spec <- tplyr_spec(
cols = "TRTP",
layers = tplyr_layers(
group_analyze("AVAL",
by = c("Urate (umol/L)", "AVISIT"),
where = AVISIT %in% c("Baseline", "Week 4", "Week 8"),
analyze_fn = mean_fn
)
)
)
result <- tplyr_build(spec, tplyr_adlb)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | rowlabel3 | res1 | res2 | res3 |
|---|---|---|---|---|---|
| Urate (umol/L) | Baseline | Mean (SD) | 323.4 (85.66) | 305.1 (61.56) | 301.6 (85.35) |
| Urate (umol/L) | Week 4 | Mean (SD) | 313.0 (64.76) | 319.7 (72.55) | 319.2 (95.14) |
| Urate (umol/L) | Week 8 | Mean (SD) | 315.2 (60.66) | 290.5 (42.80) | 255.8 (38.55) |
In this output, rowlabel1 contains the text label “Urate
(umol/L)”, rowlabel2 contains the visit values from the
data, and rowlabel3 contains the row labels from the
function output. The by parameter can mix labels and data
variables freely – tplyr2 sorts them out automatically.
Combining with Other Layers
A group_analyze() layer integrates naturally into a
multi-layer spec alongside group_count(),
group_desc(), or group_shift(). Each layer
gets its own ord_layer_index value in the output, so layers
stack in the order they are specified.
geo_fn <- function(.data, .target_var) {
vals <- .data[[.target_var]]
pos_vals <- vals[!is.na(vals) & vals > 0]
data.frame(
geo_mean = exp(mean(log(pos_vals)))
)
}
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_count("SEX",
by = "Gender",
settings = layer_settings(
format_strings = list(
"n (%)" = f_str("xx (xx.x%)", "n", "pct")
)
)
),
group_analyze("AGE",
by = "Age (years)",
analyze_fn = geo_fn,
settings = layer_settings(
format_strings = list(
"Geometric Mean" = f_str("xx.xx", "geo_mean")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| Gender | F | 53 (61.6%) | 40 (47.6%) | 50 (59.5%) |
| Gender | M | 33 (38.4%) | 44 (52.4%) | 34 (40.5%) |
| Age (years) | Geometric Mean | 74.70 | 73.94 | 75.18 |
The count layer and the analyze layer each occupy their own block of
rows. The ord_layer_index column (hidden from the kable
output above but present in the data) keeps them ordered correctly.
Error Handling
If your analyze_fn encounters an error for a particular
group – for example, if a group has no valid data for a log
transformation – tplyr2 will surface the error immediately. This is by
design: silent failures in statistical computation can lead to subtle,
hard-to-detect problems in clinical outputs.
To handle edge cases gracefully, build the error handling into your function:
safe_geo_fn <- function(.data, .target_var) {
vals <- .data[[.target_var]]
pos_vals <- vals[!is.na(vals) & vals > 0]
if (length(pos_vals) < 2) {
return(data.frame(geo_mean = NA_real_, geo_sd = NA_real_))
}
data.frame(
geo_mean = exp(mean(log(pos_vals))),
geo_sd = exp(sd(log(pos_vals)))
)
}
spec <- tplyr_spec(
cols = "TRTP",
layers = tplyr_layers(
group_analyze("AVAL",
by = "Urate (umol/L)",
where = AVISIT == "Baseline",
analyze_fn = safe_geo_fn,
settings = layer_settings(
format_strings = list(
"Geometric Mean" = f_str("xxx.xx", "geo_mean"),
"Geometric SD" = f_str("xxx.xx", "geo_sd")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adlb)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| Urate (umol/L) | Geometric Mean | 314.12 | 300.59 | 291.59 |
| Urate (umol/L) | Geometric SD | 1.29 | 1.19 | 1.32 |
When NA_real_ values pass through format strings, tplyr2
renders them as blank space of the appropriate width, maintaining
alignment in the output.
Summary
group_analyze() is the escape hatch for when tplyr2’s
built-in layer types are not enough. It gives you full control over
computation while preserving the structural benefits of the tplyr2
framework: column-based layout, ordering, multi-layer stacking, and
integration with population data and column headers.
The key points to remember:
- Your
analyze_fnreceives.data(a data.frame subset) and.target_var(a character string) for each group. - In format strings mode, return a single-row numeric
data.frame and let
f_str()handle formatting. - In pre-formatted mode, return a data.frame with
row_labelandformattedcolumns. - Use
byfor row grouping, mixing text labels and data variable names as needed. - Build error handling into your function to handle edge cases gracefully.