Descriptive Statistics Layers
desc.RmdIntroduction
Descriptive statistics tables are among the most common outputs in clinical trial reporting. Whether you are summarizing demographics in a Table 14.1 or lab parameters across visits, the pattern is the same: compute summary statistics for a continuous variable, then present them in a formatted, publication-ready layout grouped by treatment arm.
In tplyr2, descriptive statistics layers are created with
group_desc(). The core of your control over the output
comes from the format_strings parameter within
layer_settings(). Format strings let you specify exactly
which statistics appear, what row label each statistic gets, and how
numbers are formatted – all in one place.
Let’s start with a typical example. Using the built-in
tplyr_adsl dataset, we will summarize age by treatment
group.
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_desc("AGE",
by = "Age (years)",
settings = layer_settings(
format_strings = list(
"n" = f_str("xx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
"Median" = f_str("xx.x", "median"),
"Q1, Q3" = f_str("xx.x, xx.x", "q1", "q3"),
"Min, Max" = f_str("xx, xx", "min", "max"),
"Missing" = f_str("xx", "missing")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| Age (years) | n | 86 | 84 | 84 |
| Age (years) | Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) |
| Age (years) | Median | 76.0 | 76.0 | 77.5 |
| Age (years) | Q1, Q3 | 69.2, 81.8 | 70.8, 80.0 | 71.0, 82.0 |
| Age (years) | Min, Max | 52, 89 | 56, 88 | 51, 88 |
| Age (years) | Missing | 0 | 0 | 0 |
A few things to note about this example:
- The
format_stringsparameter is a named list. Each name becomes the row label in the output (e.g., “Mean (SD)”), and each value is anf_str()object that controls the numeric format. - Inside
f_str(), the first argument is the format template. The charactersxdefine the display width:xx.xmeans two integer digits and one decimal place. The remaining arguments are strings naming the statistics to plug into each format group. - The
byargument"Age (years)"does not match any column in the data, so tplyr2 treats it as a text label. It appears as an additionalrowlabelcolumn, which is useful for distinguishing blocks of statistics when multiple layers are combined.
Built-in Summaries
tplyr2 provides a set of built-in summary statistics that cover the
most common needs in clinical reporting. These are computed
automatically for every group_desc() layer – you simply
reference them by name in your format strings.
| Statistic | Description | Details |
|---|---|---|
n |
Non-missing count | length(x[!is.na(x)]) |
mean |
Arithmetic mean | mean(x, na.rm = TRUE) |
sd |
Standard deviation | sd(x, na.rm = TRUE) |
median |
Median | median(x, na.rm = TRUE) |
var |
Variance | var(x, na.rm = TRUE) |
min |
Minimum |
min(x) of finite values |
max |
Maximum |
max(x) of finite values |
iqr |
Interquartile range | IQR(x, type = ...) |
q1 |
First quartile (25th percentile) | quantile(x, 0.25, type = ...) |
q3 |
Third quartile (75th percentile) | quantile(x, 0.75, type = ...) |
missing |
Missing count | sum(is.na(x)) |
A few important notes about these built-in summaries:
- All statistics use
na.rm = TRUEby default, so missing values are excluded from calculations (except formissingitself, which counts them). -
minandmaxoperate on finite values only. If all values in a group areNA, the result isNA_real_rather thanInfor-Inf. This avoids formatting issues where infinity symbols would appear in your output. - The
nstatistic counts non-missing observations, whilemissingcounts theNAvalues. Together they sum to the total number of rows in that group.
Quantile Algorithm
By default, tplyr2 uses R’s default quantile algorithm (Type 7) for
computing q1, q3, and iqr. This
is fine for many applications, but clinical trial reporting often needs
to match SAS output, which uses a different algorithm (closest to R’s
Type 3).
You can change the quantile algorithm globally using
tplyr2_options():
# Default Type 7 (R default)
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_desc("AGE",
settings = layer_settings(
format_strings = list(
"Q1, Q3" = f_str("xx.x, xx.x", "q1", "q3")
)
)
)
)
)
result_type7 <- tplyr_build(spec, tplyr_adsl)
kable(result_type7[, !grepl("^ord", names(result_type7))],
caption = "Type 7 (R default)")| rowlabel1 | res1 | res2 | res3 |
|---|---|---|---|
| Q1, Q3 | 69.2, 81.8 | 70.8, 80.0 | 71.0, 82.0 |
# Type 3 (matches SAS PROC UNIVARIATE default)
tplyr2_options(quantile_type = 3)
result_type3 <- tplyr_build(spec, tplyr_adsl)
kable(result_type3[, !grepl("^ord", names(result_type3))],
caption = "Type 3 (SAS-like)")| rowlabel1 | res1 | res2 | res3 |
|---|---|---|---|
| Q1, Q3 | 69.0, 81.0 | 70.0, 80.0 | 71.0, 82.0 |
# Reset to default
tplyr2_options(quantile_type = 7)Notice how the quartile values differ between the two algorithms. The difference is typically small but can matter when you need to produce outputs that match SAS exactly. Type 3 uses the nearest even order statistic and is the closest match to SAS’s default behavior.
Custom Summaries
The built-in summaries cover most standard needs, but clinical reporting sometimes calls for statistics that are not part of the default set – geometric means, coefficients of variation, trimmed means, and so on. tplyr2 handles this through custom summaries.
Layer-Level Custom Summaries
You can define custom summaries directly in
layer_settings() using the custom_summaries
parameter. Each custom summary is a named expression that uses
.var as a placeholder for the target variable’s values.
Here is an example computing a geometric mean alongside the standard mean:
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_desc("AGE",
settings = layer_settings(
format_strings = list(
"n" = f_str("xx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
"Geometric Mean" = f_str("xx.xx", "geo_mean")
),
custom_summaries = list(
geo_mean = quote(exp(mean(log(.var[.var > 0]), na.rm = TRUE)))
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | res1 | res2 | res3 |
|---|---|---|---|
| n | 86 | 84 | 84 |
| Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) |
| Geometric Mean | 74.70 | 73.94 | 75.18 |
The key points about custom summaries:
- Use
quote()to wrap the expression. This delays evaluation until build time, when.varis replaced with the actual data vector. -
.varrefers to the values of the target variable for the current group. It behaves like a numeric vector, so you can apply any R function to it. - If a custom summary expression throws an error (e.g., trying to take
the log of negative values), tplyr2 catches the error and returns
NA_real_for that group, so your table build will not fail.
Session-Level Custom Summaries
If you find yourself using the same custom summary across many tables
in a study, you can register it at the session level using
tplyr2_options(). Once registered, the custom statistic is
available by name in any format_strings specification, just
like the built-in summaries.
# Register a coefficient of variation summary for the session
tplyr2_options(
custom_summaries = list(
cv = quote(sd(.var, na.rm = TRUE) / mean(.var, na.rm = TRUE) * 100)
)
)
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_desc("AGE",
settings = layer_settings(
format_strings = list(
"n" = f_str("xx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
"CV (%)" = f_str("xx.x", "cv")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | res1 | res2 | res3 |
|---|---|---|---|
| n | 86 | 84 | 84 |
| Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) |
| CV (%) | 11.4 | 10.6 | 11.0 |
# Clean up
tplyr2_options(custom_summaries = NULL)Overriding Built-in Summaries
Custom summaries can even overwrite built-in statistics. If you name
a custom summary "mean", it replaces the built-in mean
calculation. This is useful when your study requires a non-standard
definition of a standard statistic, such as using a trimmed mean instead
of the arithmetic mean.
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_desc("AGE",
settings = layer_settings(
format_strings = list(
"Trimmed Mean" = f_str("xx.x", "mean")
),
custom_summaries = list(
mean = quote(mean(.var, trim = 0.1, na.rm = TRUE))
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | res1 | res2 | res3 |
|---|---|---|---|
| Trimmed Mean | 75.7 | 75.0 | 76.6 |
Layer-level custom summaries always take priority over session-level custom summaries, and both take priority over built-in statistics. This layered precedence gives you fine-grained control: set sensible defaults at the session level, then override on a per-layer basis when needed.
Multi-Variable Descriptive Statistics
It is common to summarize several continuous variables in a single
table – for example, a demographics table that includes age, height, and
weight. Rather than creating separate layers for each variable, you can
pass a character vector of variable names to
group_desc().
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_desc(c("AGE", "HEIGHTBL", "WEIGHTBL"),
settings = layer_settings(
format_strings = list(
"n" = f_str("xx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
"Median" = f_str("xx.x", "median"),
"Q1, Q3" = f_str("xx.x, xx.x", "q1", "q3"),
"Min, Max" = f_str("xx, xx", "min", "max"),
"Missing" = f_str("xx", "missing")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| AGE | n | 86 | 84 | 84 |
| AGE | Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) |
| AGE | Median | 76.0 | 76.0 | 77.5 |
| AGE | Q1, Q3 | 69.2, 81.8 | 70.8, 80.0 | 71.0, 82.0 |
| AGE | Min, Max | 52, 89 | 56, 88 | 51, 88 |
| AGE | Missing | 0 | 0 | 0 |
| HEIGHTBL | n | 86 | 84 | 84 |
| HEIGHTBL | Mean (SD) | 162.6 (11.52) | 165.8 (10.13) | 163.4 (10.42) |
| HEIGHTBL | Median | 162.6 | 165.1 | 162.6 |
| HEIGHTBL | Q1, Q3 | 154.0, 171.2 | 157.5, 172.8 | 157.5, 170.2 |
| HEIGHTBL | Min, Max | 137, 185 | 146, 190 | 136, 196 |
| HEIGHTBL | Missing | 0 | 0 | 0 |
| WEIGHTBL | n | 86 | 84 | 83 |
| WEIGHTBL | Mean (SD) | 62.8 (12.77) | 70.0 (14.65) | 67.3 (14.12) |
| WEIGHTBL | Median | 60.5 | 69.2 | 64.9 |
| WEIGHTBL | Q1, Q3 | 53.6, 74.2 | 57.0, 80.3 | 56.0, 77.4 |
| WEIGHTBL | Min, Max | 34, 86 | 42, 108 | 45, 106 |
| WEIGHTBL | Missing | 0 | 0 | 1 |
When multiple target variables are specified:
- Each variable gets its own block of summary rows. The variable name
appears as an additional
rowlabelcolumn (here,rowlabel1), with the statistic labels in the next column (rowlabel2). - The same
format_stringsare applied to every variable. Each variable’s statistics are computed independently, so differences in scale (e.g., age in years vs. height in centimeters) are handled naturally. - Ordering is preserved: the first variable’s rows appear first, followed by the second, and so on.
You can also combine multi-variable descriptive layers with the
by parameter. If you add a text label through
by, it becomes the outermost row label, followed by the
variable name, then the statistic label.
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_desc(c("AGE", "WEIGHTBL"),
by = "Demographics",
settings = layer_settings(
format_strings = list(
"n" = f_str("xx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
"Median" = f_str("xx.x", "median")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | rowlabel3 | res1 | res2 | res3 |
|---|---|---|---|---|---|
| Demographics | AGE | n | 86 | 84 | 84 |
| Demographics | AGE | Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) |
| Demographics | AGE | Median | 76.0 | 76.0 | 77.5 |
| Demographics | WEIGHTBL | n | 86 | 84 | 83 |
| Demographics | WEIGHTBL | Mean (SD) | 62.8 (12.77) | 70.0 (14.65) | 67.3 (14.12) |
| Demographics | WEIGHTBL | Median | 60.5 | 69.2 | 64.9 |
Where to Go From Here
This vignette covered the fundamentals of descriptive statistics layers in tplyr2: built-in summaries, custom summaries, quantile algorithms, and multi-variable analysis. But there is more to explore when it comes to controlling how your numbers look on the page.
The vignette("desc_layer_formatting") vignette dives
into advanced formatting topics including:
-
Auto-precision: Dynamically adjusting decimal
places based on the precision of collected data, using
precision_by,precision_on, andprecision_capinlayer_settings(). -
Empty value handling: Controlling what appears when
all values in a group are missing, using the
emptyparameter off_str(). -
Parenthesis hugging: Eliminating the gap between
parentheses and numbers so that formats like
( 5.2)become(5.2 )using theXcharacter in format strings.