Descriptive Statistics Layers • tplyr2

Introduction

Descriptive statistics tables are among the most common outputs in clinical trial reporting. Whether you are summarizing demographics in a Table 14.1 or lab parameters across visits, the pattern is the same: compute summary statistics for a continuous variable, then present them in a formatted, publication-ready layout grouped by treatment arm.

In tplyr2, descriptive statistics layers are created with group_desc(). The core of your control over the output comes from the format_strings parameter within layer_settings(). Format strings let you specify exactly which statistics appear, what row label each statistic gets, and how numbers are formatted – all in one place.

Let’s start with a typical example. Using the built-in tplyr_adsl dataset, we will summarize age by treatment group.

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_desc("AGE",
      by = "Age (years)",
      settings = layer_settings(
        format_strings = list(
          "n"          = f_str("xx", "n"),
          "Mean (SD)"  = f_str("xx.x (xx.xx)", "mean", "sd"),
          "Median"     = f_str("xx.x", "median"),
          "Q1, Q3"     = f_str("xx.x, xx.x", "q1", "q3"),
          "Min, Max"   = f_str("xx, xx", "min", "max"),
          "Missing"    = f_str("xx", "missing")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])

rowlabel1	rowlabel2	res1	res2	res3
Age (years)	n	86	84	84
Age (years)	Mean (SD)	75.2 ( 8.59)	74.4 ( 7.89)	75.7 ( 8.29)
Age (years)	Median	76.0	76.0	77.5
Age (years)	Q1, Q3	69.2, 81.8	70.8, 80.0	71.0, 82.0
Age (years)	Min, Max	52, 89	56, 88	51, 88
Age (years)	Missing	0	0	0

A few things to note about this example:

The format_strings parameter is a named list. Each name becomes the row label in the output (e.g., “Mean (SD)”), and each value is an f_str() object that controls the numeric format.
Inside f_str(), the first argument is the format template. The characters x define the display width: xx.x means two integer digits and one decimal place. The remaining arguments are strings naming the statistics to plug into each format group.
The by argument "Age (years)" does not match any column in the data, so tplyr2 treats it as a text label. It appears as an additional rowlabel column, which is useful for distinguishing blocks of statistics when multiple layers are combined.

Built-in Summaries

tplyr2 provides a set of built-in summary statistics that cover the most common needs in clinical reporting. These are computed automatically for every group_desc() layer – you simply reference them by name in your format strings.

Statistic	Description	Details
`n`	Non-missing count	`length(x[!is.na(x)])`
`mean`	Arithmetic mean	`mean(x, na.rm = TRUE)`
`sd`	Standard deviation	`sd(x, na.rm = TRUE)`
`median`	Median	`median(x, na.rm = TRUE)`
`var`	Variance	`var(x, na.rm = TRUE)`
`min`	Minimum	`min(x)` of finite values
`max`	Maximum	`max(x)` of finite values
`iqr`	Interquartile range	`IQR(x, type = ...)`
`q1`	First quartile (25th percentile)	`quantile(x, 0.25, type = ...)`
`q3`	Third quartile (75th percentile)	`quantile(x, 0.75, type = ...)`
`missing`	Missing count	`sum(is.na(x))`

A few important notes about these built-in summaries:

All statistics use na.rm = TRUE by default, so missing values are excluded from calculations (except for missing itself, which counts them).
min and max operate on finite values only. If all values in a group are NA, the result is NA_real_ rather than Inf or -Inf. This avoids formatting issues where infinity symbols would appear in your output.
The n statistic counts non-missing observations, while missing counts the NA values. Together they sum to the total number of rows in that group.

Quantile Algorithm

By default, tplyr2 uses R’s default quantile algorithm (Type 7) for computing q1, q3, and iqr. This is fine for many applications, but clinical trial reporting often needs to match SAS output, which uses a different algorithm (closest to R’s Type 3).

You can change the quantile algorithm globally using tplyr2_options():

# Default Type 7 (R default)
spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_desc("AGE",
      settings = layer_settings(
        format_strings = list(
          "Q1, Q3" = f_str("xx.x, xx.x", "q1", "q3")
        )
      )
    )
  )
)

result_type7 <- tplyr_build(spec, tplyr_adsl)
kable(result_type7[, !grepl("^ord", names(result_type7))],
      caption = "Type 7 (R default)")

Type 7 (R default)
rowlabel1	res1	res2	res3
Q1, Q3	69.2, 81.8	70.8, 80.0	71.0, 82.0

# Type 3 (matches SAS PROC UNIVARIATE default)
tplyr2_options(quantile_type = 3)

result_type3 <- tplyr_build(spec, tplyr_adsl)
kable(result_type3[, !grepl("^ord", names(result_type3))],
      caption = "Type 3 (SAS-like)")

Type 3 (SAS-like)
rowlabel1	res1	res2	res3
Q1, Q3	69.0, 81.0	70.0, 80.0	71.0, 82.0


# Reset to default
tplyr2_options(quantile_type = 7)

Notice how the quartile values differ between the two algorithms. The difference is typically small but can matter when you need to produce outputs that match SAS exactly. Type 3 uses the nearest even order statistic and is the closest match to SAS’s default behavior.

Custom Summaries

The built-in summaries cover most standard needs, but clinical reporting sometimes calls for statistics that are not part of the default set – geometric means, coefficients of variation, trimmed means, and so on. tplyr2 handles this through custom summaries.

Layer-Level Custom Summaries

You can define custom summaries directly in layer_settings() using the custom_summaries parameter. Each custom summary is a named expression that uses .var as a placeholder for the target variable’s values.

Here is an example computing a geometric mean alongside the standard mean:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_desc("AGE",
      settings = layer_settings(
        format_strings = list(
          "n"              = f_str("xx", "n"),
          "Mean (SD)"      = f_str("xx.x (xx.xx)", "mean", "sd"),
          "Geometric Mean" = f_str("xx.xx", "geo_mean")
        ),
        custom_summaries = list(
          geo_mean = quote(exp(mean(log(.var[.var > 0]), na.rm = TRUE)))
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])

rowlabel1	res1	res2	res3
n	86	84	84
Mean (SD)	75.2 ( 8.59)	74.4 ( 7.89)	75.7 ( 8.29)
Geometric Mean	74.70	73.94	75.18

The key points about custom summaries:

Use quote() to wrap the expression. This delays evaluation until build time, when .var is replaced with the actual data vector.
.var refers to the values of the target variable for the current group. It behaves like a numeric vector, so you can apply any R function to it.
If a custom summary expression throws an error (e.g., trying to take the log of negative values), tplyr2 catches the error and returns NA_real_ for that group, so your table build will not fail.

Session-Level Custom Summaries

If you find yourself using the same custom summary across many tables in a study, you can register it at the session level using tplyr2_options(). Once registered, the custom statistic is available by name in any format_strings specification, just like the built-in summaries.

# Register a coefficient of variation summary for the session
tplyr2_options(
  custom_summaries = list(
    cv = quote(sd(.var, na.rm = TRUE) / mean(.var, na.rm = TRUE) * 100)
  )
)

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_desc("AGE",
      settings = layer_settings(
        format_strings = list(
          "n"          = f_str("xx", "n"),
          "Mean (SD)"  = f_str("xx.x (xx.xx)", "mean", "sd"),
          "CV (%)"     = f_str("xx.x", "cv")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])

rowlabel1	res1	res2	res3
n	86	84	84
Mean (SD)	75.2 ( 8.59)	74.4 ( 7.89)	75.7 ( 8.29)
CV (%)	11.4	10.6	11.0


# Clean up
tplyr2_options(custom_summaries = NULL)

Overriding Built-in Summaries

Custom summaries can even overwrite built-in statistics. If you name a custom summary "mean", it replaces the built-in mean calculation. This is useful when your study requires a non-standard definition of a standard statistic, such as using a trimmed mean instead of the arithmetic mean.

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_desc("AGE",
      settings = layer_settings(
        format_strings = list(
          "Trimmed Mean" = f_str("xx.x", "mean")
        ),
        custom_summaries = list(
          mean = quote(mean(.var, trim = 0.1, na.rm = TRUE))
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])

rowlabel1	res1	res2	res3
Trimmed Mean	75.7	75.0	76.6

Layer-level custom summaries always take priority over session-level custom summaries, and both take priority over built-in statistics. This layered precedence gives you fine-grained control: set sensible defaults at the session level, then override on a per-layer basis when needed.

Multi-Variable Descriptive Statistics

It is common to summarize several continuous variables in a single table – for example, a demographics table that includes age, height, and weight. Rather than creating separate layers for each variable, you can pass a character vector of variable names to group_desc().

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_desc(c("AGE", "HEIGHTBL", "WEIGHTBL"),
      settings = layer_settings(
        format_strings = list(
          "n"          = f_str("xx", "n"),
          "Mean (SD)"  = f_str("xx.x (xx.xx)", "mean", "sd"),
          "Median"     = f_str("xx.x", "median"),
          "Q1, Q3"     = f_str("xx.x, xx.x", "q1", "q3"),
          "Min, Max"   = f_str("xx, xx", "min", "max"),
          "Missing"    = f_str("xx", "missing")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])

rowlabel1	rowlabel2	res1	res2	res3
AGE	n	86	84	84
AGE	Mean (SD)	75.2 ( 8.59)	74.4 ( 7.89)	75.7 ( 8.29)
AGE	Median	76.0	76.0	77.5
AGE	Q1, Q3	69.2, 81.8	70.8, 80.0	71.0, 82.0
AGE	Min, Max	52, 89	56, 88	51, 88
AGE	Missing	0	0	0
HEIGHTBL	n	86	84	84
HEIGHTBL	Mean (SD)	162.6 (11.52)	165.8 (10.13)	163.4 (10.42)
HEIGHTBL	Median	162.6	165.1	162.6
HEIGHTBL	Q1, Q3	154.0, 171.2	157.5, 172.8	157.5, 170.2
HEIGHTBL	Min, Max	137, 185	146, 190	136, 196
HEIGHTBL	Missing	0	0	0
WEIGHTBL	n	86	84	83
WEIGHTBL	Mean (SD)	62.8 (12.77)	70.0 (14.65)	67.3 (14.12)
WEIGHTBL	Median	60.5	69.2	64.9
WEIGHTBL	Q1, Q3	53.6, 74.2	57.0, 80.3	56.0, 77.4
WEIGHTBL	Min, Max	34, 86	42, 108	45, 106
WEIGHTBL	Missing	0	0	1

When multiple target variables are specified:

Each variable gets its own block of summary rows. The variable name appears as an additional rowlabel column (here, rowlabel1), with the statistic labels in the next column (rowlabel2).
The same format_strings are applied to every variable. Each variable’s statistics are computed independently, so differences in scale (e.g., age in years vs. height in centimeters) are handled naturally.
Ordering is preserved: the first variable’s rows appear first, followed by the second, and so on.

You can also combine multi-variable descriptive layers with the by parameter. If you add a text label through by, it becomes the outermost row label, followed by the variable name, then the statistic label.

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_desc(c("AGE", "WEIGHTBL"),
      by = "Demographics",
      settings = layer_settings(
        format_strings = list(
          "n"         = f_str("xx", "n"),
          "Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
          "Median"    = f_str("xx.x", "median")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])

rowlabel1	rowlabel2	rowlabel3	res1	res2	res3
Demographics	AGE	n	86	84	84
Demographics	AGE	Mean (SD)	75.2 ( 8.59)	74.4 ( 7.89)	75.7 ( 8.29)
Demographics	AGE	Median	76.0	76.0	77.5
Demographics	WEIGHTBL	n	86	84	83
Demographics	WEIGHTBL	Mean (SD)	62.8 (12.77)	70.0 (14.65)	67.3 (14.12)
Demographics	WEIGHTBL	Median	60.5	69.2	64.9

Where to Go From Here

This vignette covered the fundamentals of descriptive statistics layers in tplyr2: built-in summaries, custom summaries, quantile algorithms, and multi-variable analysis. But there is more to explore when it comes to controlling how your numbers look on the page.

The vignette("desc_layer_formatting") vignette dives into advanced formatting topics including:

Auto-precision: Dynamically adjusting decimal places based on the precision of collected data, using precision_by, precision_on, and precision_cap in layer_settings().
Empty value handling: Controlling what appears when all values in a group are missing, using the empty parameter of f_str().
Parenthesis hugging: Eliminating the gap between parentheses and numbers so that formats like ( 5.2) become (5.2 ) using the X character in format strings.