General String Formatting • tplyr2

Introduction

Clinical tables live and die by their alignment. When a reviewer scans a column of numbers, the decimal points must line up, the parentheses must sit in the same position from row to row, and the whitespace must be consistent. One misaligned digit and the table looks unprofessional – or worse, it raises questions about the numbers themselves.

tplyr2 solves this with format strings: a compact notation that describes exactly how wide each number field should be, where decimals fall, and how padding works. The f_str() function is the entry point. You give it a template string and the names of the statistics that fill each slot, and tplyr2 takes care of the rest.

How Format Strings Work

A format string is a character template containing one or more format groups separated by literal text. Each format group corresponds to one statistic.

# Two format groups separated by the literal " ("  and closing ")"
fmt <- f_str("xx.x (xx.xx)", "mean", "sd")
fmt
#> tplyr format string: "xx.x (xx.xx)"Variables: mean, sd

In this example:

xx.x is the first format group (for mean): two integer digits, one decimal digit.
( is a literal separator between the first and second groups.
xx.xx is the second format group (for sd): two integer digits, two decimal digits.
) is a trailing literal.

The number of x characters determines the field width. Each x reserves exactly one character position in the output, so numbers narrower than the field are left-padded with spaces to maintain alignment across rows.

Available Variables by Layer Type

Each layer type computes a specific set of statistics that you can reference in format strings.

Layer Type	Variable	Description
Count	`n`	Number of observations
Count	`pct`	Percentage of observations
Count	`total`	Denominator for percentage
Count	`distinct_n`	Number of distinct subjects (requires `distinct_by`)
Count	`distinct_pct`	Percentage of distinct subjects
Count	`distinct_total`	Distinct denominator
Desc	`n`	Non-missing observation count
Desc	`mean`	Arithmetic mean
Desc	`sd`	Standard deviation
Desc	`median`	Median
Desc	`var`	Variance
Desc	`min`	Minimum
Desc	`max`	Maximum
Desc	`iqr`	Interquartile range
Desc	`q1`	First quartile
Desc	`q3`	Third quartile
Desc	`missing`	Count of missing values
Shift	`n`	Number of observations
Shift	`pct`	Percentage of observations
Shift	`total`	Denominator for percentage
Analyze	(user-defined)	Names returned by `analyze_fn`

For count layers, format strings are provided as a named list with the key n_counts. For descriptive statistics layers, each named entry in the format_strings list becomes a separate output row, with the name used as the row label.

Lowercase x: Fixed-Width Fields

The x character is the workhorse of format strings. Each x reserves one character position. If a number has fewer digits than the number of x characters, the output is left-padded with spaces. If a number has more digits than available positions, the number prints in full (it is never truncated).

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_desc("AGE",
      by = "Age (years)",
      settings = layer_settings(
        format_strings = list(
          "n"          = f_str("xx", "n"),
          "Mean (SD)"  = f_str("xx.x (xx.xx)", "mean", "sd"),
          "Median"     = f_str("xx.x", "median"),
          "Q1, Q3"     = f_str("xx.x, xx.x", "q1", "q3"),
          "Min, Max"   = f_str("xx, xx", "min", "max")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, c("rowlabel1", "rowlabel2", "res1", "res2", "res3")])

rowlabel1	rowlabel2	res1	res2	res3
Age (years)	n	86	84	84
Age (years)	Mean (SD)	75.2 ( 8.59)	74.4 ( 7.89)	75.7 ( 8.29)
Age (years)	Median	76.0	76.0	77.5
Age (years)	Q1, Q3	69.2, 81.8	70.8, 80.0	71.0, 82.0
Age (years)	Min, Max	52, 89	56, 88	51, 88

Notice how the numbers align within each column. The xx.x format for mean creates a field three characters wide before the decimal and one after, so a mean of 75.2 prints as 75.2 while a mean of 8.3 would print as 8.3 (with a leading space). This consistent field width is what makes clinical tables scannable.

Uppercase X: Parenthesis Hugging

Standard x formatting pads numbers inside their surrounding delimiters. This means a parenthesis or bracket can end up separated from the number it encloses by one or more spaces, which can look awkward:

 14 ( 16.3%)
  7 (  8.1%)

Uppercase X solves this with parenthesis hugging. When a format group uses X characters, any leading spaces that would normally appear inside the number field are shifted outside the preceding delimiter. The result is that the opening parenthesis (or bracket) always sits immediately next to the first significant digit.

# Standard formatting: spaces inside parentheses
spec_standard <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("DCDECOD",
      settings = layer_settings(
        format_strings = list(
          n_counts = f_str("xxx (xxx.x%)", "n", "pct")
        )
      )
    )
  )
)

result_standard <- tplyr_build(spec_standard, tplyr_adsl)

# Parenthesis hugging: spaces shift outside the delimiter
spec_hugged <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("DCDECOD",
      settings = layer_settings(
        format_strings = list(
          n_counts = f_str("xxx (XXX.x%)", "n", "pct")
        )
      )
    )
  )
)

result_hugged <- tplyr_build(spec_hugged, tplyr_adsl)

Here is the standard formatting, where padding sits between the parenthesis and the number:

kable(result_standard[1:6, c("rowlabel1", "res1", "res2", "res3")])

rowlabel1	res1	res2	res3
ADVERSE EVENT	8 ( 9.3%)	40 ( 47.6%)	44 ( 52.4%)
COMPLETED	58 ( 67.4%)	27 ( 32.1%)	25 ( 29.8%)
DEATH	2 ( 2.3%)	0 ( 0.0%)	1 ( 1.2%)
LACK OF EFFICACY	3 ( 3.5%)	1 ( 1.2%)	0 ( 0.0%)
LOST TO FOLLOW-UP	1 ( 1.2%)	0 ( 0.0%)	1 ( 1.2%)
PHYSICIAN DECISION	1 ( 1.2%)	2 ( 2.4%)	0 ( 0.0%)

And here is the same data with parenthesis hugging applied to the percentage group. The opening parenthesis now hugs the number, and the displaced spaces move to the left of the parenthesis:

kable(result_hugged[1:6, c("rowlabel1", "res1", "res2", "res3")])

rowlabel1	res1	res2	res3
ADVERSE EVENT	8 (9.3% )	40 (47.6% )	44 (52.4% )
COMPLETED	58 (67.4% )	27 (32.1% )	25 (29.8% )
DEATH	2 (2.3% )	0 (0.0% )	1 (1.2% )
LACK OF EFFICACY	3 (3.5% )	1 (1.2% )	0 (0.0% )
LOST TO FOLLOW-UP	1 (1.2% )	0 (0.0% )	1 (1.2% )
PHYSICIAN DECISION	1 (1.2% )	2 (2.4% )	0 (0.0% )

The total string width stays the same – the spaces do not disappear, they just relocate. This preserves column alignment while giving the output a cleaner look.

Parenthesis hugging works with any delimiter that appears as a literal in the format string, including square brackets [ and other characters.

Auto-Precision with a and A

In descriptive statistics tables, the appropriate number of decimal places often depends on the data itself. A lab parameter measured to one decimal place should be summarized with one or two decimal digits, while one measured to three decimal places needs more. Hardcoding widths for every parameter is tedious and error-prone.

The a and A characters enable auto-precision: the field width is determined at build time from the actual data. Specifically, tplyr2 scans the target variable, finds the maximum number of decimal places present, and uses that as the base precision.

a – auto-precision digit (like x, but width comes from data)
A – auto-precision with parenthesis hugging (like X, but width comes from data)
+N suffix – adds N to the auto-determined width (e.g., a+1 means data precision plus one extra decimal place)

Auto-precision is controlled by three settings in layer_settings():

precision_by: character vector of grouping variables (precision computed per group)
precision_on: character name of the variable to scan for precision (defaults to target variable)
precision_cap: named numeric vector c(int = , dec = ) to cap the maximum widths

spec <- tplyr_spec(
  cols = "TRTA",
  layers = tplyr_layers(
    group_desc("AVAL",
      by = "Urate (umol/L)",
      where = AVISIT %in% c("Baseline", "Week 12", "Week 24"),
      settings = layer_settings(
        precision_on = "AVAL",
        format_strings = list(
          "n"          = f_str("xx", "n"),
          "Mean (SD)"  = f_str("a+1.a+1 (a+2.a+2)", "mean", "sd"),
          "Median"     = f_str("a+1.a+1", "median"),
          "Q1, Q3"     = f_str("a+1.a+1, a+1.a+1", "q1", "q3"),
          "Min, Max"   = f_str("a.a, a.a", "min", "max")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adlb)
kable(result[, c("rowlabel1", "rowlabel2", "res1", "res2", "res3")])

rowlabel1	rowlabel2	res1	res2	res3
Urate (umol/L)	n	20	21	11
Urate (umol/L)	Mean (SD)	324.7608 ( 74.75026)	298.8162 ( 50.36185)	282.2596 ( 77.20328)
Urate (umol/L)	Median	306.3220	297.4000	273.6080
Urate (umol/L)	Q1, Q3	266.1730, 394.0550	273.6080, 315.2440	240.8940, 297.4000
Urate (umol/L)	Min, Max	226.024, 469.892	231.972, 469.892	178.440, 428.256

In this example, precision_on = "AVAL" tells tplyr2 to scan the AVAL column to determine the number of decimal places present in the data. The +1 and +2 suffixes add one and two extra decimal places beyond what the data contains. For Min, Max, a.a uses the raw data precision with no extra digits.

When precision_by is also set, precision is computed separately for each group – useful when a single spec covers multiple lab parameters, each measured at a different scale. This means you can write one spec that handles dozens of parameters, each rendered at the precision appropriate to its measurement.

Note that auto-precision characters can also be used in count layers. For instance, a in the integer portion of a count format will auto-size the field width based on the data, so you do not have to guess how many digits the largest count will require.

The empty Argument

When a statistic is NA (for example, standard deviation when n = 1), format strings produce blank space by default to preserve alignment. You can override this with the empty argument to f_str(), which specifies a replacement string when all values in a format group are missing.

fmt_with_empty <- f_str(
  "xx.x (xx.xx)",
  "mean", "sd",
  empty = c(.overall = "   -")
)
fmt_with_empty
#> tplyr format string: "xx.x (xx.xx)"Variables: mean, sdEmpty: c(.overall = "   -")

The .overall key means the replacement applies when all statistics in the string are NA. This is useful for displaying a dash or other placeholder in rows where a summary cannot be computed.

Putting It All Together

Let us build a more complete example: an adverse event summary that combines parenthesis hugging with fixed-width fields, similar to what you would see in a clinical study report.

spec <- tplyr_spec(
  cols = "TRTA",
  layers = tplyr_layers(
    group_count(c("AEBODSYS", "AEDECOD"),
      settings = layer_settings(
        distinct_by = "USUBJID",
        format_strings = list(
          n_counts = f_str("xxx (XXX.x%)", "distinct_n", "distinct_pct")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adae)
collapsed <- collapse_row_labels(result, "rowlabel1", "rowlabel2", indent = "   ")
kable(head(collapsed[, c("row_label", "res1", "res2", "res3")], 15))

row_label	res1	res2	res3
CARDIAC DISORDERS
	4 (12.5% )	6 (14.0% )	5 (10.0% )
ATRIAL FIBRILLATION	0 (0.0% )	0 (0.0% )	1 (2.0% )
ATRIAL FLUTTER	0 (0.0% )	1 (2.3% )	0 (0.0% )
ATRIAL HYPERTROPHY	1 (3.1% )	0 (0.0% )	0 (0.0% )
BUNDLE BRANCH BLOCK RIGHT	1 (3.1% )	0 (0.0% )	0 (0.0% )
CARDIAC FAILURE CONGESTIVE	1 (3.1% )	0 (0.0% )	0 (0.0% )
MYOCARDIAL INFARCTION	0 (0.0% )	1 (2.3% )	2 (4.0% )
SINUS BRADYCARDIA	0 (0.0% )	3 (7.0% )	1 (2.0% )
SUPRAVENTRICULAR EXTRASYSTOLES	1 (3.1% )	0 (0.0% )	1 (2.0% )
SUPRAVENTRICULAR TACHYCARDIA	0 (0.0% )	0 (0.0% )	1 (2.0% )
TACHYCARDIA	1 (3.1% )	0 (0.0% )	0 (0.0% )
VENTRICULAR EXTRASYSTOLES	0 (0.0% )	1 (2.3% )	0 (0.0% )
CONGENITAL, FAMILIAL AND GENETIC DISORDERS
	0 (0.0% )	1 (2.3% )	0 (0.0% )

In this table:

xxx gives a three-digit fixed field for distinct subject counts.
XXX.x uses parenthesis hugging so the ( sits right next to the percentage value.
The % sign is literal text that appears after the percentage in every cell.
distinct_n and distinct_pct compute subject-level (not event-level) summaries.
Nested counts display body system totals alongside preferred term detail.

The format string system in tplyr2 is designed so that you declare the shape of each number field once, and the package handles padding, alignment, and precision across every cell in the table. Whether you need fixed widths, data-driven precision, or delimiter-hugging alignment, the same f_str() interface covers all three.