Denominators

Introduction

When a clinical table displays a percentage, the natural question is: a percentage of what? The denominator – the number in the bottom of that fraction – drives the interpretation of every count cell in your output. A disposition table typically uses the total number of subjects in each treatment arm. An adverse event table stratified by gender might need the number of subjects within each treatment-by-gender combination. A subgroup analysis might exclude certain categories from the denominator entirely.

tplyr2 provides several mechanisms for controlling denominators in count layers, all configured through layer_settings(). This vignette walks through each one, starting with the default behavior and building up to more specialized scenarios.

Default Behavior

By default, tplyr2 uses the column variable total as the denominator. If your spec uses cols = "TRT01P", then the denominator for each cell is the total number of rows in that treatment group.

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("RACE")
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, c("rowlabel1", "res1", "res2", "res3")])

rowlabel1	res1	res2	res3
WHITE	78 (90.7%)	74 (88.1%)	78 (92.9%)
BLACK OR AFRICAN AMERICAN	8 ( 9.3%)	9 (10.7%)	6 ( 7.1%)
AMERICAN INDIAN OR ALASKA NATIVE	0 ( 0.0%)	1 ( 1.2%)	0 ( 0.0%)

Each percentage is computed against the total count within its column. For the Placebo arm, there are 86 subjects, so the WHITE row shows 78 (90.7%) because 78 / 86 = 90.7%.

This default is appropriate for most simple tables, but many real-world scenarios call for something different.

Controlling Denominator Grouping with `denoms_by`

The denoms_by parameter in layer_settings() lets you specify exactly which variables define the denominator groups. When you add a by variable to a count layer, the default denominator is still the column total. But sometimes you want the percentage to reflect the subgroup population instead.

Consider a disposition table broken out by sex. With the default denominator, each cell percentage is based on the total treatment arm count:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("DCDECOD",
      by = "SEX"
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(head(result[, c("rowlabel1", "rowlabel2", "res1", "res2", "res3")], 10))

rowlabel1	rowlabel2	res1	res2	res3
F	ADVERSE EVENT	6 ( 7.0%)	20 (23.8%)	26 (31.0%)
F	COMPLETED	34 (39.5%)	13 (15.5%)	17 (20.2%)
F	DEATH	1 ( 1.2%)	0 ( 0.0%)	1 ( 1.2%)
F	LACK OF EFFICACY	2 ( 2.3%)	1 ( 1.2%)	0 ( 0.0%)
F	LOST TO FOLLOW-UP	1 ( 1.2%)	0 ( 0.0%)	1 ( 1.2%)
F	PHYSICIAN DECISION	1 ( 1.2%)	1 ( 1.2%)	0 ( 0.0%)
F	PROTOCOL VIOLATION	1 ( 1.2%)	1 ( 1.2%)	0 ( 0.0%)
F	STUDY TERMINATED BY SPONSOR	1 ( 1.2%)	0 ( 0.0%)	0 ( 0.0%)
F	WITHDRAWAL BY SUBJECT	6 ( 7.0%)	4 ( 4.8%)	5 ( 6.0%)
M	ADVERSE EVENT	2 ( 2.3%)	20 (23.8%)	18 (21.4%)

Here the percentages use the full treatment arm as the denominator. But if you want each sex subgroup to sum to 100% independently, set denoms_by to include both the column variable and the by variable:

denoms_by replaces, it does not augment. denoms_by replaces the default denominator grouping (the cols variable(s)) rather than refining within it. To get per-column denominators that also break down by a by variable you must list the cols variable(s) explicitly — e.g. denoms_by = c("TRT01P", "SEX"), not denoms_by = "SEX". Passing only the by variable computes each percentage out of the entire by-group population across all columns, so the percentages won’t sum sensibly per column.

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("DCDECOD",
      by = "SEX",
      settings = layer_settings(
        denoms_by = c("TRT01P", "SEX")
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(head(result[, c("rowlabel1", "rowlabel2", "res1", "res2", "res3")], 10))

rowlabel1	rowlabel2	res1	res2	res3
F	ADVERSE EVENT	6 (11.3%)	20 (50.0%)	26 (52.0%)
F	COMPLETED	34 (64.2%)	13 (32.5%)	17 (34.0%)
F	DEATH	1 ( 1.9%)	0 ( 0.0%)	1 ( 2.0%)
F	LACK OF EFFICACY	2 ( 3.8%)	1 ( 2.5%)	0 ( 0.0%)
F	LOST TO FOLLOW-UP	1 ( 1.9%)	0 ( 0.0%)	1 ( 2.0%)
F	PHYSICIAN DECISION	1 ( 1.9%)	1 ( 2.5%)	0 ( 0.0%)
F	PROTOCOL VIOLATION	1 ( 1.9%)	1 ( 2.5%)	0 ( 0.0%)
F	STUDY TERMINATED BY SPONSOR	1 ( 1.9%)	0 ( 0.0%)	0 ( 0.0%)
F	WITHDRAWAL BY SUBJECT	6 (11.3%)	4 (10.0%)	5 (10.0%)
M	ADVERSE EVENT	2 ( 6.1%)	20 (45.5%)	18 (52.9%)

Now the denominator for each cell is the count of subjects in that specific treatment-by-sex combination. The female Placebo subjects have their own denominator, the male Placebo subjects have theirs, and so on. Within each sex-by-treatment group, the percentages will sum to 100%.

The denoms_by parameter accepts any character vector of variable names present in the data. You are not restricted to column or by variables – you could use any grouping that makes sense for your analysis.

Separate Denominator Populations with `denom_where`

Sometimes the denominator should come from a different subset of the data than the numerator. The denom_where parameter accepts a quoted expression that filters the data used for denominator computation, while leaving the numerator counts unaffected.

For example, suppose you want to show all disposition categories but compute percentages relative to subjects who did not complete the study:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("DCDECOD",
      settings = layer_settings(
        denom_where = quote(DCDECOD != "COMPLETED")
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, c("rowlabel1", "res1", "res2", "res3")])

rowlabel1	res1	res2	res3
ADVERSE EVENT	8 (28.6%)	40 (70.2%)	44 (74.6%)
COMPLETED	58 (207.1%)	27 (47.4%)	25 (42.4%)
DEATH	2 ( 7.1%)	0 ( 0.0%)	1 ( 1.7%)
LACK OF EFFICACY	3 (10.7%)	1 ( 1.8%)	0 ( 0.0%)
LOST TO FOLLOW-UP	1 ( 3.6%)	0 ( 0.0%)	1 ( 1.7%)
PHYSICIAN DECISION	1 ( 3.6%)	2 ( 3.5%)	0 ( 0.0%)
PROTOCOL VIOLATION	2 ( 7.1%)	3 ( 5.3%)	1 ( 1.7%)
STUDY TERMINATED BY SPONSOR	2 ( 7.1%)	3 ( 5.3%)	2 ( 3.4%)
WITHDRAWAL BY SUBJECT	9 (32.1%)	8 (14.0%)	10 (16.9%)

Notice that the COMPLETED row now shows percentages exceeding 100%. That is because the numerator (subjects who completed) is larger than the denominator (subjects who did not complete). This is expected when using denom_where – it shifts the frame of reference for the percentage calculation without changing which rows appear in the count.

This feature is particularly useful in adverse event tables where you might want to restrict the denominator to the safety population or to subjects who received at least one dose of study medication.

Excluding Values from Denominators with `denom_ignore`

The denom_ignore parameter removes specific values of the target variable from the denominator calculation. Unlike denom_where, which filters rows based on an arbitrary condition, denom_ignore targets specific levels of the variable being counted.

This is useful when a categorical variable includes a level like “NOT APPLICABLE” or “NOT ASSESSED” that should not contribute to the total. Here we exclude “AMERICAN INDIAN OR ALASKA NATIVE” from the denominator when counting race categories:

spec <- tplyr_spec(
  cols = "TRT01P",
  layers = tplyr_layers(
    group_count("RACE",
      settings = layer_settings(
        denom_ignore = c("AMERICAN INDIAN OR ALASKA NATIVE")
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adsl)
kable(result[, c("rowlabel1", "res1", "res2", "res3")])

rowlabel1	res1	res2	res3
WHITE	78 (90.7%)	74 (89.2%)	78 (92.9%)
BLACK OR AFRICAN AMERICAN	8 ( 9.3%)	9 (10.8%)	6 ( 7.1%)
AMERICAN INDIAN OR ALASKA NATIVE	0 ( 0.0%)	1 ( 1.2%)	0 ( 0.0%)

The “AMERICAN INDIAN OR ALASKA NATIVE” row still appears in the output with its count, but subjects in that category are excluded from the denominator used to calculate all percentages. This means the remaining categories will have percentages that sum to slightly more than 100% (because the “AMERICAN INDIAN OR ALASKA NATIVE” count was removed from the bottom of the fraction but included in the numerator for its own row).

You can pass multiple values to denom_ignore as a character vector.

Denominators with Distinct Counts

When distinct_by is set in layer_settings(), tplyr2 computes both event-level and subject-level counts. The denominator system applies separately to each:

n and pct use the event-level (row-level) denominator
distinct_n and distinct_pct use the distinct-subject denominator

Both denominators respect denoms_by, denom_where, and denom_ignore.

spec <- tplyr_spec(
  cols = "TRTA",
  layers = tplyr_layers(
    group_count("AEDECOD",
      settings = layer_settings(
        distinct_by = "USUBJID",
        format_strings = list(
          n_counts = f_str("xx (xx.x%) [xxx]", "distinct_n", "distinct_pct", "n")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adae)
kable(head(result[, c("rowlabel1", "res1", "res2", "res3")], 8))

rowlabel1	res1	res2	res3
ABDOMINAL PAIN	0 ( 0.0%) [ 0]	0 ( 0.0%) [ 0]	1 ( 2.0%) [ 1]
AGITATION	0 ( 0.0%) [ 0]	0 ( 0.0%) [ 0]	1 ( 2.0%) [ 1]
ANXIETY	0 ( 0.0%) [ 0]	0 ( 0.0%) [ 0]	1 ( 2.0%) [ 1]
APPLICATION SITE DERMATITIS	1 ( 3.1%) [ 1]	3 ( 7.0%) [ 3]	2 ( 4.0%) [ 2]
APPLICATION SITE ERYTHEMA	0 ( 0.0%) [ 0]	3 ( 7.0%) [ 3]	4 ( 8.0%) [ 4]
APPLICATION SITE IRRITATION	1 ( 3.1%) [ 1]	3 ( 7.0%) [ 4]	2 ( 4.0%) [ 2]
APPLICATION SITE PAIN	0 ( 0.0%) [ 0]	1 ( 2.3%) [ 1]	0 ( 0.0%) [ 0]
APPLICATION SITE PRURITUS	4 (12.5%) [ 4]	6 (14.0%) [ 7]	5 (10.0%) [ 5]

The first number is the count of distinct subjects, the percentage is the proportion of distinct subjects, and the bracketed number is the total event count. The denominators for distinct_n and distinct_pct are the total number of distinct subjects per treatment arm (rather than the total number of event rows).

Confidence Intervals for the Single Proportion

AE-incidence and response tables often display a confidence interval for the single proportion alongside each n (%) cell. tplyr2 exposes this as four count-layer f_str keywords, computed per column-by-target-level cell on the same percentage scale as pct:

ci_lower / ci_upper – from the event-level n / total
distinct_ci_lower / distinct_ci_upper – from the subject-level distinct_n / distinct_total

Two layer_settings() controls choose how the interval is built: ci_method (the estimation method) and ci_level (the coverage, default 0.95). The default method, "clopper_pearson", is the exact interval that matches SAS PROC FREQ ... EXACT and stats::binom.test() – the usual clinical convention. The other methods are "wilson" (score interval, matching stats::prop.test(correct = FALSE)), "wald", "agresti_coull", and "jeffreys".

spec <- tplyr_spec(
  cols = "TRTA",
  layers = tplyr_layers(
    group_count("AEDECOD",
      settings = layer_settings(
        distinct_by = "USUBJID",
        ci_method   = "clopper_pearson",
        ci_level    = 0.95,
        format_strings = list(
          n_counts = f_str("xx (xx.x%) [xx.x, xx.x]",
                           "distinct_n", "distinct_pct",
                           "distinct_ci_lower", "distinct_ci_upper")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adae)
kable(head(result[, c("rowlabel1", "res1", "res2", "res3")], 8))

rowlabel1	res1	res2	res3
ABDOMINAL PAIN	0 ( 0.0%) [ 0.0, 10.9]	0 ( 0.0%) [ 0.0, 8.2]	1 ( 2.0%) [ 0.1, 10.6]
AGITATION	0 ( 0.0%) [ 0.0, 10.9]	0 ( 0.0%) [ 0.0, 8.2]	1 ( 2.0%) [ 0.1, 10.6]
ANXIETY	0 ( 0.0%) [ 0.0, 10.9]	0 ( 0.0%) [ 0.0, 8.2]	1 ( 2.0%) [ 0.1, 10.6]
APPLICATION SITE DERMATITIS	1 ( 3.1%) [ 0.1, 16.2]	3 ( 7.0%) [ 1.5, 19.1]	2 ( 4.0%) [ 0.5, 13.7]
APPLICATION SITE ERYTHEMA	0 ( 0.0%) [ 0.0, 10.9]	3 ( 7.0%) [ 1.5, 19.1]	4 ( 8.0%) [ 2.2, 19.2]
APPLICATION SITE IRRITATION	1 ( 3.1%) [ 0.1, 16.2]	3 ( 7.0%) [ 1.5, 19.1]	2 ( 4.0%) [ 0.5, 13.7]
APPLICATION SITE PAIN	0 ( 0.0%) [ 0.0, 10.9]	1 ( 2.3%) [ 0.1, 12.3]	0 ( 0.0%) [ 0.0, 7.1]
APPLICATION SITE PRURITUS	4 (12.5%) [ 3.5, 29.0]	6 (14.0%) [ 5.3, 27.9]	5 (10.0%) [ 3.3, 21.8]

Each cell now shows n (%) [lower, upper], where the bracketed bounds are the confidence limits on the percentage scale. Because these are ordinary format keywords, they also compose with stat_columns (one n (%) column and one [CI] column) and appear on Total/Missing rows just like the percentage does. The bounds are only computed when a format string actually references one of the four keywords, so layers that do not display a CI incur no extra cost. The vectorized engine behind the keywords, proportion_ci(), is exported for standalone use.

This single-proportion interval is distinct from the between-arm confidence interval that risk_diff produces; the two are independent and can appear in the same table. vignette("binding-statistics") discusses the single-proportion CI alongside the other comparative statistics.

Denominators with Population Data

In many clinical studies, the analysis data contains only a subset of the randomized population. Adverse event data, for example, only includes subjects who experienced at least one event. If you compute denominators from the event data alone, you will undercount the population.

The pop_data() configuration at the spec level tells tplyr2 to draw denominators from a separate population dataset. When pop_data is specified, the denominator for each treatment arm comes from the population dataset rather than the analysis data.

spec <- tplyr_spec(
  cols = "TRTA",
  pop_data = pop_data(cols = c("TRTA" = "TRT01P")),
  layers = tplyr_layers(
    group_count("AEDECOD",
      settings = layer_settings(
        distinct_by = "USUBJID",
        format_strings = list(
          n_counts = f_str("xx (xx.x%)", "distinct_n", "distinct_pct")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adae, pop_data = tplyr_adsl)
kable(head(result[, c("rowlabel1", "res1", "res2", "res3")], 8))

rowlabel1	res1	res2	res3
ABDOMINAL PAIN	0 ( 0.0%)	0 ( 0.0%)	1 ( 1.2%)
AGITATION	0 ( 0.0%)	0 ( 0.0%)	1 ( 1.2%)
ANXIETY	0 ( 0.0%)	0 ( 0.0%)	1 ( 1.2%)
APPLICATION SITE DERMATITIS	1 ( 1.2%)	3 ( 3.6%)	2 ( 2.4%)
APPLICATION SITE ERYTHEMA	0 ( 0.0%)	3 ( 3.6%)	4 ( 4.8%)
APPLICATION SITE IRRITATION	1 ( 1.2%)	3 ( 3.6%)	2 ( 2.4%)
APPLICATION SITE PAIN	0 ( 0.0%)	1 ( 1.2%)	0 ( 0.0%)
APPLICATION SITE PRURITUS	4 ( 4.7%)	6 ( 7.1%)	5 ( 6.0%)

Without pop_data, the denominators come from the AE dataset itself. With it, they come from the full ADSL population. This typically produces smaller percentages because the denominator is larger – it includes all randomized subjects, not just those with events.

The named vector in pop_data(cols = c("TRTA" = "TRT01P")) maps the analysis data column name (TRTA) to the corresponding population data column name (TRT01P). This mapping is necessary when the column variables have different names across datasets.

You can also retrieve the population-level N values used in column headers:

kable(tplyr_header_n(result))

TRTA	.n
Placebo	86
Xanomeline High Dose	84
Xanomeline Low Dose	84

Missing Subjects

When using population data, some subjects in the population may not appear in the analysis data at all. The missing_subjects parameter adds a row that counts these absent subjects.

spec <- tplyr_spec(
  cols = "TRTA",
  pop_data = pop_data(cols = c("TRTA" = "TRT01P")),
  layers = tplyr_layers(
    group_count("AEDECOD",
      settings = layer_settings(
        distinct_by = "USUBJID",
        missing_subjects = TRUE,
        missing_subjects_label = "Not reported",
        format_strings = list(
          n_counts = f_str("xx (xx.x%)", "distinct_n", "distinct_pct")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adae, pop_data = tplyr_adsl)

# Show the "Not reported" row
nr_idx <- which(result$rowlabel1 == "Not reported")
kable(result[nr_idx, c("rowlabel1", "res1", "res2", "res3")])

	rowlabel1	res1	res2	res3
89	Not reported	54 (62.8%)	41 (48.8%)	34 (40.5%)

The “Not reported” row counts subjects who are present in the population (ADSL) but have no records in the analysis data (ADAE). This is particularly important for adverse event tables, where a large proportion of the safety population may have no reported events.

The missing_subjects_label parameter controls the text displayed in the row label. It defaults to "Missing", but "Not reported" or "No events reported" are common alternatives for adverse event tables.

Note that missing_subjects requires pop_data to be set at the spec level. Without a separate population dataset, there is no reference population to identify missing subjects against.

Putting It All Together

In practice, you often combine several of these features in a single layer. Here is a more complete adverse event summary that uses population-based denominators, distinct subject counts, and a missing subjects row:

spec <- tplyr_spec(
  cols = "TRTA",
  pop_data = pop_data(cols = c("TRTA" = "TRT01P")),
  total_groups = list(total_group("TRTA", label = "Total")),
  layers = tplyr_layers(
    group_count(c("AEBODSYS", "AEDECOD"),
      settings = layer_settings(
        distinct_by = "USUBJID",
        missing_subjects = TRUE,
        missing_subjects_label = "No events reported",
        format_strings = list(
          n_counts = f_str("xx (xx.x%)", "distinct_n", "distinct_pct")
        )
      )
    )
  )
)

result <- tplyr_build(spec, tplyr_adae, pop_data = tplyr_adsl)
collapsed <- collapse_row_labels(result, "rowlabel1", "rowlabel2", indent = "   ")

# Show the first section plus the missing subjects row
nr_idx <- which(collapsed$row_label == "No events reported")
show_rows <- c(1:12, nr_idx)
kable(collapsed[show_rows, c("row_label", "res1", "res2", "res3", "res4")])

row_label	res1	res2	res3	res4
CARDIAC DISORDERS
	4 ( 4.7%)	15 ( 0.0%)	6 ( 7.1%)	5 ( 6.0%)
ATRIAL FIBRILLATION	0 ( 0.0%)	1 ( 0.0%)	0 ( 0.0%)	1 ( 1.2%)
ATRIAL FLUTTER	0 ( 0.0%)	1 ( 0.0%)	1 ( 1.2%)	0 ( 0.0%)
ATRIAL HYPERTROPHY	1 ( 1.2%)	1 ( 0.0%)	0 ( 0.0%)	0 ( 0.0%)
BUNDLE BRANCH BLOCK RIGHT	1 ( 1.2%)	1 ( 0.0%)	0 ( 0.0%)	0 ( 0.0%)
CARDIAC FAILURE CONGESTIVE	1 ( 1.2%)	1 ( 0.0%)	0 ( 0.0%)	0 ( 0.0%)
MYOCARDIAL INFARCTION	0 ( 0.0%)	3 ( 0.0%)	1 ( 1.2%)	2 ( 2.4%)
SINUS BRADYCARDIA	0 ( 0.0%)	4 ( 0.0%)	3 ( 3.6%)	1 ( 1.2%)
SUPRAVENTRICULAR EXTRASYSTOLES	1 ( 1.2%)	2 ( 0.0%)	0 ( 0.0%)	1 ( 1.2%)
SUPRAVENTRICULAR TACHYCARDIA	0 ( 0.0%)	1 ( 0.0%)	0 ( 0.0%)	1 ( 1.2%)
TACHYCARDIA	1 ( 1.2%)	1 ( 0.0%)	0 ( 0.0%)	0 ( 0.0%)

This spec includes population-based denominators from ADSL, a “Total” column, nested body system / preferred term counts, and a row for subjects with no reported events. The denominator for every cell comes from the full population dataset, so the percentages reflect the proportion of all randomized subjects in each arm.

Summary

The table below summarizes the denominator controls available in layer_settings():

Parameter	Purpose	Default
`denoms_by`	Variables defining denominator groups	Column variable(s)
`denom_where`	Expression to filter denominator data	No filter
`denom_ignore`	Target variable values to exclude from denominators	None
`distinct_by`	Variable for distinct subject counting	NULL (event-level only)
`missing_subjects`	Add row for subjects absent from analysis data	FALSE
`pop_data()`	Separate population dataset for denominators	None (use analysis data)

Each of these can be used independently or combined as needed. The key principle is that the numerator (what you are counting) and the denominator (what you are dividing by) can be controlled separately, giving you precise control over how percentages are computed in your clinical tables.