Metadata and Traceability
metadata.RmdWhy Traceability Matters
Clinical summary tables go through rigorous quality control. When a reviewer questions a number, the programmer needs to trace that cell back to the exact rows in the source data that produced it. Without a systematic approach, this means re-reading code, reconstructing filter logic, and manually subsetting the data – a tedious and error-prone process.
tplyr2 solves this with cell-level metadata. When you build a table
with metadata = TRUE, the package records the filter
expressions that define every cell. You can then query any cell to
inspect the filters or retrieve the source data rows directly. This is
valuable for regulatory review, quality control, and powering
interactive drill-down interfaces in Shiny applications.
Building with Metadata
Enabling metadata is a single argument to
tplyr_build():
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_count("SEX"),
group_desc("AGE",
settings = layer_settings(
format_strings = list(
"n" = f_str("xxx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl, metadata = TRUE)The returned data frame has two additions compared to a normal build:
- A
row_idcolumn containing a unique identifier for each output row. - A
tplyr_metaattribute – a named list of metadata objects keyed by"row_id||column".
Row IDs
Row IDs are constructed from the layer index and row label values.
For a count layer with target_var = "SEX" in layer 1, the
IDs are 1_F and 1_M. For a descriptive
statistics layer in layer 2, they are 2_n and
2_Mean (SD).
result$row_id
#> [1] "1_F" "1_M" "2_n" "2_Mean (SD)"IDs are deterministic – building the same spec against the same data
always produces the same values. The generate_row_ids()
function creates these identifiers and can also be called on any tplyr2
output, even one built without metadata.
Inspecting Cell Metadata
Once you have a row ID and a column name,
tplyr_meta_result() returns the metadata object for that
cell:
meta <- tplyr_meta_result(result, "1_F", "res1")
meta
#> tplyr_meta [layer 1]
#> Names: TRT01P, SEX
#> Filters:
#> TRT01P == "Placebo"
#> SEX == "F"The tplyr_meta object contains:
- names: Variable names relevant to this cell.
-
filters: A list of R call expressions that,
combined with
&, define the data subset. - layer_index: Which layer this cell belongs to.
-
anti_join:
NULLfor normal cells, or atplyr_meta_anti_joinobject for missing subjects rows.
If the cell does not exist, the function returns NULL.
If you try to access metadata on a result built without it, you get a
clear error:
result_no_meta <- tplyr_build(spec, tplyr_adsl, metadata = FALSE)
tplyr_meta_result(result_no_meta, "1_F", "res1")
#> Error:
#> ! No metadata available. Rebuild with metadata = TRUEGetting Source Data
The real power of metadata is in tplyr_meta_subset(). It
evaluates the stored filters against the original data, returning the
rows that produced a cell:
source_rows <- tplyr_meta_subset(result, "1_F", "res1", tplyr_adsl)
nrow(source_rows)
#> [1] 53
all(source_rows$SEX == "F")
#> [1] TRUE
unique(source_rows$TRT01P)
#> [1] "Placebo"The number of rows returned matches the count displayed in the cell.
Metadata for Count Layers
Count cells are defined by the intersection of the column variable
level and the target variable level. When a by variable is
present, it adds an additional filter:
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_count("DCDECOD", by = "EOSSTT")
)
)
result <- tplyr_build(spec, tplyr_adsl, metadata = TRUE)
kable(head(result[, c("rowlabel1", "rowlabel2", "res1", "res2", "res3")]))| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| COMPLETED | ADVERSE EVENT | 0 ( 0.0%) | 0 ( 0.0%) | 0 ( 0.0%) |
| COMPLETED | COMPLETED | 58 (67.4%) | 27 (32.1%) | 25 (29.8%) |
| COMPLETED | DEATH | 0 ( 0.0%) | 0 ( 0.0%) | 0 ( 0.0%) |
| COMPLETED | LACK OF EFFICACY | 0 ( 0.0%) | 0 ( 0.0%) | 0 ( 0.0%) |
| COMPLETED | LOST TO FOLLOW-UP | 0 ( 0.0%) | 0 ( 0.0%) | 0 ( 0.0%) |
| COMPLETED | PHYSICIAN DECISION | 0 ( 0.0%) | 0 ( 0.0%) | 0 ( 0.0%) |
rid <- result$row_id[1]
meta <- tplyr_meta_result(result, rid, "res1")
meta
#> tplyr_meta [layer 1]
#> Names: TRT01P, EOSSTT, DCDECOD
#> Filters:
#> TRT01P == "Placebo"
#> EOSSTT == "COMPLETED"
#> DCDECOD == "ADVERSE EVENT"Total Rows
When total_row = TRUE, the total row’s metadata omits
the target variable filter, leaving only the column variable and any
by-variables:
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_count("SEX", settings = layer_settings(total_row = TRUE))
)
)
result <- tplyr_build(spec, tplyr_adsl, metadata = TRUE)
total_row <- result[result$rowlabel1 == "Total", ]
meta <- tplyr_meta_result(result, total_row$row_id, "res1")
meta
#> tplyr_meta [layer 1]
#> Names: TRT01P, SEX
#> Filters:
#> TRT01P == "Placebo"
source_rows <- tplyr_meta_subset(result, total_row$row_id, "res1", tplyr_adsl)
nrow(source_rows)
#> [1] 86Metadata for Descriptive Statistics Layers
Each row in a desc layer represents a statistic (n, mean, etc.), not a data category. All stat rows within the same column point to the same source data – the observations on which those statistics were computed:
spec <- tplyr_spec(
cols = "TRT01P",
layers = tplyr_layers(
group_desc("AGE",
settings = layer_settings(
format_strings = list(
"n" = f_str("xxx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd")
)
)
)
)
)
result <- tplyr_build(spec, tplyr_adsl, metadata = TRUE)
n_meta <- tplyr_meta_result(result, "1_n", "res1")
n_meta
#> tplyr_meta [layer 1]
#> Names: TRT01P, AGE
#> Filters:
#> TRT01P == "Placebo"The names field includes the target variable
(AGE). You can verify the statistics by subsetting and
computing them directly:
source_rows <- tplyr_meta_subset(result, "1_n", "res1", tplyr_adsl)
c(n = nrow(source_rows), mean = mean(source_rows$AGE), sd = sd(source_rows$AGE))
#> n mean sd
#> 86.000000 75.209302 8.590167Where Filters in Metadata
Both spec-level and layer-level where filters are
captured in the metadata, so you can always see the full filtering
chain:
spec <- tplyr_spec(
cols = "TRT01P",
where = SAFFL == "Y",
layers = tplyr_layers(
group_count("SEX")
)
)
result <- tplyr_build(spec, tplyr_adsl, metadata = TRUE)
meta <- tplyr_meta_result(result, result$row_id[1], "res1")
meta
#> tplyr_meta [layer 1]
#> Names: TRT01P, SAFFL, SEX
#> Filters:
#> TRT01P == "Placebo"
#> SEX == "F"
#> SAFFL == "Y"The filter list includes SAFFL == "Y" alongside the
column and target variable filters. The names field lists
every variable in the subsetting logic.
Anti-Join Metadata for Missing Subjects
Some tables include a row for subjects in the population data but not
in the analysis data. The metadata for these rows uses a special
anti_join field:
target <- data.frame(
TRT = c("A", "A", "B"),
USUBJID = c("S1", "S2", "S3"),
VAL = c("X", "Y", "X")
)
pop <- data.frame(
TRT = c("A", "A", "A", "B", "B"),
USUBJID = c("S1", "S2", "S4", "S3", "S5")
)
spec <- tplyr_spec(
cols = "TRT",
pop_data = pop_data(cols = "TRT"),
layers = tplyr_layers(
group_count("VAL",
settings = layer_settings(
distinct_by = "USUBJID",
missing_subjects = TRUE,
missing_subjects_label = "Not in Target"
)
)
)
)
result <- tplyr_build(spec, target, pop_data = pop, metadata = TRUE)
kable(result[, c("rowlabel1", "res1", "res2")])| rowlabel1 | res1 | res2 |
|---|---|---|
| Not in Target | 1 (33.3%) | 1 (50.0%) |
| X | 1 (33.3%) | 1 (50.0%) |
| Y | 1 (33.3%) | 0 ( 0.0%) |
ms_row <- result[result$rowlabel1 == "Not in Target", ]
meta <- tplyr_meta_result(result, ms_row$row_id, "res1")
meta
#> tplyr_meta [layer 1]
#> Names: TRT, VAL
#> Filters:
#> TRT == "A"
#> Anti-join:
#> On: USUBJID
#> Pop filters:
#> TRT == "A"The anti_join contains a join_meta (filters
for the population side) and on (the join key, typically
"USUBJID"). When calling tplyr_meta_subset()
on such a row, you must supply pop_data:
missing_a <- tplyr_meta_subset(result, ms_row$row_id, "res1",
target, pop_data = pop)
missing_a
#> TRT USUBJID
#> 1 A S4Subject S4 is in the population for TRT = "A" but absent
from the target, so the anti-join returns that one row. Omitting
pop_data produces a warning:
tplyr_meta_subset(result, ms_row$row_id, "res1", target)
#> Warning: pop_data is required for anti-join metadata but was not provided
#> TRT USUBJID VAL
#> 1 A S1 X
#> 2 A S2 YPractical Applications
The most common use of metadata is cell verification during QC:
result <- tplyr_build(spec, data, metadata = TRUE)
source <- tplyr_meta_subset(result, row_id = "1_F", column = "res2", data = data)
nrow(source)The metadata system also lends itself to Shiny applications where
clicking a cell triggers tplyr_meta_subset(), displaying
the source records in a detail panel:
observeEvent(input$table_cell_click, {
click <- input$table_cell_click
row_id <- result$row_id[click$row]
col_name <- names(result)[click$col]
source_data <- tplyr_meta_subset(result, row_id, col_name, original_data)
output$detail_table <- renderTable(source_data)
})Summary
| Function | Purpose |
|---|---|
tplyr_build(..., metadata = TRUE) |
Build with metadata enabled |
generate_row_ids(result) |
Create row identifiers |
tplyr_meta_result(result, row_id, column) |
Inspect filter expressions for a cell |
tplyr_meta_subset(result, row_id, column, data) |
Retrieve source data rows |
Every cell carries its own filter expressions – column variable, target variable, by-variable, where clauses, and anti-join logic for missing subjects. These expressions are stored at build time but only evaluated when you request a subset, keeping the metadata lightweight while providing exact reproducibility of every number in your table.