Spec Serialization
serialization.RmdIntroduction
One of the central design principles of tplyr2 is the separation of
specification from data. A tplyr_spec() object describes
what a table should contain – the column structure, the layers,
the formatting rules – without touching any actual dataset. Data is only
supplied at build time via tplyr_build(spec, data).
This separation creates a natural opportunity: if the spec is pure configuration, it can be saved to a file and loaded back later. tplyr2 supports serializing specs to both JSON and YAML formats. This opens up several practical workflows:
- Version control: Specs stored as JSON or YAML are plain text, so they integrate naturally with Git. You can track changes to your table definitions alongside your analysis code.
- Portability: A spec file can be shared between team members or across studies. The recipient loads it and builds against their own data.
- Reproducibility: Archiving a spec alongside its output creates a clear record of exactly what configuration produced a given table.
- Regulatory submissions: A machine-readable table definition can serve as supporting documentation in a submission package.
Writing a Spec to JSON
The tplyr_write_spec() function takes a spec object and
a file path. The format is determined by the file extension: use
.json for JSON output.
spec <- tplyr_spec(
cols = "TRT01P",
where = SAFFL == "Y",
layers = tplyr_layers(
group_count("SEX", by = "Sex n (%)"),
group_desc(
"AGE",
by = "Age (Years)",
settings = layer_settings(
format_strings = list(
"n" = f_str("xxx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
"Median" = f_str("xx.x", "median"),
"Min, Max" = f_str("xx, xx", "min", "max")
)
)
)
)
)
json_path <- tempfile(fileext = ".json")
tplyr_write_spec(spec, json_path)The file is now a plain-text JSON document. Let’s verify that we can read it back and build the same table.
Reading a Spec Back
The tplyr_read_spec() function reads a spec from a JSON
or YAML file and reconstructs the full tplyr_spec object,
including all expressions, format strings, and layer configurations.
loaded_spec <- tplyr_read_spec(json_path)
loaded_spec
#> tplyr2 table specification
#> Column variables: TRT01PWhere: SAFFL == "Y"Layers: 2[1] count: SEX (Layer 1)[2] desc: AGE (Layer 2)The loaded spec is functionally identical to the original. You can build it against any dataset that has the required columns.
result <- tplyr_build(loaded_spec, tplyr_adsl)
kable(result[, !grepl("^ord", names(result))])| rowlabel1 | rowlabel2 | res1 | res2 | res3 |
|---|---|---|---|---|
| Sex n (%) | F | 53 (61.6%) | 40 (47.6%) | 50 (59.5%) |
| Sex n (%) | M | 33 (38.4%) | 44 (52.4%) | 34 (40.5%) |
| Age (Years) | n | 86 | 84 | 84 |
| Age (Years) | Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) |
| Age (Years) | Median | 76.0 | 76.0 | 77.5 |
| Age (Years) | Min, Max | 52, 89 | 56, 88 | 51, 88 |
This is the same output you would get from building the original spec directly. The round-trip – write to disk, read back, build – preserves everything.
YAML Format
If you prefer YAML over JSON, simply use a .yaml or
.yml file extension. The API is identical.
yaml_path <- tempfile(fileext = ".yaml")
tplyr_write_spec(spec, yaml_path)
yaml_spec <- tplyr_read_spec(yaml_path)
yaml_result <- tplyr_build(yaml_spec, tplyr_adsl)
# Confirm the results match
identical(
result[, !grepl("^ord", names(result))],
yaml_result[, !grepl("^ord", names(yaml_result))]
)
#> [1] TRUEYAML tends to be more readable for human review, while JSON is more widely supported by automated tools. The choice between them is a matter of preference; tplyr2 handles both transparently.
What Gets Serialized
A tplyr_spec can contain several types of R objects that do not have direct equivalents in JSON or YAML. The serialization system handles each one with a specific convention.
Expressions
Filter expressions like where = SAFFL == "Y" are R
language objects. They are serialized by deparsing the expression to a
string and wrapping it in a marker object.
In JSON, a where clause looks like this:
On deserialization, the string is parsed back into an R expression
using rlang::parse_expr(). This preserves the original
filter logic exactly.
Format Strings
f_str() objects are stored as their component parts: the
format template, the variable names, and the optional empty
parameter. A marker class field (_class: "tplyr_f_str")
identifies them for reconstruction.
When read back, the f_str() constructor is called with
these components, re-parsing the format string and rebuilding the
internal structure.
Labels in by
When you use label() in a by parameter to
create an explicit text label, the label is serialized with a type
marker so it can be distinguished from a data variable name.
Regular data variable names (character strings) pass through as-is.
Functions
Analyze layers created with group_analyze() include a
user-defined function. These are serialized by deparsing the function
body to a string.
On deserialization, the string is parsed and evaluated to recreate the function object. Note that functions which depend on objects in a specific environment (closures that capture external variables) may not survive the round-trip. For best results, write analyze functions that are self-contained.
Examining the Serialized Output
Let’s look at the actual JSON content produced by our earlier spec to see these conventions in practice.
json_content <- readLines(json_path)
cat(json_content, sep = "\n")
#> {
#> "cols": "TRT01P",
#> "where": {
#> "_expr": "SAFFL == \"Y\""
#> },
#> "layers": [
#> {
#> "target_var": "SEX",
#> "by": "Sex n (%)",
#> "where": null,
#> "layer_type": "count",
#> "settings": {
#> "total_row": false,
#> "total_row_label": "Total",
#> "total_row_count_missings": true,
#> "missing_subjects": false,
#> "missing_subjects_label": "Missing",
#> "stats_as_columns": false
#> }
#> },
#> {
#> "target_var": "AGE",
#> "by": "Age (Years)",
#> "where": null,
#> "layer_type": "desc",
#> "settings": {
#> "format_strings": {
#> "n": {
#> "format_string": "xxx",
#> "vars": "n",
#> "_class": "tplyr_f_str"
#> },
#> "Mean (SD)": {
#> "format_string": "xx.x (xx.xx)",
#> "vars": ["mean", "sd"],
#> "_class": "tplyr_f_str"
#> },
#> "Median": {
#> "format_string": "xx.x",
#> "vars": "median",
#> "_class": "tplyr_f_str"
#> },
#> "Min, Max": {
#> "format_string": "xx, xx",
#> "vars": ["min", "max"],
#> "_class": "tplyr_f_str"
#> }
#> },
#> "total_row": false,
#> "total_row_label": "Total",
#> "total_row_count_missings": true,
#> "missing_subjects": false,
#> "missing_subjects_label": "Missing",
#> "stats_as_columns": false
#> }
#> }
#> ]
#> }You can see the structure: the top-level cols and
where fields, followed by the layers array.
Each layer carries its target_var, by,
where, layer_type, and settings.
Within the settings, format_strings contains the serialized
f_str objects.
A More Complex Example
Let’s serialize a spec that exercises more of the system: nested
counts with distinct counting, a where clause, and a total
row.
complex_spec <- tplyr_spec(
cols = "TRT01P",
where = SAFFL == "Y",
total_groups = list(total_group("TRT01P", label = "Total")),
layers = tplyr_layers(
group_count(
"RACE",
by = "Race n (%)",
settings = layer_settings(
total_row = TRUE,
total_row_label = "Total"
)
),
group_desc(
c("AGE", "WEIGHTBL"),
by = "Baseline Measurements",
settings = layer_settings(
format_strings = list(
"n" = f_str("xxx", "n"),
"Mean (SD)" = f_str("xx.x (xx.xx)", "mean", "sd"),
"Min, Max" = f_str("xx.x, xx.x", "min", "max")
)
)
)
)
)
complex_path <- tempfile(fileext = ".json")
tplyr_write_spec(complex_spec, complex_path)Now read it back and build.
reloaded <- tplyr_read_spec(complex_path)
complex_result <- tplyr_build(reloaded, tplyr_adsl)
kable(complex_result[, !grepl("^ord", names(complex_result))])| rowlabel1 | rowlabel2 | rowlabel3 | res1 | res2 | res3 | res4 |
|---|---|---|---|---|---|---|
| Race n (%) | AMERICAN INDIAN OR ALASKA NATIVE | 0 ( 0.0%) | 1 ( 0.4%) | 1 ( 1.2%) | 0 ( 0.0%) | |
| Race n (%) | BLACK OR AFRICAN AMERICAN | 8 ( 9.3%) | 23 ( 9.1%) | 9 (10.7%) | 6 ( 7.1%) | |
| Race n (%) | Total | 86 (100.0%) | 254 (100.0%) | 84 (100.0%) | 84 (100.0%) | |
| Race n (%) | WHITE | 78 (90.7%) | 230 (90.6%) | 74 (88.1%) | 78 (92.9%) | |
| Baseline Measurements | AGE | n | 86 | 254 | 84 | 84 |
| Baseline Measurements | AGE | Mean (SD) | 75.2 ( 8.59) | 75.1 ( 8.25) | 74.4 ( 7.89) | 75.7 ( 8.29) |
| Baseline Measurements | AGE | Min, Max | 52.0, 89.0 | 51.0, 89.0 | 56.0, 88.0 | 51.0, 88.0 |
| Baseline Measurements | WEIGHTBL | n | 86 | 253 | 84 | 83 |
| Baseline Measurements | WEIGHTBL | Mean (SD) | 62.8 (12.77) | 66.6 (14.13) | 70.0 (14.65) | 67.3 (14.12) |
| Baseline Measurements | WEIGHTBL | Min, Max | 34.0, 86.2 | 34.0, 108.0 | 41.7, 108.0 | 45.4, 106.1 |
The total group, total row, multi-target descriptive layer, and all formatting survive the round-trip.
Use Cases
Version-Controlled Table Definitions
Storing specs as JSON or YAML files in your project repository means
that every change to a table definition is tracked. If a reviewer asks
“what changed between the draft and final version of Table 14.1?”, you
can answer that question with a git diff of the spec
file.
# In your analysis script
spec <- tplyr_read_spec("specs/table_14_1.json")
result <- tplyr_build(spec, adsl)The spec file is a plain-text artifact that reviewers can inspect without running R.
Sharing Across Teams
A statistician can define the table structure and save the spec. A programmer on a different system can load it and build against the study data. The spec travels as a lightweight file rather than an R object that requires a specific environment.
Applying a Spec to Different Data
Because specs carry no data, the same spec can be applied to datasets from different studies, time points, or populations. This is useful for standardized tables that appear across multiple studies.
# Same spec, different data subsets
saffl_result <- tplyr_build(loaded_spec, tplyr_adsl[tplyr_adsl$SAFFL == "Y", ])
ittfl_result <- tplyr_build(loaded_spec, tplyr_adsl[tplyr_adsl$ITTFL == "Y", ])