General String Formatting
Source:vignettes/general_string_formatting.Rmd
general_string_formatting.Rmd
A key focus of producing a clinical table is ensuring that the formatting of the table is in line with the statistician and clinician’s expectations. Organizations often have strict standards around this which vary between organizations. Much of this falls outside the scope of Tplyr, but Tplyr gives great focus to how the numeric results on the page are formatted. R has vast capabilities when it comes to HTML and interactive tables, but Tplyr’s focus on string formatting is designed for those traditional, PDF document printable pages. The aim to make it as simple as possible to get what you need to work with a typical monospace fonts.
Note: We’ve still focused on R’s interactive capabilities, so be
sure to check out vignette("metadata")
Format Strings
Regardless of what layer type you use within Tplyr, control of formatting is handled by using format strings. Consider the following example.
tplyr_table(tplyr_adsl, TRT01P) %>%
add_layer(
group_count(RACE) %>%
set_format_strings(
f_str("xx (xx.x%)", n, pct)
)
) %>%
add_layer(
group_desc(AGE) %>%
set_format_strings(
"Mean (SD)" = f_str("xx.x (xx.xx)", mean, sd)
)
) %>%
build() %>%
select(1:3)
#> # A tibble: 4 × 3
#> row_label1 var1_Placebo `var1_Xanomeline High Dose`
#> <chr> <chr> <chr>
#> 1 AMERICAN INDIAN OR ALASKA NATIVE " 0 ( 0.0%)" " 1 ( 1.2%)"
#> 2 BLACK OR AFRICAN AMERICAN " 8 ( 9.3%)" " 9 (10.7%)"
#> 3 WHITE "78 (90.7%)" "74 (88.1%)"
#> 4 Mean (SD) "75.2 ( 8.59)" "74.4 ( 7.89)"
For each layer type, when you want to configure string formatting you
use the function set_format_strings()
. Inside
set_format_strings()
you provide f_str()
objects. Within count layers, for basic tables you can just provide a
singe f_str()
object to control general formatting. In
descriptive statistic layers, you provide named parameters, and the
names will become the values of row_label1
for the
statistics provided within your f_str()
object. Regardless
of the layer type, the f_str()
object is what controls the
numbers reported in your resulting table.
The table below outlines the variables available within each layer.
Layer Type | Variables | Description |
---|---|---|
Count Layers | n | Non-distinct counts |
pct | Ratio of non-distinct counts to non-distinct total | |
total | Non-distinct total | |
distinct_n | Distinct counts (must use set_distinct_by() ) |
|
distinct_pct | Ratio of distinct counts to distinct total. If population data are set, distinct_total pulled from population data. | |
distinct_total | Distinct total (must use set_distinct_by() ). If
population data are set, distinct_total pulled from population
data. |
|
Shift layers | n | Ratio of non-distinct counts to non-distinct total |
pct | Ratio of non-distinct counts to non-distinct total | |
total | Non-distinct total | |
Descriptive Statistics Layers | n | N |
mean | Mean | |
sd | Standard Deviation | |
median | Median | |
var | Variance | |
min | Minimum | |
max | Maximum | |
iqr | Interquartile Range | |
q1 | Q1 | |
q3 | Q3 | |
missing | Missing (specifically NA counts) |
Note: For the actual equations used in descriptive statistics
layers, see vignettes("desc")
General Formatting
Looking back at the Tplyr table above, let’s look at
the count layer’s f_str()
call.
f_str("xx (xx.x%)", n, pct)
#> *** Format String ***
#> xx (xx.x%)
#> *** vars, extracted formats, and settings ***
#> n formated as: xx
#> integer length: 2
#> decimal length: 0
#> pct formated as: xx.x
#> integer length: 2
#> decimal length: 1
#> Total Format Size: 10
You can see from the print method of the f_str()
object
that we capture the “format string”, here xx (xx.x%)
, and
metadata surrounding it. This string details exactly where numbers
should be placed within the output result and what that result should
look like. This is done by breaking the string down into “format
groups”, which are the separate numeric fields and their surrounding
characters within the format string.
This format string, xx (xx.x%)
, has two different format
groups. The first field is xx
, which attaches to the
variable n
. The second is (xx.x%)
, which
attaches to the variable pct
. When the result is formatted,
the first format group, xx
, will be output with a total
width of 2 characters, and space for 2 integers. The second format group
will be output with a total width of 7 characters, with space for 2
integers and 1 decimal place. For the final result, these two fields are
concatenated together, and the total width of the string will
consistently be 10 characters. Note though that formatting will not
truncate integers, and instead the total width of the string will be
expanded, skewing alignment. Decimal points are always rounded off to
the specified precision.
Controlling Formatting
Note in the format string, the result numbers to be formatted fill the spaces of the x’s. Other characters in the string are preserved as is. Tplyr’s format strings have a few different valid characters to specifically control the numeric fields.
Character | Description |
---|---|
x | One character, preserve width |
X | One character, hug character to the left |
a | Auto precision, preserve width |
A | Auto precision, hug character to the left |
a+n | Auto precision + n, preserve width |
A+n | Auto precision + n, preserve character to the left |
Lowercase ‘x’
As detailed in the first example, when using a lower case ‘x’, the
exact width of space allotted by the x’s will be preserved. Note the
var1_Placebo
row below.
tplyr_table(tplyr_adsl, TRT01P) %>%
add_layer(
group_count(RACE) %>%
set_format_strings(
f_str("xx (xx.x%)", n, pct)
)
) %>%
build() %>%
select(1:3)
#> # A tibble: 3 × 3
#> row_label1 var1_Placebo `var1_Xanomeline High Dose`
#> <chr> <chr> <chr>
#> 1 AMERICAN INDIAN OR ALASKA NATIVE " 0 ( 0.0%)" " 1 ( 1.2%)"
#> 2 BLACK OR AFRICAN AMERICAN " 8 ( 9.3%)" " 9 (10.7%)"
#> 3 WHITE "78 (90.7%)" "74 (88.1%)"
Both the integer width for the n
counts and the space to
the right of the opening parenthesis of the pct
field are
preserved. This guarentees that (when using a monospace font) the
non-numeric characters within the format strings will remain in the same
place. Given that integers don’t truncate, if these spaces are
undesired, integers will automatically increase width. In the example
below, if the n
or pct
result exceeds 10, the
width of the output string automatically expands. You can trigger this
behaivor by using a single ‘x’ in the integer side of a format
group.
tplyr_table(tplyr_adsl, TRT01P) %>%
add_layer(
group_count(RACE) %>%
set_format_strings(
f_str("x (x.x%)", n, pct)
)
) %>%
build() %>%
select(1:3)
#> # A tibble: 3 × 3
#> row_label1 var1_Placebo `var1_Xanomeline High Dose`
#> <chr> <chr> <chr>
#> 1 AMERICAN INDIAN OR ALASKA NATIVE 0 (0.0%) 1 (1.2%)
#> 2 BLACK OR AFRICAN AMERICAN 8 (9.3%) 9 (10.7%)
#> 3 WHITE 78 (90.7%) 74 (88.1%)
Uppercase ‘X’
The downside of the last example is that alignment between format
groups is completely lost. The parenthesis in the pct
field
is now bound to the integer of the percent value, but the entire string
is shifted to the right. Tplyr offers customization
over this by using a concept called “parenthesis hugging”. This is
triggered by using an uppercase ‘X’ in the integer side of a format
group. Consider the following example:
tplyr_table(tplyr_adsl, TRT01P) %>%
add_layer(
group_count(RACE) %>%
set_format_strings(
f_str("xx (XX.x%)", n, pct)
)
) %>%
build() %>%
select(1:3)
#> # A tibble: 3 × 3
#> row_label1 var1_Placebo `var1_Xanomeline High Dose`
#> <chr> <chr> <chr>
#> 1 AMERICAN INDIAN OR ALASKA NATIVE " 0 (0.0%)" " 1 (1.2%)"
#> 2 BLACK OR AFRICAN AMERICAN " 8 (9.3%)" " 9 (10.7%)"
#> 3 WHITE "78 (90.7%)" "74 (88.1%)"
Now the total string width has been preserved properly. The change
here is that instead of pulling the right side of the format group in,
essentially aligning left on the format group, the parenthesis has been
moved to the right towards the integer side of the pct
field, “hugging” the numeric results of the percent.
There are a two rules when using ‘parenthesis hugging’:
- Capital letters should only be used on the integer side of a number
- A character must precede the capital letter, otherwise there’s no character to ‘hug’
Auto-precision
Lastly, Tplyr also has the capability to automatically determine some widths necessary to format strings. This is done by using the character ‘a’ instead of ‘x’ in the format string.
Consider the following example.
tplyr_table(tplyr_adlb, TRTA, where=PARAMCD %in% c("CA", "URATE")) %>%
add_layer(
group_desc(AVAL, by=vars(PARAMCD, AVISIT)) %>%
set_format_strings(
'Mean (SD)' = f_str('a.a (a.a+1)', mean, sd)
) %>%
set_precision_by(PARAMCD)
) %>%
build() %>%
select(1:5)
#> # A tibble: 6 × 5
#> row_label1 row_label2 row_label3 var1_Placebo var1_Xanomeline High Dos…¹
#> <chr> <chr> <chr> <chr> <chr>
#> 1 CA Week 12 Mean (SD) 2.19144 (0.074711) 2.18134 (0.062553)
#> 2 CA Week 24 Mean (SD) 2.16838 (0.046617) 2.18063 (0.099174)
#> 3 CA Week 8 Mean (SD) 2.19144 (0.102771) 2.23926 (0.199489)
#> 4 URATE Week 12 Mean (SD) 298.887 (105.3868) 304.835 ( 85.1772)
#> 5 URATE Week 24 Mean (SD) 269.643 (111.8899) 281.539 (126.6612)
#> 6 URATE Week 8 Mean (SD) 234.946 ( 45.8806) 273.608 ( 36.6659)
#> # ℹ abbreviated name: ¹`var1_Xanomeline High Dose`
Note that the decimal precision varies between different lab test
results. This feature is beneficial when decimal precision rules must be
based on the precision of the data as collected. For more information of
auto-precision for descriptive statistics layers, see
vignette("desc_layer_formatting")
.
For count layers, auto-precision can also be used surrounding the
n
counts. For example, the default format string for counts
layers in Tplyr is set as a (xxx.x%)
. This
will auto-format the n
result based on the maximum
summarized value of n
within the data. For example:
tplyr_table(tplyr_adsl, TRT01P) %>%
add_layer(
group_count(RACE) %>%
set_format_strings(f_str("a (xxx.x%)", n, pct))
) %>%
build() %>%
select(1:3)
#> # A tibble: 3 × 3
#> row_label1 var1_Placebo `var1_Xanomeline High Dose`
#> <chr> <chr> <chr>
#> 1 AMERICAN INDIAN OR ALASKA NATIVE " 0 ( 0.0%)" " 1 ( 1.2%)"
#> 2 BLACK OR AFRICAN AMERICAN " 8 ( 9.3%)" " 9 ( 10.7%)"
#> 3 WHITE "78 ( 90.7%)" "74 ( 88.1%)"
Given that the maximum count was >=10 and <100, the integer
width for n
was assigned as 2. Note that auto-precision for
percents will not auto-format based on available percentages. The count
filled by a
will be based on the n
result.
For both layer types, a capital A
follows the same logic
as X
, but is triggered using auto-precision. Take this
example of an adverse event table:
tplyr_table(tplyr_adae, TRTA) %>%
set_pop_data(tplyr_adsl) %>%
set_pop_treat_var(TRT01A) %>%
add_layer(
group_count(AEDECOD) %>%
set_format_strings(f_str("a (XX.x%) [A]", distinct_n, distinct_pct, n)) %>%
set_distinct_by(USUBJID)
) %>%
build() %>%
select(1:3)
#> # A tibble: 21 × 3
#> row_label1 var1_Placebo `var1_Xanomeline High Dose`
#> <chr> <chr> <chr>
#> 1 ACTINIC KERATOSIS " 0 (0.0%) [0]" " 1 (1.2%) [1]"
#> 2 ALOPECIA " 1 (1.2%) [1]" " 0 (0.0%) [0]"
#> 3 BLISTER " 0 (0.0%) [0]" " 1 (1.2%) [2]"
#> 4 COLD SWEAT " 1 (1.2%) [3]" " 0 (0.0%) [0]"
#> 5 DERMATITIS ATOPIC " 1 (1.2%) [1]" " 0 (0.0%) [0]"
#> 6 DERMATITIS CONTACT " 0 (0.0%) [0]" " 0 (0.0%) [0]"
#> 7 DRUG ERUPTION " 1 (1.2%) [1]" " 0 (0.0%) [0]"
#> 8 ERYTHEMA " 9 (10.5%) [13]" "14 (16.7%) [22]"
#> 9 HYPERHIDROSIS " 2 (2.3%) [2]" " 8 (9.5%) [10]"
#> 10 PRURITUS " 8 (9.3%) [11]" "26 (31.0%) [38]"
#> # ℹ 11 more rows
To go over each format group:
- The distinct n count has been spaced based on the width of the
maximum
n
value - The percent field will have a maximum width of 2 characters, but the opening parenthesis will hug the integer of the percent value to the right
- The event counts within brackets (
[]
) have been auto formatted based on the maximumn
value, and the opening bracket will hug then
count to the right.
This vignette has focused on the specifics of formatting using
f_str()
objects. For more details on layer specifics, check
out the individual layer vignettes.