Using datasetjson
datasetjson works by allowing you to take a data frame and apply the necessary attributes required for the CDISC Dataset JSON. The goal is to make this experience simple. Before you can write a Dataset JSON file to disk, you first need to build the Dataset JSON object. An example call looks like this:
ds_json <- dataset_json(iris[1:5, ], "IG.IRIS", "IRIS", "Iris", iris_items)
This is the minimum information required to provide to create a
datasetjson
object.
The parameters here can be described as follows:
- The input data frame
iris
- The
item_id
, which can be described as the “Object of Dataset”, which is a key value is a unique identifier for the dataset, corresponding to ItemGroupDef/@OID in Define-XML. -
name
, which is the dataset name -
label
, which is the dataset label, and finally -
items
, which is the variable level metadata for your dataset.
The items
parameter is special here, in that you provide
a data frame with the necessary variable metadata. Take a look at the
iris_items
data frame.
iris_items
#> OID name label type length displayFormat
#> 1 IT.IR.Sepal.Length Sepal.Length Sepal Length float NA <NA>
#> 2 IT.IR.Sepal.Width Sepal.Width Sepal Width float NA <NA>
#> 3 IT.IR.Petal.Length Petal.Length Petal Length float NA <NA>
#> 4 IT.IR.Petal.Width Petal.Width Petal Width float NA <NA>
#> 5 IT.IR.Species Species Flower Species string 10 <NA>
#> keySequence
#> 1 2
#> 2 NA
#> 3 3
#> 4 NA
#> 5 1
This data frame has 7 columns, 4 of which are strictly required. This is defined by the CDISC Dataset JSON Specification.
Attribute | Requirement | Description |
---|---|---|
OID | Required | OID of a variable (must correspond to the variable OID in the Define-XML file) |
name | Required | Variable name |
label | Required | Variable description |
type | Required | Type of the variable. Allowed values: “string”, “integer”, “decimal”, “float”, “double”, “boolean”. See ODM types for details. |
length | Optional | Variable length |
displayFormat | Optional | Display format supports data visualization of numeric float and date values. |
keySequence | Optional | Indicates that this item is a key variable in the dataset structure. It also provides an ordering for the keys. |
The data within this dataframe ultimate populates the
items
element of the Dataset JSON file. The OID, name,
label, and type columns are all required and must be populated for each
variable. Note that the type column has a list of allowable values:
string
integer
float
double
decimal
boolean
This information must be provided directly by the user. Note that no
type conversions of your data are performed by the
datasetjson
package. The displayFormat column inherently
refers to display formats used within SAS.
Setting Other Data Attributes
The Dataset JSON specification has a number of other attributes available that are beyond normal ones present in an R data frame. These can be applied using a variety of setter functions directly to the dataset JSON object.
ds_updated <- ds_json |>
set_data_type("referenceData") |>
set_file_oid("/some/path") |>
set_metadata_ref("some/define.xml") |>
set_metadata_version("MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7") |>
set_originator("Some Org") |>
set_source_system("source system", "1.0") |>
set_study_oid("SOMESTUDY")
In a practical setting, applying these attributes during the creation
a dataset JSON file would be tedious, and present a challenge if the
fields update - because the text would have to be updated in each
program individually. For this reason, the datasetjson
package allows you to use pre-built objects to create a
datasetjson
object.
file_meta <- file_metadata(
originator = "Some Org",
sys = "source system",
sys_version = "1.0"
)
data_meta <- data_metadata(
study = "SOMESTUDY",
metadata_version = "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7",
metadata_ref = "some/define.xml"
)
dataset_meta <- dataset_metadata(
item_id = "IG.IRIS",
name = "IRIS",
label = "Iris",
items = iris_items
)
ds_json_from_meta <- dataset_json(
iris,
dataset_meta = dataset_meta,
file_meta = file_meta,
data_meta = data_meta
)
Or more practically, just file_meta
and
data_meta
could be provided, and the
dataset_metadata
could be provided directly to
dataset_json
.
file_meta <- file_metadata(
originator = "Some Org",
sys = "source system",
sys_version = "1.0"
)
data_meta <- data_metadata(
study = "SOMESTUDY",
metadata_version = "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7",
metadata_ref = "some/define.xml"
)
ds_json_from_meta <- dataset_json(
iris,
item_id = "IG.IRIS",
name = "IRIS",
label = "Iris",
items = iris_items,
file_meta = file_meta,
data_meta = data_meta
)
Writing and Reading
The datasetjson
object allows you to collect the
information needed to generate a Dataset JSON file, but to write the
dataset out need to use the write_dataset_json()
file. Once
the Dataset JSON object is available, all you need is that object name
and a file path.
write_dataset_json(ds_updated, file="iris.json")
The write_dataset_json()
also has the option to return
the JSON output as a character string.
js <- write_dataset_json(ds_updated, pretty=TRUE)
cat(js)
#> {
#> "creationDateTime": "2024-01-09T20:04:00",
#> "datasetJSONVersion": "1.0.0",
#> "fileOID": "/some/path",
#> "originator": "Some Org",
#> "sourceSystem": "source system",
#> "sourceSystemVersion": "1.0",
#> "referenceData": {
#> "studyOID": "SOMESTUDY",
#> "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7",
#> "metaDataRef": "some/define.xml",
#> "itemGroupData": {
#> "IG.IRIS": {
#> "records": 5,
#> "name": "IRIS",
#> "label": "Iris",
#> "items": [
#> {
#> "OID": "ITEMGROUPDATASEQ",
#> "name": "ITEMGROUPDATASEQ",
#> "label": "Record Identifier",
#> "type": "integer"
#> },
#> {
#> "OID": "IT.IR.Sepal.Length",
#> "name": "Sepal.Length",
#> "label": "Sepal Length",
#> "type": "float",
#> "keySequence": 2
#> },
#> {
#> "OID": "IT.IR.Sepal.Width",
#> "name": "Sepal.Width",
#> "label": "Sepal Width",
#> "type": "float"
#> },
#> {
#> "OID": "IT.IR.Petal.Length",
#> "name": "Petal.Length",
#> "label": "Petal Length",
#> "type": "float",
#> "keySequence": 3
#> },
#> {
#> "OID": "IT.IR.Petal.Width",
#> "name": "Petal.Width",
#> "label": "Petal Width",
#> "type": "float"
#> },
#> {
#> "OID": "IT.IR.Species",
#> "name": "Species",
#> "label": "Flower Species",
#> "type": "string",
#> "length": 10,
#> "keySequence": 1
#> }
#> ],
#> "itemData": [
#> [1, 5.1, 3.5, 1.4, 0.2, "setosa"],
#> [2, 4.9, 3, 1.4, 0.2, "setosa"],
#> [3, 4.7, 3.2, 1.3, 0.2, "setosa"],
#> [4, 4.6, 3.1, 1.5, 0.2, "setosa"],
#> [5, 5, 3.6, 1.4, 0.2, "setosa"]
#> ]
#> }
#> }
#> }
#> }
Similarly, to read a Dataset JSON object, you can use the function
read_dataset_json()
. This function will return a dataframe
to you, ready to use. To read, provide a file path.
read_dataset_json("path/to/file")
You can also provide single element character vector of the JSON text already read in.
dat <- read_dataset_json(js)
The data frame that’s read in carries a number of attributes. For example, opening the dataframe within the RStudio IDE will present the variable labels. All data available within the Dataset JSON file is ultimately attached to the imported data frame.
attributes(dat)
#> $names
#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1 2 3 4 5
#>
#> $creationDateTime
#> [1] "2024-01-09T20:04:00"
#>
#> $datasetJSONVersion
#> [1] "1.0.0"
#>
#> $fileOID
#> [1] "/some/path"
#>
#> $originator
#> [1] "Some Org"
#>
#> $sourceSystem
#> [1] "source system"
#>
#> $sourceSystemVersion
#> [1] "1.0"
#>
#> $name
#> [1] "IRIS"
#>
#> $records
#> [1] 5
#>
#> $label
#> [1] "Iris"
#>
#> $referenceData
#> $referenceData$studyOID
#> [1] "SOMESTUDY"
#>
#> $referenceData$metaDataVersionOID
#> [1] "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7"
#>
#> $referenceData$metaDataRef
#> [1] "some/define.xml"
#>
#> $referenceData$itemGroupData
#> [1] "IG.IRIS"
For variable level metadata, the attributes are applied directly to the columns.
attributes(dat$Species)
#> $label
#> [1] "Flower Species"
#>
#> $OID
#> [1] "IT.IR.Species"
#>
#> $length
#> [1] 10
#>
#> $type
#> [1] "string"
#>
#> $keySequence
#> [1] 1