How to in R

datasetjson - Read and write CDISC Dataset JSON formatted datasets in R and Python

R/Pharma 2025 Workshop

2025-11-07

The Whole Game

Let’s see the complete workflow with real ADAM data:

  1. Start with ADAM datasets (RDS/Parquet + metadata)
  2. Combine data with metadata
  3. Convert to Dataset-JSON
  4. Share the standardized file
  5. Read it back anywhere

Step 1: Load ADAM Data

library(datasetjson)
library(arrow)  # or readRDS() - format doesn't matter!

# Load data (RDS or Parquet - your choice)
adsl <- read_parquet("../../data/adam/adsl.parquet")
adae <- read_parquet("../../data/adam/adae.parquet")

# Load metadata
adsl_meta <- read_parquet("../../data/adam/metadata/adsl_meta.parquet")
adae_meta <- read_parquet("../../data/adam/metadata/adae_meta.parquet")

Step 2: Examine What We Have

# Real clinical trial data with labels!
head(adsl[1:5])  # 306 subjects, 54 variables
attr(adsl$USUBJID, "label")  # "Unique Subject Identifier"

# Metadata follows Dataset-JSON spec
head(adsl_meta)
#   dataType length itemOID  name                    label
#   string   12     STUDYID  STUDYID "Study Identifier"

Step 3: Create Dataset-JSON

# Convert ADSL to Dataset-JSON
adsl_json <- dataset_json(
  adsl,
  name = "ADSL", 
  dataset_label = "Subject Level Analysis Dataset",
  columns = adsl_meta
)

# Same for ADAE
adae_json <- dataset_json(
  adae,
  name = "ADAE",
  dataset_label = "Adverse Events Analysis Dataset", 
  columns = adae_meta
)

Step 4: Share Standardized Files

# Write to Dataset-JSON files
write_dataset_json(adsl_json, "output/ADSL.json", pretty = TRUE)
write_dataset_json(adae_json, "output/ADAE.json", pretty = TRUE)

# Now anyone can read these files!
# - Regulatory reviewers
# - External collaborators  
# - Different software (R, Python, SAS)

Step 5: Read Back Anywhere

# Read back - gets original data + metadata
adsl_restored <- read_dataset_json("output/ADSL.json")

# All metadata preserved!
attr(adsl_restored$USUBJID, "label")
# "Unique Subject Identifier"

# Identical to original
diffdf::diffdf(adsl, adsl_restored)
waldo::compare(adsl, adsl_restored)

Why This Matters

  • Format agnostic: Start with RDS, Parquet, CSV - doesn’t matter
  • Metadata preserved: Labels, types, all CDISC information
  • Standardized: One format for sharing across tools/organizations
  • Validated: Built-in schema validation
  • Round-trip safe: Almost Perfect data fidelity

Your Turn: Exercises

Time to practice! Open exercises/01-r.R and work through:

  1. Basic Operations - Read and write Dataset-JSON files
  2. Data Exploration - Examine Dataset-JSON structure
  3. Data Processing - Transform and export data
  4. Error Handling - Handle common issues
  5. Advanced Usage - Work with metadata and attributes

Exercise Preview

# Exercise 1: Basic read/write operations
# Exercise 2: Explore Dataset-JSON structure
# Exercise 3: Data transformation pipeline
# Exercise 4: Error handling scenarios
# Exercise 5: Advanced metadata features

# Let's get started!

Questions?

Ready to dive into the exercises?

Open exercises/01-r.R and let’s explore the datasetjson package together!