Why Change / Why Dataset-JSON / Why not Parquet?

datasetjson - Read and write CDISC Dataset JSON formatted datasets in R and Python

R/Pharma 2025 Workshop

2025-11-07

Why Change?

Why Change from SAS V5 XPORT (XPT)?

  • Game Boy. Sega Genesis. The Casio F-91W. And SAS XPT.
  • All were introduced in 1989.
  • In 1999, the FDA standardized the submission of clinical data using SAS XPT.
  • In 2012, FDA organized a Public Meeting on Study Data Exchange and later conducted the Dataset-XML pilot.
  • From a programming perspective, we now work in a multilingual world.
  • And now it may be time for SAS XPT to meet its end.

SAS XPT Limitations (Part 1 of 2)

  1. Data File Format
    • Limited variable types
    • Limited to US ASCII encoding
    • 8-character variable names
    • 40-character labels
    • 200-character field widths
  2. Storage
    • Inefficient use of storage space
    • The inability to compress datasets leads to file logistical issues (e.g., splitting datasets)

SAS XPT Limitations (Part 1 of 2)

  1. Content
    • Lacks a robust metadata layer
    • Only works for 2-dimensional data structures
  2. Extensibility
    • Not extensible

What is Dataset-JSON?

  • A CDISC data exchange format for tabular datasets
  • Designed to support a broad range of data exchange scenarios
  • Supports API and file-based data exchange
  • JSON/NDJSON is simple to implement, very stable, and widely supported
  • Extensible to support new metadata and new use cases

Why Dataset-JSON?

Why JSON?

  • JSON is the most widely used data interchange standard globally
  • JSON is a lightweight text-based interchange format
  • JSON is very simple and widely supported
  • JSON is easy to read and write
  • JSON targets data exchange
  • JSON is a language independent standard

Dataset-JSON as an Alternative Transport Format for Regulatory Submissions Pilot

  • The pilot was a collaboration between CDISC, PHUSE, and the FDA
  • The pilot final readout occurred in June 2024
  • The pilot demonstrated that Dataset-JSON can transport data without disruption to business
  • Pilot findings resulted in: (1) standards updates, (2) User’s Guide content, and (3) tool updates and enhancements
  • FDA testing included internal testing and external testing including submissions using the test ESG

Why not Parquet?

JSON Definition

From the JSON website: JSON is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers. These properties make JSON an ideal data-interchange language.

Parquet Definition

From the Parquet website: Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

What is the Purpose of Dataset-JSON?

  1. Addresses the data exchange use case
  2. API and File-based data exchange
  3. Optimized for breadth of support and ease of implementation for a broad range of data exchange scenarios
  4. Aligned with other data exchange standards, HL7 FHIR and ODM v2.0

What is the Purpose of Parquet?

  1. Addresses the storage and analytical processing use case
  2. File-based data exchange
  3. Optimized for storage and analytical data processing performance
  4. Aligned with sponsor statistical computing environment use

Dataset-JSON Assumptions (Part 1 of 2)

  1. Data exchange scenarios include datasets generated by applications such as EDC, ePRO, labs, and other data sources
  2. APIs will provide the most common means to exchange data
  3. Aligns with:
  • ODM v2.0, define.json, and works with Define-XML
  • DDF USDM, ARS, CORE, CDISC Library, OAK, and other CDISC projects
  • Healthcare data exchange standards like HL7 FHIR

Dataset-JSON Assumptions (Part 2 of 2)

  1. Systems will not make frequent reads/writes to Dataset-JSON files
  2. Extensible. Most vendors extend ODM-based standards
  3. Many vendors already import/export JSON
  4. Addresses SAS XPT limitations
  5. Commit to creating Dataset-JSON Parquet conversion software

References (Part 1 of 2)

References (Part 2 of 2)