Why Change / Why Dataset-JSON / Why not Parquet?

datasetjson - Read and write CDISC Dataset JSON formatted datasets in R and Python

R/Pharma 2025 Workshop

2025-11-07

Why Change?

Why Change from SAS V5 XPORT (XPT)?

Game Boy. Sega Genesis. The Casio F-91W. And SAS XPT.
All were introduced in 1989.
In 1999, the FDA standardized the submission of clinical data using SAS XPT.
In 2012, FDA organized a Public Meeting on Study Data Exchange and later conducted the Dataset-XML pilot.
From a programming perspective, we now work in a multilingual world.
And now it may be time for SAS XPT to meet its end.

SAS XPT Limitations (Part 1 of 2)

Data File Format
- Limited variable types
- Limited to US ASCII encoding
- 8-character variable names
- 40-character labels
- 200-character field widths
Storage
- Inefficient use of storage space
- The inability to compress datasets leads to file logistical issues (e.g., splitting datasets)

SAS XPT Limitations (Part 1 of 2)

Content
- Lacks a robust metadata layer
- Only works for 2-dimensional data structures
Extensibility
- Not extensible

What is Dataset-JSON?

A CDISC data exchange format for tabular datasets
Designed to support a broad range of data exchange scenarios
Supports API and file-based data exchange
JSON/NDJSON is simple to implement, very stable, and widely supported
Extensible to support new metadata and new use cases

Why Dataset-JSON?

Why JSON?

JSON is the most widely used data interchange standard globally
JSON is a lightweight text-based interchange format
JSON is very simple and widely supported
JSON is easy to read and write
JSON targets data exchange
JSON is a language independent standard

Dataset-JSON as an Alternative Transport Format for Regulatory Submissions Pilot

The pilot was a collaboration between CDISC, PHUSE, and the FDA
The pilot final readout occurred in June 2024
The pilot demonstrated that Dataset-JSON can transport data without disruption to business
Pilot findings resulted in: (1) standards updates, (2) User’s Guide content, and (3) tool updates and enhancements
FDA testing included internal testing and external testing including submissions using the test ESG

Why not Parquet?

JSON Definition

From the JSON website: JSON is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers. These properties make JSON an ideal data-interchange language.

Parquet Definition

From the Parquet website: Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

What is the Purpose of Dataset-JSON?

Addresses the data exchange use case
API and File-based data exchange
Optimized for breadth of support and ease of implementation for a broad range of data exchange scenarios
Aligned with other data exchange standards, HL7 FHIR and ODM v2.0

What is the Purpose of Parquet?

Addresses the storage and analytical processing use case
File-based data exchange
Optimized for storage and analytical data processing performance
Aligned with sponsor statistical computing environment use

Dataset-JSON Assumptions (Part 1 of 2)

Data exchange scenarios include datasets generated by applications such as EDC, ePRO, labs, and other data sources
APIs will provide the most common means to exchange data
Aligns with:

ODM v2.0, define.json, and works with Define-XML
DDF USDM, ARS, CORE, CDISC Library, OAK, and other CDISC projects
Healthcare data exchange standards like HL7 FHIR

Dataset-JSON Assumptions (Part 2 of 2)

Systems will not make frequent reads/writes to Dataset-JSON files
Extensible. Most vendors extend ODM-based standards
Many vendors already import/export JSON
Addresses SAS XPT limitations
Commit to creating Dataset-JSON Parquet conversion software

Why Change / Why Dataset-JSON / Why not Parquet?

Why Change?

Why Change from SAS V5 XPORT (XPT)?

SAS XPT Limitations (Part 1 of 2)

SAS XPT Limitations (Part 1 of 2)

What is Dataset-JSON?

Why Dataset-JSON?

Why JSON?

Dataset-JSON as an Alternative Transport Format for Regulatory Submissions Pilot

Why not Parquet?

JSON Definition

Parquet Definition

What is the Purpose of Dataset-JSON?

What is the Purpose of Parquet?

Dataset-JSON Assumptions (Part 1 of 2)

Dataset-JSON Assumptions (Part 2 of 2)

References (Part 1 of 2)

References (Part 2 of 2)