Data Management

CDISC's quiet pivot from files to a data-centric stack

Dataset-JSON now ships with an API standard. The interesting question is whether the metadata underneath is consistent enough for anything to actually query it.

Data standards & interoperability
Stat Programming
Leadership & Strategy

Dataset-JSON v1.1 landed this cycle paired with a Dataset-JSON API v1.0 Standard and a compressed variant — and that pairing, not the JSON format itself, is the architectural news. A file format swap from XPT to JSON is a plumbing change. A format plus a normative API is a different proposition: it says the unit of exchange between sponsor, CRO, and regulator is no longer a file you ship, but a dataset you query. Alongside the broader USDM/DDF and IDMP convergence we cover separately this issue, this is the piece that quietly rewires the submission stack.

The implication for biometrics infrastructure is concrete. Today’s define/validation/review loop assumes a packaged artifact: cut the XPT, regenerate Define-XML, run Pinnacle 21, ship. An API-mediated dataset implies persistent endpoints, versioning semantics, authentication, and — most awkwardly — agreement on what “the data” is at any given moment. Define-XML 2.1.11 is a point release that fits inside the old model; Dataset-JSON-plus-API points at a different one, where the line between the sponsor’s MDR and the submission package gets thin.

Where the architecture meets reality

This is where OpenStudyBuilder becomes the most honest document in the pile. Novo Nordisk’s COSA-hosted metadata repository is, on paper, the reference implementation of exactly the world Dataset-JSON-plus-API is meant to enable: a Neo4j-backed MDR exposing CDISC CT, CDASH, SDTM, ADaM, ODM-XML, Define-XML, and a DDF/USDM adaptor through a FastAPI layer. It is also the project whose own documentation diagnoses the blocker. The platform’s stated rationale calls out “gaps in standards metadata, limiting automation opportunities” and notes that the flexibility CDISC standards permit “also allows for inconsistencies that makes automation scaling difficult,” citing CDISC 360 as the conceptual fix.

That is a quiet but load-bearing admission. The pivot to a queryable, linked-metadata stack assumes the metadata is consistent enough to link. Anyone who has tried to reconcile a protocol’s inclusion criterion, a CRF item, a CT codelist, and an SDTM variable across three studies in the same compound knows the assumption is generous. Dataset-JSON gives you a better envelope. It does not give you agreement on what goes inside, and the API standard makes the inconsistencies more visible, not less — a misaligned codelist that hid inside an XPT now fails a schema check at the endpoint.

What’s actually testable now

The useful framing for biometrics leaders is therefore not “are we Dataset-JSON compliant” but “does our metadata survive being queried.” Most sponsor MDRs were built to populate Define-XML at the end of a study. Few were built to serve as the system of record an API talks to during one. The architectural question worth piloting this year is whether the current MDR — vendor or homegrown — can expose a versioned, audit-trailed view of a single study’s SDTM and Define metadata to an internal client without a human regenerating anything. If it cannot, the gap to a Dataset-JSON-API future is organisational, not technical.

The vendor conversation should shift accordingly. “We support Dataset-JSON” is becoming a checkbox; the harder questions are whether the platform actually round-trips USDM, whether anyone other than Novo Nordisk is running it in production, and what the upgrade path looks like when ADaM Oncology Examples v1.0 or the next TAUG drops new biomedical concepts the MDR has never seen. None of this is a 2026 submission problem. It is a 2027–2028 operating-model problem, and the teams that wait for regulatory force will be retrofitting under timeline pressure rather than piloting under engineering conditions.

Protocol read: The format shipped; the architecture it implies has not. The work this year is proving your own metadata is consistent enough to be queried — before a reviewer’s tooling proves it isn’t.

What to do now:

Pilot Dataset-JSON v1.1 plus the API standard on one closed study, end-to-end through define generation and validation, and log every place a human had to intervene.
Ask MDR vendors for a named, non-Novo production deployment supporting USDM/DDF round-trip, and treat absence of one as the answer.
Run a terminology-drift diagnostic across protocol, CRF, and SDTM for a single compound; the gap surfaces what the standards still don’t catch.

CDISC's quiet pivot from files to a data-centric stack

Read next

EMA starts drawing lines around external control arms

Dataset-JSON, USDM and IDMP arrive in the same quarter

Pfizer's $10B Innovent bet meets the China-data reckoning