Medical AI and FDA: Why MLOps Is Not Optional

Mar 31, 2026 • Tags: fda mlops regulatory data-governance irb medical-ai reproducibility

If you are building AI for medical devices and you intend to go through FDA clearance, MLOps is not a DevOps preference or an engineering best practice. It is the mechanism by which you generate the evidence FDA requires. Without it, submission is structurally impossible — not difficult, not expensive, not slow. Impossible.

And that sounds like an overstatement — until you see what FDA actually asks for.

Here is why.

What FDA Actually Reviews

There is a common misconception, particularly among researchers coming from a Korean MFDS background, that regulatory approval is fundamentally a documentation exercise. Submit the right forms with the right numbers, and you pass.

FDA works differently. The agency does not review documents describing your system. It reviews whether your system was built through a verifiable, auditable process. On-site inspections involve direct questioning: show me the data movement log, show me the labeling protocol, show me the change history from model v1.0 to today, reproduce this result for me right now.

The distinction matters because it changes what you need to produce. MFDS asks: is your documentation correct? FDA asks: did this actually happen the way you claim?

Audit trails cannot be backdated. Whatever process you are running today is the history you will have at submission time. If that process is ad hoc — training on individual laptops, configurations tracked in personal notes, experiments undocumented — no amount of retrospective cleanup will satisfy an inspection.

Why AI Devices Are Regulated Differently From Hardware

A hardware device is evaluated once. An AI system is evaluated continuously — whether you design for it or not.

A hardware or drug trial has a natural boundary. You collect data, run the study, close enrollment, analyze results, and submit. The dataset is fixed. The process terminates.

AI models do not work this way. Models need to be retrained as new data arrives, as performance degrades, as the device population shifts. This means the regulatory question is not just “how was this model built?” but “how will this model be maintained, updated, and monitored after clearance?”

FDA addressed this directly in its 2021 AI/ML Action Plan. The key concept is the Algorithm Change Protocol — a pre-approved framework defining which model changes can happen automatically and which require a new 510(k) submission. Getting an ACP approved requires demonstrating that your development process is controlled enough that pre-specified changes can be trusted without full re-review.

That level of process control requires MLOps. There is no workaround.

The IRB Problem Nobody Talks About

Before any of this, there is a more fundamental issue: data.

IRB approval is not a checkbox. It is a scoped authorization. An IRB approved for “research purposes” does not authorize you to use that data for AI model training, cloud storage, retraining on future versions, or expanding to new features. Each of these may require explicit consent language that most research IRBs do not include by default.

This matters enormously for AI development because the continuous retraining cycle means you need data authorization that extends across years, across model versions, and across feature expansions. If your consent forms say “data will be used for this study only,” that data cannot legally be used for retraining — and an entire dataset collected under those terms is invalid for FDA submission.

IRB protocols intended to support AI development need to be designed with that scope from the beginning. Retroactive amendment is possible but limited. Retroactive consent is often not possible at all.

The compounding problem: cloud storage and data transfer are not covered by IRB approval either. Moving patient data from a hospital server to a cloud training environment requires separate HIPAA compliance — Business Associate Agreements with the cloud provider, access control policies, encryption at rest and in transit, and a full audit log of who accessed what data and when. This is independent of IRB and must be in place before a single training run happens in the cloud.

What MLOps Actually Provides

Given the above, the function of MLOps in a regulated AI context becomes clearer. It is not about deployment speed or team efficiency. It is about generating a continuous, system-produced record of how every model came to exist.

FDA will ask: which patients’ data was used to train this model? Which version of the preprocessing pipeline? Which commit of the training code? What were the hyperparameters? What was the performance on the held-out validation set? Can you reproduce this result today?

These questions can only be answered if the answers were recorded automatically at training time. The toolchain that makes this possible:

Git tracks every change to training code, preprocessing scripts, and model architecture — with author, timestamp, and rationale. Not just the final version. Every intermediate state.

DVC (Data Version Control) links each training run to the exact dataset version used. Data files live in cloud storage; DVC metadata files live in Git. The result is a complete data provenance chain: model version → dataset version → source collection event.

MLflow logs every experiment: hyperparameters, metrics, artifacts, and the model registry. This is the equivalent of a lab notebook for AI development. Importantly, failed experiments must be logged. FDA inspectors are suspicious of records that show only successful runs — real systems fail, and the absence of failure in the record looks fabricated.

Docker freezes the complete software environment — Python version, library versions, CUDA version — so that any training run can be exactly reproduced months or years later. “It worked on my machine” is not an acceptable answer during an FDA inspection.

Label Studio (or equivalent web-based annotation) records who labeled which data point, when, under what protocol, and with what inter-rater reliability. Local annotation tools or spreadsheet-based tracking do not produce the kind of structured, queryable provenance records that FDA expects.

Together, these tools create what FDA calls a complete model lineage: raw data → preprocessing pipeline → training code → configuration → model → validation. A break anywhere in that chain is a reproducibility failure. A reproducibility failure is a rejection.

Each tool is not optional — each one answers a specific question FDA will ask.

The Traceability Requirement

The specific document FDA expects to see is a traceability matrix — a structured mapping from every user requirement to a design specification, to an implementation, to a test case, to a documented result.

For example:

Requirement → “Detect stenosis”
Design → CNN-based segmentation model
Implementation → specific training pipeline + dataset
Test → subgroup evaluation on device-specific data

For AI systems, this is more complex than for traditional software because the “implementation” includes not just code but data and model weights, and the “test case” must cover subgroup performance across demographic categories that may not have been considered during initial design.

FDA requires that performance be disaggregated by age group, sex, device type (if the model was trained on data from multiple device manufacturers), and institution. A single aggregate accuracy metric does not satisfy the submission requirement. This means subgroup analysis must be planned into the study design from the beginning — you cannot retroactively analyze subgroups that were not tracked during collection.

What This Means for Research Labs

Most academic medical imaging labs are not thinking about any of this. They are thinking about model performance, publication timelines, and grant deliverables. That is appropriate for research. But if a lab’s work is intended to eventually support a commercial product with FDA clearance, the gap between where most labs operate and what FDA requires is substantial.

The gap is not primarily technical. The tools exist and are not difficult to use. The gap is structural: labs are not set up to run IRB protocols with AI reuse scope, they do not have cloud-only data policies, they do not run experiments through version-controlled pipelines with automatic logging.

Closing this gap takes time — specifically, it takes the time needed to accumulate a meaningful audit trail. You cannot compress this. Starting the process six months before submission produces six months of history. Starting three years before submission produces three years of history. FDA is aware of the difference.

The practical implication: if a research program intends to support a clinical AI product, the MLOps infrastructure needs to be in place at the start of data collection, not at the start of submission preparation.

A Note on the 3-Year FDA Timeline

It is common for medical AI startups and research programs to cite “FDA approval in three years” as a project milestone. In most cases, this framing is either optimistic or imprecise.

Three years is feasible for reaching a 510(k) submission — if the process starts immediately, if IRB protocols are correctly scoped from the beginning, if QMS infrastructure is built in year one, and if the device falls cleanly into Class II. Three years to FDA approval (not submission) is very aggressive for most AI SaMD applications, and essentially infeasible if the baseline at year zero is no IRB, no MLOps, and no data governance.

The more accurate framing for a program starting from scratch: three years to a credible, complete 510(k) submission; approval likely in year four or beyond, depending on FDA review timelines and the number of review cycles required.

This is not a pessimistic assessment. It is the realistic timeline when the full scope of what FDA reviews is taken seriously.

Conclusion

The reason MLOps is mandatory is not bureaucratic. It is epistemic. FDA does not review your claims — it reviews whether your process generated evidence that supports them.

Starting late means the evidence does not exist. No amount of work at submission time can create a history that was not recorded.

MLOps is not how you build the model. It is how you prove the model exists.