Data Integrity in Production ML Systems
Machine learning teams came from a research culture. The artifact of research is a notebook and a metric on a slide. The artifact of production is a running service that takes real input and emits real decisions to real users. The security engineering discipline around that second artifact — integrity, provenance, auditability — is usually missing. In most ML pipelines, a pickle file appears in an S3 bucket, some CI job grabs it, and it ends up deserialized in a process with network access. That is not how we'd handle a binary in any other part of the stack.
This post is about treating ML artifacts with the same discipline that any security-conscious team applies to code. The threats are different — you aren't usually fighting remote exploitation of the model itself, although that exists — but the controls are the same: hash everything, pin versions, sign releases, log provenance, and fail loudly when invariants break.
Threat Model
Not every deployment needs the full stack of controls. Start by naming what you're defending against:
| Threat | Who cares |
|---|---|
| Silent model corruption (bad training run, data leak, label drift) | Every production team |
| Supply-chain attack on a dependency (sklearn, xgboost, torch) | Every team |
| Malicious pickle in an artifact store | Teams that deserialize pickles from shared storage |
| Training data poisoning | Teams with external-contributed training data |
| Adversarial inputs at inference | Security-critical classifiers |
| Model theft / weight exfiltration | Teams whose model IS the product |
The first three apply to essentially everyone. Let's focus on controls for those.
1. Hash Every Artifact at Publish Time and Verify at Load Time
Pickle deserialization is remote code execution by design. The Python pickle module is explicitly documented as unsafe to load from untrusted sources. In practice "untrusted" means "any source whose bit-for-bit integrity you cannot verify."
The minimum viable control: publish a SHA-256 with every artifact and verify on load.
import hashlib, pickle, json, pathlib
ARTIFACT_DIR = pathlib.Path("/opt/models")
def sha256(path: pathlib.Path) -> str:
h = hashlib.sha256()
with path.open("rb") as f:
for chunk in iter(lambda: f.read(65536), b""):
h.update(chunk)
return h.hexdigest()
def publish(name: str, model_obj) -> dict:
p = ARTIFACT_DIR / f"{name}.pkl"
with p.open("wb") as f:
pickle.dump(model_obj, f)
manifest = {
"name": name,
"sha256": sha256(p),
"size_bytes": p.stat().st_size,
"published_at": __import__("time").time(),
}
(ARTIFACT_DIR / f"{name}.manifest.json").write_text(json.dumps(manifest))
return manifest
def load(name: str, expected_sha256: str):
p = ARTIFACT_DIR / f"{name}.pkl"
actual = sha256(p)
if actual != expected_sha256:
raise RuntimeError(
f"SHA-256 mismatch on {name}: expected {expected_sha256}, got {actual}"
)
with p.open("rb") as f:
return pickle.load(f)
The expected_sha256 comes from a separate, write-restricted source — your service config, a signed manifest, an environment variable injected by CI. Never from the same bucket the artifact lives in. An attacker who can rewrite one file can rewrite two.
apt-get install. The only reason people skip it for ML is cultural inertia, not any technical barrier.2. Pin Library Versions and Record Them in the Artifact
A pickle is a reference to class names, not to bytecode. If you pickle an xgboost.Booster under xgboost 2.0.3 and load it under xgboost 2.1.0, the deserializer calls the 2.1.0 constructor on 2.0.3 bytes. Silent behavior changes. Silently wrong predictions.
import importlib.metadata, sys, pickle
class VersionedArtifact:
def __init__(self, payload, deps: list[str]):
self.payload = payload
self.env = {
"python": sys.version,
"deps": {
d: importlib.metadata.version(d) for d in deps
},
}
# Publish:
art = VersionedArtifact(
payload=my_booster,
deps=["xgboost", "numpy", "scikit-learn"],
)
with open("model.pkl", "wb") as f:
pickle.dump(art, f)
# Load:
art = pickle.load(open("model.pkl", "rb"))
for dep, ver in art.env["deps"].items():
actual = importlib.metadata.version(dep)
if actual != ver:
raise RuntimeError(f"Version drift: {dep} trained@{ver} running@{actual}")
model = art.payload
Every inference pod in production should fail loudly the moment a dependency version differs from training. This catches 90% of production-only regressions.
3. Build the Pickle in an Isolated Environment
If a training job runs on a shared Jupyter instance with arbitrary pip installs, the resulting pickle reflects that noise. Instead, serialize models from a reproducible environment — a Dockerfile with pinned versions, checked into git, tagged with a content hash.
FROM python:3.11.8-slim-bookworm@sha256:<digest>
RUN pip install --no-deps \
xgboost==2.0.3 \
numpy==1.26.4 \
scikit-learn==1.4.1.post1
COPY train.py /train.py
ENTRYPOINT ["python", "/train.py"]
The digest-pinned base image means even the OS packages are fixed. The training job tags its output with the git SHA of train.py plus the digest of the base image. You can later reconstruct the exact environment that produced any artifact.
4. Replace Pickle with a Safe Format Where You Can
Pickle is convenient but dangerous. Prefer safer alternatives when the library supports them:
- XGBoost:
booster.save_model("model.ubj")— UBJSON, portable, no code execution. - LightGBM:
model.save_model("model.txt")— plain text. - PyTorch:
safetensorsfor weights; keep the model class in code, not the pickle. - scikit-learn: for some estimators, ONNX export is supported; for others pickle is unavoidable but you can wrap it with the hash check above.
The safer formats serialize data, not code. Loading them cannot execute arbitrary functions.
5. Log Every Prediction, Sample for Ground Truth
This is the ML equivalent of web access logs. Every inference should land in a log with:
- Model version (the hash from §1)
- Input features (or a hash if privacy demands it)
- Predicted output (probability, score, class)
- Timestamp and request ID
Sampling a small percentage of these against eventual ground truth gives you the live calibration report you need to detect silent drift. If the model was 72% accurate during training and is running at 61% accurate in production across 10,000 logged predictions, something changed.
import json, time, uuid, pathlib
LOG = pathlib.Path("/var/log/ml/predictions.jsonl")
def log_prediction(model_sha: str, features: dict, pred: float,
request_id: str | None = None):
LOG.parent.mkdir(parents=True, exist_ok=True)
record = {
"ts": time.time(),
"req": request_id or str(uuid.uuid4()),
"model_sha": model_sha[:16],
"features": features,
"pred": pred,
}
with LOG.open("a") as f:
f.write(json.dumps(record) + "\n")
6. Isolate Deserialization
If you absolutely must load a pickle from a shared artifact store, do it in a sandbox, not in your main service process:
- A dedicated "model loader" process with no network, no filesystem write access, and minimal privileges.
- The loader exports the model over a Unix socket or gRPC to the serving process.
- If the pickle is hostile, the blast radius is the loader, not the full service.
On Linux, seccomp or firejail can enforce this at the kernel level. On macOS, sandbox-exec. In Kubernetes, a sidecar with restrictive securityContext.
Putting It Together: A Deployment Checklist
- Artifact built from a digest-pinned Docker image, not a researcher's laptop.
- Every dependency version recorded inside the artifact.
- SHA-256 of the artifact published to a write-restricted manifest.
- Service config holds the expected SHA-256; service refuses to load on mismatch.
- Serving process runs with the minimum privileges required.
- Every prediction is logged with the model's SHA-256 prefix.
- A nightly job samples logged predictions against ground truth and emits a calibration report.
- Any deviation outside thresholds pages a human.
None of this is exotic. It's a direct port of practices that have been standard in server operations for 20 years.
Example: What This Looks Like in Production
At ZenHodl, the model pipeline follows this pattern. Every sport has its own XGBoost ensemble (NFL, NBA, NHL, MLB, tennis, LoL, and others). Each ensemble is retrained weekly in a reproducible Docker environment, serialized with the wrapping class plus its dependency versions, hashed, and pushed to an artifact directory. The serving API loads models by expected SHA-256 from a config the ops team controls; if the hash doesn't match, the service refuses to start.
At inference time, every predicted probability is logged with the 16-character model hash prefix. A nightly calibration job replays the logged predictions against resolved outcomes and emits an Expected Calibration Error per sport per reliability bucket. If ECE drifts above a configurable threshold, the prediction API is rolled back to the previous known-good model hash and an alert fires.
The security engineering discipline buys something concrete: we can answer questions like "what version of what model produced this specific prediction 14 days ago" in constant time, and we can prove that the model that produced it wasn't quietly swapped out by anyone. That's the baseline integrity property you want from any decision system — ML or otherwise.
Closing
The common thread across the other articles on this site is that the gap between what software claims to do and what it actually does is wide, and most users never measure it. Cloud sync clients read files outside the sync folder. Messengers route plaintext through servers they ask you to trust. SSDs keep the data you told them to delete.
ML models participate in the same problem. The prediction claims to be well-calibrated. The artifact claims to be the one you trained. The serving pod claims to run the library version you validated. Without verification, those claims are marketing.
The controls above cost a few hours of engineering per deployment and a nightly job that nobody has to look at unless it fails. The alternative is running a decision system you can't audit.