ML Metadata Management: Tracking Lineage End-to-End
You've just deployed a machine learning model to production. Three months later, a regulator asks: "What data trained this model? Who approved it? Can you reproduce it exactly?" If your answer is "um, let me check my notebook," you're in trouble. This is where ML metadata management comes in - and honestly, it's one of the most overlooked yet critical pieces of modern MLOps infrastructure.
The problem is stark: most teams can tell you what their model does, but can't reliably answer how it got there. The training data has drifted. The code has changed. The random seed? Lost somewhere in git history. When something fails or regulators come knocking, you're stuck. That's expensive. That's risky. And it's entirely preventable.
In this article, we're going to build a complete lineage tracking system - the kind that lets you trace a prediction back to its training data and forward to the system that served it. We'll use ML Metadata (MLMD), the Google-backed framework that's becoming the standard for this exact problem.
Table of Contents
- The Real-World Lineage Problem
- The Four Components of ML Lineage
- 1. Data Lineage: Source to Training Dataset
- 2. Experiment Lineage: Dataset + Code → Model
- 3. Deployment Lineage: Model → Endpoint → Predictions
- 4. Audit Trail: Decisions + Evidence
- Introducing ML Metadata (MLMD)
- Building Bit-Exact Reproducibility
- Step 1: Pin All Dependencies
- Step 2: Capture Random Seeds
- Step 3: Record Hardware Configuration
- Step 4: Container Image Pinning
- Step 5: Save Training Metadata
- Lineage Integrated with Serving
- Backward Lineage Queries: "What Data Trained This?"
- Forward Lineage Queries: "What Models Used This Data?"
- Complete End-to-End Lineage Graph
- Compliance and Audit Trails
- Handling the Hard Cases: Lineage in Complex Pipelines
- Branching and Merging Lineage
- Partial Retraining and Transfer Learning
- Incremental Data and Batch vs. Streaming
- Real-World Implementation Tips
- Operational Lineage Queries: Asking the Right Questions
- Query 1: "Is My Production Model Reproducible?"
- Query 2: "What Changed Since Last Deployment?"
- Query 3: "Which Models Are Affected by This Data Change?"
- The Complete Picture
- Summary
- The Organizational Maturity Arc: Where Lineage Tracking Fits
- The Lineage Data Model: Thinking Like a Graph
- Why Lineage Tracking Fails (And How to Prevent It)
- The Hidden Costs of Not Tracking Lineage
- The Lineage Data Model: Thinking Like a Graph
- Lessons From Teams That Got It Right
- The Business Case: ROI of Lineage Tracking
- Moving Forward: Your Lineage Implementation Roadmap
The Real-World Lineage Problem
Let's start with a concrete nightmare scenario. Your fraud detection model starts giving weird predictions. You need to know:
- What data trained this model? (Data lineage)
- What code and hyperparameters were used? (Experiment lineage)
- What version is running in production? (Deployment lineage)
- Can you rebuild it from scratch? (Reproducibility)
Without lineage tracking, each question is a detective story. With it, it's a database query.
Here's what lineage actually looks like in practice:
Raw Data (Jan 2024)
↓ (CSV ingestion)
Data Lake (100K transactions)
↓ (Preprocessing: null handling, feature engineering)
Training Dataset v2 (95K cleaned rows)
↓ (XGBoost training, random_seed=42)
Model v1.2.3 (accuracy: 94.2%)
↓ (Containerization, image digest: sha256:abc123)
Production Endpoint (serving 1000 req/sec)
↓ (Prediction on new transaction)
Decision (approved/denied) + Metadata
Each arrow is a relationship. Each node is an artifact. Together, they form your audit trail.
The Four Components of ML Lineage
Let me break down what "lineage" actually means in the ML context. It's not just one thing - it's four interconnected layers. Think of lineage as the genealogy of your ML system: every ancestor, every transformation, every relationship documented for all time.
1. Data Lineage: Source to Training Dataset
Data lineage answers: "Where did this dataset come from, and what was done to it?" This is the genealogy of your data - tracing it backward through every preprocessing step back to raw sources.
You typically have:
- Raw data sources: Production databases, APIs, data lakes, CSVs, streaming sources
- Ingestion transformations: Schema validation, deduplication, encoding changes, time-based sampling
- Feature engineering: Null imputation, scaling, normalization, feature creation, feature selection
- Data quality))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark)-training-smaller-models)) checks: Validation gates, drift detection, outlier flagging
- Training dataset: The final, versioned artifact that actually trained your model
Each step transforms the data. Each step is a potential failure point. Without tracking, you can't ask "did we apply the same preprocessing last time?" or "which raw data version was in production when the model started failing?" or "if I fix a bug in preprocessing, which models need retraining?"
The stakes are high. A single preprocessing change - say, changing how you handle missing values from mean-imputation to forward-fill - can ripple through dozens of models. With data lineage, you can answer "how many models are affected?" instantly. Without it, you're guessing.
Data lineage also supports compliance. GDPR right-to-deletion? Trace which datasets contain specific user data, then find all models trained on those datasets, then schedule them for retraining with sanitized data. Without lineage, that's a nightmare. With it, it's a query.
2. Experiment Lineage: Dataset + Code → Model
Experiment lineage captures: "Given this specific dataset, code version, and hyperparameters, what model did we produce?" This is the recipe that created your model - every ingredient documented.
Key elements:
- Training dataset reference: Exact version/hash (not just "latest")
- Code commit: The exact training script SHA, plus any uncommitted changes
- Hyperparameters: Learning rate, batch size, tree depth, regularization, dropout, optimizer settings
- Random seeds: Python, NumPy, PyTorch-ddp-advanced-distributed-training), CUDA (yes, all of them - they all affect randomness)
- Hardware config: GPU type, memory, CPU specs, CUDA compute capability
- Training metrics: Validation accuracy, F1 score, precision-recall curves, loss trajectory
- Model artifact: Weights file, training configuration, ONNX-runtime-cross-platform-inference) export, inference dependencies
- Training duration: Wall-clock time, epoch count, convergence point
- Data splits: Train/val/test ratios, random seed for splits, cross-validation strategy
The magic here is reproducibility. If you capture all these, you can rebuild this exact model from scratch. Not a similar model - the exact one. This matters for compliance (regulators want proof you can rebuild), for debugging (was the randomness the problem or the data?), and for scale (someone else on your team can take your recipe and retrain without guessing).
Many teams skip this thinking "we'll just retrain when needed." But retraining is expensive. If you can prove the old model is fine by showing its lineage, you save computational resources and time-to-decision.
3. Deployment Lineage: Model → Endpoint → Predictions
Deployment lineage answers: "Which model version is running where, and how is it performing?" This is the operational tracking - where your model lives and how it behaves in the wild.
You need to track:
- Model version: Specific build identifier, training run ID, git tag
- Container image: Exact digest (not just "latest" - digests are immutable)
- Endpoint configuration: Batch size, timeout, inference library version, quantization-pipeline-pipelines-training-orchestration)-fundamentals))-automated-model-compression)-production-inference-deployment)-llms) settings
- Traffic routing: Which version serves what percentage (canary: 5% v2, 95% v1)
- Prediction metadata: For each prediction, store which model/version made it, latency, confidence
- Model performance metrics: Real-time accuracy, latency percentiles, error rates by segment
- Serving environment: Kubernetes cluster, inference engine (TensorFlow Serving, Triton-inference-server-multi-model-serving), KServe)
- Dependencies: CUDA version, cuDNN version, Python runtime version for serving
This is often forgotten, but it's critical. If a prediction goes wrong, you need to know which model made it - and be able to replay that request against earlier versions to see if v1.2.2 would have gotten it right. With deployment lineage, you can instrument your serving layer to log the model version with every prediction. Later, when someone reports an error, you instantly know which model to investigate.
Deployment lineage also enables canary deployments with automatic rollback. Deploy v1.2.4 to 5% of traffic. Monitor its prediction quality compared to v1.2.3. If quality drops, roll back automatically. All decision-making is backed by lineage data.
4. Audit Trail: Decisions + Evidence
For regulated industries (finance, healthcare, AI systems under EU AI Act), you need:
- Immutable decision log: Who decided to deploy this model, when, why
- Input hashing: Hash of the input that led to a prediction (for privacy + audit)
- Retention policy: How long to keep training data, predictions, artifacts
- Compliance markers: What regulation does this satisfy (SR 11-7, GDPR, EU AI Act)
Introducing ML Metadata (MLMD)
Google's ML Metadata is the open-source framework that makes all this practical. It's a library + backend storage system designed specifically for tracking ML artifacts and their relationships.
Think of it as a specialized database for lineage. Instead of shoehorning your lineage into PostgreSQL, you use MLMD, which understands:
- Artifacts: Datasets, models, metrics, images (the "things")
- Executions: Training runs, preprocessing jobs, deployments (the "processes")
- Contexts: Experiments, pipelines, releases (grouping logic)
- Events: Which artifact was produced/consumed by which execution
Let's visualize the MLMD data model:
graph TB
subgraph "MLMD Data Model"
A["Artifact<br/>(Dataset, Model, Metrics)"]
E["Execution<br/>(Training, Preprocessing)"]
C["Context<br/>(Experiment, Pipeline)"]
EV["Event<br/>(INPUT/OUTPUT)"]
end
A -->|produced_by| E
E -->|consumes| A
A -->|part_of| C
E -->|part_of| C
E -->|links| EV
A -->|links| EV
style A fill:#e1f5ff
style E fill:#f3e5f5
style C fill:#e8f5e9
style EV fill:#fff3e0Here's a practical example. Let's say we're building a fraud detection pipeline-parallelism):
import ml_metadata as mlmd
from ml_metadata.proto import metadata_store_pb2
# Connect to metadata store (SQLite for demo, GCS+Spanner in prod)
config = metadata_store_pb2.ConnectionConfig()
config.sqlite.filename_uri = "file:///tmp/mlmd_store.db"
store = mlmd.MetadataStore(config)
# 1. Register raw data artifact
raw_data = metadata_store_pb2.Artifact()
raw_data.type_id = store.put_artifact_types([
metadata_store_pb2.ArtifactType(name="RawData")
])[0]
raw_data.uri = "s3://data-lake/fraud/2024-01/raw/"
raw_data.properties["source"] = metadata_store_pb2.Value(string_value="production_db")
raw_data.properties["record_count"] = metadata_store_pb2.Value(int_value=1000000)
raw_data.properties["bytes"] = metadata_store_pb2.Value(int_value=5000000000)
raw_artifact_id = store.put_artifacts([raw_data])[0]
print(f"Registered raw data artifact: {raw_artifact_id}")
# 2. Register preprocessing execution
preprocessing = metadata_store_pb2.Execution()
preprocessing.type_id = store.put_execution_types([
metadata_store_pb2.ExecutionType(name="DataPreprocessing")
])[0]
preprocessing.properties["image"] = metadata_store_pb2.Value(
string_value="gcr.io/myorg/preprocessing:v1.2.3@sha256:abc123"
)
preprocessing.properties["timeout_sec"] = metadata_store_pb2.Value(int_value=3600)
preprocess_exec_id = store.put_executions([preprocessing])[0]
# 3. Link raw data → preprocessing (input)
input_event = metadata_store_pb2.Event()
input_event.artifact_id = raw_artifact_id
input_event.execution_id = preprocess_exec_id
input_event.type = metadata_store_pb2.Event.INPUT
input_event.milliseconds_since_epoch = int(time.time() * 1000)
# 4. Register training dataset (output)
training_data = metadata_store_pb2.Artifact()
training_data.type_id = store.put_artifact_types([
metadata_store_pb2.ArtifactType(name="TrainingDataset")
])[0]
training_data.uri = "s3://data-lake/fraud/2024-01/training/"
training_data.properties["record_count"] = metadata_store_pb2.Value(int_value=950000)
training_data.properties["version"] = metadata_store_pb2.Value(string_value="v2.0")
training_data.custom_properties["data_hash"] = metadata_store_pb2.Value(
string_value="sha256:def456"
)
training_artifact_id = store.put_artifacts([training_data])[0]
# 5. Link preprocessing → training dataset (output)
output_event = metadata_store_pb2.Event()
output_event.artifact_id = training_artifact_id
output_event.execution_id = preprocess_exec_id
output_event.type = metadata_store_pb2.Event.OUTPUT
output_event.milliseconds_since_epoch = int(time.time() * 1000)
store.put_events([input_event, output_event])
print(f"Linked preprocessing execution: {preprocess_exec_id}")
# 6. Now query: what raw data was used for this training dataset?
executions = store.get_executions_by_id([preprocess_exec_id])
for execution in executions:
events = store.get_events_by_execution_ids([execution.id])
for event in events:
if event.type == metadata_store_pb2.Event.INPUT:
artifact = store.get_artifacts_by_id([event.artifact_id])[0]
print(f"Training dataset sourced from: {artifact.uri}")This is the foundation. You're building a queryable, immutable record of what happened. The power emerges when you realize you can ask any question about your pipeline: which models depend on this data? Can we rebuild that model from scratch? Who approved the production deployment, and when?
Understanding the MLMD data model is crucial. Artifacts are the "nouns" (data, models), executions are the "verbs" (training, preprocessing), contexts are the "groupings" (experiments), and events are the "relationships." Everything flows through events - they're the edges in your lineage graph.
Building Bit-Exact Reproducibility
Here's where lineage gets serious: reproducibility. Can you retrain your model and get the exact same weights?
The answer is usually "no" - unless you track everything. Let's build a reproducibility configuration that actually works.
Step 1: Pin All Dependencies
# Create a reproducible requirements.txt using pip-tools
pip install pip-tools
# requirements.in (what we actually use)
torch==2.0.1
numpy==1.24.3
scikit-learn==1.3.0
xgboost==2.0.2
pandas==2.0.3
# Generate exact transitive dependencies
pip-compile requirements.in --generate-hashes
# requirements.txt now has exact versions + hashes:
# torch==2.0.1 \
# --hash=sha256:abc123... \
# --hash=sha256:def456...Step 2: Capture Random Seeds
Python's randomness is... complicated. You need to seed everything:
import os
import random
import numpy as np
import torch
def set_seed(seed: int):
"""Set ALL random seeds for bit-exact reproducibility."""
# Python's random module
random.seed(seed)
# NumPy's random
np.random.seed(seed)
# PyTorch CPU
torch.manual_seed(seed)
# PyTorch GPU
torch.cuda.manual_seed_all(seed)
# CuDNN determinism (slower but reproducible)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Environment variable for external tools
os.environ['PYTHONHASHSEED'] = str(seed)
# In your training script
RANDOM_SEED = 42
set_seed(RANDOM_SEED)
# Log this to your metadata store
training_execution.properties["random_seed"] = \
metadata_store_pb2.Value(int_value=RANDOM_SEED)Step 3: Record Hardware Configuration
Different hardware = different results (due to floating point precision, threading, CUDA versions):
import platform
import torch
import psutil
hardware_config = {
"python_version": platform.python_version(),
"torch_version": torch.__version__,
"cuda_version": torch.version.cuda,
"cudnn_version": torch.backends.cudnn.version(),
"gpu_type": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "cpu",
"gpu_count": torch.cuda.device_count(),
"cpu_count": psutil.cpu_count(),
"ram_gb": psutil.virtual_memory().total / (1024**3),
}
# Serialize and store
import json
config_json = json.dumps(hardware_config, indent=2)
training_execution.properties["hardware_config"] = \
metadata_store_pb2.Value(string_value=config_json)Step 4: Container Image Pinning
Never use "latest". Always use exact digest:
# Dockerfile
FROM python:3.10-slim@sha256:abc123def456...
COPY requirements.txt .
RUN pip install --require-hashes -r requirements.txt
COPY train.py .
ENTRYPOINT ["python", "train.py"]# Build and push
docker build -t fraud-trainer:latest .
docker push myregistry.azurecr.io/fraud-trainer:latest
# Get the exact digest
docker inspect myregistry.azurecr.io/fraud-trainer:latest | jq -r '.[0].RepoDigests[0]'
# Output: myregistry.azurecr.io/fraud-trainer@sha256:xyz789...
# Log it
echo "sha256:xyz789..." > image_digest.txtStep 5: Save Training Metadata
import hashlib
import pickle
# After training, save everything needed to reproduce
reproducibility_bundle = {
"random_seed": RANDOM_SEED,
"hardware_config": hardware_config,
"requirements": open("requirements.txt").read(),
"training_script_hash": hashlib.sha256(
open("train.py", "rb").read()
).hexdigest(),
"config_yaml": open("config.yaml").read(),
"git_commit": os.popen("git rev-parse HEAD").read().strip(),
"git_dirty": os.popen("git status --porcelain").read().strip() != "",
}
# Store in metadata
model_execution.properties["reproducibility_bundle"] = \
metadata_store_pb2.Value(string_value=json.dumps(reproducibility_bundle))
store.put_executions([model_execution])Now, if someone asks "can you rebuild model v1.2.3?", the answer is yes - because you have the exact recipe.
Lineage Integrated with Serving
Here's where it gets practical: connecting predictions back to their lineage.
When your model serves a prediction, include lineage metadata in the response:
from fastapi import FastAPI
from typing import Dict
import json
app = FastAPI()
class PredictionResponse:
def __init__(self, prediction, model_version, dataset_version, lineage_trace_id):
self.prediction = prediction
self.model_version = model_version
self.dataset_version = dataset_version
self.lineage_trace_id = lineage_trace_id
@app.post("/predict")
async def predict(request: Dict) -> PredictionResponse:
# Load model from registry with version
model_artifact = store.get_artifacts_by_id([MODEL_ARTIFACT_ID])[0]
model_version = model_artifact.properties["version"].string_value
# Get training dataset this model used
model_exec = store.get_executions_by_id([MODEL_EXECUTION_ID])[0]
training_events = store.get_events_by_execution_ids([model_exec.id])
training_dataset = None
for event in training_events:
if event.type == metadata_store_pb2.Event.INPUT:
training_dataset = store.get_artifacts_by_id([event.artifact_id])[0]
break
dataset_version = training_dataset.properties["version"].string_value if training_dataset else "unknown"
# Make prediction
prediction = model.predict(request)
# Create unique trace ID for audit
trace_id = str(uuid.uuid4())
return PredictionResponse(
prediction=prediction,
model_version=model_version,
dataset_version=dataset_version,
lineage_trace_id=trace_id
)Include these headers in the HTTP response:
headers = {
"X-Model-Version": model_version,
"X-Dataset-Version": dataset_version,
"X-Lineage-Trace-Id": trace_id,
"X-Reproducible": "true", # Only if all reproducibility checks passed
}Backward Lineage Queries: "What Data Trained This?"
Now for the powerful part: querying backward. Given a model, what data trained it?
def get_training_data_for_model(model_artifact_id):
"""Trace backwards: model → execution → training data."""
# Find executions that produced this model
executions = store.get_executions()
model_training_execs = []
for execution in executions:
events = store.get_events_by_execution_ids([execution.id])
for event in events:
if (event.type == metadata_store_pb2.Event.OUTPUT and
event.artifact_id == model_artifact_id):
model_training_execs.append(execution)
# For each training execution, find its inputs (training dataset)
lineage_chain = []
for exec in model_training_execs:
events = store.get_events_by_execution_ids([exec.id])
for event in events:
if event.type == metadata_store_pb2.Event.INPUT:
artifact = store.get_artifacts_by_id([event.artifact_id])[0]
lineage_chain.append({
"artifact_id": artifact.id,
"artifact_type": artifact.type,
"uri": artifact.uri,
"version": artifact.properties.get("version", "unknown"),
})
return lineage_chain
# Usage
model_id = 123
lineage = get_training_data_for_model(model_id)
for item in lineage:
print(f"Model trained on: {item['artifact_type']} v{item['version']}")
print(f" Location: {item['uri']}")Output:
Model trained on: TrainingDataset v2.0
Location: s3://data-lake/fraud/2024-01/training/
Forward Lineage Queries: "What Models Used This Data?"
Equally important: given raw data, what models depend on it?
def get_models_trained_on_dataset(dataset_artifact_id):
"""Trace forward: dataset → preprocessing → training → model."""
# Find executions that consumed this dataset
executions = store.get_executions()
consuming_execs = []
for execution in executions:
events = store.get_events_by_execution_ids([execution.id])
for event in events:
if (event.type == metadata_store_pb2.Event.INPUT and
event.artifact_id == dataset_artifact_id):
consuming_execs.append(execution)
# For each consuming execution, find its outputs (models)
affected_models = []
for exec in consuming_execs:
events = store.get_events_by_execution_ids([exec.id])
for event in events:
if event.type == metadata_store_pb2.Event.OUTPUT:
artifact = store.get_artifacts_by_id([event.artifact_id])[0]
if artifact.type == "Model": # Only models
affected_models.append({
"model_id": artifact.id,
"version": artifact.properties.get("version"),
"uri": artifact.uri,
})
return affected_models
# Usage
dataset_id = 456
models = get_models_trained_on_dataset(dataset_id)
print(f"If we change this dataset, these {len(models)} models are affected:")
for model in models:
print(f" - Model {model['version']} (id={model['model_id']})")Output:
If we change this dataset, these 3 models are affected:
- Model v1.2.3 (id=789)
- Model v1.2.4 (id=790)
- Model v2.0.0 (id=791)
This is gold for impact analysis. Before you modify training data, you know exactly what breaks.
Complete End-to-End Lineage Graph
Here's the full picture - from raw data ingestion to serving predictions:
graph LR
A["Raw Data<br/>S3 Bucket<br/>1M records"] -->|ingest| B["Data Lake<br/>Deduplication<br/>Schema Validation"]
B -->|preprocess| C["Training Dataset<br/>v2.0<br/>950K rows"]
C -->|train<br/>seed=42| D["Model v1.2.3<br/>XGBoost<br/>94.2% acc"]
D -->|containerize| E["Production Image<br/>sha256:abc123<br/>serving:v1"]
E -->|deploy| F["Live Endpoint<br/>100 req/min"]
F -->|predict| G["Decision + Metadata<br/>lineage_trace_id<br/>model_version<br/>dataset_version"]
H["Config + Seeds<br/>requirements.txt<br/>hardware.json"] -.->|reproducibility| C
I["Audit Log<br/>Approval<br/>Timestamp<br/>User"] -.->|compliance| E
J["Metrics Store<br/>Validation Accuracy<br/>Feature Importance<br/>Drift Score"] -.->|monitoring| D
style A fill:#e3f2fd
style C fill:#f3e5f5
style D fill:#fce4ec
style E fill:#fff3e0
style F fill:#e0f2f1
style G fill:#f1f8e9
style H fill:#ede7f6
style I fill:#fbe9e7
style J fill:#e0f2f1Each node is a versioned artifact. Each arrow is a tracked relationship. The dashed lines show the supporting metadata.
Compliance and Audit Trails
For regulated industries, you need immutable audit evidence. Here's how MLMD supports compliance:
# When deploying to production, log approval
approval_context = metadata_store_pb2.Context()
approval_context.type_id = store.put_context_types([
metadata_store_pb2.ContextType(name="ModelApproval")
])[0]
approval_context.name = "fraud-model-v1.2.3-approval"
approval_context.properties["approver"] = metadata_store_pb2.Value(
string_value="alice@company.com"
)
approval_context.properties["approval_timestamp"] = metadata_store_pb2.Value(
string_value=datetime.now().isoformat()
)
approval_context.properties["approval_reason"] = metadata_store_pb2.Value(
string_value="Validation F1=0.92, production A/B test passed"
)
approval_context.properties["regulatory_framework"] = metadata_store_pb2.Value(
string_value="SR 11-7, EU AI Act Article 6"
)
context_id = store.put_contexts([approval_context])[0]
# Link model to approval context
# (MLMD will track who, when, why the model was approved)For EU AI Act compliance, you also need to justify decisions:
# Log the input hash (for audit without storing PII)
prediction_input_hash = hashlib.sha256(
json.dumps(request, sort_keys=True).encode()
).hexdigest()
# Store immutable audit record
audit_execution = metadata_store_pb2.Execution()
audit_execution.type_id = store.put_execution_types([
metadata_store_pb2.ExecutionType(name="PredictionLogging")
])[0]
audit_execution.properties["input_hash"] = metadata_store_pb2.Value(
string_value=prediction_input_hash
)
audit_execution.properties["prediction"] = metadata_store_pb2.Value(
string_value=str(prediction)
)
audit_execution.properties["timestamp"] = metadata_store_pb2.Value(
string_value=datetime.now().isoformat()
)
audit_execution.properties["model_version"] = metadata_store_pb2.Value(
string_value=model_version
)
store.put_executions([audit_execution])Handling the Hard Cases: Lineage in Complex Pipelines
The examples so far have been linear: data → preprocessing → training → model → serving. Real life is messier.
Branching and Merging Lineage
What if you have feature-specific preprocessing? Data flows like a git tree:
RawData (Jan 2024)
├─ Feature Branch A (location-based features)
│ └─ LocationFeatures v1 → Model-A v1.0
│
└─ Feature Branch B (behavioral features)
└─ BehavioralFeatures v1 → Model-B v1.0
Later: Model-A and Model-B merge via ensemble:
EnsembleModel v1.0 uses both lineages
In MLMD, this is represented as a Context (experiment) with multiple input artifacts. The ensemble knows its parents, and you can query "which raw data versions does EnsembleModel v1.0 depend on?" and get back both branches.
Partial Retraining and Transfer Learning
What if you only retrain part of your pipeline? Say you update preprocessing, but reuse the trained model weights as initialization (transfer learning):
OldModel v1.0 (trained on OldDataset v1)
↓ (weights as initialization)
+
OldPreprocessing v1 (old feature engineering)
│
└─ NewDataset v1 (same raw data, new preprocessing)
↓ (fine-tuning with transfer learning)
NewModel v1.1 (better, but descended from v1.0)
Your lineage captures: "NewModel v1.1 depends on both NewDataset v1 AND the weights from OldModel v1.0." This is crucial for understanding model drift and reproducibility.
Incremental Data and Batch vs. Streaming
What if your training dataset gets new data daily? Classic batch retraining:
Day 1: Train on DataBatch-2024-01-01 → Model v1.0
Day 2: Accumulate DataBatch-2024-01-02, retrain on combined
Train on DataBatch-2024-01-01 + DataBatch-2024-01-02 → Model v1.1
Day 3: Add DataBatch-2024-01-03
Train on Batches 01-01 + 01-02 + 01-03 → Model v1.2
Each model version has a different lineage pointing to different data batches. This lets you answer: "Model v1.2 failed; was it the new data in batch 03, or something else?" Retrain without batch 03 and see if quality recovers.
For streaming pipelines, the picture is different: you're not retraining, you're continuously updating. But lineage still matters - you need to know which "snapshot" of the stream trained which model version.
Real-World Implementation Tips
Here are the lessons from teams that actually do this well:
1. Use Managed Backends in Production
# SQLite works for testing—use Spanner, Cloud SQL, or PostgreSQL in prod
config = metadata_store_pb2.ConnectionConfig()
config.mysql.database = "mlmd_prod"
config.mysql.host = "cloudsql-proxy"
config.mysql.port = 33062. Create a Lineage SDK
Don't call MLMD directly everywhere. Wrap it:
class LineageTracker:
def __init__(self, store):
self.store = store
def log_dataset(self, uri, version, record_count, data_hash):
# Encapsulate MLMD complexity
pass
def log_training(self, dataset_id, model_path, hyperparams, seed):
# One call instead of multiple put_* operations
pass
def get_model_lineage(self, model_id):
# Hide query complexity
pass3. Integrate with Your MLOps Pipeline
# Kubeflow or Airflow DAG
- step: preprocess
outputs:
- dataset_artifact: training_data_v2
- step: train
inputs:
- training_data_v2
outputs:
- model_artifact: fraud_model_v1.2.3
lineage:
seed: 42
hyperparams: config.yaml
- step: evaluate
inputs:
- fraud_model_v1.2.3
outputs:
- metrics_artifact: metrics_v1.2.34. Query Lineage Regularly
# Health check: are all serving models tracked?
serving_models = get_serving_models()
tracked_models = store.get_artifacts_by_type("Model")
untracked = set(serving_models) - set(tracked_models)
if untracked:
alert("Untracked model in production: " + untracked)Operational Lineage Queries: Asking the Right Questions
So you have lineage in your system. Now what? Here are the queries you'll actually run, day-to-day.
Query 1: "Is My Production Model Reproducible?"
def validate_reproducibility(model_artifact_id):
"""Check if we can rebuild this model from scratch."""
model = store.get_artifacts_by_id([model_artifact_id])[0]
# Find training execution
execs = store.get_executions()
training_exec = None
for exec in execs:
events = store.get_events_by_execution_ids([exec.id])
for event in events:
if event.type == metadata_store_pb2.Event.OUTPUT and event.artifact_id == model_artifact_id:
training_exec = exec
break
if not training_exec:
return False, "No training execution found"
# Check required reproducibility fields
required_fields = [
"random_seed",
"hardware_config",
"reproducibility_bundle",
"training_script_hash"
]
missing = []
for field in required_fields:
if field not in training_exec.properties:
missing.append(field)
if missing:
return False, f"Missing fields: {missing}"
return True, "Model is reproducible ✓"
# Usage
reproducible, reason = validate_reproducibility(model_id=123)
if not reproducible:
alert(f"Production model {model_id} is NOT reproducible! {reason}")This is your health check. Run it weekly. If it fails, you're exposed to compliance risk.
Query 2: "What Changed Since Last Deployment?"
def get_lineage_diff(old_model_id, new_model_id):
"""Compare what's different between two model versions."""
old_model = store.get_artifacts_by_id([old_model_id])[0]
new_model = store.get_artifacts_by_id([new_model_id])[0]
# Get training datasets for each
def get_training_dataset(model_id):
execs = store.get_executions()
for exec in execs:
events = store.get_events_by_execution_ids([exec.id])
for event in events:
if event.type == metadata_store_pb2.Event.OUTPUT and event.artifact_id == model_id:
for e in events:
if e.type == metadata_store_pb2.Event.INPUT:
return store.get_artifacts_by_id([e.artifact_id])[0]
return None
old_dataset = get_training_dataset(old_model_id)
new_dataset = get_training_dataset(new_model_id)
changes = {
"dataset_changed": old_dataset.properties["version"] != new_dataset.properties["version"],
"data_records_diff": new_dataset.properties["record_count"] - old_dataset.properties["record_count"],
"code_changed": old_model.custom_properties.get("training_script_hash") != new_model.custom_properties.get("training_script_hash"),
}
return changes
# Usage
changes = get_lineage_diff(old_model_id=123, new_model_id=124)
print("Differences between v1.2.3 and v1.2.4:")
print(f" Dataset updated: {changes['dataset_changed']}")
print(f" New records: {changes['data_records_diff']}")
print(f" Code changed: {changes['code_changed']}")This tells you why a new model version is different. Code change? Data change? Both? This drives your testing strategy. If only data changed, you might not need code review. If code changed, you need to understand the impact.
Query 3: "Which Models Are Affected by This Data Change?"
def impact_analysis(dataset_artifact_id):
"""Given a dataset change, show all downstream models."""
dataset = store.get_artifacts_by_id([dataset_artifact_id])[0]
# Find all executions that consumed this dataset
all_execs = store.get_executions()
consuming_execs = []
for exec in all_execs:
events = store.get_events_by_execution_ids([exec.id])
for event in events:
if event.type == metadata_store_pb2.Event.INPUT and event.artifact_id == dataset_artifact_id:
consuming_execs.append(exec)
# For each consuming execution, find output models
affected_models = set()
for exec in consuming_execs:
events = store.get_events_by_execution_ids([exec.id])
for event in events:
if event.type == metadata_store_pb2.Event.OUTPUT:
artifact = store.get_artifacts_by_id([event.artifact_id])[0]
if artifact.type == "Model":
affected_models.add(artifact.properties.get("version").string_value)
return affected_models
# Usage: Before pushing a data fix
affected = impact_analysis(dataset_id=456)
print(f"WARNING: Changing this dataset will affect {len(affected)} models:")
for model in sorted(affected):
print(f" - {model}")
# Decision tree:
if len(affected) < 3:
print("Impact is small. Safe to push.")
else:
print("Impact is large. Coordinate with teams owning these models.")This is your safety net. Before you change data, you know exactly what breaks. No surprises in production.
The Complete Picture
Let me tie this together with one concrete example: a fraud detection model lifecycle.
-
Raw data ingestion (Jan 2024)
- 1M transactions from production database
- Logged as RawData artifact with source lineage
-
Data preprocessing (Jan 5)
- Execution: null imputation, encoding, scaling
- Input: RawData artifact
- Output: TrainingDataset v2.0 (950K rows, hash=def456)
-
Model training (Jan 7)
- Execution: XGBoost with seed=42
- Input: TrainingDataset v2.0
- Hardware: A100 GPU, CUDA 11.8
- Output: Model v1.2.3 (94.2% validation F1)
-
Containerization (Jan 8)
- Image: myregistry.azurecr.io/fraud-trainer@sha256:xyz789
- Requirements: requirements.txt with hashes
- Reproducibility bundle: stored in MLMD
-
Production deployment (Jan 10)
- Approval context: alice@company.com approved
- Endpoint: serving 1000 req/sec
- Response headers include model_version, dataset_version
-
Live prediction (Jan 15, 2:47 PM)
- Input: transaction for customer 12345
- Lineage trace: trace_id=abc-123
- Output: "Approved" + headers showing model v1.2.3, dataset v2.0
- Audit log: input_hash + prediction stored immutably
-
Six months later: regulatory audit (July 2024)
- Regulator: "What data trained this model?"
- Query:
get_training_data_for_model(model_v1.2.3) - Answer: TrainingDataset v2.0 from RawData (Jan 2024)
- Reproducibility check: all seeds, versions, hardware documented ✓
You have a complete chain of evidence.
Summary
ML metadata management isn't optional anymore. If you're deploying models to production, you need lineage tracking. Full stop.
The payoff is massive:
- Reproducibility: Rebuild any model bit-for-bit
- Debugging: Know exactly what broke and why
- Compliance: Audit trail for regulators
- Impact analysis: Change data, instantly see which models are affected
- Collaboration: Everyone knows the ground truth
ML Metadata (MLMD) is battle-tested, open-source, and designed exactly for this. It integrates with Kubeflow, Airflow, and custom pipelines. Start small - track one pipeline end-to-end, build your lineage SDK, integrate serving - and you'll never wonder again.
The best time to implement lineage was when you deployed your first model. The second-best time is today.
The Organizational Maturity Arc: Where Lineage Tracking Fits
Lineage tracking might sound like something only mature organizations need. Actually, the sooner you implement it, the better. Early adoption saves you from painful retrofits later.
Think about the arc of an ML organization. You start with a single data scientist training models locally. Lineage seems unnecessary - you know where your data comes from. You remember the random seed. Fast forward one year: you have three data scientists, ten models in production, and two instances where you deployed a bad model that you didn't notice for days. Suddenly, lineage tracking would have prevented those problems.
Fast forward three years: you have a team of thirty, fifty models in production, and regulatory pressure to be able to explain every decision. You desperately need lineage, but retrofitting it is painful. Your old models don't have metadata. Your old pipelines didn't log executions. You're stuck manually reconstructing lineage for anything built before you implemented tracking.
The smart move is to start small. With your second or third model, implement MLMD. Build the discipline early. It costs a bit more upfront, but it prevents massive pain later. By the time you're a mature organization under regulatory scrutiny, lineage is automatic - you've been doing it for years.
This is what we mean by "infrastructure first." Infrastructure that scales gracefully doesn't get built on the fly in response to crisis. It gets built early, intentionally, with the right abstractions. Lineage tracking is exactly that kind of infrastructure.
The Lineage Data Model: Thinking Like a Graph
The most important insight in MLMD is that lineage is fundamentally a graph problem. Artifacts are nodes. Executions are nodes. Events are edges. Everything connects to everything else. Understanding this model is the key to leveraging MLMD effectively.
This graph perspective lets you answer questions that are impossible with relational databases. You want to know "what's the transitive closure of datasets that affect this model?" In a relational database, that's a complex recursive query. In a graph database or a graph-structured system like MLMD, it's a graph traversal. You want to know "which models are indirectly affected by this data source?" Again, a graph question with a graph answer.
Over time, this graph becomes your organization's collective memory. It encodes all the decisions, transformations, and relationships that created your current ML infrastructure-flux)-flux). A senior data scientist with deep domain knowledge can answer complex questions about lineage. An MLMD query system can answer the same questions in seconds. This is democratization of expertise.
The graph also surfaces insights that aren't obvious looking at individual pieces. You might notice that three models all depend on the same preprocessing step. If that preprocessing has a bug, you know instantly that three models are affected. You might notice that a feature engineering step is only used by one model, making it a candidate for consolidation. You might notice that two models' data dependencies don't overlap, so they can be trained in parallel instead of sequentially.
Building and maintaining this graph is an investment, but an investment that compounds over time. The bigger your system grows, the more valuable the graph becomes.
Why Lineage Tracking Fails (And How to Prevent It)
Lineage tracking sounds simple: log where data comes from, log what code processed it, log what model was trained. In practice, it's surprisingly easy to get wrong. Understanding common failure modes helps you avoid them.
The first failure mode is incomplete logging. You instrument your training pipeline to log the model artifact and the training dataset. But you forget to log the data preprocessing steps. Or you forget to log the random seed and library versions. Six months later, you try to reproduce the model and realize your logged metadata is incomplete. You need the preprocessing steps to match the original training run, but you never logged them. Prevention: create a checklist of what gets logged. Data sources, preprocessing steps, hyperparameters, environment variables, library versions, hardware configuration, random seeds, dataset splits. Make logging automatic through instrumentation that fires before and after each stage.
The second failure mode is incorrect timestamps and causality. Your training execution log says it started at 2024-01-15 10:00:00, but the input dataset creation log says it was created at 2024-01-15 10:30:00. This violates causality - the dataset was created after the training started, which is impossible. These inconsistencies happen because different systems have different clocks, different timezone handling, or different logging precision. Your lineage graph becomes unreliable. Prevention: use a centralized clock source. All logging goes through one system that assigns timestamps. Use UTC exclusively, never local time. Validate causality relationships before storing.
The third failure mode is schema drift. You start logging metadata with fields A, B, and C. Six months later, you add field D because you need it. Your old metadata doesn't have field D. Your new metadata does. When you query across both, you get inconsistent results. Schema evolution breaks lineage queries. Prevention: version your metadata schema. When you add fields, bump the version number. Build your query system to handle multiple schema versions transparently. Use union types or optional fields to maintain backward compatibility.
The fourth failure mode is metadata corruption and loss. Your metadata store becomes a single point of failure. If it fails, you can't query lineage. If it corrupts, you lose confidence in what you can't verify. Prevention: back up your metadata store regularly. Validate metadata integrity on read. Store redundant copies in different geographic regions. Treat metadata as critical as your models.
The fifth failure mode is over-logging. You log everything. Every intermediate artifact, every parameter, every configuration value. Your metadata store balloons to terabytes. Queries time out. Lineage becomes unusable because there's too much noise. Prevention: log strategically. Log inputs and outputs. Log critical decisions. Log versions and hashes. Don't log intermediate scratch files or debug information. Be selective about what goes into the permanent record.
The Hidden Costs of Not Tracking Lineage
It's easy to dismiss lineage tracking as overhead. You're already managing your data. You know where things come from. Do you really need explicit metadata tracking? The answer becomes obvious only when things go wrong.
Consider the true cost of a lineage failure without tracking. Your model breaks in production. You have three options: debug blindly, or understand what changed. Without lineage, debugging means asking your teammates, checking Slack history, digging through git commits, running through mental maps of dependencies. You spend six hours reconstructing what should have been logged automatically. Your model is down for those six hours. That's expensive.
Or consider: a data quality issue cascades through your systems. A source table's update frequency changed. Three downstream feature tables became stale. Five models trained on those stale features started making bad predictions. Without lineage, you discover this by accident - maybe users complain, maybe you notice metrics degrading. You spent a week making bad predictions before noticing. With lineage, lineage analysis tells you instantly: these five models depend on this source table, and its freshness SLA is violated. You alert immediately, retrain proactively, incident prevented.
Or consider: a regulatory audit. You need to prove you didn't use certain data in a model. Without lineage, you manually investigate. With lineage, you query: "does my current production model depend on this dataset?" The answer comes back with full provenance. You can prove compliance.
The business value of lineage becomes obvious once you internalize that without it, you're flying blind. You don't actually know what you have. You're gambling that your models are working correctly because you don't have enough visibility to know if they're not.
The Lineage Data Model: Thinking Like a Graph
The most important insight in MLMD is that lineage is fundamentally a graph problem. Artifacts are nodes. Executions are nodes. Events are edges. Everything connects to everything else. Understanding this model is the key to leveraging MLMD effectively.
This graph perspective lets you answer questions that are impossible with relational databases. You want to know "what's the transitive closure of datasets that affect this model?" In a relational database, that's a complex recursive query. In a graph database or a graph-structured system like MLMD, it's a graph traversal. You want to know "which models are indirectly affected by this data source?" Again, a graph question with a graph answer.
Over time, this graph becomes your organization's collective memory. It encodes all the decisions, transformations, and relationships that created your current ML infrastructure. A senior data scientist with deep domain knowledge can answer complex questions about lineage. An MLMD query system can answer the same questions in seconds. This is democratization of expertise.
The graph also surfaces insights that aren't obvious looking at individual pieces. You might notice that three models all depend on the same preprocessing step. If that preprocessing has a bug, you know instantly that three models are affected. You might notice that a feature engineering step is only used by one model, making it a candidate for consolidation. You might notice that two models' data dependencies don't overlap, so they can be trained in parallel instead of sequentially.
Building and maintaining this graph is an investment, but an investment that compounds over time. The bigger your system grows, the more valuable the graph becomes.
Lessons From Teams That Got It Right
The organizations that successfully operate lineage tracking share a few patterns. First, they don't try to track everything perfectly from day one. They start with critical paths. Which models are in production? What data do they depend on? Build lineage for those first. Expand gradually to non-critical models.
Second, they build simple tooling on top of MLMD, not complex systems. They create a LineageSDK wrapper that hides MLMD complexity. Their data scientists call simple functions (log_dataset, log_training, get_lineage) instead of manipulating MLMD protobuf objects. This simplicity drives adoption.
Third, they integrate lineage into their existing workflows. Their training scripts automatically log metadata to MLMD. Their serving layer automatically logs predictions. Their monitoring system automatically logs performance. Logging becomes default behavior, not something people remember to do.
Fourth, they use lineage proactively, not reactively. They query it regularly. They surface lineage information in dashboards. They alert when reproducibility checks fail. They make lineage visible and front-and-center, not something you dig up when there's a problem.
Fifth, they accept imperfection. Early lineage graphs have gaps. Some old models are missing metadata. Some pipelines don't log executions. Instead of trying to achieve 100 percent coverage immediately, they maintain what they can and gradually fill in gaps as systems get updated. Pragmatism beats perfectionism.
The Business Case: ROI of Lineage Tracking
You're probably wondering: is this worth the investment? Lineage tracking requires engineering time. It requires running and maintaining MLMD infrastructure. It requires training teams to use it. Is the payoff real?
Consider the concrete benefits. Incident resolution time. Without lineage, debugging a failed model requires detective work. What data trained it? You dig through git history. What version of the code was used? You hunt through CI logs. What was the hardware configuration? You ask whoever ran the training. You spend hours reconstructing information. With lineage, you query the metadata store and have the answer in seconds. For a team handling five production incidents per month, that's hours of freed time monthly.
Consider model reproducibility. Without lineage, rebuilding a model from scratch is guesswork. With lineage, it's a recipe you follow. If you need to retrain a model due to data issues or regulatory reasons, you pull the lineage metadata and know exactly what to do. This is enormous for compliance - auditors ask "can you rebuild this model?" and you say "yes, here's the complete specification." That confidence is priceless.
Consider team velocity. A new data scientist joining your team needs to understand your models and data. Without lineage, they learn through conversation and documentation that's always out of date. With lineage, they can explore the graph, understand dependencies, trace models backward to data sources. Onboarding accelerates dramatically.
Consider prevented incidents. An engineer is about to change a data schema. Without lineage, they don't realize that five models depend on that schema. They change it, models break, and you get a page at 3 AM. With lineage, they query the impact, see that five models break, and coordinate with those teams. The incident is prevented entirely.
Quantifying these benefits is difficult, but the pattern is clear: organizations with mature lineage tracking spend less time firefighting, move faster, and have more confidence in their ML systems. That's worth the investment.
Moving Forward: Your Lineage Implementation Roadmap
If you're convinced that lineage tracking is important (and hopefully you are), here's how to get started:
Week 1: Evaluate MLMD vs. alternatives. Try out the quickstart. Get comfortable with the data model.
Week 2: Implement a simple end-to-end lineage flow for one non-critical model. Train a model, log the metadata, query it back. Get hands-on experience.
Week 3: Build a LineageSDK wrapper that abstracts MLMD complexity. Your team should never touch protobuf directly.
Week 4: Integrate lineage into your first production training pipeline. Make it automatic, not manual.
Month 2: Roll lineage out to five more models. Build organizational momentum.
Month 3: Start building lineage queries for common questions. "Is this model reproducible?" "What data trained this?" etc.
Month 6: You have half your models instrumented. You've caught incidents prevented by lineage. You've rebuilt a model from scratch using lineage metadata. You're seeing the value.
By month twelve, lineage is part of your ML infrastructure. New models get lineage automatically. Old models gradually get retrofitted. Your organization is more transparent, more reproducible, more compliant.
It's a journey, not a destination. But it's a journey worth taking.
Building the infrastructure that makes ML trustworthy, reproducible, and auditable.