
You've built a killer ML model. It predicts customer churn with 94% accuracy. Runs beautifully on your laptop. Then someone on the team tries to use it, and... "It doesn't work on my machine."
Sound familiar? This is where Docker comes in. Containerizing your ML model solves the classic "works on my machine" problem by bundling your model, dependencies, environment variables, and runtime into a single, reproducible package. Whether you're shipping to production, sharing with teammates, or deploying to cloud infrastructure, containers are your safety net.
But containerizing ML models isn't quite like containerizing a typical Python app. You're dealing with large model weights, GPU acceleration, environment-specific dependencies, and persistence challenges that a simple web app never has to think about. A Flask app with SQLAlchemy might weigh in at a few hundred megabytes, your fine-tuned BERT model alone can eclipse that before you've added a single line of application code. Throw in CUDA drivers, cuDNN libraries, and framework-specific binaries, and the complexity compounds fast. We'll cover the full toolchain, from writing an ML-specific Dockerfile to optimizing images to enabling GPU support with NVIDIA Docker. By the end, you'll have a production-grade containerization workflow that you can apply to any ML project, regardless of framework or model size.
Table of Contents
- Why Containerize ML Models?
- The ML Dockerfile: Starting Simple
- Base Image Choices
- Why Containerize ML Models
- Docker Layer Optimization for ML
- Multi-Stage Builds for Production
- Model Artifacts: Image vs. Volume
- Environment Variables for ML Config
- Health Checks and Graceful Shutdown
- Docker Compose with GPU Support
- Image Optimization: Size and Speed
- Docker Layer Optimization for ML
- Multi-Stage Builds for Production
- Common Docker ML Mistakes
- Pushing to Registries
- Putting It All Together
- Common Gotchas
- Wrapping Up
Why Containerize ML Models?
If you've been deploying ML models the old-fashioned way, SSH-ing into servers, running pip install manually, crossing your fingers that the CUDA version matches, you already know the pain. Let's talk about why containerization is the antidote, because understanding the why makes the how stick.
Reproducibility is non-negotiable in ML. Your model doesn't just depend on your Python code. It depends on a specific Python version, specific package versions (NumPy 1.24 vs. 1.26 can produce different numerical results), the CUDA version, cuDNN version, OS-level libraries, and sometimes even specific hardware drivers. A Docker image captures all of that in a single artifact. When you hand your container to a colleague, a CI system, or a cloud deployment pipeline, it runs identically. No drift, no surprises, no "but it worked on my machine" debugging sessions that eat three hours of your afternoon.
Isolation eliminates dependency hell. Your production ML service needs TensorFlow 2.14, but your data pipeline needs TensorFlow 2.10, and your experiment tracking tool needs Python 3.9. Virtual environments help, but they don't isolate system-level dependencies or CUDA. Containers do. Each container gets its own complete environment, and they coexist happily on the same host. You can run five different ML services with five different framework versions on a single GPU server, and they'll never interfere with each other.
Portability means you build once, run anywhere. The same container image that runs on your laptop runs in your company's on-premise Kubernetes cluster, on AWS ECS, on Google Cloud Run, and on Azure Container Instances. You write the Dockerfile once. You build the image once. From that point forward, deployment is just pulling the image and running it, no rewriting environment setup scripts for each target platform, no wrestling with cloud-specific configuration quirks.
Operational consistency matters at scale. When your model goes from serving 10 requests per day to 10,000, you need to scale horizontally. Containers make that trivial: spin up more replicas of the same image. Each one is identical. There's no snowflake server syndrome where replica three behaves differently because someone manually installed a hotfix six months ago.
The alternative, hand-written setup instructions, assumption-based deployments, runtime surprises, costs way more than learning Docker. A few hours learning container best practices for ML pays dividends every time you deploy, every time you onboard a new team member, and every time your model needs to scale.
The ML Dockerfile: Starting Simple
Let's say you've got a scikit-learn model that predicts house prices. The starting point looks straightforward enough, and it gives us a solid foundation to build on before we get into the ML-specific considerations that will really matter at scale.
Here's a minimal Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.pkl .
COPY app.py .
CMD ["python", "app.py"]This works, but it's generic. ML workloads often need specialized base images. If you're using PyTorch or TensorFlow, you're better off starting with their official images, those images are maintained by the framework teams and come pre-configured with exactly the right library versions:
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.pth .
COPY app.py .
CMD ["python", "app.py"]Why the switch? The pytorch/pytorch image comes pre-loaded with CUDA and cuDNN. You don't need to install them separately. The image is optimized for GPU inference out of the box. If you're using CPU only, you could stick with python:3.11, but you'd be missing out on optimizations and you'd need to install CUDA manually (which defeats the purpose).
For TensorFlow users, the pattern is similar, note that TensorFlow's official images use the -gpu suffix to signal GPU readiness, so pick the tag that matches your deployment target and the CUDA version supported by your hardware:
FROM tensorflow/tensorflow:2.15.0-gpu
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model_dir/ ./model_dir/
COPY app.py .
CMD ["python", "app.py"]The key insight: Pick a base image that matches your framework. Saves time, avoids surprises, and ensures your dependencies align with the framework's expectations.
Base Image Choices
| Framework | CPU Base | GPU Base |
|---|---|---|
| PyTorch | python:3.11 | pytorch/pytorch:2.1-cuda12.1-cudnn8-runtime |
| TensorFlow | python:3.11 | tensorflow/tensorflow:2.15.0-gpu |
| scikit-learn | python:3.11-slim | python:3.11-slim |
| General ML | python:3.11-slim | nvidia/cuda:12.1.1-runtime-ubuntu22.04 |
For scikit-learn and XGBoost, plain Python suffices. For deep learning, specialized images save headaches.
Why Containerize ML Models
We touched on the high-level case above, but it's worth going deeper on the mechanics, because the ML-specific reasons for containerization are distinct from what you'd hear in a generic Docker tutorial, and understanding them shapes how you write your Dockerfiles.
The most underappreciated reason is CUDA version pinning. CUDA is not backward compatible in the way you might hope. A PyTorch model compiled against CUDA 12.1 will not run correctly on a host with only CUDA 11.8. Without containers, managing this across a team of engineers with different workstations and a heterogeneous fleet of GPU servers is a genuine nightmare. A Docker image built with pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime carries its CUDA dependencies with it. The host only needs to have a CUDA driver that supports CUDA 12.1 or later, everything else lives inside the container.
Model weight management is another ML-specific concern. Model weights are binary artifacts that can range from a few megabytes (a small scikit-learn model) to tens of gigabytes (a large language model). Containers let you make an explicit architectural decision: do the weights live inside the image (immutable, versioned alongside the code), or do they get mounted at runtime from external storage (flexible, swappable without rebuilding)? Neither answer is always right, but Docker forces you to think about this tradeoff explicitly, which is the first step to managing it intentionally.
Inference environment consistency matters because ML frameworks are notoriously sensitive to their environment. NumPy, SciPy, and BLAS implementations interact in ways that can subtly affect numerical outputs. The difference between running MKL-optimized NumPy versus OpenBLAS-backed NumPy might not change your model's predictions meaningfully, but it can affect performance by 2-10x. Containers let you specify exactly which linear algebra backend you're using and ensure consistency across every inference call, everywhere your model runs.
Finally, security and isolation deserve mention. ML models are increasingly being treated as sensitive intellectual property. Running your model in a container with restricted filesystem access, network policies, and resource limits provides a meaningful security boundary. You can prevent the model from accessing arbitrary parts of the host filesystem, limit its network egress, and cap its resource consumption, all without modifying a line of model code.
Docker Layer Optimization for ML
One of the most impactful things you can learn about Docker for ML is how the layer cache works, and how to exploit it to make your builds dramatically faster. Every instruction in a Dockerfile creates a layer. Docker caches these layers and only rebuilds from the point where something changed. This sounds simple, but ML workloads have characteristics that make layer ordering critically important.
The golden rule: put things that change infrequently at the top, things that change frequently at the bottom. For an ML app, that usually means: base image first, system dependencies second, Python dependencies third, model weights fourth, application code last. Application code changes with every commit. Model weights change whenever you retrain. Python dependencies change when you add a new package. System dependencies almost never change.
Here's the wrong order, which most people write first:
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY . . # Everything at once, kills cache on every change
RUN pip install --no-cache-dir -r requirements.txt
CMD ["python", "app.py"]Every time you change any file, even a single line in app.py, Docker invalidates the cache at the COPY . . layer and reinstalls all your Python dependencies from scratch. If your requirements take four minutes to install, you're burning four minutes on every code change.
Here's the right order:
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
# Dependencies first, only reinstall when requirements.txt changes
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Model weights, only recopy when model is retrained
COPY models/ ./models/
# App code last, changes every commit, but pip install is already cached
COPY app.py .
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Now a code change only invalidates the last layer. Pip install is cached. Model weights are cached unless you explicitly update them. Build time drops from four minutes to fifteen seconds for typical code changes.
For large model weights specifically, consider using Docker's --cache-from flag in your CI pipeline to pull the previous image's layers as a cache source. Combined with layer-aware ordering, this means your CI builds can reuse cached layers across pipeline runs, turning a ten-minute Docker build into a ninety-second one for incremental changes.
One more technique: squash layers in the right places. System-level package installations often leave behind package manager caches that bloat your image. Combine related RUN instructions with && and clean up in the same layer:
RUN apt-get update && \
apt-get install -y --no-install-recommends curl libgomp1 && \
rm -rf /var/lib/apt/lists/*If you split the apt-get install and rm -rf into separate RUN instructions, Docker creates two layers and the apt cache lives in the first one permanently. Combining them into a single RUN instruction means the cleanup happens before the layer is committed, and the cache never lands in your final image.
Multi-Stage Builds for Production
Here's a problem: model weights can be huge. A BERT model? 500MB. A fine-tuned GPT-2? 1GB+. If you bake the model directly into your image using a naive approach, your image size explodes and your download/install tools bloat the final artifact unnecessarily. Multi-stage builds are Docker's answer to keeping production images lean while still having a full build environment available during construction.
The concept is elegant: you get multiple FROM instructions in a single Dockerfile, each starting a new stage. Stages can copy artifacts from each other, but only what you explicitly copy survives into the final image. Build tools, pip caches, intermediate files, they all get discarded automatically.
Multi-stage builds solve this by separating the build context (where we compile/download things) from the runtime context (what actually runs in production):
# Stage 1: Preparation (download model, install build deps)
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime as builder
WORKDIR /tmp
RUN pip install --no-cache-dir huggingface-hub
COPY download_model.py .
RUN python download_model.py
# Stage 2: Runtime (only the app and model, no build cruft)
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY --from=builder /tmp/models/ ./models/
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8000
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Here's what happens:
- The builder stage downloads a large model using Hugging Face Hub.
- The runtime stage copies only the final model artifact from the builder, skipping all the intermediate build tools.
- The final image includes the model but not the model-download tools, build artifacts, or pip cache.
This cuts image size significantly, sometimes by 50% or more. The tradeoff: slightly longer build time (two stages instead of one), but deployment is faster because the image is smaller.
The production use case extends beyond model downloads. Use multi-stage builds whenever you need compilation steps, building C extensions, compiling custom CUDA kernels, or generating protobuf code. The builder stage can have compilers, headers, and development libraries installed. The runtime stage gets only the compiled artifacts. Your production image stays clean and minimal, which matters enormously when you're pulling it across a network on every deployment.
For Python-heavy ML stacks, you can take this further with a dedicated dependency compilation stage that handles packages needing C extensions (like tokenizers from Hugging Face, which compiles Rust code during install), keeping the final stage's pip install fast and cache-friendly.
Model Artifacts: Image vs. Volume
Should you bake the model into the image, or mount it as a volume at runtime?
Baked in (image): Great for immutable models. You deploy the image, and the model goes with it. Simpler orchestration in Kubernetes. No runtime dependencies on external storage.
As a volume: Great for large models, frequently updated models, or shared models across services. Mount a volume at /models/, and swap the model without rebuilding the image.
Here's a hybrid approach, the model lives outside the image, but the app knows where to find it. This pattern works especially well in environments where you're continuously retraining and want to update model weights without rebuilding and pushing a new Docker image every time:
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
# Volume mount point for model artifacts
VOLUME ["/models"]
EXPOSE 8000
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Then, when you run the container, you mount the volume, Docker will bind the host path to the container's /models directory, making your local model files available inside the container without copying them into the image:
docker run -v /path/to/models:/models -p 8000:8000 my-ml-app:latestOr in Docker Compose (covered next):
services:
ml-app:
image: my-ml-app:latest
volumes:
- ./models:/models
ports:
- "8000:8000"The choice depends on your deployment model. Microservices that scale horizontally? Volumes. Batch jobs? Baked-in. Frequently retrained models? Volumes let you hot-swap without redeployment.
Environment Variables for ML Config
ML apps need runtime configuration: model paths, batch sizes, confidence thresholds, logging levels. Environment variables are your friend, they're the twelve-factor app way of separating configuration from code, and they're especially powerful for ML because the same model might need different batch sizes or confidence thresholds depending on whether it's handling real-time requests or batch processing:
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
ENV MODEL_PATH=/models/bert_model.pt \
BATCH_SIZE=32 \
CONFIDENCE_THRESHOLD=0.85 \
LOG_LEVEL=INFO
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8000
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]In your Python app, read these with sensible defaults so the application can still run even if an environment variable is accidentally omitted:
import os
from pathlib import Path
MODEL_PATH = Path(os.getenv("MODEL_PATH", "/models/model.pt"))
BATCH_SIZE = int(os.getenv("BATCH_SIZE", "32"))
CONFIDENCE_THRESHOLD = float(os.getenv("CONFIDENCE_THRESHOLD", "0.85"))
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")At runtime, override them, notice how you can swap to a different model entirely or double the batch size without touching Dockerfile or application code:
docker run -e MODEL_PATH=/models/custom_model.pt -e BATCH_SIZE=64 my-ml-app:latestThis flexibility is powerful. Same image, different configs. Perfect for dev/staging/production.
Health Checks and Graceful Shutdown
Production containers need to know when they're sick. Add a health check, this is your container telling the orchestrator "I'm ready to serve traffic" or "something went wrong, restart me." Without this, Docker and Kubernetes assume the container is healthy just because the process is running, even if your model failed to load:
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8000
# Health check every 30 seconds, timeout 5s, 3 failures = unhealthy
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]And in your FastAPI app, the health endpoint should verify that the model is loaded and ready, not just that the web server is running, but that the actual inference machinery is operational:
from contextlib import asynccontextmanager
from fastapi import FastAPI
import torch
model = None
@asynccontextmanager
async def lifespan(app):
global model
model = torch.load("/models/model.pt", weights_only=True)
model.eval()
print("Model loaded and ready")
yield
app = FastAPI(lifespan=lifespan)
@app.get("/health")
async def health():
return {"status": "healthy"}
@app.post("/predict")
async def predict(data: dict):
input_tensor = torch.tensor([data["features"]])
with torch.no_grad():
output = model(input_tensor)
return {"prediction": output.item()}The health check tells orchestrators (Docker Compose, Kubernetes, ECS) whether the container is alive. If it fails 3 times, the container is marked unhealthy and can be restarted.
For graceful shutdown, handle SIGTERM, this is particularly important for ML services because a long inference call in progress when the container gets a stop signal should be allowed to complete rather than being abruptly terminated mid-computation:
import signal
import sys
def signal_handler(sig, frame):
print("Shutdown signal received, cleaning up...")
sys.exit(0)
signal.signal(signal.SIGTERM, signal_handler)When you stop a container, it gets SIGTERM. Your app has 10 seconds to finish requests and clean up before SIGKILL. This matters for ML models, you don't want to interrupt a long inference mid-way.
Docker Compose with GPU Support
For local development and testing, Docker Compose lets you define multi-container setups. This is where Docker really starts to feel like infrastructure-as-code rather than just a packaging tool, you define your entire local environment in a single YAML file, and anyone on your team can reproduce it with one command.
Here's a Compose file for an ML app with GPU. The GPU configuration section is the part most people get wrong, so pay attention to the deploy.resources.reservations.devices block:
version: "3.9"
services:
ml-app:
build: .
image: my-ml-app:latest
container_name: ml-app-gpu
ports:
- "8000:8000"
environment:
- MODEL_PATH=/models/model.pt
- BATCH_SIZE=32
- CUDA_VISIBLE_DEVICES=0
volumes:
- ./models:/models
- ./logs:/app/logs
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
# Optional: MLflow for experiment tracking
mlflow:
image: ghcr.io/mlflow/mlflow:latest
ports:
- "5000:5000"
volumes:
- ./mlruns:/mlruns
command: server --host 0.0.0.0 --backend-store-uri file:///mlrunsThe critical part: deploy.resources.reservations.devices with nvidia driver tells Docker Compose to allocate a GPU to the container. This requires nvidia-docker2 to be installed on the host.
To use it, you'll need to do a one-time setup of the NVIDIA Container Toolkit, after this, every Docker container on your host can access the GPU with the configuration shown above:
# Install nvidia-docker2 (one time)
# On Ubuntu/Debian:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
# Then start the service with GPU
docker-compose upVerify GPU access in your app by adding a debug endpoint, this is something you'll want during development to confirm the GPU passthrough is working before you wonder why your inference is slow:
import torch
@app.get("/gpu-info")
async def gpu_info():
return {
"cuda_available": torch.cuda.is_available(),
"device_count": torch.cuda.device_count(),
"current_device": torch.cuda.current_device(),
"device_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None
}Hit http://localhost:8000/gpu-info and you'll see your GPU listed. If it's not, your compose config isn't passing the GPU through correctly.
Image Optimization: Size and Speed
Large images are slow to push, pull, and start. Optimize with three techniques:
1. Use .dockerignore
The .dockerignore file tells Docker what to exclude from the build context, everything listed here never even gets sent to the Docker daemon, which speeds up builds and prevents accidentally including sensitive files or large development artifacts:
# .dockerignore
__pycache__
*.pyc
.git
.gitignore
.env
.venv
notebooks/
*.ipynb
test_*.py
pytest.ini
.pytest_cache/
*.egg-info/
dist/
build/
.DS_StoreThis tells Docker to skip these files when building the context. Massive speed boost if you've got large notebooks or test data.
2. Multi-stage builds (as shown above)
Don't include build tools, pip caches, or temporary files in the final image.
3. Clear pip cache
RUN pip install --no-cache-dir -r requirements.txtThe --no-cache-dir flag saves 100-200MB per install.
Example optimized Dockerfile, notice how the site-packages are explicitly copied from the builder stage, which means you get the installed packages without any of the pip download cache or build intermediates:
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime as builder
WORKDIR /tmp
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && \
pip cache purge
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY app.py .
COPY models/ ./models/
RUN chmod +x app.py
CMD ["python", "app.py"]Build and check size to validate your optimizations are actually having the expected effect, it's easy to assume you've saved space and find you haven't because of a missed cleanup step:
docker build -t my-ml-app:latest .
docker images my-ml-appWith optimizations, you might see 500MB → 300MB. For large-scale deployments, that's the difference between 2-minute and 10-minute pulls.
Docker Layer Optimization for ML
Beyond the basic layer ordering rules we covered earlier, ML workloads benefit from some framework-specific layer optimization strategies that can shave significant time off both your build pipeline and your deployment cycle.
The first thing to understand is that ML dependency installation is fundamentally different from web app dependency installation. Installing PyTorch alone downloads roughly 700MB of binaries. Installing the full Hugging Face transformers stack with tokenizers can take ten minutes on a cold cache. This means your layer strategy should treat ML dependencies as near-immutable infrastructure, change them as rarely as possible, and make sure Docker can cache them aggressively.
Split your requirements into two files: requirements-base.txt for your ML framework and its core dependencies (PyTorch, NumPy, CUDA utilities), and requirements-app.txt for your application-level dependencies (FastAPI, uvicorn, Pydantic). Install the base requirements first in their own layer, then the app requirements. Your ML framework changes maybe once a quarter. Your app dependencies change weekly. This split lets Docker cache the expensive ML install while still giving you fast rebuilds when you add a new utility package.
The second ML-specific optimization is handling Hugging Face model caches intelligently. By default, the transformers library caches models in ~/.cache/huggingface/. Inside a container, this is the root user's home directory. If you're downloading models at container startup rather than baking them into the image, set TRANSFORMERS_CACHE and HF_HOME environment variables explicitly to a path you control:
ENV TRANSFORMERS_CACHE=/app/model_cache
ENV HF_HOME=/app/model_cacheThis ensures the cache lands in a predictable location that you can mount as a volume, share between container runs, and inspect when debugging. Without this, model downloads happen in an unpredictable location and get discarded when the container exits, forcing a fresh download every time.
Multi-Stage Builds for Production
The multi-stage build pattern deserves its own deeper treatment beyond what we covered in the introductory example, because production ML deployments have requirements that push the pattern to its limits.
Consider a production LLM inference server. You need to: download model weights from Hugging Face (requires huggingface_hub), convert them to an optimized format for your serving framework (requires optimum with ONNX or TensorRT), and then serve them (requires only your serving framework). That's three distinct phases with different dependency requirements, and only the final phase needs to be in your production image.
A three-stage Dockerfile for this scenario looks like this. The key is that each stage only carries forward what the next stage needs, no build tool leaks into production:
# Stage 1: Download weights
FROM python:3.11-slim as downloader
WORKDIR /tmp
RUN pip install --no-cache-dir huggingface-hub
COPY scripts/download_model.py .
RUN python download_model.py --model-id bert-base-uncased --output /tmp/raw_model
# Stage 2: Optimize for inference
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-devel as optimizer
WORKDIR /tmp
RUN pip install --no-cache-dir optimum[onnxruntime-gpu]
COPY --from=downloader /tmp/raw_model ./raw_model
COPY scripts/optimize_model.py .
RUN python optimize_model.py --input raw_model --output /tmp/optimized_model
# Stage 3: Production runtime
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
ENV MODEL_PATH=/app/models/optimized
COPY --from=optimizer /tmp/optimized_model ./models/optimized
COPY requirements-runtime.txt .
RUN pip install --no-cache-dir -r requirements-runtime.txt
COPY app.py .
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Notice that the production stage uses the -runtime CUDA image instead of the -devel image used for optimization. The devel image includes compilers and headers needed to compile CUDA extensions, hundreds of megabytes that your production inference server doesn't need. The runtime image is much smaller while still having everything needed to run GPU inference.
This pattern is particularly valuable in CI/CD pipelines. Each stage can be cached separately. If your model weights haven't changed, the downloader stage hits its cache. If your optimization script hasn't changed, the optimizer stage hits its cache. Only the production stage needs to rebuild, and that's the smallest stage. Your CI pipeline goes from forty-minute builds to five-minute builds once the caches are warm.
Common Docker ML Mistakes
Experience with ML containerization surfaces the same failure modes repeatedly. Knowing what they are before you hit them saves hours of confused debugging.
Mistake 1: Ignoring image size until it's a problem. A 15GB Docker image feels fine on your development machine with a fast SSD and a gigabit internet connection. It's a disaster in production when your auto-scaler needs to spin up a new instance in under thirty seconds. The time to optimize image size is during Dockerfile development, not after your on-call rotation gets paged because your service can't scale fast enough. Build the .dockerignore file first. Use multi-stage builds from the start. Check docker images after every significant Dockerfile change.
Mistake 2: Hardcoding model paths and assuming model presence. A container that fails silently because a model file isn't at /models/model.pt is worse than one that fails loudly. Always validate that required files exist at startup, with clear error messages that tell you exactly what's missing and where it was expected:
from pathlib import Path
import os
import sys
model_path = Path(os.getenv("MODEL_PATH", "/models/model.pt"))
if not model_path.exists():
print(f"FATAL: Model not found at {model_path}", file=sys.stderr)
print(f"Contents of {model_path.parent}:", file=sys.stderr)
for f in model_path.parent.iterdir():
print(f" {f}", file=sys.stderr)
sys.exit(1)Mistake 3: Not setting PYTHONUNBUFFERED=1. By default, Python buffers stdout. In a container, this means your logs don't appear in docker logs until the buffer flushes, which might be never if the container crashes. Always add ENV PYTHONUNBUFFERED=1 to your Dockerfile. Your future self will thank you during incident response.
Mistake 4: Running as root. The default Docker behavior runs your process as root inside the container. This is a security problem. If an attacker exploits your ML API, they get a root shell inside the container, which is a significant foothold. Add a non-root user:
RUN useradd --create-home --shell /bin/bash mluser
USER mluserMistake 5: Not pinning base image versions. FROM pytorch/pytorch:latest is a time bomb. The PyTorch team updates latest regularly, and a base image change can subtly break your model's behavior or performance in ways that are hard to diagnose. Always pin to a specific tag: FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime. When you want to upgrade, make it an explicit decision with a Dockerfile change, not an implicit one that happens during your next build.
Mistake 6: Forgetting about model warmup time. Kubernetes and load balancers will start sending traffic to a new container as soon as it reports healthy. But an ML model might take thirty seconds to load from disk before it can serve requests. Your health check needs to account for this. Use the start_period option in your HEALTHCHECK instruction to give the container time to load before the health check starts counting failures.
Pushing to Registries
Once your image is built and tested, push it to a registry so others can use it. Three popular options:
Docker Hub:
docker tag my-ml-app:latest username/my-ml-app:latest
docker login
docker push username/my-ml-app:latestOthers pull it with:
docker pull username/my-ml-app:latestAWS ECR:
ECR is the registry of choice for AWS deployments because it integrates natively with ECS, EKS, and Lambda, images stored there get pulled over AWS's internal network, which is faster and doesn't incur egress costs:
# Create repo
aws ecr create-repository --repository-name my-ml-app --region us-east-1
# Get login token
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
# Tag and push
docker tag my-ml-app:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/my-ml-app:latest
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/my-ml-app:latestGoogle Container Registry:
# Set project
gcloud config set project MY_PROJECT
# Authenticate
gcloud auth configure-docker
# Tag and push
docker tag my-ml-app:latest gcr.io/MY_PROJECT/my-ml-app:latest
docker push gcr.io/MY_PROJECT/my-ml-app:latestEach registry has slightly different auth flows, but the pattern is the same: authenticate, tag with the registry URL, push.
Pro tip: Version your images with semantic versioning. latest is convenient but risky, you don't know what code is running. Use my-ml-app:1.0.0, my-ml-app:1.1.0, etc.
Putting It All Together
Here's a real-world example: a sentiment analysis API using a BERT model served with FastAPI. This example pulls together every technique we've covered, multi-stage builds, layer optimization, health checks, environment variables, and GPU support, into a coherent, production-ready setup.
Dockerfile:
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime as builder
WORKDIR /tmp
RUN pip install --no-cache-dir huggingface-hub transformers
COPY download_model.py .
RUN python download_model.py
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
WORKDIR /app
ENV MODEL_PATH=/models/bert-base-uncased \
LOG_LEVEL=INFO
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && pip cache purge
COPY --from=builder /tmp/models /models
COPY app.py .
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]app.py:
from contextlib import asynccontextmanager
from fastapi import FastAPI
from transformers import pipeline
import torch
import logging
import os
logging.basicConfig(level=os.getenv("LOG_LEVEL", "INFO"))
logger = logging.getLogger(__name__)
sentiment_pipeline = None
@asynccontextmanager
async def lifespan(app):
global sentiment_pipeline
model_path = os.getenv("MODEL_PATH", "/models/bert-base-uncased")
logger.info(f"Loading model from {model_path}")
sentiment_pipeline = pipeline("sentiment-analysis", model=model_path, device=0 if torch.cuda.is_available() else -1)
logger.info("Model loaded successfully")
yield
app = FastAPI(lifespan=lifespan)
@app.get("/health")
async def health():
return {"status": "ok"}
@app.post("/predict")
async def predict(text: str):
result = sentiment_pipeline(text)[0]
return {"text": text, "label": result["label"], "score": float(result["score"])}docker-compose.yml:
version: "3.9"
services:
sentiment-api:
build: .
image: sentiment-api:1.0.0
ports:
- "8000:8000"
environment:
- MODEL_PATH=/models/bert-base-uncased
- LOG_LEVEL=INFO
volumes:
- ./models:/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]Build and run, the -t flag tags the image with your version, and docker-compose up handles everything else including GPU allocation:
docker build -t sentiment-api:1.0.0 .
docker-compose upTest it to confirm everything is wired together correctly, if you get the expected JSON response with a POSITIVE label and high confidence score, your containerized ML service is working end-to-end:
curl -X POST "http://localhost:8000/predict?text=I%20love%20this%20product"
# {"text":"I love this product","label":"POSITIVE","score":0.999}Common Gotchas
GPU not detected: Ensure nvidia-docker2 is installed and Docker daemon is restarted. Check with docker run --rm --gpus all nvidia/cuda:12.1.1-runtime-ubuntu22.04 nvidia-smi.
Model path not found: Double-check volume mounts and environment variables. Add debug logging: print(f"Looking for model at {MODEL_PATH}") at startup.
Container OOM killed: ML models are memory-hungry. Set resource limits in Compose or Kubernetes, or increase available memory. docker run -m 8g limits to 8GB RAM.
Slow inference in container: Might be CPU-based when you expected GPU. Verify GPU is passed through and model is on correct device: print(next(model.parameters()).device).
Wrapping Up
Containerizing ML models is one of those skills that feels optional until the moment it becomes essential, and that moment usually comes at the worst possible time, when you're trying to deploy under pressure and your environment assumptions are falling apart. Getting comfortable with Docker for ML before you need it means you can deploy with confidence rather than desperation.
The core techniques we've covered work together as a system. Specialized base images give you the right CUDA and framework foundation. Layer ordering makes your builds fast by preserving the cache for expensive dependency installations. Multi-stage builds keep your production images lean by discarding build tooling after it's served its purpose. Volume mounts give you flexibility to swap model weights without rebuilding. Health checks tell your orchestration layer when your service is ready to serve traffic. And environment variables let the same image serve different environments with different configurations.
The jump from "it works on my laptop" to "it works in production, at scale, for everyone on the team" is real. Docker bridges that gap for ML the same way it does for web services, but with ML-specific considerations around model size, GPU access, and framework dependencies that require a slightly different playbook than you'd find in a generic Docker tutorial.
Start small: containerize your next model, push it to Docker Hub, pull it on a different machine, and watch it work identically. That first successful cross-machine reproduction is the moment Docker goes from "infrastructure concern" to "essential tool in your ML workflow." Once you've experienced it, going back to manual environment management will feel like deploying code with FTP.