Deploying ML Models at Scale: A Production Checklist
Moving a machine learning model from a Jupyter notebook to a production system that handles millions of requests is a different engineering problem entirely. Here's the checklist we use.
From Notebook to Production
The gap between a working ML model and a reliable production system is wider than most teams anticipate. A model with 94% accuracy in a notebook can become a source of silent failures, latency spikes, and budget surprises in prod.
This is the checklist we run through before every ML deployment.
1. Containerize the Inference Environment
Never rely on "it works on my machine." Package your model and its dependencies into a Docker image.
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model/ ./model/
COPY app.py .
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Key rules:
- Pin all dependency versions in
requirements.txt - Use multi-stage builds to keep image size minimal
- Never bake secrets into the image — use environment variables or a secrets manager
2. Version Your Models, Not Just Your Code
Model artifacts are first-class deployable units. Treat them like software releases.
import mlflow
with mlflow.start_run():
mlflow.log_params({"learning_rate": 0.001, "epochs": 50})
mlflow.log_metrics({"accuracy": 0.943, "f1": 0.938})
mlflow.sklearn.log_model(
sk_model=model,
artifact_path="model",
registered_model_name="churn-predictor"
)
Use MLflow, DVC, or a model registry in your cloud provider. The goal: any team member can reproduce any inference result from any point in time.
3. Design for Blue-Green Deployments
Never hard-cut traffic from v1 to v2. Use blue-green or canary releases to validate new model versions under real load.
# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-api-green
spec:
replicas: 2
selector:
matchLabels:
app: ml-api
slot: green
template:
spec:
containers:
- name: api
image: your-registry/ml-api:v2.1.0
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2000m"
memory: "4Gi"
Shift 10% of traffic to green, monitor error rates and latency for 30 minutes, then gradually increase to 100%.
4. Monitor Beyond Accuracy
Production monitoring is not just about HTTP 500s. ML systems degrade silently when input data distributions shift.
Set up three layers of monitoring:
Infrastructure layer: CPU, memory, request latency (p50/p95/p99), error rates — use Prometheus + Grafana.
Model layer: prediction distribution, confidence score histogram, feature drift — use Evidently AI or a custom pipeline.
Business layer: downstream KPI impact (e.g., did churn rate actually decrease after deploying the new model?).
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=training_df, current_data=production_df)
report.save_html("drift_report.html")
5. Build a Rollback Playbook Before You Need It
Write the rollback procedure before deploying. It should be a 2-minute, single-command operation.
# rollback.sh
kubectl set image deployment/ml-api api=your-registry/ml-api:v2.0.0
kubectl rollout status deployment/ml-api
echo "Rollback complete. Monitor dashboards."
Test this rollback in staging. The worst time to figure out how to rollback is at 2am during an incident.
Final Checklist Summary
- Docker image with pinned deps and no hardcoded secrets
- Model versioned in a registry with reproducible training run
- Blue-green or canary deployment strategy
- Infrastructure + model + business monitoring in place
- Rollback playbook documented and tested
- Load tested at 2× expected peak traffic
- Runbook written and linked from the service's README
Ship with confidence, not hope.