◯ — MLOps: The Pipeline That Ships Models

The objective of an MLOps team is to automate the deployment of ML models into the core software system without any manual intervention.

That's the ideal. The reality? Most ML projects die in Jupyter notebooks. They work perfectly on a data scientist's machine and fall apart the moment someone asks "can we put this in production?"

MLOps is the discipline that bridges that gap. It treats machine learning as what it actually is: software. Software that needs versioning, testing, deployment, and monitoring. Software that breaks when you're not watching.

Three Phases, One Pipeline

The MLOps lifecycle breaks into three broad phases:

Design: Business understanding, data exploration, defining what success looks like. This is where you decide if ML is even the right tool.
Experimentation: The notebook phase. Feature engineering, model selection, hyperparameter tuning. Iterative, messy, creative.
Operations: Production. Where DevOps practices—testing, versioning, continuous delivery, monitoring—transform experiments into reliable systems.

The key insight: these phases aren't sequential. They're interconnected loops. Design decisions propagate to experimentation, which constrains deployment options, which feeds back into design.

The Automation Ladder

Not all MLOps is created equal. There's a maturity model, and it's honest about the progression:

Level 0: Manual — Everything runs by hand. Jupyter notebooks, manual data prep, manual model training, manual deployment. This is where most teams start. It's fine for prototyping. It's death at scale.

Level 1: ML Pipeline Automation — Model training is automated. New data triggers retraining. Data and model validation happen without human intervention. This is where reliability starts.

Level 2: CI/CD Pipeline Automation — The full monty. Automated building, testing, and deployment of data pipelines, model pipelines, and application code. Changes flow from commit to production automatically.

# What Level 2 looks like in practice
git push origin main

# Automatically triggers:
# 1. Build pipeline components
# 2. Run tests (data, model, infrastructure)
# 3. Deploy pipeline to staging
# 4. Train model on production-like data
# 5. Validate model meets thresholds
# 6. Deploy model to production
# 7. Monitor for drift

The Continuous Quartet

MLOps extends DevOps' CI/CD with two additional practices unique to ML:

Continuous Integration (CI): Test code, data, and models together. Not just unit tests—data validation, feature tests, model quality gates.
Continuous Delivery (CD): Ship the prediction service automatically. Model goes from trained to serving in minutes, not weeks.
Continuous Training (CT): The ML-specific practice. Models don't just deploy once—they retrain automatically on new data, adapting as the world changes.
Continuous Monitoring (CM): Watch production data and model performance. Detect drift before users complain. Tie technical metrics to business outcomes.

CT is the game-changer. Traditional software doesn't improve itself. ML systems can—if you build the pipeline right.

Versioning: The Three Amigos

In traditional software, you version code. In ML, you version three things:

Code: Training scripts, feature engineering, serving logic. Git handles this.
Data: Training datasets, validation splits, feature stores. Tools like DVC or Delta Lake.
Models: Trained artifacts, hyperparameters, metadata. MLflow, Weights & Biases, or cloud registries.

Why does this matter? Because models change for reasons that have nothing to do with code:

New training data arrives
Training approach improves
Models degrade over time (concept drift)
Compliance requires audit trails
Rollback needs a previous version

Without versioning, you're flying blind. You can't reproduce results. You can't debug production issues. You can't prove to regulators what your model was doing six months ago.

Testing: Three Rings of Validation

ML testing splits into three scopes:

Features and Data Tests: Does incoming data match expected schemas? Do features have predictive power? Are we complying with GDPR? Unit tests for feature engineering code catch bugs before they become models.

Model Development Tests: Does the model actually solve the business problem? This is subtler than it sounds—loss metrics (MSE, log-loss) need to correlate with business impact (revenue, engagement). A model that optimizes the wrong metric is worse than no model.

# Model staleness test
# How often should we retrain?

age_vs_quality = []
for age_days in [7, 14, 30, 60, 90]:
    old_model = load_model(f"model_{age_days}d_ago")
    current_data = get_recent_data()
    quality = evaluate(old_model, current_data)
    age_vs_quality.append((age_days, quality))

# If quality drops >5% after 30 days, 
# retraining schedule should be weekly

Infrastructure Tests: Does the pipeline work end-to-end? Can we reproduce training? Does serving handle load? These are classic software tests applied to ML infrastructure.

The Stack

A mature MLOps setup needs several components working together:

Source Control: Git for code, with extensions for data/model versioning
CI/CD Services: GitHub Actions, GitLab CI, Jenkins—build, test, deploy
Model Registry: Central storage for trained models with metadata
Feature Store: Precomputed features for training and serving consistency
Metadata Store: Track experiments—parameters, metrics, lineage
Pipeline Orchestrator: Airflow, Kubeflow, or cloud equivalents—automate the steps

The good news: in 2026, most of this is solved. AWS SageMaker, Azure ML, Vertex AI provide integrated stacks. Open-source options like MLflow + Kubernetes give you control. The tools exist. The challenge is cultural.

The Cultural Shift

MLOps isn't primarily a tooling problem. It's a collaboration problem.

Data scientists optimize for model accuracy. Operations teams optimize for system reliability. These goals can conflict. A more accurate model might be slower, harder to debug, or require more frequent retraining.

MLOps forces these groups to speak the same language. Model performance isn't just F1 score—it's F1 score plus latency, availability, and drift rate. Deployment isn't "it works on my machine"—it's reproducible builds, canary releases, automated rollbacks.

The teams that get this right don't just ship models faster. They ship models that keep working.

The Real Test

Here's a simple question to assess your MLOps maturity:

If your best data scientist quit today, could you still retrain and deploy your production models tomorrow?

If the answer is no, you have work to do. The knowledge locked in notebooks is technical debt. MLOps is how you pay it down.

Ship models like you ship code. Test like you mean it. Monitor like things will break—because they will. That's MLOps.