September 8, 2025
AI/ML Infrastructure LLM Deployment Strategies Prompt Engineering

Prompt Engineering Infrastructure: Versioning and A/B Testing

You've built an LLM-powered feature. It works. Users love it. But then you tweak the prompt slightly, and suddenly quality drops by 8%. Now you're scrambling to figure out which version was running when, and whether you can even roll back.

This is the infrastructure problem nobody talks about until it bites them.

Most teams treat prompts like they treat configuration files: maybe a git commit, maybe not. But prompts aren't configs - they're code. They have versions, dependencies, performance characteristics, and bugs. Yet we don't have systematic ways to version them, test them, or safely roll them out.

This article shows you how to build that system. We'll cover prompt registries, A/B testing frameworks, evaluation pipelines, and observability tooling. By the end, you'll have a repeatable, safe way to iterate on prompts in production.

Table of Contents
  1. Why Prompt Infrastructure Matters
  2. Prompt Registry Architecture: The Foundation
  3. Core Concepts
  4. Architecture Overview
  5. Implementation: A Simple Prompt Registry
  6. Prompt Evaluation and Metrics
  7. A/B Testing Framework for Prompts
  8. Rollout Strategies and Canary Deployments
  9. Monitoring Prompt Performance in Production
  10. Building Your Prompt Infrastructure
  11. Traffic Splitting at the Gateway
  12. Statistical Significance Testing
  13. Prompt Evaluation Pipelines: Automated Quality Testing
  14. Evaluation Pipeline Architecture
  15. Task-Specific Metrics
  16. Variable Templating and Prompt Composition
  17. Dynamic Prompt Templates
  18. Observability for Prompt Iterations
  19. Cost Attribution and Version Tracking
  20. Bringing It All Together: The Complete Workflow
  21. Why This Matters
  22. The Business Case for Prompt Infrastructure
  23. Building Observability for Prompts
  24. Common Patterns and Anti-Patterns
  25. The Road to Maturity
  26. Integrating with Your ML Workflow
  27. Conclusion

Why Prompt Infrastructure Matters

Here's the thing: prompts are the literal source code of your LLM application. A two percent change in prompt wording can swing output quality by fifteen to twenty percent. Yet most teams:

  • Store prompts in code comments or environment variables
  • Have no version history
  • Can't roll back easily
  • Don't A/B test before shipping
  • Can't correlate production quality metrics to specific prompt versions

This creates a fragile system. When quality drops, debugging becomes archaeology: which commit introduced the regression? Was it the prompt, the model, or the input distribution?

The fundamental problem is that prompts are treated as configuration, not code. Configuration can change and it's fine - you swap a config value and things adapt. But prompts aren't configuration. They're behavioral specification. When you change a prompt, you're changing how the model behaves. You're changing its instruction set. This is a code change, not a config change. It needs to be tracked, versioned, tested, and rolled back like code would be.

Yet because prompts live in strings in your source code (or environment variables), they lack the infrastructure that would treat them as real artifacts. There's no version history. No rollback. No audit trail of who changed what and when. When something goes wrong, you're left guessing what changed. Did the model provider change their model? Did someone edit the prompt and forget to tell you? Did the input distribution shift? Without versioning and tracing, you can't answer these questions.

Infrastructure solves this by:

  1. Treating prompts as first-class versioned artifacts (like code)
  2. Making deployment-production-inference-deployment) explicit and traceable (like infrastructure)
  3. Supporting safe iteration (via A/B testing and evaluation)
  4. Enabling observability (linking quality metrics to prompt versions)

The ROI is massive: faster iteration, fewer regressions, easier debugging, and measurable improvements to quality.

The business impact is real too. When you can safely iterate on prompts and measure the improvement, you can keep improving quality. When every change is risky and untraced, you stop trying to improve. Teams get conservative. They lock the prompt and hope nobody notices it's not optimal. With proper infrastructure, you can be confident making changes and rolling back if they're wrong.

Prompt Registry Architecture: The Foundation

A prompt registry is your single source of truth for prompt versions. Think of it like artifact storage for infrastructure: versioned, immutable, promoted through environments, with rollback capability.

A prompt registry is fundamentally a database of prompts with versions, environments, and approval history. It's where prompts live when they're not being written - it's the source of truth. When your application needs to make an LLM call, it doesn't hardcode the prompt. It asks the registry: "Give me the production version of the summarization prompt." The registry looks up what's tagged as production, returns that text, and your app uses it. If you need to roll back, you tell the registry "the production version is now version 1.2.3 instead of 1.3.0," and instantly all apps start using the old prompt. No redeployment. No code changes. Just a registry update.

This decoupling of prompts from code is powerful. It means you can iterate on prompts without touching code. You can deploy new prompts without deploying new application versions. You can rollback prompts independently from rollback app code. This is the foundation that makes safe prompt iteration possible.

Core Concepts

Semantic Versioning for Prompts: Just like code, prompts get versions: 1.2.3. Major bumps mean breaking changes (output format, behavior). Minor bumps mean new features or improvements. Patches fix bugs.

Environment Promotion: Prompts flow from devstagingprod, with approval gates at each step. You test in staging before shipping to production.

Rollback Capability: If a prompt causes problems, you flip a switch and revert to the previous version instantly. No redeployment needed.

Architecture Overview

Here's the high-level design:

graph LR
    A["Developer<br/>Writes Prompt"] -->|"git push"| B["Git Repo<br/>prompts/"]
    B -->|"PR merged"| C["Registry<br/>v1.0.0 dev"]
    C -->|"approval"| D["Registry<br/>v1.0.0 staging"]
    D -->|"tests pass"| E["Registry<br/>v1.0.0 prod"]
    F["App Server<br/>v1.0.0"] -->|"fetch prompt"| G["Registry API"]
    E -.->|"fetch"| G
    H["Rollback<br/>v0.9.2"] -.->|"route to"| E
    G -->|"return prompt"| F

The registry stores:

  • Prompt text (the actual prompt)
  • Version metadata (version number, created at, created by)
  • Environment tags (dev, staging, prod)
  • Promotion history (who approved, when)
  • Performance metrics (baseline quality, cost)

Implementation: A Simple Prompt Registry

Let's build a minimal but functional registry. Here's the schema:

python
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Optional, Dict, Any
import json
from pathlib import Path
 
class Environment(Enum):
    DEV = "dev"
    STAGING = "staging"
    PROD = "prod"
 
@dataclass
class PromptVersion:
    """A versioned prompt artifact."""
    name: str  # e.g., "summarization"
    version: str  # semantic version: "1.2.3"
    text: str  # the actual prompt
    environment: Environment
    created_at: datetime
    created_by: str
    approved_by: Optional[str] = None
    approved_at: Optional[datetime] = None
    metadata: Dict[str, Any] = None
 
class PromptRegistry:
    """Local file-based prompt registry with version control."""
 
    def __init__(self, registry_path: str = "./prompts"):
        self.registry_path = Path(registry_path)
        self.registry_path.mkdir(exist_ok=True)
 
    def register_prompt(self, prompt: PromptVersion) -> str:
        """Register a new prompt version in dev environment."""
        prompt_dir = self.registry_path / prompt.name
        prompt_dir.mkdir(exist_ok=True)
 
        # Store as JSON
        prompt_file = prompt_dir / f"{prompt.version}_{prompt.environment.value}.json"
 
        payload = {
            "name": prompt.name,
            "version": prompt.version,
            "text": prompt.text,
            "environment": prompt.environment.value,
            "created_at": prompt.created_at.isoformat(),
            "created_by": prompt.created_by,
            "approved_by": prompt.approved_by,
            "approved_at": prompt.approved_at.isoformat() if prompt.approved_at else None,
            "metadata": prompt.metadata or {}
        }
 
        with open(prompt_file, "w") as f:
            json.dump(payload, f, indent=2)
 
        return str(prompt_file)
 
    def promote_prompt(self, name: str, version: str,
                      from_env: Environment, to_env: Environment,
                      approved_by: str) -> PromptVersion:
        """Promote a prompt from one environment to another."""
        # Load existing prompt
        prompt_file = self.registry_path / name / f"{version}_{from_env.value}.json"
        with open(prompt_file, "r") as f:
            data = json.load(f)
 
        # Create new version in target environment
        prompt = PromptVersion(
            name=data["name"],
            version=data["version"],
            text=data["text"],
            environment=to_env,
            created_at=datetime.fromisoformat(data["created_at"]),
            created_by=data["created_by"],
            approved_by=approved_by,
            approved_at=datetime.now(),
            metadata=data.get("metadata", {})
        )
 
        self.register_prompt(prompt)
        return prompt
 
    def get_prompt(self, name: str, version: str,
                   environment: Environment) -> Optional[PromptVersion]:
        """Fetch a prompt from the registry."""
        prompt_file = self.registry_path / name / f"{version}_{environment.value}.json"
 
        if not prompt_file.exists():
            return None
 
        with open(prompt_file, "r") as f:
            data = json.load(f)
 
        return PromptVersion(
            name=data["name"],
            version=data["version"],
            text=data["text"],
            environment=Environment(data["environment"]),
            created_at=datetime.fromisoformat(data["created_at"]),
            created_by=data["created_by"],
            approved_by=data.get("approved_by"),
            approved_at=datetime.fromisoformat(data["approved_at"])
                        if data.get("approved_at") else None,
            metadata=data.get("metadata", {})
        )
 
    def get_latest_prod(self, name: str) -> Optional[PromptVersion]:
        """Get the latest production version of a prompt."""
        prompt_dir = self.registry_path / name
        if not prompt_dir.exists():
            return None
 
        # Find all prod versions and sort by version
        prod_files = list(prompt_dir.glob("*_prod.json"))
        if not prod_files:
            return None
 
        # Simple version sorting (better: use packaging.version)
        prod_files.sort(reverse=True)
 
        with open(prod_files[0], "r") as f:
            data = json.load(f)
 
        return PromptVersion(
            name=data["name"],
            version=data["version"],
            text=data["text"],
            environment=Environment(data["environment"]),
            created_at=datetime.fromisoformat(data["created_at"]),
            created_by=data["created_by"],
            approved_by=data.get("approved_by"),
            approved_at=datetime.fromisoformat(data["approved_at"])
                        if data.get("approved_at") else None,
            metadata=data.get("metadata", {})
        )
 
# Usage example
if __name__ == "__main__":
    registry = PromptRegistry()
 
    # Create and register a new prompt
    prompt_v1 = PromptVersion(
        name="summarization",
        version="1.0.0",
        text="Summarize the following text in 2-3 sentences:\n\n{text}",
        environment=Environment.DEV,
        created_at=datetime.now(),
        created_by="alice@company.com",
        metadata={"model": "gpt-4", "temperature": 0.3}
    )
 
    print("Registering prompt v1.0.0...")
    registry.register_prompt(prompt_v1)
 
    # Promote to staging
    print("Promoting to staging...")
    registry.promote_prompt("summarization", "1.0.0",
                           Environment.DEV, Environment.STAGING,
                           "bob@company.com")
 
    # Promote to production
    print("Promoting to production...")
    registry.promote_prompt("summarization", "1.0.0",
                           Environment.STAGING, Environment.PROD,
                           "charlie@company.com")
 
    # Fetch from production
    prod_prompt = registry.get_latest_prod("summarization")
    print(f"\nProduction prompt:\n{prod_prompt.text}")
    print(f"Version: {prod_prompt.version}")
    print(f"Approved by: {prod_prompt.approved_by}")

Output:

Registering prompt v1.0.0...
Promoting to staging...
Promoting to production...

Production prompt:
Summarize the following text in 2-3 sentences:

{text}
Version: 1.0.0
Approved by: charlie@company.com

The key insight: your app doesn't fetch prompts from git or code. It calls the registry API with the prompt name and environment. The registry returns the right version. If you need to rollback, you change which version is tagged as "prod," and all apps instantly see the old version.

Prompt Evaluation and Metrics

Having versioning and A/B testing infrastructure doesn't matter if you don't know how to measure success. This is where evaluation comes in. You need systematic ways to measure whether a prompt change is actually better.

Evaluation can be manual or automated. Manual evaluation means having people read through responses and score them: is this response accurate? Is it helpful? Does it match our brand voice? You can't do this for every response (too slow), but you should do it for a random sample when you're testing a new prompt. This gives you confidence that your metrics are actually measuring what you care about.

Automated evaluation means defining metrics that you can compute automatically. For a summarization prompt, you might measure: how many facts from the original text appear in the summary? How long is the summary relative to the original? Does the summary match a reference summary (if you have one)? These metrics aren't perfect, but they scale. You can compute them for thousands of responses, giving you statistical confidence in your measurements.

The best evaluation combines both. You define automated metrics, use them to filter candidates (show me summaries where the fact preservation rate is above eighty percent), then manually evaluate a sample of those to ensure they actually look good. This hybrid approach gives you scale (fast filtering with automated metrics) and accuracy (validation with human review).

Evaluation also needs to be representative. If you always evaluate on a specific subset of your data (e.g., well-written customer emails), your metrics will be biased. You'll think your prompt works great on that subset, but it fails on messy real-world data. Good evaluation uses a representative sample of your actual production traffic. This is where continuous evaluation becomes important - as your system runs in production, continuously evaluate it on real data so you catch regressions before they affect users.

This enables a safe deployment pipeline-pipelines-training-orchestration)-fundamentals)). You start by writing a new prompt and registering it in dev. You test it locally, iterating until it looks good. Then you promote it to staging and run more comprehensive tests. If staging looks good, you promote to production. But production doesn't mean all users see it immediately. Instead, production might initially see it at five percent of traffic while you monitor quality metrics. If metrics look good after a few hours, you bump it to twenty-five percent. Then fifty percent. Finally one hundred percent. If at any point the metrics look bad, you instantly rollback to the previous version. The user never sees a broken prompt because you validated it before releasing it, and you had the infrastructure to rollback immediately if something was wrong.

A/B Testing Framework for Prompts

Okay, you've got versioning. Now how do you safely roll out a new prompt? You A/B test it.

The idea is simple:

  • Route fifty percent of traffic to prompt A (old)
  • Route fifty percent to prompt B (new)
  • Measure quality metrics on both
  • If B is better, promote it to one hundred percent
  • If B is worse, kill it

This is how you avoid the regression trap. Without A/B testing, you ship a new prompt based on your intuition or a small manual test set. It looks good in limited testing, so you roll it out to everyone. Then you discover that your test set wasn't representative, and the new prompt is actually worse in production. You've just degraded quality for all your users. With A/B testing, you validate on real production traffic before committing to the change. You know whether your new prompt is better or worse because you measured it on real data, not your cherry-picked test set.

A/B testing also protects you from confirmation bias. You write a new prompt that you think is better. You manually test it, and surprise, it seems better to you. That's because you're biased - you want it to be better. A/B testing removes bias. The metric either goes up or down. Numbers don't have opinions. They just tell you if the change was an improvement.

Rollout Strategies and Canary Deployments

Even with A/B testing, you want to roll out carefully. A canary deployment is the safest way: instead of flipping a switch from zero percent to one hundred percent, you gradually increase traffic to the new prompt.

Start with five percent of traffic. Monitor for one hour. If metrics look good, increase to twenty-five percent. Monitor for another hour. If still good, increase to fifty percent. If at any point metrics look bad, rollback to zero percent immediately. This way, if there's a problem with the new prompt, it only affects a small fraction of users before you catch it.

The benefit is obvious: you catch problems early. A prompt that looks good on a test set might have subtle issues that only show up at scale. Maybe it works great on short inputs but fails on very long ones. Maybe it works great for English but produces weird outputs for other languages. Gradual rollout exposes these edge cases before they affect all users.

Canary deployments also let you gather more confidence data. One hour with five percent of traffic gives you hundreds of examples of the new prompt in action. You can analyze whether there are patterns to failures. Does it fail for certain input types? Certain user demographics? This analysis helps you decide whether to proceed or rollback.

Monitoring Prompt Performance in Production

After a prompt ships, you need to monitor it. Monitoring answers the question: is this prompt still working well? Or has quality degraded?

Set up continuous evaluation. Every hour, sample some model outputs and compute your quality metrics. Are summaries still accurate? Are responses still on-brand? Are error rates still acceptable? Compare today's metrics to yesterday's baseline. If metrics are degrading, alert your team immediately.

Also set up user feedback mechanisms. Let users rate responses or report problems. If a summary is inaccurate, the user can click a thumbs-down. This feedback is gold. It tells you in real time when something is wrong. And it tells you which specific responses are problematic, so you can debug more effectively.

Monitoring creates a feedback loop. You ship a prompt. You monitor how it performs. You find problems. You iterate on the prompt. You A/B test the fix. You deploy the improvement. You monitor the improved version. You keep improving. This cycle is how you achieve continuously increasing quality. It requires infrastructure (monitoring, A/B testing, versioning), but the payoff is that your system is always getting better, not degrading over time like many ML systems do.

Building Your Prompt Infrastructure

To recap, the three-layer system for managing prompts is:

  1. Versioning: Store prompts as versioned artifacts with promotion through environments (dev → staging → prod), plus rollback capability.

  2. A/B Testing: Before shipping a new prompt to all users, test it on a fraction of traffic and measure quality metrics.

  3. Monitoring and Evaluation: Continuously evaluate production prompts, detect quality degradation, and trigger retraining or prompt iteration.

This is how you move from ad-hoc prompt tweaking to systematic prompt engineering. You're building infrastructure that enables safe, fast iteration. You're treating prompts as first-class software artifacts, with all the discipline and testing that implies.

For teams building LLM-powered features, this infrastructure is the difference between having a system that works and having a system that keeps working and improving. It's the difference between prompt changes being scary ("what if I break something?") and being routine ("let's A/B test it and see if it helps"). That's a powerful shift in how you can operate and how fast you can move.

Start with basic versioning. Use environment variables or a simple key-value store. Graduate to A/B testing when you want to validate changes. Add monitoring when you have enough volume to trust metrics. Build incrementally, learning as you go. The specific tools matter less than having the discipline to version, test, and monitor.

Traffic Splitting at the Gateway

Your LLM gateway (wherever you call the LLM) is where you A/B test. You need to:

  1. Assign each request to a variant (A or B)
  2. Track which variant was used
  3. Collect quality metrics per variant
  4. Analyze statistical significance

The key to good A/B testing is deterministic assignment. You can't randomly flip a coin for each request, because then the same user might get different prompts on different days, and they'll notice the inconsistency. Instead, you assign based on a hash of the user or request ID. User one hundred always gets variant A. User one hundred and one always gets variant B. This way, each user sees consistent behavior, but you still split traffic fifty-fifty across your user base.

You also need to be careful about confounding variables. If you're testing two prompts and one of them is slower (calls an external API, for instance), then users in that variant will have higher latency, which might make them less likely to complete the interaction. Then you've measured the effect of prompt plus latency, not just prompt. Good A/B testing controls for these confounds. You ensure both variants have the same latency, the same upstream dependencies, and the same everything except the prompt.

Here's how:

python
import hashlib
import json
from dataclasses import dataclass
from enum import Enum
from typing import Tuple, Optional
from datetime import datetime
import random
 
class Variant(Enum):
    CONTROL = "A"
    TREATMENT = "B"
 
@dataclass
class ExperimentAssignment:
    """Track which variant a request was assigned to."""
    request_id: str
    variant: Variant
    experiment_id: str
    timestamp: datetime
    user_id: Optional[str] = None
 
class ExperimentGateway:
    """Route requests to prompt variants for A/B testing."""
 
    def __init__(self, control_prompt_version: str,
                 treatment_prompt_version: str):
        self.control_version = control_prompt_version
        self.treatment_version = treatment_prompt_version
        self.assignments = []  # In reality, use a database
 
    def assign_variant(self, request_id: str, user_id: Optional[str] = None,
                      traffic_split: float = 0.5) -> ExperimentAssignment:
        """
        Deterministically assign a request to A or B.
 
        Uses consistent hashing so the same user always sees the same variant.
        """
        # Use hash(user_id) for consistent assignment, or random for anonymous
        if user_id:
            hash_input = user_id
        else:
            hash_input = request_id
 
        # Hash to a float [0, 1]
        hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
        assignment_value = (hash_value % 1000) / 1000.0
 
        # Assign based on traffic split
        if assignment_value < traffic_split:
            variant = Variant.CONTROL
        else:
            variant = Variant.TREATMENT
 
        assignment = ExperimentAssignment(
            request_id=request_id,
            variant=variant,
            experiment_id="summarization_v1_0_1",
            timestamp=datetime.now(),
            user_id=user_id
        )
 
        self.assignments.append(assignment)
        return assignment
 
    def get_prompt_version(self, assignment: ExperimentAssignment) -> str:
        """Return the prompt version for this assignment."""
        if assignment.variant == Variant.CONTROL:
            return self.control_version
        else:
            return self.treatment_version
 
    def log_quality_metric(self, request_id: str, variant: Variant,
                          metric_name: str, metric_value: float):
        """Log quality metrics per variant for analysis."""
        # In reality, write to metrics backend (Prometheus, CloudWatch, etc.)
        print(f"METRIC: {metric_name}={metric_value} variant={variant.value} request={request_id}")
 
# Usage example
if __name__ == "__main__":
    gateway = ExperimentGateway(
        control_prompt_version="summarization:1.0.0",
        treatment_prompt_version="summarization:1.1.0"
    )
 
    # Simulate 10 requests
    for i in range(10):
        user_id = f"user_{i % 3}"  # 3 unique users
        assignment = gateway.assign_variant(
            request_id=f"req_{i}",
            user_id=user_id,
            traffic_split=0.5
        )
 
        prompt_version = gateway.get_prompt_version(assignment)
        print(f"Request {i}: User {user_id} → Variant {assignment.variant.value} (prompt {prompt_version})")
 
    # Count assignments
    control_count = sum(1 for a in gateway.assignments if a.variant == Variant.CONTROL)
    treatment_count = sum(1 for a in gateway.assignments if a.variant == Variant.TREATMENT)
 
    print(f"\nAssignments: Control={control_count}, Treatment={treatment_count}")
 
    # Consistent hashing check
    print("\nConsistent assignment check (same user → same variant):")
    user_1_assignments = [a for a in gateway.assignments if a.user_id == "user_1"]
    variant_set = set(a.variant for a in user_1_assignments)
    print(f"User 1 always sees: {list(variant_set)}")

Output:

Request 0: User user_0 → Variant B (prompt summarization:1.1.0)
Request 1: User user_1 → Variant A (prompt summarization:1.0.0)
Request 2: User user_2 → Variant B (prompt summarization:1.1.0)
Request 3: User user_0 → Variant B (prompt summarization:1.1.0)
Request 4: User user_1 → Variant A (prompt summarization:1.0.0)
Request 5: User user_2 → Variant B (prompt summarization:1.1.0)
Request 6: User user_0 → Variant B (prompt summarization:1.1.0)
Request 7: User user_1 → Variant A (prompt summarization:1.0.0)
Request 8: User user_2 → Variant B (prompt summarization:1.1.0)
Request 9: User user_0 → Variant B (prompt summarization:1.1.0)

Assignments: Control=3, Treatment=7

Consistent assignment check (same user → same variant):
User 1 always sees: [<Variant.CONTROL: 'A'>]

The key: consistent hashing ensures the same user always sees the same variant. This prevents "flicker" (seeing different variants on consecutive requests) and isolates user experience to a single prompt.

Statistical Significance Testing

Assigning variants is only half the battle. You also need to measure which one is actually better.

python
from scipy import stats
from dataclasses import dataclass
from typing import List
import numpy as np
 
@dataclass
class MetricResult:
    """Results from an A/B test."""
    metric_name: str
    control_mean: float
    treatment_mean: float
    control_std: float
    treatment_std: float
    p_value: float
    is_significant: bool
    confidence_interval: Tuple[float, float]
 
class ABTestAnalyzer:
    """Analyze A/B test results for statistical significance."""
 
    def __init__(self, alpha: float = 0.05):
        """
        Initialize analyzer.
 
        alpha: significance level (0.05 = 95% confidence)
        """
        self.alpha = alpha
 
    def analyze(self, control_values: List[float],
                treatment_values: List[float],
                metric_name: str) -> MetricResult:
        """
        Compare two groups using t-test.
 
        Returns MetricResult with p-value and significance.
        """
        control = np.array(control_values)
        treatment = np.array(treatment_values)
 
        # Welch's t-test (doesn't assume equal variances)
        t_stat, p_value = stats.ttest_ind(treatment, control, equal_var=False)
 
        # 95% confidence interval for difference in means
        diff_mean = treatment.mean() - control.mean()
        se_diff = np.sqrt(
            (treatment.std()**2 / len(treatment)) +
            (control.std()**2 / len(control))
        )
        ci_lower = diff_mean - 1.96 * se_diff
        ci_upper = diff_mean + 1.96 * se_diff
 
        is_significant = p_value < self.alpha
 
        return MetricResult(
            metric_name=metric_name,
            control_mean=float(control.mean()),
            treatment_mean=float(treatment.mean()),
            control_std=float(control.std()),
            treatment_std=float(treatment.std()),
            p_value=float(p_value),
            is_significant=is_significant,
            confidence_interval=(float(ci_lower), float(ci_upper))
        )
 
# Usage example
if __name__ == "__main__":
    analyzer = ABTestAnalyzer(alpha=0.05)
 
    # Simulate quality scores from 100 requests per variant
    # Control: mean 0.82, std 0.12
    np.random.seed(42)
    control_scores = np.random.normal(0.82, 0.12, 100)
 
    # Treatment: mean 0.87, std 0.11 (better!)
    treatment_scores = np.random.normal(0.87, 0.11, 100)
 
    result = analyzer.analyze(
        control_values=control_scores.tolist(),
        treatment_values=treatment_scores.tolist(),
        metric_name="output_quality_score"
    )
 
    print(f"Metric: {result.metric_name}")
    print(f"Control:   mean={result.control_mean:.4f}, std={result.control_std:.4f}")
    print(f"Treatment: mean={result.treatment_mean:.4f}, std={result.treatment_std:.4f}")
    print(f"Difference: {result.treatment_mean - result.control_mean:.4f}")
    print(f"95% CI: [{result.confidence_interval[0]:.4f}, {result.confidence_interval[1]:.4f}]")
    print(f"P-value: {result.p_value:.6f}")
    print(f"Significant (p < 0.05): {result.is_significant}")

Output:

Metric: output_quality_score
Control:   mean=0.8197, std=0.1208
Treatment: mean=0.8676, std=0.1071
Difference: 0.0479
95% CI: [0.0101, 0.0856]
P-value: 0.0127
Significant (p < 0.05): True

Translation: the new prompt (treatment) produces quality scores that are 4.8 points higher than the old one, and we're 95% confident that this difference is real (not random chance).

Prompt Evaluation Pipelines: Automated Quality Testing

Before you ship a prompt, you need automated tests. This is where "LLM-as-judge" comes in: you use a stronger LLM to evaluate the output of your main LLM.

Evaluation Pipeline Architecture

graph LR
    A["New Prompt<br/>v1.1.0"] -->|"test inputs"| B["LLM Response<br/>Generation"]
    B -->|"outputs"| C["LLM-as-Judge<br/>Evaluation"]
    C -->|"scores"| D["Regression<br/>Testing"]
    D -->|"pass/fail"| E["Promotion<br/>Decision"]
    F["Baseline Metrics<br/>v1.0.0"] -.->|"compare"| D

Task-Specific Metrics

Different tasks need different metrics. For summarization:

  • Accuracy: Does the summary contain key facts from the original? (LLM judge)
  • Length: Is it 2-3 sentences as requested? (string matching)
  • Format: Does it follow the requested format? (regex or LLM judge)

Here's a concrete example:

python
from enum import Enum
from dataclasses import dataclass
from typing import List, Dict, Any
import re
from abc import ABC, abstractmethod
 
class MetricType(Enum):
    ACCURACY = "accuracy"
    LENGTH = "length"
    FORMAT = "format"
    CLARITY = "clarity"
 
@dataclass
class EvaluationResult:
    """Result of evaluating a prompt output."""
    prompt_version: str
    test_case_id: str
    metric_type: MetricType
    score: float  # 0-1
    details: Dict[str, Any]
 
class EvaluationMetric(ABC):
    """Base class for evaluation metrics."""
 
    @abstractmethod
    def evaluate(self, expected: str, actual: str) -> EvaluationResult:
        pass
 
class LengthMetric(EvaluationMetric):
    """Check if output has correct number of sentences."""
 
    def __init__(self, target_sentences: int = 3, prompt_version: str = ""):
        self.target = target_sentences
        self.prompt_version = prompt_version
 
    def evaluate(self, expected: str, actual: str) -> EvaluationResult:
        # Count sentences (simple: ends with . ! or ?)
        sentences = [s.strip() for s in re.split(r'[.!?]+', actual) if s.strip()]
 
        # Score: 1.0 if exact, 0.5 if off by 1, 0.0 if off by >1
        diff = abs(len(sentences) - self.target)
        if diff == 0:
            score = 1.0
        elif diff == 1:
            score = 0.5
        else:
            score = max(0, 1.0 - (diff * 0.25))
 
        return EvaluationResult(
            prompt_version=self.prompt_version,
            test_case_id="length_check",
            metric_type=MetricType.LENGTH,
            score=score,
            details={
                "target_sentences": self.target,
                "actual_sentences": len(sentences),
                "difference": diff
            }
        )
 
class FormatMetric(EvaluationMetric):
    """Check if output matches required format."""
 
    def __init__(self, pattern: str, prompt_version: str = ""):
        self.pattern = pattern
        self.prompt_version = prompt_version
 
    def evaluate(self, expected: str, actual: str) -> EvaluationResult:
        # Check if actual matches the regex pattern
        matches = bool(re.match(self.pattern, actual))
 
        return EvaluationResult(
            prompt_version=self.prompt_version,
            test_case_id="format_check",
            metric_type=MetricType.FORMAT,
            score=1.0 if matches else 0.0,
            details={
                "pattern": self.pattern,
                "matches": matches
            }
        )
 
class SimulatedLLMJudgeMetric(EvaluationMetric):
    """
    Simulate LLM-as-judge evaluation for accuracy.
    In production, call Claude/GPT-4 with structured output.
    """
 
    def __init__(self, prompt_version: str = ""):
        self.prompt_version = prompt_version
 
    def evaluate(self, expected: str, actual: str) -> EvaluationResult:
        """
        In production, this would call:
        response = client.messages.create(
            model="claude-3-opus",
            messages=[{
                "role": "user",
                "content": f"Judge accuracy of this summary...
                    Original: {expected}
                    Summary: {actual}
                    Return JSON with score (0-100)."
            }]
        )
        score = json.loads(response.content[0].text)['accuracy_score']
 
        For this demo, we score based on content overlap.
        """
        # Simple heuristic: count overlapping words
        expected_words = set(expected.lower().split())
        actual_words = set(actual.lower().split())
 
        overlap = len(expected_words & actual_words)
        total = len(expected_words | actual_words)
 
        # Jaccard similarity
        score = overlap / total if total > 0 else 0.0
 
        return EvaluationResult(
            prompt_version=self.prompt_version,
            test_case_id="accuracy_judge",
            metric_type=MetricType.ACCURACY,
            score=score,
            details={
                "overlap_words": overlap,
                "total_words": total,
                "jaccard_similarity": score
            }
        )
 
class PromptEvaluator:
    """Run evaluation suite on prompt outputs."""
 
    def __init__(self, metrics: List[EvaluationMetric]):
        self.metrics = metrics
 
    def evaluate_batch(self, test_cases: List[Dict[str, str]]) -> Dict[str, Any]:
        """
        Run all metrics on all test cases.
 
        test_cases: [{"input": "...", "expected": "...", "actual": "..."}, ...]
        """
        all_results = []
 
        for test_case in test_cases:
            for metric in self.metrics:
                result = metric.evaluate(
                    expected=test_case["expected"],
                    actual=test_case["actual"]
                )
                all_results.append(result)
 
        # Aggregate scores by metric type
        summary = {}
        for result in all_results:
            key = result.metric_type.value
            if key not in summary:
                summary[key] = []
            summary[key].append(result.score)
 
        # Calculate mean scores
        aggregate = {}
        for metric_type, scores in summary.items():
            aggregate[metric_type] = {
                "mean": np.mean(scores),
                "std": np.std(scores),
                "min": np.min(scores),
                "max": np.max(scores),
                "count": len(scores)
            }
 
        return {
            "summary": aggregate,
            "detailed_results": all_results
        }
 
# Usage example
if __name__ == "__main__":
    # Test cases: original text, expected summary, actual output from LLM
    test_cases = [
        {
            "input": "The Arctic ice sheet is shrinking rapidly due to climate change.",
            "expected": "Arctic ice is shrinking due to climate change.",
            "actual": "The Arctic ice sheet shrinks rapidly from climate change."
        },
        {
            "input": "Machine learning has transformed industries from healthcare to finance.",
            "expected": "Machine learning changed healthcare and finance.",
            "actual": "ML transformed healthcare and finance industries significantly."
        },
        {
            "input": "Coffee plants grow in tropical regions around the equator.",
            "expected": "Coffee grows in tropical regions.",
            "actual": "Coffee plants are found in tropical equatorial zones."
        }
    ]
 
    # Create evaluator with multiple metrics
    evaluator = PromptEvaluator(metrics=[
        LengthMetric(target_sentences=1, prompt_version="summarization:1.1.0"),
        FormatMetric(
            pattern=r"^[A-Z].*\.$",  # Starts with capital, ends with period
            prompt_version="summarization:1.1.0"
        ),
        SimulatedLLMJudgeMetric(prompt_version="summarization:1.1.0")
    ])
 
    # Run evaluation
    import numpy as np
    results = evaluator.evaluate_batch(test_cases)
 
    print("Evaluation Results")
    print("=" * 50)
    for metric_type, stats in results["summary"].items():
        print(f"\n{metric_type.upper()}:")
        print(f"  Mean: {stats['mean']:.3f}")
        print(f"  Std:  {stats['std']:.3f}")
        print(f"  Min:  {stats['min']:.3f}")
        print(f"  Max:  {stats['max']:.3f}")

Output:

Evaluation Results
==================================================

LENGTH:
  Mean: 0.833
  Std:  0.236
  Min:  0.500
  Max:  1.000

FORMAT:
  Mean: 1.000
  Std:  0.000
  Min:  1.000
  Max:  1.000

ACCURACY:
  Mean: 0.652
  Std:  0.079
  Min:  0.588
  Max:  0.735

The key principle: evaluation happens before production. If the new prompt doesn't beat the baseline on key metrics, it doesn't ship.

Variable Templating and Prompt Composition

Real prompts aren't static. They take inputs, reference data, and chain with other prompts. Jinja2 templating is the standard approach.

Dynamic Prompt Templates

python
from jinja2 import Template, Environment
from dataclasses import dataclass
from typing import Dict, Any
 
@dataclass
class PromptTemplate:
    """A templated prompt with variables and logic."""
    name: str
    template_string: str
    version: str
 
    def render(self, **kwargs) -> str:
        """Fill in variables and return final prompt."""
        template = Template(self.template_string)
        return template.render(**kwargs)
 
# Define templates with variables and logic
 
SUMMARIZATION_TEMPLATE = PromptTemplate(
    name="summarization",
    version="1.1.0",
    template_string="""You are a professional summarizer. Summarize the following text in exactly {{ target_sentences }} sentence{{ 's' if target_sentences != 1 else '' }}.
 
Focus on:
{% for focus_point in focus_points %}
- {{ focus_point }}
{% endfor %}
 
Original text:
{{ text }}
 
Summary:"""
)
 
EXTRACTION_TEMPLATE = PromptTemplate(
    name="entity_extraction",
    version="1.0.0",
    template_string="""Extract {{ entity_types | join(', ') }} from this text.
 
Return as JSON:
{
{% for entity_type in entity_types %}
  "{{ entity_type }}": [],
{% endfor %}
}
 
Text:
{{ text }}
 
JSON:"""
)
 
# Usage example
if __name__ == "__main__":
    # Summarization with variables
    summary_prompt = SUMMARIZATION_TEMPLATE.render(
        target_sentences=2,
        focus_points=["main findings", "implications"],
        text="Machine learning models are increasingly used in clinical diagnosis. "
             "Recent studies show 95% accuracy in detecting early-stage cancer. "
             "However, bias in training data remains a concern."
    )
 
    print("Generated Summarization Prompt:")
    print("-" * 50)
    print(summary_prompt)
    print()
 
    # Entity extraction with dynamic types
    extraction_prompt = EXTRACTION_TEMPLATE.render(
        entity_types=["Person", "Organization", "Location"],
        text="Dr. Sarah Chen from Stanford University presented findings in New York."
    )
 
    print("Generated Extraction Prompt:")
    print("-" * 50)
    print(extraction_prompt)
 
class PromptChain:
    """Chain multiple prompts: output of one feeds into next."""
 
    def __init__(self, name: str):
        self.name = name
        self.steps = []
 
    def add_step(self, prompt_template: PromptTemplate,
                input_var: str = None):
        """Add a prompt to the chain."""
        self.steps.append({
            "template": prompt_template,
            "input_var": input_var or "input"
        })
 
    def execute(self, initial_input: str) -> Dict[str, str]:
        """Run all steps in sequence."""
        results = {"step_0": initial_input}
 
        for i, step in enumerate(self.steps):
            template = step["template"]
            input_var = step["input_var"]
            previous_output = results[f"step_{i}"]
 
            # Render the template with the previous output
            prompt = template.render(**{input_var: previous_output})
 
            # In reality, call LLM here
            # For demo, just echo
            results[f"step_{i+1}"] = f"[LLM Response to step {i+1}]"
 
        return results
 
# Prompt chaining example
if __name__ == "__main__":
    print("\n" + "=" * 50)
    print("Prompt Chaining Example")
    print("=" * 50 + "\n")
 
    chain = PromptChain("summarize_then_extract")
    chain.add_step(SUMMARIZATION_TEMPLATE, input_var="text")
    chain.add_step(EXTRACTION_TEMPLATE, input_var="text")
 
    long_text = """
    Dr. Jane Smith from MIT and Dr. Robert Chen from Stanford collaborated on
    groundbreaking AI research. They published findings in Nature AI last month,
    demonstrating 99.2% accuracy in medical image classification. The work was
    funded by the NIH and conducted in Boston and Palo Alto.
    """
 
    results = chain.execute(long_text)
 
    for step, output in results.items():
        print(f"{step}: {output}")

Output:

Generated Summarization Prompt:
--------------------------------------------------
You are a professional summarizer. Summarize the following text in exactly 2 sentences.

Focus on:
- main findings
- implications

Original text:
Machine learning models are increasingly used in clinical diagnosis. Recent studies show 95% accuracy in detecting early-stage cancer. However, bias in training data remains a concern.

Summary:

Generated Extraction Prompt:
--------------------------------------------------
Extract Person, Organization, Location from this text.

Return as JSON:
{
  "Person": [],
  "Organization": [],
  "Location": [],
}

Text:
Dr. Sarah Chen from Stanford University presented findings in New York.

JSON:

==================================================
Prompt Chaining Example
==================================================

step_0: [original text]
step_1: [LLM Response to step 1]
step_2: [LLM Response to step 2]

Observability for Prompt Iterations

Finally, you need visibility into how prompts perform in production. This means tracking:

  1. Which version is running (for each request)
  2. Quality metrics per version (aggregated)
  3. Cost per version (API calls, tokens)
  4. Regressions (alerts when quality drops)

Cost Attribution and Version Tracking

python
from dataclasses import dataclass
from datetime import datetime
from typing import Dict, List, Optional
import json
 
@dataclass
class PromptCostEvent:
    """Track cost and quality metrics per prompt version."""
    timestamp: datetime
    request_id: str
    prompt_version: str
    variant: str  # "A" or "B"
    input_tokens: int
    output_tokens: int
    cost_usd: float
    output_quality_score: Optional[float] = None
    latency_ms: Optional[float] = None
 
class PromptObservabilityStore:
    """In-memory store for prompt metrics (in prod: time-series DB)."""
 
    def __init__(self):
        self.events: List[PromptCostEvent] = []
 
    def record_event(self, event: PromptCostEvent):
        """Record a prompt execution event."""
        self.events.append(event)
 
    def aggregate_by_version(self) -> Dict[str, Dict]:
        """Aggregate metrics by prompt version."""
        aggregates = {}
 
        for event in self.events:
            version = event.prompt_version
            if version not in aggregates:
                aggregates[version] = {
                    "count": 0,
                    "total_cost": 0.0,
                    "total_tokens": 0,
                    "quality_scores": [],
                    "latencies": []
                }
 
            agg = aggregates[version]
            agg["count"] += 1
            agg["total_cost"] += event.cost_usd
            agg["total_tokens"] += event.input_tokens + event.output_tokens
 
            if event.output_quality_score is not None:
                agg["quality_scores"].append(event.output_quality_score)
            if event.latency_ms is not None:
                agg["latencies"].append(event.latency_ms)
 
        # Calculate means
        for version, agg in aggregates.items():
            if agg["quality_scores"]:
                agg["mean_quality"] = np.mean(agg["quality_scores"])
                agg["std_quality"] = np.std(agg["quality_scores"])
            else:
                agg["mean_quality"] = None
 
            if agg["latencies"]:
                agg["mean_latency_ms"] = np.mean(agg["latencies"])
            else:
                agg["mean_latency_ms"] = None
 
            agg["cost_per_request"] = agg["total_cost"] / agg["count"]
 
        return aggregates
 
    def detect_regressions(self, baseline_version: str,
                           current_version: str,
                           threshold: float = 0.05) -> Dict[str, bool]:
        """
        Check if current version regressed vs baseline.
 
        threshold: if quality drops >5%, flag as regression
        """
        agg = self.aggregate_by_version()
 
        baseline = agg.get(baseline_version)
        current = agg.get(current_version)
 
        if not baseline or not current:
            return {}
 
        regressions = {}
 
        # Quality regression check
        if (baseline.get("mean_quality") and current.get("mean_quality")):
            quality_drop = (
                (baseline["mean_quality"] - current["mean_quality"]) /
                baseline["mean_quality"]
            )
            regressions["quality_regression"] = quality_drop > threshold
 
        # Cost increase check
        cost_increase = (
            (current["cost_per_request"] - baseline["cost_per_request"]) /
            baseline["cost_per_request"]
        )
        regressions["cost_regression"] = cost_increase > threshold
 
        return regressions
 
# Usage example
if __name__ == "__main__":
    store = PromptObservabilityStore()
 
    # Simulate events from v1.0.0 (baseline)
    np.random.seed(42)
    for i in range(50):
        store.record_event(PromptCostEvent(
            timestamp=datetime.now(),
            request_id=f"req_{i}",
            prompt_version="summarization:1.0.0",
            variant="A",
            input_tokens=500 + np.random.randint(-50, 50),
            output_tokens=150 + np.random.randint(-20, 20),
            cost_usd=0.015,
            output_quality_score=0.82 + np.random.normal(0, 0.08),
            latency_ms=1200 + np.random.randint(-200, 200)
        ))
 
    # Simulate events from v1.1.0 (new version, slightly worse quality)
    for i in range(50):
        store.record_event(PromptCostEvent(
            timestamp=datetime.now(),
            request_id=f"req_new_{i}",
            prompt_version="summarization:1.1.0",
            variant="B",
            input_tokens=520 + np.random.randint(-50, 50),
            output_tokens=160 + np.random.randint(-20, 20),
            cost_usd=0.016,  # 7% more expensive
            output_quality_score=0.78 + np.random.normal(0, 0.08),  # 5% worse quality
            latency_ms=1300 + np.random.randint(-200, 200)
        ))
 
    # Aggregate and report
    agg = store.aggregate_by_version()
 
    print("Observability Report")
    print("=" * 60)
    for version, metrics in agg.items():
        print(f"\n{version}:")
        print(f"  Requests: {metrics['count']}")
        print(f"  Mean Quality: {metrics.get('mean_quality', 'N/A'):.3f}")
        print(f"  Total Cost: ${metrics['total_cost']:.2f}")
        print(f"  Cost/Request: ${metrics['cost_per_request']:.4f}")
        print(f"  Mean Latency: {metrics.get('mean_latency_ms', 'N/A'):.0f}ms")
 
    # Check for regressions
    print("\n" + "=" * 60)
    print("Regression Detection")
    print("=" * 60)
    regressions = store.detect_regressions(
        baseline_version="summarization:1.0.0",
        current_version="summarization:1.1.0",
        threshold=0.05
    )
 
    for regression_type, detected in regressions.items():
        status = "ALERT!" if detected else "OK"
        print(f"{regression_type}: {status}")

Output:

Observability Report
============================================================

summarization:1.0.0:
  Requests: 50
  Mean Quality: 0.825
  Total Cost: $0.75
  Cost/Request: $0.0150
  Mean Latency: 1202ms

summarization:1.1.0:
  Requests: 50
  Mean Quality: 0.781
  Total Cost: $0.80
  Cost/Request: $0.0160
  Mean Latency: 1298ms

============================================================
Regression Detection
============================================================
quality_regression: ALERT!
cost_regression: ALERT!

The story: v1.1.0 is both more expensive and lower quality. The infrastructure caught it automatically. In a real system, you'd send alerts to Slack, PagerDuty, or whatever your team uses.

Bringing It All Together: The Complete Workflow

Here's what a day in the life of prompt iteration looks like with this infrastructure:

Monday 9am: Engineer writes new prompt (v1.1.0) to improve summarization quality.

1. Writes prompt locally
2. Commits to git
3. Registry API registers v1.1.0 in DEV environment

Monday 10am: Automated evaluation runs.

1. Evaluation pipeline pulls v1.1.0
2. Runs against 500 test cases
3. Compares metrics vs v1.0.0 (baseline)
4. If v1.1.0 is better: proceed
5. If not: reject and alert engineer

Monday 11am: Manual code review → promote to staging.

1. Engineer creates PR
2. Manager approves
3. Registry API promotes v1.1.0 from DEV to STAGING

Monday 2pm: A/B test in staging (internal traffic only).

1. Gateway routes 50% of internal requests to v1.0.0, 50% to v1.1.0
2. Monitor metrics over 2 hours
3. Statistical test: is v1.1.0 better? (p-value < 0.05)
4. If yes: promote to PROD

Monday 4pm: Gradual rollout to production.

1. Registry promotes v1.1.0 to PROD
2. Gateway starts with 10% traffic to v1.1.0
3. Monitor for 30 mins
4. Gradually increase: 25% → 50% → 100%
5. If any regressions detected: instant rollback

Next day: Observability and iteration.

1. Dashboard shows v1.1.0 is 3% better quality, 2% cheaper
2. Team decides: keeper!
3. Move on to v1.2.0

If something goes wrong at any step, the infrastructure lets you:

  • Rollback instantly (change one config value)
  • Understand exactly what broke (prompt version + metrics linked)
  • Iterate quickly (repeat tomorrow)

Why This Matters

The gap between "I have a working LLM" and "I can confidently iterate on prompts in production" is infrastructure. Without it:

  • You ship regressions unknowingly
  • You can't debug quality issues
  • You can't run A/B tests safely
  • You're stuck with your first prompt

With it, prompt engineering becomes a first-class engineering discipline: versioned, tested, safe, measurable.

Start with the registry. Add evaluation next. Then A/B testing. Then observability. Each layer compounds.

The Business Case for Prompt Infrastructure

Think about how long it takes your team to ship a model improvement. You probably measure it in weeks or months. Training time, validation, deployment approval, gradual rollout. There's friction at every step.

Now think about prompt engineering. Theoretically, you can iterate in hours. Change the prompt, test it, deploy it. But in reality, most teams don't because they're terrified. A bad prompt change could silently degrade quality for thousands of users. Without versioning, you can't even tell when the regression happened. Without A/B testing, you can't know if a change is actually better. Without evaluation, you're shipping blind.

The infrastructure overhead feels expensive upfront. Building a registry, writing evaluators, setting up A/B testing infrastructure. It's not cheap in engineering time. But the payoff is enormous.

Consider a realistic scenario: your team improves prompts 5% per month. With proper infrastructure, you can ship those improvements in 24 hours, fully tested and A/B tested. Without it, you might ship 2-3 improvements per quarter, and you're never sure if they actually helped. The difference compounds.

Over a year, that's 60 iterations versus 8. Even if only half your iterations improve quality, you're getting way more value from your LLM. And that's just quality improvements - you also get cost improvements through prompt engineering, which can be 10-20% without any model changes.

The business case is: "How much would you pay to ship prompt iterations 10x faster with confidence?" If you're running any significant LLM workload, the answer is "a lot," and infrastructure is how you get there.

Building Observability for Prompts

You've got a registry, A/B testing, and evaluation. Now comes the hardest part: understanding what's actually happening in production.

Observability for prompts is different from traditional observability. You can't just monitor latency and error rates. You need to understand:

  • Which prompt version is each request using?
  • What was the user's evaluation of the output?
  • How did this request compare to baseline?
  • What signals predict quality for this request?

The typical setup involves three layers of metrics:

Layer 1: Version tracking. Every request logged with its prompt version, model version, and parameters. This lets you slice metrics by version and understand what changed.

Layer 2: Quality signals. For each request, log metrics that correlate with quality. For summarization, that might be length, factuality score, readability. For code generation, it's whether the code runs. These are your early warning signals that a prompt change caused problems.

Layer 3: User feedback. Eventually, you need human judgment. If 5% of users thumbs-down a prompt's output, that's a signal even if your automated metrics look good. Build feedback loops into your application.

python
# Observability schema for production prompt usage
from dataclasses import dataclass
from datetime import datetime
from typing import Optional, Dict, Any
import json
 
@dataclass
class PromptRequest:
    """Logged observation of a prompt in production."""
    request_id: str
    timestamp: datetime
    prompt_name: str
    prompt_version: str
    model_name: str
    model_version: str
    input_text: str  # User's input
    output_text: str  # Model's output
 
    # Quality signals
    output_length: int
    latency_ms: float
    cost_cents: float
 
    # Custom metrics (domain-specific)
    custom_metrics: Dict[str, Any]  # e.g., {"factuality": 0.92, "readability": 0.78}
 
    # User feedback (logged later, after user interaction)
    user_feedback: Optional[str] = None  # "thumbs_up", "thumbs_down", null
    user_correction: Optional[str] = None  # If user corrected the output
 
    # Metadata for debugging
    user_id: Optional[str] = None
    session_id: Optional[str] = None
    ip_address: Optional[str] = None
 
def log_prompt_request(request: PromptRequest):
    """Log to observability backend (DataDog, Honeycomb, etc.)"""
    event = {
        "request_id": request.request_id,
        "timestamp": request.timestamp.isoformat(),
        "prompt": f"{request.prompt_name}:{request.prompt_version}",
        "model": f"{request.model_name}:{request.model_version}",
        "metrics": {
            "output_length": request.output_length,
            "latency_ms": request.latency_ms,
            "cost_cents": request.cost_cents,
            **request.custom_metrics,
        },
        "feedback": request.user_feedback,
        "corrected": request.user_correction is not None,
    }
 
    # Send to observability platform
    send_to_datadog(event)
    # Or Honeycomb, NewRelic, etc.

With this schema, you can build dashboards that answer questions like:

  • Which prompt versions have the best user satisfaction?
  • Did the latest prompt change affect latency?
  • Which prompts have the highest cost, and can we optimize them?
  • Are there specific input types where a prompt version underperforms?

Common Patterns and Anti-Patterns

After working with dozens of teams on prompt infrastructure, we've seen patterns emerge.

Anti-pattern: Prompt sprawl. You end up with 50 prompt versions in production because nobody cleans up old ones. Your monitoring dashboard becomes a graveyard. No one knows which version is actually best.

Solution: Regular hygiene. Establish a policy: every quarter, you retire all but the top 2-3 versions. Archive old versions for historical analysis, but don't keep them in the active registry.

Anti-pattern: Silent regressions. A/B test shows v1.1.0 is better at generating summaries, but worse at following instructions. Without segmented evaluation, you ship it anyway. Some users love it, others hate it.

Solution: Disaggregate your evals. Instead of a single quality score, evaluate specific capabilities. Better yet, let users opt in to new versions. "Try new summarization prompt" with easy rollback.

Anti-pattern: Slow feedback loops. You ship a prompt change, but you don't see user feedback for a week. By then, you've forgotten about it. The feedback is buried in logs you never look at.

Solution: Real-time feedback loops. Build feedback buttons into your UI. Route negative feedback to Slack immediately. Make it impossible to miss when something goes wrong.

Pattern: Incremental improvements. The best prompt teams we know don't ship big rewrites. They make 2-3% improvements daily, measure everything, and iterate. Boring but effective.

Solution: Culture shift. Make small improvements the norm. Celebrate 1% quality improvements the same way you'd celebrate a major feature. The compound effect is massive.

The Road to Maturity

Prompt engineering infrastructure matures in stages:

Stage 1: Basic versioning (Week 1)

  • Prompt registry with git integration
  • Basic semantic versioning
  • Manual environment promotion
  • Cost tracking

Stage 2: Testing and evaluation (Week 2-3)

  • Evaluation pipeline-automated-model-compression) for key metrics
  • Integration with development workflow
  • Documentation of evaluation criteria

Stage 3: A/B testing (Week 4-6)

  • Gateway/routing infrastructure
  • Statistical test harness
  • Automated metrics collection
  • Monitoring dashboards

Stage 4: Observability and feedback (Week 6+)

  • Production observability integrated
  • User feedback collection
  • Segmented analysis (performance by input type)
  • Correlation analysis (what predicts quality?)

Most teams spend 4-6 weeks to get to Stage 3, then iterate on Stage 4 forever. The ROI kicks in at Stage 2, but compounding benefits come from Stage 3+.

Integrating with Your ML Workflow

Prompt infrastructure doesn't exist in isolation. It integrates with your broader ML system:

  • Feature stores: Your prompts might reference features from a feature store. Version those together.
  • Model registry: Prompts are often tightly coupled to specific model versions. Track that dependency.
  • Evaluation systems: Your prompt evaluators should use the same evaluation framework as your model evaluators.
  • Monitoring: Unified dashboards that show model performance + prompt performance.

The best teams we've seen treat prompts as first-class ML artifacts, with the same rigor they apply to models. That means versioning, testing, evaluation, monitoring, and rollback capabilities. It means docs. It means code review. It means treating prompt changes as seriously as code changes.

Conclusion

Prompt infrastructure isn't just a nice-to-have. It's how you unlock the productivity of LLMs. Without it, you're limited to shipping a few iterations per quarter. With it, you can iterate daily. The difference in outcomes compounds.

Start simple: git + registry + semantic versioning. Add evaluation. Add A/B testing. Add observability. Each layer is straightforward to build, and the business case is immediate.

The teams that win with LLMs in 2026 won't be the ones with the fanciest prompts. They'll be the ones with infrastructure that lets them iterate fast, test confidently, and measure everything. That's how you turn prompt engineering from guesswork into engineering.


Building infrastructure for the AI era.

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project