Adapting the Karpathy Loop for Production System-Prompt Optimization

Apr 26

A field report on closed-loop prompt improvement: what I built, what broke, and what I learned.

Background: The Karpathy Loop

In early 2025, Andrej Karpathy described a simple but powerful pattern for autonomous software improvement he called "autoresearch." The core idea: give an LLM agent a frozen evaluation harness, a bounded editable artifact, a natural-language specification of what "better" means, and a ratchet that commits on improvement and reverts on regression. Run it overnight. Collect improvements.

The pattern has five components:

Frozen harness — a benchmark suite the agent cannot modify.
Editable artifact — a single, bounded surface the agent may change.
Human spec — natural-language goals, constraints, and stopping criteria.
Scalar metric — one number per run, monotone in quality.
Ratchet — accept on strict improvement, revert on regression, always log.

The power of the pattern is its honesty: the only way to "win" is to genuinely improve the metric. There is no prompt-injection path into the benchmark, no way to game a score the agent did not write.

I adapted this pattern to automatically improve the system prompt driving a structured-application generation pipeline. This post describes the architectural choices I made, the five non-obvious problems I hit, and how I fixed each one.

The Use Case: Closed-Loop System-Prompt Optimization

My system takes a natural-language product description and generates a structured application graph — entities, fields, actions, workflows, lifecycle states — as a deterministic compiled artifact. Output quality depends heavily on a generation system prompt: a 12,000–32,000 character instruction document that tells the LLM what to produce and how.

The generation prompt is the highest-leverage editable surface in my stack. A 5% improvement in entity coverage propagates to every application generated from that point forward, for every user, with zero additional effort. This is exactly the kind of artifact the Karpathy loop was designed for.

My benchmark harness runs the active prompt against ten reference playbooks (CRM, issue tracker, recruiting pipeline, field service, and others) and computes a composite score:

composite = 0.45 × compile_pass_rate
+ 0.45 × (mean_eg_score / 100)
+ 0.10 × gate_pass_rate

Where:

compile_pass_rate — fraction of structural benchmark checks passing across all reference apps.
mean_eg_score — a 0–100 enterprise-quality score averaged over the ten reference applications.
gate_pass_rate — fraction of apps passing all quality gates without requiring LLM-driven repair.

A fourth sub-metric, semantic_f1, is measured and guarded with a hard floor (≥ 0.90) but excluded from the composite. On my static benchmark corpus it was prompt-insensitive — consistently 1.000 — contributing no optimization signal. Including a flat metric in the objective only dilutes the weights that carry actual gradient.

Architectural Difference: In-Process Candidate Evaluation via ContextVar

Karpathy's reference implementation is git-backed: write the candidate artifact to disk, run the eval, commit or revert. Clean and auditable for a developer machine.

In a containerized production system with immutable infrastructure and database-versioned state, git-backed ratcheting is impractical. It requires the container to have a writable working tree, it couples the evaluation lifecycle to git state, and it makes concurrent runs impossible without isolated worktrees.

My solution: Python contextvars.ContextVar for in-process prompt injection. The active prompt lives in PostgreSQL as a versioned record. During candidate evaluation, a context variable temporarily overrides the active prompt for the duration of the harness run — no file I/O, no git state, no mutations to shared structures. The override is scoped to the evaluation coroutine's context and disappears when it exits.

_prompt_override: ContextVar[str | None] = ContextVar("prompt_override", default=None)

@contextmanager
def prompt_override_context(candidate: str):
    token = _prompt_override.set(candidate)
    try:
        yield
    finally:
        _prompt_override.reset(token)

def get_active_prompt() -> str:
    override = _prompt_override.get()
    if override is not None:
        return override
    return _load_from_db()

On acceptance, a new prompt revision is created, auto-activated in the database, and written to a durable asset file for survival across container restarts. On rejection, nothing persists — the ContextVar goes out of scope and leaves no trace.

The Calibration Suite

Before I could trust the ratchet to accept or reject experiments, I needed confidence in three things: that my noise floor estimate was accurate, that my harness could actually discriminate between good and bad prompts, and that my agent was genuinely exploring the prompt space rather than copy-pasting. I built a calibration suite that runs four experiments in sequence and produces a structured JSON report with verdicts.

{
  "noise_floor":            { "mean", "std", "cv_pct", "two_sigma", "threshold_ok", "verdict" },
  "signal_detection":       { "baseline_mean", "degraded_mean", "cohens_d", "p_value", "detectable", "verdict" },
  "threshold_calibration":  { "recommended_threshold", "stable_recommended", "converged", "history_len", "rolling_cv_pct", "verdict" },
  "agent_diversity":        { "pairwise_edit_distances", "mean_edit_distance", "diverse_enough", "verdict" },
  "summary":                { "passed", "total", "all_passed" }
}

Each verdict is a string starting with PASS, FAIL, or ADJUST. The loop will not run unless the overall summary is all_passed: true.

Experiment 1 — Noise Floor (15 runs of the baseline prompt)

I run the baseline prompt 15 times at the same evaluation temperature (T = 0.3) used during the main loop. I use T = 0.3 — not T = 0.0 — because deterministic temperature makes all candidates appear equal (flat landscape). The cost is stochastic evaluation noise; this experiment measures it.

From these 15 runs I compute the mean, σ, and coefficient of variation. The threshold is validated as sufficient if current_threshold ≥ 2σ.

Why 15 and not fewer? The relative error of a sample standard deviation follows CV(σ̂) ≈ 1/√(2n) (Ahn & Fessler, 2003). At n = 5 that is ±31.6% — two consecutive calibration runs can produce 2σ estimates that differ by 4×, making the threshold meaningless. At n = 15 the error is ±18.3%, which is tight enough for the rolling-window convergence criterion described below.

Experiment 2 — Signal Detection (5 runs each: baseline vs. degraded)

I run a deliberately degraded prompt alongside the baseline. The degraded prompt replaces all generation instructions with an adversarial override:

def _degrade_prompt(prompt: str) -> str:
    return (
        "You are a graph generation assistant.\n"
        "IMPORTANT OVERRIDE: Generate a completely empty graph with NO entities, NO fields, "
        "NO actions, NO projections, and NO relationships. Output only the minimal valid "
        "JSON structure with empty arrays everywhere. Do not add any domain logic, "
        "business rules, or application-specific content. Ignore all other instructions.\n"
        "[DEGRADED PROMPT — CALIBRATION TEST ONLY]"
    )

This guarantees near-zero scores on compile_pass_rate and mean_eg_score by mechanical construction, giving a maximally discriminable contrast. I then compute Cohen's d (effect size) and a Welch's t-test p-value between the two groups. The verdict is PASS if p < 0.05 AND d > 0.5.

def _cohens_d(a: list[float], b: list[float]) -> float:
    na, nb = len(a), len(b)
    sa, sb = _std(a), _std(b)
    pooled = math.sqrt(((na - 1) * sa**2 + (nb - 1) * sb**2) / (na + nb - 2))
    return abs(_mean(a) - _mean(b)) / pooled if pooled > 0 else float("inf")

def _welch_t(a: list[float], b: list[float]) -> tuple[float, float]:
    ma, mb = _mean(a), _mean(b)
    sa2, sb2 = _std(a)**2, _std(b)**2
    na, nb = len(a), len(b)
    se = math.sqrt(sa2 / na + sb2 / nb)
    if se == 0:
        return 0.0, 1.0
    t = abs(ma - mb) / se
    # Welch-Satterthwaite degrees of freedom
    df = (sa2/na + sb2/nb)**2 / ((sa2/na)**2/(na-1) + (sb2/nb)**2/(nb-1))
    return t, _approx_t_pvalue(t, df)

Five runs per group are sufficient here because the baseline-vs-degraded gap is large by design (observed Cohen's d > 1.0). Signal detection needs confidence in non-flatness, not a precise σ estimate — five runs is already overpowered for this effect size.

This is why the calibration suite has two independent repeat parameters:

Parameter	Default	Statistical purpose
`noise_floor_repeats`	15	σ estimation — needs n ≥ 15 for CV(σ̂) < 18%
`repeats`	5	Signal detection + agent diversity — Cohen's d > 1.0 needs only n ≥ 3

Conflating them into a single count either wastes compute (σ estimation does not need the sensitivity of signal detection) or under-samples (signal detection does not need n = 15). A --quick mode sets both to 3–5 for fast smoke checks, but disables threshold auto-apply.

Experiment 3 — Threshold Calibration (rolling maximum estimator)

Given the two_sigma value from Experiment 1, this experiment computes a principled threshold recommendation and decides whether to auto-apply it.

The problem with mean-based rolling averages. My first implementation used rolling_mean(2σ) × 1.1 across sessions. In practice the rolling coefficient of variation stabilized at 28–37% even at n = 15 per session, indicating right-skewed non-stationary variance: some sessions produced high σ, others low. The mean systematically underestimated the threshold needed on high-variance sessions, causing false accepts.
The fix: rolling maximum from order statistics. By order statistics, for k i.i.d. draws from any distribution F:

E[F(X_(k))] = k / (k + 1)

The sample maximum is the unbiased estimator of the k/(k+1) quantile — no distributional assumption required. With a rolling window of k = 10 sessions, the sample maximum estimates the 90.9th percentile of the observed 2σ distribution.

def record_two_sigma(two_sigma: float) -> tuple[float, bool, int, float]:
    """
    Append two_sigma to DB history, compute rolling-max threshold,
    return (recommended, converged, history_len, rolling_cv_pct).
    """
    history = _load_two_sigma_history(window=10)  # last 9 + current
    history.append(two_sigma)
    recommended = max(history) * 1.1
    converged = (
        len(history) >= 5
        and abs(history[-1] - history[-2]) / history[-2] < 0.10
    )
    if converged:
        _write_threshold_to_config(recommended)  # hot-path JSON file
    cv_pct = _std(history) / _mean(history) * 100 if _mean(history) > 0 else 0.0
    return recommended, converged, len(history), cv_pct

Setting τ = max(2σ̂₁, …, 2σ̂₁₀) × 1.1 gives:

For the 90.9% of sessions where σⱼ ≤ Q₀.₉₀₉: P(false accept) ≤ Φ(−1.1√2) ≈ 6% — controlled Type I error.
For the 9.1% tail (unusually high-variance sessions): τ is updated upward at the next calibration run. Self-correcting by construction.
Convergence criterion: max stability, not CV. CV measures spread — the wrong property for a max-based estimator. The max converges when the distributional tail is sufficiently sampled, which is visible as the running maximum stabilizing:

converged  ⟺  n ≥ 5  AND  |M_k − M_{k-1}| / M_{k-1} < 10%

rolling_cv_pct is still logged as an observability signal but is no longer the convergence gate.

Two persistence layers. The threshold lives in two places with distinct responsibilities:

Layer	Stores	Why
PostgreSQL	Every calibration run: full report, timestamps, status	Audit trail, rolling-window history queries, UI display
JSON config file	`min_improvement_delta` scalar only	Read on every ratchet `evaluate()` call — must be fast with no DB round-trip

The rule: historical records go to the database. Hot-path scalars go to the config file. The scalar is always derivable from the DB history; the config file is just a cache of the latest computed value.

Experiment 4 — Agent Diversity (3 independent proposals)

I make three independent calls to the agent with identical inputs and measure whether the proposals genuinely differ. If they do not, the agent is copy-pasting, and the loop will spin at the noise floor regardless of how well-calibrated the ratchet is.

I originally measured diversity with Wagner-Fischer Levenshtein. At O(nm) complexity, it was only feasible by truncating inputs to 4,000 characters. My generation prompt has a shared boilerplate header of ~6,000 characters. The truncation window landed entirely inside this header. All proposals appeared identical regardless of actual differences in the instruction sections further down.

I replaced it with word-bigram Jaccard distance:

def _text_divergence(a: str, b: str) -> float:
    """
    Word-bigram Jaccard distance. O(n) in word count; boilerplate-immune.
    Why not Levenshtein: O(n²), truncation to 4000 chars lands in shared
    boilerplate header, all proposals appear identical.
    """
    wa, wb = a.split(), b.split()
    sa = set(zip(wa, wa[1:])) if len(wa) > 1 else {(wa[0],)}
    sb = set(zip(wb, wb[1:])) if len(wb) > 1 else {(wb[0],)}
    union = sa | sb
    return 0.0 if not union else 1.0 - len(sa & sb) / len(union)

Properties:

O(n) in word count. ~0.15 ms per pair at 12,000 characters. Scales to the full 32,000-character limit without truncation.
Boilerplate-immune. Shared bigrams appear in both the numerator and denominator of the ratio — they cancel out. Diversity is measured over the delta, not the shared base.
Order-sensitive. ("MUST", "include") and ("include", "MUST") are distinct bigrams. Reordering instructions registers as a change.
Whitespace-normalized. split() collapses indentation and newline differences that carry no semantic content.

Verdict: diverse_enough if mean_pairwise_distance > 0.05 AND all_proposals_differ_from_baseline > 0.01. Both conditions are required: the first guards against globally similar proposals; the second guards against a single outlier inflating the mean.

Challenge: Parsing Structured Output from Reasoning Models

The agent is instructed to respond in a fixed format:

DESCRIPTION: <one-sentence summary of the change>
PROMPT:
<full replacement prompt>

My original parser used str.startswith("DESCRIPTION:"). It silently failed whenever Claude prefaced its response with reasoning before the structured block — which happens reliably on complex tasks where the model works through the problem before committing to a proposal:

Let me analyze the benchmark results to identify the highest-leverage change...
[three paragraphs of reasoning]
DESCRIPTION: Strengthen entity-presence instruction for core cluster
PROMPT:
...

With startswith, the entire response — reasoning included — was passed as the candidate prompt, causing validation to fail or a garbled string to be evaluated.

The fix: re.search with multiline anchoring.

import re

_DESC_RE = re.compile(r'(?m)^DESCRIPTION:\s*(.+)')
_PROMPT_RE = re.compile(r'(?m)^PROMPT:\s*\n(.*)', re.DOTALL)

def parse_agent_response(raw: str) -> tuple[str, str] | tuple[None, str]:
    prompt_match = _PROMPT_RE.search(raw)
    if not prompt_match:
        return None, "PROMPT: marker not found in agent response"
    desc_match = _DESC_RE.search(raw)
    description = desc_match.group(1).strip() if desc_match else ""
    prompt_text = prompt_match.group(1).strip()
    return prompt_text, description

(?m) makes ^ match at the start of any line, so preamble text before the structured section is silently skipped. re.DOTALL captures the full multi-line prompt body. The general principle: structured output from reasoning models should always be extracted with search, never with prefix checks.

Challenge: Silent Failures in LLM-Dependent Calibration Pipelines

The calibration suite makes many LLM calls — fifteen harness runs for noise floor estimation alone, each of which fires parallel generation calls for ten reference applications. When the LLM provider is unavailable (quota exhaustion, network partition), individual generation calls can fail silently and return empty results. Those zeros corrupt the noise floor estimate: a session that should report σ ≈ 0.012 instead reports σ ≈ 0.002 because most scores are zero. The ratchet threshold is set too tight, and every subsequent experiment is falsely rejected.

I encountered this pattern in three distinct forms:

Silent zeros from quota errors. Generation calls caught quota exceptions internally and returned empty graphs. The harness computed valid (but meaningless) scores of 0.0.
Missing fields in calibration reports. The detectable and diverse_enough verdict fields were computed inside the calibration functions but never written to the report dict. The operator UI displayed "–" for both fields.
Opaque error messages in rejection reasons. The runner returned "see ERROR logs above" as a rejection reason when an agent call failed.

The fixes:

# 1. Fast-abort on first quota error
_quota_abort = threading.Event()

def generate_one_app(playbook: str, prompt: str) -> Graph | None:
    if _quota_abort.is_set():
        return None  # skip immediately; don't burn the full timeout
    try:
        return _call_llm(playbook, prompt)
    except InsufficientQuotaError:
        _quota_abort.set()
        raise

# 2. Always serialize verdict fields
report["signal_detection"]["detectable"] = detectable
report["agent_diversity"]["diverse_enough"] = diverse_enough

# 3. Surface actual exception messages
except Exception as exc:
    return None, f"{type(exc).__name__}: {exc}"  # not "see ERROR logs above"

Additionally: I track apps_failed per harness run and propagate it through all progress callbacks. The operator UI turns the progress bar red and shows an inline warning when apps_failed > 0. Calibration results from partially-failed runs are stored in the database for audit but are flagged; threshold auto-apply is suppressed until a clean run completes.

The Hardest Lesson: Verify Your Harness Before Blaming Your Prompt

Early in production, the loop stalled. Five consecutive experiments proposed emphasis-only language — "MUST generate every entity", "all actions are required", "do not omit any entity or action" — and all were rejected with zero composite improvement. The core benchmark cluster (entity-presence and action-presence checks) remained at 77 failures regardless of how strongly the prompt instructed the model.

The instinct was to write a more specific prompt. Instead I read the benchmark helpers.

The benchmark was reading compiled format keys (graph["entities"]) to check entity presence. The LLM was generating in a nested format (domain_model.entities). The key path was wrong. No instruction in any system prompt could fix a key-path mismatch in the evaluation harness.

Once the helpers were made format-agnostic, the same emphasis-only proposals started producing measurable improvements.

The lesson: before assuming a prompt is wrong, verify that the harness is measuring what you think it is. A benchmark that silently returns zero for a structural reason looks identical to a benchmark that correctly scores a bad prompt. The loop faithfully optimizes whatever you measure — if the measurement is broken, the loop will spin indefinitely at the noise floor.

What the Ratchet Actually Guards

For completeness, my ratchet evaluates five independent conditions on every experiment. All five must pass for acceptance:

Composite improvement: composite_new > composite_baseline + min_improvement_delta
Per-app regression floor: For all reference apps, eg_score_new ≥ prior_best_eg_score − 2.0. The 2.0-point tolerance absorbs evaluation noise at T = 0.3 without masking genuine regressions.
Semantic F1 guard: If baseline ≥ 0.90, candidate must stay ≥ 0.90. Prevents prompt changes from degrading a downstream classification layer.
Vocabulary leakage guard: Ensures the prompt does not overfit to domain-specific terminology from the reference playbooks in a way that would degrade performance on unseen domains.
(Optional) Regression test suite: A full pytest run can be gated here; available for high-stakes promotion flows.

All rejection reasons are collected into a list and written to the experiment record. Empty list → accept. This structure makes it straightforward to triage stalled loops: if "composite improvement below threshold" appears in every experiment, the noise floor may be miscalibrated; if per-app regression reasons dominate, one reference app may be an outlier dragging the score down.

Results and What Generalizes

The loop has produced measurable, committed improvements to my generation prompt — primarily by forcing it to name entities explicitly for domains where naming conventions are ambiguous, and by tightening the action-completeness instruction for workflow-heavy templates.

The patterns that I think generalize to other prompt optimization loops:

Problem	General principle
Mean-based noise threshold with non-stationary variance	Use rolling maximum × safety margin; order statistics, no distributional assumptions required
Two statistical purposes conflated into one repeat count	Separate σ estimation from signal detection; their n requirements differ by ~3×
Edit-distance diversity metric on prompts with shared boilerplate	Use set-overlap metrics (Jaccard, n-gram) that are naturally boilerplate-immune
Structured output parsed with prefix check	Always use `re.search` with `(?m)` flag; reasoning models produce preamble
LLM failures silently corrupting calibration data	Abort-on-first-error, track failure counts, suppress threshold auto-apply on dirty runs
Loop stalled on emphasis-only proposals	Read the benchmark helpers before rewriting the prompt

The Karpathy loop is genuinely useful — but only if the measurement is honest. Every hour spent hardening the harness, the calibration suite, and the error visibility is an hour that pays forward in reliable, unattended improvement cycles.

References

Karpathy, A. (2025). autoresearch — canonical reference implementation of the loop pattern.
Python contextvars documentation — PEP 567, context isolation for async and threaded code.
Order statistics — Wikipedia — sample maximum as quantile estimator: E[F(X_(k))] = k/(k+1).
Ahn, S., & Fessler, J. A. (2003). Standard errors of mean, variance, and standard deviation estimators — derivation of CV(σ̂) ≈ 1/√(2n).
Welch's t-test — Wikipedia — unequal-variance two-sample t-test used for signal detection.
Cohen's d — Wikipedia — standardized effect size for two-group comparisons.
Jaccard index — Wikipedia — set-overlap similarity; basis for the bigram divergence metric.
Wagner-Fischer algorithm — Wikipedia — O(nm) edit distance; why it was replaced.
Papineni, K., et al. (2002). BLEU: a method for automatic evaluation of machine translation — n-gram precision as a text-comparison primitive; conceptual basis for bigram Jaccard.
DSPy optimizers — alternative prompt optimization approach (MIPRO, BootstrapFewShot); complements the ratchet pattern for few-shot tuning.

AICORELINE-OS

Michael Boeni