Calibration Stack¶
Raw LLM probabilities are never traded against directly. Project Rule R4 (CLAUDE.md) is explicit:
R4 — Calibration is Mandatory. Raw LLM probabilities never used directly; all go through the calibration pipeline.
The stack stacks four post-hoc methods into a single
CalibratorBundle,
each compensating for a different defect of the prior stage. The design follows
ADR-0008.
Why not trust the LLM probability¶
Two reasons:
- Systematic miscalibration. [Tian et al. 2023] measured Claude, GPT, and Llama on multi-choice QA and found persistent overconfidence at the tails (p ≥ 0.8 and p ≤ 0.2) and reversion-to-0.5 bias at the center.
- No distribution-free guarantee. The LLM can be wrong about its own uncertainty and there is no native way to express "I am \((1-\alpha)\) confident my probability is in \([p_\text{lo}, p_\text{hi}]\)."
Post-hoc calibration fixes both at decision time without retraining the model.
Pipeline¶
flowchart LR
R["raw LLM prob p_llm"] --> I["isotonic<br/>per category"]
I --> V["Venn-Abers<br/>p_lo, p_mid, p_hi"]
V --> C{"Mondrian conformal<br/>singleton at alpha=0.10?"}
C -->|"set = {0,1}"| S["SKIP"]
C -->|"singleton"| E["shrinkage ensemble<br/>alpha * p_iso + (1 − alpha) * p_market"]
E --> F["p_final"]
Isotonic regression¶
Fits a monotone non-decreasing function \(g: [0, 1] \to [0, 1]\) by minimizing squared error under an order constraint, solved with the Pool Adjacent Violators (PAV) algorithm [Barlow et al. 1972]:
Isotonic is low-bias but high-variance; [Niculescu-Mizil & Caruana 2005]
recommend \(\ge 500\) calibration points. We set the per-category guard lower
(\(\ge 150\)) and degrade gracefully to the identity map when below threshold.
See
IsotonicCalibrator.
Why per-category
Calibration drifts by category — "elections" miscalibrate differently from "crypto prices". Fitting globally averages away the effect we want to correct.
Venn-Abers predictors¶
[Vovk et al. 2015] Venn-Abers produce an interval \([p_0, p_1]\) that contains the true posterior probability with asymptotic validity under the exchangeability assumption. Unlike a confidence interval on a point estimate, the VA interval has a distribution-free finite-sample guarantee: the average of \(p_0\) over the calibration sequence equals the empirical frequency conditional on the model's ordering.
We use Inductive Venn-Abers (IVAP) via the venn-abers package. Our wrapper
VennAbersCalibrator
returns \((p_\text{lo}, p_\text{mid}, p_\text{hi})\) with the midpoint used as a
point estimate and the width \(p_\text{hi} - p_\text{lo}\) as an uncertainty
signal for position sizing downstream.
Mondrian conformal prediction¶
Plain conformal prediction gives marginal \((1-\alpha)\) coverage. Mondrian
conformal [Vovk et al. 2003] stratifies the calibration set by a taxonomy
and gives conditional coverage per stratum. Our stratification is
(category, true_class):
This is strictly stronger than the more common category-only Mondrian. Without
the true_class axis, a category whose base rate is 80% could have 100%
coverage for the majority class and near-zero for the minority while still
appearing "calibrated" on average.
Nonconformity score for calibration point \((p, y)\):
At decision time:
| Prediction set | Meaning | Action |
|---|---|---|
| \(\{1\}\) | Only class 1 plausible at \(\alpha\) | trade YES |
| \(\{0\}\) | Only class 0 plausible | trade NO |
| \(\{0, 1\}\) | Both plausible, or too little data | skip |
Default \(\alpha = 0.10\) with \(\ge 30\) samples per (category, class). See
MondrianConformalFilter.
Empirical-Bayes shrinkage ensemble¶
After the calibrated probability \(p_\text{llm}\), we combine with the market price:
The per-category weight \(\alpha_c\) is fit by line-search to minimize log-loss on historical decisions, then shrunk toward a global \(\alpha\) by empirical Bayes [Efron 2010]:
A category with few trades inherits the global weight; one with many decisions
earns its own estimate. Implementation:
ShrinkageEnsemble.
Low alpha is a signal, not a bug
If \(\alpha_c \to 0\) for a category (market dominates the combination), the LLM has no edge there. If \(\alpha_c \to 1\) on a liquid market, that niche is inefficient — exactly where we want to concentrate effort.
CalibratorBundle: atomic hot-swap¶
All four fitted objects are pickled together in a
CalibratorBundle.
Save is atomic via tempfile.NamedTemporaryFile \(\to\) Path.replace:
with tempfile.NamedTemporaryFile("wb", dir=target.parent, delete=False) as fh:
pickle.dump(self, fh, protocol=pickle.HIGHEST_PROTOCOL)
tmp_path = Path(fh.name)
tmp_path.replace(target) # atomic on POSIX
The trading loop reloads the bundle at startup; a mid-flight swap leaves no
half-written file. Library versions (scikit-learn, venn-abers, numpy,
scipy) are snapshot at save time — a silent pip install --upgrade
scikit-learn that subtly changes IsotonicRegression internals is logged as
CalibratorBundle library drift detected on next load.
When to refit¶
refit_calibrators
(refit.py)
pulls the last 2000 labelled decisions, retrains every component, and swaps
the bundle. Triggers:
| Trigger | Cadence | Why |
|---|---|---|
| Scheduled | Nightly, part of pipeline/nightly.py | Calibration drifts with regime |
| Per-category threshold | Every 50 new resolved trades in a category | Fresh data for a hot niche |
| Library drift | On load-time warning | sklearn / venn-abers version change |
| Manual | uv run kfish-calibrate after protocol change |
Model prompt edits, new persona, etc. |
Too-frequent refits are a hazard
Refit with too little new data and isotonic overfits the last batch; the bundle starts tracking noise. The 2000-row window and the 50-trade per-category guard keep the signal-to-noise sane.
References¶
- Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD (1972). Statistical inference under order restrictions. Wiley.
- Efron B (2010). Large-scale inference. Cambridge University Press.
- Niculescu-Mizil A, Caruana R (2005). Predicting good probabilities with supervised learning. ICML.
- Tian K, Mitchell E, Zhou A, et al. (2023). Just ask for calibration. EMNLP.
- Vovk V, Lindsay D, Nouretdinov I, Gammerman A (2003). Mondrian confidence machine. On-line Compression Modelling Project.
- Vovk V, Petej I, Fedorova V (2015). Large-scale probabilistic predictors with and without guarantees of validity. NeurIPS.
- ADR-0008: Isotonic + Venn-Abers + Mondrian conformal.