Theory and Math¶
This document states the scoring rules and calibration metrics that the K-Fish
stack optimizes against, and explains why each one is used. Every formula has a
corresponding implementation in
apps/kfish-core/src/kfish_core/calibration/brier.py.
Brier score¶
Given a set of probabilistic forecasts \(f_i \in [0, 1]\) for binary outcomes \(o_i \in \{0, 1\}\), the Brier score [Brier 1950] is the mean squared error
Perfect foresight gives \(B = 0\). Always predicting \(0.5\) gives \(B = 0.25\). Random guessing on a balanced dataset gives \(B \approx 0.33\). The K-Fish baseline is \(B = 0.206\) on 200 retrodicted Polymarket questions (CLAUDE.md).
The computation lives in
brier_score.
Murphy decomposition¶
[Murphy 1973] showed Brier decomposes cleanly when forecasts are grouped into \(K\) bins with \(n_k\) forecasts per bin, mean forecast \(\bar f_k\), empirical frequency \(\bar o_k\), and overall base rate \(\bar o\):
Three orthogonal quantities:
| Component | Interpretation | Direction |
|---|---|---|
| Reliability | Distance between stated confidence and observed frequency within each bin. Captures calibration error. | Minimize |
| Resolution | How far bin frequencies spread from the base rate. Captures discrimination — the ability to separate "will happen" from "won't happen". | Maximize |
| Uncertainty | Variance of the outcome, fixed by the dataset. | Not controllable |
Implemented twice — numpy version in
decompose_brier
and SQL version in BRIER_DECOMPOSITION_SQL
(same file) so
the warehouse and the test suite compute identical numbers.
Why both?
Re-implementing the decomposition in DuckDB SQL means every
refit_calibrators run (thousands of rows) reports the same number as the
unit test on a 20-row fixture. Divergence between the two is a bug signal.
Why Brier, not log-loss or CRPS¶
Three scoring rules are candidates for binary forecasts:
| Rule | Formula | Strictly proper? | Bounded? | Behavior near 0/1 |
|---|---|---|---|---|
| Brier | \((f - o)^2\) | yes | \([0, 1]\) | finite |
| Log-loss | \(-[o \log f + (1 - o) \log (1 - f)]\) | yes | \([0, \infty)\) | diverges |
| CRPS | integral form — reduces to Brier for binary | yes | \([0, 1]\) | finite |
Log-loss penalizes overconfidence at 0 or 1 with infinite loss. For LLM-sampled probabilities, a single \(p = 0.999\) on a resolved-NO market would dominate the mean even across hundreds of markets. Brier keeps every outlier bounded, which matters when the probability generator is stochastic and occasionally extreme.
CRPS reduces to Brier in the binary case, so there is nothing to gain. For continuous outcomes the choice would matter; binary prediction markets make it moot.
Calibration vs discrimination¶
The Murphy decomposition makes the distinction formal:
- Calibration is reliability: when a forecaster says \(0.70\), the long-run frequency should be \(0.70\). Low reliability term \(\Rightarrow\) well calibrated.
- Discrimination is resolution: across bins, do frequencies actually differ from the base rate? High resolution \(\Rightarrow\) the forecaster distinguishes events.
A forecaster that always outputs the base rate is perfectly calibrated (reliability \(= 0\)) but has zero resolution, so its Brier equals uncertainty. A useful forecaster sacrifices a little calibration for a lot of resolution.
Brier skill score vs climatology¶
Skill is relative to a naïve climatology forecaster that always reports the base rate \(\bar o\). That forecaster has $B_{\text{clim}} = \bar o (1 - \bar o) = $ uncertainty. The skill score is
\(\mathrm{BSS} > 0\) beats climatology; \(\mathrm{BSS} = 1\) is perfection; negative means worse than the base rate. K-Fish currently sits at \(\mathrm{BSS} \approx +0.176\) against random, but has not yet beaten the crowd (\(B_{\text{crowd}} \approx 0.084\)).
Computed as
BrierDecomposition.skill_score.
Expected Calibration Error¶
ECE [Naeini 2015] is a single-number summary of the reliability diagram. Bin forecasts into \(K\) equal-width bins; compute per-bin mean confidence \(\mathrm{conf}(B_k)\) and accuracy \(\mathrm{acc}(B_k)\); take weighted mean deviation:
Default \(K = 15\) in
compute_ece.
Bin count sensitivity
ECE with \(K = 10\) and ECE with \(K = 30\) on the same data can differ by a factor of two. [Kumar 2019] showed ECE is biased downward for small \(K\). We pin \(K = 15\) so successive refits are comparable; report raw bin contents in the reliability diagram rather than letting \(K\) hide them.
Why extremization¶
LLM probability distributions are biased toward \(0.5\). [Galton 1907] and [Tetlock 2005] showed that mechanically sharpening a crowd's consensus — pushing the mean away from the prior — often improves Brier, provided the forecasters are not catastrophically miscalibrated.
[Baron et al. 2014] derived the asymmetric extremization formula: push harder when forecasters agree, pull back toward \(0.5\) when they disagree. K-Fish uses the dispersion-sensitive variant
where \(s\) is the standard deviation of per-persona probabilities and
\(a = 0.5\) is the aggressiveness hyperparameter. Implemented as
asymmetric_extremize.
When extremization hurts
On unknowable markets (true \(P = 0.5\)), sharpening amplifies noise. The 3-Fish pre-screen (swarm.md) skips these before the swarm even runs, so extremization never sees them.
References¶
- Baron J, Mellers BA, Tetlock PE, et al. (2014). Two reasons to make aggregated probability forecasts more extreme. Decision Analysis 11(2).
- Brier GW (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review 78(1): 1–3.
- Galton F (1907). Vox populi. Nature 75: 450–451.
- Kumar A, Liang P, Ma T (2019). Verified uncertainty calibration. NeurIPS.
- Murphy AH (1973). A new vector partition of the probability score. Journal of Applied Meteorology 12(4): 595–600.
- Naeini MP, Cooper G, Hauskrecht M (2015). Obtaining well calibrated probabilities using Bayesian binning. AAAI.
- Tetlock PE (2005). Expert political judgment. Princeton University Press.