Tool Leaderboard — Methodology

Every tool on the PMP leaderboard is graded with the same shape: a forecast at signal time, a single binary outcome at resolution, a Brier component, and a P/L number tied to the limit price the tool quoted. This page is the canonical source for what each number means and how it's computed. If something on a tool's page surprises you, the answer is here.

License: CC BY 4.0 — cite us, but please do.

The ledger

When a predictive tool emits a signal, we write one row to tool_picks with the side, the model probability, the market price at pick time, and a confidence tier. When the underlying market resolves, a settle worker writes a matching row to tool_settles with the resolution and the realized P/L. Every public number on the leaderboard is computed from those two tables — no spreadsheet, no manual revision. The full ledger is queryable through /api/public/tool-leaderboard with JSON and CSV outputs.

Picks are written at signal time, not after the fact. We do not backdate. If a tool didn't emit a signal that day, no row exists for that day. This kills the easiest form of track-record embellishment.

Hit rate

Hit rate is the fraction of settled picks where the trader's side won. void and refunded resolutions are excluded from both numerator and denominator (see voids below).

hit_rate = wins / (wins + losses)

A 60% hit rate at 50¢ entry is profitable; the same hit rate at 80¢ entry is not. Hit rate without entry-price context is meaningless — see ROI.

Wilson 95% confidence interval

We display a 95% Wilson interval on hit rate so a tool with a small sample doesn't look more (or less) skilled than the data supports. Wilson is preferred over the normal approximation because it stays sane near 0% and 100% and at small n.

p̂ ± (z² / 2n + z·√((p̂·(1−p̂) + z²/4n) / n)) / (1 + z²/n)

where = observed hit rate, n = settled-pick count, and z = 1.96 for 95% confidence.

A 60% hit rate over 10 picks lands at roughly [31%, 83%] — wide enough that you should not pay much attention. The same 60% over 100 picks lands at roughly [50%, 69%] — meaningful.

Brier score

The Brier score grades probabilistic forecasts against binary outcomes. Lower is better. It's the single best test of whether a tool's probabilities are honest, because a tool that always says “70%” can score well on hit rate purely by picking high-prob markets — but its Brier will degrade if 70% picks resolve at any other rate.

Brier = (1/n) · Σ (predicted_prob − yes_outcome)²

predicted_probis always the model's YES probability (regardless of which side the tool took). yes_outcome is 1 if the YES side won, 0 if NO won. So a tool that picks NO at 60% (i.e. predicted_prob = 40%) and the market resolves NO scores(0.40 − 0)² = 0.16.

We compare each tool's Brier against a market-mirror baseline: what the Brier would be if the tool always predicted exactly the market price. Beating the baseline means the tool added information beyond what the market already knew. Failing to beat baseline does not make a tool worthless (it can still be profitable on price selection), but it disqualifies the tool from the “calibration_ok” flag used for promotion to public.

Calibration check threshold: tool_brier ≤ baseline_brier + 0.05 AND settled-pick count ≥ 30. Below 30 picks, the calibration flag is always false — Brier on tiny samples is noise.

Calibration curve

The calibration curve checks whether the tool's probabilities are honest. We bucket every settled pick by predicted probability into ten 10%-wide buckets (0–10%, 10–20%, …, 90–100%) and plot the actual win rate inside each bucket. A perfectly calibrated tool traces the diagonal: when it says 70%, the actual win rate over that bucket is 70%.

For NO-side picks we calibrate against the trader's side: a tool that picks NO at “70% confidence” lands in the 70% bucket and its outcome is “did NO win.” This way a NO pick that resolves NO counts as a win in the 70% bucket — the bucket measures the probability the trader actually staked.

Bucket count is locked at 10. Cells with fewer than 5 picks are suppressed (n displayed but no curve point) — a single pick in a bucket gives 0% or 100%, which is structural, not signal.

ROI at par

ROI at par is realized profit per $1 staked at the limit price the tool quoted at signal time. “At par” means we assume the trader actually got filled at the quoted price — no slippage modeled. For binary contracts:

win:    pnl = (1 − entry_price) / entry_price
loss:   pnl = −1
void:   pnl = 0  (excluded from denominator)

entry_priceis the tool's reported entry — the YES price for YES picks, (1 − yes_price) for NO picks. The canonical implementation lives at lib/picks/pnl.ts and is used by every settle worker.

We post ROI per pick (averaged across a window) rather than total return because tools fire at different cadences. Comparing “Silver Edge ROI” to “POTD ROI” on a per-pick basis is apples-to-apples; comparing total dollars would just measure how chatty each tool is.

Real-world execution has slippage, queue position, and fee drag that this number ignores. ROI at par is best read as ceiling performance.

Drawdown

Drawdown is the distance below the running peak of cumulative ROI. It surfaces variance that hit rate and average ROI hide. A tool with 60% hit rate and a 30-point drawdown traveled through a long rough stretch to get there — useful information for a trader sizing a position.

Regime tags

Every pick is tagged with the regime it fired in:

  • vol_tier — quiet / normal / volatile (asset-class specific definition)
  • dow — day-of-week, 0 = Sunday UTC through 6 = Saturday UTC
  • hour_utc — 0–23, when in the trading day the signal fired
  • news_flag — was a high-impact news event live at signal time
  • season — for seasonal markets (NFL season, election cycle, etc.)

Per-tool pages surface the heatmap of hit rate by day-of-week × vol tier (cells with ≥ 5 picks). Useful if a tool is great on quiet days and terrible during news events — that's actionable; a single rolled-up number isn't.

Visibility tiers

Every tool starts at visibility=admin and climbs only after meeting the gates:

  • admin — internal QA only. Picks logged, no external surface.
  • pro — visible on /pro/leaderboard-preview. Promoted once the picks pipeline is healthy and the tool has meaningful sample data.
  • public — visible on /leaderboard/tools. Promoted only when the tool has ≥ 30 settled picks AND calibration_ok = true (Brier ≤ baseline + 0.05).
  • retired — sunset. Data preserved; ledger queryable via the public API.

Tools are never deleted. If a tool stops working, it's retired with the full historical ledger intact so anyone can audit what happened and why we pulled it.

Sample-size badges

  • building — fewer than 30 settled picks
  • stable — 30 to 99 settled picks
  • deep — 100 or more settled picks

The 30-pick threshold mirrors min_picks_for_publicon the registry. It's the floor where Brier and Wilson stop being dominated by noise.

Voids and refunds

When a market voids (no resolution) or refunds (returned to traders), the pick is recorded with resolution='void' or 'refunded', P/L = 0, Brier excluded. These rows count against the tool's pick volume but never against its accuracy. We do not silently drop voided picks — they're listed in the picks table with the void flag visible.

Time windows

Every tool gets four scorecards per day, one per window:

  • 7d — last seven days, recency weight
  • 30d — last thirty days, the headline window
  • 90d — last ninety days, regime stability
  • all-time — every settled pick since the tool went live

Scorecards refresh nightly at 03:15 UTC. The headline numbers on every public page are the 30-day window; deeper windows surface on the per-tool page.

Open data

Every number on the leaderboard is queryable through the public API:

  • /api/public/tool-leaderboard — list of public tools with latest scorecards (JSON or ?format=csv)
  • /api/public/tool-leaderboard/[slug] — per-tool ledger: scorecards across all four windows + recent settled picks

Rate limit is 60 requests per hour per IP. Every response carries an X-Attribution header with the citation string. License is CC BY 4.0; please attribute when you reuse.

What this is not

These tools forecast prediction-market contracts. They are not trade signals, not investment advice, and not a substitute for your own due diligence. Past performance documented here is past performance — markets shift, regimes change, and a tool that ran hot for 30 days can cool for the next 30. Trade responsibly.