# Diagnosis and Improvement Recommendations
## Diagnosis
### 1. Totals Market Calibration is Severely Broken
The totals market is running at 50.6% (40-39), well below the 60% threshold. Looking at the wrong predictions, the problem is stark and consistent: the model assigned **high fair_prob and strong positive edge to "under" predictions** on games that went over dramatically. Game 599 (3-13, 16 total runs) had `fair_prob=0.756, edge=+0.264` on under. Game 634 (9-3, 12 runs) had `fair_prob=0.805, edge=+0.315` on under. Game 630 (16-7, 23 runs) had `fair_prob=0.723, edge=+0.236` on under. Game 622 (8-6, 14 runs) had `fair_prob=0.727, edge=+0.242` on under. These are not marginal misses — these are games that blew up dramatically, yet the model was extremely confident in the under. The fair_prob values in the 0.70-0.80 range should be hitting at near those rates; they are clearly not. This points to a systematic upstream bias in the run-scoring distribution, likely the pitcher/batter rolling features or park factor weighting **suppressing projected totals below true expectation**.
### 2. The `_fair_prob_from_dist` Function Has a Boundary Error on the Under
In `_evaluate.py`, the `_fair_prob_from_dist` function computes the under probability as:
```python
return sum(v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n <= line)
```
This includes the **exact line value** in the under bucket (e.g., if the line is 8.5, it shouldn't matter, but if lines are stored as integers like 8 or 9, a game landing exactly on the line integer value is double-counted or miscounted). More critically, if `median_line_value` is being computed from a mix of half-point and whole-number lines (e.g., 8.5 vs 9.0), the `statistics.median()` in `consensus_market` could return a value like 8.75 that doesn't align with how PMF keys are stored. If PMF keys are integers (0, 1, 2... 20) and the consensus line comes back as 8.75, then `_n <= 8.75` captures runs 0-8 while `_n > 8.75` captures 9+, which is correct for a half-point line. But if the PMF is a **discrete distribution** and the model's projected total is systematically too low (e.g., projecting 7.5 total when actual is 9-10), the under bucket will always appear inflated. The combination of a low projected total mean with a PMF that has most mass below the consensus line artificially inflates `fair_prob` for unders.
### 3. Edge Threshold is Too Permissive for Low-Confidence Markets
The `_verdict` function marks anything ≥ 0.03 (3%) as green. Looking at wrong predictions, many losing bets had edge values of +0.01 to +0.06 on totals/under: Game 573 (edge=+0.051), Game 611 (edge=+0.046), Game 619 (edge=+0.035), Game 626 (edge=+0.053). A 3% edge threshold was possibly calibrated when the model was more accurate. The totals market's 50.6% win rate means the model has **negative expected value on many "green" totals plays**. The threshold needs to be higher for totals specifically, and should be scaled by `fair_prob` — a play with `fair_prob=0.51` and `edge=+0.04` is not the same confidence level as `fair_prob=0.65` and `edge=+0.04`.
### 4. Runline Side-Mapping Has a Silent Probability Inversion Risk
The `_MODEL_TO_ODDS` dict maps both `("runline", "home_minus")` and `("runline", "home_plus")` to `("spreads", "home_runline")`, and both `("runline", "away_minus")` and `("runline", "away_plus")` to `("spreads", "away_runline")`. The line-value filtering in `_build_comparison` is the only guard against comparing a +1.5 prediction against a -1.5 market price. But `consensus_market` uses `devig_two_way` across whatever two sides exist — if book data has inconsistent side labeling (some books labeling home runline as -1.5, others as +1.5 due to a data ingestion quirk), the `median_line_value` guard could pass while the `consensus_implied` is actually for the **opposite side**. Several runline losses (Game 572: away runline with `edge=+0.012, verdict=yellow` in a 4-9 game, Game 573 and 574) suggest the runline away predictions on games where away won big (correct direction) but perhaps with wrong line comparison.
### 5. No Contextual Filtering on High-Variance Games
The wrong predictions include multiple blowout games (1-15, 2-12, 3-13, 9-3, 16-7, 0-12) where both the total and the runline went against the model's under/close-game prediction. Games 570-575 all on 2026-05-14 show a cluster of failures — all predicted away wins and unders, and several ended in blowouts. This suggests a **weather or park event** on that date was not captured (park_rf and weather fields are `?` in the data). The model has no mechanism to reduce confidence or widen the edge threshold when contextual data is missing — it still outputs green verdicts with high edges even when `home_wp`, `proj_total`, `park_rf`, and weather are all unknown.
---
## Specific Improvement Suggestions
### 1. Fix the PMF boundary condition and add line-alignment validation in `_fair_prob_from_dist`
**File:** `services/model/src/mlb_model/market/_evaluate.py`
**Function:** `_fair_prob_from_dist`
**Problem:** The `<=` boundary on under includes the exact line integer, and there's no validation that the consensus `median_line_value` is sensible relative to the PMF's key range. Also, totals lines are almost always half-points (8.5, 9.5) — if they're stored as whole numbers in any book, the median can land on a whole number and the under/over split will be wrong.
```python
def _fair_prob_from_dist(
distribution: dict[str, float], side: str, line: float
) -> float | None:
"""Derive fair probability for a totals market from a stored PMF.
Returns None if the line value is outside the PMF's support range,
which indicates a data alignment problem rather than a real probability.
"""
def _numeric(k: str) -> float | None:
try:
return float(k)
except (ValueError, TypeError):
return None
numeric_items = [
(_numeric(k), v)
for k, v in distribution.items()
if _numeric(k) is not None
]
if not numeric_items:
return None
keys = [n for n, _ in numeric_items]
pmf_min, pmf_max = min(keys), max(keys)
# Guard: if the consensus line is outside the PMF support, the PMF and
# market are misaligned — return None so this comparison is skipped
# rather than producing a garbage fair_prob.
if line < pmf_min or line > pmf_max:
logger.warning(
"_fair_prob_from_dist: line=%.2f outside PMF range [%.1f, %.1f]; skipping",
line, pmf_min, pmf_max,
)
return None
# Use strict < for over and strict <= for under on half-point lines.
# For whole-number lines, the "push" bucket (exact line) should be
# excluded from both sides (it won't exist in practice for MLB totals,
# but this makes the split unambiguous).
if side == "over":
return sum(v for n, v in numeric_items if n > line)
# under: strictly less than line (exclude exact-line ties)
return sum(v for n, v in numeric_items if n < line)
```
Then update the caller in `_build_comparison` to handle the `None` return:
```python
if pred.market in _DISTRIBUTION_MARKETS:
if pred.distribution is None or consensus.median_line_value is None:
return None
fair_prob = _fair_prob_from_dist(
pred.distribution, pred.side, consensus.median_line_value
)
# NEW: treat None as uncomputable — skip this comparison
if fair_prob is None:
return None
```
---
### 2. Implement per-market edge thresholds with a `fair_prob` minimum floor
**File:** `services/model/src/mlb_model/market/_evaluate.py`
**Functions:** `_verdict`, `_edge_threshold`, `_build_comparison`
**Problem:** A flat 3% edge threshold ignores that totals predictions are less reliable and that low `fair_prob` green calls (0.51 with 4% edge) have little real value. The totals market is hitting 50.6% — its effective threshold should be raised until it's demonstrated to be calibrated.
```python
# Replace the single _DEFAULT_EDGE_THRESHOLD with a per-market dict
_DEFAULT_EDGE_THRESHOLD = 0.03
_MARKET_EDGE_THRESHOLDS: dict[str, float] = {
"moneyline": 0.04, # slight increase from 0.03
"runline": 0.04,
"total": 0.07, # raised significantly — market is at 50.6%, needs higher bar
"f5_total": 0.07,
"nrfi": 0.05,
}
# Minimum fair_prob required for a green verdict — below this, cap at yellow
# regardless of edge, because low-probability estimates are high-variance
_MIN_FAIR_PROB_FOR_GREEN: dict[str, float] = {
"moneyline": 0.53,
"runline": 0.60,
"total": 0.58, # require meaningful confidence on totals
"f5_total": 0.58,
"nrfi": 0.55,
}
def _edge_threshold(market: str | None = None) -> float:
"""Return edge threshold for a specific market, with env override."""
raw = os.environ.get("EDGE_THRESHOLD_PCT", "")
if raw:
try:
return float(raw) / 100.0
except ValueError:
pass
if market is not None:
return _MARKET_EDGE_THRESHOLDS.get(market, _DEFAULT_EDGE_THRESHOLD)
return _DEFAULT_EDGE_THRESHOLD
def _verdict(
edge: float,
sharp: bool,
rlm: bool,
threshold: float,
fair_prob: float,
market: str,
) -> Verdict:
min_prob = _MIN_FAIR_PROB_FOR_GREEN.get(market, 0.52)
if edge >= threshold and fair_prob >= min_prob:
return Verdict.green
if edge < 0.0:
return Verdict.red
return Verdict.yellow
```
Update the call site in `_build_comparison`:
```python
# Pass market-specific threshold
threshold = _edge_threshold(pred.market)
sharp = detect_sharp_divergence(market_splits)
rlm = detect_reverse_line_movement(market_odds, market_splits)
verdict = _verdict(edge, sharp, rlm, threshold, fair_prob, pred.market)
```
---
### 3. Add a missing-context confidence penalty
**File:** `services/model/src/mlb_model/market/_evaluate.py`
**Function:** `_build_comparison`
**Problem:** Games 570-575 all have `?` for `home_wp`, `proj_total`, `park_rf`, and weather. The model still outputs green verdicts with large edges. When key contextual inputs are absent, the fair_prob estimate is based on incomplete information and should be penalized or the verdict should be capped.
```python
def _context_completeness_score(pred: Prediction, session: Session, game_id: int) -> float:
"""Return a score in [0, 1] representing fraction of key context present.
Used to down-grade verdict confidence when model inputs are missing.
"""
from db.models import Game, ParkFactor, WeatherSnapshot
score = 0.0
checks = 0
# Check 1: home_win_prob populated (moneyline prediction exists)
if pred.fair_prob is not None:
score += 1.0
checks += 1
# Check 2: park factor present
game = session.get(Game, game_id)
if game is not None:
pf = session.scalar(
select(ParkFactor).where(
ParkFactor.park_id == game.park_id,
ParkFactor.season == game.game_date.year,
)
)
if pf is not None:
score += 1.0
checks += 1
# Check 3: weather snapshot present
weather = session.scalar(
select(WeatherSnapshot)
.where(WeatherSnapshot.game_id == game_id)
.order_by(WeatherSnapshot.captured_at.desc())
.limit(1)
)
if weather is not None:
score += 1.0
checks += 1
return score / checks if checks > 0 else 0.0
# In _build_comparison, after computing verdict, apply context penalty:
verdict = _verdict(edge, sharp, rlm, threshold, fair_prob, pred.market)
# Downgrade verdict when context completeness is low
# Requires session to be threaded through — add session param to _build_comparison
completeness = _context_completeness_score(pred, session, game_id)
if completeness < 0.67 and verdict == Verdict.green:
logger.info(
"Downgrading game_id=%d market=%s side=%s from green to yellow: "
"context completeness=%.2f",
game_id, pred.market, pred.side, completeness,
)
verdict = Verdict.yellow
```
Update `_build_comparison` signature to accept `session`:
```python
def _build_comparison(
game_id: int,
model_run_id: int,
pred: Prediction,
odds_snapshots: list[OddsSnapshot],
splits_snapshots: list[SplitsSnapshot],
threshold: float,
session: Session, # NEW
) -> MarketComparison | None:
```
And update the call in `_evaluate`:
```python
comp = _build_comparison(
game_id=game_id,
model_run_id=model_run.id,
pred=pred,
odds_snapshots=odds_snapshots,
splits_snapshots=splits_snapshots,
threshold=threshold,
session=session, # NEW
)
```
---
### 4. Validate runline side consistency before computing consensus
**File:** `services/model/src/mlb_model/market/_evaluate.py`
**Function:** `_build_comparison`
**Problem:** The current guard checks `median_lv` against expected `±1.5`, but `consensus_market` is called *before* this check, and `consensus_implied` is already computed from whatever the two sides happen to be in the snapshot data. If any book has home and away runline both at -1.5 (a data error), `devig_two_way` produces garbage. Add an explicit consistency check on the snapshot data before passing to `consensus_market`.
```python
if pred.market == "runline":
# Pre-filter: only keep snapshots where line values are internally
# consistent (home = -1.5 when away = +1.5, or vice versa).
# Remove any book where both sides have the same sign.
def _runline_snapshots_are_