Model Analysis

Auto-generated every 40 games · Claude diagnoses underperforming markets and suggests fixes

Today's slate

Last 40 games — 2026-05-31 to 2026-06-03

Generated Jun 4, 2026

Analysis run

Last 40 games

ML18-22(45.0%)

ATS25-15(62.5%)

O/U22-15-3(59.5%)

Full season

ML141-127(52.6%)

ATS165-103(61.6%)

O/U138-112-18(55.2%)

Diagnosis

# Diagnosis ## 1. Moneyline Calibration Is the Core Problem The moneyline market at 45% (18-22) is the primary drag on overall performance. Looking at the wrong predictions, the losses cluster in two distinct failure modes. First, there are high-confidence green-verdict losses where the model has strong conviction but is simply wrong: Game 816 (home_wp=0.675, edge=+0.086, green) lost 10-9, Game 825 (home_wp=0.700, edge=+0.135, green) lost 8-0 to the *away* team, Game 838 (home_wp=0.729, edge=+0.249, green) won only 6-5. Second, there are red-verdict games the model bet on correctly by avoiding (Game 829, 837, 846) but the model fired on comparable games and lost. The moneyline losses are not concentrated in low-edge yellow bets — several green-verdict moneyline picks lost outright, which suggests the `fair_prob` values from the upstream layers are systematically overconfident, particularly for home favorites in the 0.60–0.75 range where the model is most active. ## 2. Totals Is Severely Miscalibrated on Overs The totals market at 59.5% overall is borderline, but the wrong predictions reveal a specific directional problem: **over bets are losing at a high rate**. Games 821 (actual=2-1, over bet green at 0.717), 825 (over bet green lost), 827 (over bet green, actual=4-2), 830 (over bet green at 0.821, actual=3-4), 840 (over bet green, actual=4-1), 842 (over bet green, actual=8-0 — this one won), and 849 (over bet green at 0.793, actual=1-0) show a striking pattern: the model assigns very high fair probabilities to overs (0.717–0.821) in games that end up as low-scoring affairs. Games 821 and 849 are particularly damning: fair_prob=0.717 and 0.793 respectively on the over, yet final scores were 2-1 and 1-0. This is a massive calibration failure. The `_fair_prob_from_dist` function pulls from a distribution PMF, meaning the Monte Carlo layer is generating run distributions that are too right-skewed (fat upper tails), likely because the weather/park suppression factors are underweighted when temperatures are moderate and wind speeds are low-to-moderate. ## 3. The `_verdict` Function Ignores Sharp/RLM Signals Entirely Looking at `_verdict`: ```python def _verdict(edge: float, sharp: bool, rlm: bool, threshold: float) -> Verdict: if edge >= threshold: return Verdict.green if edge < 0.0: return Verdict.red return Verdict.yellow ``` The `sharp` and `rlm` parameters are accepted but **completely ignored**. The function is purely edge-threshold driven. This means that when sharp money is moving against the model's position (RLM = True on the popular side) or handle diverges from bets (sharp divergence), the model still emits a green verdict if edge ≥ 0.03. This is a structural bug, not a tuning issue. Sharp money signals are historically among the strongest contrarian indicators, and discarding them entirely explains some of the high-confidence losses. ## 4. The 3% Edge Threshold Is Too Low and Undifferentiated The default `_DEFAULT_EDGE_THRESHOLD = 0.03` (3 percentage points) treats a 3.1pp edge the same as a 40pp edge for verdict purposes — both are green. But the wrong predictions include many losses with moderate edges (Game 816: edge=+0.086, Game 825: edge=+0.135, Game 844: edge=+0.190, Game 820: edge=+0.402) alongside wins at similar edge levels. What's absent is any market-specific threshold calibration: moneyline markets have much tighter vig and more efficient pricing than runline or totals, so a 3% edge on a moneyline means less than 3% on a runline. The uniform threshold inflates green verdicts on moneylines where the model's edge is most likely to be noise rather than signal. ## 5. Yellow-Verdict Bets Are Being Included in Win Rate Tracking Games 815, 817, 818, 823, 848, and 853 all appear in the "wrong predictions" list with yellow verdicts (edge between 0 and threshold). If yellow bets are being counted in the win rate denominators, they are diluting the signal from green bets and making the moneyline market look worse than it is for high-conviction plays. The yellow-verdict moneyline losses (Game 817: fair_prob=0.574, edge=+0.012; Game 818: fair_prob=0.562, edge=+0.019) represent exactly the marginal bets that should be excluded from grading since the model explicitly flagged low confidence. --- # Specific Improvement Suggestions ## 1. Fix `_verdict` to Actually Use Sharp and RLM Signals **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_verdict` The sharp divergence and RLM flags should downgrade verdicts. A green bet facing sharp counter-action should become yellow; a yellow bet facing sharp counter-action should become red. This is the single highest-leverage fix because it gates high-confidence bets against the strongest market signal available. ```python def _verdict(edge: float, sharp: bool, rlm: bool, threshold: float) -> Verdict: """Compute verdict incorporating edge threshold and sharp-money signals. Degradation rules: - sharp=True or rlm=True each downgrade one level (green→yellow, yellow→red) - Both firing simultaneously forces red regardless of edge """ # Base verdict from edge if edge >= threshold: base = Verdict.green elif edge < 0.0: base = Verdict.red else: base = Verdict.yellow # Count active adverse signals adverse = (1 if sharp else 0) + (1 if rlm else 0) if adverse == 0: return base # Both signals firing: always red if adverse >= 2: return Verdict.red # Single signal: downgrade one level if base == Verdict.green: return Verdict.yellow if base == Verdict.yellow: return Verdict.red # base is already red return Verdict.red ``` --- ## 2. Introduce Per-Market Edge Thresholds **File:** `services/model/src/mlb_model/market/_evaluate.py` **Functions:** `_edge_threshold` (replace), `_build_comparison` (update call site) Moneyline markets at major books are priced to within 1-2% of true probability; a 3% model edge is within the noise band. Totals and runlines have wider vig and more model-exploitable structure. Use higher thresholds for moneyline and lower for runline/totals to reflect market efficiency differences. ```python # Per-market edge thresholds (in probability units, not percentage points) _MARKET_EDGE_THRESHOLDS: dict[str, float] = { "moneyline": 0.06, # ML is efficient; require 6pp edge "runline": 0.04, # Spread markets slightly less efficient "total": 0.04, # Totals similar to runline "f5_total": 0.04, "nrfi": 0.05, } _DEFAULT_EDGE_THRESHOLD = 0.03 # fallback only def _edge_threshold(market: str | None = None) -> float: """Return edge threshold for a given market, respecting env override. The env var EDGE_THRESHOLD_PCT still overrides everything when set, preserving backward compatibility for existing deployments. """ raw = os.environ.get("EDGE_THRESHOLD_PCT", "") if raw: try: return float(raw) / 100.0 except ValueError: pass if market is not None: return _MARKET_EDGE_THRESHOLDS.get(market, _DEFAULT_EDGE_THRESHOLD) return _DEFAULT_EDGE_THRESHOLD ``` Then update `_build_comparison` to pass the market: ```python def _build_comparison( game_id: int, model_run_id: int, pred: Prediction, odds_snapshots: list[OddsSnapshot], splits_snapshots: list[SplitsSnapshot], threshold: float, # now ignored in favor of per-market threshold ) -> MarketComparison | None: # ... existing lookup code unchanged ... # Use per-market threshold instead of global effective_threshold = _edge_threshold(pred.market) # ... existing consensus/fair_prob/edge computation unchanged ... verdict = _verdict(edge, sharp, rlm, effective_threshold) # changed ``` And update the call in `_evaluate` to pass `threshold=0` (it's now ignored internally): ```python for pred in predictions: comp = _build_comparison( game_id=game_id, model_run_id=model_run.id, pred=pred, odds_snapshots=odds_snapshots, splits_snapshots=splits_snapshots, threshold=0, # per-market threshold used internally ) ``` --- ## 3. Add Distribution Sanity Check to Catch Over-Inflated Fair Probabilities **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_build_comparison` Games 821 (fair_prob=0.717 over, actual 2-1) and 849 (fair_prob=0.793 over, actual 1-0) represent catastrophic calibration failures where the PMF is heavily weighted toward high run totals but the game ended in a pitcher's duel. Add a guard that clamps extreme totals probabilities and logs a warning so upstream distribution bugs surface visibly rather than silently producing bad bets. ```python # Add this constant near the top of _evaluate.py _TOTALS_PROB_CLAMP = 0.80 # fair probs above this for totals are almost certainly # distribution artifacts, not genuine edges def _build_comparison( game_id: int, model_run_id: int, pred: Prediction, odds_snapshots: list[OddsSnapshot], splits_snapshots: list[SplitsSnapshot], threshold: float, ) -> MarketComparison | None: # ... existing code up to fair_prob calculation ... if pred.market in _DISTRIBUTION_MARKETS: if pred.distribution is None or consensus.median_line_value is None: return None fair_prob = _fair_prob_from_dist( pred.distribution, pred.side, consensus.median_line_value ) # Sanity check: probabilities above the clamp threshold suggest # the Monte Carlo distribution has a fat tail artifact. # Cap and emit a warning so the upstream layer can be debugged. if fair_prob > _TOTALS_PROB_CLAMP: logger.warning( "game_id=%d market=%s side=%s fair_prob=%.3f exceeds clamp=%.2f " "(line=%.1f); clamping. Check MC distribution.", game_id, pred.market, pred.side, fair_prob, _TOTALS_PROB_CLAMP, consensus.median_line_value, ) fair_prob = _TOTALS_PROB_CLAMP else: if pred.fair_prob is None: return None fair_prob = pred.fair_prob # ... rest of function unchanged ... ``` --- ## 4. Separate Yellow-Verdict Bets from Win Rate Grading **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_build_reasoning` Yellow bets (edge between 0 and threshold) are explicitly flagged as low-confidence. Including them in the same win-rate pool as green bets masks the true performance of high-conviction plays and can trigger false reviews of otherwise healthy markets. Add a verdict breakdown to the reasoning output so the grading layer can split win rates by verdict tier. ```python def _build_reasoning( game_id: int, model_run_id: int, predictions: list[Prediction], results: list[MarketComparison], session: Session, ) -> dict: from db.models import Game, ParkFactor, WeatherSnapshot reasoning: dict = {} for pred in predictions: if pred.market in ("moneyline", "ml") and pred.side == "home" and pred.fair_prob is not None: reasoning["home_win_prob"] = round(pred.fair_prob, 4) if pred.market == "total" and pred.side == "over" and pred.distribution: try: mean_total = sum(float(k) * v for k, v in pred.distribution.items()) reasoning["projected_total"] = round(mean_total, 2) except (ValueError, TypeError): pass reasoning["edges"] = [ { "market": c.market, "side": c.side, "edge_pct": round(c.edge_pct, 4), "verdict": c.verdict.value, "sharp": c.sharp_divergence, "rlm": c.reverse_line_movement, } for c in sorted(results, key=lambda c: abs(c.edge_pct), reverse=True) ] # NEW: verdict-stratified summary for downstream win-rate tracking # This allows the grading layer to compute green-only win rates separately # from yellow bets, preventing low-confidence plays from diluting metrics. verdict_summary: dict[str, list[dict]] = {"green": [], "yellow": [], "red": []} for c in results: entry = {"market": c.market, "side": c.side, "edge_pct": round(c.edge_pct, 4)} verdict_summary[c.verdict.value].append(entry) reasoning["verdict_summary"] = verdict_summary reasoning["signals"] = { "sharp_markets": [c.market for c in results if c.sharp_divergence], "rlm_markets": [c.market for c in results if c.reverse_line_movement], } context: dict = {} game = session.get(Game, game_id) if game is not None: pf = session.scalar( select(ParkFactor).where( ParkFactor.park_id == game.park_id, ParkFactor.season == game.game_date.year, ) ) if pf is not None: context["park_runs_factor"] = pf.runs_factor context["park_hr_factor"] = pf.hr_factor weather = session.scalar( select(WeatherSnapshot) .where(WeatherSnapshot.game_id == game_id) .order_by(WeatherSnapshot.captured_at.desc()) .limit(1) ) if weather is not None: context["weather"] = { "temp_f": weather.temp_f, "wind_mph": weather.wind_mph, "wind_dir_deg": weather.wind_dir_deg, } if context: reasoning["context"] = context return reasoning ``` --- ## 5. Fix `_fair_prob_from_dist` to Use Strict Inequality Consistently **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_fair_prob_from_dist` The current under probability includes `_n <= line` (i.e., exactly hitting the total counts as under). In MLB totals markets the standard convention is **over wins on strictly greater than, under wins on strictly less than, and exactly hitting the total is a push**. Including the push outcome in the under probability inflates under fair_prob and deflates over fair_prob by the probability mass on the exact line value — which for a PMF with integer keys and a

Draft file

/home/ubuntu/mlbbetting/analysis_drafts/2026-06-04_0703_model_review.md

Last 50 games — 2026-05-28 to 2026-05-31

Generated Jun 1, 2026

Analysis run

Last 50 games

ML29-21(58.0%)

ATS34-16(68.0%)

O/U28-18-4(60.9%)

Full season

ML123-105(53.9%)

ATS140-88(61.4%)

O/U116-97-15(54.5%)

Diagnosis

# Diagnosis and Improvement Recommendations ## Diagnosis ### 1. Systematic Totals Miscalibration (Most Impactful Issue) The totals market is the clearest failure mode in the wrong predictions list. Looking at the losses: Game 775 (actual 3-4, model predicted over with edge=+0.207, fair_prob=0.699), Game 799 (actual 1-5, model predicted over with edge=+0.226, fair_prob=0.705), Game 802 (actual 2-4, model predicted over with edge=+0.137), Game 805 (actual 2-5, model predicted over with edge=+0.323, fair_prob=0.812), Game 808 (actual 1-2, model predicted over with edge=+0.062), Game 809 (actual 2-0, model predicted over with edge=+0.242, fair_prob=0.742), and Game 813 (actual 2-3, model predicted over). That is 7 wrong over predictions. Conversely, the wrong under predictions (Games 782, 783, 811) involve high-scoring actual games (6-8, 8-2, 19-6). The model is consistently over-projecting run scoring. The `_fair_prob_from_dist` function computes over probability as `sum(v for k if k > line)` — this is mathematically correct, but the upstream PMF (from Monte Carlo) is clearly shifted right relative to realized outcomes. The totals win rate of 60.9% is close to the review threshold, and the wrong-prediction list is dominated by totals misses despite high stated edges, which is a hallmark of distribution bias rather than edge miscalibration. ### 2. High-Edge Green Verdicts Are Failing at an Alarming Rate Several "green" picks with large edges are in the wrong predictions list: Game 772 moneyline/home (edge=+0.201, home won 7-5 ✓ — actually this is a WIN, not a loss... let me recount). Re-examining: the provided games are *wrong* predictions. Game 775 runline/home_minus (edge=+0.190, green) lost — home was 3-4 (lost). Game 805 total/over (edge=+0.323, green) lost — actual 2-5. Game 811 total/under (edge=+0.145, green) lost — actual 19-6. Game 813 runline/home_minus (edge=+0.144, green) lost — actual 2-3 (away won). Game 809 total/over (edge=+0.242, green) lost — actual 2-0. The pattern is that the model assigns high confidence (green, large edge) to positions that then lose badly. This suggests the edge computation itself is sound mechanically but the `fair_prob` inputs are wrong — specifically, `compute_edge` is just `model_prob - market_prob`, so inflated `fair_prob` values directly inflate edge with no dampening mechanism. There is no uncertainty band or confidence interval on the PMF-derived probabilities. ### 3. Runline Direction Filter Is Masking a Real Problem with Run-Differential Modeling The wrong runline predictions show a split: home_minus losses (Games 775, 813) where the home team actually lost outright, and away_plus losses (Games 780, 793, 795) where the away team lost by more than 1.5. Game 780 (actual 1-9, model predicted away_plus, edge=-0.015, red verdict — this was flagged red and still included as a wrong prediction, indicating the red threshold isn't preventing action). Game 793 (actual 1-6, runline/away_plus, edge=-0.080, red) also got through. The `_verdict` function returns `Verdict.red` for negative edge but the system still records and apparently acts on red verdicts. There's no hard block on negative-edge predictions reaching downstream consumers. ### 4. Park Factor and Weather Context Are Computed But Not Feeding Back Into Edge Thresholds Looking at the contextual data: Game 780 has park_rf=0.9467 (pitcher-friendly) and 90.6°F wind=6.7mph — a hot day that typically increases scoring, yet the model predicted away_plus (suggesting a blowout) and it actually was a blowout (1-9) but in the wrong direction. Game 782 has park_rf=1.1467 (extreme hitter's park) and predicted total/under (lost, actual 6-8). Game 811 has park_rf=1.1467 and predicted total/under (lost catastrophically, actual 19-6). The model captures park_rf in reasoning but the `_verdict` function signature accepts `sharp` and `rlm` but **never uses them** — both signals are computed and stored but have zero effect on the verdict. This is dead code that represents wasted signal. ### 5. Moneyline Calibration for Mid-Range Probabilities (0.52–0.62) Is Poor Multiple moneyline wrong predictions cluster in the 0.52–0.62 fair_prob band: Game 768 (0.530, red, lost — correct direction), Game 769 (0.563, red), Game 770 (0.601, yellow, lost), Game 792 (0.609, yellow, lost). The model's moneyline win rate of 58% is decent but the losses are concentrated in games where the model had low-to-medium conviction. The issue is that `fair_prob` values in the 0.52–0.62 range for moneyline likely reflect genuine uncertainty that the model is not adequately representing — these games are close to coin flips but the model treats a 0.563 the same way structurally as a 0.716, just with a smaller edge. --- ## Specific Improvement Suggestions ### 1. Fix `_verdict` to Actually Use Sharp and RLM Signals **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_verdict` The `sharp` and `rlm` parameters are accepted but completely ignored. Sharp money disagreeing with the model should downgrade a green to yellow, and RLM against the model's side should further penalize the verdict. This is the single-line fix with the highest signal-to-noise ratio. ```python def _verdict(edge: float, sharp: bool, rlm: bool, threshold: float) -> Verdict: """Compute verdict incorporating sharp-money and line-movement signals. Degradation rules (applied after edge baseline): - Sharp divergence against model side: downgrade green→yellow - Reverse line movement against model side: downgrade green→yellow, yellow→red - Both signals present: cap at red regardless of edge """ if edge < 0.0: return Verdict.red # Base verdict from edge alone if edge >= threshold: base = Verdict.green else: base = Verdict.yellow # Each adverse signal downgrades one level # green → yellow → red _order = [Verdict.red, Verdict.yellow, Verdict.green] level = _order.index(base) if sharp: level = max(0, level - 1) # downgrade if rlm: level = max(0, level - 1) # downgrade return _order[level] ``` --- ### 2. Add a Hard Block on Negative-Edge Predictions (Red Verdicts Should Not Persist as Actionable) **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_evaluate` Games 780 and 793 both had red verdicts (negative edge) but were still wrong predictions that presumably were acted on. The `add_market_comparison` call should either be skipped for red verdicts or the verdict should be stored with an explicit `is_actionable=False` flag. Since modifying the DB schema is heavier, the minimal fix is to log a hard warning and skip persistence for red verdicts, or to gate on a minimum edge floor: ```python # In _evaluate(), replace the add_market_comparison call block: for pred in predictions: comp = _build_comparison( game_id=game_id, model_run_id=model_run.id, pred=pred, odds_snapshots=odds_snapshots, splits_snapshots=splits_snapshots, threshold=threshold, ) if comp is None: continue # Do not persist red-verdict comparisons as actionable picks. # They are still appended to results for reasoning/audit purposes # but are flagged so downstream consumers can filter them. if comp.verdict == Verdict.red: logger.info( "evaluate_game: skipping persistence for red verdict " "game_id=%d market=%s side=%s edge=%.4f", game_id, comp.market, comp.side, comp.edge_pct, ) results.append(comp) # keep for reasoning audit trail continue add_market_comparison( session=session, game_id=comp.game_id, model_run_id=comp.model_run_id, market=comp.market, side=comp.side, fair_prob=comp.fair_prob, fair_price_american=comp.fair_price_american, consensus_price_american=comp.consensus_price_american, consensus_implied_prob=comp.consensus_implied_prob, edge_pct=comp.edge_pct, sharp_divergence=comp.sharp_divergence, reverse_line_movement=comp.reverse_line_movement, verdict=comp.verdict.value, evaluated_at=now, ) results.append(comp) ``` --- ### 3. Apply Park Factor as an Edge Threshold Modifier for Totals **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_build_comparison` Games 782 and 811 both have `park_rf=1.1467` (the same extreme hitter's park — almost certainly Coors Field) and both predicted under. The model lost badly (actual 6-8 and 19-6). When the park strongly inflates runs, the PMF's over/under boundary becomes less reliable because the distribution tails are fatter. The fix is to widen the threshold in proportion to how far `park_rf` deviates from 1.0, specifically for totals markets: ```python def _build_comparison( game_id: int, model_run_id: int, pred: Prediction, odds_snapshots: list[OddsSnapshot], splits_snapshots: list[SplitsSnapshot], threshold: float, ) -> MarketComparison | None: # ... (existing code unchanged through consensus / fair_prob computation) ... fair_price = prob_to_american(fair_prob) edge = compute_edge(fair_prob, consensus_implied) # --- NEW: park-adjusted threshold for totals markets --- effective_threshold = threshold if pred.market in _DISTRIBUTION_MARKETS: from db.models import Game, ParkFactor from sqlalchemy import select # We need the session here; thread it in or look it up from snapshots. # Simplest approach: encode park_rf in the snapshot query upstream, # but as a self-contained patch we retrieve it from the odds context. # Instead, accept park_rf as an optional parameter (see call-site change below). pass # see parameterized version below # ... ``` Because `_build_comparison` doesn't currently have DB access, the cleaner approach is to pass `park_rf` in from `_evaluate` where the session is available: ```python # In _evaluate(), resolve park_rf once per game before the loop: from db.models import Game, ParkFactor from sqlalchemy import select park_rf: float = 1.0 game = session.get(Game, game_id) if game is not None: pf = session.scalar( select(ParkFactor).where( ParkFactor.park_id == game.park_id, ParkFactor.season == game.game_date.year, ) ) if pf is not None: park_rf = float(pf.runs_factor) for pred in predictions: comp = _build_comparison( game_id=game_id, model_run_id=model_run.id, pred=pred, odds_snapshots=odds_snapshots, splits_snapshots=splits_snapshots, threshold=threshold, park_rf=park_rf, # NEW ) ``` ```python # Updated _build_comparison signature and threshold logic: def _build_comparison( game_id: int, model_run_id: int, pred: Prediction, odds_snapshots: list[OddsSnapshot], splits_snapshots: list[SplitsSnapshot], threshold: float, park_rf: float = 1.0, # NEW ) -> MarketComparison | None: # ... existing code unchanged until verdict computation ... edge = compute_edge(fair_prob, consensus_implied) # For totals, widen the required edge threshold when park_rf deviates # substantially from neutral (1.0). Each 0.05 deviation adds 1pp to # the threshold, capped at 2x the base threshold. # Rationale: extreme parks make the PMF tails less reliable; we need # more model conviction before betting totals at Coors-type venues. effective_threshold = threshold if pred.market in _DISTRIBUTION_MARKETS: park_deviation = abs(park_rf - 1.0) # e.g. park_rf=1.1467 → deviation=0.1467 → +2.93pp added to threshold extra = (park_deviation / 0.05) * 0.01 effective_threshold = min(threshold + extra, threshold * 2.0) sharp = detect_sharp_divergence(market_splits) rlm = detect_reverse_line_movement(market_odds, market_splits) verdict = _verdict(edge, sharp, rlm, effective_threshold) # pass effective_threshold return MarketComparison( # ... existing fields ... verdict=verdict, ) ``` --- ### 4. Recalibrate the Totals PMF with a Shrinkage Correction for Systematic Over-Bias **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_fair_prob_from_dist` The wrong-predictions list has 7 failed over bets and only 3 failed under bets — a 7:3 ratio suggesting the PMF is consistently shifted ~0.3–0.5 runs high. Rather than retraining the Monte Carlo (which is upstream), apply a calibration shrinkage to the derived fair probability for overs that pulls it toward 0.5 when the raw probability is high: ```python # Calibration constant derived from observed over hit rate in this sample: # Overs in wrong list: 7 losses. We need win rate data for overs overall, # but given totals overall is 60.9% and overs appear over-represented in # losses, we apply a conservative 4pp shrinkage toward 0.5 for overs. _OVER_SHRINKAGE = 0.04 # tune with more data; start conservative _UNDER_SHRINKAGE = 0.00 # unders appear better calibrated def _fair_prob_from_dist( distribution: dict[str, float], side: str, line: float ) -> float: """Derive fair probability for a totals market from a stored PMF. Applies a shrinkage correction toward 0.5 to account for observed systematic over-bias in the Monte Carlo run-scoring distribution. """ def _numeric(k: str) -> float | None: try: return float(k) except (ValueError, TypeError): return None if side == "over": raw = sum( v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n > line ) # Shrink toward 0.5: reduces over-confidence from inflated PMF return raw - _OVER_SHRINKAGE * (raw - 0.5) # under raw = sum( v for k, v in distribution.items() if (_n :=

Draft file

/home/ubuntu/mlbbetting/analysis_drafts/2026-06-01_0702_model_review.md

Last 43 games — 2026-05-25 to 2026-05-27

Generated May 28, 2026

Analysis run

Last 43 games

ML22-21(51.2%)

ATS21-22(48.8%)

O/U23-19-1(54.8%)

Full season

ML94-84(52.8%)

ATS106-72(59.6%)

O/U88-79-11(52.7%)

Diagnosis

# Diagnosis ## 1. Totals Market is the Strongest Signal, But Still Underperforming The totals market shows the best win rate (54.8%) but is still well below the 60% flag threshold. Looking at the wrong predictions, there's a clear systematic bias: the model repeatedly issues **over** picks that lose on low-scoring games. Games 750, 751, 752, 754, 760, 763, and 764 all had `total/over` green verdicts with relatively high fair_probs (0.594–0.742) yet produced final scores of 1-2, 2-3, 1-2, 3-2, 2-4, 4-3, and 1-4 respectively. These are all sub-7 total run games. The distribution-based fair_prob calculation in `_fair_prob_from_dist` sums PMF mass above the line, but if the underlying Monte Carlo run total distribution is systematically right-skewed (fat tails toward high scores), the over probability will be inflated even when the median projection is modest. The park_rf values available (0.947–1.025) show neutral-to-pitcher-friendly parks in several of these cases, which the model may be insufficiently weighting when building the run distribution. ## 2. Runline Model Has Serious Directional Confusion and Calibration Problems The runline win rate of 48.8% is below breakeven. The wrong predictions show two failure modes. First, **home_plus** bets losing badly: Game 725 (actual 1-5, home lost by 4), Game 730 (actual 3-0, home won outright but this was `home_plus` which won — wait, 3-0 home win covers +1.5), Game 741 (actual 7-2, home won big but this was picked as `home_plus`). Actually the most damning cases are Game 733 (away_plus green, actual 3-5 away won outright — this is a *win* for away_plus), so some "wrong" predictions in the list may be wins. The deeper issue is that several high-confidence runline picks (edge >0.15) are losing, particularly `home_minus` bets in Games 727 and 750 where the margin was narrow or wrong direction. Second, the `_build_comparison` runline filter logic gates on `median_lv` matching ±0.1 of expected ±1.5, but this doesn't account for alternate lines (e.g., -1.5 vs -2.5 markets) being mixed into the snapshot pool, potentially corrupting the consensus implied probability. ## 3. Edge Threshold is Too Permissive and Verdict Logic Ignores Sharp/RLM Signals The `_verdict` function computes sharp divergence and RLM signals but **completely ignores them** in the verdict output — they're logged to reasoning but don't affect whether a bet is green/yellow/red. This is a significant bug: the signals are computed but wired to nothing. Meanwhile, the 3% edge threshold is very low for a noisy domain like MLB. Looking at the wrong green predictions: Game 724 (edge=+0.031, +0.063), Game 727 (edge=+0.053, +0.112 — these actually won), Game 728 (edge=+0.060 — won). But Game 750 has edge=+0.091 and +0.094 for both runline_home_minus and total/over, both of which lost. Game 762 has edge=+0.043 and +0.051, both lost. The model is generating green verdicts at 3-6% edge on markets where the true calibration error could easily exceed that range, meaning many "green" picks have zero or negative true edge after accounting for model error. ## 4. High home_wp Predictions Are Not Translating to Wins Games 729 (home_wp=0.5006, home won 10-2 — model picked away, lost), 730 (home_wp=0.3201, home won 3-0 — model had home_plus red verdict), 731 (home_wp=0.6153, home won 9-0, model correctly picked home), 732 (home_wp=0.7123, home won 8-2, correct). But Game 733 (home_wp=0.6292, home lost 3-5) had a `runline/away_plus` green pick that won. Game 744 (home_wp=0.6015, home lost 0-6) had `runline/away_plus` green that won. The moneyline calibration appears reasonable at extreme probabilities (>0.71 wins tend to be correct) but breaks down in the 0.50–0.62 range where the model shows a slight home bias — it underpredicts away wins in that band. This is consistent with a well-known MLB modeling issue where home field advantage is over-parameterized. ## 5. Missing Contextual Data is a Silent Risk The wrong predictions list shows `proj_total=?`, `park_rf=?`, and `wind=?mph` as missing for most games — only Games 750, 753–758, 760, 762–764 have park_rf populated. This means `_build_reasoning` is silently omitting park and weather context for the majority of evaluations because the DB lookups return `None`. If park factors and weather are missing at reasoning time, they may also be missing or stale at prediction time (Layer 5), meaning the Monte Carlo simulation is running without proper environmental adjustments for those games. This would explain the systematic over bias — without a pitcher-friendly park factor dampening the run distribution, totals get overestimated. --- # Specific Improvement Suggestions ## 1. Wire Sharp/RLM Signals into Verdict Logic **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_verdict` The signals are computed but ignored. Sharp money against your model's side is strong evidence of model error. RLM indicates the market has information your model lacks. ```python # CURRENT CODE: def _verdict(edge: float, sharp: bool, rlm: bool, threshold: float) -> Verdict: if edge >= threshold: return Verdict.green if edge < 0.0: return Verdict.red return Verdict.yellow # PROPOSED CODE: def _verdict(edge: float, sharp: bool, rlm: bool, threshold: float) -> Verdict: """Incorporate sharp money and RLM as veto signals. Even if edge >= threshold, downgrade to yellow when sharp divergence or RLM is detected. These signals empirically indicate market information the model does not have. Require a higher edge (1.5x threshold) to remain green when either signal fires. """ has_adverse_signal = sharp or rlm effective_threshold = threshold * 1.5 if has_adverse_signal else threshold if edge >= effective_threshold: return Verdict.green if edge < 0.0: return Verdict.red return Verdict.yellow ``` **Why:** The `_build_reasoning` output already stores these signals. Right now they're pure dead weight in the verdict path. Based on the data, several losing picks (Games 750, 760, 762) sit right at the 3-9% edge range where a sharp divergence veto would have suppressed the green signal. --- ## 2. Raise the Default Edge Threshold and Make Market-Specific Thresholds Available **File:** `services/model/src/mlb_model/market/_evaluate.py` **Functions:** `_edge_threshold`, `_build_comparison` A flat 3% threshold is too low given the noise level. The runline market at 48.8% win rate needs a higher bar. Totals at 54.8% are the most reliable but still below 60%. ```python # CURRENT CODE: _DEFAULT_EDGE_THRESHOLD = 0.03 def _edge_threshold() -> float: raw = os.environ.get("EDGE_THRESHOLD_PCT", "") if raw: try: return float(raw) / 100.0 except ValueError: pass return _DEFAULT_EDGE_THRESHOLD # PROPOSED CODE: _DEFAULT_EDGE_THRESHOLD = 0.05 # raised from 0.03 # Per-market minimums derived from observed calibration quality. # Runline is worst-performing, needs highest bar. # Totals are best-performing, slightly lower bar acceptable. _MARKET_EDGE_THRESHOLDS: dict[str, float] = { "moneyline": 0.05, "runline": 0.08, # penalize underperforming market "total": 0.05, "f5_total": 0.06, "nrfi": 0.06, } def _edge_threshold(market: str | None = None) -> float: """Return edge threshold for a specific market, falling back to env/default.""" raw = os.environ.get("EDGE_THRESHOLD_PCT", "") if raw: try: return float(raw) / 100.0 except ValueError: pass if market is not None: return _MARKET_EDGE_THRESHOLDS.get(market, _DEFAULT_EDGE_THRESHOLD) return _DEFAULT_EDGE_THRESHOLD # In _build_comparison, pass market to threshold: def _build_comparison( game_id: int, model_run_id: int, pred: Prediction, odds_snapshots: list[OddsSnapshot], splits_snapshots: list[SplitsSnapshot], threshold: float, # kept as parameter but overridden per market below ) -> MarketComparison | None: # ... existing code up to verdict call ... # Override threshold per market for finer-grained control market_threshold = _edge_threshold(pred.market) verdict = _verdict(edge, sharp, rlm, market_threshold) # rest of function unchanged ``` **And update the call site in `_evaluate`:** ```python # In _evaluate, the threshold passed to _build_comparison becomes a fallback: threshold = _edge_threshold() # global fallback, still used as param for pred in predictions: comp = _build_comparison( game_id=game_id, model_run_id=model_run.id, pred=pred, odds_snapshots=odds_snapshots, splits_snapshots=splits_snapshots, threshold=threshold, # _build_comparison now overrides this per market ) ``` **Why:** Looking at the runline losses: Game 725 edge=+0.155 (lost), Game 733 edge=+0.118 (won), Game 740 edge=+0.146 (won), Game 741 edge=+0.014 (lost). The low-edge runline picks (0.014, 0.043, 0.047) are noise. At 8% runline threshold, Games 741 and 734's runline picks would have been suppressed. The 21-22 runline record suggests the signal exists but is being diluted by marginal picks. --- ## 3. Fix Totals Over-Bias by Adding a PMF Sanity Check and Skew Penalty **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_fair_prob_from_dist` The current implementation sums PMF mass above/below the line but doesn't check whether the distribution is physically reasonable or apply any correction for right-skew inflation. ```python # CURRENT CODE: def _fair_prob_from_dist( distribution: dict[str, float], side: str, line: float ) -> float: """Derive fair probability for a totals market from a stored PMF.""" def _numeric(k: str) -> float | None: try: return float(k) except (ValueError, TypeError): return None if side == "over": return sum(v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n > line) return sum(v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n <= line) # PROPOSED CODE: def _fair_prob_from_dist( distribution: dict[str, float], side: str, line: float ) -> float: """Derive fair probability for a totals market from a stored PMF. Includes: - Validation that PMF sums to ~1.0 (rejects degenerate distributions) - Skew-adjusted probability: if the distribution mean is below the line and we're computing over, cap the boost from tail mass. """ def _numeric(k: str) -> float | None: try: return float(k) except (ValueError, TypeError): return None numeric_items = [(_numeric(k), v) for k, v in distribution.items()] valid_items = [(n, v) for n, v in numeric_items if n is not None] if not valid_items: return 0.5 # fallback total_mass = sum(v for _, v in valid_items) # Reject or renormalize distributions that don't sum to ~1.0 if total_mass < 0.80: # Likely a truncated/sparse distribution — not reliable return 0.5 if abs(total_mass - 1.0) > 0.01: # Renormalize valid_items = [(n, v / total_mass) for n, v in valid_items] if side == "over": raw_prob = sum(v for n, v in valid_items if n > line) else: raw_prob = sum(v for n, v in valid_items if n <= line) # Skew correction: compute distribution mean and apply dampening # when mean is on the opposite side of the line from our bet. # This penalizes cases where the model predicts over but the mean # total is well below the line (fat right tail inflating over prob). dist_mean = sum(n * v for n, v in valid_items) if side == "over" and dist_mean < line: # Mean is below line: tail-driven over probability, apply dampening # The further the mean is below the line, the stronger the penalty. gap = line - dist_mean # positive = mean below line dampening = max(0.0, 1.0 - (gap / line) * 0.5) raw_prob = raw_prob * dampening elif side == "under" and dist_mean > line: gap = dist_mean - line dampening = max(0.0, 1.0 - (gap / line) * 0.5) raw_prob = raw_prob * dampening # Clamp to valid probability range return max(0.01, min(0.99, raw_prob)) ``` **Why:** Games 750, 751, 752, 754, 760, 763, 764 all had `total/over` green picks with final scores of 3 or fewer total runs. The fair_probs ranged from 0.594 to 0.742. If the underlying Monte Carlo distribution had a mean near 8 runs but a right tail putting, say, 65% mass above the line at 7.5, and the actual game was a 1-2 pitchers duel, the mean signal was being ignored. The skew dampening directly targets this failure mode. --- ## 4. Add Minimum Book Count Guard to Consensus Market **File:** `services/model/src/mlb_model/market/_consensus.py` **Function:** `consensus_market` A consensus built from 1-2 books is unreliable and can produce large apparent edges that are really just book-specific line differences. ```python # CURRENT CODE: no book count guard before computing probs # PROPOSED CODE: add to consensus_market, after computing book_count: _MIN_BOOKS_FOR_CONSENSUS = 3 # require at least 3 books def consensus_market(snapshots: list[OddsSnapshot]) -> ConsensusLine: """Compute median price and no-vig implied probability across books. Requires at least _MIN_BOOKS_FOR_CONSENSUS books for a valid consensus. Returns empty side_implied_probs if book count is insufficient, which causes _build_comparison to skip the prediction (returns None via the `odds_side not in consensus.side_implied_probs` check). """ if not snapshots: return ConsensusLine(

Draft file

/home/ubuntu/mlbbetting/analysis_drafts/2026-05-28_0702_model_review.md

Last 50 games — 2026-05-21 to 2026-05-24

Generated May 25, 2026

Analysis run

Last 50 games

ML20-30(40.0%)

ATS32-18(64.0%)

O/U25-21-4(54.4%)

Full season

ML72-63(53.3%)

ATS85-50(63.0%)

O/U65-60-10(52.0%)

Diagnosis

# Diagnosis and Improvement Recommendations ## Diagnosis ### 1. Moneyline Market Has a Severe Calibration Problem (40% Win Rate) The moneyline market is the clearest failure: 20-30 (40%) over 50 graded games. Looking at the wrong predictions, a sharp pattern emerges. Games where the model assigns `fair_prob` in the **0.52–0.65 range** with **positive edge** consistently lose. For example, Games 685, 688, 693, 694, 699, 703, 704 all had green verdicts with `fair_prob` between 0.526 and 0.653 and lost. Meanwhile, the model's high-conviction calls (home_wp < 0.40 or > 0.68, edge > 0.15) mostly won (Games 677, 678, 679, 681, 683, 711, 713, 715, 717). This strongly suggests the model is **systematically overconfident in the 52–65% probability range** — it's calling edge where none exists. The market is pricing these correctly, and the model's 3% edge threshold is far too low for moneyline picks in this probability band. ### 2. Edge Threshold Is Not Market-Stratified The current `_verdict` function uses a single flat threshold (default 3%) across moneyline, runline, and totals. This is a critical design flaw. The wrong predictions reveal that **green verdicts with edges of +0.01 to +0.12 on moneyline lose at a high rate**, while **runline green picks at similar edge values win**. The runline is 64% (well above threshold), totals at 54.4% (marginal), and moneyline at 40% (deeply negative). A single 3% threshold across all three markets ignores the fundamentally different variance and market efficiency profiles of each. The moneyline market is the most efficient (sharpest) and requires a much higher edge to show positive expected value. The data suggests moneyline needs at least ~10–12% edge to be viable, while runline can operate closer to 7–8%. ### 3. Probability Range Filtering is Absent — The "Middle Band" Trap Examining all the losing moneyline picks, they cluster tightly in `fair_prob` 0.52–0.68. The model never has a mechanism to say "even with positive edge, this probability estimate is in the high-noise zone where our model's error bars are large." The `_verdict` function only gates on `edge >= threshold` — there's no concept of **prediction confidence or fair_prob range reliability**. Games 695 and 707 are revealing: both predicted home team with `fair_prob` ~0.541–0.542, edges of -0.001 and -0.068 (correctly flagged red), yet similar games with edge barely above 0 (e.g., Game 670: edge=+0.010) got yellow. The yellow/red distinction at the edge boundary is not causing the problem — the green verdicts with moderate edges are. There is no calibration guard preventing the model from issuing green signals on teams it only mildly favors. ### 4. Total Market Has a Directional Bias Problem The totals market at 54.4% is marginal and shows a concerning pattern in the wrong predictions. Games 700 (actual=2-0, over predicted), 702 (actual=6-7, under predicted), 703 (actual=11-3, under predicted), 705 (actual=2-5, over predicted) all lost. Game 703 is especially egregious: actual score was 11-3 (total=14, a high-scoring game), but the model bet under with `fair_prob=0.542` — suggesting systematic underestimation of run environment in certain conditions. Game 705 shows the inverse: `fair_prob=0.715` on over, but only 7 runs scored. The `_fair_prob_from_dist` function's reliability depends entirely on the quality of the PMF distribution — if the mean of the distribution is miscalibrated by even 0.5 runs, over/under predictions near the line flip entirely. No line proximity penalty exists. ### 5. Red/Yellow Verdict Games Are Not the Primary Problem — Misclassified Greens Are Reviewing all wrong predictions, the red-verdict losses (Games 672, 674, 682, 684, 695, 707, 709, 718) are **expected losses** — the model correctly flagged uncertainty. The real damage comes from **green verdict losses**: Games 685, 688, 670, 676, 677 (wait — 677 won), 678 (won), etc. Filtering to clear green losses on moneyline: Games 685, 688, 670, 693, 694, 699, 701, 703, 704, 705 are all green moneyline/total losses. The `detect_sharp_divergence` and `detect_reverse_line_movement` signals exist but their output is **completely unused in the `_verdict` function** — `sharp` and `rlm` are parameters to `_verdict` but the function ignores them entirely. This is a direct code bug: sharp money and reverse line movement are computed but never incorporated into the verdict. --- ## Specific Improvement Suggestions ### 1. Fix the `_verdict` Function to Actually Use Sharp/RLM Signals (Bug Fix) **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_verdict` The current implementation accepts `sharp` and `rlm` but ignores them completely. This means the entire `_signals.py` module has zero effect on outcomes. When sharp divergence is detected against the model's pick, or when reverse line movement opposes the model, the verdict should be downgraded. ```python def _verdict(edge: float, sharp: bool, rlm: bool, threshold: float) -> Verdict: """Compute verdict incorporating edge, sharp money, and RLM signals. Sharp divergence against the model's side or reverse line movement against the model's side are penalizing signals that increase the effective threshold required for a green verdict. """ # Count how many adverse market signals are firing adverse_signal_count = int(sharp) + int(rlm) # Each adverse signal raises the effective threshold by 50% # (e.g., 0.03 base -> 0.045 with one signal -> 0.06 with two) effective_threshold = threshold * (1.5 ** adverse_signal_count) if edge >= effective_threshold: return Verdict.green if edge < 0.0: return Verdict.red # If adverse signals push edge below effective threshold but edge is # still positive, degrade green->yellow rather than leaving it green if adverse_signal_count > 0 and edge >= threshold: return Verdict.yellow return Verdict.yellow ``` --- ### 2. Implement Per-Market Edge Thresholds **File:** `services/model/src/mlb_model/market/_evaluate.py` **Functions:** `_edge_threshold` (replace), `_build_comparison` (modify call site), `_verdict` (modify signature) The moneyline market at 40% win rate needs a substantially higher threshold. Based on the data, moneyline picks with edge < ~0.10 are losing money. Runline is performing well and can stay near the current threshold. Totals need a modest bump. ```python # Replace the single _edge_threshold() with per-market thresholds _MARKET_EDGE_THRESHOLDS: dict[str, float] = { "moneyline": 0.09, # Moneyline is most efficient; 40% WR demands higher bar "runline": 0.05, # Runline at 64% — working, modest tightening "total": 0.07, # Totals at 54.4% — needs improvement "f5_total": 0.07, "nrfi": 0.07, } _DEFAULT_EDGE_THRESHOLD = 0.06 # fallback def _edge_threshold(market: str | None = None) -> float: """Return edge threshold for a specific market, with env override support.""" # Allow full env override (existing behavior) raw = os.environ.get("EDGE_THRESHOLD_PCT", "") if raw: try: return float(raw) / 100.0 except ValueError: pass # Per-market env override, e.g. EDGE_THRESHOLD_MONEYLINE_PCT=9 if market: market_env_key = f"EDGE_THRESHOLD_{market.upper()}_PCT" market_raw = os.environ.get(market_env_key, "") if market_raw: try: return float(market_raw) / 100.0 except ValueError: pass return _MARKET_EDGE_THRESHOLDS.get(market, _DEFAULT_EDGE_THRESHOLD) return _DEFAULT_EDGE_THRESHOLD # In _build_comparison, change the threshold call: def _build_comparison( game_id: int, model_run_id: int, pred: Prediction, odds_snapshots: list[OddsSnapshot], splits_snapshots: list[SplitsSnapshot], threshold: float, # kept for signature compat, but overridden per-market below ) -> MarketComparison | None: # ... (existing code up to verdict call unchanged) ... # Override threshold per market effective_threshold = _edge_threshold(pred.market) verdict = _verdict(edge, sharp, rlm, effective_threshold) return MarketComparison( # ... same as before ... ) # In _evaluate, remove the single threshold computation from the loop: def _evaluate(game_id: int, session: Session) -> list[MarketComparison]: # ... existing setup ... threshold = _edge_threshold() # kept as fallback only # ... rest unchanged, _build_comparison now resolves per-market internally ... ``` --- ### 3. Add Fair Probability Confidence Banding to Suppress Low-Conviction Green Picks **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_verdict` The wrong prediction data shows that `fair_prob` values of 0.52–0.62 on moneyline almost always lose even with positive edge. The model's error in this probability band likely exceeds the claimed edge. Add a confidence band that degrades verdicts when `fair_prob` is near 0.5 (high uncertainty zone). ```python # Add this constant near the top of _evaluate.py # Pairs of (market, min_fair_prob_for_green) — picks below this are capped at yellow _MIN_FAIR_PROB_FOR_GREEN: dict[str, float] = { "moneyline": 0.62, # Below 62% moneyline picks are too noisy; data shows consistent losses "runline": 0.58, # Runline working well; modest floor "total": 0.58, # Totals need reasonable conviction } def _verdict( edge: float, sharp: bool, rlm: bool, threshold: float, fair_prob: float = 0.5, market: str = "", ) -> Verdict: """Compute verdict with edge, signal, and probability confidence checks.""" adverse_signal_count = int(sharp) + int(rlm) effective_threshold = threshold * (1.5 ** adverse_signal_count) if edge < 0.0: return Verdict.red if edge >= effective_threshold: # Check if fair_prob is above the minimum conviction floor for this market min_prob = _MIN_FAIR_PROB_FOR_GREEN.get(market, 0.55) if fair_prob < min_prob: return Verdict.yellow # Downgrade: positive edge but low conviction if adverse_signal_count > 0: return Verdict.yellow # Downgrade: signals oppose the pick return Verdict.green return Verdict.yellow # Update the call site in _build_comparison: verdict = _verdict(edge, sharp, rlm, effective_threshold, fair_prob=fair_prob, market=pred.market) ``` --- ### 4. Add Line Proximity Penalty for Totals Distributions **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_fair_prob_from_dist` The PMF-derived probability is highly sensitive to the exact line value. When the consensus line is within ±0.5 of the projected mean, the fair prob will be near 0.50 and the claimed edge is mostly noise. Games 700, 702, 703, and 705 all show the totals model confidently calling sides that lost, suggesting the distribution mean is close to the market line and small errors dominate. ```python def _fair_prob_from_dist( distribution: dict[str, float], side: str, line: float ) -> float: """Derive fair probability for a totals market from a stored PMF. Returns a probability shrunk toward 0.50 when the distribution mass is concentrated near the line (low-confidence region). """ def _numeric(k: str) -> float | None: try: return float(k) except (ValueError, TypeError): return None if side == "over": raw_prob = sum( v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n > line ) else: raw_prob = sum( v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n <= line ) # Compute probability mass within ±1 run of the line (near-line density) # High near-line density => distribution straddles the line => low confidence near_line_mass = sum( v for k, v in distribution.items() if (_n := _numeric(k)) is not None and abs(_n - line) <= 1.0 ) # Shrink toward 0.50 proportionally to near-line mass. # If 40%+ of the distribution is within 1 run of the line, # the pick is unreliable — shrink aggressively. # near_line_mass of 0.0 => no shrinkage; 0.5 => 50% shrinkage toward 0.5 shrinkage = min(near_line_mass, 0.60) # cap shrinkage factor fair_prob = raw_prob * (1.0 - shrinkage) + 0.50 * shrinkage return fair_prob ``` --- ### 5. Add Minimum Book Count Guard in `consensus_market` **File:** `services/model/src/mlb_model/market/_consensus.py` **Function:** `consensus_market` The consensus can currently be built from a single book, which means the "consensus" implied probability may be a single book's line with minimal vig removal reliability. Low book counts produce noisy consensus prices, making edge calculations meaningless. Several losses may stem from thin market data producing falsely attractive edges. ```python _MIN_BOOKS_FOR_RELIABLE_CONSENSUS = 3 def consensus_market(snapshots: list[OddsSnapshot]) -> ConsensusLine: """Compute median price and no-vig implied probability across books. Snapshots should all belong to the same game and market. Within each (book, side) pair only the most recent snapshot is used. Returns a ConsensusLine with book_count populated; callers should check book_count >= _MIN_BOOKS_FOR_RELIABLE_CONSENSUS before trusting side_implied_probs for edge computation. """ # ... (existing logic unchanged until return) ... # Flag low-confidence consensus in the returned object # so _build_comparison can gate on it return ConsensusLine( median_line_value=median_line, side_implied_probs={side_a: prob_a, side_b: prob_b}, book_count=book_count, ) # In _evaluate.py _build_comparison, add after consensus is computed: consensus = consensus_market(market_odds) # Require minimum book coverage for reliable edge computation MIN_BOOKS = int(os.environ.get("MIN_CONSENSUS_BOOKS",

Draft file

/home/ubuntu/mlbbetting/analysis_drafts/2026-05-25_0902_model_review.md

Last 85 games — 2026-05-14 to 2026-05-20

Generated May 21, 2026

Analysis run

Last 85 games

ML52-33(61.2%)

ATS53-32(62.4%)

O/U40-39-6(50.6%)

Full season

ML52-33(61.2%)

ATS53-32(62.4%)

O/U40-39-6(50.6%)

Diagnosis

# Diagnosis and Improvement Recommendations ## Diagnosis ### 1. Totals Market Calibration is Severely Broken The totals market is running at 50.6% (40-39), well below the 60% threshold. Looking at the wrong predictions, the problem is stark and consistent: the model assigned **high fair_prob and strong positive edge to "under" predictions** on games that went over dramatically. Game 599 (3-13, 16 total runs) had `fair_prob=0.756, edge=+0.264` on under. Game 634 (9-3, 12 runs) had `fair_prob=0.805, edge=+0.315` on under. Game 630 (16-7, 23 runs) had `fair_prob=0.723, edge=+0.236` on under. Game 622 (8-6, 14 runs) had `fair_prob=0.727, edge=+0.242` on under. These are not marginal misses — these are games that blew up dramatically, yet the model was extremely confident in the under. The fair_prob values in the 0.70-0.80 range should be hitting at near those rates; they are clearly not. This points to a systematic upstream bias in the run-scoring distribution, likely the pitcher/batter rolling features or park factor weighting **suppressing projected totals below true expectation**. ### 2. The `_fair_prob_from_dist` Function Has a Boundary Error on the Under In `_evaluate.py`, the `_fair_prob_from_dist` function computes the under probability as: ```python return sum(v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n <= line) ``` This includes the **exact line value** in the under bucket (e.g., if the line is 8.5, it shouldn't matter, but if lines are stored as integers like 8 or 9, a game landing exactly on the line integer value is double-counted or miscounted). More critically, if `median_line_value` is being computed from a mix of half-point and whole-number lines (e.g., 8.5 vs 9.0), the `statistics.median()` in `consensus_market` could return a value like 8.75 that doesn't align with how PMF keys are stored. If PMF keys are integers (0, 1, 2... 20) and the consensus line comes back as 8.75, then `_n <= 8.75` captures runs 0-8 while `_n > 8.75` captures 9+, which is correct for a half-point line. But if the PMF is a **discrete distribution** and the model's projected total is systematically too low (e.g., projecting 7.5 total when actual is 9-10), the under bucket will always appear inflated. The combination of a low projected total mean with a PMF that has most mass below the consensus line artificially inflates `fair_prob` for unders. ### 3. Edge Threshold is Too Permissive for Low-Confidence Markets The `_verdict` function marks anything ≥ 0.03 (3%) as green. Looking at wrong predictions, many losing bets had edge values of +0.01 to +0.06 on totals/under: Game 573 (edge=+0.051), Game 611 (edge=+0.046), Game 619 (edge=+0.035), Game 626 (edge=+0.053). A 3% edge threshold was possibly calibrated when the model was more accurate. The totals market's 50.6% win rate means the model has **negative expected value on many "green" totals plays**. The threshold needs to be higher for totals specifically, and should be scaled by `fair_prob` — a play with `fair_prob=0.51` and `edge=+0.04` is not the same confidence level as `fair_prob=0.65` and `edge=+0.04`. ### 4. Runline Side-Mapping Has a Silent Probability Inversion Risk The `_MODEL_TO_ODDS` dict maps both `("runline", "home_minus")` and `("runline", "home_plus")` to `("spreads", "home_runline")`, and both `("runline", "away_minus")` and `("runline", "away_plus")` to `("spreads", "away_runline")`. The line-value filtering in `_build_comparison` is the only guard against comparing a +1.5 prediction against a -1.5 market price. But `consensus_market` uses `devig_two_way` across whatever two sides exist — if book data has inconsistent side labeling (some books labeling home runline as -1.5, others as +1.5 due to a data ingestion quirk), the `median_line_value` guard could pass while the `consensus_implied` is actually for the **opposite side**. Several runline losses (Game 572: away runline with `edge=+0.012, verdict=yellow` in a 4-9 game, Game 573 and 574) suggest the runline away predictions on games where away won big (correct direction) but perhaps with wrong line comparison. ### 5. No Contextual Filtering on High-Variance Games The wrong predictions include multiple blowout games (1-15, 2-12, 3-13, 9-3, 16-7, 0-12) where both the total and the runline went against the model's under/close-game prediction. Games 570-575 all on 2026-05-14 show a cluster of failures — all predicted away wins and unders, and several ended in blowouts. This suggests a **weather or park event** on that date was not captured (park_rf and weather fields are `?` in the data). The model has no mechanism to reduce confidence or widen the edge threshold when contextual data is missing — it still outputs green verdicts with high edges even when `home_wp`, `proj_total`, `park_rf`, and weather are all unknown. --- ## Specific Improvement Suggestions ### 1. Fix the PMF boundary condition and add line-alignment validation in `_fair_prob_from_dist` **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_fair_prob_from_dist` **Problem:** The `<=` boundary on under includes the exact line integer, and there's no validation that the consensus `median_line_value` is sensible relative to the PMF's key range. Also, totals lines are almost always half-points (8.5, 9.5) — if they're stored as whole numbers in any book, the median can land on a whole number and the under/over split will be wrong. ```python def _fair_prob_from_dist( distribution: dict[str, float], side: str, line: float ) -> float | None: """Derive fair probability for a totals market from a stored PMF. Returns None if the line value is outside the PMF's support range, which indicates a data alignment problem rather than a real probability. """ def _numeric(k: str) -> float | None: try: return float(k) except (ValueError, TypeError): return None numeric_items = [ (_numeric(k), v) for k, v in distribution.items() if _numeric(k) is not None ] if not numeric_items: return None keys = [n for n, _ in numeric_items] pmf_min, pmf_max = min(keys), max(keys) # Guard: if the consensus line is outside the PMF support, the PMF and # market are misaligned — return None so this comparison is skipped # rather than producing a garbage fair_prob. if line < pmf_min or line > pmf_max: logger.warning( "_fair_prob_from_dist: line=%.2f outside PMF range [%.1f, %.1f]; skipping", line, pmf_min, pmf_max, ) return None # Use strict < for over and strict <= for under on half-point lines. # For whole-number lines, the "push" bucket (exact line) should be # excluded from both sides (it won't exist in practice for MLB totals, # but this makes the split unambiguous). if side == "over": return sum(v for n, v in numeric_items if n > line) # under: strictly less than line (exclude exact-line ties) return sum(v for n, v in numeric_items if n < line) ``` Then update the caller in `_build_comparison` to handle the `None` return: ```python if pred.market in _DISTRIBUTION_MARKETS: if pred.distribution is None or consensus.median_line_value is None: return None fair_prob = _fair_prob_from_dist( pred.distribution, pred.side, consensus.median_line_value ) # NEW: treat None as uncomputable — skip this comparison if fair_prob is None: return None ``` --- ### 2. Implement per-market edge thresholds with a `fair_prob` minimum floor **File:** `services/model/src/mlb_model/market/_evaluate.py` **Functions:** `_verdict`, `_edge_threshold`, `_build_comparison` **Problem:** A flat 3% edge threshold ignores that totals predictions are less reliable and that low `fair_prob` green calls (0.51 with 4% edge) have little real value. The totals market is hitting 50.6% — its effective threshold should be raised until it's demonstrated to be calibrated. ```python # Replace the single _DEFAULT_EDGE_THRESHOLD with a per-market dict _DEFAULT_EDGE_THRESHOLD = 0.03 _MARKET_EDGE_THRESHOLDS: dict[str, float] = { "moneyline": 0.04, # slight increase from 0.03 "runline": 0.04, "total": 0.07, # raised significantly — market is at 50.6%, needs higher bar "f5_total": 0.07, "nrfi": 0.05, } # Minimum fair_prob required for a green verdict — below this, cap at yellow # regardless of edge, because low-probability estimates are high-variance _MIN_FAIR_PROB_FOR_GREEN: dict[str, float] = { "moneyline": 0.53, "runline": 0.60, "total": 0.58, # require meaningful confidence on totals "f5_total": 0.58, "nrfi": 0.55, } def _edge_threshold(market: str | None = None) -> float: """Return edge threshold for a specific market, with env override.""" raw = os.environ.get("EDGE_THRESHOLD_PCT", "") if raw: try: return float(raw) / 100.0 except ValueError: pass if market is not None: return _MARKET_EDGE_THRESHOLDS.get(market, _DEFAULT_EDGE_THRESHOLD) return _DEFAULT_EDGE_THRESHOLD def _verdict( edge: float, sharp: bool, rlm: bool, threshold: float, fair_prob: float, market: str, ) -> Verdict: min_prob = _MIN_FAIR_PROB_FOR_GREEN.get(market, 0.52) if edge >= threshold and fair_prob >= min_prob: return Verdict.green if edge < 0.0: return Verdict.red return Verdict.yellow ``` Update the call site in `_build_comparison`: ```python # Pass market-specific threshold threshold = _edge_threshold(pred.market) sharp = detect_sharp_divergence(market_splits) rlm = detect_reverse_line_movement(market_odds, market_splits) verdict = _verdict(edge, sharp, rlm, threshold, fair_prob, pred.market) ``` --- ### 3. Add a missing-context confidence penalty **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_build_comparison` **Problem:** Games 570-575 all have `?` for `home_wp`, `proj_total`, `park_rf`, and weather. The model still outputs green verdicts with large edges. When key contextual inputs are absent, the fair_prob estimate is based on incomplete information and should be penalized or the verdict should be capped. ```python def _context_completeness_score(pred: Prediction, session: Session, game_id: int) -> float: """Return a score in [0, 1] representing fraction of key context present. Used to down-grade verdict confidence when model inputs are missing. """ from db.models import Game, ParkFactor, WeatherSnapshot score = 0.0 checks = 0 # Check 1: home_win_prob populated (moneyline prediction exists) if pred.fair_prob is not None: score += 1.0 checks += 1 # Check 2: park factor present game = session.get(Game, game_id) if game is not None: pf = session.scalar( select(ParkFactor).where( ParkFactor.park_id == game.park_id, ParkFactor.season == game.game_date.year, ) ) if pf is not None: score += 1.0 checks += 1 # Check 3: weather snapshot present weather = session.scalar( select(WeatherSnapshot) .where(WeatherSnapshot.game_id == game_id) .order_by(WeatherSnapshot.captured_at.desc()) .limit(1) ) if weather is not None: score += 1.0 checks += 1 return score / checks if checks > 0 else 0.0 # In _build_comparison, after computing verdict, apply context penalty: verdict = _verdict(edge, sharp, rlm, threshold, fair_prob, pred.market) # Downgrade verdict when context completeness is low # Requires session to be threaded through — add session param to _build_comparison completeness = _context_completeness_score(pred, session, game_id) if completeness < 0.67 and verdict == Verdict.green: logger.info( "Downgrading game_id=%d market=%s side=%s from green to yellow: " "context completeness=%.2f", game_id, pred.market, pred.side, completeness, ) verdict = Verdict.yellow ``` Update `_build_comparison` signature to accept `session`: ```python def _build_comparison( game_id: int, model_run_id: int, pred: Prediction, odds_snapshots: list[OddsSnapshot], splits_snapshots: list[SplitsSnapshot], threshold: float, session: Session, # NEW ) -> MarketComparison | None: ``` And update the call in `_evaluate`: ```python comp = _build_comparison( game_id=game_id, model_run_id=model_run.id, pred=pred, odds_snapshots=odds_snapshots, splits_snapshots=splits_snapshots, threshold=threshold, session=session, # NEW ) ``` --- ### 4. Validate runline side consistency before computing consensus **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_build_comparison` **Problem:** The current guard checks `median_lv` against expected `±1.5`, but `consensus_market` is called *before* this check, and `consensus_implied` is already computed from whatever the two sides happen to be in the snapshot data. If any book has home and away runline both at -1.5 (a data error), `devig_two_way` produces garbage. Add an explicit consistency check on the snapshot data before passing to `consensus_market`. ```python if pred.market == "runline": # Pre-filter: only keep snapshots where line values are internally # consistent (home = -1.5 when away = +1.5, or vice versa). # Remove any book where both sides have the same sign. def _runline_snapshots_are_

Draft file

/home/ubuntu/mlbbetting/analysis_drafts/2026-05-21_0901_model_review.md

Last 40 games — 2026-05-17 to 2026-05-19

Generated May 19, 2026

Analysis run

Last 40 games

ML22-18(55.0%)

ATS25-15(62.5%)

O/U16-21-3(43.2%)

Diagnosis

# Diagnosis and Improvement Plan ## Diagnosis ### 1. Totals Market Systematic Failure (Most Critical Issue) The totals market is performing at **43.2% (16-21)**, well below the 60% threshold and actually worse than random chance. Looking at the wrong predictions, the pattern is striking and consistent: the model repeatedly predicted **under** with high confidence and large edges, yet the actual scores were frequently high-scoring games. Examples include Game 630 (actual 16-7, under predicted with fair_prob=0.723, edge=+0.236), Game 634 (actual 9-3, under predicted with fair_prob=0.805, edge=+0.315), Game 644 (actual 6-9, under predicted with fair_prob=0.724, edge=+0.235), and Game 631 (actual 6-7, under predicted with fair_prob=0.686, edge=+0.171). The model's highest-confidence under picks are systematically wrong — this is not noise. The projected total distribution from the Monte Carlo layer appears to be generating downward-biased run distributions, meaning `_fair_prob_from_dist` consistently overestimates under probability. The boundary condition in `_fair_prob_from_dist` is also suspect: under is computed as `_n <= line`, meaning a total landing exactly on the line (a push in most books) is counted as a win for the under, slightly inflating under fair_prob. ### 2. Moneyline Calibration Is Poor for Marginal Edges The moneyline is at 55.0% (22-18), which is passable but the wrong predictions reveal a calibration problem: the model is generating contradictory signals within the same game. Game 621 had `moneyline/home` as red (edge=-0.004) yet `runline/home_plus` as green (edge=+0.052), and home won 2-0 — the runline was right, the moneyline direction was right, but the market layer assigned opposite verdicts. Game 623 had `moneyline/home` red (edge=-0.053) yet home won 10-1. Game 651 had `moneyline/home` red yet home won 6-4. This suggests the moneyline fair_prob values are being slightly underestimated relative to the runline fair_prob values from the same underlying model, indicating inconsistency between how `pred.fair_prob` is set for moneyline versus how `_fair_prob_from_dist` or the runline probability is derived. The moneyline and runline probabilities should be mathematically consistent — when a team wins 10-1 there should not be a negative moneyline edge. ### 3. Edge Threshold Is Too Permissive for Low-Confidence Predictions Several red-verdict predictions (negative edge) are still appearing in the wrong-predictions list, meaning the model correctly flagged them as red but they were still presumably surfaced somewhere (or the threshold discussion reveals a structural issue). More critically, many yellow-verdict predictions with edges of +0.004 to +0.027 are wrong. The current `_DEFAULT_EDGE_THRESHOLD = 0.03` (3pp) is too low — it's generating "green" signals on edges as small as +0.004 for runline (Game 618, fair_prob=0.604, edge=+0.004, still marked yellow since 0.004 < 0.03, but this is borderline noise). The `_verdict` function ignores the `sharp` and `rlm` signals entirely — they are computed but have zero effect on the verdict. This is a significant unused-signal bug. ### 4. The `_fair_prob_from_dist` Boundary Condition and Distribution Quality The under boundary `_n <= line` is mathematically incorrect for a continuous approximation of discrete run scoring. In baseball, a total of exactly 8 when the line is 8 is a push, not a win for under. More importantly, the PMF stored likely uses integer run totals, and the line value from `consensus.median_line_value` may be a non-integer (e.g., 8.5), but if it's ever an integer (e.g., 8.0), the under calculation includes the push scenario, inflating the under probability by whatever mass is at exactly 8 runs. Combined with a systematic downward bias in the Monte Carlo run distribution (possibly from stale pitcher rolling averages, under-weighted bullpen degradation, or park factor under-application), this compounds to make the totals model unreliable. ### 5. Contextual Data Is Missing from Reasoning but May Also Be Missing from the Model All wrong predictions show `home_wp=?, proj_total=?, park_rf=?, ?F wind=?mph` — the reasoning context is not being populated. This means either `_build_reasoning` is failing silently (the `session.get(Game, game_id)` or `ParkFactor` queries are returning None), or the data simply isn't in the database. If park factors and weather are not reaching the Monte Carlo layer either, that would explain systematic bias — a hitter-friendly park with hot weather would have its run environment underestimated, leading to exactly the pattern seen (high-scoring actual games, under predictions from the model). --- ## Specific Improvement Suggestions ### 1. Fix the `_fair_prob_from_dist` boundary condition and add a push-aware calculation **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_fair_prob_from_dist` **Why:** The current `<=` boundary for under means a line of 8.0 counts a final total of 8 as an under win. In practice this is a push (money returned). This inflates under fair_prob. Additionally, half-point lines (8.5) are common and should be handled explicitly. The fix should also add a push mass tracker for diagnostic purposes. ```python def _fair_prob_from_dist( distribution: dict[str, float], side: str, line: float ) -> float: """Derive fair probability for a totals market from a stored PMF. For integer lines, the mass exactly on the line is a push and is excluded from both over and under probability, then each side is renormalized by (1 - push_mass) so they sum to 1.0. """ def _numeric(k: str) -> float | None: try: return float(k) except (ValueError, TypeError): return None is_half_point = abs(line - round(line)) >= 0.4 # e.g. 8.5 over_mass = sum( v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n > line ) under_mass = sum( v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n < line ) push_mass = sum( v for k, v in distribution.items() if (_n := _numeric(k)) is not None and _n == line ) if not is_half_point else 0.0 # For integer lines, renormalize excluding push mass so over+under = 1.0 live_mass = over_mass + under_mass if live_mass <= 0.0: return 0.5 # degenerate distribution if side == "over": return over_mass / live_mass return under_mass / live_mass ``` --- ### 2. Make sharp divergence and RLM signals actually affect the verdict **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_verdict` **Why:** The `sharp` and `rlm` parameters are passed into `_verdict` but completely ignored. The model detects sharp money and reverse line movement but throws the information away. Sharp divergence should upgrade a yellow to green and can also upgrade a borderline red to yellow. RLM on the popular side being faded should similarly boost confidence. This is free signal that's currently wasted. ```python def _verdict(edge: float, sharp: bool, rlm: bool, threshold: float) -> Verdict: """Compute verdict incorporating sharp money and RLM signals. Signal logic: - Sharp divergence: smart money aligns with our edge -> lower threshold by 30% (sharps provide confirmation) - RLM: line moving against public -> adds 0.01 to effective edge (structural value signal) - Negative edge with sharp confirmation -> yellow instead of red (sharps may know something the model doesn't) """ effective_edge = edge effective_threshold = threshold if sharp: # Sharps confirm our position: reduce required threshold effective_threshold = threshold * 0.70 if rlm: # Line moving against public money: treat as +1pp bonus effective_edge = edge + 0.010 if effective_edge >= effective_threshold: return Verdict.green if edge < 0.0 and not sharp: # Negative edge without sharp confirmation: red return Verdict.red if edge < 0.0 and sharp: # Negative model edge but sharps are on this side: yellow (conflicting signals) return Verdict.yellow return Verdict.yellow ``` --- ### 3. Raise the default edge threshold and add a minimum fair_prob gate **File:** `services/model/src/mlb_model/market/_evaluate.py` **Constants and function:** `_DEFAULT_EDGE_THRESHOLD`, `_build_comparison` **Why:** The current 3pp threshold is generating green signals on edges as small as +0.004 (Game 618 was yellow but only because 0.004 < 0.03 barely; green starts at 0.03 which is still very small). The wrong-prediction data shows multiple green-verdict losses with edges in the 0.03–0.06 range. Raise to 5pp. Additionally, no prediction with a fair_prob below 0.52 should ever be green regardless of edge — these are essentially coin flips where model error exceeds the signal. ```python _DEFAULT_EDGE_THRESHOLD = 0.05 # raised from 0.03 — require 5pp edge for green _MIN_FAIR_PROB_FOR_GREEN = 0.52 # below this, cap at yellow regardless of edge def _verdict(edge: float, sharp: bool, rlm: bool, threshold: float, fair_prob: float = 0.5) -> Verdict: effective_edge = edge effective_threshold = threshold if sharp: effective_threshold = threshold * 0.70 if rlm: effective_edge = edge + 0.010 if effective_edge >= effective_threshold: # Additional gate: fair_prob must clear minimum confidence bar if fair_prob < _MIN_FAIR_PROB_FOR_GREEN: return Verdict.yellow return Verdict.green if edge < 0.0 and not sharp: return Verdict.red if edge < 0.0 and sharp: return Verdict.yellow return Verdict.yellow # In _build_comparison, update the call: verdict = _verdict(edge, sharp, rlm, threshold, fair_prob=fair_prob) ``` --- ### 4. Add a totals-specific edge multiplier penalty to debias the systematic under lean **File:** `services/model/src/mlb_model/market/_evaluate.py` **Function:** `_build_comparison` **Why:** The data shows a systematic over-prediction of under probability. Until the upstream Monte Carlo distribution is recalibrated, apply a temporary debiasing correction in the market layer: for totals markets, apply a shrinkage factor that pulls the fair_prob toward 0.5 proportionally to how extreme it is. This is a Bayesian-style regularization acknowledging the model's known bias. Also add a market-specific edge threshold for totals. ```python _TOTALS_SHRINKAGE = 0.15 # pull 15% toward 0.5 — recalibrate when MC is fixed _TOTALS_EDGE_THRESHOLD_MULTIPLIER = 1.4 # require 40% more edge on totals def _apply_totals_debiasing(fair_prob: float) -> float: """Shrink totals fair_prob toward 0.5 to correct systematic MC bias. Remove once Monte Carlo run distribution is recalibrated. """ return fair_prob * (1.0 - _TOTALS_SHRINKAGE) + 0.5 * _TOTALS_SHRINKAGE # In _build_comparison, after computing fair_prob: if pred.market in _DISTRIBUTION_MARKETS: if pred.distribution is None or consensus.median_line_value is None: return None fair_prob = _fair_prob_from_dist( pred.distribution, pred.side, consensus.median_line_value ) # Debiasing: correct for known systematic under-prediction bias fair_prob = _apply_totals_debiasing(fair_prob) # Use a higher edge threshold for totals until MC is recalibrated market_threshold = threshold * _TOTALS_EDGE_THRESHOLD_MULTIPLIER else: if pred.fair_prob is None: return None fair_prob = pred.fair_prob market_threshold = threshold # ... rest of function uses market_threshold instead of threshold: verdict = _verdict(edge, sharp, rlm, market_threshold, fair_prob=fair_prob) ``` --- ### 5. Fix the `consensus_market` side ordering to be deterministic **File:** `services/model/src/mlb_model/market/_consensus.py` **Function:** `consensus_market` **Why:** `sides_with_prices[0], sides_with_prices[1]` iterates over a dict (`prices_by_side`) built from `defaultdict`, which in Python 3.7+ preserves insertion order but that order depends on which snapshot was processed first. If the order of `home_ml`/`away_ml` flips between runs, `devig_two_way(median_a, median_b)` will assign `prob_a` to whichever side happens to be first, potentially swapping home and away implied probabilities. This would cause edge calculations to be inverted for affected games — a -5pp edge appearing as +5pp. The fix is to sort sides explicitly. ```python def consensus_market(snapshots: list[OddsSnapshot]) -> ConsensusLine: """Compute median price and no-vig implied probability across books.""" if not snapshots: return ConsensusLine( median_line_value=None, side_implied_probs={}, book_count=0, ) latest: dict[tuple[str, str], OddsSnapshot] = {} for snap in snapshots: key = (snap.book, snap.side) if key not in latest or snap.captured_at > latest[key].captured_at: latest[key] = snap prices_by_side: dict[str, list[int]] = defaultdict(list) lines_by_side: dict[str, list[float]] = defaultdict(list) for (_, side), snap in latest.items(): if snap.price_american is not None: prices_by_side[side].append(snap.price_american) if snap.line_value is not None: lines_by_side[side].append(snap.line_value) book_count = len({book for book, _ in latest}) all_lines = [v for vals in lines_by_side.values() for v in vals] median_line: float | None = statistics.median(all_lines) if all_lines else None sides_with_prices = sorted( # FIXED: deterministic ordering [s for s in prices_by_side if prices_by_side[s]] ) if len(sides_with_prices) < 2: probs = {s: 0.5 for s in sides_with_prices} return ConsensusLine( median_line_value=median_line, side_implied_probs=probs, book_count=book_count, ) side_a, side_b = sides_with_prices[0], sides_with_prices[1] median_a = int(round(statistics.median(prices_by_side[side_a]))) median_b = int(round(statistics.median(prices_by_side[side_b]))) prob_a, prob_b = devig_two_way(median_a, median_b) return ConsensusLine( median_line

Draft file

/home/ubuntu/mlbbetting/services/model/../../analysis_drafts/2026-05-20_0349_model_review.md