No description
Find a file
2026-03-19 11:45:18 -04:00
.idea Code used for bracket 2026-03-18 15:57:24 -04:00
data/processed Code used for bracket 2026-03-18 15:57:24 -04:00
reports Final code to generate 2026 bracket 2026-03-19 11:45:18 -04:00
scripts Final code to generate 2026 bracket 2026-03-19 11:45:18 -04:00
src/march_madness_bot Final code to generate 2026 bracket 2026-03-19 11:45:18 -04:00
tests Final code to generate 2026 bracket 2026-03-19 11:45:18 -04:00
.gitignore Code used for bracket 2026-03-18 15:57:24 -04:00
pyproject.toml Code used for bracket 2026-03-18 15:57:24 -04:00
README.md Final code to generate 2026 bracket 2026-03-19 11:45:18 -04:00

march-madness-bot

Historical analytics scaffold for the Kaggle March Machine Learning Mania 2026 dataset.

Competition overview review

Based on the public competition metadata and the downloaded CSV bundle in march-machine-learning-mania-2026/, the competition objective is to forecast 2026 NCAA tournament matchups. The dataset includes historical men's and women's:

  • regular season detailed and compact game results
  • NCAA tournament detailed and compact game results
  • tournament seeds and bracket slot metadata
  • team metadata and conference mappings
  • mens Massey ordinal rankings
  • sample submission files for Kaggle scoring

This repo is currently focused on the historical analytics / insight-generation side of that dataset rather than Kaggle submission modeling. The first scaffold question is:

Which single raw, non-ranking box-score stat is most associated with winning a March Madness game?

What this scaffold adds

  • a reusable loader for the Kaggle CSVs
  • a normalized team-game table (one row per team per game)
  • enrichment with team names, conferences, and tournament seeds
  • a small CLI for common historical questions
  • a first-pass stat ranking workflow using stat differentials and ROC AUC

Project layout

src/march_madness_bot/data.py        # CSV ingestion + normalized team-game table
src/march_madness_bot/questions.py   # reusable historical question functions
src/march_madness_bot/cli.py         # terminal entry point
tests/test_questions.py              # smoke tests
data/processed/                      # optional exported normalized tables

Setup

Install the package and its dependencies into your current Python environment:

cd /Users/cmoriarty/repos/march-madness-bot
python -m pip install -e .

Quick start

Show dataset coverage:

python -m march_madness_bot.cli dataset-summary --gender all --season-type tournament

Export normalized team-game tables:

python -m march_madness_bot.cli build-cache --gender all --season-type tournament

Rank raw stats by win signal in March Madness:

python -m march_madness_bot.cli top-stat --gender all --season-type tournament --limit 10

Evaluate all 2- and 3-stat raw combinations (men's tournament history) and show top 10 by accuracy:

python -m march_madness_bot.cli top-stat-combos --top-n 10

Generate a 2026 men's bracket using FGM, FTM first. If confidence is below 70%, fallback to team ranking:

python -m march_madness_bot.cli pick-bracket-fgm-ftm-fallback --season 2026 --confidence-threshold 70 --output-csv data/processed/men_2026_fgm_ftm_fallback_bracket.csv

Stat abbreviation key

The stat column in top-stat output uses these box-score abbreviations:

  • FGM: Field Goals Made
  • FGA: Field Goals Attempted
  • FGM3: 3-Point Field Goals Made
  • FGA3: 3-Point Field Goals Attempted
  • FTM: Free Throws Made
  • FTA: Free Throws Attempted
  • OR: Offensive Rebounds
  • DR: Defensive Rebounds
  • Ast: Assists
  • TO: Turnovers
  • Stl: Steals
  • Blk: Blocks
  • PF: Personal Fouls

Related feature naming in this repo:

  • diff_<STAT> means team stat minus opponent stat (for example, diff_Ast)
  • Opp<STAT> means opponent value of that stat (for example, OppAst)
  • ScoreDiff means TeamScore - OppScore

Metric definitions in top-stat

The ranking output includes these two quality metrics:

  • roc_auc: Area Under the ROC Curve for a stat differential score against win/loss labels.
    • Interprets as: the probability a random winning row has a higher stat differential than a random losing row.
    • 0.50 means no separation (random), >0.50 means useful signal, <0.50 means inverse signal.
  • zero_threshold_accuracy: Accuracy from a simple rule using differential sign only.
    • Rule: predict win if diff_<STAT> > 0, predict loss if diff_<STAT> < 0, and treat exact ties (0) as half-correct.
    • 0.50 is near coin-flip, higher is better.

How to read them together:

  • Use roc_auc as the primary ranking metric for single-stat separation strength.
  • Use zero_threshold_accuracy as a practical sanity check for a very simple threshold rule.

Inspect one program's tournament history:

python -m march_madness_bot.cli team-history "Duke" --gender men

Fill a 2026 mens bracket using the two strongest single stats (FGM and DR):

python -m march_madness_bot.cli pick-bracket --season 2026 --fgm-weight 1.0 --dr-weight 1.0

Show a bracket-style visual view instead of a flat table:

python -m march_madness_bot.cli pick-bracket --season 2026 --view bracket

pick-bracket output is intentionally human-focused and includes only:

  • Round
  • Slot
  • StrongTeam
  • WeakTeam
  • WinnerTeam
  • WinnerConfidencePct

Confidence interpretation:

  • WinnerConfidencePct is derived from the winner's StrengthGap using a logistic transform, so larger gaps produce higher confidence.
  • WinnerConfidencePct is rounded to 3 decimal places for easier manual entry.

Optionally export picks to CSV:

python -m march_madness_bot.cli pick-bracket --season 2026 --output-csv data/processed/men_2026_fgm_dr_bracket.csv

--output-csv always exports the flat table columns for easy import: Round, Slot, StrongTeam, WeakTeam, WinnerTeam, WinnerConfidencePct.

Methodology for the example insight

The top-stat command:

  1. loads detailed tournament box scores
  2. converts each game into two rows: one winner perspective and one loser perspective
  3. computes raw stat differentials such as diff_Ast, diff_TO, and diff_OR
  4. scores each raw stat differential against win/loss labels with ROC AUC
  5. ranks stats by that separation strength and a simple zero-threshold accuracy check

The pick-bracket command:

  1. loads all 2026 men's regular-season games from MRegularSeasonDetailedResults.csv and computes each team's season average FGM and season average DR across all regular-season games
  2. standardizes both stats across all teams using z-scores (a team one standard deviation above average in FGM gets a score of +1.0, etc.), then adds the two z-scores together into a single strength number per team:
    strength = (fgm_weight × FGM_z-score) + (dr_weight × DR_z-score)
    
    With default weights of 1.0 / 1.0, both stats contribute equally.
  3. resolves each tournament slot from MNCAATourneySeeds.csv + MNCAATourneySlots.csv, including play-in games first
  4. in each matchup, the team with the higher strength wins — no seed, no historical tournament record, no other stats are used; seed and team ID are only consulted as tiebreakers when two teams have the exact same strength score
  5. computes winner confidence from StrengthGap = WinnerStrength - LoserStrength:
    confidence = 1 / (1 + exp(-1.25 * StrengthGap))
    
    and reports the percentage as WinnerConfidencePct (rounded to 3 decimals)

Important caveats:

  • this is descriptive, not causal
  • it uses postgame box scores, so it tells you what most separates winners from losers historically
  • it intentionally excludes rankings / seeds from the stat ranking itself
  • it also excludes obvious outcome leakage like final score from the candidate stat list

Extending the scaffold

The easiest next additions are:

  • seed-upset analysis by round
  • conference performance in the tournament over time
  • regular-season-to-tournament feature pipelines
  • team profile comparisons and matchup previews

If you add a new historical question, prefer implementing it in src/march_madness_bot/questions.py first and then exposing it through src/march_madness_bot/cli.py.