- Python 100%
| .idea | ||
| data/processed | ||
| reports | ||
| scripts | ||
| src/march_madness_bot | ||
| tests | ||
| .gitignore | ||
| pyproject.toml | ||
| README.md | ||
march-madness-bot
Historical analytics scaffold for the Kaggle March Machine Learning Mania 2026 dataset.
Competition overview review
Based on the public competition metadata and the downloaded CSV bundle in march-machine-learning-mania-2026/, the competition objective is to forecast 2026 NCAA tournament matchups. The dataset includes historical men's and women's:
- regular season detailed and compact game results
- NCAA tournament detailed and compact game results
- tournament seeds and bracket slot metadata
- team metadata and conference mappings
- men’s Massey ordinal rankings
- sample submission files for Kaggle scoring
This repo is currently focused on the historical analytics / insight-generation side of that dataset rather than Kaggle submission modeling. The first scaffold question is:
Which single raw, non-ranking box-score stat is most associated with winning a March Madness game?
What this scaffold adds
- a reusable loader for the Kaggle CSVs
- a normalized team-game table (one row per team per game)
- enrichment with team names, conferences, and tournament seeds
- a small CLI for common historical questions
- a first-pass stat ranking workflow using stat differentials and ROC AUC
Project layout
src/march_madness_bot/data.py # CSV ingestion + normalized team-game table
src/march_madness_bot/questions.py # reusable historical question functions
src/march_madness_bot/cli.py # terminal entry point
tests/test_questions.py # smoke tests
data/processed/ # optional exported normalized tables
Setup
Install the package and its dependencies into your current Python environment:
cd /Users/cmoriarty/repos/march-madness-bot
python -m pip install -e .
Quick start
Show dataset coverage:
python -m march_madness_bot.cli dataset-summary --gender all --season-type tournament
Export normalized team-game tables:
python -m march_madness_bot.cli build-cache --gender all --season-type tournament
Rank raw stats by win signal in March Madness:
python -m march_madness_bot.cli top-stat --gender all --season-type tournament --limit 10
Evaluate all 2- and 3-stat raw combinations (men's tournament history) and show top 10 by accuracy:
python -m march_madness_bot.cli top-stat-combos --top-n 10
Generate a 2026 men's bracket using FGM, FTM first. If confidence is below 70%, fallback to team ranking:
python -m march_madness_bot.cli pick-bracket-fgm-ftm-fallback --season 2026 --confidence-threshold 70 --output-csv data/processed/men_2026_fgm_ftm_fallback_bracket.csv
Stat abbreviation key
The stat column in top-stat output uses these box-score abbreviations:
FGM: Field Goals MadeFGA: Field Goals AttemptedFGM3: 3-Point Field Goals MadeFGA3: 3-Point Field Goals AttemptedFTM: Free Throws MadeFTA: Free Throws AttemptedOR: Offensive ReboundsDR: Defensive ReboundsAst: AssistsTO: TurnoversStl: StealsBlk: BlocksPF: Personal Fouls
Related feature naming in this repo:
diff_<STAT>means team stat minus opponent stat (for example,diff_Ast)Opp<STAT>means opponent value of that stat (for example,OppAst)ScoreDiffmeansTeamScore - OppScore
Metric definitions in top-stat
The ranking output includes these two quality metrics:
roc_auc: Area Under the ROC Curve for a stat differential score against win/loss labels.- Interprets as: the probability a random winning row has a higher stat differential than a random losing row.
0.50means no separation (random),>0.50means useful signal,<0.50means inverse signal.
zero_threshold_accuracy: Accuracy from a simple rule using differential sign only.- Rule: predict win if
diff_<STAT> > 0, predict loss ifdiff_<STAT> < 0, and treat exact ties (0) as half-correct. 0.50is near coin-flip, higher is better.
- Rule: predict win if
How to read them together:
- Use
roc_aucas the primary ranking metric for single-stat separation strength. - Use
zero_threshold_accuracyas a practical sanity check for a very simple threshold rule.
Inspect one program's tournament history:
python -m march_madness_bot.cli team-history "Duke" --gender men
Fill a 2026 men’s bracket using the two strongest single stats (FGM and DR):
python -m march_madness_bot.cli pick-bracket --season 2026 --fgm-weight 1.0 --dr-weight 1.0
Show a bracket-style visual view instead of a flat table:
python -m march_madness_bot.cli pick-bracket --season 2026 --view bracket
pick-bracket output is intentionally human-focused and includes only:
RoundSlotStrongTeamWeakTeamWinnerTeamWinnerConfidencePct
Confidence interpretation:
WinnerConfidencePctis derived from the winner'sStrengthGapusing a logistic transform, so larger gaps produce higher confidence.WinnerConfidencePctis rounded to 3 decimal places for easier manual entry.
Optionally export picks to CSV:
python -m march_madness_bot.cli pick-bracket --season 2026 --output-csv data/processed/men_2026_fgm_dr_bracket.csv
--output-csv always exports the flat table columns for easy import:
Round, Slot, StrongTeam, WeakTeam, WinnerTeam, WinnerConfidencePct.
Methodology for the example insight
The top-stat command:
- loads detailed tournament box scores
- converts each game into two rows: one winner perspective and one loser perspective
- computes raw stat differentials such as
diff_Ast,diff_TO, anddiff_OR - scores each raw stat differential against win/loss labels with ROC AUC
- ranks stats by that separation strength and a simple zero-threshold accuracy check
The pick-bracket command:
- loads all 2026 men's regular-season games from
MRegularSeasonDetailedResults.csvand computes each team's season averageFGMand season averageDRacross all regular-season games - standardizes both stats across all teams using z-scores (a team one standard deviation above average in
FGMgets a score of+1.0, etc.), then adds the two z-scores together into a singlestrengthnumber per team:
With default weights ofstrength = (fgm_weight × FGM_z-score) + (dr_weight × DR_z-score)1.0/1.0, both stats contribute equally. - resolves each tournament slot from
MNCAATourneySeeds.csv+MNCAATourneySlots.csv, including play-in games first - in each matchup, the team with the higher
strengthwins — no seed, no historical tournament record, no other stats are used; seed and team ID are only consulted as tiebreakers when two teams have the exact same strength score - computes winner confidence from
StrengthGap = WinnerStrength - LoserStrength:
and reports the percentage asconfidence = 1 / (1 + exp(-1.25 * StrengthGap))WinnerConfidencePct(rounded to 3 decimals)
Important caveats:
- this is descriptive, not causal
- it uses postgame box scores, so it tells you what most separates winners from losers historically
- it intentionally excludes rankings / seeds from the stat ranking itself
- it also excludes obvious outcome leakage like final score from the candidate stat list
Extending the scaffold
The easiest next additions are:
- seed-upset analysis by round
- conference performance in the tournament over time
- regular-season-to-tournament feature pipelines
- team profile comparisons and matchup previews
If you add a new historical question, prefer implementing it in src/march_madness_bot/questions.py first and then exposing it through src/march_madness_bot/cli.py.