No description

Python 100%

Find a file

Chris Moriarty f669115d7e Final code to generate 2026 bracket		2026-03-19 11:45:18 -04:00
.idea	Code used for bracket	2026-03-18 15:57:24 -04:00
data/processed	Code used for bracket	2026-03-18 15:57:24 -04:00
reports	Final code to generate 2026 bracket	2026-03-19 11:45:18 -04:00
scripts	Final code to generate 2026 bracket	2026-03-19 11:45:18 -04:00
src/march_madness_bot	Final code to generate 2026 bracket	2026-03-19 11:45:18 -04:00
tests	Final code to generate 2026 bracket	2026-03-19 11:45:18 -04:00
.gitignore	Code used for bracket	2026-03-18 15:57:24 -04:00
pyproject.toml	Code used for bracket	2026-03-18 15:57:24 -04:00
README.md	Final code to generate 2026 bracket	2026-03-19 11:45:18 -04:00

README.md

march-madness-bot

Historical analytics scaffold for the Kaggle March Machine Learning Mania 2026 dataset.

Competition overview review

Based on the public competition metadata and the downloaded CSV bundle in march-machine-learning-mania-2026/, the competition objective is to forecast 2026 NCAA tournament matchups. The dataset includes historical men's and women's:

regular season detailed and compact game results
NCAA tournament detailed and compact game results
tournament seeds and bracket slot metadata
team metadata and conference mappings
men’s Massey ordinal rankings
sample submission files for Kaggle scoring

This repo is currently focused on the historical analytics / insight-generation side of that dataset rather than Kaggle submission modeling. The first scaffold question is:

Which single raw, non-ranking box-score stat is most associated with winning a March Madness game?

What this scaffold adds

a reusable loader for the Kaggle CSVs
a normalized team-game table (one row per team per game)
enrichment with team names, conferences, and tournament seeds
a small CLI for common historical questions
a first-pass stat ranking workflow using stat differentials and ROC AUC

Project layout

src/march_madness_bot/data.py        # CSV ingestion + normalized team-game table
src/march_madness_bot/questions.py   # reusable historical question functions
src/march_madness_bot/cli.py         # terminal entry point
tests/test_questions.py              # smoke tests
data/processed/                      # optional exported normalized tables

Setup

Install the package and its dependencies into your current Python environment:

cd /Users/cmoriarty/repos/march-madness-bot
python -m pip install -e .

Quick start

Show dataset coverage:

python -m march_madness_bot.cli dataset-summary --gender all --season-type tournament

Export normalized team-game tables:

python -m march_madness_bot.cli build-cache --gender all --season-type tournament

Rank raw stats by win signal in March Madness:

python -m march_madness_bot.cli top-stat --gender all --season-type tournament --limit 10

Evaluate all 2- and 3-stat raw combinations (men's tournament history) and show top 10 by accuracy:

python -m march_madness_bot.cli top-stat-combos --top-n 10

Generate a 2026 men's bracket using FGM, FTM first. If confidence is below 70%, fallback to team ranking:

python -m march_madness_bot.cli pick-bracket-fgm-ftm-fallback --season 2026 --confidence-threshold 70 --output-csv data/processed/men_2026_fgm_ftm_fallback_bracket.csv

Stat abbreviation key

The stat column in top-stat output uses these box-score abbreviations:

FGM: Field Goals Made
FGA: Field Goals Attempted
FGM3: 3-Point Field Goals Made
FGA3: 3-Point Field Goals Attempted
FTM: Free Throws Made
FTA: Free Throws Attempted
OR: Offensive Rebounds
DR: Defensive Rebounds
Ast: Assists
TO: Turnovers
Stl: Steals
Blk: Blocks
PF: Personal Fouls

Related feature naming in this repo:

diff_<STAT> means team stat minus opponent stat (for example, diff_Ast)
Opp<STAT> means opponent value of that stat (for example, OppAst)
ScoreDiff means TeamScore - OppScore

Metric definitions in `top-stat`

The ranking output includes these two quality metrics:

roc_auc: Area Under the ROC Curve for a stat differential score against win/loss labels.
- Interprets as: the probability a random winning row has a higher stat differential than a random losing row.
- 0.50 means no separation (random), >0.50 means useful signal, <0.50 means inverse signal.
zero_threshold_accuracy: Accuracy from a simple rule using differential sign only.
- Rule: predict win if diff_<STAT> > 0, predict loss if diff_<STAT> < 0, and treat exact ties (0) as half-correct.
- 0.50 is near coin-flip, higher is better.

How to read them together:

Use roc_auc as the primary ranking metric for single-stat separation strength.
Use zero_threshold_accuracy as a practical sanity check for a very simple threshold rule.

Inspect one program's tournament history:

python -m march_madness_bot.cli team-history "Duke" --gender men

Fill a 2026 men’s bracket using the two strongest single stats (FGM and DR):

python -m march_madness_bot.cli pick-bracket --season 2026 --fgm-weight 1.0 --dr-weight 1.0

Show a bracket-style visual view instead of a flat table:

python -m march_madness_bot.cli pick-bracket --season 2026 --view bracket

pick-bracket output is intentionally human-focused and includes only:

Round
Slot
StrongTeam
WeakTeam
WinnerTeam
WinnerConfidencePct

Confidence interpretation:

WinnerConfidencePct is derived from the winner's StrengthGap using a logistic transform, so larger gaps produce higher confidence.
WinnerConfidencePct is rounded to 3 decimal places for easier manual entry.

Optionally export picks to CSV:

python -m march_madness_bot.cli pick-bracket --season 2026 --output-csv data/processed/men_2026_fgm_dr_bracket.csv

--output-csv always exports the flat table columns for easy import: Round, Slot, StrongTeam, WeakTeam, WinnerTeam, WinnerConfidencePct.

Methodology for the example insight

The top-stat command:

loads detailed tournament box scores
converts each game into two rows: one winner perspective and one loser perspective
computes raw stat differentials such as diff_Ast, diff_TO, and diff_OR
scores each raw stat differential against win/loss labels with ROC AUC
ranks stats by that separation strength and a simple zero-threshold accuracy check

The pick-bracket command:

loads all 2026 men's regular-season games from MRegularSeasonDetailedResults.csv and computes each team's season average FGM and season average DR across all regular-season games
standardizes both stats across all teams using z-scores (a team one standard deviation above average in FGM gets a score of +1.0, etc.), then adds the two z-scores together into a single strength number per team:
```
strength = (fgm_weight × FGM_z-score) + (dr_weight × DR_z-score)
```
With default weights of 1.0 / 1.0, both stats contribute equally.
resolves each tournament slot from MNCAATourneySeeds.csv + MNCAATourneySlots.csv, including play-in games first
in each matchup, the team with the higher strength wins — no seed, no historical tournament record, no other stats are used; seed and team ID are only consulted as tiebreakers when two teams have the exact same strength score
computes winner confidence from StrengthGap = WinnerStrength - LoserStrength:
```
confidence = 1 / (1 + exp(-1.25 * StrengthGap))
```
and reports the percentage as WinnerConfidencePct (rounded to 3 decimals)

Important caveats:

this is descriptive, not causal
it uses postgame box scores, so it tells you what most separates winners from losers historically
it intentionally excludes rankings / seeds from the stat ranking itself
it also excludes obvious outcome leakage like final score from the candidate stat list

Extending the scaffold

The easiest next additions are:

seed-upset analysis by round
conference performance in the tournament over time
regular-season-to-tournament feature pipelines
team profile comparisons and matchup previews

If you add a new historical question, prefer implementing it in src/march_madness_bot/questions.py first and then exposing it through src/march_madness_bot/cli.py.

README.md Unescape Escape