College Football Model
Project Info
Motivation
I have always loved sports (I played Division 1 Power 5 sports as a 3-year team captain), and I have always loved numbers, math, and computer models. I wanted to make my own college football model and see how well I could make one perform. While it doesn't have a super beneficial use case, it was something I was passionate about and I wanted to do a good job.
Project Overview
- Engine: An advanced analytics and simulation platform for NCAA football. Ingests play-by-play and game data from external APIs, stores and processes this data in a local SQLite database, and generates team rankings, predictive analytics, and full-season simulations. Features include automated data ingestion, custom metrics, statistical modeling (logistic regression, graph-aware weighting, decaying priors), machine learning (Random Forests), and a Monte Carlo simulation engine for games and seasons.
- Backend (API): A FastAPI backend providing a RESTful API for college football games, polls, and calendar data. Supports API key authentication, CORS, and interactive API docs.
- Frontend (Webpage): A Next.js app deployed on Vercel, providing interactive pages for games, teams, and rankings, with a modern UI and Tailwind CSS.
Design Choices
- TypeScript & React: Chosen for ease of use and type safety, making the frontend more robust and enjoyable to develop.
- FastAPI: Selected due to prior experience and comfort, allowing for rapid backend development.
- SQLite: Used for its simplicity, ease of modification during weekly engine updates, and straightforward debugging. The backend can easily query the database instance.
- Python for Analysis: Python offers excellent machine learning libraries and is ideal for rapid prototyping and experimentation. Its syntax is readable and enjoyable, making it well-suited for analytics-heavy work.
- Java for Simulation Engine: Java was used to optimize the runtime of the Monte Carlo simulation engine, enabling efficient large-scale simulations.
Technical Challenges & Adjusted Ratings
The most challenging and time-consuming aspect of this project was developing a self-contained, minimally hard-coded model to generate adjusted ratings for various football metrics, which serve as the backbone of the model and simulation. My initial approach was to iteratively adjust each play under each metric based on what the opponent was allowing. However, this method struggled with proper adjustment, especially since some conferences primarily play within themselves, and hardcoding adjustments was not feasible due to independents and complex scheduling. This problem is relatively unique to college football, requiring a flexible and adaptive modeling approach.
After experimenting with several variants and facing phases of burnout, I ultimately settled on a solution using regularized logistic regression models at the play level, with graph-aware game weighting (using networkx edge betweenness centrality) to properly weight games that provide valuable information between otherwise disconnected groups of teams. This approach, combined with decaying priors and identifiability constraints, allowed for robust, interpretable, and well-calibrated adjusted ratings.
Model Performance
The model's performance, closely aligned with Vegas and top-tier models, is summarized below, including win probability calibration and an accuracy against the spread (ATS) exceeding 50%, surpassing the critical threshold.
Win Probability Calibration
Win Probability Bracket | Accuracy |
---|---|
50-60% | 53.7% (246/458) |
60-70% | 63.1% (252/399) |
70-80% | 75.5% (262/347) |
80-90% | 79.8% (174/218) |
90-100% | 93.2% (153/164) |
Cover Probability Calibration
Cover Probability Bracket | Accuracy |
---|---|
50-60% | 52.8% (316/598) |
60-75% | 49.1% (323/658) |
75-100% | 53.6% (177/330) |
ATS & Winner Accuracy
- 2023: 51.5% ATS accuracy (All Games)
- 2024: 53.2% ATS accuracy (All Games)
- Average ATS accuracy (All Games): 52.4%
- Average ATS accuracy (Power 5): 54.2%
- Average winner accuracy (All Games): 68.5% (Vegas: 71.2%)
- Average winner accuracy (Power 5): 65.9% (Vegas: 67.5%)