Last year, I published the first iteration of my March Madness regression model. I had predicted a good-but-not-great performance. Somehow, I ended up with a bracket in the 98.3 percentile.
This was a brilliant result. Of the six “upsets” I had predicted, three came to pass, which I’m going to call a solid performance. But the crowning achievement of the bracket was picking Baylor over Gonzaga in the final.
I will say that the regression was dangerously close to picking Illinois over Michigan in the final, which would’ve made the bracket way less impressive, somewhere around the 75 percentile range. This is still good, sure, but it’s not going to win you any pools.
Updating the Model
Considering the model’s elite performance last year, I decided not to overhaul it. I figured the best thing I could do was incorporate last year’s tournament results into the training data and run it back.
I was aware of the fact that linear regression tries to exploit every correlation no matter how spurious, so I was expecting the increased degrees of freedom to bring the R-squared down a touch, and past predictions may not look as strong. This comes with the territory of decreasing overfitting.
In this table, we can see how the coefficients in the model changed from last year to this year:
Variable | 2021 Coef. | 2022 Coef. | 2022 Prob. |
---|---|---|---|
Intercept | -7.271 | -0.223 | 0.9777 |
SeedLog | -0.646 | -0.628 | 3.22E-11 |
SOS | 0.063 | 0.028 | 0.2645 |
RPI | -7.130 | -2.523 | 0.5179 |
W | 0.075 | 0.004 | 0.8727 |
L | 0.074 | 0.046 | 0.2163 |
Pts | -0.194 | -0.078 | 0.4541 |
OpPts | 0.109 | 0.074 | 0.3725 |
FGP | 36.190 | 24.440 | 0.0554 |
OpFGP | -11.913 | -15.140 | 0.1086 |
X3P | 0.316 | 0.212 | 0.0973 |
X3PP | -4.370 | -3.173 | 0.312 |
Op3P | -0.215 | -0.190 | 0.0955 |
Op3PP | 2.355 | 1.545 | 0.6579 |
FT | 0.069 | 0.025 | 0.6488 |
FTP | 2.845 | 1.163 | 0.583 |
OReb | 0.258 | 0.176 | 0.0943 |
DReb | 0.139 | 0.012 | 0.9016 |
OpReb | -0.019 | -0.029 | 0.6847 |
Ast | -0.072 | -0.074 | 0.061 |
TO | -0.262 | -0.181 | 0.115 |
OpTO | 0.307 | 0.246 | 0.0517 |
Blk | 0.016 | 0.001 | 0.9904 |
Stl | -0.047 | -0.062 | 0.479 |
PF | -0.083 | -0.114 | 0.0258 |
The majority of the coefficients decrease, which means that changes in that variable will have a less pronounced effect on the predicted number of tournament wins. The few exceptions are Opponent Field Goal Percentage, Opponent Rebounds, Assists, Steals, and Personal Fouls.
Past Tournament Observations
Last year’s model only predicted 37 wins out of an expected 63. This may have been in part due to a weak field, but more likely was that the previous year’s model gave a boost to a team for its number of games played, and last year many teams had shortened schedules due to COVID protocols.
The model adjusted this year, giving much less weight to number of games played. In addition to other balancing, this means the model now sees 2021 Baylor as the strongest in the sample, predicting 3.66 wins where last year’s model predicted 2.98.
Next, let’s take a look at some outliers. This plot shows how teams fared in the tournament compared to how they were expected to fare:
Overachievers
These are the teams that exceeded their predicted wins the most.
Year | Team | Pred. Wins | Tourn. Wins |
---|---|---|---|
2014 | UConn | 0.99 | 6 |
2014 | Kentucky | 1.08 | 5 |
2016 | Villanova | 2.14 | 6 |
2021 | UCLA | 0.37 | 4 |
2018 | Loyola Chicago | 0.69 | 4 |
2016 | Syracuse | 0.74 | 4 |
2017 | North Carolina | 2.95 | 6 |
2015 | Michigan St. | 0.97 | 4 |
2021 | Oregon St. | 0.00 | 3 |
2017 | South Carolina | 1.07 | 4 |
2019 | Virginia | 3.13 | 6 |
2017 | Xavier | 0.16 | 3 |
2018 | Michigan | 2.25 | 5 |
2015 | Duke | 3.36 | 6 |
2019 | Texas Tech | 2.39 | 5 |
2018 | Villanova | 3.41 | 6 |
2014 | Dayton | 0.42 | 3 |
2021 | Baylor | 3.67 | 6 |
As you can see, every tournament champion makes this list. I’m going to make the claim that the average champion isn’t that much better than the average Elite 8 team, and yet they end up twice as far in the tournament. This means that the winner is almost always going to be a significant outlier.
We also see that the 2014 Final was even nuttier than previously believed. This model doesn’t see them as anything other than very normal 7 and 8 seeds, but the runs they each went on was historic.
Underachievers
These teams had the most disappointing performances in the Big Dance.
Year | Team | Pred. Wins | Tourn. Wins |
---|---|---|---|
2018 | Virginia | 3.05 | 0 |
2017 | Villanova | 3.28 | 1 |
2015 | Iowa St. | 2.01 | 0 |
2015 | Villanova | 2.99 | 1 |
2016 | West Virginia | 1.95 | 0 |
2014 | Duke | 1.93 | 0 |
2021 | Ohio St. | 1.84 | 0 |
2016 | Michigan St. | 1.81 | 0 |
2018 | Wichita St. | 1.78 | 0 |
2021 | Illinois | 2.68 | 1 |
2021 | Tennessee | 1.66 | 0 |
2018 | Cincinnati | 2.64 | 1 |
Virginia’s first round upset by UMBC is no surprise as the most disastrous. 2018 and 2021 each had three heavyweights go down way earlier than expected.
One thing to note is that there aren’t many repeats on either list. It seems like a windfall or catastrophe of this magnitude is a very rare thing for a team. Each list only has one repeat member, but unbelievably, this member is the same for both…
Villanova
They were a 1 or 2 seed for four straight years. Twice they won the championship, twice they didn’t escape the first weekend. A stretch that volatile has to be unprecedented, and there’s nothing close to it in our sample.
Predicting This Year
Applying our model to the field this year, we can see the number of wins the model predicts. To pick our bracket, we simply choose the team with more predicted wins. This table is what the model gives us:
Seed | Team | Pred. Wins |
---|---|---|
1 | Baylor | 3.26 |
1 | Gonzaga | 3.20 |
1 | Kansas | 2.99 |
1 | Arizona | 2.90 |
2 | Kentucky | 2.64 |
2 | Duke | 2.62 |
3 | Texas Tech | 2.48 |
2 | Auburn | 2.40 |
2 | Villanova | 2.28 |
5 | Iowa | 2.05 |
5 | Houston | 2.02 |
3 | Tennessee | 1.79 |
4 | UCLA | 1.67 |
4 | Illinois | 1.66 |
3 | Purdue | 1.64 |
5 | UConn | 1.49 |
6 | Texas | 1.47 |
4 | Arkansas | 1.47 |
6 | LSU | 1.42 |
7 | Murray St. | 1.29 |
6 | Alabama | 1.18 |
12 | UAB | 1.18 |
7 | Ohio St. | 1.15 |
8 | Seton Hall | 1.15 |
5 | Saint Mary’s (CA) | 1.08 |
11 | Virginia Tech | 0.98 |
3 | Wisconsin | 0.97 |
8 | Boise St. | 0.84 |
10 | Miami (FL) | 0.79 |
6 | Colorado St. | 0.77 |
9 | Creighton | 0.75 |
9 | Marquette | 0.75 |
4 | Providence | 0.69 |
10 | San Francisco | 0.69 |
7 | Southern California | 0.69 |
9 | TCU | 0.62 |
11 | Iowa St. | 0.58 |
13 | Chattanooga | 0.53 |
13 | Akron | 0.52 |
15 | Delaware | 0.51 |
8 | San Diego St. | 0.51 |
12 | Indiana | 0.51 |
10 | Davidson | 0.50 |
10 | Loyola Chicago | 0.49 |
13 | Vermont | 0.43 |
9 | Memphis | 0.42 |
16 | Wright St. | 0.40 |
11 | Michigan | 0.39 |
11 | Notre Dame | 0.38 |
15 | Jacksonville St. | 0.35 |
16 | Georgia St. | 0.35 |
14 | Colgate | 0.34 |
8 | North Carolina | 0.34 |
12 | Richmond | 0.32 |
7 | Michigan St. | 0.31 |
12 | New Mexico St. | 0.20 |
13 | South Dakota St. | 0.14 |
14 | Montana St. | 0.09 |
14 | Longwood | 0.02 |
16 | Norfolk St. | 0.00 |
16 | Texas Southern | -0.12 |
15 | Cal St. Fullerton | -0.19 |
15 | Saint Peter’s | -0.22 |
14 | Yale | -0.57 |
This result is a little less chalky than last year. I mean it’s still really chalky, but not as bad as last year. Because of the way we take the log of the seed and it being our most important variable, a huge boost is given to the top seeds, especially the 1 seeds.
We can again summarize the bracket with the upsets and the final.
9 Creighton over 8 San Diego State
9 Marquette over 8 North Carolina
10 Davidson over 7 Michigan St.
10 Miami (FL) over 7 Southern California
5 Houston over 4 Illinois
5 Iowa over 4 Providence
5 UConn over 4 Arkansas
6 LSU over 3 Wisconsin
1 Baylor over 1 Kansas in the championship game
This year, eight upsets are predicted instead of last year’s six, which I’ll call an improvement. Compared to last year, the model still really likes Baylor and Creighton and it really does not like North Carolina and Arkansas. It flipped on Wisconsin though.
Michigan St. and North Carolina are the most popular 7 and 8 seed respectively to make a deep run this year, but my model sees them as the weakest of their seeds. Maybe they’re slotted where they are due to name recognition and my model sees through it, or maybe there are intangibles at play my model is blind to.
Its selection of Baylor to repeat as champions is also interesting, as they are the least popular 1 seed. These value plays could help separate the bracket from the pack a bit on what is otherwise a very boring bracket.
It didn’t have to be this boring, however. If teams got placed in the right brackets, the model would see 11 seed Virginia Tech make a run to the Sweet 16 over Colorado St. and Wisconsin and 12 seed UAB to do the same over St. Mary’s (CA) and Providence. These are significant claims for the model to make since it’s so reliant on seeding, and I wish we would have seen them. Instead, each is matched up with a strong first round opponent that the model prefers, so no big upset picks. Fooey.
Last year I was very tempered in my expectations for the bracket, but I was pleasantly surprised. This year, my confidence has grown, so it’s only fair that I be brought back down to Earth. But whatever happens, I will be sure to blame it on the math.