Solving March Madness with Regression Analysis

I am not a good guesser. And yet, every year, I end up throwing my hat in the ring for March Madness Bracket Challenges, the world’s premier guessing simulator.

All you have to do is pick the winner of 63 games.

If you did this completely at random, 50/50 shot, you’d expect to correctly guess 16 of the 32 first round games. In the second round, 8 of your 16 guesses wouldn’t have made it, and half of your surviving picks you’d get wrong (so 4 total). So of the 4 teams you picked to make the Sweet Sixteen, 2 you didn’t pick to win their next game, and between the others, you’ll probably get 1 wrong. After that point, you’re really not likely to get any right.

Picking fully randomly, you’d expect to get about 21.3 of the 63 games right.

Not great.

But the odds are not random. There are teams that are better than their opponents and have a better than 50% chance to advance. So easy, pick all the favorites, right?

Well, let’s look at an example. About 75% of the time, #4 seed teams beat #13 seeds. Picking the #4 seed every time should get you the right answer 75% of the time. But odds are, one of the four #13 seeds is gonna win one, and knowing which one it’s going to be gives you a huge leg up on the competition.

If you hit 75% of the time for a full bracket, you’d expect 38.3 correct picks. Not bad, but picking the favorites every game is boring (and that 75% figure is pretty unrealistic).

So I want to be able to pick the upsets that are actually going to happen. But I can’t guess for my life.

I need backup.

Data Time

Statistics

To figure out who was going to win this year’s tournament, I figured the best place to start was using statistics from previous years’ teams. But the NCAA handles statistics in a really unhelpful way. It includes tournament games in calculating year-end stats. I know none of the major American professional sports leagues do things this way. You know why? Because it would be chaos, that’s why. But for some reason the NCAA thinks this is acceptable.

This made my task of tracking down statistics much more difficult. I would’ve loved to have taken a huge data export from Sports Reference, but if I want to find out why the 28-4 team won the tournament, being told they were a 34-4 team is insanely misleading. I had to go find archived statistics from the week brackets were released each year. Miraculously, the NCAA provides this. This was critical for figuring out how many points, rebounds, assists, etc. teams had tallied by the time brackets were announced. They had data on offensive and defensive rebounds split back to 2014, so that’s how far I went. I didn’t include any 2020 data since there was no tournament to test things out.

The exports were still in severe need of tidying, so that took a lot of work to get the data in a format in which it could be manipulated.

There was one significant flaw in this approach. There was no control for opponent strength. So when I looked at a team that held opponents to 50 points per game, it was unclear if they were a defensive powerhouse or if their opponents just average 50 points a game.

I needed more.

Ratings

There are dozens of ratings and rankings that come out and get thrown around come bracketology season, each with their own initialism. NET, KPI, SOR, BPI, POM, SAG… the list goes on.

I was particularly interested in strength of schedule (SOS), but again, end-of-year ratings included tournament data. This makes even less sense to me.

I could track down the team sheets that the actual bracketeers used to seed the teams, but I could only find PDFs and I was not putting myself through that.

Finally, I stumbled upon TeamRankings, which had archival SOS data, including both rankings and ratings. Beautiful.

They also had RPI data, which combines a team’s record, their strength of schedule, and their opponents’ strength of schedule.

Unfortunately, their school names did not match up with the stats export I got from the NCAA (UConn vs. Connecticut, USC vs. Southern California, like half the schools were different), so I needed to match everything up manually. Also, I couldn’t export the data, so I had to manually key the info for each team each year.

But eventually, I had all the data together. I threw in which seed a team was assigned for good measure and started the actual analysis.

Regresh Sesh

I developed a fondness for linear regression in an econometrics class I took some years back, and despite its limitations, I find it an intuitive way to examine how multiple variables affect an outcome. If an explanatory variable rises by one, the response variable moves approximately equal to the coefficient. I can follow that.

For the outcome, I initially started using the number of points accumulated for ESPN’s Tournament Challenge. 10 points for a first round win, 20 for a second round, onward to 40, 80, 160, and 320. So a championship team would’ve picked up 630 points. However, I discovered that the predictive power of the model improved dramatically when I simply used the number of tournament games won instead. Simple it is then.

Tweaking the model further, I discovered that taking the log (base 2) of the seed yielded much better results. This makes sense, since there’s a larger performance gap between a #1 and a #4 than there is between #11 and #14. Complicated it is then.

I started playing around to see if there were certain variables that should be included or excluded, but what I discovered was that the seed a team was assigned was doing almost all of the heavy lifting. Part of this can be attributed to better teams getting better seeds, but also better seeded teams get to face lower seeded teams. To make the Sweet Sixteen, a #1 seed has to beat a #16 and a #8 while a #12 has to beat a #5 and a #4.

I cut variables that were redundant with included ones since I was hitting some serious overfitting problems. Total rebounds was cut since offensive and defensive were kept, scoring margin was cut since points and points against were kept, and so on. I recognize that the combined variable may be more predictive than either of the underlying ones individually, but if both of the underlying ones are included, we don’t need the combined.

Stuff to keep in mind

Some of the predictive variables have a sign opposite what is shown in their plot. In these cases, since seeding is such an important factor, it is an indication that this is a factor that bracket seeders may be subconsciously overvaluing. Conversely, when a variable has the same sign as expected, it implies that the bracket makers may be undervaluing it. In a way this entire analysis has kind of become about what variables bracket seeders are under and overvaluing.

(Is every upset a failure on the bracketeers’ part? Should they be considering who might be the best team at that point in time or who has had the best season so far? When the 2017 College Football Playoff teams were announced, a lot of stink was made over the fact that Alabama was one of the four teams despite not even making it to their conference championship game. They went on to win the national title. Were the powers that be right to include them because they won or wrong to include them because there were more deserving teams. I don’t know, but it’s an interesting thought.)

Of course our goal is to exploit those discrepancies, but it’s an interesting vantage point I hadn’t considered going into this.

I added a little jitter to the points on the plot so they wouldn’t all overlap each other. Please don’t read into specific vertical position too much or horizontal position on discrete predictive variables.

Some of the variables are correlated with each other, which is a no-no, but I worked hard to get all these variables, so they’re all getting thrown in the pot. Here’s a correlation matrix heatmap for your perusal (ignore leading X’s):

Without further ado, here are the variables included in the final model, from least to most important. All in-game stats are on a per game basis.

Oh, and our intercept is -7.2706 for those playing along at home.

Blocks

Coefficient: 0.0162

P-value: 0.8373

Looks like blocks are pretty much just for show. That’s about as random of a graph as you could ask for.

Best: 2015 Texas (7.88)

Worst: 2014 Creighton (1.39)

Opponent Rebounds

Coefficient: -0.0191

P-value: 0.8005

This was a factor in rebound margin that was broken out individually. I’m not surprised it had very little bearing on the results.

Best: 2017 St. Mary’s (26.28)

Worst: 2018 Marshall (40.68)

Steals

Coefficient: -0.0473

P-value: 0.6260

This is the first variable with a flipped coefficient, so though it is a pretty insignificant variable, it does appear to be slightly overvalued. Blocks and steals have long been critiqued for not painting an accurate picture of defensive performance, this is just more evidence for the pile.

Best: 2014 VCU (11.18)

Worst: 2015 Texas (3.76)

Opponent Three-Point Percentage

Coefficient: 2.3553

P-value: 0.5244

Don’t let the high coefficient fool you, it’s only because the underlying statistic is already a decimal. Another flipped sign, but not much to see here.

Best: 2019 Virginia (27.2%)

Worst: 2015 Eastern Washington (38.5%)

Opponent Field Goal Percentage

Coefficient: -11.9131

P-value: 0.2380

This is the first one where if you squint you can kind of see what the trendline was going for. High outliers have few wins, and there are a few low outliers that made deep runs.

Best: 2015 Kentucky (35.5%)

Worst: 2014 Eastern Kentucky (48.2%)

Free Throws

Coefficient: 0.0694

P-value: 0.2360

So this plot is a lot less clear than the preceding one, but it’s a useful time to remind that the coefficients and p-values are for the final model, so free throws per game helps add slightly more predictive power than opponent field goal percentage, even if its plot is kind of all over the place.

Best: 2014 BYU (20.79)

Worst: 2017 Virginia (9.78)

Losses

Coefficient: 0.0755

P-value: 0.2341

This might be the clearest flipped sign in the whole analysis. Considering wins is another stronger variable, this could be saying that teams that play more games do better. Or perhaps the bracket makers give too much consideration to “bad losses.” Either way, I find this one fascinating.

Best: 2014 Wichita St. and 2015 Kentucky (0)

Worst: 2014 Cal Poly, 2016 Holy Cross, and 2018 Texas Southern (19)

Free Throw Percentage

Coefficient: 2.8449

P-value: 0.2327

Tournament winners seemed to do average or better, which is expected, but then teams that shot 65% or lower at the line made it out of the first round most of the time. There are no doubt confounding factors, like Hack-a-Shaq candidates, but it’s still an oddity.

Best: 2017 Notre Dame (79.9%)

Worst: 2019 Saint Louis (59.8%)

Opponent Points

Coefficient: 0.1092

P-value: 0.2325

Another clear flipped sign, similar to losses. Again, points is stronger, so maybe it’s saying that a breakneck pace is beneficial (taking a peak at points, this is not the case). More likely though is that bracket makers value margin of victory too much when seeding the teams.

Best: 2015 Virginia (50.69)

Worst: 2018 Oklahoma (81.65)

Defensive Rebounds

Coefficient: 0.1385

P-value: 0.2001

This is another graph that seems all over the place. Having a lot of rebounds doesn’t necessarily mean a team is good at rebounding. Maybe they force a lot of bad shots and have a bigger pile of rebounds to grab from. Maybe they lock down the paint so teams have to shoot from outside, then clean up the boards by virtue of being nearby.

Best: 2017 Gonzaga (30.94)

Worst: 2014 Eastern Kentucky (19.09)

Three-Point Percentage

Coefficient: -4.3697

P-value: 0.1892

Sometimes teams that shoot lights out from deep are targeted as upset candidates since if they get hot, they can take down anybody. Or so the thinking goes. It seems they do well, but not good enough to justify their seed.

Best: 2016 Michigan St. (43.4%)

Worst: 2014 NC State (29%)

RPI (Rating Percentage Index)

Coefficient: -7.1305

P-value: 0.1669

So I’ve made a lot of fuss about whether bracketeers are over or undervaluing different variables, but honestly, their probably not looking at opponent three-point percentage. What they did have on their team sheets for years was RPI. And though it clearly has a positive trend with the best teams, the negative coefficient means that high RPI teams were underperforming relative to their seed (especially since seed and RPI are so correlated). So much so that RPI was removed from bracket makers’ team sheets. Brutal, but warranted.

Best: 2018 Virginia (0.679)

Worst: 2016 Holy Cross (0.451)

Wins

Coefficient: 0.0755

P-value: 0.1502

No surprise that this pretty much looks like the flip of the losses plot. I will note that it is a tad odd that the wins and losses coefficients are so similar, which basically means that it doesn’t matter whether a team wins or loses, it’s just in your interest to schedule a lot of games.

Best: 2014 Wichita St. and 2015 Kentucky (34)

Worst: 2014 Cal Poly (13)

Fouls

Coefficient: -0.0831

P-value: 0.1400

A lot of tight games are won and lost on whether a team can stay disciplined and stay out of foul trouble. This helps make that case. I would be interested to find out if long-tenured coaches have lower fouls and deeper tournament runs. Another time, perhaps.

Best: 2015 Wisconsin (12.03)

Worst: 2014 Manhattan (24.13)

Points

Coefficient: -0.1944

P-value: 0.0904

This is the first variable that could maybe be considered to be significant and it has a serious negative value. This says that all else equal, if you score about five fewer points per game, you’re expected to win an extra tournament game. Certainly striking, and combined with the opponent points variable, leads one to believe that scoring margin may not be all it’s cracked up to be.

Best: 2017 UCLA (90.36)

Worst: 2015 Wyoming (61.68)

Assists

Coefficient: -0.0721

P-value: 0.0881

March Madness is often called a guards’ tournament since they’re the ones with the ball in their hands when their team needs a bucket in crunch time. You look at high assists and your gut may be that it’s evidence of strong guard play, but maybe what it means is that they need to get the ball to someone else to score. I don’t know, but this is our last flipped sign, so I can cool it down a bit on the wild speculation.

Best: 2017 UCLA (21.48)

Worst: 2014 Nebraska (9.58)

Opponent Three-Points Made

Coefficient: -0.2149

P-value: 0.0842

Forcing your opponent to make fewer threes appears to be a winning strategy. 3 > 2 and all that. If you recall, three-point percentage had a flipped sign, so this seems to indicate that it’s more about dictating shot selection.

Best: 2015 New Mexico St. (3.64)

Worst: 2017 South Dakota St. (10.62)

Strength of Schedule

Coefficient: 0.0633

P-value: 0.0606

Now we’re talking. That’s a plot right there. With the lone exception of 2018 Loyola, an outlier among outliers, you need a SOS rating of at least three to make the Sweet Sixteen and at least seven to make the Final Four. (Confession time: I don’t actually know what TeamRankings’ SOS rating means. I know I should, but I don’t. I believe it’s calculated based on opponents’ performance, but I don’t know what the actual rating represents.)

Best: 2019 Duke

Worst: 2016 Hampton

Turnovers

Coefficient: -0.2620

P-value: 0.0348

Protecting the ball is a good thing. Whoda thunkit? Some of the very best teams in the turnover department made very deep runs, and this plot may be underselling the importance since this is our first variable with a sub-0.05 p-value.

Best: 2015 Wisconsin

Worst: 2018 Stephen F. Austin

Opponent Turnovers

Coefficient: 0.3072

P-value: 0.0255

The other side of the coin: taking the ball away is just as important. The plot shows a flat trend here, but the coefficient is strongly positive as one would expect. Steals is highly correlated and had a negative sign, so maybe it’s non-steal takeaways that are critical. No matter what, improving your turnover margin seems to be one of the best things you can do to win make deeper tournament runs.

Best: 2017 West Virginia (20.44)

Worst: 2015 Texas (9.03)

Offensive Rebounds

Coefficient: 0.2577

P-value: 0.0245

Some of basketball’s greatest late game heroics have hinged on someone coming up big on the offensive glass. (Like this one. Or these. And of course, the granddaddy of them all. ) This is very much in the same vein as turnover margin since you’re essentially conjuring up a new possession that you normally wouldn’t have had.

Best: 2015 West Virginia (16.84)

Worst: 2016 Northern Iowa (5.38)

Three-Points Made

Coefficient: 0.3157

P-value: 0.0223

Earlier we talked about how it doesn’t matter if you make a high percentage of threes. Here we see that it’s very important to make a lot of threes. The direction is clear: take a lot of threes. If you can add about three made threes per game, you should make it through another game in the tournament, ceteris paribus. And of course, being able to make that buzzer beater from deep is the difference between ignominy and immortality.

Best: 2018 Villanova (11.41)

Worst: 2015 SMU (4.18)

Field Goal Percentage

Coefficient: 36.1899

P-value: 0.0083

That is a huge coefficient right there, and not without reason. Holding all else the same, if you can increase your field goal percentage by just six percentage points, you will win an additional two tournament games. Just by improving from 43% to 49%, you can last an entire extra weekend. There is some correlation with other variables and I expected it to be positive, but I did not see it being such a gargantuan factor on tournament performance. Goes to show how important finding the bottom of the net is.

Best: 2019 Gonzaga (53.2%)

Worst: 2016 Temple (40.5%)

Seed (Log Base 2)

Coefficient: -0.6462

P-value: 0.00000000045

Yup, this is the one. The variable dragging the rest along for the ride. The plot makes it clear why. Better seeds frequently make deep tournament runs, lower seeds are seldom so lucky. And the tippy top, the #1 seeds, are head and shoulders above the rest of the pack. This will certainly make our model very chalky, but when trying to pick an upset, you’re going to be wrong a lot more often than you’ll be right.

Let’s take a look at our final model.

All Together Now

R-Squared: 0.447

Adj. R-Squared: 0.41

You know what? I’m really happy with that. Considering how many games are won and lost on a lucky bounce here, an unlucky bounce there, I think we’ve done a nice job of extracting as much predictive power from the model as we can.

Most Predicted Wins: 2019 Gonzaga (3.59)

Fewest Predicted Wins: 2016 Hampton (-0.82)

Putting This Work to Work

Using all of these coefficients on the 2021 field, we can find how many games each team is expected to win. For our bracket, we simply choose the team with more predicted wins.

Seed	School	Pred. Wins
1	Baylor	2.98
1	Illinois	2.72
1	Gonzaga	2.65
1	Michigan	2.58
2	Iowa	2.36
2	Houston	2.36
2	Alabama	2.23
2	Ohio St.	1.9
3	Kansas	1.78
6	Texas Tech	1.48
5	Colorado	1.48
5	Creighton	1.42
4	Virginia	1.41
5	Tennessee	1.41
3	West Virginia	1.37
3	Texas	1.37
3	Arkansas	1.36
4	Oklahoma St.	1.19
6	Southern California	1.14
4	Florida St.	0.97
4	Purdue	0.94
8	Loyola Chicago	0.85
7	Florida	0.83
7	Oregon	0.74
9	Wisconsin	0.73
7	UConn	0.73
8	Oklahoma	0.67
10	Maryland	0.66
5	Villanova	0.65
8	LSU	0.64
13	Liberty	0.63
9	Georgia Tech	0.58
10	Rutgers	0.54
11	Drake	0.53
6	San Diego St.	0.52
8	North Carolina	0.51
11	Syracuse	0.48
13	North Texas	0.44
6	BYU	0.36
7	Clemson	0.26
15	Oral Roberts	0.18
10	VCU	0.18
11	Utah St.	0.14
13	UNC Greensboro	0.11
11	UCLA	0.07
14	Abilene Christian	-0.04
12	UC Santa Barbara	-0.08
12	Oregon St.	-0.08
9	Missouri	-0.1
15	Grand Canyon	-0.21
10	Virginia Tech	-0.28
12	Georgetown	-0.29
13	Ohio	-0.32
9	St. Bonaventure	-0.37
14	Morehead St.	-0.48
12	Winthrop	-0.51
14	Eastern Wash.	-0.82
16	Texas Southern	-0.87
16	Hartford	-0.9
15	Cleveland St.	-0.97
16	Drexel	-1.14
16	Norfolk St.	-1.15
14	Colgate	-1.69
15	Iona	-1.83

Welp, that’s about as chalky as you can get. In fact, I can summarize the entire bracket by listing the handful of “upsets” and the championship.

9 Wisconsin over 8 North Carolina

10 Rutgers over 7 Clemson

5 Creighton over 4 Virginia

5 Colorado over 4 Florida State

5 Tennessee over 4 Oklahoma State

6 Texas Tech over 3 Arkansas

1 Baylor over 1 Gonzaga in the championship game

Other than that, straight chalk. I will note that this year only 37 wins are predicted, where the other six years averaged 63 (how many games there are). It could be a sign of a relatively weak field.

Will it do well?

Yeah, probably.

Will it do very well?

No, probably not.

Is it a boring bracket?

Yes, definitely.

Do I regret this endeavor?

Not a chance.