I have begun to run across more people interested in doing the kind of work I do here at JoBS. What follows is work done by Gil Graybill, another person from NC who independently performed the following analysis to determine the odds that a 16 seed would win the NCAA Tournament. With his permission, I also make some comments at the end.
When the NCAA Tournament pairings came out, somebody made the news by declaring that Fairfield was a "5 Gazillion to 1" shot to win the tournament. Sounds a little high, so I got some numbers and did some plugging and chugging.
My conclusion: The average 16 seed in the NCAA Tournament has a 384,000,000 to 1 chance of winning the whole tournament.
I guess I have to explain it now, huh?
It would be way too easy to say, "Since a 16 seed has never beaten a 1 seed, the chances are 0. End of story." That wouldn't be very much fun, now would it.
Before I tell you how I did it, here are the assumptions I had to make:
The final scores are representative of how well the teams played.
(Editor's Note 1)
(2) That the NCAA Tournament committee who does the seedings knew what they were doing, and that everyone got the seeds they deserved. (Editor's Note 2)
(3) All 1 seeds are created equal, and all 16 seeds are created equal. I know this isn't true, but as a group, we can predict their performances. (Editor's Note 3)
There will be more assumptions to follow. I'll try to number them to make it easier to point your flame-thrower at it.
I got all the scores to all the games since the NCAA Basketball Tournament expanded in 1985. There have been 48 matchups of 1 vs. 16 (4 games a year * 12 years), which is a reasonably valid sample size (I didn't include this year). I calculated the average of the point spread to be 23 points. I calculated the standard deviation to be 12.334. (Editor's Note 4)
(4) I assume that the difference in the winning and losing score will be a normal distribution. (Editor's Note 5)
Using the values of mean = 23, stdev = 12.334, and with x = -1 (The probability that the score will be <= -1), I threw it into your standard normal curve problem, and came up with .02584, or 2.584 %. That is, on the average, a number 16 seed has a 2.5% chance of beating a number 1 seed, based on the performance of previous 1 vs. 16 pairings. (Editor's Note 6)
So, according to this, a 16 should have upset a 1 by now. There have been a couple of close ones. By fudging my normal curve calculations and solving for "x=0" (a tie game) instead of "x=-1", I could include the overtime game between Michigan State and Murray State in 1990 and come out smelling like a rose. But, it's not a huge statistical anomoly to have a 1 in 40 chance not occur in 48 chances. (Editor's Note 6)
If I round off the 2.584% to 2.5 percent, we can call it an even 40 to 1 shot that the 16 seed will win their first game. I'll use that from now on.
So, one game down, 5 to go.
(5) If we assume that every game is as difficult as the first, the real answer to "What are the chances of a 16 seed winning the tournament?" is (1/40)**6, or 1 in 4,096,000,000. Instead, I am going to assume that the chances of winning in the next rounds are as follows
Round 2: Vs an 8 or 9 seed, 1 in 10 Round 3: Vs. a 4 or 5 seed, 1 in 20 Round 4: vs. a 2 or 3 seed, 1 in 30 Rounds 5 and 6: both vs. 1 seeds, 1 in 40(Editor's Note 7)
(5a) I am assuming there is only one Cinderella team in the tournament.
Multiplying all the chances out, we get (1/40)(1/10)(1/20)(1/30)(1/40)(1/40) = 1/384,000,000, or 384,000,000 to 1.
Just for giggles and grins, I used the same method to calculate the chances of a 15 seed beating a 2 seed. This has happened 2 times in 48 games. The average difference of 2 vs. 15 is 17.833. The standard deviation is 10.491. Throw it into a normal curve with x=-1, and you get .03631, or 3.6%. That's about a 1 in 30 shot. Pretty darn close. My calculations make it seem like the 15 seed has, in reality, won too many times. Of course, having it happen AGAIN this year makes it even more interesting. (Editor's Note 8)
So, the next time someone starts throwing numbers like a gazillion around, just show them this.
Let the flames begin.
Mr. Graybill did a very good job at looking at an issue that we all have interest in. He did it thoroughly and without knowledge of my work, which makes it even better. He also wrote it up well, which counts for something.
His results verge on the impossible. 384,000,000 to 1 is essentially impossible, which means that it is prone to what I call "Mother Nature Syndrome". This is when what we think is impossible is actually subject to unlikely external effects, like food poisoning taking out the other teams in the bracket. That may be unlikely to happen, but it is probably more likely than 384,000,000 to 1. Uncertainties like that actually dominate when the basketball odds are this low. For instance, food poisoning might raise a #16 chances to something closer to 1,000,000 to 1. We're probably not going to see it happen either way, but there are ways it can happen that seem more likely than 384,000,000 to 1.
2. I don't think that this analysis requires that the Tournament committee knew what they were doing. Just the fact that the scores are representative of the quality of play allows some analysis to be done. If, for some reason, the committee had Joan Rivers deciding the seeds, we would still be able to determine the odds of a 16 seed winning -- they would just be a lot higher.
3. "All #16 seeds are created equal" -- this is a pretty good way of stating what statistical analysis is about. When we look at the details of every #16 seed, we know they are different and we know that some are better than others. But we don't have the time to do a lot of that and the information to do it is often not very good. When we lump all #16 seeds together, we get a fuzzy picture, but one that people can generally agree upon and one that is provable in some sense.
4. These numbers were a little surprising to me. I did not realize that the #1 seeds had crushed the #16 seeds so badly. With regard to Gil's concern that 48 games is not enough to get a good estimate of the statistics, I say it is good enough. An NBA season is only 82 games long and there is quite a bit of noise in the records based upon statistical theory. But we accept them as being fairly accurate. If his sample is only 48 games, think of it as being 48 games through an NBA season -- Do you think that the playoff teams are pretty well distinguished from the nonplayoff teams? Probably. The statistics are stable enough that we can be qualitatively comfortable.
5. The justification for a normal distribution is pretty well established. I have never looked at matchups so disproportionate however. Intuitively, I would imagine that the assumption is even better in this case, but I don't know. Mr. Graybill should actually be able to answer this question himself with the numbers he used.
6. This is the most intriguing part of the analysis to me because it is different from what I do. (It may not be interesting to people who don't care about the math.) In the method, I look at the probability that a value from a normal distribution is greater or less than zero, not -1, as Mr. Graybill uses. I have known for some time that there is a slight conservative bias to my method, where a team that wins 70% of its games will be predicted to win only about 69%. Mr. Graybill's technique would approximately fix this, moving that prediction up to around 70%, but I don't believe that the theory to do this is right. Theoretically, I believe that he should have used x=0, in his notation, to determine the appropriate winning percentage. Practically, it doesn't make much difference, the predicted winning percentage being about 3% if he used x=0. With either of these odds, it is not unusual for the #16 seed not to have won a game.
7. Even though the numbers used here are essentially drawn from thin air, they are not without some rationale. I personally believe that a 16 seed has better than 1 in 10 odds against an 8 or 9 seed, but he chose round numbers that aren't obviously wrong. He did ignore the possibilty that other lower seeds might be met along the way -- which is his assumption 5a. This would increase the odds of a #16 seed winning the Tournament, at least by this methodology. There is another way of performing this analysis that raises the odds, which says that, with every victory, the #16 seed is really better than their seeding and their odds of winning go up. Regardless, if we say a #16 seed has a 1 in 10,000,000 chance or a 1 in 384,000,000 chance, this isn't going to stop that #16 seed from believing it has a chance. Nor is it going to change the chance that we will see a #16 seed win the Tournament in our lifetime.
I recognize that there may be criticism of this sort of analysis as being only useful for gambling. That is not my purpose. Personally, I don't enjoy gambling, with the exception of donating $2 every year to an office NCAA pool and getting criticized for not winning it. There is nothing like friends who will beat you when you're down.
What I personally do enjoy in this analysis is the analysis itself. It is educational. It is the kind of analysis that people have to do in many jobs, where they have to estimate some number with a limited amount of information. In litigation cases I have been involved in, this is the type of analysis that has been necessary to determine who is responsible for environmental contamination. Financial analysts have to do this all the time. Business people trying to forecast future demand have to make similar estimates. Basketball, unlike a lot of real world situations, often presents ways to verify the numbers we calculate, which can make it fun when you're right.
So if there are parents out there concerned that their children are reading about gambling, I hope they think twice. It is certainly an excusable concern, but there is a lot to be gained from an educational standpoint by allowing children to read this type of analysis.
This ain't Mr. Rogers' Neighborhood, but have a beautiful day anyway.