On/Off and Sample Size

Eric Hofmann
5 min readOct 17, 2018

--

On/off has an error bar of ± 11 for a player who plays a full season. If one player’s on/off is +5 and another’s is -5 at the end of the season, this difference is not statistically significant.

.

Suppose we have a coin, and we want to know if it’s fair. If we flip it twice and get two heads, we all agree that is probably not good evidence the coin is rigged. If we flip it two billion times and get two billion heads, we all agree that probably is. But where specifically along the way should we starting agreeing the evidence is good? At what point is a sample no longer small?

Truly great is the power of George Washington!

For a coin flip there turns out to be an easy arithmetic way to determine this, but there isn’t one for on/off so let’s stick that in our back pocket and sit on it, Fonzarelli. Instead let’s try taking a coin we know is fair (which is to say one that has a fifty percent chance of coming up heads, fifty percent chance of tails, and zero percent chance of telepaths) and let’s flip it a hundred times a hundred times. We know that it’s possible for even a fair coin to come up heads a hundred times out of a hundred flips, but by looking at a hundred trials this will give us an idea of what’s probable in any given trial of a hundred flips, which will let us say whether some future coin is probably rigged.

“This is zonés?” “This is horrible.”

This tells us how many times we flipped our fair coin a hundred times and got a given number of heads. For example, we got 45 heads six times and 58 heads twice.

As expected our results cluster around fifty, and results become less and less likely the further they deviate from the expected value. We can express this quantitatively with the standard deviation, which covers 68% of values in a normal distribution within one standard deviation of the average, 95% within two, and 99.7% within three, and that’s where the colors come in since our standard deviation happens to be five. (It is not necessarily the case and almost never in practice we get such a nice round number.) If we flip our future coin a hundred times and get a result that’s in the red zone, by convention we call that difference statistically significant, since the deviance probably can’t be explained only by random chance. Put another way, if we measure seventy heads over a hundred flips that shouldn’t be dismissed as a small sample in the way we would dismiss seven heads over ten flips. (Other wordings include that we can reject the null hypothesis, or that the signal is greater than the noise, or that the result falls outside the confidence interval. You know, typical nerd stuff.)

It’s crucial to remember that we’re only 95% sure, which means one time out of twenty we’ll be wrong, which is why even the fair coin shows up in the red zone a few times. Science is hard.

.

Now we do the same thing but with basketball players instead of quarters. Since basketball games are not decided by coin flips we need a couple more levels to our model. Our player will only be on the court for 36 minutes each game so there’s an “on” and “off” in the first place, then we’ll randomly select a play based on how often turnovers, free throws, and field goals happened in the actual 2017 NBA season. Things like assists, rebounds, fouls etc. don’t matter for on/off because they do not in themselves put points on the board. All that matters is how a possession ends, and how many points result: a turnover always results in zero, a field goal we’ll randomly attribute zero, two, or three points based on how often each of those results happened, and likewise for free throws.

Then we repeat that process 95 times and call it a game.

Then we repeat that process 81 times and call it a season.

Then we repeat that process 999 times and call it a day because boy oh boy that’s a lot of processing.

This is fine. Everything is fine.

Whoa, Nelly! Let’s start with the good news. Our thousand seasons make another normal distribution which is good and our on/offs are clustered around zero which is very good, since the true talent level on/off of every player in this league is zero by definition since they’re all literally identical.

But the range on this! Yeesh! Our player (who is exactly the same as every other player!) managed to post a +20 and -17 on/off over a full season! The standard deviation works out to about 5.5 so according to this model when a full season has been played we can’t (or at least shouldn’t) be confident that a player who has posted an on/off of +10 actually makes their team better!

How can this possibly be right?

If on/off was really so unreliable wouldn’t we see players’ numbers change dramatically year over year?

Well… we do.

LeBron James on/off years four through eleven

All on/off stats care of the indispensable basketball-reference.com. Even if we say “aha, the Miami trade!” for year eight (which is the kind of situation we would hope on/off covered anyway), what happened in year six? or year nine? or year eleven? And the King is not alone in this, each of the other last five MVPs have seen similarly massive and inexplicable changes: Kevin Durant +17 to -1(!!!) in year four, Steph Curry +3 to +15 in year five, James Harden -4 to +8 in year five, Russell Westbrook -9 to +2 in year four. The size of these variations is important not in itself but relative to the size of our measurements: if players routinely posted +100 and -100 on/offs, having error bars of 10 wouldn’t be a big concern.

But they don’t.

So they are.

.

.

A full season is an important sample size. It carries an MVP, a championship, a playoffs theme song. But not everything that’s important is statistically significant, and it so happens that’s the case with season long on/off. Analytically we’ve seen it’s unreliable, and empirically we’ve seen cases that at least garnish credulity’s paychecks if not beggaring it outright.

It didn’t have to be this way, and as we’ll see in future entries not every composite stat has such pronounced noise. It just so happened to be this way for on/off at season end, so we should be extremely hesitant about using it during a season, especially for specific lineups that necessarily haven’t played as many minutes as individual players. Making the sample size that’s already too small even smaller is a recipe for disaster.

.

That’s it, thanks for reading! :)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Eric Hofmann
Eric Hofmann

Written by Eric Hofmann

...you can't tell a heart when to start, how to beat...

No responses yet

Write a response