Friday, December 18, 2009

Distributions, and their meaning

This all started innocently enough.

Heading into the game against the Washington Huskies, I fully expected Georgetown to struggle mightily with turnovers, since Georgetown was turning the ball over on more than 22% of their possessions.  Against a high tempo team, I expected these turnovers to be turned into easy fast break points for the Huskies.

Turned out, Washington had a harder time holding on the ball than the Hoyas.  For the game, Georgetown outscored UW 28-23 on points after turnovers, as the Huskies ended fully 30% of their possessions with a give away.

Somehow, this got me to thinking about the four factors, scoring efficiencies and steals.  So I turned to and downloaded all of the end-of-season stats for 2004 through 2009, and starting playing around with the data.

Specifically, I was curious what the distribution of turnover rates were for teams in college basketball, as well as the median value.  While conference-only stats might be a bit more useful when looking at the Hoyas, they're a bit more tedious to generate (Pomeroy has done all the work for me for stats for all games), and his complete database has more than 2000 team-seasons, which makes statistical analysis much less susceptible to the occasional outlier.

An important point to keep in mind during this discussion is that I'm only looking at team-to-team differences, not game-to-game differences for a team, or player-to-player differences within a team.

First, I looked at the distribution of turnover rates (turnovers per 100 possessions).  Since we have TO rates for both offense and defense (i.e. turnovers committed and generated, respectively), I decided to compare histograms of the two, to see if the rate offenses turn the ball over varies equally to the rate defenses force turnovers (click any figure to enlarge).

The convention used here will be consistent throughout:  I created 25 equally-sized bins for a histogram, counted the number of team-seasons in each bin, then ran a Gaussian fit through each (assuming a normal distribution) to get the median and the width, really FWHM (full width at half of the maximum) which are reported next to the histogram.  The histograms are plotted offense on top, then defense on the bottom.

For turnover rate, the distributions look very similar, but not identical:  turnover rates by the offense is slightly broader and the median is a fuzz higher, while turnover rates by the defense are a bit narrower and the median lower with a strange tail extending to the right.  I suspect that this tail is due to a few coaches going all-out in attempting to force turnovers (think 40 Minutes of Hell), while most teams vary about the 17-24% range.  Since all teams try to prevent turnovers while on offense, that distribution looks a bit more classically bell-curve shaped.

This was all well-and-good, but frankly not very interesting, at least not yet.  Next, I ran steal rate (steals per 100 possessions):

At the top is the steals allowed while on offense, i.e. opponents' steal rate, while the bottom is steals generated while on defense.  Hopefully, these pairs of distributions look as different to you as they did to me.  The top plot (steals allowed) again has a much more symmetrical shape than the bottom (steals committed), and now it is much narrower.  I think that asymmetry on the bottom plot -the tailling on the right side - goes back to the idea that a few coaches will have their players constantly attempt for steals.  I'll discuss the difference in widths a bit more later on.

This got me thinking about two things - for now I'll just stick to the distributions, but if this post doesn't turn into a novella, I'll return to the other point (about turnovers and steals) at the end.

My next inclination was to wonder whether offensive and defensive efficiencies also show a difference in their distributions.  After all, all teams are trying to score and prevent the other team from scoring, so I'd expect that they'd be roughly the same. Here we go:

If you couldn't see much of a difference in the distributions in that last plot, I hope you see the difference here.  While both have a classical bell-curve shape, the offensive efficiencies are much more widely distributed than the defense.

So what does this mean?

When I looked at the difference between off. steals (allowed) and def. steals (committed), I noted that the defensive curve was broader.  The explanation is a bit complicated, but is as follows:
  • The stats were are looking at are "raw" or not adjusted for competition.  And each histogram is generated from the end-of-season stats for each team.
  • Therefore, each statistic is based upon playing a large number of teams (typically ~30 games) of varying talent and strategies.  So, when a statistic is a measure of some consistent team strategy (i.e. defensive steals), the distribution will be broad, since the statistic will be inflated or deflated expressly by strategy.
  • To demonstrate by continuing the example of defensive steals, a team like VMI has a very high steal rate which seems to be a fundamental component of their defense, while Washington St. has a very low steal rate as Tony Bennett eschews a gambling defense.  In each case, their opponents will allow more or fewer steals than normal because of the defensive strategy, and this accumulates over the course of 30 games.  By season's end, this consistent strategy results in a statistic (here steals rate) far from the median or average value.
  • Conversely, all teams try to minimize steals allowed.  Perhaps a certain offensive style is likely to lead to more steals for the defense, but offensive players on the court are always trying to prevent steals allowed.  Over the course of 30 games, they'll play against aggressive and passive defenses, so sometimes they'll have a high steal rate allowed, sometimes low.  But by the end of the season, they'll regress towards some median value.
  • When you extend this logic to all 2000 team-seasons, active strategies will result in a wider distribution than reactive responses.
This sounds reasonable to me, but can I prove this hypothesis?  We'll need to take a look at another stat where we can compare strategy to reaction by the offense and defense.  Thanks to a recent diatribe by John Gasaway, we know of another statistic:  rebounding.

Simply put, defensive rebounding is reactive - every team wants to get every rebound while on defense.  But offensive rebounding is strategic - some coaches will send 3 or 4 players to crash the boards on a missed shot, while others will send everyone back to prevent a fast break.

So if my thesis holds water, we'd expect offensive rebounding % to have a wider distribution than defensive rebounding %.  Let's take a look (here, I'll be using OR% and OR% allowed [= 1 - DR%] to make the distributions directly comparable):

And there it is, just as predicted.  The distribution of teams' offensive rebounding rate is wider than the defensive rebounding rate.

Feeling rather smug, I went ahead and ran all of the four factors (along with steal rate), summarized in this table:

.                     Offense                    Defense
Stat           Median  Width   W/Med      Median  Width   W/Med     Difference
Raw Effic.     101.0    9.7     9.6%      100.6    7.5     7.4%        2.2%
Adj. Effic.    100.2   12.8    12.8%      100.8   11.9    11.8%        0.9%

eFG%            49.0    4.4     8.9%       49.1    4.0     8.1%        0.8%
TO Rate         20.7    3.2    15.2%       20.5    3.1    15.0%        0.3%
O. Reb %        33.0    4.5    13.7%       32.9    4.0    12.3%        1.5%
FT Rate         35.5    7.0    19.6%       35.4    8.4    23.8%       -4.2%

Steal Rate       9.8    2.1    21.5%        9.7    2.4    25.0%       -3.2%

That column "W/Med" is simply the width divided by the median, as a way of normalizing the statistics to make them comparable.

The last column is the one of real interest, and is simply (Off. W/Med - Def. W/Med).  For the case of steal rate, the difference is negative, meaning that the relative width of the defensive distribution is wider, which we now know means that the defensive behavior is controlling the stat.  For rebounding percentage the value is positive, and so the offensive team has a greater impact.  Also note that the difference between offense and defense is more than twice as strong for steal rate as rebounding %, which seems reasonable.

The others:
  • Free Throw Rate:  The most strongly dependent upon the defensive strategy, more so than even steals.  I suspect here that the strategy of end-of-game fouling is dominating the stats; if I could re-run with just 1st half statistics, I wonder if the result would be so strong, or even the same.
  • Turnover Rate:  This is slightly more dependent upon the offense than the defense, but the difference is very small.  Essentially, the offense and defense are equally responsible for turnover rate.  In light of the strong dependence of steal rate on the defense, this may be surprising.  I'll have more to say about that at the end.
  • Effective FG%:  Again only a weak difference, but shooting accuracy is more dependent upon offense than defense.  This stat also may be opening up a second way to understand the difference column - while certainly offensive strategy (e.g. Princeton offense:  shoot only open 3s or layups) can help, I wonder if player skill is also being measured here.  That is, the ability to shoot accurately may be more important than the ability to defend shooters.
  • Raw Efficiency:  This most decidedly indicates that offense is determining efficiency more than defense.  The significance of this goes back to the eFG% remark, where I don't know if the stats are saying offensive strategy or offensive skill is the driver but I suspect both are involved.  This also is likely a cumulative effect of the first three factors (eFG%, TO Rate and O. Reb%) all correlating more with offense than defense.  The factors are listed in order of importance, so the strong influence of defense on FT Rate just isn't as important.
  • Adj. Efficiency:  For curiosity sake, I also ran KenPom's adjusted efficiencies, which is as close as I can get to conference-only stats using Ken's data.  Since the quality of competition is now accounted for, we see the offense-as-driver is much weaker.  The implication here is that when good teams play bad teams, offensive skill of the good teams dominates (assuming that all teams can implement strategy equally adeptly).  Since the rest of the stats are not adjusted for competition, this also may be telling us that all of these results would be weaker when teams of equal ability play.  This is to say, players' skill or relative athleticism may be more important than the coach's strategy, after all.

Near the start of this article, I mentioned that I had a second thread of thought about turnovers and steals:  I wondered how important steals were to turnovers.  The intuitive response is simply one-to-one since, after all, a steal is a turnover.  But I could rationalize other arguments:
  1. If a defense tries for a lot of steals, the offense would commit more other types of turnovers (5 seconds calls, throwing the ball out of bounds, etc.).
  2. If a defense tries for a lot of steals, there would be less other types of turnovers, e.g. errant passes would be more likely to be intercepted by a ball-hawking defense than allowed to sail out of bounds.
So, I simply plotted defensive turnover rate (TOs forced) versus defensive steal rate (steals committed):

The slope of the line is greater than 1, by about 20%.  This indicates that alternative hypothesis #1 is, in fact, the correct one.  Teams that force more steals also force additional turnovers beyond steals, so that they'll get 6 extra turnovers for 5 extra steals.

Interestingly, if I do this same plot using offensive turnover rates and steals (TOs and steals allowed), the slope of the line is exactly 1 (not shown).  Since we now know that steal rate is a defensive-dependent stat, I think the first plot is more valid, but I'm still thinking this through.

Finally, I can adjust both the offensive and defensive turnover rates for the steal rates, effectively creating a new stat [= TO Rate - Steal Rate * slope].  The temptation is to call this "Unforced TO Rate", but this would only make sense if the distribution analysis would show a very strong dependence upon offense rather than defense.  So, I ran the numbers

.                     Offense                    Defense
Stat           Median  Width   W/Med      Median  Width   W/Med     Difference
TOs-Steals      8.5     2.0    23.5%       10.5    1.9    17.7%        1.5%

This is what I'd call just a modestly pleasing result.  If I remove the steals component out of turnovers, the offense is controlling "Unforced TO Rate" about as strongly as rebounding rate.

Frankly, I was hoping for more, but I suspect I am trying to push the statistics a bit harder than they will allow.


  1. I did something similar last week, but much simpler I think, using 2010-11 NCAA data.

    eFG%, for example:

    1) Find Adj. eFG%, which = average eFG% + home(0=no, 1=yes)*x
    I found a value of 1.45 for eFG%

    2) Find each team's out-of-sample Adjusted eFG%.
    For each game, find the average of ALL other games played by each team, and average their adjusted eFG%.

    3) Create a regression from Offensive-Out-Of-Sample-Adj-eFG%*a + Opponent's-Defensive-Out-Of-Sample-Adj-eFG%*b = Adj eFG% (game)

    So for eFG%, I found the following:
    0.545514519*offense + 0.45507438*defense (R^2 value of 0.97)

    Since these numbers roughly sum to 1, we can say that offenses control roughly 54.6% of their own eFG%.

    My values for each of the four-factors is:
    eFG: (54.6%), TO: (47.7%), OR (64.4%), FTR (41.0%)

    Of course, true free-throw-rate (FTM/FGA) was controlled more by offense than defense.

    1. Hi Nathan,

      Your results are encouraging.

      To compare:

      Stat . . . Me . . . You
      eFG . . . 0.8 . . . 4.6
      TO . . . . 0.3 . . . -3.3
      OR . . . . 1.5 . . . 14.4
      FTR . . . -4.2 . . . -9.0

      For your stats, I've just standardized about 50%.

      If you plot those two, you'd find that the first three are a straight line. Free-throw rate is off the line, but we both find that defense is controlling.

      No idea what it all means, though.

  2. I used the same method (out-of-sample-prediction) to predict efficiency: