September is almost over, and I haven't posted since mid-August. Well, while it may appear that there is no activity going on here at Hoya Prospectus, some things are taking place behind the scenes. Primarily, I've been working on a data loader for play-by-play data. At this point, it's not ready for prime-time, but I thought I should post a progress report so y'all wouldn't think I've fallen off the face of the planet.
Right now, I'm able to produce a facsimile of the KenPom HD Box score. This is something that Pomeroy put out for the '06-'07 season only, and then just in a limited fashion. I won't go into too many details here about the utility of an HD Box, other than to say that it contains a lot more information than a conventional box score. Here's Ken's description:
The essence of the HD box score is that opportunities are tabulated along with the traditional counting stats. The only exceptions are the points and FGA categories where the “opportunities” are the team totals while the player in question in on the court. But for assists, you have the teammates’ field goals made associated with it, for offensive rebounds, you have total possible offensive rebounds, and so on. You can also see the defensive and offensive possessions a player was on the floor by looking at the steals and turnovers categories, respectively.For today's post, I've run 3 games from the '06-'07 season, so you, the intrepid reader, can compare my work to Ken's (which we'll treat as the gold standard for now).
Example #1 (KenPom's version linked here)
Georgetown vs Villanova
02/17/07 Noon at Wachovia Center (Philadelphia, Pa.)
Final score: Georgetown 58, Villanova 55
Georgetown Min +/- Pts 2PM-A 3PM-A FTM-A FGA A Stl TO Blk OR DR PF
Summers, DaJuan 34:51 + 8 11/53 2 -5 0 -4 7 -8 9 /43 2 /17 0 /52 2 /53 4 /31 3 /28 3 /34 2
Green, Jeff 39:34 + 5 19/58 7 -14 1 -2 2 -3 16/47 4 /13 1 /57 2 /58 8 /35 2 /30 7 /36 2
Hibbert, Roy 18:07 + 6 4 /28 2 -4 0 -0 0 -0 4 /23 0 /8 1 /24 1 /26 2 /18 1 /15 2 /20 4
Wallace, Jonathan 34:46 - 4 2 /49 1 -4 0 -1 0 -0 5 /42 5 /17 0 /49 2 /49 0 /32 0 /27 1 /31 1
Sapp, Jessie 40:00 + 3 16/58 3 -6 3 -4 1 -2 10/47 3 /15 0 /58 2 /59 0 /35 0 /30 5 /36 1
Macklin, Vernon 08:20 - 8 0 /7 0 -0 0 -0 0 -0 0 /8 0 /3 0 /12 1 /12 0 /6 0 /5 1 /6 1
Rivers, Jeremiah 05:14 + 7 0 /9 0 -0 0 -0 0 -0 0 /5 0 /3 0 /9 0 /10 0 /3 0 /3 1 /5 1
Ewing, Patrick 19:08 - 2 6 /28 1 -1 1 -2 1 -2 3 /20 1 /8 0 /29 1 /28 0 /15 1 /12 1 /12 1
TOTALS 40:00 58 16-34 5 -13 11-15 47 15/21 2 /58 13/59 14/35 8 /30 23/36 13
. 0.471 0.385 0.733 0.714 0.034 0.220 0.400 0.267 0.639
Villanova Min +/- Pts 2PM-A 3PM-A FTM-A FGA A Stl TO Blk OR DR PF
CLARK, Shane 35:38 + 0 6 /52 0 -4 2 -3 0 -0 7 /49 1 /16 0 /50 1 /51 0 /31 1 /32 2 /26 4
SUMPTER, Curtis 34:13 + 3 15/50 3 -6 1 -5 6 -6 11/46 1 /13 0 /50 4 /50 2 /31 2 /29 6 /29 4
SHERIDAN, Will 28:08 + 0 5 /42 2 -8 0 -0 1 -2 8 /34 2 /12 1 /39 0 /37 2 /24 3 /22 3 /17 1
REYNOLDS, Scottie 28:16 - 2 18/42 3 -8 4 -6 0 -0 14/45 4 /8 2 /45 1 /46 0 /25 1 /31 3 /21 4
NARDI, Mike 33:25 - 6 2 /44 1 -4 0 -2 0 -0 6 /44 2 /14 0 /48 2 /47 1 /29 0 /31 1 /25 0
REDDING, Reggie 16:23 - 1 6 /20 1 -1 0 -1 4 -6 2 /16 0 /4 2 /25 0 /24 0 /10 0 /13 2 /12 2
CUNNINGHAM, Dante 23:57 - 9 3 /25 1 -4 0 -0 1 -2 4 /26 1 /5 1 /38 1 /35 0 /20 3 /22 2 /20 3
TOTALS 40:00 55 11-35 7 -17 12-16 52 11/18 6 /59 10/58 5 /34 13/36 22/30 18
. 0.314 0.412 0.750 0.611 0.102 0.172 0.147 0.361 0.733
Efficiency: Georgetown 0.983, Villanova 0.948
eFG%: Georgetown 0.500, Villanova 0.413
Substitutions: Georgetown 34, Villanova 34
2-pt Shot Selection:
Dunks: Georgetown 5-5, Villanova 0-0
Layups/Tips: Georgetown 5-14, Villanova 5-13
Jumpers: Georgetown 6-15, Villanova 6-22
Edited on 10/04 with version 1.0 of HD box score makerI've "improved" my possession counter now, and I believe it is working correctly (at least, it's giving more reasonable results). I was able to find out why both KenPom and I were getting lousy possession results for this game - the Villanova play-by-play didn't note turnovers on offensive fouls. I think that this is normally done, as it appears that KenPom's loader also assumes this will be the case. So, I had to correct the p-b-p by hand, which isn't a lot of fun, but it runs much better now.
What this means is that the official box score (from which the TOTALS lines are derived) no longer matches up with the actual sum of individual player turnovers:
Example #2 (KenPom's version linked here)
Marquette vs Georgetown
02/10/07 noon at Verizon Center, Washington, D.C.
Final score: Georgetown 76, Marquette 58
Marquette Min +/- Pts 2PM-A 3PM-A FTM-A FGA A Stl TO Blk OR DR PF
JAMES, Dominic 33:41 - 18 6 /47 2 -11 0 -6 2 -2 17/53 3 /15 0 /55 2 /55 0 /35 0 /36 4 /28 1
MCNEAL, Jerel 25:47 - 19 11/36 4 -8 1 -5 0 -0 13/39 1 /9 1 /41 3 /41 0 /23 1 /25 2 /17 5
MATTHEWS, Wesley 19:15 - 2 2 /26 1 -2 0 -1 0 -0 3 /29 1 /10 0 /29 2 /29 0 /19 0 /18 2 /21 2
HAYWARD, Lazar 17:20 - 13 14/24 3 -6 1 -1 5 -5 7 /33 0 /3 0 /32 0 /31 0 /17 3 /26 1 /13 1
BARRO, Ousmane 30:10 - 11 14/49 5 -8 0 -0 4 -4 8 /50 2 /13 2 /50 1 /50 1 /30 7 /32 5 /24 5
FITZGERALD, Dan 27:01 - 11 7 /38 2 -2 1 -2 0 -0 4 /38 1 /10 0 /41 1 /42 1 /30 1 /25 2 /24 2
CUBILLAN, David 26:30 - 10 3 /40 0 -2 1 -6 0 -0 8 /40 1 /13 0 /40 0 /41 0 /26 0 /26 2 /22 1
BURKE, Dwight 07:58 - 1 1 /16 0 -1 0 -0 1 -2 1 /13 0 /5 0 /11 0 /12 0 /4 2 /8 0 /5 0
BLACKLEDGE, Lawrence 03:51 + 0 0 /7 0 -0 0 -0 0 -0 0 /6 0 /3 0 /5 0 /4 0 /3 0 /3 0 /2 1
KINSELLA, Mike 08:27 - 5 0 /7 0 -1 0 -0 0 -0 1 /9 0 /3 0 /11 0 /10 1 /8 0 /6 2 /9 3
TOTALS 40:00 58 17-41 4 -21 12-13 62 9 /21 3 /63 10/63 3 /39 15/41 20/33 21
. 0.415 0.190 0.923 0.429 0.048 0.159 0.077 0.366 0.606
Georgetown Min +/- Pts 2PM-A 3PM-A FTM-A FGA A Stl TO Blk OR DR PF
Wallace, Jonathan 35:30 + 22 10/71 2 -5 1 -4 3 -4 9 /46 6 /22 1 /58 1 /57 0 /36 1 /25 5 /38 2
Summers, DaJuan 32:49 + 20 10/62 2 -4 2 -3 0 -1 7 /42 2 /16 0 /53 1 /52 1 /33 1 /26 4 /36 3
Sapp, Jessie 31:41 + 15 9 /63 2 -3 1 -3 2 -2 6 /41 2 /18 1 /49 2 /51 0 /34 0 /22 3 /35 3
Green, Jeff 37:57 + 20 24/76 7 -11 2 -4 4 -5 15/54 4 /17 0 /60 1 /61 1 /40 2 /32 3 /39 1
Hibbert, Roy 35:01 + 25 23/71 7 -12 0 -0 9 -11 12/45 0 /17 1 /55 4 /55 3 /33 4 /25 7 /37 3
Macklin, Vernon 04:59 - 7 0 /5 0 -1 0 -0 0 -0 1 /10 0 /2 0 /8 0 /8 0 /8 1 /8 0 /4 0
Rivers, Jeremiah 08:26 - 4 0 /7 0 -2 0 -0 0 -0 2 /16 0 /3 0 /13 0 /13 0 /9 0 /15 1 /7 0
Crawford, Tyler 04:23 + 3 0 /11 0 -0 0 -1 0 -0 1 /7 0 /3 0 /6 0 /5 0 /3 1 /4 0 /2 1
Ewing, Patrick 09:18 - 4 0 /14 0 -1 0 -1 0 -0 2 /14 1 /6 0 /13 1 /13 1 /9 2 /8 1 /7 0
TOTALS 40:00 76 20-39 6 -16 18-23 55 15/26 3 /63 10/63 6 /41 13/33 26/41 13
. 0.513 0.375 0.783 0.577 0.048 0.159 0.146 0.394 0.634
Efficiency: Georgetown 1.206, Marquette 0.921
eFG%: Georgetown 0.527, Marquette 0.371
Substitutions: Georgetown 17, Marquette 29
2-pt Shot Selection:
Dunks: Georgetown 3-3, Marquette 2-2
Layups/Tips: Georgetown 9-18, Marquette 9-23
Jumpers: Georgetown 8-18, Marquette 6-16
Edited on 10/4 with version 1.0 of HD box score makerOkay, so it seems to work regardless of whether G'town is the home or road team. You may have also noticed that some of the ancillary information that KenPom includes with his HD box is not in mine. I can track everything that he does, and already do for most stats, but I just haven't had time getting around to including those items in the box score output script yet.
Finally, I'll take a crack at a non-G'town game:
Example #3 (KenPom's version linked here)
Villanova vs Marquette
02/19/07 6 p.m. at Milwaukee, Wis. (Bradley Center)
Final score: Marquette 80, Villanova 67
Villanova Min +/- Pts 2PM-A 3PM-A FTM-A FGA A Stl TO Blk OR DR PF
Clark, Shane 31:56 - 5 5 /59 1 -3 1 -3 0 -0 6 /42 2 /14 2 /57 2 /58 1 /24 3 /31 2 /24 5
Sumpter, Curtis 27:29 - 1 14/53 3 -6 0 -2 8 -11 8 /34 2 /10 0 /43 2 /48 0 /22 3 /25 2 /20 4
Sheridan, Will 19:35 - 10 4 /25 1 -4 0 -0 2 -3 4 /26 0 /8 1 /36 2 /32 2 /16 1 /20 2 /19 3
Reynolds, Scottie 35:48 - 6 25/61 2 -5 5 -10 6 -7 15/42 2 /10 2 /59 5 /64 0 /26 0 /31 1 /26 4
Nardi, Mike 21:56 - 2 10/44 1 -3 2 -3 2 -2 6 /34 1 /9 1 /37 2 /40 0 /15 0 /23 3 /14 2
Benn, Bilal 13:19 - 19 1 /10 0 -0 0 -0 1 -2 0 /9 0 /2 0 /23 1 /19 0 /10 2 /10 1 /10 3
Redding, Reggie 29:09 - 11 0 /48 0 -1 0 -3 0 -1 4 /32 1 /12 2 /48 2 /48 0 /19 1 /26 3 /20 2
Cunningham, Dante 20:36 - 14 8 /32 2 -5 0 -0 4 -6 5 /20 1 /6 1 /32 1 /30 0 /18 1 /14 1 /12 4
Drummond, Casiem 00:12 + 3 0 /3 0 -0 0 -0 0 -0 0 /1 0 /1 0 /0 0 /1 0 /0 0 /0 0 /0 0
TOTALS 40:00 67 10-27 8 -21 23-32 48 9 /18 9 /67 18/68 3 /30 15/34 16/29 27
. 0.370 0.381 0.719 0.500 0.134 0.265 0.100 0.441 0.552
Marquette Min +/- Pts 2PM-A 3PM-A FTM-A FGA A Stl TO Blk OR DR PF
HAYWARD, Lazar 28:04 + 11 18/53 3 -4 2 -3 6 -8 7 /39 0 /11 0 /46 1 /47 0 /22 2 /24 2 /24 5
BARRO, Ousmane 26:02 + 8 2 /53 1 -1 0 -0 0 -0 1 /32 0 /14 1 /43 1 /42 3 /21 2 /17 2 /26 4
JAMES, Dominic 35:42 + 13 18/72 3 -6 2 -7 6 -6 13/42 5 /13 2 /62 4 /62 1 /25 1 /25 2 /33 4
McNEAL, Jerel 31:28 + 4 15/61 5 -13 1 -2 2 -2 15/42 3 /12 1 /54 3 /53 0 /19 2 /24 5 /26 5
MATTHEWS, Wesley 27:10 + 9 16/59 2 -4 0 -1 12-12 5 /33 2 /12 3 /47 3 /45 0 /20 3 /20 4 /25 1
FITZGERALD, Dan 08:59 - 2 3 /17 0 -1 0 -0 3 -3 1 /9 0 /4 0 /16 1 /13 0 /5 0 /5 0 /8 2
CUBILLAN, David 28:37 + 17 8 /58 0 -0 2 -6 2 -2 6 /31 1 /12 0 /47 0 /48 0 /17 0 /18 1 /28 2
BURKE, Dwight 08:56 + 4 0 /19 0 -1 0 -0 0 -2 1 /11 0 /4 0 /15 0 /14 0 /1 2 /8 1 /5 4
KINSELLA, Mike 02:52 + 4 0 /6 0 -0 0 -0 0 -0 0 /3 0 /1 0 /7 1 /7 0 /4 0 /2 1 /4 0
LOTT, Jamil 02:10 - 3 0 /2 0 -0 0 -0 0 -0 0 /3 0 /1 0 /3 0 /4 0 /1 0 /2 0 /1 0
TOTALS 40:00 80 14-30 7 -19 31-35 49 11/21 7 /68 14/67 4 /27 13/29 19/34 27
. 0.467 0.368 0.886 0.524 0.103 0.209 0.148 0.448 0.559
Efficiency: Marquette 1.194, Villanova 0.985
eFG%: Marquette 0.500, Villanova 0.458
Substitutions: Marquette 26, Villanova 56
2-pt Shot Selection:
Dunks: Marquette 0-0, Villanova 0-0
Layups/Tips: Marquette 12-18, Villanova 5-8
Jumpers: Marquette 2-12, Villanova 5-19
Edited on 10/4 with version 1.0 of HD box score makerI haven't even looked at KenPom's linked HD box for this game, so I have no idea yet how well I did. Nothing like going to press with a half-debugged software package! You may have noticed a certain theme in the other Big East teams published here, which is mostly because there are bloggers for these teams (MU, VU) who are outed stat-heads, and I thought they might find this interesting.
A few points to wrap up
- No matter how well I write my stats loader, there will be mistakes in the HD box scores I generate, because the play-by-play data itself has mistakes. So far, I find about 3 to 4 substitution errors per game I've looked at. Some can be intuitively corrected just by looking at the play-by-play, but some would require a tape of the game to puzzle out, and that's not something that I have available for every game. This problem will also be true for KenPom's HD boxes.
- I'll need a few more weeks to iron out the rest of the kinks in the loader, but hope to have it ready to go for this season. I will definitely provide a HD box for every G'town game that has a good play-by-play recap available (all but 3 regular season games last year had good pbp). In addition, I will try to publish HD boxes for all Big East conference games. I plan to run all conference games through my data loader to collect a database of Big East stats, so I should be able to publish the HD boxes as well. The only exceptions may be for St. John's home games (no play-by-play was available last year) and Rutgers home games (play-by-play lacked subbing info last season). I may decide to abandon this if the workload is too much, or if this blog becomes overly cluttered with non-Hoya stuff. However, if you are a fan of another Big East team and could make use of these HD boxes, please leave me a note here or by e-mail, so I'll know if this would be a useful service to the community.
- Finally (for now), I'm actually much more interested in the underlying data, rather than just pumping out lots of fancy box scores. I hope to go back through the past 2 Hoya seasons (I may post HD boxes for games not available on KenPom.com, backdated) to build up a bit of a database, to answer some basic questions. Here are a few, off the top of my head:
- Just how valuable are offensive/defensive rebounds? I should be able to look at offensive efficiency before/after an offensive rebound.
- Was J. Rivers really that great of a defender? I'll look at the team's offensive and defensive efficiencies with each player on or off the court, to see if I can learn a bit more about the defensive side of things.
- How fast is Chris Wright, and how deliberate was Roy Hibbert? I can also isolate time per possession with each player on or off the court, to see if certain players have a greater effect on controlling pace.
Well, that's all for tonight. I've set quite an ambitious agenda for myself this year, but I hope I can live up to what I'd like to get done. Of course, now that I've said all this, I'm sure either KenPom or Basketball Prospectus will do this as well, rendering moot a lot of what I plan.
Nice work, CO. Over at Hoya Hoops, we're slowly starting to work on improving our box scores too - not at all with the intensity that you're going for though, so keep up the good work.
ReplyDeleteJust wanted to let you know that we have our Google Calendar schedule up again, for whoever wants it:
The post
Woooo!!!
ReplyDeleteWell, this is terrific news. I was sort of dreading doing the +/- stats by hand again this year.
And yes, I would agree with your estimate of 3 or 4 errors per game in the play-by-play data. Usually you can figure out what actually happened from the following action, but sometimes it is too garbled to make any sense at all.
Great, great work.
LOL.
ReplyDeleteAt least a part of my motivation for this project was the guilt I had, envisioning you staying up until the wee hours with a pad and a pencil . . .
Wow. If you had to make just one post in September, this was definitely the one.
ReplyDeleteI am very interested. If you want to split the workload or anything let me know. I have also had problems with box scores from different sources and decided to use those from the team websites as of the 2006-07 season. I noticed as I was putting together the stats-by-half last season (especially when I went back over the play-by-play), that the possessions were not consistent with whole-game calculations. I will look for the TO/foul error in the 2/17/07 game. Terrific work.
I am a little surprised that Henry Sugar has not been over to see. I'll shoot him an email.