Saturday, October 4, 2008

Further update on HD box scores

I've spent some more time debugging my data loader, and I think it is now working.
  • I've updated the HD boxes in the previous post, as run through my HD Box Score MakerTMVersion 1.0.
    • I was wrong about there being 4 offensive fouls in the Villanova game; now it looks like just 2 (Reynolds and Sumpter).
    • I am now generating full headers and footers for each box score. If you'd like to see something else added, let me know.
    • Totals are now calculated from the play-by-play rather than the box score. Expect some differences from the official box score.

  • I have begun to post HD boxes retroactively. I've added all games from Nov. & Dec. 2006 except for the win at Vanderbilt (11/15), since the play-by-play for that game had no substitution data. I hope to have all available games from the 2006-7 and 2007-8 seasons added before the start of this season.
  • Having run about 15 or 20 games so far, I'm slowly resigning myself to the fact that both the play-by-play and the official box score contain mistakes on occassion, and these could only be corrected by video review (which I'm not likely to do). I suspect that KenPom abandoned his attempt to provide these HD boxes because of this. However, so long as the mistakes are few enough, I will stick to my pledge to run all Big East conf. games. I promise not give up until at least December 30th.

5 comments:

  1. http://bleacherreport.com/articles/65472-why-you-wont-see-georgetown-in-my-top-25

    Your thoughts?

    ReplyDelete
  2. Frankly, I haven't given much thought to where Georgetown fits in the pre-season ranking hierarchy. It requires paying enough attention to 30-40 other teams, and I barely know what is going on with other teams in the Big East.

    I will say that the last few paragraphs have some interesting logic. As I understand it:

    1. 2009 G'town reminds the author of 2008 Syracuse.
    2. 2008 Syracuse started the season 9 scholarship players.
    3. 2008 Syracuse was almost(?) ranked early season
    4. 2008 Syracuse lost two guards to injury, and fell out of any chance at being ranked.
    5. Therefore, 2009 G'town will lose players to injury and fall out of the rankings.

    As a general statement about roster depth, Georgetown should go 8 deep this season (Sapp, Summers, Freeman, Wright, Monroe, Sims, Clark, Vaughn). The 2007 Final 4 team went 8 deep (Green, Sapp, Wallace, Summers, Hibbert, Ewing, Rivers, Crawford/Macklin [both rarely used]).

    An important difference between the Hoyas and Orangemen is that the Cuse played at 72 poss/game last year, while G'town played at 62 poss/game. I'd suggest that the slower pace benefits a team with a shorter bench.


    Mostly though, I'll point out that the author is just another Syracuse grad with an axe to grind (op. cit. this).

    ReplyDelete
  3. I generally agree with your points. It does worry me though that if GU suffers injuries or suspensions that with the team being so young they will end up like SU.

    The tempo point is a good point.

    ReplyDelete
  4. Great stuff. I'm trying to find a good way to create HD boxes for my Gators, so I feel your pain with respect to play-by-play errors. FYI, Pomeroy still publishes his own HD boxes.

    ReplyDelete
  5. Hi, and thanks for dropping in!

    I've got my HD box score maker most of the way there right now - I've added a number of error checks when I hit a problem. As I said somewhere, I think KenPom mostly dropped HD Boxes because of all the pbp errors.

    For minor errors (e.g. player listed in pbp is not supposed to be on the court) the code just flags the problem and continues. For major errors (e.g. number of team's players on court doesn't equal five), the code forces an abort - usually I can puzzle those out quickly.

    Unfortunately, G'town's home play-by-play is probably in the lower half of quality (graded by number of mistakes per game), so I deal with problems quite a bit, but I've gotten better at figuring out what actually happened from clues in the p-b-p.

    Debugging will likely be a perpetual process - I just discovered that I wasn't tracking missed dunks correctly this weekend. But I'm also finding that you can extract a lot more data from the p-b-p than KenPom does with his HD Boxes; this is a project that has been sucking up a lot of my time, but will hopefully start showing returns soon (next weekend?).

    Cheers!

    ReplyDelete