TL;DR and by strange I mean a combination of utter nonsense tests on top of the now-expected rigged test.
Baseball Prospectus released a new article grading defensive metrics against each other and declared their FRAA metric the overall winner, even though it’s by far the most primitive defensive stat of the bunch for non-catchers. Furthermore, they graded FRAA as a huge winner in the outfield and Statcast’s Outs Above Average as a huge winner in the infield.. and graded FRAA as a dumpster fire in the infield and OAA as a dumpster fire in the outfield. This is all very curious. We’re going to answer the three questions in the following order:
- On their tests, why does OAA rule the infield while FRAA sucks?
- On their tests, why does FRAA rule the outfield while OAA sucks?
- On their test, why does FRAA come out ahead overall?
First, a summary of the two systems. OAA ratings try to completely strip out positioning- they’re only a measure of how well the player did, given where the ball was and where the player started. FRAA effectively treats all balls as having the same difficulty (after dealing with park, handedness, etc). It assumes that each player should record the league-average X outs per BIP for the given defensive position/situation and gives +/- relative to that number.
A team allowing a million uncatchable base hits won’t affect the OAA at all (not making a literal 0% play doesn’t hurt your rating), but it will tank everybody’s FRAA because it thinks the fielders “should” be making X outs per Y BIPs. In a similar vein, hitting a million easy balls at a fielder who botches them all will destroy that fielder’s OAA but leave the rest of his teammates unchanged. It will still tank *everybody’s* FRAA the same as if the balls weren’t catchable. An average-performing (0 OAA), average-positioned fielder with garbage teammates will get dragged down to a negative FRAA. An average-performing (0 OAA), average-positioned fielder whose pitcher allows a bunch of difficult balls nowhere near him will also get dragged down to a negative FRAA.
So, in abstract terms: On a team level, team OAA=range + conversion and team FRAA = team OAA + positioning-based difficulty relative to average. On a player level, player OAA= range + conversion and player FRAA = player OAA + positioning + teammate noise.
Now, their methodology. It is very strange, and I tweeted at them to make sure they meant what they wrote. They didn’t reply, it fits the results, and any other method of assigning plays would be in-depth enough to warrant a description, so we’re just going to assume this is what they actually did. For the infield and outfield tests, they’re using the season-long rating each system gave a player to predict whether or not a play resulted in an out. That may not sound crazy at first blush, but..
…using only the fielder ratings for the position in question, run the same model type position by position to determine how each system predicts the out probability for balls fielded by each position. So, the position 3 test considers only the fielder quality rate of the first baseman on *balls fielded by first basemen*, and so on.
Their position-by-position comparisons ONLY INVOLVE BALLS THAT THE PLAYER ACTUALLY FIELDED. A ground ball right through the legs untouched does not count as a play for that fielder in their test (they treat it as a play for whoever picks it up in the outfield). Obviously, by any sane measure of defense, that’s a botched play by the defender, which means the position-by-position tests they’re running are not sane tests of defense. They’re tests of something else entirely, and that’s why they get the results that they do.
Using the bolded abstraction above, this is only a test of conversion. Every play that the player didn’t/couldn’t field IS NOT INCLUDED IN THE TEST. Since OAA adds the “noise” of range to conversion, and FRAA adds the noise of range PLUS the noise of positioning PLUS the noise from other teammates to conversion, OAA is less noisy and wins and FRAA is more noisy and sucks. UZR, which strips out some of the positioning noise based on ball location, comes out in the middle. The infield turned out to be pretty easy to explain.
The outfield is a bit trickier. Again, because ground balls that got through the infield are included in the OF test (because they were eventually fielded by an outfielder), the OF test is also not a sane test of defense. Unlike the infield, when the outfield doesn’t catch a ball, it’s still (usually) eventually fielded by an outfielder, and roughly on average by the same outfielder who didn’t catch it.
So using the abstraction, their OF test measures range + conversion + positioning + missed ground balls (that roll through to the OF). OAA has range and conversion. FRAA has range, conversion, positioning, and some part of missed ground balls through the teammate noise effect described earlier. FRAA wins and OAA gets dumpstered on this silly test, and again it’s not that hard to see why, not that it actually means much of anything.
Before talking about the teamwide defense test, it’s important to define what “defense” actually means (for positions 3-9). If a batter hits a line drive 50 feet from anybody, say a rope safely over the 3B’s head down the line, is it bad defense by 3-9 that it went for a hit? Clearly not, by the common usage of the word. Who would it be bad defense by? Nobody could have caught it. Nobody should have been positioned there.
BP implicitly takes a different approach
So, recognizing that defenses are, in the end, a system of players, we think an important measure of defensive metric quality is this: taking all balls in play that remained in the park for an entire season — over 100,000 of them in 2019 — which system on average most accurately measures whether an out is probable on a given play? This, ultimately, is what matters. Either you get more hitters out on balls in play or you do not. The better that a system can anticipate that a batter will be out, the better the system is.
that does consider this bad defense. It’s kind of amazing (and by amazing I mean not the least bit surprising at this point) that every “questionable” definition and test is always for the benefit one of BP’s stats. Neither OAA, nor any of the other non-FRAA stats mentioned, are based on outs/BIP or trying to explain outs/BIP. In fact, they’re specifically designed to do the exact opposite of that. The analytical community has spent decades making sure that uncatchable balls don’t negatively affect PLAYER defensive ratings, and more generally to give an appropriate amount of credit to the PLAYER based on the system’s estimate of the difficulty of the play (remember from earlier that FRAA doesn’t- it treats EVERY BIP as average difficulty).
The second “questionable” decision is to test against outs/BIP. Using abstract language again to break this down, outs/BIP = player performance given the difficulty of the opportunity + difficulty of opportunity. The last term can be further broken down into difficulty of opportunity = smart/dumb fielder positioning + quality of contact allowed (a pitcher who allows an excess of 100mph batted balls is going to make it harder for his defense to get outs, etc) + luck. In aggregate:
outs/BIP=
player performance given the difficulty of the opportunity (OAA) +
smart/dumb fielder positioning (a front-office/manager skill in 2019) +
quality of contact allowed (a batter/pitcher skill) +
luck (not a skill).
That’s testing against a lot of nonsense beyond fielder skill, and it’s testing against nonsense *that the other systems were explicitly designed to exclude*. It would take the creators of the other defensive systems less time than it took me to write the previous paragraph to run a query and report an average difficulty of opportunity metric when the player was on the field (their systems are all already designed around giving every BIP a difficulty of opportunity score), but again, they don’t do that because *they’re not trying to explain outs/BIP*.
The third “questionable” decision is to use 2019 ratings to predict 2019 outs/BIP. Because observed OAA is skill+luck, it benefits from “knowing” the luck in the plays it’s trying to predict. In this case, luck being whether a fielder converted plays at/above/below his true skill level. 2019 FRAA has all of the difficulty of opportunity information baked in for 2019 balls, INCLUDING all of the luck in difficulty of opportunity ON TOP OF the luck in conversion that OAA also has.
All of that luck is just noise in reality, but because BP is testing the rating against THE SAME PLAYS used to create the rating, that noise is actually signal in this test, and the more of it included, the better. That’s why FRAA “wins” handily. One could say that this test design is almost maximally disingenuous, and of course it’s for the benefit of BP’s in-house stat, because that’s how they roll.