Don’t use FRAA for outfielders

TL;DR OAA is far better, as expected.  Read after the break for next-season OAA prediction/commentary.

As a followup to my previous piece on defensive metrics, I decided to retest the metrics using a sane definition of opportunity.  BP’s study defined a defensive opportunity as any ball fielded by an outfielder, which includes completely uncatchable balls as well as ground balls that made it through the infield.  The latter are absolute nonsense, and the former are pretty worthless.  Thanks to Statcast, a better definition of defensive opportunity is available- any ball it gives a nonzero catch probability and assigns to an OF.  Because Statcast doesn’t provide catch probability/OAA on individual plays, we’ll be testing each outfielder in aggregate.

Similarly to what BP tried to do, we’re going to try to describe or predict each OF’s outs/opportunity, and we’re testing the 354 qualified OF player-seasons from 2016-2019.  Our contestants are Statcast’s OAA/opportunity, UZR/opportunity, FRAA/BIP (what BP used in their article), simple average catch probability (with no idea if the play was made or not), and positional adjustment (effectively the share of innings in CF, corner OF, or 1B/DH).  Because we’re comparing all outfielders to each other, and UZR and FRAA compare each position separately, those two received the positional adjustment (they grade quite a bit worse without it, as expected).

Using data from THE SAME SEASON (see previous post if it isn’t obvious why this is a bad idea) to describe that SAME SEASON’s outs/opportunity, which is what BP was testing, we get the following correlations:

Metric r^2 to same-season outs/opportunity
OAA/opp 0.74
UZR/opp 0.49
Catch Probability + Position 0.43
FRAA/BIP 0.34
Catch Probability 0.32
Positional adjustment/opp 0.25


OAA wins running away, UZR is a clear second, background information is 3rd, and FRAA is a distant 4th, barely ahead of raw catch probability.  And catch probability shouldn’t be that important.  It’s almost independent of OAA (r=0.06) and explains much less of the outs/opp variance.  Performance on opportunities is a much bigger driver than difficulty of opportunities over the course of a season.  I ran the same test on the 3 OF positions individually (using Statcast’s definition of primary position for that season), and the numbers bounced a little, but it’s the same rank order and similar magnitude of differences.

Attempting to describe same-season OAA/opp gives the following:

Metric r^2 to same-season OAA/opportunity
OAA/opp 1
UZR/opp 0.5
FRAA/BIP 0.32
Positional adjustment/opp 0.17
Catch Probability 0.004

As expected, catch probability drops way off.  CF opportunities are on average about 1% harder than corner OF opportunities.  Positional adjustment is obviously a skill correlate (Full-time CF > CF/corner tweeners > Full-time corner > corner/1B-DH tweeners), but it’s a little interesting that it drops off compared to same-season outs/opportunity.  It’s reasonably correlated to catch probability, which is good for describing outs/opp and useless for describing OAA/opp, so I’m guessing that’s most of the decline.


Now, on to the more interesting things.. Using one season’s metric to predict the NEXT season’s OAA/opportunity (both seasons must be qualified), which leaves 174 paired seasons, gives us the following (players who dropped out were almost average in aggregate defensively):

Metric r^2 to next season OAA/opportunity
OAA/opp 0.45
FRAA/BIP 0.27
UZR/opp 0.25
Positional adjustment 0.1
Catch Probability 0.02

FRAA notably doesn’t suck here- although unless you’re a modern-day Wintermute who is forbidden to know OAA, just use OAA of course.  Looking at the residuals from previous-season OAA, UZR is useless, but FRAA and positional adjustment contain a little information, and by a little I mean enough together to get the r^2 up to 0.47.  We’ve discussed positional adjustment already and that makes sense, but FRAA appears to know a little something that OAA doesn’t, and it’s the same story for predicting next-season outs/opp as well.

That’s actually interesting.  If the crew at BP had discovered that and spent time investigating the causes, instead of spending time coming up with ways to bullshit everybody that a metric that treats a ground ball to first as a missed play for the left fielder really does outperform Statcast, we might have all learned something useful.

The Baseball Prospectus article comparing defensive metrics is… strange

TL;DR and by strange I mean a combination of utter nonsense tests on top of the now-expected rigged test.

Baseball Prospectus released a new article grading defensive metrics against each other and declared their FRAA metric the overall winner, even though it’s by far the most primitive defensive stat of the bunch for non-catchers.  Furthermore, they graded FRAA as a huge winner in the outfield and Statcast’s Outs Above Average as a huge winner in the infield.. and graded FRAA as a dumpster fire in the infield and OAA as a dumpster fire in the outfield.  This is all very curious.  We’re going to answer the three questions in the following order:

  1. On their tests, why does OAA rule the infield while FRAA sucks?
  2. On their tests, why does FRAA rule the outfield while OAA sucks?
  3. On their test, why does FRAA come out ahead overall?

First, a summary of the two systems.  OAA ratings try to completely strip out positioning- they’re only a measure of how well the player did, given where the ball was and where the player started.  FRAA effectively treats all balls as having the same difficulty (after dealing with park, handedness, etc).  It assumes that each player should record the league-average X outs per BIP for the given defensive position/situation and gives +/- relative to that number.

A team allowing a million uncatchable base hits won’t affect the OAA at all (not making a literal 0% play doesn’t hurt your rating), but it will tank everybody’s FRAA because it thinks the fielders “should” be making X outs per Y BIPs.  In a similar vein, hitting a million easy balls at a fielder who botches them all will destroy that fielder’s OAA but leave the rest of his teammates unchanged.  It will still tank *everybody’s* FRAA the same as if the balls weren’t catchable.  An average-performing (0 OAA), average-positioned fielder with garbage teammates will get dragged down to a negative FRAA. An average-performing (0 OAA), average-positioned fielder whose pitcher allows a bunch of difficult balls nowhere near him will also get dragged down to a negative FRAA.

So, in abstract terms: On a team level, team OAA=range + conversion and team FRAA = team OAA + positioning-based difficulty relative to average.  On a player level, player OAA= range + conversion and player FRAA = player OAA + positioning + teammate noise.

Now, their methodology.  It is very strange, and I tweeted at them to make sure they meant what they wrote.  They didn’t reply, it fits the results, and any other method of assigning plays would be in-depth enough to warrant a description, so we’re just going to assume this is what they actually did.  For the infield and outfield tests, they’re using the season-long rating each system gave a player to predict whether or not a play resulted in an out.  That may not sound crazy at first blush, but..

…using only the fielder ratings for the position in question, run the same model type position by position to determine how each system predicts the out probability for balls fielded by each position. So, the position 3 test considers only the fielder quality rate of the first baseman on *balls fielded by first basemen*, and so on.

Their position-by-position comparisons ONLY INVOLVE BALLS THAT THE PLAYER ACTUALLY FIELDED.  A ground ball right through the legs untouched does not count as a play for that fielder in their test (they treat it as a play for whoever picks it up in the outfield).  Obviously, by any sane measure of defense, that’s a botched play by the defender, which means the position-by-position tests they’re running are not sane tests of defense.  They’re tests of something else entirely, and that’s why they get the results that they do.

Using the bolded abstraction above, this is only a test of conversion.  Every play that the player didn’t/couldn’t field IS NOT INCLUDED IN THE TEST.  Since OAA adds the “noise” of range to conversion, and FRAA adds the noise of range PLUS the noise of positioning PLUS the noise from other teammates to conversion, OAA is less noisy and wins and FRAA is more noisy and sucks.  UZR, which strips out some of the positioning noise based on ball location, comes out in the middle.  The infield turned out to be pretty easy to explain.

The outfield is a bit trickier.  Again, because ground balls that got through the infield are included in the OF test (because they were eventually fielded by an outfielder), the OF test is also not a sane test of defense.  Unlike the infield, when the outfield doesn’t catch a ball, it’s still (usually) eventually fielded by an outfielder, and roughly on average by the same outfielder who didn’t catch it.

So using the abstraction, their OF test measures range + conversion + positioning + missed ground balls (that roll through to the OF).  OAA has range and conversion.  FRAA has range, conversion, positioning, and some part of missed ground balls through the teammate noise effect described earlier.  FRAA wins and OAA gets dumpstered on this silly test, and again it’s not that hard to see why, not that it actually means much of anything.


Before talking about the teamwide defense test, it’s important to define what “defense” actually means (for positions 3-9).  If a batter hits a line drive 50 feet from anybody, say a rope safely over the 3B’s head down the line, is it bad defense by 3-9 that it went for a hit?  Clearly not, by the common usage of the word. Who would it be bad defense by?  Nobody could have caught it.  Nobody should have been positioned there.

BP implicitly takes a different approach

So, recognizing that defenses are, in the end, a system of players, we think an important measure of defensive metric quality is this: taking all balls in play that remained in the park for an entire season — over 100,000 of them in 2019 — which system on average most accurately measures whether an out is probable on a given play? This, ultimately, is what matters.  Either you get more hitters out on balls in play or you do not. The better that a system can anticipate that a batter will be out, the better the system is.

that does consider this bad defense.  It’s kind of amazing (and by amazing I mean not the least bit surprising at this point) that every “questionable” definition and test is always for the benefit one of BP’s stats.  Neither OAA, nor any of the other non-FRAA stats mentioned, are based on outs/BIP or trying to explain outs/BIP.  In fact, they’re specifically designed to do the exact opposite of that.  The analytical community has spent decades making sure that uncatchable balls don’t negatively affect PLAYER defensive ratings, and more generally to give an appropriate amount of credit to the PLAYER based on the system’s estimate of the difficulty of the play (remember from earlier that FRAA doesn’t- it treats EVERY BIP as average difficulty).

The second “questionable” decision is to test against outs/BIP.  Using abstract language again to break this down, outs/BIP = player performance given the difficulty of the opportunity + difficulty of opportunity.  The last term can be further broken down into difficulty of opportunity = smart/dumb fielder positioning + quality of contact allowed (a pitcher who allows an excess of 100mph batted balls is going to make it harder for his defense to get outs, etc) + luck.  In aggregate:

outs/BIP=

player performance given the difficulty of the opportunity (OAA) +

smart/dumb fielder positioning (a front-office/manager skill in 2019) +

quality of contact allowed (a batter/pitcher skill) +

luck (not a skill).

That’s testing against a lot of nonsense beyond fielder skill, and it’s testing against nonsense *that the other systems were explicitly designed to exclude*.  It would take the creators of the other defensive systems less time than it took me to write the previous paragraph to run a query and report an average difficulty of opportunity metric when the player was on the field (their systems are all already designed around giving every BIP a difficulty of opportunity score), but again, they don’t do that because *they’re not trying to explain outs/BIP*.

The third “questionable” decision is to use 2019 ratings to predict 2019 outs/BIP.  Because observed OAA is skill+luck, it benefits from “knowing” the luck in the plays it’s trying to predict.  In this case, luck being whether a fielder converted plays at/above/below his true skill level.  2019 FRAA has all of the difficulty of opportunity information baked in for 2019 balls, INCLUDING all of the luck in difficulty of opportunity ON TOP OF the luck in conversion that OAA also has.

All of that luck is just noise in reality, but because BP is testing the rating against THE SAME PLAYS used to create the rating, that noise is actually signal in this test, and the more of it included, the better.  That’s why FRAA “wins” handily.  One could say that this test design is almost maximally disingenuous, and of course it’s for the benefit of BP’s in-house stat, because that’s how they roll.

Trust the barrels

Inspired by the curious case of Harrison Bader

baderbarrels

whose average exit velocity is horrific, hard hit% is average, and barrel/contact% is great (not shown, but a little better than the xwOBA marker), I decided to look at which one of these metrics was more predictive.  Barrels are significantly more descriptive of current-season wOBAcon (wOBA on batted balls/contact), and average exit velocity is sketchy because the returns on harder-hit balls are strongly nonlinear. The game rewards hitting the crap out of the ball, and one rocket and one trash ball come out a lot better than two average balls.

Using consecutive seasons with at least 150 batted balls (there’s some survivor bias based on quality of contact, but it’s pretty much even across all three measures), which gave 763 season pairs, barrel/contact% led the way with r=0.58 to next season’s wOBAcon, followed by hard-hit% at r=0.53 and average exit velocity at r=0.49.  That’s not a huge win, but it is a win, but since these are three ways of measuring a similar thing (quality of contact), they’re likely to be highly correlated, and we can do a little more work to figure out where the information lies.

evvehardhit

I split the sample into tenths based on average exit velocity rank, and Hard-hit% and average exit velocity track an almost perfect line at the group (76-77 player) level.  Barrels deviate from linearity pretty measurably with the outliers on either end, so I interpolated and extrapolated on the edges to get an “expected” barrel% based on the average exit velocity, and then I looked at how players who overperformed and underperformed their expected barrel% by more than 1 SD (of the barrel% residual) did with next season’s wOBAcon.

Avg EV decile >2.65% more barrels than expected average-ish barrels >2.65% fewer barrels than expected whole group
0 0.362 0.334 none 0.338
1 0.416 0.356 0.334 0.360
2 0.390 0.377 0.357 0.376
3 0.405 0.386 0.375 0.388
4 0.389 0.383 0.380 0.384
5 0.403 0.389 0.374 0.389
6 0.443 0.396 0.367 0.402
7 0.434 0.396 0.373 0.401
8 0.430 0.410 0.373 0.405
9 0.494 0.428 0.419 0.441

That’s.. a gigantic effect.  Knowing barrel/contact% provides a HUGE amount of information on top of average exit velocity going forward to the next season.  I also looked at year-to-year changes in non-contact wOBA (K/BB/HBP) for these groups just to make sure and it’s pretty close to noise, no real trend and nothing close to this size.

It’s also possible to look at this in the opposite direction- find the expected average exit velocity based on the barrel%, then look at players who hit the ball more than 1 SD (of the average EV residual) harder or softer than they “should” have and see how much that tells us.

Barrel% decile >1.65 mph faster than expected average-ish EV >1.65 mph slower than expected whole group
0 0.358 0.339 0.342 0.344
1 0.362 0.359 0.316 0.354
2 0.366 0.364 0.361 0.364
3 0.389 0.377 0.378 0.379
4 0.397 0.381 0.376 0.384
5 0.388 0.395 0.418 0.397
6 0.429 0.400 0.382 0.403
7 0.394 0.398 0.401 0.398
8 0.432 0.414 0.409 0.417
9 0.449 0.451 0.446 0.450


There’s still some information there, but while the average difference between the good and bad EV groups here is 12 points of next season’s wOBAcon, the average difference for good and bad barrel groups was 50 points.  Knowing barrels on top of average EV tells you a lot.  Knowing average EV on top of barrels tells you a little.

Back to Bader himself, a month of elite barreling doesn’t mean he’s going to keep smashing balls like Stanton or anything silly, and trying to project him based on contact quality so far is way beyond the scope of this post, but if you have to be high on one and low on the other, lots of barrels and a bad average EV is definitely the way to go, both for YTD and expected future production.

 

Mashers underperform xwOBA on air balls

Using the same grouping methodology as The Statcast GB speed adjustment seems to capture about 40% of the speed effect, except using barrel% (barrels/batted balls), I got the following for air balls (FB, LD, Popup):

barrel group FB BA-xBA FB wOBA-xwOBA n
high-barrel% 0.006 -0.005 22993
avg 0.006 0.010 22775
low-barrel% -0.002 0.005 18422

These numbers get closer to the noise range (+/- 0.003), but mashers simultaneously OUTPERFORMING on BA while UNDERPERFORMING on wOBA while weak hitters do the opposite is a tough parlay to hit by chance alone because any positive BA event is a positive wOBA event as well.  The obvious explanation to me, which Tango is going with too, is that mashers just get played deeper in the OF, and that that alignment difference is the major driver of what we’ve each measured.

 

The Statcast GB speed adjustment seems to capture about 40% of the speed effect

Statcast recently rolled out an adjustment to its ground ball xwOBA model to account for batter speed, and I set out to test how well that adjustment was doing.  I used 2018 data for players with at least 100 batted balls (n=390).  To get a proxy for sprint speed, I used the average difference between the speed-unadjusted xwOBA and the speed-adjusted xwOBA for ground balls.  Billy Hamilton graded out fast.  Welington Castillo didn’t.  That’s good.  Grouping the players into thirds by their speed-proxy, I got the following

 

speed Actual GB wOBA basic xwOBA speed-adjusted xwOBA Actual-basic Actual- (speed-adjusted) n
slow 0.215 0.226 0.215 -0.011 0.000 14642
avg 0.233 0.217 0.219 0.016 0.014 16481
fast 0.247 0.208 0.218 0.039 0.029 18930

The slower players seem to hit the ball better on the ground according to basic xwOBA, but they still have worse actual outcomes.  We can see that the fast players outperform the slow ones by 50 points in unadjusted wOBA-xwOBA and only 29 points after the speed adjustment.