Fielding-Independent Defense

TL;DR Using Sprint Speed, Reaction, and Burst from the Statcast leaderboard pages, with no catch information (or other information) of any kind, is enough to make a good description of same-season outfield OAA and that descriptive stat makes a better prediction of next-season outfield OAA than current-season OAA does.

A recently-released not-so-great defensive metric inspired me to repurpose an old idea of mine and see how well I could model outfield defense without knowing any actual play results, and the answer is actually pretty well.  Making catches in the OF, taking positioning as a given as OAA does, is roughly based on five factors- reacting fast, accelerating fast, running fast, running to the right place, and catching the balls you’re close enough to reach.

Reacting fast has its own leaderboard metric (reaction distance, the distance traveled in the first 1.5s), as does running fast (Sprint Speed, although it’s calculated on offense).  Acceleration has a metric somewhat covering it, although not as cleanly, in Burst (distance traveled between 1.5s and 3s).  Running to the right place only has the route metric, which covers the first 3s only and is very confounded and doesn’t help nontrivially, so I don’t use it, and actually catching the balls is deliberately left out of the FID metric (I do provide a way to incorporate catch information into next season’s estimation at the end of each section).

2+ Star Plays

The first decision was what metric to try to model first, and I went with OAA/n on 2+ star plays over OAA/n for all plays for multiple reasons. 2+ star plays are responsible for the vast majority of seasonal OAA, OAA/n on 2+ star plays correlates at over r=0.94 to OAA/n on all plays (same season), so it contains the vast majority of the information in OAA/n anyway, and the three skills I’m trying to model based on aren’t put on full display on easier balls.  Then I had to decide how to incorporate the three factors (Sprint Speed, Reaction, Burst). Reaction and Burst are already normalized to league-average, and I normalized sprint speed to the average OF speed weighted by the number of 2+ star opportunities they got (since that’s the average sprint speed of the 2+* sample).  That can get janky early in a season before a representative sample has qualified for the leaderboard, so in that case it’s probably better to just use the previous season’s average sprint speed as a baseline for awhile as there’s not that much variation.

year weighted avg OF Sprint Speed
2016 27.82
2017 27.93
2018 28.01
2019 27.93
2020 27.81
2021 27.94
2022 28.00

 

Conveniently each stat individually was pretty linear in its relationship to OAA/n 2+* (50+ 2* opportunities shown).

Reaction isn’t convincingly near-linear as opposed to some other positively-correlated shape, but it’s also NOT convincingly nonlinear at all, so we’ll just go with it.

The outlier at 4.7 reaction is Enrique Hernandez who seems to do this regularly but didn’t have enough opportunities in the other seasons to get on the graph again.  I’m guessing he’s deliberately positioning “slightly wrong” and choosing to be in motion with the pitch instead of stationary-ish and reacting to it.  If more players start doing that, then this basic model formulation will have a problem.

Reaction Distance and Sprint Speed are conveniently barely correlated, r=-.09, and changes from year to year are correlated at r=0.1.  I’d expect the latter to be closer to the truth, given that for physical reasons there should be a bit of correlation, and there should be a bit of “missing lower left quadrant” bias in the first number where you can’t be bad at both and still get run out there, but it seems like the third factor to being put in the lineup (offense) is likely enough to keep that from being too extreme.

Burst, on the other hand, there’s no sugarcoating it.  Correlation of r=0.4 to Reaction Distance (moving further = likely going faster at the 1.5s point, leading to more distance covered in the next 1.5s) and r=0.56 to Sprint Speed for obvious reasons.  I took the easy way out with only one messy variable and made an Expected Burst from Reaction and Sprint Speed (r=0.7 to actual Burst), and then took the residual Burst – Expected Burst to create Excess Burst and used that as the third input.  I also tried a version with Sprint Speed and Excess Bootup Distance (distance traveled in the correct direction in the first 3s) as a 2-variable model, and it still “works”, but it’s significantly worse in both description and prediction.  Excess Burst also looks fine as far as being linear with respect to OAA/n.

Looking at how the inputs behave year to year (unless indicated otherwise, all correlations are with players who qualified for Jump stats to be calculated (10+ 2-star opportunities) and Sprint Speed leaderboard (10+ competitive runs) in both years, and correlations are weighted by year T 2+ star opportunities)

weighted correlation r-values (N=712) itself next season same-season OAA/n 2+* next-season OAA/n 2+*
Sprint Speed above avg 0.91 0.49 0.45
Reaction 0.85 0.40 0.35
Burst 0.69 0.79 0.60
Excess Burst 0.53 0.41 0.21

Sprint Speed was already widely known to be highly reliable year-over-year, so no surprise there, and Reaction grades out almost as well, but Burst clearly doesn’t, particularly in the year-to-year drop in correlation to OAA/n.  Given that the start of a run (Reaction) holds up year-to-year, and that the end of the run (Sprint Speed) holds up year-to-year, it’s very strange to me that the middle of the run wouldn’t.  I could certainly believe a world where there’s less skill variation in the middle of the run, so the left column would be smaller, but that doesn’t explain the dropoff in correlation to OAA/n.  Play-level data isn’t available, but what I think is happening here is that because Burst is calculated on ALL 2+* plays, it’s a mixture of maximum burst AND how often a player chose to use it, because players most definitely don’t bust ass for 3 seconds on plenty of the 2+* plays they don’t get close to making (and this isn’t even necessarily wrong on any given play, getting in position to field the bounce and throw can be the right decision)

I would expect maximum burst to hold up roughly as well as Reaction or Sprint Speed in the OAA/n correlations, but the choice component is the kind of noisy yes-no variable that takes longer to stabilize than a stat that’s closer to a physical attribute.  While there’s almost no difference in Sprint Speed year-to-year correlation for players with 60+ attempts (n=203) and players with 30 or fewer attempts (n=265), r=0.92 to 0.90, there’s a huge dropoff in Burst, r=0.77 to r=0.57.

This choice variable is also a proxy for making catches- if you burst after a ball that some players don’t, you have a chance to make a catch that they won’t, and if you don’t burst after a ball that other players do, you have no chance to make a catch that they might.  That probably also explains why the Burst metric and Excess Burst are relatively overcorrelated to same-season OAA/n.

Now that we’ve discussed the components, let’s introduce the Fielding-Independent Defense concept.  It is a DESCRIPTIVE stat, the values of A/B/C/D in FID=(A + B* Reaction + C* Sprint Speed above average + D* Excess Burst) that minimizes SAME-SEASON opportunity-weighted RMSE between FID and OAA/n on whatever plays we’re looking at (here, 2+ star plays). Putting FID on the success probability added scale (e.g. if Statcast had an average catch probability of 50% on a player’s opportunities, and he caught 51%, he’s +1% success probability, and if FID expects him to catch 52%, he’d be +2% FID), I get

FID = 0.1% + 4.4% * Reaction + 5.4% * Sprint Speed above average + 6.4% * Excess Burst.

In an ideal world, the Y intercept (A) would be 0, because in a linear model, somebody who’s average at every component should be average, but our Sprint Speed above average here is above the weighted average of players who had enough opportunities to have stats calculated, which isn’t exactly the average including players who didn’t, so I let the intercept be free, and 0.1% on 2+ star plays is less than 0.1 OAA over a full season, so I’m not terribly worried at this point.  And yeah, I know I’m nominally adding things that don’t even use the same units (ft vs. ft/s) and coming up with a number in outs/opportunity, so be careful porting this idea to different measurement systems because the coefficients would need to be converted from ft-1 etc.

So, how well does it do?

weighted correlation r-values (N=712) itself next season same-season OAA/n 2+* next-season OAA/n 2+*
FID 2+* 0.748 0.806 0.633
OAA/n 2+* 0.591 1.000 0.591
OAA-FID 0.156 0.587 0.132
Regressed FID and (OAA-FID) 0.761 0.890 0.647

That’s pretty darn good for a purely descriptive stat that doesn’t know the outcome of any play, although 2+* plays do need physical skills more than 0-1* plays.  The OAA-FID residual- actual catch information- does contain some predictive value, but it doesn’t help a ton.  The regression amounts I came up with (the best fit regression for FID and OAA-FID together to predict next season’t OAA/n on 2+*) were 12 opportunities for FID and 218 opportunities for OAA-FID.  Given that a full-time season is 70-80 2+ star opportunities on average (more for CF, fewer for corner), FID is half-real in a month and OAA-FID would be half-real if a season lasted 18 straight months.  Those aren’t the usual split-sample correlations, since there isn’t any split-sample data available, but regressions based on players with different numbers of opportunities.  That has its own potential issues, but 12 and 218 should be in the ballpark.  FID stabilizes super-fast.

0 and 1 Star Plays

Since 0 and 1 star plays have a catch probability of at least 90%, there’s an upper bound on success probability added, and while players reaching +10% on 2+* plays isn’t rare, obviously that’s not going to translate over directly to a 105% catch rate on 0/1* plays.   I did the same analysis for 0-1* plays as for 2+* plays using the 2+* metrics as well as a FID fit specifically for 0-1* plays.  Everything here is weighted by the player’s number of 0-1* opportunities in year T.

weighted correlation r-values (N=712) itself next season same-season OAA/n 0/1* next-season OAA/n 0/1*
FID 2+* 0.75 0.20 0.25
OAA/n 2+* 0.59 0.27 0.27
OAA-FID 2+* 0.15 0.18 0.11
FID reweighted for only 0/1* 0.75 0.22 0.26
OAA/n 0/1* 0.13 1.00 0.13
OAA-FID 0/1* 0.07 0.97 0.07

The obvious takeaway is that these correlations suck compared to the ones for 2+* plays, and that’s the result of there being a much larger spread in talent in the ability to catch 2+* balls compared to easier ones.  2* plays are ~25% of the total plays, but comprise ~80% of a player’s OAA.  0-1* plays are 3 times as numerous and comprise ~20% of OAA.  OAA on 0-1* plays is just harder to predict because there’s less signal in it as seen by it self-correlating horrendously.

The other oddities are that OAA/n on 2+* outcorrelates FID from 2+* and that both FIDs do better on the next season than the current season.  For the former, the residual OAA-FID on 2+* plays has a comparable signal, and 0-1* plays are a lot more weighted to “catch the ball” skill relative to physical skills than 2+* plays are, and OAA and especially the residual weight “catch the ball” heavily, so that’s my best guess for that.  As to why FID correlates better to next season, I don’t have a great explanation there. It was odd enough that I recreated the entire analysis from scratch, but I just got the same thing again. I broke it down by year and didn’t see anything strange there either.  2020 correlated worse, of course, but everything is attempt-weighted so that shouldn’t much matter for same-season correlations.

Reaction time grades out relatively much less important in comparison to Sprint Speed and Burst for these plays than it did on 2+* plays, not shockingly.

All plays together

Doing the same analysis on all plays, and comparing the 2+* metrics to the all-play metrics, we get

weighted correlation r-values (N=712) itself next season same-season OAA/n all plays next-season OAA/n all plays
FID 2+* 0.749 0.758 0.627
OAA/n 2+* 0.589 0.943 0.597
OAA-FID 2+* 0.154 0.559 0.153
FID Reweighted for all plays 0.751 0.759 0.628
OAA/n all plays 0.601 1.000 0.601
OAA-FID all plays 0.210 0.629 0.173
Regressed FID and (OAA-FID) 2+* only 0.760 0.842 0.642
Regressed FID and (OAA-FID) all plays 0.763 0.894 0.652

with regression numbers of 35 opportunities for all-play FID and 520 (~2 full seasons) opportunities for all-play OAA-FID.  As a descriptive stat, you get basically everywhere you’re going to go with FID just by looking at 2* plays, and that’s also enough to outpredict last season’s OAA/n for all plays.  The relative weights are fairly similar here- 1:1.34:1.57 Reaction:Sprint Speed:Excess Burst for all plays compared to 1:1.23:1.48 for only 2+* plays, with Reaction relatively a bit less important when considering all plays as expected given the 0-1* section.  The equation for all plays is:

FID = 0.08% + 1.12% * Reaction + 1.50% * Sprint Speed above average + 1.76% * Excess Burst.

The coefficients are smaller than the 2+* version because there’s less variation in OAA/n than in 2+* plays, but to convert to a seasonal number, there are roughly 4 times as many total plays as 2+* plays, so it all works out.  A hypothetical player who’s +1.0 in all components would be +16.3% on 2+star plays and +4.46% on all plays.  Given a breakdown of 70 2+* attempts and 210 0/1* attempts for 280 total attempts in a season, this person would be expected to be +12.5 OAA overall, +11.4 OAA on 2+* plays and +1.1 OAA on 0/1* plays.

Defensive Aging

FID also allows a quick look at where aging comes from across the paired seasons.  Sprint Speed is roughly a straight line down with age, and it’s the biggest driver of decline by far.

Reaction Sprint Speed above avg Excess Burst Total FID (all) OAA.n (all)
Change -0.0410 -0.1860 0.0133 -0.0030 -0.0029
Expected Impact (in catch %) -0.046% -0.279% 0.023% -0.302% -0.302% -0.285%

That’s around a 0.9 drop in seasonal OAA on average from getting a year older that’s a very good match for the decline predicted by worse FID components.  As a rough rule of thumb, 1 marginal foot in reaction distance is worth 3 OAA over a full year, 1 ft/s in Sprint Speed is worth 4 OAA, and 1 foot in Excess Burst is worth 5 OAA.

Conclusion

Sprint Speed, Reaction Distance, and Excess Burst really do cover most of the skill variation in defense, and by the time players have reached the major league level, their routes and ability to catch the ball are sufficiently good that those differences are overwhelmed by physical differences.  This measure depends on that being true- at lower levels, it might not hold as well and the regression number for OAA-FID could drop much lower to where you’d be throwing away a lot of information by not considering OAA-FID.  It’s also definitely vulnerable to Goodhart’s law- “When a measure becomes a target, it ceases to be a good measure“.  It’s hard to uselessly game your FIP to be much lower than it would be if you were pitching normally, but without play-level data (or observation) to detect players in motion at the pitch, anybody who wanted to game FID could probably do it pretty easily.

OAA and the New Baseball Prospectus Defensive Metric Range Defense Added: Infield Edition

TL;DR Use OAA. Ignore RDA/ROS.

Baseball Prospectus came out with a new defensive metric in the vein of their DRC+ and DRA- stats.   If you’re familiar with my commentary on DRC+, this is going to hit some of the same notes, but it’s still worth a quick read if you made it this far.  The infield and outfield models for RDA behave extremely differently, so I’m going to discuss each one in a separate post. The outfield post is here.

The infield model is just simply bad compared to OAA/DRS.  If somebody is giving you hype-job statistics and only tells you how well a team does against a non-division same-league opponent who’s at least 2 games below .500 while wearing uniforms with a secondary color hex code between #C39797 and #FFFFFF in Tuesday day games during a waxing gibbous moon.. well, that ought to make you immediately suspicious of how bad everything else is.  And the same for the statistics cited in the RDA article.

That is.. the opposite of a resounding win for ROS/RDA.  And it’s worse than it looks because OAA is (theoretically, and likely practically) the best at stripping out fielder positioning, while DRS and ROS will have some residual positioning information that will self-correlate to some extent.  DRS also contains additional information (extra penalty for botched balls down the line, throwing errors, double plays) that likely help it self-correlate better, and ROS/RDA appear to contain outside information as described above which will also help it self-correlate better.

  OAA/ inn DRS/ inn ROS RDA/ inn N
to OAA 0.44 0.32 0.22 0.21 177
to DRS 0.26 0.45 0.30 0.30 177

ROS/RDA correlating significantly better to DRS than to OAA is suggestive of a fair bit of its year-to-year self-correlation being to non-demonstrated-fielding-skill information.

Even in their supposed area of supremacy, team-switchers, infield ROS/RDA is still bad.  Classifying players as either non-switchers (played both seasons for the same team only), offseason switchers (played all of year T for one team and all of year T+1 for a different team), or midseason switchers (switched teams in the middle of at least one season).

All IF OAA/inn DRS/inn ROS RDA/inn n
Offseason 0.40 0.45 0.43 0.46 79
Midseason 0.39 0.31 0.13 0.11 91
Off or Mid 0.39 0.38 0.28 0.28 170
No Switch 0.45 0.45 0.37 0.36 541
All 0.44 0.45 0.36 0.35 711

They match OAA/DRS on offseason-switching players- likely due to overfitting their model to a small number of players- but they’re absolutely atrocious on midseason switchers, and they actually have the *biggest* overall drop in reliability between non-switchers and switchers.  I don’t think there’s much more to say.  Infield RDA/ROS isn’t better than OAA/DRS.  It isn’t even close to equal to OAA/DRS. 


Technical notes: I sourced OAA from Fangraphs because I didn’t see a convenient way to grab OAA by position from Savant without scraping individual player pages (the OAA Leaderboard .csv with a position filter doesn’t include everybody who played a position).  This meant that the slightly inconvenient way of grabbing attempts from Savant wasn’t useful here because it also couldn’t split attempts by position, so I was left with innings as a denominator.  Fangraphs doesn’t have a (convenient?) way to split defensive seasons between teams, while BP does split between teams on their leaderboard, so I had to combine split-team seasons and used a weighted average by innings.  Innings by position match between BP and FG in 98.9% of cases and the differences are only a couple of innings here and there, nothing that should make much difference to anything.

The hidden benefit of pulling the ball

Everything else about the opportunity being equal, corner OFs have a significantly harder time catching pulled balls  than they do catching opposite-field balls.  In this piece, I’ll demonstrate that the effect actually exists, try to quantify it in a useful way, and give a testable take on what I think is causing it.

Looking at all balls with a catch probability >0 and <0.99 (the Statcast cutoff for absolutely routine fly balls), corner OF out rates underperform catch probability by 0.028 on pulled balls relative to oppo balls.

(For the non-baseball readers, position 7 is left field, 8 is center field, 9 is right field, and a pulled ball is a right-handed batter hitting a ball to left field or a LHB hitting a ball to right field.  Oppo is “opposite field”, RHB hitting the ball to right field, etc.)

Stands Pos Catch Prob Out Rate Difference N
L 7 0.859 0.844 -0.015 14318
R 7 0.807 0.765 -0.042 11380
L 8 0.843 0.852 0.009 14099
R 8 0.846 0.859 0.013 19579
R 9 0.857 0.853 -0.004 19271
L 9 0.797 0.763 -0.033 8098

The joint standard deviation for each L-R difference, given those Ns, is about 0.005, so .028+/- 0.005, symmetric in both fields, is certainly interesting.  Rerunning the numbers on more competitive plays, 0.20<catch probability<0.80

Stands Pos Catch Prob Out Rate Difference N
L 7 0.559 0.525 -0.034 2584
R 7 0.536 0.407 -0.129 2383
L 9 0.533 0.418 -0.116 1743
R 9 0.553 0.549 -0.005 3525

Now we see a much more pronounced difference, .095 in LF and .111 in RF (+/- ~.014).  The difference is only about .01 on plays between .8 and .99, so whatever’s going on appears to be manifesting itself clearly on competitive plays while being much less relevant to easier plays.

Using competitive plays also allows a verification that is (mostly) independent of Statcast’s catch probability.  According to this Tango blog post, catch probability changes are roughly linear to time or distance changes in the sweet spot at a rate of 0.1s=10% out rate and 1 foot = 4% out rate.  By grouping roughly similar balls and using those conversions, we can see how robust this finding is.  Using 0.2<=CP=0.8, back=0, and binning by hang time in 0.5s increments, we can create buckets of almost identical opportunities.  For RF, it looks like

Stands Hang Time Bin Avg Hang Time Avg Distance N
L 2.5-3.0 2.881 30.788 126
R 2.5-3.0 2.857 29.925 242
L 3.0-3.5 3.268 41.167 417
R 3.0-3.5 3.256 40.765 519
L 3.5-4.0 3.741 55.234 441
R 3.5-4.0 3.741 55.246 500
L 4.0-4.5 4.248 69.408 491
R 4.0-4.5 4.237 68.819 380
L 4.5-5.0 4.727 81.487 377
R 4.5-5.0 4.714 81.741 204
L 5.0-5.5 5.216 93.649 206
R 5.0-5.5 5.209 93.830 108

If there’s truly a 10% gap, it should easily show up in these bins.

Hang Time to LF Raw Difference Corrected Difference Catch Prob Difference SD
2.5-3.0 0.099 0.104 -0.010 0.055
3.0-3.5 0.062 0.059 -0.003 0.033
3.5-4.0 0.107 0.100 0.013 0.032
4.0-4.5 0.121 0.128 0.026 0.033
4.5-5.0 0.131 0.100 0.033 0.042
5.0-5.5 0.080 0.057 0.023 0.059
Hang Time to RF Raw Difference Corrected Difference Catch Prob Difference SD
2.5-3.0 0.065 0.096 -0.063 0.057
3.0-3.5 0.123 0.130 -0.023 0.032
3.5-4.0 0.169 0.149 0.033 0.032
4.0-4.5 0.096 0.093 0.020 0.035
4.5-5.0 0.256 0.261 0.021 0.044
5.0-5.5 0.168 0.163 0.044 0.063

and it does.  Whatever is going on is clearly not just an artifact of the catch probability algorithm.  It’s a real difference in catching balls.  This also means that I’m safe using catch probability to compare performance and that I don’t have to do the whole bin-and-correct thing any more in this post.

Now we’re on to the hypothesis-testing portion of the post.  I’d used the back=0 filter to avoid potentially Simpson’s Paradoxing myself, so how does the finding hold up with back=1 & wall=0?

Stands Pos Catch Prob Out Rate Difference N
R 7 0.541 0.491 -0.051 265
L 7 0.570 0.631 0.061 333
R 9 0.564 0.634 0.071 481
L 9 0.546 0.505 -0.042 224

.11x L-R difference in both fields.  Nothing new there.

In theory, corner OFs could be particularly bad at playing hooks or particularly good at playing slices.  If that’s true, then the balls with more sideways movement should be quite different than the balls with less sideways movement.  I made an estimation of the sideways acceleration in flight based on hang time, launch spray angle, and landing position and split balls into high and low acceleration (slices have more sideways acceleration than hooks on average, so this is comparing high slice to low slice, high hook to low hook).

Batted Spin Stands Pos Catch Prob Out Rate Difference N
Lots of Slice L 7 0.552 0.507 -0.045 1387
Low Slice L 7 0.577 0.545 -0.032 617
Lots of Hook R 7 0.528 0.409 -0.119 1166
Low Hook R 7 0.553 0.402 -0.151 828
Lots of Slice R 9 0.540 0.548 0.007 1894
Low Slice R 9 0.580 0.539 -0.041 972
Lots of Hook L 9 0.526 0.425 -0.101 850
Low Hook L 9 0.546 0.389 -0.157 579

And there’s not much to see there.  Corner OF play low-acceleration balls worse, but on average those are balls towards the gap and somewhat longer runs, and the out rate difference is somewhat-to-mostly explained by corner OF’s lower speed getting exposed over a longer run.  Regardless, nothing even close to explaining away our handedness effect.

Perhaps pull and oppo balls come from different pitch mixes and there’s something about the balls hit off different pitches.

Pitch Type Stands Pos Catch Prob Out Rate Difference N
FF L 7 0.552 0.531 -0.021 904
FF R 7 0.536 0.428 -0.109 568
FF L 9 0.527 0.434 -0.092 472
FF R 9 0.556 0.552 -0.004 1273
FT/SI L 7 0.559 0.533 -0.026 548
FT/SI R 7 0.533 0.461 -0.072 319
FT/SI L 9 0.548 0.439 -0.108 230
FT/SI R 9 0.553 0.592 0.038 708
Other L 7 0.569 0.479 -0.090 697
Other R 7 0.541 0.379 -0.161 1107
Other L 9 0.534 0.385 -0.149 727
Other R 9 0.550 0.497 -0.054 896

The effect clearly persists, although there is a bit of Simpsoning showing up here.  Slices are relatively fastball-heavy and hooks are relatively Other-heavy, and corner OF catch FBs at a relatively higher rate.  That will be the subject of another post.  The average L-R difference among paired pitch types is still 0.089 though.

Vertical pitch location is completely boring, and horizontal pitch location is the subject for another post (corner OFs do best on outside pitches hit oppo and worst on inside pitches pulled), but the handedness effect clearly persists across all pitch location-fielder pairs.

So what is going on?  My theory is that this is a visibility issue.  The LF has a much better view of a LHB’s body and swing than he does of a RHB’s, and it’s consistent with all the data that looking into the open side gives about a 0.1 second advantage in reaction time compared to looking at the closed side.  A baseball swing takes around 0.15 seconds, so that seems roughly reasonable to me.  I don’t have the play-level data to test that myself, but it should show up as a batter handedness difference in corner OF reaction distance and around a 2.5 foot batter handedness difference in corner OF jump on competitive plays.

Don’t use FRAA for outfielders

TL;DR OAA is far better, as expected.  Read after the break for next-season OAA prediction/commentary.

As a followup to my previous piece on defensive metrics, I decided to retest the metrics using a sane definition of opportunity.  BP’s study defined a defensive opportunity as any ball fielded by an outfielder, which includes completely uncatchable balls as well as ground balls that made it through the infield.  The latter are absolute nonsense, and the former are pretty worthless.  Thanks to Statcast, a better definition of defensive opportunity is available- any ball it gives a nonzero catch probability and assigns to an OF.  Because Statcast doesn’t provide catch probability/OAA on individual plays, we’ll be testing each outfielder in aggregate.

Similarly to what BP tried to do, we’re going to try to describe or predict each OF’s outs/opportunity, and we’re testing the 354 qualified OF player-seasons from 2016-2019.  Our contestants are Statcast’s OAA/opportunity, UZR/opportunity, FRAA/BIP (what BP used in their article), simple average catch probability (with no idea if the play was made or not), and positional adjustment (effectively the share of innings in CF, corner OF, or 1B/DH).  Because we’re comparing all outfielders to each other, and UZR and FRAA compare each position separately, those two received the positional adjustment (they grade quite a bit worse without it, as expected).

Using data from THE SAME SEASON (see previous post if it isn’t obvious why this is a bad idea) to describe that SAME SEASON’s outs/opportunity, which is what BP was testing, we get the following correlations:

Metric r^2 to same-season outs/opportunity
OAA/opp 0.74
UZR/opp 0.49
Catch Probability + Position 0.43
FRAA/BIP 0.34
Catch Probability 0.32
Positional adjustment/opp 0.25


OAA wins running away, UZR is a clear second, background information is 3rd, and FRAA is a distant 4th, barely ahead of raw catch probability.  And catch probability shouldn’t be that important.  It’s almost independent of OAA (r=0.06) and explains much less of the outs/opp variance.  Performance on opportunities is a much bigger driver than difficulty of opportunities over the course of a season.  I ran the same test on the 3 OF positions individually (using Statcast’s definition of primary position for that season), and the numbers bounced a little, but it’s the same rank order and similar magnitude of differences.

Attempting to describe same-season OAA/opp gives the following:

Metric r^2 to same-season OAA/opportunity
OAA/opp 1
UZR/opp 0.5
FRAA/BIP 0.32
Positional adjustment/opp 0.17
Catch Probability 0.004

As expected, catch probability drops way off.  CF opportunities are on average about 1% harder than corner OF opportunities.  Positional adjustment is obviously a skill correlate (Full-time CF > CF/corner tweeners > Full-time corner > corner/1B-DH tweeners), but it’s a little interesting that it drops off compared to same-season outs/opportunity.  It’s reasonably correlated to catch probability, which is good for describing outs/opp and useless for describing OAA/opp, so I’m guessing that’s most of the decline.


Now, on to the more interesting things.. Using one season’s metric to predict the NEXT season’s OAA/opportunity (both seasons must be qualified), which leaves 174 paired seasons, gives us the following (players who dropped out were almost average in aggregate defensively):

Metric r^2 to next season OAA/opportunity
OAA/opp 0.45
FRAA/BIP 0.27
UZR/opp 0.25
Positional adjustment 0.1
Catch Probability 0.02

FRAA notably doesn’t suck here- although unless you’re a modern-day Wintermute who is forbidden to know OAA, just use OAA of course.  Looking at the residuals from previous-season OAA, UZR is useless, but FRAA and positional adjustment contain a little information, and by a little I mean enough together to get the r^2 up to 0.47.  We’ve discussed positional adjustment already and that makes sense, but FRAA appears to know a little something that OAA doesn’t, and it’s the same story for predicting next-season outs/opp as well.

That’s actually interesting.  If the crew at BP had discovered that and spent time investigating the causes, instead of spending time coming up with ways to bullshit everybody that a metric that treats a ground ball to first as a missed play for the left fielder really does outperform Statcast, we might have all learned something useful.

The Baseball Prospectus article comparing defensive metrics is… strange

TL;DR and by strange I mean a combination of utter nonsense tests on top of the now-expected rigged test.

Baseball Prospectus released a new article grading defensive metrics against each other and declared their FRAA metric the overall winner, even though it’s by far the most primitive defensive stat of the bunch for non-catchers.  Furthermore, they graded FRAA as a huge winner in the outfield and Statcast’s Outs Above Average as a huge winner in the infield.. and graded FRAA as a dumpster fire in the infield and OAA as a dumpster fire in the outfield.  This is all very curious.  We’re going to answer the three questions in the following order:

  1. On their tests, why does OAA rule the infield while FRAA sucks?
  2. On their tests, why does FRAA rule the outfield while OAA sucks?
  3. On their test, why does FRAA come out ahead overall?

First, a summary of the two systems.  OAA ratings try to completely strip out positioning- they’re only a measure of how well the player did, given where the ball was and where the player started.  FRAA effectively treats all balls as having the same difficulty (after dealing with park, handedness, etc).  It assumes that each player should record the league-average X outs per BIP for the given defensive position/situation and gives +/- relative to that number.

A team allowing a million uncatchable base hits won’t affect the OAA at all (not making a literal 0% play doesn’t hurt your rating), but it will tank everybody’s FRAA because it thinks the fielders “should” be making X outs per Y BIPs.  In a similar vein, hitting a million easy balls at a fielder who botches them all will destroy that fielder’s OAA but leave the rest of his teammates unchanged.  It will still tank *everybody’s* FRAA the same as if the balls weren’t catchable.  An average-performing (0 OAA), average-positioned fielder with garbage teammates will get dragged down to a negative FRAA. An average-performing (0 OAA), average-positioned fielder whose pitcher allows a bunch of difficult balls nowhere near him will also get dragged down to a negative FRAA.

So, in abstract terms: On a team level, team OAA=range + conversion and team FRAA = team OAA + positioning-based difficulty relative to average.  On a player level, player OAA= range + conversion and player FRAA = player OAA + positioning + teammate noise.

Now, their methodology.  It is very strange, and I tweeted at them to make sure they meant what they wrote.  They didn’t reply, it fits the results, and any other method of assigning plays would be in-depth enough to warrant a description, so we’re just going to assume this is what they actually did.  For the infield and outfield tests, they’re using the season-long rating each system gave a player to predict whether or not a play resulted in an out.  That may not sound crazy at first blush, but..

…using only the fielder ratings for the position in question, run the same model type position by position to determine how each system predicts the out probability for balls fielded by each position. So, the position 3 test considers only the fielder quality rate of the first baseman on *balls fielded by first basemen*, and so on.

Their position-by-position comparisons ONLY INVOLVE BALLS THAT THE PLAYER ACTUALLY FIELDED.  A ground ball right through the legs untouched does not count as a play for that fielder in their test (they treat it as a play for whoever picks it up in the outfield).  Obviously, by any sane measure of defense, that’s a botched play by the defender, which means the position-by-position tests they’re running are not sane tests of defense.  They’re tests of something else entirely, and that’s why they get the results that they do.

Using the bolded abstraction above, this is only a test of conversion.  Every play that the player didn’t/couldn’t field IS NOT INCLUDED IN THE TEST.  Since OAA adds the “noise” of range to conversion, and FRAA adds the noise of range PLUS the noise of positioning PLUS the noise from other teammates to conversion, OAA is less noisy and wins and FRAA is more noisy and sucks.  UZR, which strips out some of the positioning noise based on ball location, comes out in the middle.  The infield turned out to be pretty easy to explain.

The outfield is a bit trickier.  Again, because ground balls that got through the infield are included in the OF test (because they were eventually fielded by an outfielder), the OF test is also not a sane test of defense.  Unlike the infield, when the outfield doesn’t catch a ball, it’s still (usually) eventually fielded by an outfielder, and roughly on average by the same outfielder who didn’t catch it.

So using the abstraction, their OF test measures range + conversion + positioning + missed ground balls (that roll through to the OF).  OAA has range and conversion.  FRAA has range, conversion, positioning, and some part of missed ground balls through the teammate noise effect described earlier.  FRAA wins and OAA gets dumpstered on this silly test, and again it’s not that hard to see why, not that it actually means much of anything.


Before talking about the teamwide defense test, it’s important to define what “defense” actually means (for positions 3-9).  If a batter hits a line drive 50 feet from anybody, say a rope safely over the 3B’s head down the line, is it bad defense by 3-9 that it went for a hit?  Clearly not, by the common usage of the word. Who would it be bad defense by?  Nobody could have caught it.  Nobody should have been positioned there.

BP implicitly takes a different approach

So, recognizing that defenses are, in the end, a system of players, we think an important measure of defensive metric quality is this: taking all balls in play that remained in the park for an entire season — over 100,000 of them in 2019 — which system on average most accurately measures whether an out is probable on a given play? This, ultimately, is what matters.  Either you get more hitters out on balls in play or you do not. The better that a system can anticipate that a batter will be out, the better the system is.

that does consider this bad defense.  It’s kind of amazing (and by amazing I mean not the least bit surprising at this point) that every “questionable” definition and test is always for the benefit one of BP’s stats.  Neither OAA, nor any of the other non-FRAA stats mentioned, are based on outs/BIP or trying to explain outs/BIP.  In fact, they’re specifically designed to do the exact opposite of that.  The analytical community has spent decades making sure that uncatchable balls don’t negatively affect PLAYER defensive ratings, and more generally to give an appropriate amount of credit to the PLAYER based on the system’s estimate of the difficulty of the play (remember from earlier that FRAA doesn’t- it treats EVERY BIP as average difficulty).

The second “questionable” decision is to test against outs/BIP.  Using abstract language again to break this down, outs/BIP = player performance given the difficulty of the opportunity + difficulty of opportunity.  The last term can be further broken down into difficulty of opportunity = smart/dumb fielder positioning + quality of contact allowed (a pitcher who allows an excess of 100mph batted balls is going to make it harder for his defense to get outs, etc) + luck.  In aggregate:

outs/BIP=

player performance given the difficulty of the opportunity (OAA) +

smart/dumb fielder positioning (a front-office/manager skill in 2019) +

quality of contact allowed (a batter/pitcher skill) +

luck (not a skill).

That’s testing against a lot of nonsense beyond fielder skill, and it’s testing against nonsense *that the other systems were explicitly designed to exclude*.  It would take the creators of the other defensive systems less time than it took me to write the previous paragraph to run a query and report an average difficulty of opportunity metric when the player was on the field (their systems are all already designed around giving every BIP a difficulty of opportunity score), but again, they don’t do that because *they’re not trying to explain outs/BIP*.

The third “questionable” decision is to use 2019 ratings to predict 2019 outs/BIP.  Because observed OAA is skill+luck, it benefits from “knowing” the luck in the plays it’s trying to predict.  In this case, luck being whether a fielder converted plays at/above/below his true skill level.  2019 FRAA has all of the difficulty of opportunity information baked in for 2019 balls, INCLUDING all of the luck in difficulty of opportunity ON TOP OF the luck in conversion that OAA also has.

All of that luck is just noise in reality, but because BP is testing the rating against THE SAME PLAYS used to create the rating, that noise is actually signal in this test, and the more of it included, the better.  That’s why FRAA “wins” handily.  One could say that this test design is almost maximally disingenuous, and of course it’s for the benefit of BP’s in-house stat, because that’s how they roll.

Trust the barrels

Inspired by the curious case of Harrison Bader

baderbarrels

whose average exit velocity is horrific, hard hit% is average, and barrel/contact% is great (not shown, but a little better than the xwOBA marker), I decided to look at which one of these metrics was more predictive.  Barrels are significantly more descriptive of current-season wOBAcon (wOBA on batted balls/contact), and average exit velocity is sketchy because the returns on harder-hit balls are strongly nonlinear. The game rewards hitting the crap out of the ball, and one rocket and one trash ball come out a lot better than two average balls.

Using consecutive seasons with at least 150 batted balls (there’s some survivor bias based on quality of contact, but it’s pretty much even across all three measures), which gave 763 season pairs, barrel/contact% led the way with r=0.58 to next season’s wOBAcon, followed by hard-hit% at r=0.53 and average exit velocity at r=0.49.  That’s not a huge win, but it is a win, but since these are three ways of measuring a similar thing (quality of contact), they’re likely to be highly correlated, and we can do a little more work to figure out where the information lies.

evvehardhit

I split the sample into tenths based on average exit velocity rank, and Hard-hit% and average exit velocity track an almost perfect line at the group (76-77 player) level.  Barrels deviate from linearity pretty measurably with the outliers on either end, so I interpolated and extrapolated on the edges to get an “expected” barrel% based on the average exit velocity, and then I looked at how players who overperformed and underperformed their expected barrel% by more than 1 SD (of the barrel% residual) did with next season’s wOBAcon.

Avg EV decile >2.65% more barrels than expected average-ish barrels >2.65% fewer barrels than expected whole group
0 0.362 0.334 none 0.338
1 0.416 0.356 0.334 0.360
2 0.390 0.377 0.357 0.376
3 0.405 0.386 0.375 0.388
4 0.389 0.383 0.380 0.384
5 0.403 0.389 0.374 0.389
6 0.443 0.396 0.367 0.402
7 0.434 0.396 0.373 0.401
8 0.430 0.410 0.373 0.405
9 0.494 0.428 0.419 0.441

That’s.. a gigantic effect.  Knowing barrel/contact% provides a HUGE amount of information on top of average exit velocity going forward to the next season.  I also looked at year-to-year changes in non-contact wOBA (K/BB/HBP) for these groups just to make sure and it’s pretty close to noise, no real trend and nothing close to this size.

It’s also possible to look at this in the opposite direction- find the expected average exit velocity based on the barrel%, then look at players who hit the ball more than 1 SD (of the average EV residual) harder or softer than they “should” have and see how much that tells us.

Barrel% decile >1.65 mph faster than expected average-ish EV >1.65 mph slower than expected whole group
0 0.358 0.339 0.342 0.344
1 0.362 0.359 0.316 0.354
2 0.366 0.364 0.361 0.364
3 0.389 0.377 0.378 0.379
4 0.397 0.381 0.376 0.384
5 0.388 0.395 0.418 0.397
6 0.429 0.400 0.382 0.403
7 0.394 0.398 0.401 0.398
8 0.432 0.414 0.409 0.417
9 0.449 0.451 0.446 0.450


There’s still some information there, but while the average difference between the good and bad EV groups here is 12 points of next season’s wOBAcon, the average difference for good and bad barrel groups was 50 points.  Knowing barrels on top of average EV tells you a lot.  Knowing average EV on top of barrels tells you a little.

Back to Bader himself, a month of elite barreling doesn’t mean he’s going to keep smashing balls like Stanton or anything silly, and trying to project him based on contact quality so far is way beyond the scope of this post, but if you have to be high on one and low on the other, lots of barrels and a bad average EV is definitely the way to go, both for YTD and expected future production.

 

Mashers underperform xwOBA on air balls

Using the same grouping methodology as The Statcast GB speed adjustment seems to capture about 40% of the speed effect, except using barrel% (barrels/batted balls), I got the following for air balls (FB, LD, Popup):

barrel group FB BA-xBA FB wOBA-xwOBA n
high-barrel% 0.006 -0.005 22993
avg 0.006 0.010 22775
low-barrel% -0.002 0.005 18422

These numbers get closer to the noise range (+/- 0.003), but mashers simultaneously OUTPERFORMING on BA while UNDERPERFORMING on wOBA while weak hitters do the opposite is a tough parlay to hit by chance alone because any positive BA event is a positive wOBA event as well.  The obvious explanation to me, which Tango is going with too, is that mashers just get played deeper in the OF, and that that alignment difference is the major driver of what we’ve each measured.

 

The Statcast GB speed adjustment seems to capture about 40% of the speed effect

Statcast recently rolled out an adjustment to its ground ball xwOBA model to account for batter speed, and I set out to test how well that adjustment was doing.  I used 2018 data for players with at least 100 batted balls (n=390).  To get a proxy for sprint speed, I used the average difference between the speed-unadjusted xwOBA and the speed-adjusted xwOBA for ground balls.  Billy Hamilton graded out fast.  Welington Castillo didn’t.  That’s good.  Grouping the players into thirds by their speed-proxy, I got the following

 

speed Actual GB wOBA basic xwOBA speed-adjusted xwOBA Actual-basic Actual- (speed-adjusted) n
slow 0.215 0.226 0.215 -0.011 0.000 14642
avg 0.233 0.217 0.219 0.016 0.014 16481
fast 0.247 0.208 0.218 0.039 0.029 18930

The slower players seem to hit the ball better on the ground according to basic xwOBA, but they still have worse actual outcomes.  We can see that the fast players outperform the slow ones by 50 points in unadjusted wOBA-xwOBA and only 29 points after the speed adjustment.