Stuff+ Doesn’t Have a Team-Switching Problem

Not even going to bother with a Betteridge’s Law headline here. On top of what at this point is presumptively bad-faith discussion of their own stats (rank-order correlations on team-switchers only? Really?) BP claimed that the Stuff+ metric has a team-switching problem and spent like 15 paragraphs discussing it. I’m only going to spend two paragraphs, because it just doesn’t.

Edit 5/5/2023: went ahead and got all the same players together with the exact same weighting for everything to make sure to compare DRA- to Stuff+ and other stats completely fairly and replaced the section with one composite chart.

Using data from Fangraphs and BP, I took each season 2020-2022 with at least 100 pitches thrown (this got rid of position players, etc.) and took DRA-,  Pitches, Stuff+, FIP, xFIP-, SIERA, and ERA. Because each season’s ERA was quite different, I converted ERA/SIERA/FIP to Stat/MLB_Average_ERA for that season and multiplied by 100 to make a (non-park-adjusted) “Stat-“.  DRA- and xFIP- are already on that scale.  I then did an IP-weighted fit of same-season Stuff+ and “ERA-” and got predicted same-season “ERA-” = 98.93 – 1.15* (Stuff+ – 100). I then took paired consecutive player-seasons and compared weighted RMSEs for year T’s stats, broken down by team-switching status (No = both seasons for the same team, Yes = played for more than one team).

RMSE T+1 “ERA-“ Non-switch Switch All
Stuff+ 38.0 37.4 37.7
“SIERA-“ 39.5 38.9 39.26
DRA- 40.0 38.1 39.29
xFIP- 40.6 40.0 40.4
“FIP-“ 43.4 49.2 45.6
“ERA-“ 50.8 60.6 54.6
N 588 409 997


Literally no problems here.  Stuff+ does fine with team-switchers, does better than park-unadjusted “FIP-” across the board, and does much better on team-switchers than park-unadjusted “FIP-“, as expected, since park-unadjusted FIP should be the metric taking a measurable accuracy hit from a park change.  And yet somehow BP is reporting the complete opposite conclusions instead:  1) that Stuff+ is fine for non-switchers but becomes near-useless for team-switchers, and 2) that its performance degrades significantly compared to park-unadjusted-FIP for team switchers.  Common sense and the data clearly say otherwise.  DRA- grades out roughly between SIERA and xFIP- for non-switchers predicting next season’s ERA, on par with SIERA overall, and solidly behind Stuff+. (Apologies for temporarily stating it was much worse than that).

Looking at it another way, creating an IP-weighted-RMSE-minimizing linear fit for each metric to predict next season’s “ERA-” (e.g. Year T+1 ERA- = 99 + 0.1 * (year T DRA- – 100) gives the following chart

y=mx+b intercept slope RMSE r
Stuff+ ERA- 102.42 0.79 34.16 0.29
SIERA- 103.07 0.53 34.56 0.25
DRA- 101.42 0.49 34.62 0.24
xFIP- 101.57 0.40 34.88 0.21
“FIP-“ 101.13 0.21 35.14 0.17
“ERA-“ 100.87 0.11 35.40 0.12
everybody the same 100.55 0.00 35.65 0.00

The intercepts are different slightly out of noise and slightly because they’re not all centered exactly identically- SIERA has the lowest average value for whatever reason.  ERA predicted from Stuff+ is the clear winner again, with DRA- again between SIERA and xFIP-.  Since all the metrics being fit are on the same scale (Stuff+ was transformed into ERA- as in the paragraph above), the slopes can be compared directly, and the bigger the slope, the more one point of year-T stat predicts the year T+1 ERA-.  Well, almost, since the slopes to year-T ERA aren’t exactly 1, but nothing is compressed enough to change rank order (DRA- almost catches SIERA, but falls further behind Stuff+) .  One point of year-T Stuff+ ERA- is worth 1.00 points of Year T ERA- and 0.8 points of year T+1 ERA-.  One point of year-T DRA- is worth 1.04 points of year-T ERA- but only 0.49 points of year-T+1 ERA-.  Stuff+ is much stickier.   Fitting to switchers only, the Stuff+ slope is 0.66 and DRA- is 0.46.  Stuff+ is still much stickier.  There’s just nothing here.  Stuff+ doesn’t have a big team-switching problem and points of Stuff+ ERA- are clearly worth more than points of DRA- going forward for switchers and non-switchers alike.

Fielding-Independent Defense

TL;DR Using Sprint Speed, Reaction, and Burst from the Statcast leaderboard pages, with no catch information (or other information) of any kind, is enough to make a good description of same-season outfield OAA and that descriptive stat makes a better prediction of next-season outfield OAA than current-season OAA does.

A recently-released not-so-great defensive metric inspired me to repurpose an old idea of mine and see how well I could model outfield defense without knowing any actual play results, and the answer is actually pretty well.  Making catches in the OF, taking positioning as a given as OAA does, is roughly based on five factors- reacting fast, accelerating fast, running fast, running to the right place, and catching the balls you’re close enough to reach.

Reacting fast has its own leaderboard metric (reaction distance, the distance traveled in the first 1.5s), as does running fast (Sprint Speed, although it’s calculated on offense).  Acceleration has a metric somewhat covering it, although not as cleanly, in Burst (distance traveled between 1.5s and 3s).  Running to the right place only has the route metric, which covers the first 3s only and is very confounded and doesn’t help nontrivially, so I don’t use it, and actually catching the balls is deliberately left out of the FID metric (I do provide a way to incorporate catch information into next season’s estimation at the end of each section).

2+ Star Plays

The first decision was what metric to try to model first, and I went with OAA/n on 2+ star plays over OAA/n for all plays for multiple reasons. 2+ star plays are responsible for the vast majority of seasonal OAA, OAA/n on 2+ star plays correlates at over r=0.94 to OAA/n on all plays (same season), so it contains the vast majority of the information in OAA/n anyway, and the three skills I’m trying to model based on aren’t put on full display on easier balls.  Then I had to decide how to incorporate the three factors (Sprint Speed, Reaction, Burst). Reaction and Burst are already normalized to league-average, and I normalized sprint speed to the average OF speed weighted by the number of 2+ star opportunities they got (since that’s the average sprint speed of the 2+* sample).  That can get janky early in a season before a representative sample has qualified for the leaderboard, so in that case it’s probably better to just use the previous season’s average sprint speed as a baseline for awhile as there’s not that much variation.

year weighted avg OF Sprint Speed
2016 27.82
2017 27.93
2018 28.01
2019 27.93
2020 27.81
2021 27.94
2022 28.00


Conveniently each stat individually was pretty linear in its relationship to OAA/n 2+* (50+ 2* opportunities shown).

Reaction isn’t convincingly near-linear as opposed to some other positively-correlated shape, but it’s also NOT convincingly nonlinear at all, so we’ll just go with it.

The outlier at 4.7 reaction is Enrique Hernandez who seems to do this regularly but didn’t have enough opportunities in the other seasons to get on the graph again.  I’m guessing he’s deliberately positioning “slightly wrong” and choosing to be in motion with the pitch instead of stationary-ish and reacting to it.  If more players start doing that, then this basic model formulation will have a problem.

Reaction Distance and Sprint Speed are conveniently barely correlated, r=-.09, and changes from year to year are correlated at r=0.1.  I’d expect the latter to be closer to the truth, given that for physical reasons there should be a bit of correlation, and there should be a bit of “missing lower left quadrant” bias in the first number where you can’t be bad at both and still get run out there, but it seems like the third factor to being put in the lineup (offense) is likely enough to keep that from being too extreme.

Burst, on the other hand, there’s no sugarcoating it.  Correlation of r=0.4 to Reaction Distance (moving further = likely going faster at the 1.5s point, leading to more distance covered in the next 1.5s) and r=0.56 to Sprint Speed for obvious reasons.  I took the easy way out with only one messy variable and made an Expected Burst from Reaction and Sprint Speed (r=0.7 to actual Burst), and then took the residual Burst – Expected Burst to create Excess Burst and used that as the third input.  I also tried a version with Sprint Speed and Excess Bootup Distance (distance traveled in the correct direction in the first 3s) as a 2-variable model, and it still “works”, but it’s significantly worse in both description and prediction.  Excess Burst also looks fine as far as being linear with respect to OAA/n.

Looking at how the inputs behave year to year (unless indicated otherwise, all correlations are with players who qualified for Jump stats to be calculated (10+ 2-star opportunities) and Sprint Speed leaderboard (10+ competitive runs) in both years, and correlations are weighted by year T 2+ star opportunities)

weighted correlation r-values (N=712) itself next season same-season OAA/n 2+* next-season OAA/n 2+*
Sprint Speed above avg 0.91 0.49 0.45
Reaction 0.85 0.40 0.35
Burst 0.69 0.79 0.60
Excess Burst 0.53 0.41 0.21

Sprint Speed was already widely known to be highly reliable year-over-year, so no surprise there, and Reaction grades out almost as well, but Burst clearly doesn’t, particularly in the year-to-year drop in correlation to OAA/n.  Given that the start of a run (Reaction) holds up year-to-year, and that the end of the run (Sprint Speed) holds up year-to-year, it’s very strange to me that the middle of the run wouldn’t.  I could certainly believe a world where there’s less skill variation in the middle of the run, so the left column would be smaller, but that doesn’t explain the dropoff in correlation to OAA/n.  Play-level data isn’t available, but what I think is happening here is that because Burst is calculated on ALL 2+* plays, it’s a mixture of maximum burst AND how often a player chose to use it, because players most definitely don’t bust ass for 3 seconds on plenty of the 2+* plays they don’t get close to making (and this isn’t even necessarily wrong on any given play, getting in position to field the bounce and throw can be the right decision)

I would expect maximum burst to hold up roughly as well as Reaction or Sprint Speed in the OAA/n correlations, but the choice component is the kind of noisy yes-no variable that takes longer to stabilize than a stat that’s closer to a physical attribute.  While there’s almost no difference in Sprint Speed year-to-year correlation for players with 60+ attempts (n=203) and players with 30 or fewer attempts (n=265), r=0.92 to 0.90, there’s a huge dropoff in Burst, r=0.77 to r=0.57.

This choice variable is also a proxy for making catches- if you burst after a ball that some players don’t, you have a chance to make a catch that they won’t, and if you don’t burst after a ball that other players do, you have no chance to make a catch that they might.  That probably also explains why the Burst metric and Excess Burst are relatively overcorrelated to same-season OAA/n.

Now that we’ve discussed the components, let’s introduce the Fielding-Independent Defense concept.  It is a DESCRIPTIVE stat, the values of A/B/C/D in FID=(A + B* Reaction + C* Sprint Speed above average + D* Excess Burst) that minimizes SAME-SEASON opportunity-weighted RMSE between FID and OAA/n on whatever plays we’re looking at (here, 2+ star plays). Putting FID on the success probability added scale (e.g. if Statcast had an average catch probability of 50% on a player’s opportunities, and he caught 51%, he’s +1% success probability, and if FID expects him to catch 52%, he’d be +2% FID), I get

FID = 0.1% + 4.4% * Reaction + 5.4% * Sprint Speed above average + 6.4% * Excess Burst.

In an ideal world, the Y intercept (A) would be 0, because in a linear model, somebody who’s average at every component should be average, but our Sprint Speed above average here is above the weighted average of players who had enough opportunities to have stats calculated, which isn’t exactly the average including players who didn’t, so I let the intercept be free, and 0.1% on 2+ star plays is less than 0.1 OAA over a full season, so I’m not terribly worried at this point.  And yeah, I know I’m nominally adding things that don’t even use the same units (ft vs. ft/s) and coming up with a number in outs/opportunity, so be careful porting this idea to different measurement systems because the coefficients would need to be converted from ft-1 etc.

So, how well does it do?

weighted correlation r-values (N=712) itself next season same-season OAA/n 2+* next-season OAA/n 2+*
FID 2+* 0.748 0.806 0.633
OAA/n 2+* 0.591 1.000 0.591
OAA-FID 0.156 0.587 0.132
Regressed FID and (OAA-FID) 0.761 0.890 0.647

That’s pretty darn good for a purely descriptive stat that doesn’t know the outcome of any play, although 2+* plays do need physical skills more than 0-1* plays.  The OAA-FID residual- actual catch information- does contain some predictive value, but it doesn’t help a ton.  The regression amounts I came up with (the best fit regression for FID and OAA-FID together to predict next season’t OAA/n on 2+*) were 12 opportunities for FID and 218 opportunities for OAA-FID.  Given that a full-time season is 70-80 2+ star opportunities on average (more for CF, fewer for corner), FID is half-real in a month and OAA-FID would be half-real if a season lasted 18 straight months.  Those aren’t the usual split-sample correlations, since there isn’t any split-sample data available, but regressions based on players with different numbers of opportunities.  That has its own potential issues, but 12 and 218 should be in the ballpark.  FID stabilizes super-fast.

0 and 1 Star Plays

Since 0 and 1 star plays have a catch probability of at least 90%, there’s an upper bound on success probability added, and while players reaching +10% on 2+* plays isn’t rare, obviously that’s not going to translate over directly to a 105% catch rate on 0/1* plays.   I did the same analysis for 0-1* plays as for 2+* plays using the 2+* metrics as well as a FID fit specifically for 0-1* plays.  Everything here is weighted by the player’s number of 0-1* opportunities in year T.

weighted correlation r-values (N=712) itself next season same-season OAA/n 0/1* next-season OAA/n 0/1*
FID 2+* 0.75 0.20 0.25
OAA/n 2+* 0.59 0.27 0.27
OAA-FID 2+* 0.15 0.18 0.11
FID reweighted for only 0/1* 0.75 0.22 0.26
OAA/n 0/1* 0.13 1.00 0.13
OAA-FID 0/1* 0.07 0.97 0.07

The obvious takeaway is that these correlations suck compared to the ones for 2+* plays, and that’s the result of there being a much larger spread in talent in the ability to catch 2+* balls compared to easier ones.  2* plays are ~25% of the total plays, but comprise ~80% of a player’s OAA.  0-1* plays are 3 times as numerous and comprise ~20% of OAA.  OAA on 0-1* plays is just harder to predict because there’s less signal in it as seen by it self-correlating horrendously.

The other oddities are that OAA/n on 2+* outcorrelates FID from 2+* and that both FIDs do better on the next season than the current season.  For the former, the residual OAA-FID on 2+* plays has a comparable signal, and 0-1* plays are a lot more weighted to “catch the ball” skill relative to physical skills than 2+* plays are, and OAA and especially the residual weight “catch the ball” heavily, so that’s my best guess for that.  As to why FID correlates better to next season, I don’t have a great explanation there. It was odd enough that I recreated the entire analysis from scratch, but I just got the same thing again. I broke it down by year and didn’t see anything strange there either.  2020 correlated worse, of course, but everything is attempt-weighted so that shouldn’t much matter for same-season correlations.

Reaction time grades out relatively much less important in comparison to Sprint Speed and Burst for these plays than it did on 2+* plays, not shockingly.

All plays together

Doing the same analysis on all plays, and comparing the 2+* metrics to the all-play metrics, we get

weighted correlation r-values (N=712) itself next season same-season OAA/n all plays next-season OAA/n all plays
FID 2+* 0.749 0.758 0.627
OAA/n 2+* 0.589 0.943 0.597
OAA-FID 2+* 0.154 0.559 0.153
FID Reweighted for all plays 0.751 0.759 0.628
OAA/n all plays 0.601 1.000 0.601
OAA-FID all plays 0.210 0.629 0.173
Regressed FID and (OAA-FID) 2+* only 0.760 0.842 0.642
Regressed FID and (OAA-FID) all plays 0.763 0.894 0.652

with regression numbers of 35 opportunities for all-play FID and 520 (~2 full seasons) opportunities for all-play OAA-FID.  As a descriptive stat, you get basically everywhere you’re going to go with FID just by looking at 2* plays, and that’s also enough to outpredict last season’s OAA/n for all plays.  The relative weights are fairly similar here- 1:1.34:1.57 Reaction:Sprint Speed:Excess Burst for all plays compared to 1:1.23:1.48 for only 2+* plays, with Reaction relatively a bit less important when considering all plays as expected given the 0-1* section.  The equation for all plays is:

FID = 0.08% + 1.12% * Reaction + 1.50% * Sprint Speed above average + 1.76% * Excess Burst.

The coefficients are smaller than the 2+* version because there’s less variation in OAA/n than in 2+* plays, but to convert to a seasonal number, there are roughly 4 times as many total plays as 2+* plays, so it all works out.  A hypothetical player who’s +1.0 in all components would be +16.3% on 2+star plays and +4.46% on all plays.  Given a breakdown of 70 2+* attempts and 210 0/1* attempts for 280 total attempts in a season, this person would be expected to be +12.5 OAA overall, +11.4 OAA on 2+* plays and +1.1 OAA on 0/1* plays.

Defensive Aging

FID also allows a quick look at where aging comes from across the paired seasons.  Sprint Speed is roughly a straight line down with age, and it’s the biggest driver of decline by far.

Reaction Sprint Speed above avg Excess Burst Total FID (all) OAA.n (all)
Change -0.0410 -0.1860 0.0133 -0.0030 -0.0029
Expected Impact (in catch %) -0.046% -0.279% 0.023% -0.302% -0.302% -0.285%

That’s around a 0.9 drop in seasonal OAA on average from getting a year older that’s a very good match for the decline predicted by worse FID components.  As a rough rule of thumb, 1 marginal foot in reaction distance is worth 3 OAA over a full year, 1 ft/s in Sprint Speed is worth 4 OAA, and 1 foot in Excess Burst is worth 5 OAA.


Sprint Speed, Reaction Distance, and Excess Burst really do cover most of the skill variation in defense, and by the time players have reached the major league level, their routes and ability to catch the ball are sufficiently good that those differences are overwhelmed by physical differences.  This measure depends on that being true- at lower levels, it might not hold as well and the regression number for OAA-FID could drop much lower to where you’d be throwing away a lot of information by not considering OAA-FID.  It’s also definitely vulnerable to Goodhart’s law- “When a measure becomes a target, it ceases to be a good measure“.  It’s hard to uselessly game your FIP to be much lower than it would be if you were pitching normally, but without play-level data (or observation) to detect players in motion at the pitch, anybody who wanted to game FID could probably do it pretty easily.

Range Defense Added and OAA- Outfield Edition

TL;DR It massively cheats and it’s bad, just ignore it.

First, OAA finally lets us compare all outfielders to each other regardless of position and without need for a positional adjustment.  Range Defense Added unsolves that problem and goes back to comparing position-by-position.  It also produces some absolutely batshit numbers.

From 2022:

Name Position Innings Range Out Score Fielded Plays
Giancarlo Stanton LF 32 -21.5 6
Giancarlo Stanton RF 280.7 -6.8 90

Stanton was -2 OAA on the year in ~300 innings (like 1/4th of a season).  An ROS of -21.5 over a full season is equivalent to pushing -50 OAA.  The worst qualified season in the Statcast era is 2016 Matt Kemp (-26 OAA in 240 opportunities), and that isn’t even -*10*% success probability added (analogous to ROS), much less -21.5%.  The worst seasons at 50+ attempts (~300 innings) are 2017 Trumbo and 2019 Jackson Frazier at -12%.  Maybe 2022 Yadier Molina converted to a full-time CF could have pulled off -21.5%, but nobody who’s actually put in the outfield voluntarily for 300 innings in the Statcast era is anywhere near that terrible.  That’s just not a number a sane model can put out without a hell of a reason, and 2022 Stanton was just bad in the field, not “craploads worse than end-stage Kemp and Trumbo” material.

Name Position Innings Range Out Score Fielded Plays
Luis Barrera CF 1 6.1 2
Luis Barrera LF 98.7 2 38
Luis Barrera RF 101 4.6 37

I thought CF was supposed to be the harder position.  No idea where that number comes from.  Barrera has played OF quite well in his limited time, but not +6.1% over the average CF well.

As I did with the infield edition, I’ll be using rate stats (Range Out Score and OAA/inning) for correlations, each player-position-year combo is treated separately, and it’s important to repeat the reminder that BP will blatantly cheat to improve correlations without mentioning anything about what they’re doing in the announcements, and they’re almost certainly doing that again here.

Here’s a chart with year-to-year correlations broken down by inning tranches (weighted by the minimum of the two paired years)

LF OAA to OAA ROS to ROS ROS to OAA Lower innings Higher Innings Inn at other positions year T Inn at other positions year T n
0 to 10 -0.06 0.21 -0.11 6 102 246 267 129
10 to 25 -0.04 0.43 0.08 17 125 287 332 128
25 to 50 0.10 0.73 0.30 35 175 355 318 135
50 to 100 0.36 0.67 0.23 73 240 338 342 120
100 to 200 0.27 0.78 0.33 142 384 310 303 121
200 to 400 0.49 0.71 0.37 284 581 253 259 85
400+ inn 0.52 0.56 0.32 707 957 154 124 75
RF OAA to OAA ROS to ROS ROS to OAA Lower innings Higher Innings Inn at other positions year T Inn at other positions year T n
0 to 10 0.10 0.34 0.05 5 91 303 322 121
10 to 25 0.05 0.57 0.07 16 140 321 299 128
25 to 50 0.26 0.59 0.14 36 186 339 350 101
50 to 100 0.09 0.75 0.16 68 244 367 360 168
100 to 200 0.38 0.72 0.42 137 347 376 370 83
200 to 400 0.30 0.68 0.43 291 622 245 210 83
400+ inn 0.60 0.58 0.32 725 1026 120 129 92
CF OAA to OAA ROS to ROS ROS to OAA Lower innings Higher Innings Inn at other positions year T Inn at other positions year T n
0 to 10 0.00 0.16 0.09 5 161 337 391 83
10 to 25 0.00 0.42 -0.01 17 187 314 362 95
25 to 50 0.04 0.36 0.03 34 234 241 294 73
50 to 100 0.16 0.56 0.09 70 305 299 285 100
100 to 200 0.34 0.70 0.42 148 434 314 305 95
200 to 400 0.47 0.66 0.25 292 581 228 230 86
400+ inn 0.48 0.45 0.22 754 995 134 77 58

Focus on the left side of the chart first.  OAA/inning behaves reasonably, being completely useless for very small numbers of innings and then doing fine for players who actually play a lot.  ROS is simply insane.  Outfielders in aggregate get an opportunity to make a catch every ~4 innings (where opportunity is a play that the best fielders would have a nonzero chance at, not something completely uncatchable that they happen to pick up after it’s hit the ground).

ROS is claiming meaningful correlations on 1-2 opportunities and after ~10 opportunities, it’s posting year to year correlations on par with OAA’s after a full season.  That’s simply impossible (or beyond astronomically unlikely) to do with ~10 yes/no outcome data points with average talent variation well under +/-10%.  The only way to do it is by using some kind of outside information to cheat (time spent at DH/1B?, who knows, who cares).

I don’t know why the 0-10 inning correlations are so low- those players played a fair bit at other positions (see the right side of the table), so any proxy cheat measures should have reasonably stabilized- but maybe the model is just generically batshit nonsense at extremely low opportunities at a position for some unknown reason as happened with the DRC+ rollout (look at the gigantic DRC+ spread on 1 PA 1 uBB pitchers in the cheating link above).

Also, once ROS crosses the 200-inning threshold, it starts getting actively worse at correlating to itself.  Across all three positions, it correlates much better at lower innings totals and then shits the bed once it starts trying to correlate full-time seasons to full-time seasons.  This is obviously completely backwards of how a metric should behave and more evidence that the basic model behavior here is “good correlation based on cheating (outside information) that’s diluted by mediocre correlation on actual play-outcome data.”

They actually do “improve” on team switchers here relative to nonswitchers- instead of being the worst as they were in the infield, again likely due to overfitting to a fairly small number of players- but it’s still nothing of note given how bad they are relative to OAA’s year-to year for regular players even with the cheating.

OAA and the New Baseball Prospectus Defensive Metric Range Defense Added: Infield Edition

TL;DR Use OAA. Ignore RDA/ROS.

Baseball Prospectus came out with a new defensive metric in the vein of their DRC+ and DRA- stats.   If you’re familiar with my commentary on DRC+, this is going to hit some of the same notes, but it’s still worth a quick read if you made it this far.  The infield and outfield models for RDA behave extremely differently, so I’m going to discuss each one in a separate post. The outfield post is here.

The infield model is just simply bad compared to OAA/DRS.  If somebody is giving you hype-job statistics and only tells you how well a team does against a non-division same-league opponent who’s at least 2 games below .500 while wearing uniforms with a secondary color hex code between #C39797 and #FFFFFF in Tuesday day games during a waxing gibbous moon.. well, that ought to make you immediately suspicious of how bad everything else is.  And the same for the statistics cited in the RDA article.

That is.. the opposite of a resounding win for ROS/RDA.  And it’s worse than it looks because OAA is (theoretically, and likely practically) the best at stripping out fielder positioning, while DRS and ROS will have some residual positioning information that will self-correlate to some extent.  DRS also contains additional information (extra penalty for botched balls down the line, throwing errors, double plays) that likely help it self-correlate better, and ROS/RDA appear to contain outside information as described above which will also help it self-correlate better.

  OAA/ inn DRS/ inn ROS RDA/ inn N
to OAA 0.44 0.32 0.22 0.21 177
to DRS 0.26 0.45 0.30 0.30 177

ROS/RDA correlating significantly better to DRS than to OAA is suggestive of a fair bit of its year-to-year self-correlation being to non-demonstrated-fielding-skill information.

Even in their supposed area of supremacy, team-switchers, infield ROS/RDA is still bad.  Classifying players as either non-switchers (played both seasons for the same team only), offseason switchers (played all of year T for one team and all of year T+1 for a different team), or midseason switchers (switched teams in the middle of at least one season).

All IF OAA/inn DRS/inn ROS RDA/inn n
Offseason 0.40 0.45 0.43 0.46 79
Midseason 0.39 0.31 0.13 0.11 91
Off or Mid 0.39 0.38 0.28 0.28 170
No Switch 0.45 0.45 0.37 0.36 541
All 0.44 0.45 0.36 0.35 711

They match OAA/DRS on offseason-switching players- likely due to overfitting their model to a small number of players- but they’re absolutely atrocious on midseason switchers, and they actually have the *biggest* overall drop in reliability between non-switchers and switchers.  I don’t think there’s much more to say.  Infield RDA/ROS isn’t better than OAA/DRS.  It isn’t even close to equal to OAA/DRS. 

Technical notes: I sourced OAA from Fangraphs because I didn’t see a convenient way to grab OAA by position from Savant without scraping individual player pages (the OAA Leaderboard .csv with a position filter doesn’t include everybody who played a position).  This meant that the slightly inconvenient way of grabbing attempts from Savant wasn’t useful here because it also couldn’t split attempts by position, so I was left with innings as a denominator.  Fangraphs doesn’t have a (convenient?) way to split defensive seasons between teams, while BP does split between teams on their leaderboard, so I had to combine split-team seasons and used a weighted average by innings.  Innings by position match between BP and FG in 98.9% of cases and the differences are only a couple of innings here and there, nothing that should make much difference to anything.

About MLB’s New Mudding and Storage Protocol

My prior research on the slippery ball problem: Baseball’s Last Mile Problem

The TL;DR is that mudding adds moisture to the surface of the ball.  Under normal conditions (i.e. stored with free airflow where it was stored before mudding), that moisture evaporates off in a few hours and leaves a good ball.  If that evaporation is stopped, the ball goes to complete hell and becomes more slippery than a new ball.  This is not fixed by time in free airflow afterwards.

My hypothesis is that the balls were sometimes getting stored in environments with sufficiently restricted airflow (the nylon ball bag) too soon after mudding, and that stopped the evaporation.  This only became a problem this season with the change to mudding all balls on gameday and storing them in a zipped nylon bag before the game.

MLB released a new memo yesterday that attempts to standardize the mudding and storage procedure.  Of the five bullet points, one (AFAIK) is not a change.  Balls were already supposed to sit in the humidor for at least 14 days.  Attempting to standardize the application procedure and providing a poster with allowable darkness/lightness levels are obviously good things.  It may be relevant here if the only problem balls were the muddiest (aka wettest) which shouldn’t happen anymore, but from anecdotal reports, there were problem balls where players didn’t think the balls were even mudded at all, and unless they’re blind, that seems hard to reconcile with also being too dark/too heavily mudded.  So this may help some balls, but probably not all of them.

Gameday Mudding

The other points are more interesting.  Requiring all balls to be mudded within 3 hours of each other could be good or bad.  If it eliminates stragglers getting mudded late, this is good.  If it pushes all mudding closer to gametime, this is bad.  Either way, unless MLB knows something I don’t (which is certainly possible- they’re a business worth billions and I’m one guy doing science in my kitchen), the whole gameday mudding thing makes *absolutely no sense* to me at all in any way.

Pre-mudding, all balls everywhere** are all equilibrated in the humidor the same way.  Post-mudding, the surface is disrupted with transient excess moisture.  If you want the balls restandardized for the game, then YOU MAKE SURE YOU GIVE THE BALL SURFACE TIME AFTER MUDDING TO REEQUILIBRATE TO A STANDARD ENVIRONMENT BEFORE DOING ANYTHING ELSE WITH THE BALL. And that takes hours.

In a world without universal humidors, gameday mudding might make sense since later storage could be widely divergent.  Now, it’s exactly the same everywhere**.  Unless MLB has evidence that a mudded ball sitting overnight in the humidor goes to hell (and I tested and found no evidence for that at all, but obviously my testing at home isn’t world-class- also, if it’s a problem, it should have shown up frequently in humidor parks before this season), I have no idea why you would mud on gameday instead of the day before like it was done last season.  The evaporation time between mudding and going in  the nylon bag for the game might not be long enough if mudding is done on gameday, but mudding the day before means it definitely is.

Ball Bag Changes

Cleaning the ball bag seems like it can’t hurt anything, but I’m also not sure it helps anything. I’m guessing that ball bag hygiene over all levels of the sport and prior seasons of MLB was generally pretty bad, yet somehow it was never a problem.  They’ve seen the bottom of the bags though.  I haven’t. If there’s something going on there, I’d expect it to be a symptom of something else and not a primary problem.

Limiting to 96 balls per bag is also kind of strange.  If there is something real about the bottom of the bag effect, I’d expect it to be *the bottom of the bag effect*.  As long as the number of balls is sufficient to require a tall stack in the bag (and 96 still is), and since compression at these number ranges doesn’t seem relevant (prior research post), I don’t have a physical model of what could be going on that would make much difference for being ball 120 of 120 vs. ball 96 of 96.  Also, if the bottom of the bag effect really is a primary problem this year, why wasn’t it a problem in the past?  Unless they’re using entirely new types of bags this season, which I haven’t seen mentioned, we should have seen it before.  But I’m theorizing and they may have been testing, so treat that paragraph with an appropriate level of skepticism.

Also, since MLB uses more than 96 balls on average in a game, this means that balls will need to come from multiple batches.  This seemed like it had the potential to be significantly bad (late-inning balls being stored in a different bag for much longer), but according to an AP report on the memo

“In an effort to reduce time in ball bags, balls are to be taken from the humidor 15-30 minutes before the scheduled start, and then no more than 96 balls at a time.  When needed, up to 96 more balls may be taken from the humidor, and they should not be mixed in bags with balls from the earlier bunch.”

This seems generally like a step in the smart direction, like they’d identified being zipped up in the bag as a potential problem (or gotten the idea from reading my previous post from 30 days ago :)).  I don’t know if it’s a sufficient mitigation because I don’t know exactly how long it takes for the balls to go to hell (60 minutes in near airtight made them complete garbage, so damage certainly appears in less time, but I don’t know how fast and can’t quickly test that).  And again, repeating the mantra from before, time spent in the ball bag *is only an issue if the balls haven’t evaporated off after mudding*.  And that problem is slam-dunk guaranteed solvable by mudding the day before, and then this whole section would be irrelevant.

Box Storage

The final point, “all balls should be placed back in the Rawlings boxes with dividers, and the boxes should then be placed in the humidor. In the past, balls were allowed to go directly into the humidor.” could be either extremely important or absolutely nothing.  This doesn’t say whether the boxes should be open or closed (have the box top on) in the humidor.  I tweeted to the ESPN writer and didn’t get an answer.

The boxes can be seen in the two images in  If they’re open (and not stacked or otherwise covered to restrict airflow), this is fine and at least as good as whatever was done before today.  If the boxes are closed, it could be a real problem.  Like the nylon ball bag, this is also a restricted-flow environment, and unlike the nylon ball bag, some balls will *definitely* get in the box before they’ve had time to evaporate off (since they go in shortly after mudding)

I have one Rawlings box without all the dividers.  The box isn’t airtight, but it’s hugely restricted airflow.  I put 3 moistened balls in the box along with a hygrometer and the RH increased 5% and the balls lost moisture about half as fast as they did in free air.  The box itself absorbed no relevant amount.  With 6 moistened balls in the box, the RH increased 7% (the maximum moistened balls in a confined space will do per prior research) and they lost moisture between 1/3 and 1/4 as fast as in free air.

Unlike the experiments in the previous post where the balls were literally sealed, there is still some moisture flux off the surface here.  I don’t know if it’s enough to stop the balls from going to hell.  It would take me weeks to get unmudded equilibrated balls to actually do mudding test runs in a closed box, and I only found out about this change yesterday with everybody else.  Even if the flux is still sufficient to avoid the balls going to hell directly, the evaporation time appears to be lengthened significantly, and that means that balls are more likely to make it into the closed nylon bag before they’ve evaporated off, which could also cause problems at that point (if there’s still enough time for problems there- see previous section).

The 3 and 6 ball experiments are one run each, in my ball box, which may have a better or worse seal than the average Rawlings box, and the dividers may matter (although they don’t seem to absorb very much moisture from the air, prior post), etc.  Error bars are fairly wide on the relative rates of evaporation, but hygrometer don’t lie.  There doesn’t seem to be any way a closed box isn’t measurably restricting airflow and increasing humidity inside unless the box design changed a lot in the last 3 years.  Maybe that humidity increase/restricted airflow isn’t enough to matter directly or indirectly, but it’s a complete negative freeroll.  Nothing good can come of it.  Bad things might.  If there are reports somewhere this week that tons of balls were garbage, closed-box storage after mudding is the likely culprit.  Or the instructions will actually be uncovered open box (and obeyed) and the last 5 paragraphs will be completely irrelevant.  That would be good.

Conclusion: A few of the changes are obviously common-sense good.  Gameday mudding continues to make no sense to me and looks like it’s just asking for trouble.  Box storage in the humidor after mudding, if the boxes are closed, may be introducing a new problem. It’s unclear to me if the new ball-bag procedures reduce time sufficiently to prevent restricted-airflow problems from arising there, although it’s at least clearly a considered attempt to mitigate a potential problem.

Baseball’s Last Mile Problem

2022 has brought a constant barrage of players criticizing the baseballs as hard to grip and wildly inconsistent from inning to inning, and probably not coincidentally, a spike in throwing error rates to boot.  “Can’t get a grip” and “throwing error” do seem like they might go together.  MLB has denied any change in the manufacturing process, however there have been changes this season in how balls are handled in the stadium, and I believe that is likely to be the culprit.

I have a plausible explanation for how the new ball-handling protocol can cause even identical balls from identical humidors to turn out wildly different on the field, and it’s backed up by experiments and measurements I’ve done on several balls I have, but until those experiments can be repeated at an actual MLB facility (hint, hint), this is still just a hypothesis, albeit a pretty good one IMO.

Throwing Errors

First, to quantify the throwing errors, I used Throwing Errors + Assists as a proxy for attempted throws (it doesn’t count throws that are accurate but late, etc), and broke down TE/(TE+A) by infield position.

TE/(TE+A) 2022 2021 2011-20 max 21-22 Increase 2022 By Chance
C 9.70% 7.10% 9.19% 36.5% 1.9%
3B 3.61% 2.72% 3.16% 32.7% 0.8%
SS 2.20% 2.17% 2.21% 1.5% 46.9%
2B 1.40% 1.20% 1.36% 15.9% 20.1%

By Chance is the binomial odds of getting the 2022 rate or worse using 2021 as the true odds.  Not only are throwing errors per “opportunity” up over 2021, but they’re higher than every single season in the 10 years before that as well, and way higher for C and 3B.   C and 3B have the least time on average to establish a grip before throwing.  This would be interesting even without players complaining left and right about the grip.

The Last Mile

To explain what I suspect is causing this, I need to break down the baseball supply chain.  Baseballs are manufactured in a Rawlings factory, stored in conditions that, to the best of my knowledge, have never been made public, shipped to teams, sometimes stored again in unknown conditions outside a humidor, stored in a humidor for at least 2 weeks, and then prepared and used in a game.  Borrowing the term from telecommunications and delivery logistics, we’ll call everything after the 2+ weeks in the humidor the last mile.

Humidors were in use in 9 parks last year, and Meredith Wills has found that many of the balls this year are from the same batches as balls in 2021.  So we have literally some of the same balls in literally the same humidors, and there were no widespread grip complaints (or equivalent throwing error rates) in 2021.  This makes it rather likely that the difference, assuming there really is one, is occurring somewhere in the last mile.

The last mile starts with a baseball that has just spent 2+ weeks in the humidor.  That is long enough to equilibrate, per, other prior published research, and my own past experiments.  Getting atmospheric humidity changes to meaningfully affect the core of a baseball takes on the order of days to weeks.  That means that nothing humidity-wise in the last mile has any meaningful impact on the ball’s core because there’s not enough time for that to happen.

This article from the San Francisco Chronicle details how balls are prepared for a game after being removed from the humidor, and since that’s paywalled, a rough outline is:

  1. Removed from humidor at some point on gameday
  2. Rubbed with mud/water to remove gloss
  3. Reviewed by umpires
  4. Not kept out of the humidor for more than 2 hours
  5. Put in a security-sealed bag that’s only opened in the dugout when needed

While I don’t have 2022 game balls or official mud, I do have some 2019* balls, water, and dirt, so I decided to do some science at home.  Again, while I have confidence in my experiments done with my balls and my dirt, these aren’t exactly the same things being used in MLB, so it’s possible that what I found isn’t relevant to the 2022 questions.

Update: Dr. Wills informed that that 2019, and only 2019, had a production issue that resulted in squashed leather and could have affected the mudding results.  She checked my batch code, and it looks like my balls were made late enough in 2019 that they were actually used in 2020 with the non-problematic production method.  Yay.

Experiments With Water

When small amounts of water are rubbed on the surface of a ball, it absorbs pretty readily (the leather and laces love water), and once the external source of water is removed, that creates a situation where the outer edge of the ball is more moisture-rich than what’s slightly further inside and more moisture-rich than the atmosphere.  The water isn’t going to just stay there- it’s either going to evaporate off or start going slightly deeper into the leather as well.

As it turns out, if the baseball is rubbed with water and then stored with unrestricted air access (and no powered airflow) in the environment it was equilibrated with, the water entirely evaporates off fairly quickly with an excess-water half-life of a little over an hour (and this would likely be lower with powered air circulation) and goes right back to its pre-rub weight down to 0.01g precision.  So after a few hours, assuming you only added a reasonable amount of water to the surface (I was approaching 0.75 grams added at the most) and didn’t submerge the ball in a toilet or something ridiculous, you’d never know anything had happened.  These surface moisture changes are MUCH faster than the days-to-weeks timescales of core moisture changes.

Things get much more interesting if the ball is then kept in a higher-humidity environment.  I rubbed a ball down, wiped it with a paper towel, let it sit for a couple of minutes to deal with any surface droplets I missed, and then sealed the ball in a sandwich bag for 2 hours along with a battery-powered portable hygrometer.  I expected the ball to completely saturate the air while losing less mass than I could measure (<0.01g) in the process, but that’s not what actually happened.  The relative humidity in the bag only went up 7%, and as expected, the ball lost no measurable amount of mass.  After taking it out, it started losing mass with a slightly longer half-life than before and lost all the excess water in a few hours.

I repeated the experiment except this time I sealed the ball and the hygrometer in an otherwise empty 5-gallon pail.  Again, the relative humidity only went up 7%, and the ball lost 0.04 grams of mass.  I calculated that 0.02g of evaporation should have been sufficient to cause that humidity change, so I’m not exactly sure what happened- maybe 0.01 was measurement error (the scale I was using goes to 0.01g), maybe my seal wasn’t perfectly airtight, maybe the crud on the lid I couldn’t clean off or the pail itself absorbed a little moisture).  But the ball had 0.5g of excess water to lose (which it did completely lose after removal from the pail, as expected) and only lost 0.04g in the pail, so the basic idea is still the same.

This means that if the wet ball has restricted airflow, it’s going to take for freaking ever to reequilibrate (because it only takes a trivial amount of moisture loss to “saturate” a good-sized storage space), and that if it’s in a sealed environment or in a free-airflow environment more than 7% RH above what it’s equilibrated to, the excess moisture will travel inward to more of the leather instead of evaporating off (and eventually the entire ball would equilibrate to the higher-RH environment, but we’re only concerned with the high-RH environment as a temporary last-mile storage condition here, so that won’t happen on our timescales).

I also ran the experiment sealing the rubbed ball and the hygrometer in a sandwich bag overnight for 8 hours.  The half-life for losing moisture after that was around 2.5 hours, up from the 70 minutes when it was never sealed.  This confirms that the excess moisture doesn’t just sit around at the surface waiting if it can’t immediately evaporate, but that evaporation dominates when possible.

I also ran the experiment with a ball sealed in a sandwich bag for 2 hours along with an equilibrated cardboard divider that came with the box of balls I have.  That didn’t make much difference. The cardboard only absorbed 0.04g of the ~0.5g excess moisture in that time period, and that’s with a higher cardboard:ball ratio than a box actually comes with.  Equilibrated cardboard can’t substitute for free airflow on the timescale of a couple of hours.

Experiments With Mud

I mixed dirt and water to make my own mud and rubbed it in doing my best imitation of videos I could find, rubbing until the surface of the ball felt dry again.  Since I don’t have any kind of instrument to measure slickness, these are my perceptions plus those of my significant other.  We were in almost full agreement on every ball, and the one disagreement converged on the next measurement 30 minutes later.

If stored with unrestricted airflow in the environment it was equilibrated to, this led to roughly the following timeline:

  1. t=0, mudded, ball surface feels dry
  2. t= 30 minutes, ball surface feels moist and is worse than when it was first mudded.
  3. t=60 minutes, ball surface is drier and is similar in grip to when first mudded.
  4. t=90 minutes, ball is significantly better than when first mudded
  5. t=120 minutes, no noticeable change from t=90 minutes.
  6. T=12 hours, no noticeable change from t=120 minutes

I tested a couple of other things as well

  1. I took a 12-hour ball, put it in a 75% RH environment for an hour and then a 100% RH environment for 30 minutes, and it didn’t matter.  The ball surface was still fine.  The ball would certainly go to hell eventually under those conditions, but it doesn’t seem likely to be a concern with anything resembling current protocols.  I also stuck one in a bag for awhile and it didn’t affect the surface or change the RH at all, as expected since all of the excess moisture was already gone.
  2. I mudded a ball, let it sit out for 15 minutes, and then sealed it in a sandwich bag.  This ball was slippery at every time interval, 1 hour, 2 hours, 12 hours. (repeated twice).  Interestingly, putting the ball back in its normal environment for over 24 hours didn’t help much and it was still quite slippery.  Even with all the excess moisture gone, whatever had happened to the surface while bagged had ruined the ball.
  3. I mudded a ball, let it sit out for 2 hours, at which point the surface was quite good per the timeline above, and then sealed it in a bag.  THE RH WENT UP AND THE BALL TURNED SLIPPERY, WORSE THAN WHEN IT WAS FIRST MUDDED. (repeated 3x).  Like #2, time in the normal environment afterwards didn’t help.  Keeping the ball in its proper environment for 2 hours, sealing it for an hour, and then letting it out again was enough to ruin the ball.

That’s really important IMO.  We know from the water experiments that it takes more than 2 hours to lose the excess moisture under my storage conditions, and it looks like the combination of fresh(ish) mud plus excess surface moisture that can’t evaporate off is a really bad combo and a recipe for slippery balls.  Ball surfaces can feel perfectly good and game-ready while they still have some excess moisture left and then go to complete shit, apparently permanently, in under an hour if the evaporation isn’t allowed to finish.

Could this be the cause of the throwing errors and reported grip problems? Well…

2022 Last-Mile Protocol Changes

The first change for 2022 is that balls must be rubbed with mud on gameday, meaning they’re always taking on that surface moisture on gameday.  In 2021, balls had to be mudded at least 24 hours in advance of the game, and while 2021 changed the window to 1-2 days in advance, the window used to be up to 5 days in advance of the game.  I don’t know how far in advance they were regularly mudded before 2021, but even early afternoon for a night game would be fine assuming the afternoon storage had reasonable airflow.

The second change is that they’re put back in the humidor fairly quickly after being mudded and allowed a maximum of 2 hours out of the humidor.  While I don’t think there’s anything inherently wrong with putting the balls back in the humidor after mudding (unless it’s something specific to 2022 balls), humidors look something like this.  If the balls are kept in a closed box, or an open box with another box right on top of them, there’s little chance that they reequilibrate in time.  If they’re kept in an open box on a middle shelf without much room above, unless the air is really whipping around in there, the excess moisture half-life should increase.

There’s also a chance that something could go wrong if the balls are taken out of the humidor, kept in a wildly different environment for an hour, then mudded and put back in the humidor, but I haven’t investigated that, and there are many possible combinations of both humidity and temperature that would need to be checked for problems.

The third change (at least I think it’s a change) is that the balls are kept in a sealed bag- at least highly restricted flow, possibly almost airtight- until opened in the dugout.  Even if it’s not a change, it’s still extremely relevant- sealing balls that have evaporated their excess moisture off doesn’t affect anything, while sealing balls that haven’t finished evaporating off seems to be a disaster.


Mudding adds excess moisture to the surface of the ball, and if its evaporation is prevented for very long- either through restricted airflow or storage in too humid an environment- the surface of the ball becomes much more slippery and stays that way even if evaporation continues later.  It takes hours- dependent on various parameters- for that moisture to evaporate off, and 2022 protocol changes make it much more likely that the balls don’t get enough time to evaporate off, causing them to fall victim to that slipperiness.  In particular, balls can feel perfectly good and ready while they still have some excess surface moisture and then quickly go to hell if the remaining evaporation is prevented inside the security-sealed bag.

It looks to me like MLB had a potential problem- substantial latent excess surface moisture being unable to evaporate and causing slipperiness- that prior to 2022 it was avoiding completely by chance or by following old procedures derived from lost knowledge.   In an attempt to standardize procedures, MLB accidentally made the excess surface moisture problem a reality, and not only that, did it in a way where the amount of excess surface moisture was highly variable.

The excess surface moisture when a ball gets to a pitcher depends on the amount of moisture initially absorbed, the airflow and humidity of the last-mile storage environment, and the amount of time spent in those environments and in the sealed bag.  None of those are standardized parts of the protocol, and it’s easy to see how there would be wide variability ball-to-ball and game-to-game.

Assuming this is actually what’s happening, the fix is fairly easy.  Balls need to be mudded far enough in advance and stored afterwards in a way that they get sufficient airflow for long enough to reequilibrate (the exact minimum times depending on measurements done in real MLB facilities), but as an easy interim fix, going back to mudding the day before the game and leaving those balls in an open uncovered box in the humidor overnight should be more than sufficient.  (and again, checking that on-site is pretty easy)


I found (or didn’t find) some other things that I may as well list here as well along with some comments.

  1. These surface moisture changes don’t change the circumference of the baseball at all, down to 0.5mm precision, even after 8 hours.
  2. I took a ball that had stayed moisturized for 2 hours and put a 5-pound weight on it for an hour.  There was no visible distortion and the circumference was exactly the same as before along both seam axes (I oriented the pressure along one seam axis and perpendicular to the other).  To whatever extent flat-spotting is happening or happening more this season, I don’t see how it can be a last-mile cause, at least with my balls.  Dr. Wills has mentioned that the new balls seem uniquely bad at flat-spotting, so it’s not completely impossible that a moist new ball at the bottom of a bucket could deform under the weight, but I’d still be pretty surprised.
  3. The ball feels squishier to me after being/staying moisturized, and free pieces of leather from dissected balls are indisputably much squishier when equilibrated to higher humidity, but “feels squishier” isn’t a quantified measurement or an assessment of in-game impact.  The squishy-ball complaints may also be another symptom of unfinished evaporation.
  4. I have no idea if the surface squishiness in 3 affects the COR of the ball to a measurable degree.
  5. I have no idea if the excess moisture results in an increased drag coefficient.  We’re talking about changes to the surface, and my prior dissected-ball experiments showed that the laces love water and expand from it, so it’s at least in the realm of possibility.
  6. For the third time, this is a hypothesis.  I think it’s definitely one worth investigating since it’s supported by physical evidence, lines up with the protocol changes this year, and is easy enough to check with access to actual MLB facilities.  I’m confident in my findings as reported, but since I’m not using current balls or official mud, this mechanism could also turn out to have absolutely nothing to do with the 2022 game.

The 2022 MLB baseball

As of this writing on 4/25/2022, HRs are down, damage and distance on barrels are down, and both Alan Nathan and Rob Arthur have observed that the drag coefficient of baseballs this year is substantially increased.  This has led to speculation about what has changed with the 2022 balls and even what production batch of balls or mixture of batches may be in use this year.  Given the kerfluffle last year that resulted in MLB finally confirming that a mix of 2020 and 2021 balls were used during the season, that speculation is certainly reasonable.

It may well also turn out to be correct, and changes in the 2022 ball manufacture could certainly explain the current stats, but I think it’s worth noting that everything we’ve seen so far is ALSO consistent with “absolutely nothing changed with regard to ball manufacture/end product between 2021 and 2022” and “all or a vast majority of balls being used are from 2021 or 2022”.  

How is that possible?  Well, the 2021 baseball production was changed on purpose.  The new baseball was lighter, less dense, and less bouncy by design, or in more scientific terms, “dead”.  What if all we’re seeing now is the 2021 baseball specifications in their true glory, now untainted by the 2020 live balls that were mixed in last year?

Even without any change to the surface of the baseball, a lighter, less dense ball won’t carry as far.  The drag force is independent of the mass (for a given size, which changed less if at all), and F=MA, so a constant force and a lower mass means a higher drag deceleration and less carry.

The aforementioned measurements of the drag coefficient from Statcast data also *don’t measure the drag coefficient*.  They measure the drag *acceleration* and use an average baseball mass value to convert to the drag force (which is then used to get the drag coefficient).  If they’re using the same average mass for a now-lighter ball, they’re overestimating the drag force and the drag coefficient, and the drag coefficient may literally not have changed at all (while the drag acceleration did go up, per the previous paragraph).

Furthermore, I looked at pitchers who threw at least 50 four-seam fastballs last year after July 1, 2021 (after the sticky stuff crackdown) and have also thrown at least 50 FFs in 2022.  This group is, on average, -0.35 MPH and +0.175 RPM on their pitches.  These stats usually move in the same direction, and a 1 MPH increase “should” increase spin by about 20 RPM.  So the group should have lost around 7 RPM from decreased velocity and actually wound up slightly positive instead.  It’s possible that the current baseball is just easier to spin based on surface characteristics, but it’s also possible that it’s easier to spin because it’s lighter and has less rotational inertia.  None of this is proof, and until we have results from experiments on actual game balls in the wild, we won’t have a great idea of the what or the why behind the drag acceleration being up. 

(It’s not (just) the humidor- drag acceleration is up even in parks that already had a humidor, and in places where a new humidor would add some mass, making the ball heavier is the exact opposite of what’s needed to match drag observations, although being in the humidor could have other effects as well)

Missing the forest for.. the forest

The paper  A Random Forest approach to identify metrics that best predict match outcome and player ranking in the esport Rocket League got published yesterday (9/29/2021), and for a Cliff’s Notes version, it did two things:  1) Looked at 1-game statistics to predict that game’s winner and/or goal differential, and 2) Looked at 1-game statistics across several rank (MMR/ELO) stratifications to attempt to classify players into the correct rank based on those stats.  The overarching theme of the paper was to identify specific areas that players could focus their training on to improve results.

For part 1, that largely involves finding “winner things” and “loser things” and the implicit assumption that choosing to do more winner things and fewer loser things will increase performance.  That runs into the giant “correlation isn’t causation” issue.  While the specific Rocket League details aren’t important, this kind of analysis will identify second-half QB kneeldowns as a huge winner move and having an empty net with a minute left in an NHL game as a huge loser move.  Treating these as strategic directives- having your QB kneel more or refusing to pull your goalie ever- would be actively terrible and harm your chances of winning.

Those examples are so obviously ridiculous that nobody would ever take them seriously, but when the metrics don’t capture losing endgames as precisely, they can be even *more* dangerous, telling a story that’s incorrect for the same fundamental reason, but one that’s plausible enough to be believed.  A common example is outrushing your opponent in the NFL being correlated to winning.  We’ve seen Derrick Henry or Marshawn Lynch completely dump truck opposing defenses, and when somebody talks about outrushing leading to wins, it’s easy to think of instances like that and agree.  In reality, leading teams run more and trailing teams run less, and the “signal” is much, much more from capturing leading/trailing behavior than from Marshawn going full beast mode sometimes.

If you don’t apply subject-matter knowledge to your data exploration, you’ll effectively ask bad questions that get answered by “what a losing game looks like” and not “what (actionable) choices led to losing”.  That’s all well-known, if worth restating occasionally.

The more interesting part begins with the second objective.  While the particular skills don’t matter, trust me that the difference in car control between top players and Diamond-ranked players is on the order of watching Simone Biles do a floor routine and watching me trip over my cat.  Both involve tumbling, and that’s about where the similarity ends.

The paper identifies various mechanics and identifies rank pretty well based on those.  What’s interesting is that while they can use those mechanics to tell a Diamond from a Bronze, when they tried to use those mechanics to predict the outcome of a game, they all graded out as basically worthless.  While some may have suffered from adverse selection (something you do less when you’re winning), they had a pretty good selection of mechanics and they ALL sucked at predicting the winner.  And, yet, beyond absolutely any doubt, the higher rank stratifications are much better at them than the lower-rank ones.  WTF? How can that be?

The answer is in a sample constructed in a particularly pathological way, and it’s one that will be common among esports data sets for the foreseeable future.  All of the matches are contested between players of approximately equal overall skill.  The sample contains no games of Diamonds stomping Bronzes or getting crushed by Grand Champs.

The players in each match have different abilities at each of the mechanics, but the overall package always grades out similarly given that they have close enough MMR to get paired up.  So if Player A is significantly stronger than player B at mechanic A to the point you’d expect it to show up, ceteris paribus, as a large winrate effect, A almost tautologically has to be worse at the other aspects, otherwise A would be significantly higher-rated than B and the pairing algorithm would have excluded that match from the sample.  So the analysis comes to the conclusion that being better at mechanic A doesn’t predict winning a game.  If the sample contained comparable numbers of cross-rank matches, all of the important mechanics would obviously be huge predictors of game winner/loser.

The sample being pathologically constructed led to the profoundly incorrect conclusion

Taken together, higher rank players show better control over the movement of their car and are able to play a greater proportion of their matches at high speed.  However, within rank-matched matches, this does not predict match outcome.Therefore, our findings suggest that while focussing on game speed and car movement may not provide immediate benefit to the outcome within matches, these PIs are important to develop as they may facilitate one’s improvement in overall expertise over time.

even though adding or subtracting a particular ability from a player would matter *immediately*.  The idea that you can work on mechanics to improve overall expertise (AKA achieving a significantly higher MMR) WITHOUT IT MANIFESTING IN MATCH RESULTS, WHICH IS WHERE MMR COMES FROM, is.. interesting.  It’s trying to take two obviously true statements (Higher-ranked players play faster and with more control- quantified in the paper. Playing faster and with more control makes you better- self-evident to anybody who knows RL at all) and shoehorn a finding between them that obviously doesn’t comport.

This kind of mistake will occur over and over and over when data sets comprised of narrow-band matchmaking are analysed that way.

(It’s basically the same mistake as thinking that velocity doesn’t matter for mediocre MLB pitchers- it doesn’t correlate to a lower ERA among that group, but any individuals gaining velocity will improve ERA on average)


The hidden benefit of pulling the ball

Everything else about the opportunity being equal, corner OFs have a significantly harder time catching pulled balls  than they do catching opposite-field balls.  In this piece, I’ll demonstrate that the effect actually exists, try to quantify it in a useful way, and give a testable take on what I think is causing it.

Looking at all balls with a catch probability >0 and <0.99 (the Statcast cutoff for absolutely routine fly balls), corner OF out rates underperform catch probability by 0.028 on pulled balls relative to oppo balls.

(For the non-baseball readers, position 7 is left field, 8 is center field, 9 is right field, and a pulled ball is a right-handed batter hitting a ball to left field or a LHB hitting a ball to right field.  Oppo is “opposite field”, RHB hitting the ball to right field, etc.)

Stands Pos Catch Prob Out Rate Difference N
L 7 0.859 0.844 -0.015 14318
R 7 0.807 0.765 -0.042 11380
L 8 0.843 0.852 0.009 14099
R 8 0.846 0.859 0.013 19579
R 9 0.857 0.853 -0.004 19271
L 9 0.797 0.763 -0.033 8098

The joint standard deviation for each L-R difference, given those Ns, is about 0.005, so .028+/- 0.005, symmetric in both fields, is certainly interesting.  Rerunning the numbers on more competitive plays, 0.20<catch probability<0.80

Stands Pos Catch Prob Out Rate Difference N
L 7 0.559 0.525 -0.034 2584
R 7 0.536 0.407 -0.129 2383
L 9 0.533 0.418 -0.116 1743
R 9 0.553 0.549 -0.005 3525

Now we see a much more pronounced difference, .095 in LF and .111 in RF (+/- ~.014).  The difference is only about .01 on plays between .8 and .99, so whatever’s going on appears to be manifesting itself clearly on competitive plays while being much less relevant to easier plays.

Using competitive plays also allows a verification that is (mostly) independent of Statcast’s catch probability.  According to this Tango blog post, catch probability changes are roughly linear to time or distance changes in the sweet spot at a rate of 0.1s=10% out rate and 1 foot = 4% out rate.  By grouping roughly similar balls and using those conversions, we can see how robust this finding is.  Using 0.2<=CP=0.8, back=0, and binning by hang time in 0.5s increments, we can create buckets of almost identical opportunities.  For RF, it looks like

Stands Hang Time Bin Avg Hang Time Avg Distance N
L 2.5-3.0 2.881 30.788 126
R 2.5-3.0 2.857 29.925 242
L 3.0-3.5 3.268 41.167 417
R 3.0-3.5 3.256 40.765 519
L 3.5-4.0 3.741 55.234 441
R 3.5-4.0 3.741 55.246 500
L 4.0-4.5 4.248 69.408 491
R 4.0-4.5 4.237 68.819 380
L 4.5-5.0 4.727 81.487 377
R 4.5-5.0 4.714 81.741 204
L 5.0-5.5 5.216 93.649 206
R 5.0-5.5 5.209 93.830 108

If there’s truly a 10% gap, it should easily show up in these bins.

Hang Time to LF Raw Difference Corrected Difference Catch Prob Difference SD
2.5-3.0 0.099 0.104 -0.010 0.055
3.0-3.5 0.062 0.059 -0.003 0.033
3.5-4.0 0.107 0.100 0.013 0.032
4.0-4.5 0.121 0.128 0.026 0.033
4.5-5.0 0.131 0.100 0.033 0.042
5.0-5.5 0.080 0.057 0.023 0.059
Hang Time to RF Raw Difference Corrected Difference Catch Prob Difference SD
2.5-3.0 0.065 0.096 -0.063 0.057
3.0-3.5 0.123 0.130 -0.023 0.032
3.5-4.0 0.169 0.149 0.033 0.032
4.0-4.5 0.096 0.093 0.020 0.035
4.5-5.0 0.256 0.261 0.021 0.044
5.0-5.5 0.168 0.163 0.044 0.063

and it does.  Whatever is going on is clearly not just an artifact of the catch probability algorithm.  It’s a real difference in catching balls.  This also means that I’m safe using catch probability to compare performance and that I don’t have to do the whole bin-and-correct thing any more in this post.

Now we’re on to the hypothesis-testing portion of the post.  I’d used the back=0 filter to avoid potentially Simpson’s Paradoxing myself, so how does the finding hold up with back=1 & wall=0?

Stands Pos Catch Prob Out Rate Difference N
R 7 0.541 0.491 -0.051 265
L 7 0.570 0.631 0.061 333
R 9 0.564 0.634 0.071 481
L 9 0.546 0.505 -0.042 224

.11x L-R difference in both fields.  Nothing new there.

In theory, corner OFs could be particularly bad at playing hooks or particularly good at playing slices.  If that’s true, then the balls with more sideways movement should be quite different than the balls with less sideways movement.  I made an estimation of the sideways acceleration in flight based on hang time, launch spray angle, and landing position and split balls into high and low acceleration (slices have more sideways acceleration than hooks on average, so this is comparing high slice to low slice, high hook to low hook).

Batted Spin Stands Pos Catch Prob Out Rate Difference N
Lots of Slice L 7 0.552 0.507 -0.045 1387
Low Slice L 7 0.577 0.545 -0.032 617
Lots of Hook R 7 0.528 0.409 -0.119 1166
Low Hook R 7 0.553 0.402 -0.151 828
Lots of Slice R 9 0.540 0.548 0.007 1894
Low Slice R 9 0.580 0.539 -0.041 972
Lots of Hook L 9 0.526 0.425 -0.101 850
Low Hook L 9 0.546 0.389 -0.157 579

And there’s not much to see there.  Corner OF play low-acceleration balls worse, but on average those are balls towards the gap and somewhat longer runs, and the out rate difference is somewhat-to-mostly explained by corner OF’s lower speed getting exposed over a longer run.  Regardless, nothing even close to explaining away our handedness effect.

Perhaps pull and oppo balls come from different pitch mixes and there’s something about the balls hit off different pitches.

Pitch Type Stands Pos Catch Prob Out Rate Difference N
FF L 7 0.552 0.531 -0.021 904
FF R 7 0.536 0.428 -0.109 568
FF L 9 0.527 0.434 -0.092 472
FF R 9 0.556 0.552 -0.004 1273
FT/SI L 7 0.559 0.533 -0.026 548
FT/SI R 7 0.533 0.461 -0.072 319
FT/SI L 9 0.548 0.439 -0.108 230
FT/SI R 9 0.553 0.592 0.038 708
Other L 7 0.569 0.479 -0.090 697
Other R 7 0.541 0.379 -0.161 1107
Other L 9 0.534 0.385 -0.149 727
Other R 9 0.550 0.497 -0.054 896

The effect clearly persists, although there is a bit of Simpsoning showing up here.  Slices are relatively fastball-heavy and hooks are relatively Other-heavy, and corner OF catch FBs at a relatively higher rate.  That will be the subject of another post.  The average L-R difference among paired pitch types is still 0.089 though.

Vertical pitch location is completely boring, and horizontal pitch location is the subject for another post (corner OFs do best on outside pitches hit oppo and worst on inside pitches pulled), but the handedness effect clearly persists across all pitch location-fielder pairs.

So what is going on?  My theory is that this is a visibility issue.  The LF has a much better view of a LHB’s body and swing than he does of a RHB’s, and it’s consistent with all the data that looking into the open side gives about a 0.1 second advantage in reaction time compared to looking at the closed side.  A baseball swing takes around 0.15 seconds, so that seems roughly reasonable to me.  I don’t have the play-level data to test that myself, but it should show up as a batter handedness difference in corner OF reaction distance and around a 2.5 foot batter handedness difference in corner OF jump on competitive plays.

Wins Above Average Closer

It’s HoF season again, and I’ve never been satisfied with the dicsussion around relievers.  I wanted something that attempted to quantify excellence at the position while still being a counting stat, and what better way to quantify excellence at RP than by comparing to the average closer?  (Please treat that as a rhetorical question)

I used the highly scientific method of defining the average closer as the aggregate performance of the players who were top-25 in saves in a given year, and I used several measures of wins.  I wanted something that used all events (so not fWAR) and already handled run environment for me, and comparing runs as a counting stat across different run environments is more than a bit janky, so that meant a wins-based metric.  I went with REW (RE24-based wins), WPA, and WPA/LI. 

IP as the denominator instead of PA/TBF because I wanted any (1-inning, X runs) and any inherited runner situation (X outs gotten to end the inning, Y runs allowed – entering RE) to grade out the same regardless of batters faced.  Not that using PA as the denominator would make much difference.

The first trick was deciding on a baseline Wins/IP to compare against because the “average closer” is significantly better now than 1974, to the tune of around 0.5 normalized RA/9 better. 

I used the regression as the baseline Wins/IP for each season/metric because I was more interested in excellence compared to peers than compared to players who were pitching significantly different innings/appearance.  WPA/LI/IP basically overlaps REW/IP and makes it all harder to see, so I left it off.

For each season, a player’s WAAC is (Wins/IP – baseline wins/IP) * IP, computed separately for each win metric.

Without further ado, the top 20 in WAAC (REW-based) and the remaining HoFers.  Peak is defined as the optimal start and stop years for REW.  Fangraphs doesn’t have Win Probability stats before 1974, which cuts out all of Hoyt Wilhelm, but by a quick glance, he’s going to be top-5, solidly among the best non-Rivera RPs.  I also miss the beginning of Fingers’s career, but it doesn’t matter. 

Career WAAC based on REW WPA WPA/LI Peak REW Peak Years
Mariano Rivera 16.9 26.6 18.6 16.9 1996-2013
Billy Wagner 7.3 7.2 7.3 7.3 1996-2010
Joe Nathan 6.6 12.6 7.5 8.5 2003-2013
Zack Britton 5.4 7.0 4.1 5.4 2014-2020
Craig Kimbrel 5.2 6.9 4.3 6.3 2010-2018
Keith Foulke 4.9 4.1 5.3 7.2 1999-2004
Tom Henke 4.9 5.8 5.7 5.8 1985-1995
Aroldis Chapman 4.2 4.1 4.7 4.6 2012-2019
Rich Gossage 4.1 7.4 4.8 10.8 1975-1985
Andrew Miller 3.8 3.5 3.2 5.4 2012-2017
Wade Davis 3.8 3.5 3.7 5.6 2012-2017
Trevor Hoffman 3.8 8.0 6.1 5.8 1994-2009
Darren O’Day 3.7 -1.2 1.8 4.1 2009-2020
Rafael Soriano 3.5 2.8 2.8 4.3 2003-2012
Jonathan Papelbon 3.5 9.7 4.8 5 2006-2009
Eric Gagne 3.4 8.5 3.6 4.3 2002-2005
Dennis Eckersley (RP) 3.4 -0.3 5.2 7.1 1987-1992
John Wetteland 3.3 7.1 4.9 4.2 1992-1999
John Smoltz (RP) 3.2 9.0 3.5 3.2 2001-2004
Kenley Jansen (#20) 3.1 2.6 3.5 4.3 2010-2017
Lee Smith (#25) 2.5 0.9 1.8 4.5 1981-1991
Rollie Fingers (#64) 0.7 -6.0 0.8 1.9 1975-1984
Bruce Sutter (#344) 0.0 2.4 3.6 4.3 1976-1981

Mariano looks otherworldly here, but it’s hard to screw that up.  We get a few looks at really aberrant WPAs, good and bad, which is no shock because it’s known to be noisy as hell.  Peak Gossage was completely insane.  His career rate stats got dragged down by pitching forever, but for those 10 years (20.8 peak WPA too), he was basically Mo.  That’s the longest imitation so far.

Wagner was truly excellent.  And he’s 3rd in RA9-WAR behind Mo and Goose, so it’s not like his lack of IP stopped him from accumulating regular value.  Please vote him in if you have a vote.

It’s also notable how hard it is to stand out or sustain that level.  Only one other player is above 3 career WAAC (Koji).  There are flashes of brilliance (often mixed with flashes of positive variance), but almost nobody sustained “average closer” performance for over 10 years.  The longest peaks are (a year skipped to injury/not throwing 10 IP in relief doesn’t break it, it just doesn’t count towards the peak length):

Rivera 17 (16 with positive WAAC)

Wagner 15 (12 positive)

Hoffman 15 (9 positive)

Wilhelm 13 (by eyeball)

Smith 11 (10 positive)

Henke 11 (10 positive)

Fingers 11 (8 positive, giving him 1973)

O’Day 11 (7 positive)

and that’s it in the history of baseball.  It’s pretty tough to pitch that well for that many years.  

This isn’t going to revolutionize baseball analysis or anything, but I thought it was an interesting look that went beyond career WAR/career WPA to give a kind of counting stat for excellence.