Stuff+ Doesn’t Have a Team-Switching Problem

Not even going to bother with a Betteridge’s Law headline here. On top of what at this point is presumptively bad-faith discussion of their own stats (rank-order correlations on team-switchers only? Really?) BP claimed that the Stuff+ metric has a team-switching problem and spent like 15 paragraphs discussing it. I’m only going to spend two paragraphs, because it just doesn’t.

Edit 5/5/2023: went ahead and got all the same players together with the exact same weighting for everything to make sure to compare DRA- to Stuff+ and other stats completely fairly and replaced the section with one composite chart.

Using data from Fangraphs and BP, I took each season 2020-2022 with at least 100 pitches thrown (this got rid of position players, etc.) and took DRA-,  Pitches, Stuff+, FIP, xFIP-, SIERA, and ERA. Because each season’s ERA was quite different, I converted ERA/SIERA/FIP to Stat/MLB_Average_ERA for that season and multiplied by 100 to make a (non-park-adjusted) “Stat-“.  DRA- and xFIP- are already on that scale.  I then did an IP-weighted fit of same-season Stuff+ and “ERA-” and got predicted same-season “ERA-” = 98.93 – 1.15* (Stuff+ – 100). I then took paired consecutive player-seasons and compared weighted RMSEs for year T’s stats, broken down by team-switching status (No = both seasons for the same team, Yes = played for more than one team).

RMSE T+1 “ERA-“ Non-switch Switch All
Stuff+ 38.0 37.4 37.7
“SIERA-“ 39.5 38.9 39.26
DRA- 40.0 38.1 39.29
xFIP- 40.6 40.0 40.4
“FIP-“ 43.4 49.2 45.6
“ERA-“ 50.8 60.6 54.6
N 588 409 997

 

Literally no problems here.  Stuff+ does fine with team-switchers, does better than park-unadjusted “FIP-” across the board, and does much better on team-switchers than park-unadjusted “FIP-“, as expected, since park-unadjusted FIP should be the metric taking a measurable accuracy hit from a park change.  And yet somehow BP is reporting the complete opposite conclusions instead:  1) that Stuff+ is fine for non-switchers but becomes near-useless for team-switchers, and 2) that its performance degrades significantly compared to park-unadjusted-FIP for team switchers.  Common sense and the data clearly say otherwise.  DRA- grades out roughly between SIERA and xFIP- for non-switchers predicting next season’s ERA, on par with SIERA overall, and solidly behind Stuff+. (Apologies for temporarily stating it was much worse than that).

Looking at it another way, creating an IP-weighted-RMSE-minimizing linear fit for each metric to predict next season’s “ERA-” (e.g. Year T+1 ERA- = 99 + 0.1 * (year T DRA- – 100) gives the following chart

y=mx+b intercept slope RMSE r
Stuff+ ERA- 102.42 0.79 34.16 0.29
SIERA- 103.07 0.53 34.56 0.25
DRA- 101.42 0.49 34.62 0.24
xFIP- 101.57 0.40 34.88 0.21
“FIP-“ 101.13 0.21 35.14 0.17
“ERA-“ 100.87 0.11 35.40 0.12
everybody the same 100.55 0.00 35.65 0.00

The intercepts are different slightly out of noise and slightly because they’re not all centered exactly identically- SIERA has the lowest average value for whatever reason.  ERA predicted from Stuff+ is the clear winner again, with DRA- again between SIERA and xFIP-.  Since all the metrics being fit are on the same scale (Stuff+ was transformed into ERA- as in the paragraph above), the slopes can be compared directly, and the bigger the slope, the more one point of year-T stat predicts the year T+1 ERA-.  Well, almost, since the slopes to year-T ERA aren’t exactly 1, but nothing is compressed enough to change rank order (DRA- almost catches SIERA, but falls further behind Stuff+) .  One point of year-T Stuff+ ERA- is worth 1.00 points of Year T ERA- and 0.8 points of year T+1 ERA-.  One point of year-T DRA- is worth 1.04 points of year-T ERA- but only 0.49 points of year-T+1 ERA-.  Stuff+ is much stickier.   Fitting to switchers only, the Stuff+ slope is 0.66 and DRA- is 0.46.  Stuff+ is still much stickier.  There’s just nothing here.  Stuff+ doesn’t have a big team-switching problem and points of Stuff+ ERA- are clearly worth more than points of DRA- going forward for switchers and non-switchers alike.

Fielding-Independent Defense

TL;DR Using Sprint Speed, Reaction, and Burst from the Statcast leaderboard pages, with no catch information (or other information) of any kind, is enough to make a good description of same-season outfield OAA and that descriptive stat makes a better prediction of next-season outfield OAA than current-season OAA does.

A recently-released not-so-great defensive metric inspired me to repurpose an old idea of mine and see how well I could model outfield defense without knowing any actual play results, and the answer is actually pretty well.  Making catches in the OF, taking positioning as a given as OAA does, is roughly based on five factors- reacting fast, accelerating fast, running fast, running to the right place, and catching the balls you’re close enough to reach.

Reacting fast has its own leaderboard metric (reaction distance, the distance traveled in the first 1.5s), as does running fast (Sprint Speed, although it’s calculated on offense).  Acceleration has a metric somewhat covering it, although not as cleanly, in Burst (distance traveled between 1.5s and 3s).  Running to the right place only has the route metric, which covers the first 3s only and is very confounded and doesn’t help nontrivially, so I don’t use it, and actually catching the balls is deliberately left out of the FID metric (I do provide a way to incorporate catch information into next season’s estimation at the end of each section).

2+ Star Plays

The first decision was what metric to try to model first, and I went with OAA/n on 2+ star plays over OAA/n for all plays for multiple reasons. 2+ star plays are responsible for the vast majority of seasonal OAA, OAA/n on 2+ star plays correlates at over r=0.94 to OAA/n on all plays (same season), so it contains the vast majority of the information in OAA/n anyway, and the three skills I’m trying to model based on aren’t put on full display on easier balls.  Then I had to decide how to incorporate the three factors (Sprint Speed, Reaction, Burst). Reaction and Burst are already normalized to league-average, and I normalized sprint speed to the average OF speed weighted by the number of 2+ star opportunities they got (since that’s the average sprint speed of the 2+* sample).  That can get janky early in a season before a representative sample has qualified for the leaderboard, so in that case it’s probably better to just use the previous season’s average sprint speed as a baseline for awhile as there’s not that much variation.

year weighted avg OF Sprint Speed
2016 27.82
2017 27.93
2018 28.01
2019 27.93
2020 27.81
2021 27.94
2022 28.00

 

Conveniently each stat individually was pretty linear in its relationship to OAA/n 2+* (50+ 2* opportunities shown).

Reaction isn’t convincingly near-linear as opposed to some other positively-correlated shape, but it’s also NOT convincingly nonlinear at all, so we’ll just go with it.

The outlier at 4.7 reaction is Enrique Hernandez who seems to do this regularly but didn’t have enough opportunities in the other seasons to get on the graph again.  I’m guessing he’s deliberately positioning “slightly wrong” and choosing to be in motion with the pitch instead of stationary-ish and reacting to it.  If more players start doing that, then this basic model formulation will have a problem.

Reaction Distance and Sprint Speed are conveniently barely correlated, r=-.09, and changes from year to year are correlated at r=0.1.  I’d expect the latter to be closer to the truth, given that for physical reasons there should be a bit of correlation, and there should be a bit of “missing lower left quadrant” bias in the first number where you can’t be bad at both and still get run out there, but it seems like the third factor to being put in the lineup (offense) is likely enough to keep that from being too extreme.

Burst, on the other hand, there’s no sugarcoating it.  Correlation of r=0.4 to Reaction Distance (moving further = likely going faster at the 1.5s point, leading to more distance covered in the next 1.5s) and r=0.56 to Sprint Speed for obvious reasons.  I took the easy way out with only one messy variable and made an Expected Burst from Reaction and Sprint Speed (r=0.7 to actual Burst), and then took the residual Burst – Expected Burst to create Excess Burst and used that as the third input.  I also tried a version with Sprint Speed and Excess Bootup Distance (distance traveled in the correct direction in the first 3s) as a 2-variable model, and it still “works”, but it’s significantly worse in both description and prediction.  Excess Burst also looks fine as far as being linear with respect to OAA/n.

Looking at how the inputs behave year to year (unless indicated otherwise, all correlations are with players who qualified for Jump stats to be calculated (10+ 2-star opportunities) and Sprint Speed leaderboard (10+ competitive runs) in both years, and correlations are weighted by year T 2+ star opportunities)

weighted correlation r-values (N=712) itself next season same-season OAA/n 2+* next-season OAA/n 2+*
Sprint Speed above avg 0.91 0.49 0.45
Reaction 0.85 0.40 0.35
Burst 0.69 0.79 0.60
Excess Burst 0.53 0.41 0.21

Sprint Speed was already widely known to be highly reliable year-over-year, so no surprise there, and Reaction grades out almost as well, but Burst clearly doesn’t, particularly in the year-to-year drop in correlation to OAA/n.  Given that the start of a run (Reaction) holds up year-to-year, and that the end of the run (Sprint Speed) holds up year-to-year, it’s very strange to me that the middle of the run wouldn’t.  I could certainly believe a world where there’s less skill variation in the middle of the run, so the left column would be smaller, but that doesn’t explain the dropoff in correlation to OAA/n.  Play-level data isn’t available, but what I think is happening here is that because Burst is calculated on ALL 2+* plays, it’s a mixture of maximum burst AND how often a player chose to use it, because players most definitely don’t bust ass for 3 seconds on plenty of the 2+* plays they don’t get close to making (and this isn’t even necessarily wrong on any given play, getting in position to field the bounce and throw can be the right decision)

I would expect maximum burst to hold up roughly as well as Reaction or Sprint Speed in the OAA/n correlations, but the choice component is the kind of noisy yes-no variable that takes longer to stabilize than a stat that’s closer to a physical attribute.  While there’s almost no difference in Sprint Speed year-to-year correlation for players with 60+ attempts (n=203) and players with 30 or fewer attempts (n=265), r=0.92 to 0.90, there’s a huge dropoff in Burst, r=0.77 to r=0.57.

This choice variable is also a proxy for making catches- if you burst after a ball that some players don’t, you have a chance to make a catch that they won’t, and if you don’t burst after a ball that other players do, you have no chance to make a catch that they might.  That probably also explains why the Burst metric and Excess Burst are relatively overcorrelated to same-season OAA/n.

Now that we’ve discussed the components, let’s introduce the Fielding-Independent Defense concept.  It is a DESCRIPTIVE stat, the values of A/B/C/D in FID=(A + B* Reaction + C* Sprint Speed above average + D* Excess Burst) that minimizes SAME-SEASON opportunity-weighted RMSE between FID and OAA/n on whatever plays we’re looking at (here, 2+ star plays). Putting FID on the success probability added scale (e.g. if Statcast had an average catch probability of 50% on a player’s opportunities, and he caught 51%, he’s +1% success probability, and if FID expects him to catch 52%, he’d be +2% FID), I get

FID = 0.1% + 4.4% * Reaction + 5.4% * Sprint Speed above average + 6.4% * Excess Burst.

In an ideal world, the Y intercept (A) would be 0, because in a linear model, somebody who’s average at every component should be average, but our Sprint Speed above average here is above the weighted average of players who had enough opportunities to have stats calculated, which isn’t exactly the average including players who didn’t, so I let the intercept be free, and 0.1% on 2+ star plays is less than 0.1 OAA over a full season, so I’m not terribly worried at this point.  And yeah, I know I’m nominally adding things that don’t even use the same units (ft vs. ft/s) and coming up with a number in outs/opportunity, so be careful porting this idea to different measurement systems because the coefficients would need to be converted from ft-1 etc.

So, how well does it do?

weighted correlation r-values (N=712) itself next season same-season OAA/n 2+* next-season OAA/n 2+*
FID 2+* 0.748 0.806 0.633
OAA/n 2+* 0.591 1.000 0.591
OAA-FID 0.156 0.587 0.132
Regressed FID and (OAA-FID) 0.761 0.890 0.647

That’s pretty darn good for a purely descriptive stat that doesn’t know the outcome of any play, although 2+* plays do need physical skills more than 0-1* plays.  The OAA-FID residual- actual catch information- does contain some predictive value, but it doesn’t help a ton.  The regression amounts I came up with (the best fit regression for FID and OAA-FID together to predict next season’t OAA/n on 2+*) were 12 opportunities for FID and 218 opportunities for OAA-FID.  Given that a full-time season is 70-80 2+ star opportunities on average (more for CF, fewer for corner), FID is half-real in a month and OAA-FID would be half-real if a season lasted 18 straight months.  Those aren’t the usual split-sample correlations, since there isn’t any split-sample data available, but regressions based on players with different numbers of opportunities.  That has its own potential issues, but 12 and 218 should be in the ballpark.  FID stabilizes super-fast.

0 and 1 Star Plays

Since 0 and 1 star plays have a catch probability of at least 90%, there’s an upper bound on success probability added, and while players reaching +10% on 2+* plays isn’t rare, obviously that’s not going to translate over directly to a 105% catch rate on 0/1* plays.   I did the same analysis for 0-1* plays as for 2+* plays using the 2+* metrics as well as a FID fit specifically for 0-1* plays.  Everything here is weighted by the player’s number of 0-1* opportunities in year T.

weighted correlation r-values (N=712) itself next season same-season OAA/n 0/1* next-season OAA/n 0/1*
FID 2+* 0.75 0.20 0.25
OAA/n 2+* 0.59 0.27 0.27
OAA-FID 2+* 0.15 0.18 0.11
FID reweighted for only 0/1* 0.75 0.22 0.26
OAA/n 0/1* 0.13 1.00 0.13
OAA-FID 0/1* 0.07 0.97 0.07

The obvious takeaway is that these correlations suck compared to the ones for 2+* plays, and that’s the result of there being a much larger spread in talent in the ability to catch 2+* balls compared to easier ones.  2* plays are ~25% of the total plays, but comprise ~80% of a player’s OAA.  0-1* plays are 3 times as numerous and comprise ~20% of OAA.  OAA on 0-1* plays is just harder to predict because there’s less signal in it as seen by it self-correlating horrendously.

The other oddities are that OAA/n on 2+* outcorrelates FID from 2+* and that both FIDs do better on the next season than the current season.  For the former, the residual OAA-FID on 2+* plays has a comparable signal, and 0-1* plays are a lot more weighted to “catch the ball” skill relative to physical skills than 2+* plays are, and OAA and especially the residual weight “catch the ball” heavily, so that’s my best guess for that.  As to why FID correlates better to next season, I don’t have a great explanation there. It was odd enough that I recreated the entire analysis from scratch, but I just got the same thing again. I broke it down by year and didn’t see anything strange there either.  2020 correlated worse, of course, but everything is attempt-weighted so that shouldn’t much matter for same-season correlations.

Reaction time grades out relatively much less important in comparison to Sprint Speed and Burst for these plays than it did on 2+* plays, not shockingly.

All plays together

Doing the same analysis on all plays, and comparing the 2+* metrics to the all-play metrics, we get

weighted correlation r-values (N=712) itself next season same-season OAA/n all plays next-season OAA/n all plays
FID 2+* 0.749 0.758 0.627
OAA/n 2+* 0.589 0.943 0.597
OAA-FID 2+* 0.154 0.559 0.153
FID Reweighted for all plays 0.751 0.759 0.628
OAA/n all plays 0.601 1.000 0.601
OAA-FID all plays 0.210 0.629 0.173
Regressed FID and (OAA-FID) 2+* only 0.760 0.842 0.642
Regressed FID and (OAA-FID) all plays 0.763 0.894 0.652

with regression numbers of 35 opportunities for all-play FID and 520 (~2 full seasons) opportunities for all-play OAA-FID.  As a descriptive stat, you get basically everywhere you’re going to go with FID just by looking at 2* plays, and that’s also enough to outpredict last season’s OAA/n for all plays.  The relative weights are fairly similar here- 1:1.34:1.57 Reaction:Sprint Speed:Excess Burst for all plays compared to 1:1.23:1.48 for only 2+* plays, with Reaction relatively a bit less important when considering all plays as expected given the 0-1* section.  The equation for all plays is:

FID = 0.08% + 1.12% * Reaction + 1.50% * Sprint Speed above average + 1.76% * Excess Burst.

The coefficients are smaller than the 2+* version because there’s less variation in OAA/n than in 2+* plays, but to convert to a seasonal number, there are roughly 4 times as many total plays as 2+* plays, so it all works out.  A hypothetical player who’s +1.0 in all components would be +16.3% on 2+star plays and +4.46% on all plays.  Given a breakdown of 70 2+* attempts and 210 0/1* attempts for 280 total attempts in a season, this person would be expected to be +12.5 OAA overall, +11.4 OAA on 2+* plays and +1.1 OAA on 0/1* plays.

Defensive Aging

FID also allows a quick look at where aging comes from across the paired seasons.  Sprint Speed is roughly a straight line down with age, and it’s the biggest driver of decline by far.

Reaction Sprint Speed above avg Excess Burst Total FID (all) OAA.n (all)
Change -0.0410 -0.1860 0.0133 -0.0030 -0.0029
Expected Impact (in catch %) -0.046% -0.279% 0.023% -0.302% -0.302% -0.285%

That’s around a 0.9 drop in seasonal OAA on average from getting a year older that’s a very good match for the decline predicted by worse FID components.  As a rough rule of thumb, 1 marginal foot in reaction distance is worth 3 OAA over a full year, 1 ft/s in Sprint Speed is worth 4 OAA, and 1 foot in Excess Burst is worth 5 OAA.

Conclusion

Sprint Speed, Reaction Distance, and Excess Burst really do cover most of the skill variation in defense, and by the time players have reached the major league level, their routes and ability to catch the ball are sufficiently good that those differences are overwhelmed by physical differences.  This measure depends on that being true- at lower levels, it might not hold as well and the regression number for OAA-FID could drop much lower to where you’d be throwing away a lot of information by not considering OAA-FID.  It’s also definitely vulnerable to Goodhart’s law- “When a measure becomes a target, it ceases to be a good measure“.  It’s hard to uselessly game your FIP to be much lower than it would be if you were pitching normally, but without play-level data (or observation) to detect players in motion at the pitch, anybody who wanted to game FID could probably do it pretty easily.

Range Defense Added and OAA- Outfield Edition

TL;DR It massively cheats and it’s bad, just ignore it.

First, OAA finally lets us compare all outfielders to each other regardless of position and without need for a positional adjustment.  Range Defense Added unsolves that problem and goes back to comparing position-by-position.  It also produces some absolutely batshit numbers.

From 2022:

Name Position Innings Range Out Score Fielded Plays
Giancarlo Stanton LF 32 -21.5 6
Giancarlo Stanton RF 280.7 -6.8 90

Stanton was -2 OAA on the year in ~300 innings (like 1/4th of a season).  An ROS of -21.5 over a full season is equivalent to pushing -50 OAA.  The worst qualified season in the Statcast era is 2016 Matt Kemp (-26 OAA in 240 opportunities), and that isn’t even -*10*% success probability added (analogous to ROS), much less -21.5%.  The worst seasons at 50+ attempts (~300 innings) are 2017 Trumbo and 2019 Jackson Frazier at -12%.  Maybe 2022 Yadier Molina converted to a full-time CF could have pulled off -21.5%, but nobody who’s actually put in the outfield voluntarily for 300 innings in the Statcast era is anywhere near that terrible.  That’s just not a number a sane model can put out without a hell of a reason, and 2022 Stanton was just bad in the field, not “craploads worse than end-stage Kemp and Trumbo” material.

Name Position Innings Range Out Score Fielded Plays
Luis Barrera CF 1 6.1 2
Luis Barrera LF 98.7 2 38
Luis Barrera RF 101 4.6 37

I thought CF was supposed to be the harder position.  No idea where that number comes from.  Barrera has played OF quite well in his limited time, but not +6.1% over the average CF well.

As I did with the infield edition, I’ll be using rate stats (Range Out Score and OAA/inning) for correlations, each player-position-year combo is treated separately, and it’s important to repeat the reminder that BP will blatantly cheat to improve correlations without mentioning anything about what they’re doing in the announcements, and they’re almost certainly doing that again here.

Here’s a chart with year-to-year correlations broken down by inning tranches (weighted by the minimum of the two paired years)

LF OAA to OAA ROS to ROS ROS to OAA Lower innings Higher Innings Inn at other positions year T Inn at other positions year T n
0 to 10 -0.06 0.21 -0.11 6 102 246 267 129
10 to 25 -0.04 0.43 0.08 17 125 287 332 128
25 to 50 0.10 0.73 0.30 35 175 355 318 135
50 to 100 0.36 0.67 0.23 73 240 338 342 120
100 to 200 0.27 0.78 0.33 142 384 310 303 121
200 to 400 0.49 0.71 0.37 284 581 253 259 85
400+ inn 0.52 0.56 0.32 707 957 154 124 75
RF OAA to OAA ROS to ROS ROS to OAA Lower innings Higher Innings Inn at other positions year T Inn at other positions year T n
0 to 10 0.10 0.34 0.05 5 91 303 322 121
10 to 25 0.05 0.57 0.07 16 140 321 299 128
25 to 50 0.26 0.59 0.14 36 186 339 350 101
50 to 100 0.09 0.75 0.16 68 244 367 360 168
100 to 200 0.38 0.72 0.42 137 347 376 370 83
200 to 400 0.30 0.68 0.43 291 622 245 210 83
400+ inn 0.60 0.58 0.32 725 1026 120 129 92
CF OAA to OAA ROS to ROS ROS to OAA Lower innings Higher Innings Inn at other positions year T Inn at other positions year T n
0 to 10 0.00 0.16 0.09 5 161 337 391 83
10 to 25 0.00 0.42 -0.01 17 187 314 362 95
25 to 50 0.04 0.36 0.03 34 234 241 294 73
50 to 100 0.16 0.56 0.09 70 305 299 285 100
100 to 200 0.34 0.70 0.42 148 434 314 305 95
200 to 400 0.47 0.66 0.25 292 581 228 230 86
400+ inn 0.48 0.45 0.22 754 995 134 77 58

Focus on the left side of the chart first.  OAA/inning behaves reasonably, being completely useless for very small numbers of innings and then doing fine for players who actually play a lot.  ROS is simply insane.  Outfielders in aggregate get an opportunity to make a catch every ~4 innings (where opportunity is a play that the best fielders would have a nonzero chance at, not something completely uncatchable that they happen to pick up after it’s hit the ground).

ROS is claiming meaningful correlations on 1-2 opportunities and after ~10 opportunities, it’s posting year to year correlations on par with OAA’s after a full season.  That’s simply impossible (or beyond astronomically unlikely) to do with ~10 yes/no outcome data points with average talent variation well under +/-10%.  The only way to do it is by using some kind of outside information to cheat (time spent at DH/1B?, who knows, who cares).

I don’t know why the 0-10 inning correlations are so low- those players played a fair bit at other positions (see the right side of the table), so any proxy cheat measures should have reasonably stabilized- but maybe the model is just generically batshit nonsense at extremely low opportunities at a position for some unknown reason as happened with the DRC+ rollout (look at the gigantic DRC+ spread on 1 PA 1 uBB pitchers in the cheating link above).

Also, once ROS crosses the 200-inning threshold, it starts getting actively worse at correlating to itself.  Across all three positions, it correlates much better at lower innings totals and then shits the bed once it starts trying to correlate full-time seasons to full-time seasons.  This is obviously completely backwards of how a metric should behave and more evidence that the basic model behavior here is “good correlation based on cheating (outside information) that’s diluted by mediocre correlation on actual play-outcome data.”

They actually do “improve” on team switchers here relative to nonswitchers- instead of being the worst as they were in the infield, again likely due to overfitting to a fairly small number of players- but it’s still nothing of note given how bad they are relative to OAA’s year-to year for regular players even with the cheating.

OAA and the New Baseball Prospectus Defensive Metric Range Defense Added: Infield Edition

TL;DR Use OAA. Ignore RDA/ROS.

Baseball Prospectus came out with a new defensive metric in the vein of their DRC+ and DRA- stats.   If you’re familiar with my commentary on DRC+, this is going to hit some of the same notes, but it’s still worth a quick read if you made it this far.  The infield and outfield models for RDA behave extremely differently, so I’m going to discuss each one in a separate post. The outfield post is here.

The infield model is just simply bad compared to OAA/DRS.  If somebody is giving you hype-job statistics and only tells you how well a team does against a non-division same-league opponent who’s at least 2 games below .500 while wearing uniforms with a secondary color hex code between #C39797 and #FFFFFF in Tuesday day games during a waxing gibbous moon.. well, that ought to make you immediately suspicious of how bad everything else is.  And the same for the statistics cited in the RDA article.

That is.. the opposite of a resounding win for ROS/RDA.  And it’s worse than it looks because OAA is (theoretically, and likely practically) the best at stripping out fielder positioning, while DRS and ROS will have some residual positioning information that will self-correlate to some extent.  DRS also contains additional information (extra penalty for botched balls down the line, throwing errors, double plays) that likely help it self-correlate better, and ROS/RDA appear to contain outside information as described above which will also help it self-correlate better.

  OAA/ inn DRS/ inn ROS RDA/ inn N
to OAA 0.44 0.32 0.22 0.21 177
to DRS 0.26 0.45 0.30 0.30 177

ROS/RDA correlating significantly better to DRS than to OAA is suggestive of a fair bit of its year-to-year self-correlation being to non-demonstrated-fielding-skill information.

Even in their supposed area of supremacy, team-switchers, infield ROS/RDA is still bad.  Classifying players as either non-switchers (played both seasons for the same team only), offseason switchers (played all of year T for one team and all of year T+1 for a different team), or midseason switchers (switched teams in the middle of at least one season).

All IF OAA/inn DRS/inn ROS RDA/inn n
Offseason 0.40 0.45 0.43 0.46 79
Midseason 0.39 0.31 0.13 0.11 91
Off or Mid 0.39 0.38 0.28 0.28 170
No Switch 0.45 0.45 0.37 0.36 541
All 0.44 0.45 0.36 0.35 711

They match OAA/DRS on offseason-switching players- likely due to overfitting their model to a small number of players- but they’re absolutely atrocious on midseason switchers, and they actually have the *biggest* overall drop in reliability between non-switchers and switchers.  I don’t think there’s much more to say.  Infield RDA/ROS isn’t better than OAA/DRS.  It isn’t even close to equal to OAA/DRS. 


Technical notes: I sourced OAA from Fangraphs because I didn’t see a convenient way to grab OAA by position from Savant without scraping individual player pages (the OAA Leaderboard .csv with a position filter doesn’t include everybody who played a position).  This meant that the slightly inconvenient way of grabbing attempts from Savant wasn’t useful here because it also couldn’t split attempts by position, so I was left with innings as a denominator.  Fangraphs doesn’t have a (convenient?) way to split defensive seasons between teams, while BP does split between teams on their leaderboard, so I had to combine split-team seasons and used a weighted average by innings.  Innings by position match between BP and FG in 98.9% of cases and the differences are only a couple of innings here and there, nothing that should make much difference to anything.

The Five MTG: Arena Rankings

This ties up a few loose ends with the Mythic ranking system and explains how the other behind the scenes MMRs work.  As it turns out, there are five completely distinct rankings in Arena.  Mythic Constructed, Mythic Limited, Serious Constructed, Serious Limited, and Play (these are my names for them).

Non-Mythic

All games played in ranked (including Mythic) affect the Serious ratings, *as do the corresponding competitive events- Traditional Draft, Historic Constructed Event, etc*.  Serious ratings are conserved from month to month.   Play rating includes the Play and Brawl queues, as well as Jump In (despite Jump In being a limited event), and is also conserved month to month.  Nonsense events (e.g. the Momir and Artisan MWMs) don’t appear to be rated in any way.

The Serious and Play ratings appear to be intended to be a standard implementation of the Glicko-2 rating system except with ratings updated after every game.  I say intended because they have a gigantic bug that screws everything up, but we’ll talk about that later.  New accounts start with a rating of 1500, RD of 350, and volatility of 0.06.  These update after every game and there doesn’t seem to be any kind of decay or RD increase over time (an account that hadn’t even logged in since before rotation still had a low RD).  Bo1 and Bo3 match wins are rated the same for these.

Mythic

Only games played while in Mythic count towards the Mythic rankings, and the gist of the system is exactly as laid out in One last rating update although I have a slight formula update.  These rankings appear to come into existence when Mythic is reached each month and disappear again at the end of the month (season).

I got the bright idea that they may be using the same code to calculate Mythic changes, and this appears to be true (I say appears because I have irreconcilable differences in the 5th decimal place for all of their Glicko-2 computations that I can’t resolve with any (tau, Glicko scale constant) pair.  There’s either a small bug or some kind of rounding issue on one of our ends, but it’s really tiny regardless).  The differences are that Mythic uses a fixed RD of 60 and a fixed volatility of 0.06 (or extremely extremely close) that doesn’t change after matches and that the rating changes are multiplied by 2 for Bo3 matches.  Glicko with a fixed RD is very similar to Elo with a fixed K on matchups within 200 points of each other.

In addition, the initial Mythic rating is seeded *as a function of the Serious rating*.  [Edit 8/1/2022: The formula is: if Serious Rating >=3000, Mythic Rating = 1650.  Otherwise Mythic Rating = 1650 – ((3000 – Serious Rating) / 10).  Using Serious Rating from before you win your play-in game to Mythic, not after you win it, because reasons] That’s the one piece I never exactly had a handle on.  I knew that tanking my rating gave me easy pairings and a trash initial Mythic rating the next month, but I didn’t know how that happened.  The existence of a separate conserved Serious rating explains it all.  [Edit 8/1/22: rest of paragraph deleted since it’s no longer relevant with the formula given]

The Mythic system also had two fixed constants that previously appeared to be randomly arbitrary- the minimum win of 5.007 points and the +7.4 win/-13.02 loss when playing a non-Mythic.  Using the Glicko formula with both players having RD 60 and Volatility 0.06, the first number appears when the rating difference is restricted to a maximum of exactly 200 points. Even if you’re 400 points above your opponent, the match is rated as though you’re only 200 points higher.  The second number appears when you treat the non-Mythic player as rated exactly 100 points lower than you are (regardless of what your rating actually is) with the same RD=60.  This conclusively settles the debate as to whether or not that system/those numbers were empirically derived or rectally derived.

The Huge Bug

As I reported on Twitter last month the Serious and Play ratings have a problem (but not Mythic.. at least not this problem).  If you lose, you’re rated as though you lost to your opponent.  If you win, you’re rated as though you beat a copy of yourself (rating and RD).  And, of course, even though correcting the formula/function call is absolutely trivial, it still hasn’t been fixed after weeks.  This bug dates to at least January, almost certainly to the back-end update last year, and quite possibly back into Beta.

Glicko-2 isn’t zero-sum by design (if two players with the same rating play, the one with higher RD will gain/lose more points), but it doesn’t rapidly go batshit.  With this bug, games are now positive-sum in expectation.  When the higher-rated player wins, points are created (the higher-rated player wins more points than they should).  When the lower-rated player wins, points are destroyed (the lower-rated player wins fewer points than they should). Handwaving away some distributional things that the data show don’t matter, since the higher-rated player wins more often, points are created more than destroyed, and the entire system inflates over time.

My play rating is 5032 (and I’m not spending my life tryharding the play queue), and the median rating is supposed to be around 1500.  In other words, if the system were functioning correctly, my rating would mean that I’d be expected to only lose to the median player 1 time per tens of millions of games.  I’m going to go out on a limb and say that even if I were playing against total newbies with precons, I wouldn’t make it anywhere close to 10 million games before winding up on the business end of a Gigantosaurus.  And I encountered a player in a constructed event who had a Serious Constructed rating over 9000, so I’m nowhere near the extreme here.  It appears from several reports that the cap is exactly 10000.

In addition to inflating everybody who plays, it also lets players go infinite (up to 10000) on rating.  Because beating a copy of yourself is worth ~half as many points as a loss to a player you’re expected to beat ~100% of the time, anybody who can win more than 2/3 of their matches will increase their rating without bound.  Clearly some players (like the 9000 I played) have been doing this for awhile.

If the system were functioning properly, most people would be in a fairly narrow range.  In Mythic constructed, something like 99% of players are between 1200-2100 (this is a guess, but I’m confident those endpoints aren’t too narrow by much, if at all), and that’s with a system that artificially spreads people out by letting high-rated players win too many points and low-rated players lose too many points.  Serious Constructed would be spread out a bit more because it includes all the non-Mythics as well, but it’s not going to be that huge a gap down to the people who can at least make it out of Silver.  And while the Play queue has much wider deck-strength, the deck-strength matching, while very far from perfect, should at least make the MMR difference more about play skill than deck strength, so it also wouldn’t spread out too far.

Instead, because of the rampant inflation, the center of the distribution is at like 4000 MMR instead of ~1500.  Strong players are going infinite on the high side, and because new accounts still start at 1500, and there’s no way to make it to 4000 without winning lots of games (especially if you screwed around with precons or something for a few games at some point), there’s a constant trickle of relatively new (and some truly atrocious) accounts on the left tail spanning the thousands-of-points gap between new players and established players, and that gap only exists because of the rating bug.  It looks something like this.

The three important features are that the bulk of the distribution is off by thousands of points from where it should be, noobs start thousands of points below the bulk, and players are spread over an absurdly wide range.  The curves are obviously hand-drawn in Paint, so don’t read anything beyond that into the precise shapes.

This is why new players often make it to Mythic one time easily with functional-but-quite-underpowered decks- they make it before their Serious Constructed rating has crossed the giant gap from 1500 to the cluster where the established players reside.  Then once they’ve crossed the gap, they mostly get destroyed.  This is also why tanking the rating is so effective.  It’s possible to tank thousands of points and get all the way back below the noob entry point.  I’ve said repeatedly that my matchups after doing that were almost entirely obvious noobs and horrific decks, and now it’s clear why.

It shouldn’t be possible (without spending a metric shitton of time conceding matches, since there should be rapidly diminishing returns once you separate from the bulk) to tank far enough to do an entire Mythic climb with a high winrate without at least getting the rating back to the point where you’re bumping into the halfway competent players/decks on a regular basis, but because of the giant gap, it is.  The Serious Constructed rating system wouldn’t be entirely fixed without the bug- MMR-based pairing still means a good player is going to face tougher matchups than a bad player on the way to Mythic, and tanking would still result in some very easy matches at the start- but those effects are greatly magnified because of the artificial gap that the bug has created.

What I still don’t know

I don’t know whether or not Serious or Mythic is used for pairing Mythic.  I don’t have the data to know if Serious is used to pair ranked drafts or any other events (Traditional Draft, Standard Constructed Event, etc).  It would take a 17lands-data-dump amount of match data to conclusively show that one way or another.  AFAIK, WotC has said that it isn’t, but they’ve made incorrect statements about ratings and pairings before.  And I certainly don’t know, in so, so many different respects, how everything made it to this point.

About MLB’s New Mudding and Storage Protocol

My prior research on the slippery ball problem: Baseball’s Last Mile Problem

The TL;DR is that mudding adds moisture to the surface of the ball.  Under normal conditions (i.e. stored with free airflow where it was stored before mudding), that moisture evaporates off in a few hours and leaves a good ball.  If that evaporation is stopped, the ball goes to complete hell and becomes more slippery than a new ball.  This is not fixed by time in free airflow afterwards.

My hypothesis is that the balls were sometimes getting stored in environments with sufficiently restricted airflow (the nylon ball bag) too soon after mudding, and that stopped the evaporation.  This only became a problem this season with the change to mudding all balls on gameday and storing them in a zipped nylon bag before the game.

MLB released a new memo yesterday that attempts to standardize the mudding and storage procedure.  Of the five bullet points, one (AFAIK) is not a change.  Balls were already supposed to sit in the humidor for at least 14 days.  Attempting to standardize the application procedure and providing a poster with allowable darkness/lightness levels are obviously good things.  It may be relevant here if the only problem balls were the muddiest (aka wettest) which shouldn’t happen anymore, but from anecdotal reports, there were problem balls where players didn’t think the balls were even mudded at all, and unless they’re blind, that seems hard to reconcile with also being too dark/too heavily mudded.  So this may help some balls, but probably not all of them.

Gameday Mudding

The other points are more interesting.  Requiring all balls to be mudded within 3 hours of each other could be good or bad.  If it eliminates stragglers getting mudded late, this is good.  If it pushes all mudding closer to gametime, this is bad.  Either way, unless MLB knows something I don’t (which is certainly possible- they’re a business worth billions and I’m one guy doing science in my kitchen), the whole gameday mudding thing makes *absolutely no sense* to me at all in any way.

Pre-mudding, all balls everywhere** are all equilibrated in the humidor the same way.  Post-mudding, the surface is disrupted with transient excess moisture.  If you want the balls restandardized for the game, then YOU MAKE SURE YOU GIVE THE BALL SURFACE TIME AFTER MUDDING TO REEQUILIBRATE TO A STANDARD ENVIRONMENT BEFORE DOING ANYTHING ELSE WITH THE BALL. And that takes hours.

In a world without universal humidors, gameday mudding might make sense since later storage could be widely divergent.  Now, it’s exactly the same everywhere**.  Unless MLB has evidence that a mudded ball sitting overnight in the humidor goes to hell (and I tested and found no evidence for that at all, but obviously my testing at home isn’t world-class- also, if it’s a problem, it should have shown up frequently in humidor parks before this season), I have no idea why you would mud on gameday instead of the day before like it was done last season.  The evaporation time between mudding and going in  the nylon bag for the game might not be long enough if mudding is done on gameday, but mudding the day before means it definitely is.

Ball Bag Changes

Cleaning the ball bag seems like it can’t hurt anything, but I’m also not sure it helps anything. I’m guessing that ball bag hygiene over all levels of the sport and prior seasons of MLB was generally pretty bad, yet somehow it was never a problem.  They’ve seen the bottom of the bags though.  I haven’t. If there’s something going on there, I’d expect it to be a symptom of something else and not a primary problem.

Limiting to 96 balls per bag is also kind of strange.  If there is something real about the bottom of the bag effect, I’d expect it to be *the bottom of the bag effect*.  As long as the number of balls is sufficient to require a tall stack in the bag (and 96 still is), and since compression at these number ranges doesn’t seem relevant (prior research post), I don’t have a physical model of what could be going on that would make much difference for being ball 120 of 120 vs. ball 96 of 96.  Also, if the bottom of the bag effect really is a primary problem this year, why wasn’t it a problem in the past?  Unless they’re using entirely new types of bags this season, which I haven’t seen mentioned, we should have seen it before.  But I’m theorizing and they may have been testing, so treat that paragraph with an appropriate level of skepticism.

Also, since MLB uses more than 96 balls on average in a game, this means that balls will need to come from multiple batches.  This seemed like it had the potential to be significantly bad (late-inning balls being stored in a different bag for much longer), but according to an AP report on the memo

“In an effort to reduce time in ball bags, balls are to be taken from the humidor 15-30 minutes before the scheduled start, and then no more than 96 balls at a time.  When needed, up to 96 more balls may be taken from the humidor, and they should not be mixed in bags with balls from the earlier bunch.”

This seems generally like a step in the smart direction, like they’d identified being zipped up in the bag as a potential problem (or gotten the idea from reading my previous post from 30 days ago :)).  I don’t know if it’s a sufficient mitigation because I don’t know exactly how long it takes for the balls to go to hell (60 minutes in near airtight made them complete garbage, so damage certainly appears in less time, but I don’t know how fast and can’t quickly test that).  And again, repeating the mantra from before, time spent in the ball bag *is only an issue if the balls haven’t evaporated off after mudding*.  And that problem is slam-dunk guaranteed solvable by mudding the day before, and then this whole section would be irrelevant.

Box Storage

The final point, “all balls should be placed back in the Rawlings boxes with dividers, and the boxes should then be placed in the humidor. In the past, balls were allowed to go directly into the humidor.” could be either extremely important or absolutely nothing.  This doesn’t say whether the boxes should be open or closed (have the box top on) in the humidor.  I tweeted to the ESPN writer and didn’t get an answer.

The boxes can be seen in the two images in https://www.mlb.com/news/rockies-humidor-stories.  If they’re open (and not stacked or otherwise covered to restrict airflow), this is fine and at least as good as whatever was done before today.  If the boxes are closed, it could be a real problem.  Like the nylon ball bag, this is also a restricted-flow environment, and unlike the nylon ball bag, some balls will *definitely* get in the box before they’ve had time to evaporate off (since they go in shortly after mudding)

I have one Rawlings box without all the dividers.  The box isn’t airtight, but it’s hugely restricted airflow.  I put 3 moistened balls in the box along with a hygrometer and the RH increased 5% and the balls lost moisture about half as fast as they did in free air.  The box itself absorbed no relevant amount.  With 6 moistened balls in the box, the RH increased 7% (the maximum moistened balls in a confined space will do per prior research) and they lost moisture between 1/3 and 1/4 as fast as in free air.

Unlike the experiments in the previous post where the balls were literally sealed, there is still some moisture flux off the surface here.  I don’t know if it’s enough to stop the balls from going to hell.  It would take me weeks to get unmudded equilibrated balls to actually do mudding test runs in a closed box, and I only found out about this change yesterday with everybody else.  Even if the flux is still sufficient to avoid the balls going to hell directly, the evaporation time appears to be lengthened significantly, and that means that balls are more likely to make it into the closed nylon bag before they’ve evaporated off, which could also cause problems at that point (if there’s still enough time for problems there- see previous section).

The 3 and 6 ball experiments are one run each, in my ball box, which may have a better or worse seal than the average Rawlings box, and the dividers may matter (although they don’t seem to absorb very much moisture from the air, prior post), etc.  Error bars are fairly wide on the relative rates of evaporation, but hygrometer don’t lie.  There doesn’t seem to be any way a closed box isn’t measurably restricting airflow and increasing humidity inside unless the box design changed a lot in the last 3 years.  Maybe that humidity increase/restricted airflow isn’t enough to matter directly or indirectly, but it’s a complete negative freeroll.  Nothing good can come of it.  Bad things might.  If there are reports somewhere this week that tons of balls were garbage, closed-box storage after mudding is the likely culprit.  Or the instructions will actually be uncovered open box (and obeyed) and the last 5 paragraphs will be completely irrelevant.  That would be good.

Conclusion: A few of the changes are obviously common-sense good.  Gameday mudding continues to make no sense to me and looks like it’s just asking for trouble.  Box storage in the humidor after mudding, if the boxes are closed, may be introducing a new problem. It’s unclear to me if the new ball-bag procedures reduce time sufficiently to prevent restricted-airflow problems from arising there, although it’s at least clearly a considered attempt to mitigate a potential problem.

The Mythic Limited Rating Problem

TL;DR

Thanks to @twoduckcubed for reading my previous work and being familiar enough with high-end limited winrates to see that there was likely to be a real problem here, and there is.  If you haven’t read my previous work, it’s at One last rating update and Inside the MTG: Arena Rating System, but as long as you know anything at all about any MMR system, this post is intended to be readable by itself.

Mythic MMR starts from scratch each month and each player is assigned a Mythic MMR when they first make Mythic that month.  Most people start at the initial cap, 1650, and a few start a bit below that.  It takes losing an awful lot to be assigned an initial rating very far below that, and since losing a bunch of limited matches costs resources (while doing it in ranked constructed is basically free), it’s mostly 1650 or close.  When two people with a Mythic rating play in Premier or Quick, it’s approximately an Elo system with a K value of 20.4, and the matches are zero-sum.  When one player wins points, the other player loses exactly the same number of points.

Most games that Mythic limited players play aren’t against other Mythics though.  Diamonds are the most common opponents, with significant numbers of games against Platinums as well (and a handful against Gold/Silver/Bronze).  In this case, since the non-Mythic opponents literally don’t have a Mythic MMR to plug into the Elo formula, Arena, in a decision that’s utterly incomprehensible on multiple levels, rates all of these matches exactly the same regardless of the Mythic’s rating or the non-Mythic’s rank or match history.  +7.4 points for a win and -13 points for a loss, and this is *not* zero-sum because the non-Mythic doesn’t have a Mythic rating yet.  The points are simply created out of nothing or lost to the void.

+7.4 for a win and -13 for a loss means that the Mythics need to win 13/(13+7.4) = 63.7% of the time against non-Mythics to break even.   And, well, thanks to the 17lands data dumps, I found that they won 58.3% in SNC and 59.4% in NEO (VOW and MID didn’t seem to have opponent rank data available).  Nowhere close to breakeven.  ~57% vs. Diamonds and ~63% vs Plats.  Not even breakeven playing down two ranks.  And this is already a favorable sample for multiple reasons.  It’s 17lands users, who are above average Mythics (their Mythic-Mythic winrate is 52.4%).  It’s also a game-averaged sample instead of a player-averaged sample, and better players play more games on average in Mythic because they get there faster and have more resources to keep paying entry fees with.

Because of this, to a reasonable approximation, every time a Mythic Limited player hits the play button, 1 MMR vanishes into the void.  And since 1% of Mythic in limited is only ~16.5 MMR, 1% Mythic in expectation is lost every 2-3 drafts just for playing.  The more they play, the more MMR they lose into the void.  The very best players- those who can win 5% more than the average 17lands-using Mythic drafter- can outrun this and profit from playing lower ranks- but the vast majority can’t, hence the video at the top of the post.  Instead of Mythic MMR being a zero-sum game, it’s like gambling against a house edge, and playing at all is clearly disincentivized for most people.

Obviously this whole implementation is just profoundly flawed and needs to be fixed.  The 17lands data is anonymized, so I don’t know how many Mythic-Mythic games appeared from both sides, so I don’t know exactly what percentage of a Mythic’s games are against each level, but it’s something like 51% against Diamond, 29% against Mythic 19% against Plat, 1% Gold and below.  Clearly games vs Diamonds need to be handled responsibly, and games vs. Golds and below don’t matter much.

A simple fix that keeps most of the system intact (which may not be the best idea, but hey, at least it’s progress) is to assign the initial Mythic MMR upon making Platinum (instead of Mythic) and to not Mythic-rate any games involving a Gold or below.  You wouldn’t get leaderboard position or anything until actually making Mythic, but the rating would be there behind the scenes doing its thing and this silliness would be avoided since all the rated games would be zero-sum and all the Diamond opponents would be reasonably rated for quality after playing enough games to get out of Plat.


Constructed has the same implementation, but it’s mostly not as big a deal because outside of Alchemy, cross-rank pairing isn’t very common except at the beginning of the month, and even if top-1200 quality players are getting scammed out of points by lower ranks at the start of the month (and they may well not be), they have all the time in the world to reequilibrate their rating against a ~100% Mythic opponent lineup later.  Drafters play against bunches of non-Mythics throughout.  Cross-rank pairing in Alchemy ranked may be/become enough of a problem to warrant a similar fix (although likely for the opposite reason, farming lower ranks instead of losing points to them), and it’s not like assigning the initial Mythic rating upon reaching Diamond and ignoring games against lower ranks actually hurts anything there either.

Baseball’s Last Mile Problem

2022 has brought a constant barrage of players criticizing the baseballs as hard to grip and wildly inconsistent from inning to inning, and probably not coincidentally, a spike in throwing error rates to boot.  “Can’t get a grip” and “throwing error” do seem like they might go together.  MLB has denied any change in the manufacturing process, however there have been changes this season in how balls are handled in the stadium, and I believe that is likely to be the culprit.

I have a plausible explanation for how the new ball-handling protocol can cause even identical balls from identical humidors to turn out wildly different on the field, and it’s backed up by experiments and measurements I’ve done on several balls I have, but until those experiments can be repeated at an actual MLB facility (hint, hint), this is still just a hypothesis, albeit a pretty good one IMO.

Throwing Errors

First, to quantify the throwing errors, I used Throwing Errors + Assists as a proxy for attempted throws (it doesn’t count throws that are accurate but late, etc), and broke down TE/(TE+A) by infield position.

TE/(TE+A) 2022 2021 2011-20 max 21-22 Increase 2022 By Chance
C 9.70% 7.10% 9.19% 36.5% 1.9%
3B 3.61% 2.72% 3.16% 32.7% 0.8%
SS 2.20% 2.17% 2.21% 1.5% 46.9%
2B 1.40% 1.20% 1.36% 15.9% 20.1%

By Chance is the binomial odds of getting the 2022 rate or worse using 2021 as the true odds.  Not only are throwing errors per “opportunity” up over 2021, but they’re higher than every single season in the 10 years before that as well, and way higher for C and 3B.   C and 3B have the least time on average to establish a grip before throwing.  This would be interesting even without players complaining left and right about the grip.

The Last Mile

To explain what I suspect is causing this, I need to break down the baseball supply chain.  Baseballs are manufactured in a Rawlings factory, stored in conditions that, to the best of my knowledge, have never been made public, shipped to teams, sometimes stored again in unknown conditions outside a humidor, stored in a humidor for at least 2 weeks, and then prepared and used in a game.  Borrowing the term from telecommunications and delivery logistics, we’ll call everything after the 2+ weeks in the humidor the last mile.

Humidors were in use in 9 parks last year, and Meredith Wills has found that many of the balls this year are from the same batches as balls in 2021.  So we have literally some of the same balls in literally the same humidors, and there were no widespread grip complaints (or equivalent throwing error rates) in 2021.  This makes it rather likely that the difference, assuming there really is one, is occurring somewhere in the last mile.

The last mile starts with a baseball that has just spent 2+ weeks in the humidor.  That is long enough to equilibrate, per https://tht.fangraphs.com/the-physics-of-cheating-baseballs-humidors/, other prior published research, and my own past experiments.  Getting atmospheric humidity changes to meaningfully affect the core of a baseball takes on the order of days to weeks.  That means that nothing humidity-wise in the last mile has any meaningful impact on the ball’s core because there’s not enough time for that to happen.

This article from the San Francisco Chronicle details how balls are prepared for a game after being removed from the humidor, and since that’s paywalled, a rough outline is:

  1. Removed from humidor at some point on gameday
  2. Rubbed with mud/water to remove gloss
  3. Reviewed by umpires
  4. Not kept out of the humidor for more than 2 hours
  5. Put in a security-sealed bag that’s only opened in the dugout when needed

While I don’t have 2022 game balls or official mud, I do have some 2019* balls, water, and dirt, so I decided to do some science at home.  Again, while I have confidence in my experiments done with my balls and my dirt, these aren’t exactly the same things being used in MLB, so it’s possible that what I found isn’t relevant to the 2022 questions.

Update: Dr. Wills informed that that 2019, and only 2019, had a production issue that resulted in squashed leather and could have affected the mudding results.  She checked my batch code, and it looks like my balls were made late enough in 2019 that they were actually used in 2020 with the non-problematic production method.  Yay.

Experiments With Water

When small amounts of water are rubbed on the surface of a ball, it absorbs pretty readily (the leather and laces love water), and once the external source of water is removed, that creates a situation where the outer edge of the ball is more moisture-rich than what’s slightly further inside and more moisture-rich than the atmosphere.  The water isn’t going to just stay there- it’s either going to evaporate off or start going slightly deeper into the leather as well.

As it turns out, if the baseball is rubbed with water and then stored with unrestricted air access (and no powered airflow) in the environment it was equilibrated with, the water entirely evaporates off fairly quickly with an excess-water half-life of a little over an hour (and this would likely be lower with powered air circulation) and goes right back to its pre-rub weight down to 0.01g precision.  So after a few hours, assuming you only added a reasonable amount of water to the surface (I was approaching 0.75 grams added at the most) and didn’t submerge the ball in a toilet or something ridiculous, you’d never know anything had happened.  These surface moisture changes are MUCH faster than the days-to-weeks timescales of core moisture changes.

Things get much more interesting if the ball is then kept in a higher-humidity environment.  I rubbed a ball down, wiped it with a paper towel, let it sit for a couple of minutes to deal with any surface droplets I missed, and then sealed the ball in a sandwich bag for 2 hours along with a battery-powered portable hygrometer.  I expected the ball to completely saturate the air while losing less mass than I could measure (<0.01g) in the process, but that’s not what actually happened.  The relative humidity in the bag only went up 7%, and as expected, the ball lost no measurable amount of mass.  After taking it out, it started losing mass with a slightly longer half-life than before and lost all the excess water in a few hours.

I repeated the experiment except this time I sealed the ball and the hygrometer in an otherwise empty 5-gallon pail.  Again, the relative humidity only went up 7%, and the ball lost 0.04 grams of mass.  I calculated that 0.02g of evaporation should have been sufficient to cause that humidity change, so I’m not exactly sure what happened- maybe 0.01 was measurement error (the scale I was using goes to 0.01g), maybe my seal wasn’t perfectly airtight, maybe the crud on the lid I couldn’t clean off or the pail itself absorbed a little moisture).  But the ball had 0.5g of excess water to lose (which it did completely lose after removal from the pail, as expected) and only lost 0.04g in the pail, so the basic idea is still the same.

This means that if the wet ball has restricted airflow, it’s going to take for freaking ever to reequilibrate (because it only takes a trivial amount of moisture loss to “saturate” a good-sized storage space), and that if it’s in a sealed environment or in a free-airflow environment more than 7% RH above what it’s equilibrated to, the excess moisture will travel inward to more of the leather instead of evaporating off (and eventually the entire ball would equilibrate to the higher-RH environment, but we’re only concerned with the high-RH environment as a temporary last-mile storage condition here, so that won’t happen on our timescales).

I also ran the experiment sealing the rubbed ball and the hygrometer in a sandwich bag overnight for 8 hours.  The half-life for losing moisture after that was around 2.5 hours, up from the 70 minutes when it was never sealed.  This confirms that the excess moisture doesn’t just sit around at the surface waiting if it can’t immediately evaporate, but that evaporation dominates when possible.

I also ran the experiment with a ball sealed in a sandwich bag for 2 hours along with an equilibrated cardboard divider that came with the box of balls I have.  That didn’t make much difference. The cardboard only absorbed 0.04g of the ~0.5g excess moisture in that time period, and that’s with a higher cardboard:ball ratio than a box actually comes with.  Equilibrated cardboard can’t substitute for free airflow on the timescale of a couple of hours.

Experiments With Mud

I mixed dirt and water to make my own mud and rubbed it in doing my best imitation of videos I could find, rubbing until the surface of the ball felt dry again.  Since I don’t have any kind of instrument to measure slickness, these are my perceptions plus those of my significant other.  We were in almost full agreement on every ball, and the one disagreement converged on the next measurement 30 minutes later.

If stored with unrestricted airflow in the environment it was equilibrated to, this led to roughly the following timeline:

  1. t=0, mudded, ball surface feels dry
  2. t= 30 minutes, ball surface feels moist and is worse than when it was first mudded.
  3. t=60 minutes, ball surface is drier and is similar in grip to when first mudded.
  4. t=90 minutes, ball is significantly better than when first mudded
  5. t=120 minutes, no noticeable change from t=90 minutes.
  6. T=12 hours, no noticeable change from t=120 minutes

I tested a couple of other things as well

  1. I took a 12-hour ball, put it in a 75% RH environment for an hour and then a 100% RH environment for 30 minutes, and it didn’t matter.  The ball surface was still fine.  The ball would certainly go to hell eventually under those conditions, but it doesn’t seem likely to be a concern with anything resembling current protocols.  I also stuck one in a bag for awhile and it didn’t affect the surface or change the RH at all, as expected since all of the excess moisture was already gone.
  2. I mudded a ball, let it sit out for 15 minutes, and then sealed it in a sandwich bag.  This ball was slippery at every time interval, 1 hour, 2 hours, 12 hours. (repeated twice).  Interestingly, putting the ball back in its normal environment for over 24 hours didn’t help much and it was still quite slippery.  Even with all the excess moisture gone, whatever had happened to the surface while bagged had ruined the ball.
  3. I mudded a ball, let it sit out for 2 hours, at which point the surface was quite good per the timeline above, and then sealed it in a bag.  THE RH WENT UP AND THE BALL TURNED SLIPPERY, WORSE THAN WHEN IT WAS FIRST MUDDED. (repeated 3x).  Like #2, time in the normal environment afterwards didn’t help.  Keeping the ball in its proper environment for 2 hours, sealing it for an hour, and then letting it out again was enough to ruin the ball.

That’s really important IMO.  We know from the water experiments that it takes more than 2 hours to lose the excess moisture under my storage conditions, and it looks like the combination of fresh(ish) mud plus excess surface moisture that can’t evaporate off is a really bad combo and a recipe for slippery balls.  Ball surfaces can feel perfectly good and game-ready while they still have some excess moisture left and then go to complete shit, apparently permanently, in under an hour if the evaporation isn’t allowed to finish.

Could this be the cause of the throwing errors and reported grip problems? Well…

2022 Last-Mile Protocol Changes

The first change for 2022 is that balls must be rubbed with mud on gameday, meaning they’re always taking on that surface moisture on gameday.  In 2021, balls had to be mudded at least 24 hours in advance of the game, and while 2021 changed the window to 1-2 days in advance, the window used to be up to 5 days in advance of the game.  I don’t know how far in advance they were regularly mudded before 2021, but even early afternoon for a night game would be fine assuming the afternoon storage had reasonable airflow.

The second change is that they’re put back in the humidor fairly quickly after being mudded and allowed a maximum of 2 hours out of the humidor.  While I don’t think there’s anything inherently wrong with putting the balls back in the humidor after mudding (unless it’s something specific to 2022 balls), humidors look something like this.  If the balls are kept in a closed box, or an open box with another box right on top of them, there’s little chance that they reequilibrate in time.  If they’re kept in an open box on a middle shelf without much room above, unless the air is really whipping around in there, the excess moisture half-life should increase.

There’s also a chance that something could go wrong if the balls are taken out of the humidor, kept in a wildly different environment for an hour, then mudded and put back in the humidor, but I haven’t investigated that, and there are many possible combinations of both humidity and temperature that would need to be checked for problems.

The third change (at least I think it’s a change) is that the balls are kept in a sealed bag- at least highly restricted flow, possibly almost airtight- until opened in the dugout.  Even if it’s not a change, it’s still extremely relevant- sealing balls that have evaporated their excess moisture off doesn’t affect anything, while sealing balls that haven’t finished evaporating off seems to be a disaster.

Conclusion

Mudding adds excess moisture to the surface of the ball, and if its evaporation is prevented for very long- either through restricted airflow or storage in too humid an environment- the surface of the ball becomes much more slippery and stays that way even if evaporation continues later.  It takes hours- dependent on various parameters- for that moisture to evaporate off, and 2022 protocol changes make it much more likely that the balls don’t get enough time to evaporate off, causing them to fall victim to that slipperiness.  In particular, balls can feel perfectly good and ready while they still have some excess surface moisture and then quickly go to hell if the remaining evaporation is prevented inside the security-sealed bag.

It looks to me like MLB had a potential problem- substantial latent excess surface moisture being unable to evaporate and causing slipperiness- that prior to 2022 it was avoiding completely by chance or by following old procedures derived from lost knowledge.   In an attempt to standardize procedures, MLB accidentally made the excess surface moisture problem a reality, and not only that, did it in a way where the amount of excess surface moisture was highly variable.

The excess surface moisture when a ball gets to a pitcher depends on the amount of moisture initially absorbed, the airflow and humidity of the last-mile storage environment, and the amount of time spent in those environments and in the sealed bag.  None of those are standardized parts of the protocol, and it’s easy to see how there would be wide variability ball-to-ball and game-to-game.

Assuming this is actually what’s happening, the fix is fairly easy.  Balls need to be mudded far enough in advance and stored afterwards in a way that they get sufficient airflow for long enough to reequilibrate (the exact minimum times depending on measurements done in real MLB facilities), but as an easy interim fix, going back to mudding the day before the game and leaving those balls in an open uncovered box in the humidor overnight should be more than sufficient.  (and again, checking that on-site is pretty easy)

Notes

I found (or didn’t find) some other things that I may as well list here as well along with some comments.

  1. These surface moisture changes don’t change the circumference of the baseball at all, down to 0.5mm precision, even after 8 hours.
  2. I took a ball that had stayed moisturized for 2 hours and put a 5-pound weight on it for an hour.  There was no visible distortion and the circumference was exactly the same as before along both seam axes (I oriented the pressure along one seam axis and perpendicular to the other).  To whatever extent flat-spotting is happening or happening more this season, I don’t see how it can be a last-mile cause, at least with my balls.  Dr. Wills has mentioned that the new balls seem uniquely bad at flat-spotting, so it’s not completely impossible that a moist new ball at the bottom of a bucket could deform under the weight, but I’d still be pretty surprised.
  3. The ball feels squishier to me after being/staying moisturized, and free pieces of leather from dissected balls are indisputably much squishier when equilibrated to higher humidity, but “feels squishier” isn’t a quantified measurement or an assessment of in-game impact.  The squishy-ball complaints may also be another symptom of unfinished evaporation.
  4. I have no idea if the surface squishiness in 3 affects the COR of the ball to a measurable degree.
  5. I have no idea if the excess moisture results in an increased drag coefficient.  We’re talking about changes to the surface, and my prior dissected-ball experiments showed that the laces love water and expand from it, so it’s at least in the realm of possibility.
  6. For the third time, this is a hypothesis.  I think it’s definitely one worth investigating since it’s supported by physical evidence, lines up with the protocol changes this year, and is easy enough to check with access to actual MLB facilities.  I’m confident in my findings as reported, but since I’m not using current balls or official mud, this mechanism could also turn out to have absolutely nothing to do with the 2022 game.

The 2022 MLB baseball

As of this writing on 4/25/2022, HRs are down, damage and distance on barrels are down, and both Alan Nathan and Rob Arthur have observed that the drag coefficient of baseballs this year is substantially increased.  This has led to speculation about what has changed with the 2022 balls and even what production batch of balls or mixture of batches may be in use this year.  Given the kerfluffle last year that resulted in MLB finally confirming that a mix of 2020 and 2021 balls were used during the season, that speculation is certainly reasonable.

It may well also turn out to be correct, and changes in the 2022 ball manufacture could certainly explain the current stats, but I think it’s worth noting that everything we’ve seen so far is ALSO consistent with “absolutely nothing changed with regard to ball manufacture/end product between 2021 and 2022” and “all or a vast majority of balls being used are from 2021 or 2022”.  

How is that possible?  Well, the 2021 baseball production was changed on purpose.  The new baseball was lighter, less dense, and less bouncy by design, or in more scientific terms, “dead”.  What if all we’re seeing now is the 2021 baseball specifications in their true glory, now untainted by the 2020 live balls that were mixed in last year?

Even without any change to the surface of the baseball, a lighter, less dense ball won’t carry as far.  The drag force is independent of the mass (for a given size, which changed less if at all), and F=MA, so a constant force and a lower mass means a higher drag deceleration and less carry.

The aforementioned measurements of the drag coefficient from Statcast data also *don’t measure the drag coefficient*.  They measure the drag *acceleration* and use an average baseball mass value to convert to the drag force (which is then used to get the drag coefficient).  If they’re using the same average mass for a now-lighter ball, they’re overestimating the drag force and the drag coefficient, and the drag coefficient may literally not have changed at all (while the drag acceleration did go up, per the previous paragraph).

Furthermore, I looked at pitchers who threw at least 50 four-seam fastballs last year after July 1, 2021 (after the sticky stuff crackdown) and have also thrown at least 50 FFs in 2022.  This group is, on average, -0.35 MPH and +0.175 RPM on their pitches.  These stats usually move in the same direction, and a 1 MPH increase “should” increase spin by about 20 RPM.  So the group should have lost around 7 RPM from decreased velocity and actually wound up slightly positive instead.  It’s possible that the current baseball is just easier to spin based on surface characteristics, but it’s also possible that it’s easier to spin because it’s lighter and has less rotational inertia.  None of this is proof, and until we have results from experiments on actual game balls in the wild, we won’t have a great idea of the what or the why behind the drag acceleration being up. 

(It’s not (just) the humidor- drag acceleration is up even in parks that already had a humidor, and in places where a new humidor would add some mass, making the ball heavier is the exact opposite of what’s needed to match drag observations, although being in the humidor could have other effects as well)

New MTG:A Event Payouts

With the new Premier Play announcement, we also got two new constructed event payouts and a slightly reworked Traditional Draft.

This is the analysis of the EV of those events for various winrates (game winrate for Bo1 and match winrate for Bo3).  I give the expected gem return, the expected number of packs, the expected number of play points, and two ROIs, one counting packs as 200 gems (store price), the other counting packs as 22.5 gems (if you have all the cards).  These are ROIs for gem entry.  For gold entry, multiply by whatever you’d otherwise use gold for.  If you’d buy packs, then multiply by 3/4.  If you’d otherwise draft, then the constructed event entries are the same exchange rate.

Traditional Draft (1500 gem entry)

Winrate Gems Packs Points ROI (200) ROI (22.5)
0.4 578 1.9 0.13 63.8% 41.4%
0.45 681 2.13 0.18 73.7% 48.6%
0.5 794 2.38 0.25 84.6% 56.5%
0.55 917 2.65 0.33 96.4% 65.1%
0.6 1050 2.94 0.43 109.3% 74.4%
0.65 1194 3.26 0.55 123.1% 84.5%
0.7 1348 3.6 0.69 137.9% 95.3%
0.75 1513 3.95 0.84 153.6% 106.8%

THIS DOES NOT INCLUDE ANY VALUE FROM THE CARDS TAKEN DURING THE DRAFT, which, if you value the cards, is a bit under 3 packs on average (no WC progress).

Bo1 Constructed Event (375 gem entry)

Winrate Gems Packs Points ROI (200) ROI (22.5)
0.4 129 0.65 0.03 68.7% 38.2%
0.45 160 0.81 0.05 85.9% 47.4%
0.5 195 1 0.09 105.6% 58.1%
0.55 235 1.22 0.15 128.0% 70.0%
0.6 278 1.47 0.23 152.7% 83.0%
0.65 323 1.74 0.34 179.0% 96.5%
0.7 367 2.03 0.46 205.9% 109.9%
0.75 408 2.31 0.6 231.8% 122.7%

Bo3 Constructed Event (750 gem entry)

Winrate Gems Packs Points ROI (200) ROI (22.5)
0.4 292 1.67 0.04 83.5% 43.9%
0.45 348 1.76 0.07 93.3% 51.6%
0.5 408 1.84 0.125 103.5% 59.9%
0.55 471 1.92 0.2 113.9% 68.5%
0.6 535 1.99 0.31 124.4% 77.3%
0.65 600 2.06 0.464 135.0% 86.2%
0.7 664 2.14 0.67 145.6% 95.0%
0.75 727 2.22 0.95 156.1% 103.5%