About MLB’s New Mudding and Storage Protocol

My prior research on the slippery ball problem: Baseball’s Last Mile Problem

The TL;DR is that mudding adds moisture to the surface of the ball.  Under normal conditions (i.e. stored with free airflow where it was stored before mudding), that moisture evaporates off in a few hours and leaves a good ball.  If that evaporation is stopped, the ball goes to complete hell and becomes more slippery than a new ball.  This is not fixed by time in free airflow afterwards.

My hypothesis is that the balls were sometimes getting stored in environments with sufficiently restricted airflow (the nylon ball bag) too soon after mudding, and that stopped the evaporation.  This only became a problem this season with the change to mudding all balls on gameday and storing them in a zipped nylon bag before the game.

MLB released a new memo yesterday that attempts to standardize the mudding and storage procedure.  Of the five bullet points, one (AFAIK) is not a change.  Balls were already supposed to sit in the humidor for at least 14 days.  Attempting to standardize the application procedure and providing a poster with allowable darkness/lightness levels are obviously good things.  It may be relevant here if the only problem balls were the muddiest (aka wettest) which shouldn’t happen anymore, but from anecdotal reports, there were problem balls where players didn’t think the balls were even mudded at all, and unless they’re blind, that seems hard to reconcile with also being too dark/too heavily mudded.  So this may help some balls, but probably not all of them.

Gameday Mudding

The other points are more interesting.  Requiring all balls to be mudded within 3 hours of each other could be good or bad.  If it eliminates stragglers getting mudded late, this is good.  If it pushes all mudding closer to gametime, this is bad.  Either way, unless MLB knows something I don’t (which is certainly possible- they’re a business worth billions and I’m one guy doing science in my kitchen), the whole gameday mudding thing makes *absolutely no sense* to me at all in any way.

Pre-mudding, all balls everywhere** are all equilibrated in the humidor the same way.  Post-mudding, the surface is disrupted with transient excess moisture.  If you want the balls restandardized for the game, then YOU MAKE SURE YOU GIVE THE BALL SURFACE TIME AFTER MUDDING TO REEQUILIBRATE TO A STANDARD ENVIRONMENT BEFORE DOING ANYTHING ELSE WITH THE BALL. And that takes hours.

In a world without universal humidors, gameday mudding might make sense since later storage could be widely divergent.  Now, it’s exactly the same everywhere**.  Unless MLB has evidence that a mudded ball sitting overnight in the humidor goes to hell (and I tested and found no evidence for that at all, but obviously my testing at home isn’t world-class- also, if it’s a problem, it should have shown up frequently in humidor parks before this season), I have no idea why you would mud on gameday instead of the day before like it was done last season.  The evaporation time between mudding and going in  the nylon bag for the game might not be long enough if mudding is done on gameday, but mudding the day before means it definitely is.

Ball Bag Changes

Cleaning the ball bag seems like it can’t hurt anything, but I’m also not sure it helps anything. I’m guessing that ball bag hygiene over all levels of the sport and prior seasons of MLB was generally pretty bad, yet somehow it was never a problem.  They’ve seen the bottom of the bags though.  I haven’t. If there’s something going on there, I’d expect it to be a symptom of something else and not a primary problem.

Limiting to 96 balls per bag is also kind of strange.  If there is something real about the bottom of the bag effect, I’d expect it to be *the bottom of the bag effect*.  As long as the number of balls is sufficient to require a tall stack in the bag (and 96 still is), and since compression at these number ranges doesn’t seem relevant (prior research post), I don’t have a physical model of what could be going on that would make much difference for being ball 120 of 120 vs. ball 96 of 96.  Also, if the bottom of the bag effect really is a primary problem this year, why wasn’t it a problem in the past?  Unless they’re using entirely new types of bags this season, which I haven’t seen mentioned, we should have seen it before.  But I’m theorizing and they may have been testing, so treat that paragraph with an appropriate level of skepticism.

Also, since MLB uses more than 96 balls on average in a game, this means that balls will need to come from multiple batches.  This seemed like it had the potential to be significantly bad (late-inning balls being stored in a different bag for much longer), but according to an AP report on the memo

“In an effort to reduce time in ball bags, balls are to be taken from the humidor 15-30 minutes before the scheduled start, and then no more than 96 balls at a time.  When needed, up to 96 more balls may be taken from the humidor, and they should not be mixed in bags with balls from the earlier bunch.”

This seems generally like a step in the smart direction, like they’d identified being zipped up in the bag as a potential problem (or gotten the idea from reading my previous post from 30 days ago :)).  I don’t know if it’s a sufficient mitigation because I don’t know exactly how long it takes for the balls to go to hell (60 minutes in near airtight made them complete garbage, so damage certainly appears in less time, but I don’t know how fast and can’t quickly test that).  And again, repeating the mantra from before, time spent in the ball bag *is only an issue if the balls haven’t evaporated off after mudding*.  And that problem is slam-dunk guaranteed solvable by mudding the day before, and then this whole section would be irrelevant.

Box Storage

The final point, “all balls should be placed back in the Rawlings boxes with dividers, and the boxes should then be placed in the humidor. In the past, balls were allowed to go directly into the humidor.” could be either extremely important or absolutely nothing.  This doesn’t say whether the boxes should be open or closed (have the box top on) in the humidor.  I tweeted to the ESPN writer and didn’t get an answer.

The boxes can be seen in the two images in https://www.mlb.com/news/rockies-humidor-stories.  If they’re open (and not stacked or otherwise covered to restrict airflow), this is fine and at least as good as whatever was done before today.  If the boxes are closed, it could be a real problem.  Like the nylon ball bag, this is also a restricted-flow environment, and unlike the nylon ball bag, some balls will *definitely* get in the box before they’ve had time to evaporate off (since they go in shortly after mudding)

I have one Rawlings box without all the dividers.  The box isn’t airtight, but it’s hugely restricted airflow.  I put 3 moistened balls in the box along with a hygrometer and the RH increased 5% and the balls lost moisture about half as fast as they did in free air.  The box itself absorbed no relevant amount.  With 6 moistened balls in the box, the RH increased 7% (the maximum moistened balls in a confined space will do per prior research) and they lost moisture between 1/3 and 1/4 as fast as in free air.

Unlike the experiments in the previous post where the balls were literally sealed, there is still some moisture flux off the surface here.  I don’t know if it’s enough to stop the balls from going to hell.  It would take me weeks to get unmudded equilibrated balls to actually do mudding test runs in a closed box, and I only found out about this change yesterday with everybody else.  Even if the flux is still sufficient to avoid the balls going to hell directly, the evaporation time appears to be lengthened significantly, and that means that balls are more likely to make it into the closed nylon bag before they’ve evaporated off, which could also cause problems at that point (if there’s still enough time for problems there- see previous section).

The 3 and 6 ball experiments are one run each, in my ball box, which may have a better or worse seal than the average Rawlings box, and the dividers may matter (although they don’t seem to absorb very much moisture from the air, prior post), etc.  Error bars are fairly wide on the relative rates of evaporation, but hygrometer don’t lie.  There doesn’t seem to be any way a closed box isn’t measurably restricting airflow and increasing humidity inside unless the box design changed a lot in the last 3 years.  Maybe that humidity increase/restricted airflow isn’t enough to matter directly or indirectly, but it’s a complete negative freeroll.  Nothing good can come of it.  Bad things might.  If there are reports somewhere this week that tons of balls were garbage, closed-box storage after mudding is the likely culprit.  Or the instructions will actually be uncovered open box (and obeyed) and the last 5 paragraphs will be completely irrelevant.  That would be good.

Conclusion: A few of the changes are obviously common-sense good.  Gameday mudding continues to make no sense to me and looks like it’s just asking for trouble.  Box storage in the humidor after mudding, if the boxes are closed, may be introducing a new problem. It’s unclear to me if the new ball-bag procedures reduce time sufficiently to prevent restricted-airflow problems from arising there, although it’s at least clearly a considered attempt to mitigate a potential problem.

Baseball’s Last Mile Problem

2022 has brought a constant barrage of players criticizing the baseballs as hard to grip and wildly inconsistent from inning to inning, and probably not coincidentally, a spike in throwing error rates to boot.  “Can’t get a grip” and “throwing error” do seem like they might go together.  MLB has denied any change in the manufacturing process, however there have been changes this season in how balls are handled in the stadium, and I believe that is likely to be the culprit.

I have a plausible explanation for how the new ball-handling protocol can cause even identical balls from identical humidors to turn out wildly different on the field, and it’s backed up by experiments and measurements I’ve done on several balls I have, but until those experiments can be repeated at an actual MLB facility (hint, hint), this is still just a hypothesis, albeit a pretty good one IMO.

Throwing Errors

First, to quantify the throwing errors, I used Throwing Errors + Assists as a proxy for attempted throws (it doesn’t count throws that are accurate but late, etc), and broke down TE/(TE+A) by infield position.

TE/(TE+A) 2022 2021 2011-20 max 21-22 Increase 2022 By Chance
C 9.70% 7.10% 9.19% 36.5% 1.9%
3B 3.61% 2.72% 3.16% 32.7% 0.8%
SS 2.20% 2.17% 2.21% 1.5% 46.9%
2B 1.40% 1.20% 1.36% 15.9% 20.1%

By Chance is the binomial odds of getting the 2022 rate or worse using 2021 as the true odds.  Not only are throwing errors per “opportunity” up over 2021, but they’re higher than every single season in the 10 years before that as well, and way higher for C and 3B.   C and 3B have the least time on average to establish a grip before throwing.  This would be interesting even without players complaining left and right about the grip.

The Last Mile

To explain what I suspect is causing this, I need to break down the baseball supply chain.  Baseballs are manufactured in a Rawlings factory, stored in conditions that, to the best of my knowledge, have never been made public, shipped to teams, sometimes stored again in unknown conditions outside a humidor, stored in a humidor for at least 2 weeks, and then prepared and used in a game.  Borrowing the term from telecommunications and delivery logistics, we’ll call everything after the 2+ weeks in the humidor the last mile.

Humidors were in use in 9 parks last year, and Meredith Wills has found that many of the balls this year are from the same batches as balls in 2021.  So we have literally some of the same balls in literally the same humidors, and there were no widespread grip complaints (or equivalent throwing error rates) in 2021.  This makes it rather likely that the difference, assuming there really is one, is occurring somewhere in the last mile.

The last mile starts with a baseball that has just spent 2+ weeks in the humidor.  That is long enough to equilibrate, per https://tht.fangraphs.com/the-physics-of-cheating-baseballs-humidors/, other prior published research, and my own past experiments.  Getting atmospheric humidity changes to meaningfully affect the core of a baseball takes on the order of days to weeks.  That means that nothing humidity-wise in the last mile has any meaningful impact on the ball’s core because there’s not enough time for that to happen.

This article from the San Francisco Chronicle details how balls are prepared for a game after being removed from the humidor, and since that’s paywalled, a rough outline is:

  1. Removed from humidor at some point on gameday
  2. Rubbed with mud/water to remove gloss
  3. Reviewed by umpires
  4. Not kept out of the humidor for more than 2 hours
  5. Put in a security-sealed bag that’s only opened in the dugout when needed

While I don’t have 2022 game balls or official mud, I do have some 2019* balls, water, and dirt, so I decided to do some science at home.  Again, while I have confidence in my experiments done with my balls and my dirt, these aren’t exactly the same things being used in MLB, so it’s possible that what I found isn’t relevant to the 2022 questions.

Update: Dr. Wills informed that that 2019, and only 2019, had a production issue that resulted in squashed leather and could have affected the mudding results.  She checked my batch code, and it looks like my balls were made late enough in 2019 that they were actually used in 2020 with the non-problematic production method.  Yay.

Experiments With Water

When small amounts of water are rubbed on the surface of a ball, it absorbs pretty readily (the leather and laces love water), and once the external source of water is removed, that creates a situation where the outer edge of the ball is more moisture-rich than what’s slightly further inside and more moisture-rich than the atmosphere.  The water isn’t going to just stay there- it’s either going to evaporate off or start going slightly deeper into the leather as well.

As it turns out, if the baseball is rubbed with water and then stored with unrestricted air access (and no powered airflow) in the environment it was equilibrated with, the water entirely evaporates off fairly quickly with an excess-water half-life of a little over an hour (and this would likely be lower with powered air circulation) and goes right back to its pre-rub weight down to 0.01g precision.  So after a few hours, assuming you only added a reasonable amount of water to the surface (I was approaching 0.75 grams added at the most) and didn’t submerge the ball in a toilet or something ridiculous, you’d never know anything had happened.  These surface moisture changes are MUCH faster than the days-to-weeks timescales of core moisture changes.

Things get much more interesting if the ball is then kept in a higher-humidity environment.  I rubbed a ball down, wiped it with a paper towel, let it sit for a couple of minutes to deal with any surface droplets I missed, and then sealed the ball in a sandwich bag for 2 hours along with a battery-powered portable hygrometer.  I expected the ball to completely saturate the air while losing less mass than I could measure (<0.01g) in the process, but that’s not what actually happened.  The relative humidity in the bag only went up 7%, and as expected, the ball lost no measurable amount of mass.  After taking it out, it started losing mass with a slightly longer half-life than before and lost all the excess water in a few hours.

I repeated the experiment except this time I sealed the ball and the hygrometer in an otherwise empty 5-gallon pail.  Again, the relative humidity only went up 7%, and the ball lost 0.04 grams of mass.  I calculated that 0.02g of evaporation should have been sufficient to cause that humidity change, so I’m not exactly sure what happened- maybe 0.01 was measurement error (the scale I was using goes to 0.01g), maybe my seal wasn’t perfectly airtight, maybe the crud on the lid I couldn’t clean off or the pail itself absorbed a little moisture).  But the ball had 0.5g of excess water to lose (which it did completely lose after removal from the pail, as expected) and only lost 0.04g in the pail, so the basic idea is still the same.

This means that if the wet ball has restricted airflow, it’s going to take for freaking ever to reequilibrate (because it only takes a trivial amount of moisture loss to “saturate” a good-sized storage space), and that if it’s in a sealed environment or in a free-airflow environment more than 7% RH above what it’s equilibrated to, the excess moisture will travel inward to more of the leather instead of evaporating off (and eventually the entire ball would equilibrate to the higher-RH environment, but we’re only concerned with the high-RH environment as a temporary last-mile storage condition here, so that won’t happen on our timescales).

I also ran the experiment sealing the rubbed ball and the hygrometer in a sandwich bag overnight for 8 hours.  The half-life for losing moisture after that was around 2.5 hours, up from the 70 minutes when it was never sealed.  This confirms that the excess moisture doesn’t just sit around at the surface waiting if it can’t immediately evaporate, but that evaporation dominates when possible.

I also ran the experiment with a ball sealed in a sandwich bag for 2 hours along with an equilibrated cardboard divider that came with the box of balls I have.  That didn’t make much difference. The cardboard only absorbed 0.04g of the ~0.5g excess moisture in that time period, and that’s with a higher cardboard:ball ratio than a box actually comes with.  Equilibrated cardboard can’t substitute for free airflow on the timescale of a couple of hours.

Experiments With Mud

I mixed dirt and water to make my own mud and rubbed it in doing my best imitation of videos I could find, rubbing until the surface of the ball felt dry again.  Since I don’t have any kind of instrument to measure slickness, these are my perceptions plus those of my significant other.  We were in almost full agreement on every ball, and the one disagreement converged on the next measurement 30 minutes later.

If stored with unrestricted airflow in the environment it was equilibrated to, this led to roughly the following timeline:

  1. t=0, mudded, ball surface feels dry
  2. t= 30 minutes, ball surface feels moist and is worse than when it was first mudded.
  3. t=60 minutes, ball surface is drier and is similar in grip to when first mudded.
  4. t=90 minutes, ball is significantly better than when first mudded
  5. t=120 minutes, no noticeable change from t=90 minutes.
  6. T=12 hours, no noticeable change from t=120 minutes

I tested a couple of other things as well

  1. I took a 12-hour ball, put it in a 75% RH environment for an hour and then a 100% RH environment for 30 minutes, and it didn’t matter.  The ball surface was still fine.  The ball would certainly go to hell eventually under those conditions, but it doesn’t seem likely to be a concern with anything resembling current protocols.  I also stuck one in a bag for awhile and it didn’t affect the surface or change the RH at all, as expected since all of the excess moisture was already gone.
  2. I mudded a ball, let it sit out for 15 minutes, and then sealed it in a sandwich bag.  This ball was slippery at every time interval, 1 hour, 2 hours, 12 hours. (repeated twice).  Interestingly, putting the ball back in its normal environment for over 24 hours didn’t help much and it was still quite slippery.  Even with all the excess moisture gone, whatever had happened to the surface while bagged had ruined the ball.
  3. I mudded a ball, let it sit out for 2 hours, at which point the surface was quite good per the timeline above, and then sealed it in a bag.  THE RH WENT UP AND THE BALL TURNED SLIPPERY, WORSE THAN WHEN IT WAS FIRST MUDDED. (repeated 3x).  Like #2, time in the normal environment afterwards didn’t help.  Keeping the ball in its proper environment for 2 hours, sealing it for an hour, and then letting it out again was enough to ruin the ball.

That’s really important IMO.  We know from the water experiments that it takes more than 2 hours to lose the excess moisture under my storage conditions, and it looks like the combination of fresh(ish) mud plus excess surface moisture that can’t evaporate off is a really bad combo and a recipe for slippery balls.  Ball surfaces can feel perfectly good and game-ready while they still have some excess moisture left and then go to complete shit, apparently permanently, in under an hour if the evaporation isn’t allowed to finish.

Could this be the cause of the throwing errors and reported grip problems? Well…

2022 Last-Mile Protocol Changes

The first change for 2022 is that balls must be rubbed with mud on gameday, meaning they’re always taking on that surface moisture on gameday.  In 2021, balls had to be mudded at least 24 hours in advance of the game, and while 2021 changed the window to 1-2 days in advance, the window used to be up to 5 days in advance of the game.  I don’t know how far in advance they were regularly mudded before 2021, but even early afternoon for a night game would be fine assuming the afternoon storage had reasonable airflow.

The second change is that they’re put back in the humidor fairly quickly after being mudded and allowed a maximum of 2 hours out of the humidor.  While I don’t think there’s anything inherently wrong with putting the balls back in the humidor after mudding (unless it’s something specific to 2022 balls), humidors look something like this.  If the balls are kept in a closed box, or an open box with another box right on top of them, there’s little chance that they reequilibrate in time.  If they’re kept in an open box on a middle shelf without much room above, unless the air is really whipping around in there, the excess moisture half-life should increase.

There’s also a chance that something could go wrong if the balls are taken out of the humidor, kept in a wildly different environment for an hour, then mudded and put back in the humidor, but I haven’t investigated that, and there are many possible combinations of both humidity and temperature that would need to be checked for problems.

The third change (at least I think it’s a change) is that the balls are kept in a sealed bag- at least highly restricted flow, possibly almost airtight- until opened in the dugout.  Even if it’s not a change, it’s still extremely relevant- sealing balls that have evaporated their excess moisture off doesn’t affect anything, while sealing balls that haven’t finished evaporating off seems to be a disaster.


Mudding adds excess moisture to the surface of the ball, and if its evaporation is prevented for very long- either through restricted airflow or storage in too humid an environment- the surface of the ball becomes much more slippery and stays that way even if evaporation continues later.  It takes hours- dependent on various parameters- for that moisture to evaporate off, and 2022 protocol changes make it much more likely that the balls don’t get enough time to evaporate off, causing them to fall victim to that slipperiness.  In particular, balls can feel perfectly good and ready while they still have some excess surface moisture and then quickly go to hell if the remaining evaporation is prevented inside the security-sealed bag.

It looks to me like MLB had a potential problem- substantial latent excess surface moisture being unable to evaporate and causing slipperiness- that prior to 2022 it was avoiding completely by chance or by following old procedures derived from lost knowledge.   In an attempt to standardize procedures, MLB accidentally made the excess surface moisture problem a reality, and not only that, did it in a way where the amount of excess surface moisture was highly variable.

The excess surface moisture when a ball gets to a pitcher depends on the amount of moisture initially absorbed, the airflow and humidity of the last-mile storage environment, and the amount of time spent in those environments and in the sealed bag.  None of those are standardized parts of the protocol, and it’s easy to see how there would be wide variability ball-to-ball and game-to-game.

Assuming this is actually what’s happening, the fix is fairly easy.  Balls need to be mudded far enough in advance and stored afterwards in a way that they get sufficient airflow for long enough to reequilibrate (the exact minimum times depending on measurements done in real MLB facilities), but as an easy interim fix, going back to mudding the day before the game and leaving those balls in an open uncovered box in the humidor overnight should be more than sufficient.  (and again, checking that on-site is pretty easy)


I found (or didn’t find) some other things that I may as well list here as well along with some comments.

  1. These surface moisture changes don’t change the circumference of the baseball at all, down to 0.5mm precision, even after 8 hours.
  2. I took a ball that had stayed moisturized for 2 hours and put a 5-pound weight on it for an hour.  There was no visible distortion and the circumference was exactly the same as before along both seam axes (I oriented the pressure along one seam axis and perpendicular to the other).  To whatever extent flat-spotting is happening or happening more this season, I don’t see how it can be a last-mile cause, at least with my balls.  Dr. Wills has mentioned that the new balls seem uniquely bad at flat-spotting, so it’s not completely impossible that a moist new ball at the bottom of a bucket could deform under the weight, but I’d still be pretty surprised.
  3. The ball feels squishier to me after being/staying moisturized, and free pieces of leather from dissected balls are indisputably much squishier when equilibrated to higher humidity, but “feels squishier” isn’t a quantified measurement or an assessment of in-game impact.  The squishy-ball complaints may also be another symptom of unfinished evaporation.
  4. I have no idea if the surface squishiness in 3 affects the COR of the ball to a measurable degree.
  5. I have no idea if the excess moisture results in an increased drag coefficient.  We’re talking about changes to the surface, and my prior dissected-ball experiments showed that the laces love water and expand from it, so it’s at least in the realm of possibility.
  6. For the third time, this is a hypothesis.  I think it’s definitely one worth investigating since it’s supported by physical evidence, lines up with the protocol changes this year, and is easy enough to check with access to actual MLB facilities.  I’m confident in my findings as reported, but since I’m not using current balls or official mud, this mechanism could also turn out to have absolutely nothing to do with the 2022 game.

The 2022 MLB baseball

As of this writing on 4/25/2022, HRs are down, damage and distance on barrels are down, and both Alan Nathan and Rob Arthur have observed that the drag coefficient of baseballs this year is substantially increased.  This has led to speculation about what has changed with the 2022 balls and even what production batch of balls or mixture of batches may be in use this year.  Given the kerfluffle last year that resulted in MLB finally confirming that a mix of 2020 and 2021 balls were used during the season, that speculation is certainly reasonable.

It may well also turn out to be correct, and changes in the 2022 ball manufacture could certainly explain the current stats, but I think it’s worth noting that everything we’ve seen so far is ALSO consistent with “absolutely nothing changed with regard to ball manufacture/end product between 2021 and 2022” and “all or a vast majority of balls being used are from 2021 or 2022”.  

How is that possible?  Well, the 2021 baseball production was changed on purpose.  The new baseball was lighter, less dense, and less bouncy by design, or in more scientific terms, “dead”.  What if all we’re seeing now is the 2021 baseball specifications in their true glory, now untainted by the 2020 live balls that were mixed in last year?

Even without any change to the surface of the baseball, a lighter, less dense ball won’t carry as far.  The drag force is independent of the mass (for a given size, which changed less if at all), and F=MA, so a constant force and a lower mass means a higher drag deceleration and less carry.

The aforementioned measurements of the drag coefficient from Statcast data also *don’t measure the drag coefficient*.  They measure the drag *acceleration* and use an average baseball mass value to convert to the drag force (which is then used to get the drag coefficient).  If they’re using the same average mass for a now-lighter ball, they’re overestimating the drag force and the drag coefficient, and the drag coefficient may literally not have changed at all (while the drag acceleration did go up, per the previous paragraph).

Furthermore, I looked at pitchers who threw at least 50 four-seam fastballs last year after July 1, 2021 (after the sticky stuff crackdown) and have also thrown at least 50 FFs in 2022.  This group is, on average, -0.35 MPH and +0.175 RPM on their pitches.  These stats usually move in the same direction, and a 1 MPH increase “should” increase spin by about 20 RPM.  So the group should have lost around 7 RPM from decreased velocity and actually wound up slightly positive instead.  It’s possible that the current baseball is just easier to spin based on surface characteristics, but it’s also possible that it’s easier to spin because it’s lighter and has less rotational inertia.  None of this is proof, and until we have results from experiments on actual game balls in the wild, we won’t have a great idea of the what or the why behind the drag acceleration being up. 

(It’s not (just) the humidor- drag acceleration is up even in parks that already had a humidor, and in places where a new humidor would add some mass, making the ball heavier is the exact opposite of what’s needed to match drag observations, although being in the humidor could have other effects as well)

Missing the forest for.. the forest

The paper  A Random Forest approach to identify metrics that best predict match outcome and player ranking in the esport Rocket League got published yesterday (9/29/2021), and for a Cliff’s Notes version, it did two things:  1) Looked at 1-game statistics to predict that game’s winner and/or goal differential, and 2) Looked at 1-game statistics across several rank (MMR/ELO) stratifications to attempt to classify players into the correct rank based on those stats.  The overarching theme of the paper was to identify specific areas that players could focus their training on to improve results.

For part 1, that largely involves finding “winner things” and “loser things” and the implicit assumption that choosing to do more winner things and fewer loser things will increase performance.  That runs into the giant “correlation isn’t causation” issue.  While the specific Rocket League details aren’t important, this kind of analysis will identify second-half QB kneeldowns as a huge winner move and having an empty net with a minute left in an NHL game as a huge loser move.  Treating these as strategic directives- having your QB kneel more or refusing to pull your goalie ever- would be actively terrible and harm your chances of winning.

Those examples are so obviously ridiculous that nobody would ever take them seriously, but when the metrics don’t capture losing endgames as precisely, they can be even *more* dangerous, telling a story that’s incorrect for the same fundamental reason, but one that’s plausible enough to be believed.  A common example is outrushing your opponent in the NFL being correlated to winning.  We’ve seen Derrick Henry or Marshawn Lynch completely dump truck opposing defenses, and when somebody talks about outrushing leading to wins, it’s easy to think of instances like that and agree.  In reality, leading teams run more and trailing teams run less, and the “signal” is much, much more from capturing leading/trailing behavior than from Marshawn going full beast mode sometimes.

If you don’t apply subject-matter knowledge to your data exploration, you’ll effectively ask bad questions that get answered by “what a losing game looks like” and not “what (actionable) choices led to losing”.  That’s all well-known, if worth restating occasionally.

The more interesting part begins with the second objective.  While the particular skills don’t matter, trust me that the difference in car control between top players and Diamond-ranked players is on the order of watching Simone Biles do a floor routine and watching me trip over my cat.  Both involve tumbling, and that’s about where the similarity ends.

The paper identifies various mechanics and identifies rank pretty well based on those.  What’s interesting is that while they can use those mechanics to tell a Diamond from a Bronze, when they tried to use those mechanics to predict the outcome of a game, they all graded out as basically worthless.  While some may have suffered from adverse selection (something you do less when you’re winning), they had a pretty good selection of mechanics and they ALL sucked at predicting the winner.  And, yet, beyond absolutely any doubt, the higher rank stratifications are much better at them than the lower-rank ones.  WTF? How can that be?

The answer is in a sample constructed in a particularly pathological way, and it’s one that will be common among esports data sets for the foreseeable future.  All of the matches are contested between players of approximately equal overall skill.  The sample contains no games of Diamonds stomping Bronzes or getting crushed by Grand Champs.

The players in each match have different abilities at each of the mechanics, but the overall package always grades out similarly given that they have close enough MMR to get paired up.  So if Player A is significantly stronger than player B at mechanic A to the point you’d expect it to show up, ceteris paribus, as a large winrate effect, A almost tautologically has to be worse at the other aspects, otherwise A would be significantly higher-rated than B and the pairing algorithm would have excluded that match from the sample.  So the analysis comes to the conclusion that being better at mechanic A doesn’t predict winning a game.  If the sample contained comparable numbers of cross-rank matches, all of the important mechanics would obviously be huge predictors of game winner/loser.

The sample being pathologically constructed led to the profoundly incorrect conclusion

Taken together, higher rank players show better control over the movement of their car and are able to play a greater proportion of their matches at high speed.  However, within rank-matched matches, this does not predict match outcome.Therefore, our findings suggest that while focussing on game speed and car movement may not provide immediate benefit to the outcome within matches, these PIs are important to develop as they may facilitate one’s improvement in overall expertise over time.

even though adding or subtracting a particular ability from a player would matter *immediately*.  The idea that you can work on mechanics to improve overall expertise (AKA achieving a significantly higher MMR) WITHOUT IT MANIFESTING IN MATCH RESULTS, WHICH IS WHERE MMR COMES FROM, is.. interesting.  It’s trying to take two obviously true statements (Higher-ranked players play faster and with more control- quantified in the paper. Playing faster and with more control makes you better- self-evident to anybody who knows RL at all) and shoehorn a finding between them that obviously doesn’t comport.

This kind of mistake will occur over and over and over when data sets comprised of narrow-band matchmaking are analysed that way.

(It’s basically the same mistake as thinking that velocity doesn’t matter for mediocre MLB pitchers- it doesn’t correlate to a lower ERA among that group, but any individuals gaining velocity will improve ERA on average)


The hidden benefit of pulling the ball

Everything else about the opportunity being equal, corner OFs have a significantly harder time catching pulled balls  than they do catching opposite-field balls.  In this piece, I’ll demonstrate that the effect actually exists, try to quantify it in a useful way, and give a testable take on what I think is causing it.

Looking at all balls with a catch probability >0 and <0.99 (the Statcast cutoff for absolutely routine fly balls), corner OF out rates underperform catch probability by 0.028 on pulled balls relative to oppo balls.

(For the non-baseball readers, position 7 is left field, 8 is center field, 9 is right field, and a pulled ball is a right-handed batter hitting a ball to left field or a LHB hitting a ball to right field.  Oppo is “opposite field”, RHB hitting the ball to right field, etc.)

Stands Pos Catch Prob Out Rate Difference N
L 7 0.859 0.844 -0.015 14318
R 7 0.807 0.765 -0.042 11380
L 8 0.843 0.852 0.009 14099
R 8 0.846 0.859 0.013 19579
R 9 0.857 0.853 -0.004 19271
L 9 0.797 0.763 -0.033 8098

The joint standard deviation for each L-R difference, given those Ns, is about 0.005, so .028+/- 0.005, symmetric in both fields, is certainly interesting.  Rerunning the numbers on more competitive plays, 0.20<catch probability<0.80

Stands Pos Catch Prob Out Rate Difference N
L 7 0.559 0.525 -0.034 2584
R 7 0.536 0.407 -0.129 2383
L 9 0.533 0.418 -0.116 1743
R 9 0.553 0.549 -0.005 3525

Now we see a much more pronounced difference, .095 in LF and .111 in RF (+/- ~.014).  The difference is only about .01 on plays between .8 and .99, so whatever’s going on appears to be manifesting itself clearly on competitive plays while being much less relevant to easier plays.

Using competitive plays also allows a verification that is (mostly) independent of Statcast’s catch probability.  According to this Tango blog post, catch probability changes are roughly linear to time or distance changes in the sweet spot at a rate of 0.1s=10% out rate and 1 foot = 4% out rate.  By grouping roughly similar balls and using those conversions, we can see how robust this finding is.  Using 0.2<=CP=0.8, back=0, and binning by hang time in 0.5s increments, we can create buckets of almost identical opportunities.  For RF, it looks like

Stands Hang Time Bin Avg Hang Time Avg Distance N
L 2.5-3.0 2.881 30.788 126
R 2.5-3.0 2.857 29.925 242
L 3.0-3.5 3.268 41.167 417
R 3.0-3.5 3.256 40.765 519
L 3.5-4.0 3.741 55.234 441
R 3.5-4.0 3.741 55.246 500
L 4.0-4.5 4.248 69.408 491
R 4.0-4.5 4.237 68.819 380
L 4.5-5.0 4.727 81.487 377
R 4.5-5.0 4.714 81.741 204
L 5.0-5.5 5.216 93.649 206
R 5.0-5.5 5.209 93.830 108

If there’s truly a 10% gap, it should easily show up in these bins.

Hang Time to LF Raw Difference Corrected Difference Catch Prob Difference SD
2.5-3.0 0.099 0.104 -0.010 0.055
3.0-3.5 0.062 0.059 -0.003 0.033
3.5-4.0 0.107 0.100 0.013 0.032
4.0-4.5 0.121 0.128 0.026 0.033
4.5-5.0 0.131 0.100 0.033 0.042
5.0-5.5 0.080 0.057 0.023 0.059
Hang Time to RF Raw Difference Corrected Difference Catch Prob Difference SD
2.5-3.0 0.065 0.096 -0.063 0.057
3.0-3.5 0.123 0.130 -0.023 0.032
3.5-4.0 0.169 0.149 0.033 0.032
4.0-4.5 0.096 0.093 0.020 0.035
4.5-5.0 0.256 0.261 0.021 0.044
5.0-5.5 0.168 0.163 0.044 0.063

and it does.  Whatever is going on is clearly not just an artifact of the catch probability algorithm.  It’s a real difference in catching balls.  This also means that I’m safe using catch probability to compare performance and that I don’t have to do the whole bin-and-correct thing any more in this post.

Now we’re on to the hypothesis-testing portion of the post.  I’d used the back=0 filter to avoid potentially Simpson’s Paradoxing myself, so how does the finding hold up with back=1 & wall=0?

Stands Pos Catch Prob Out Rate Difference N
R 7 0.541 0.491 -0.051 265
L 7 0.570 0.631 0.061 333
R 9 0.564 0.634 0.071 481
L 9 0.546 0.505 -0.042 224

.11x L-R difference in both fields.  Nothing new there.

In theory, corner OFs could be particularly bad at playing hooks or particularly good at playing slices.  If that’s true, then the balls with more sideways movement should be quite different than the balls with less sideways movement.  I made an estimation of the sideways acceleration in flight based on hang time, launch spray angle, and landing position and split balls into high and low acceleration (slices have more sideways acceleration than hooks on average, so this is comparing high slice to low slice, high hook to low hook).

Batted Spin Stands Pos Catch Prob Out Rate Difference N
Lots of Slice L 7 0.552 0.507 -0.045 1387
Low Slice L 7 0.577 0.545 -0.032 617
Lots of Hook R 7 0.528 0.409 -0.119 1166
Low Hook R 7 0.553 0.402 -0.151 828
Lots of Slice R 9 0.540 0.548 0.007 1894
Low Slice R 9 0.580 0.539 -0.041 972
Lots of Hook L 9 0.526 0.425 -0.101 850
Low Hook L 9 0.546 0.389 -0.157 579

And there’s not much to see there.  Corner OF play low-acceleration balls worse, but on average those are balls towards the gap and somewhat longer runs, and the out rate difference is somewhat-to-mostly explained by corner OF’s lower speed getting exposed over a longer run.  Regardless, nothing even close to explaining away our handedness effect.

Perhaps pull and oppo balls come from different pitch mixes and there’s something about the balls hit off different pitches.

Pitch Type Stands Pos Catch Prob Out Rate Difference N
FF L 7 0.552 0.531 -0.021 904
FF R 7 0.536 0.428 -0.109 568
FF L 9 0.527 0.434 -0.092 472
FF R 9 0.556 0.552 -0.004 1273
FT/SI L 7 0.559 0.533 -0.026 548
FT/SI R 7 0.533 0.461 -0.072 319
FT/SI L 9 0.548 0.439 -0.108 230
FT/SI R 9 0.553 0.592 0.038 708
Other L 7 0.569 0.479 -0.090 697
Other R 7 0.541 0.379 -0.161 1107
Other L 9 0.534 0.385 -0.149 727
Other R 9 0.550 0.497 -0.054 896

The effect clearly persists, although there is a bit of Simpsoning showing up here.  Slices are relatively fastball-heavy and hooks are relatively Other-heavy, and corner OF catch FBs at a relatively higher rate.  That will be the subject of another post.  The average L-R difference among paired pitch types is still 0.089 though.

Vertical pitch location is completely boring, and horizontal pitch location is the subject for another post (corner OFs do best on outside pitches hit oppo and worst on inside pitches pulled), but the handedness effect clearly persists across all pitch location-fielder pairs.

So what is going on?  My theory is that this is a visibility issue.  The LF has a much better view of a LHB’s body and swing than he does of a RHB’s, and it’s consistent with all the data that looking into the open side gives about a 0.1 second advantage in reaction time compared to looking at the closed side.  A baseball swing takes around 0.15 seconds, so that seems roughly reasonable to me.  I don’t have the play-level data to test that myself, but it should show up as a batter handedness difference in corner OF reaction distance and around a 2.5 foot batter handedness difference in corner OF jump on competitive plays.

Wins Above Average Closer

It’s HoF season again, and I’ve never been satisfied with the dicsussion around relievers.  I wanted something that attempted to quantify excellence at the position while still being a counting stat, and what better way to quantify excellence at RP than by comparing to the average closer?  (Please treat that as a rhetorical question)

I used the highly scientific method of defining the average closer as the aggregate performance of the players who were top-25 in saves in a given year, and I used several measures of wins.  I wanted something that used all events (so not fWAR) and already handled run environment for me, and comparing runs as a counting stat across different run environments is more than a bit janky, so that meant a wins-based metric.  I went with REW (RE24-based wins), WPA, and WPA/LI. 

IP as the denominator instead of PA/TBF because I wanted any (1-inning, X runs) and any inherited runner situation (X outs gotten to end the inning, Y runs allowed – entering RE) to grade out the same regardless of batters faced.  Not that using PA as the denominator would make much difference.

The first trick was deciding on a baseline Wins/IP to compare against because the “average closer” is significantly better now than 1974, to the tune of around 0.5 normalized RA/9 better. 

I used the regression as the baseline Wins/IP for each season/metric because I was more interested in excellence compared to peers than compared to players who were pitching significantly different innings/appearance.  WPA/LI/IP basically overlaps REW/IP and makes it all harder to see, so I left it off.

For each season, a player’s WAAC is (Wins/IP – baseline wins/IP) * IP, computed separately for each win metric.

Without further ado, the top 20 in WAAC (REW-based) and the remaining HoFers.  Peak is defined as the optimal start and stop years for REW.  Fangraphs doesn’t have Win Probability stats before 1974, which cuts out all of Hoyt Wilhelm, but by a quick glance, he’s going to be top-5, solidly among the best non-Rivera RPs.  I also miss the beginning of Fingers’s career, but it doesn’t matter. 

Career WAAC based on REW WPA WPA/LI Peak REW Peak Years
Mariano Rivera 16.9 26.6 18.6 16.9 1996-2013
Billy Wagner 7.3 7.2 7.3 7.3 1996-2010
Joe Nathan 6.6 12.6 7.5 8.5 2003-2013
Zack Britton 5.4 7.0 4.1 5.4 2014-2020
Craig Kimbrel 5.2 6.9 4.3 6.3 2010-2018
Keith Foulke 4.9 4.1 5.3 7.2 1999-2004
Tom Henke 4.9 5.8 5.7 5.8 1985-1995
Aroldis Chapman 4.2 4.1 4.7 4.6 2012-2019
Rich Gossage 4.1 7.4 4.8 10.8 1975-1985
Andrew Miller 3.8 3.5 3.2 5.4 2012-2017
Wade Davis 3.8 3.5 3.7 5.6 2012-2017
Trevor Hoffman 3.8 8.0 6.1 5.8 1994-2009
Darren O’Day 3.7 -1.2 1.8 4.1 2009-2020
Rafael Soriano 3.5 2.8 2.8 4.3 2003-2012
Jonathan Papelbon 3.5 9.7 4.8 5 2006-2009
Eric Gagne 3.4 8.5 3.6 4.3 2002-2005
Dennis Eckersley (RP) 3.4 -0.3 5.2 7.1 1987-1992
John Wetteland 3.3 7.1 4.9 4.2 1992-1999
John Smoltz (RP) 3.2 9.0 3.5 3.2 2001-2004
Kenley Jansen (#20) 3.1 2.6 3.5 4.3 2010-2017
Lee Smith (#25) 2.5 0.9 1.8 4.5 1981-1991
Rollie Fingers (#64) 0.7 -6.0 0.8 1.9 1975-1984
Bruce Sutter (#344) 0.0 2.4 3.6 4.3 1976-1981

Mariano looks otherworldly here, but it’s hard to screw that up.  We get a few looks at really aberrant WPAs, good and bad, which is no shock because it’s known to be noisy as hell.  Peak Gossage was completely insane.  His career rate stats got dragged down by pitching forever, but for those 10 years (20.8 peak WPA too), he was basically Mo.  That’s the longest imitation so far.

Wagner was truly excellent.  And he’s 3rd in RA9-WAR behind Mo and Goose, so it’s not like his lack of IP stopped him from accumulating regular value.  Please vote him in if you have a vote.

It’s also notable how hard it is to stand out or sustain that level.  Only one other player is above 3 career WAAC (Koji).  There are flashes of brilliance (often mixed with flashes of positive variance), but almost nobody sustained “average closer” performance for over 10 years.  The longest peaks are (a year skipped to injury/not throwing 10 IP in relief doesn’t break it, it just doesn’t count towards the peak length):

Rivera 17 (16 with positive WAAC)

Wagner 15 (12 positive)

Hoffman 15 (9 positive)

Wilhelm 13 (by eyeball)

Smith 11 (10 positive)

Henke 11 (10 positive)

Fingers 11 (8 positive, giving him 1973)

O’Day 11 (7 positive)

and that’s it in the history of baseball.  It’s pretty tough to pitch that well for that many years.  

This isn’t going to revolutionize baseball analysis or anything, but I thought it was an interesting look that went beyond career WAR/career WPA to give a kind of counting stat for excellence.

Don’t use FRAA for outfielders

TL;DR OAA is far better, as expected.  Read after the break for next-season OAA prediction/commentary.

As a followup to my previous piece on defensive metrics, I decided to retest the metrics using a sane definition of opportunity.  BP’s study defined a defensive opportunity as any ball fielded by an outfielder, which includes completely uncatchable balls as well as ground balls that made it through the infield.  The latter are absolute nonsense, and the former are pretty worthless.  Thanks to Statcast, a better definition of defensive opportunity is available- any ball it gives a nonzero catch probability and assigns to an OF.  Because Statcast doesn’t provide catch probability/OAA on individual plays, we’ll be testing each outfielder in aggregate.

Similarly to what BP tried to do, we’re going to try to describe or predict each OF’s outs/opportunity, and we’re testing the 354 qualified OF player-seasons from 2016-2019.  Our contestants are Statcast’s OAA/opportunity, UZR/opportunity, FRAA/BIP (what BP used in their article), simple average catch probability (with no idea if the play was made or not), and positional adjustment (effectively the share of innings in CF, corner OF, or 1B/DH).  Because we’re comparing all outfielders to each other, and UZR and FRAA compare each position separately, those two received the positional adjustment (they grade quite a bit worse without it, as expected).

Using data from THE SAME SEASON (see previous post if it isn’t obvious why this is a bad idea) to describe that SAME SEASON’s outs/opportunity, which is what BP was testing, we get the following correlations:

Metric r^2 to same-season outs/opportunity
OAA/opp 0.74
UZR/opp 0.49
Catch Probability + Position 0.43
Catch Probability 0.32
Positional adjustment/opp 0.25

OAA wins running away, UZR is a clear second, background information is 3rd, and FRAA is a distant 4th, barely ahead of raw catch probability.  And catch probability shouldn’t be that important.  It’s almost independent of OAA (r=0.06) and explains much less of the outs/opp variance.  Performance on opportunities is a much bigger driver than difficulty of opportunities over the course of a season.  I ran the same test on the 3 OF positions individually (using Statcast’s definition of primary position for that season), and the numbers bounced a little, but it’s the same rank order and similar magnitude of differences.

Attempting to describe same-season OAA/opp gives the following:

Metric r^2 to same-season OAA/opportunity
OAA/opp 1
UZR/opp 0.5
Positional adjustment/opp 0.17
Catch Probability 0.004

As expected, catch probability drops way off.  CF opportunities are on average about 1% harder than corner OF opportunities.  Positional adjustment is obviously a skill correlate (Full-time CF > CF/corner tweeners > Full-time corner > corner/1B-DH tweeners), but it’s a little interesting that it drops off compared to same-season outs/opportunity.  It’s reasonably correlated to catch probability, which is good for describing outs/opp and useless for describing OAA/opp, so I’m guessing that’s most of the decline.

Now, on to the more interesting things.. Using one season’s metric to predict the NEXT season’s OAA/opportunity (both seasons must be qualified), which leaves 174 paired seasons, gives us the following (players who dropped out were almost average in aggregate defensively):

Metric r^2 to next season OAA/opportunity
OAA/opp 0.45
UZR/opp 0.25
Positional adjustment 0.1
Catch Probability 0.02

FRAA notably doesn’t suck here- although unless you’re a modern-day Wintermute who is forbidden to know OAA, just use OAA of course.  Looking at the residuals from previous-season OAA, UZR is useless, but FRAA and positional adjustment contain a little information, and by a little I mean enough together to get the r^2 up to 0.47.  We’ve discussed positional adjustment already and that makes sense, but FRAA appears to know a little something that OAA doesn’t, and it’s the same story for predicting next-season outs/opp as well.

That’s actually interesting.  If the crew at BP had discovered that and spent time investigating the causes, instead of spending time coming up with ways to bullshit everybody that a metric that treats a ground ball to first as a missed play for the left fielder really does outperform Statcast, we might have all learned something useful.

The Baseball Prospectus article comparing defensive metrics is… strange

TL;DR and by strange I mean a combination of utter nonsense tests on top of the now-expected rigged test.

Baseball Prospectus released a new article grading defensive metrics against each other and declared their FRAA metric the overall winner, even though it’s by far the most primitive defensive stat of the bunch for non-catchers.  Furthermore, they graded FRAA as a huge winner in the outfield and Statcast’s Outs Above Average as a huge winner in the infield.. and graded FRAA as a dumpster fire in the infield and OAA as a dumpster fire in the outfield.  This is all very curious.  We’re going to answer the three questions in the following order:

  1. On their tests, why does OAA rule the infield while FRAA sucks?
  2. On their tests, why does FRAA rule the outfield while OAA sucks?
  3. On their test, why does FRAA come out ahead overall?

First, a summary of the two systems.  OAA ratings try to completely strip out positioning- they’re only a measure of how well the player did, given where the ball was and where the player started.  FRAA effectively treats all balls as having the same difficulty (after dealing with park, handedness, etc).  It assumes that each player should record the league-average X outs per BIP for the given defensive position/situation and gives +/- relative to that number.

A team allowing a million uncatchable base hits won’t affect the OAA at all (not making a literal 0% play doesn’t hurt your rating), but it will tank everybody’s FRAA because it thinks the fielders “should” be making X outs per Y BIPs.  In a similar vein, hitting a million easy balls at a fielder who botches them all will destroy that fielder’s OAA but leave the rest of his teammates unchanged.  It will still tank *everybody’s* FRAA the same as if the balls weren’t catchable.  An average-performing (0 OAA), average-positioned fielder with garbage teammates will get dragged down to a negative FRAA. An average-performing (0 OAA), average-positioned fielder whose pitcher allows a bunch of difficult balls nowhere near him will also get dragged down to a negative FRAA.

So, in abstract terms: On a team level, team OAA=range + conversion and team FRAA = team OAA + positioning-based difficulty relative to average.  On a player level, player OAA= range + conversion and player FRAA = player OAA + positioning + teammate noise.

Now, their methodology.  It is very strange, and I tweeted at them to make sure they meant what they wrote.  They didn’t reply, it fits the results, and any other method of assigning plays would be in-depth enough to warrant a description, so we’re just going to assume this is what they actually did.  For the infield and outfield tests, they’re using the season-long rating each system gave a player to predict whether or not a play resulted in an out.  That may not sound crazy at first blush, but..

…using only the fielder ratings for the position in question, run the same model type position by position to determine how each system predicts the out probability for balls fielded by each position. So, the position 3 test considers only the fielder quality rate of the first baseman on *balls fielded by first basemen*, and so on.

Their position-by-position comparisons ONLY INVOLVE BALLS THAT THE PLAYER ACTUALLY FIELDED.  A ground ball right through the legs untouched does not count as a play for that fielder in their test (they treat it as a play for whoever picks it up in the outfield).  Obviously, by any sane measure of defense, that’s a botched play by the defender, which means the position-by-position tests they’re running are not sane tests of defense.  They’re tests of something else entirely, and that’s why they get the results that they do.

Using the bolded abstraction above, this is only a test of conversion.  Every play that the player didn’t/couldn’t field IS NOT INCLUDED IN THE TEST.  Since OAA adds the “noise” of range to conversion, and FRAA adds the noise of range PLUS the noise of positioning PLUS the noise from other teammates to conversion, OAA is less noisy and wins and FRAA is more noisy and sucks.  UZR, which strips out some of the positioning noise based on ball location, comes out in the middle.  The infield turned out to be pretty easy to explain.

The outfield is a bit trickier.  Again, because ground balls that got through the infield are included in the OF test (because they were eventually fielded by an outfielder), the OF test is also not a sane test of defense.  Unlike the infield, when the outfield doesn’t catch a ball, it’s still (usually) eventually fielded by an outfielder, and roughly on average by the same outfielder who didn’t catch it.

So using the abstraction, their OF test measures range + conversion + positioning + missed ground balls (that roll through to the OF).  OAA has range and conversion.  FRAA has range, conversion, positioning, and some part of missed ground balls through the teammate noise effect described earlier.  FRAA wins and OAA gets dumpstered on this silly test, and again it’s not that hard to see why, not that it actually means much of anything.

Before talking about the teamwide defense test, it’s important to define what “defense” actually means (for positions 3-9).  If a batter hits a line drive 50 feet from anybody, say a rope safely over the 3B’s head down the line, is it bad defense by 3-9 that it went for a hit?  Clearly not, by the common usage of the word. Who would it be bad defense by?  Nobody could have caught it.  Nobody should have been positioned there.

BP implicitly takes a different approach

So, recognizing that defenses are, in the end, a system of players, we think an important measure of defensive metric quality is this: taking all balls in play that remained in the park for an entire season — over 100,000 of them in 2019 — which system on average most accurately measures whether an out is probable on a given play? This, ultimately, is what matters.  Either you get more hitters out on balls in play or you do not. The better that a system can anticipate that a batter will be out, the better the system is.

that does consider this bad defense.  It’s kind of amazing (and by amazing I mean not the least bit surprising at this point) that every “questionable” definition and test is always for the benefit one of BP’s stats.  Neither OAA, nor any of the other non-FRAA stats mentioned, are based on outs/BIP or trying to explain outs/BIP.  In fact, they’re specifically designed to do the exact opposite of that.  The analytical community has spent decades making sure that uncatchable balls don’t negatively affect PLAYER defensive ratings, and more generally to give an appropriate amount of credit to the PLAYER based on the system’s estimate of the difficulty of the play (remember from earlier that FRAA doesn’t- it treats EVERY BIP as average difficulty).

The second “questionable” decision is to test against outs/BIP.  Using abstract language again to break this down, outs/BIP = player performance given the difficulty of the opportunity + difficulty of opportunity.  The last term can be further broken down into difficulty of opportunity = smart/dumb fielder positioning + quality of contact allowed (a pitcher who allows an excess of 100mph batted balls is going to make it harder for his defense to get outs, etc) + luck.  In aggregate:


player performance given the difficulty of the opportunity (OAA) +

smart/dumb fielder positioning (a front-office/manager skill in 2019) +

quality of contact allowed (a batter/pitcher skill) +

luck (not a skill).

That’s testing against a lot of nonsense beyond fielder skill, and it’s testing against nonsense *that the other systems were explicitly designed to exclude*.  It would take the creators of the other defensive systems less time than it took me to write the previous paragraph to run a query and report an average difficulty of opportunity metric when the player was on the field (their systems are all already designed around giving every BIP a difficulty of opportunity score), but again, they don’t do that because *they’re not trying to explain outs/BIP*.

The third “questionable” decision is to use 2019 ratings to predict 2019 outs/BIP.  Because observed OAA is skill+luck, it benefits from “knowing” the luck in the plays it’s trying to predict.  In this case, luck being whether a fielder converted plays at/above/below his true skill level.  2019 FRAA has all of the difficulty of opportunity information baked in for 2019 balls, INCLUDING all of the luck in difficulty of opportunity ON TOP OF the luck in conversion that OAA also has.

All of that luck is just noise in reality, but because BP is testing the rating against THE SAME PLAYS used to create the rating, that noise is actually signal in this test, and the more of it included, the better.  That’s why FRAA “wins” handily.  One could say that this test design is almost maximally disingenuous, and of course it’s for the benefit of BP’s in-house stat, because that’s how they roll.

Dave Stieb was good

Since there’s nothing of any interest going on in the country or the world today, I decided the time was right to defend the honour of a Toronto pitcher from the 80s.  Looking deeper into this article, https://www.baseballprospectus.com/news/article/57310/rubbing-mud-dra-and-dave-stieb/ which concluded that Stieb was actually average or worse rate-wise, many of the assertions are… strange.

First, there’s the repeated assertion that Stieb’s K and BB rates are bad.  They’re not.  He pitched to basically dead average defensive catchers, and weighted by the years Stieb pitched, he’s actually marginally above the AL average.  The one place where he’s subpar, hitting too many batters, isn’t even mentioned.  This adds up to a profile of

K/9 BB/9 HBP/9
AL Average 5.22 3.28 0.20
Stieb 5.19 3.21 0.40

Accounting for the extra HBPs, these components account for about 0.05 additional ERA over league average, or ~1%.  Without looking at batted balls at all, Stieb would only be 1% worse than average (AL and NL are pretty close pitcher-quality wise over this timeframe, with the AL having a tiny lead if anything).  BP’s version of FIP- (cFIP) has Stieb at 104.  That doesn’t really make any sense before looking at batted balls, and Stieb only allowed a HR/9 of 0.70 vs. a league average of 0.88.  He suppressed home runs by 20%- in a slight HR-friendly park- over 2900 innings, combined with an almost dead average K/BB profile, and BP rates his FIP as below average.  That is completely insane.

The second assertion is that Stieb relied too much on his defense.  We can see from above that an almost exactly average percentage of his PAs ended with balls in play, so that part falls flat, and while Toronto did have a slightly above-average defense, it was only SLIGHTLY above average.  Using BP’s own FRAA numbers, Jays fielders were only 236 runs above average from 79-92, and prorating for Stieb’s share of IP, they saved him 24 runs, or a 0.08 lower ERA (sure, it’s likely that they played a bit better behind him and a bit worse behind everybody else).  Stieb’s actual ERA was 3.44 and his DRA is 4.43- almost one full run worse- and the defense was only a small part of that difference.  Even starting from Stieb’s FIP of 3.82, there’s a hell of a long way to go to get up to 4.43, and a slightly good defense isn’t anywhere near enough to do it.

Stieb had a career BABIP against of .260 vs. AL average of .282, and the other pitchers on his teams had an aggregate BABIP of .278.  That’s more evidence of a slightly above-average defense, suppressing BABIP a little in a slight hitter’s home park, but Stieb’s BABIP suppression goes far beyond what the defense did for everybody else.  It’s thousands-to-1 against a league-average pitcher suppressing HR as much as Stieb did.  It’s also thousands-to-1 against a league-average pitcher in front of Toronto’s defense suppressing BABIP as much as Stieb did.  It’s exceptionally likely that Stieb actually was a true-talent soft contact machine.  Maybe not literally to his careen numbers, but the best estimate is a hell of a lot closer to career numbers than to average after 12,000 batters faced.

This is kind of DRA and DRC in a microcosm.  It can spit out values that make absolutely no sense at a quick glance, like a league-average K/BB guy with great HR suppression numbers grading out with a below-average cFIP, and it struggles to accept outlier performance on balls in play, even over gigantic samples, because the season-by-season construction is completely unfit for purpose when used to describe a career.  That’s literally the first thing I wrote when DRC+ was rolled out, and it’s still true here.

Reliever Sequencing, Real or Not?

I read this first article on reliever sequencing, and it seemed like a reasonable enough hypothesis, that batters would do better seeing pitches come from the same place and do worse seeing them come from somewhere else, but the article didn’t discuss the simplest variable that should have a big impact- does it screw batters up to face a lefty after a righty or does it really not matter much at all?  I don’t have their arm slot data, and I don’t know what their exact methodology was, so I just designed my own little study to measure the handedness switch impact.

Using PAs from 2015-18 where the batter is facing a different pitcher than the previous PA in this game (this excludes the first PA in the game for all batters, of course), I noted the handedness of the pitcher, the stance of the batter, and the standard wOBA result of the PA.  To determine the impact of the handedness switch, I compared pairs of data: (RHB vs RHP where the previous pitcher was a LHP) to (RHB vs RHP where the previous pitcher was a RHP), etc, which also controls for platoon effects without having to try to quantify them everywhere.  The raw data is

Table 1

Bats Throws Prev P wOBA N
L L L 0.302 16162
L L R 0.296 54160
L R R 0.329 137190
L R L 0.333 58959
R L L 0.339 19612
R L R 0.337 63733
R R R 0.315 191871
R R L 0.313 82190

which looks fairly minor, and the differences (following same hand – following opposite hand) come out to

Table 2

Bats Throws wOBA Diff SD Harmonic mean of N
L L 0.006 0.0045 24895
L R -0.0046 0.0025 82474
R L 0.002 0.0041 29994
R R 0.002 0.0021 115083
Total Total 0.000000752 252446

which is in the noise range in every bucket and overall no difference between same and opposite hand as the previous pitcher.  Just in case there was miraculously a player-quality effect exactly offsetting a real handedness effect, for each PA in the 8 groups in table 1, I calculated the overall (all 4 years) batter performance against the pitcher’s handedness and the pitcher’s overall performance against batters of that stance, then compared the quality of the group that followed same-handed pitching to the group that followed opposite-handed pitching.

As it turned out there was an effect… quality effects offset some of the observed differential in 3 of the buckets, and now the difference in every individual bucket is less than 1 SD away from 0.000 while the overall effect is still nonexistent.

Table 3

Bats Throws wOBA Diff Q diff Adj Diff SD Harmonic mean of N
L L 0.0057 0.0037 0.0020 0.0045 24895
L R -0.0046 -0.0038 -0.0008 0.0025 82474
R L 0.0018 -0.0022 0.0040 0.0041 29994
R R 0.0016 0.0033 -0.0017 0.0021 115083
Total Total 0 0.0004 -0.0004 252446

Q Diff means that LHP + LHB following a LHP were a combination of better batters/worse pitchers by 3.7 points of wOBA compared to LHP + LHB following a RHP, etc.  So of the observed 5.7 points of wOBA difference, 3.7 of it was expected from player quality and the 2 points left over is the adjusted difference.

I also looked at only the performance against the second pitcher the batter faced in the game using the first pitcher’s handedness, but in that case, following the same-handed pitcher actually LOWERED adjusted performance by 1.7 points of wOBA (third and subsequent pitcher faced was a 1 point benefit for samehandedness), but these are still nothing.  I just don’t see anything here.  If changing pitcher characteristics made a meaningful difference, it would almost have to show up in flipped handedness, and it just doesn’t.


There was one other obvious thing to check, velocity, and it does show the makings of a real (and potentially somewhat actionable) effect.  Bucketing pitchers into fast (average fastball velocity>94.5, slow <89.5, or medium and doing the same quality/handedness controls as above gave the following:

first reliever starter Quality-adjusted woba SD N
F F 0.319 0.0047 11545
F M 0.311 0.0019 65925
F S 0.306 0.0037 17898
M F 0.318 0.0033 23476
M M 0.321 0.0012 167328
M S 0.320 0.0022 50625
S F 0.321 0.0074 4558
S M 0.318 0.0025 39208
S S 0.330 0.0043 13262

Harder-throwing relievers do better, which isn’t a surprise, but it looks like there’s extra advantage when the starter was especially soft-tossing, and at the other end, slow-throwing relievers are max punished immediately following soft-tossing starters.  This deserves a more in-depth look with more granular tools than aggregate PA wOBA, but two independent groups both showing a >1SD effect in the hypothesized direction is.. something, at least, and an effect size on the order of .2-.3 RA/9 isn’t useless if it holds up.  I’m intrigued again.