Dave Stieb was good

Since there’s nothing of any interest going on in the country or the world today, I decided the time was right to defend the honour of a Toronto pitcher from the 80s.  Looking deeper into this article, https://www.baseballprospectus.com/news/article/57310/rubbing-mud-dra-and-dave-stieb/ which concluded that Stieb was actually average or worse rate-wise, many of the assertions are… strange.

First, there’s the repeated assertion that Stieb’s K and BB rates are bad.  They’re not.  He pitched to basically dead average defensive catchers, and weighted by the years Stieb pitched, he’s actually marginally above the AL average.  The one place where he’s subpar, hitting too many batters, isn’t even mentioned.  This adds up to a profile of

K/9 BB/9 HBP/9
AL Average 5.22 3.28 0.20
Stieb 5.19 3.21 0.40

Accounting for the extra HBPs, these components account for about 0.05 additional ERA over league average, or ~1%.  Without looking at batted balls at all, Stieb would only be 1% worse than average (AL and NL are pretty close pitcher-quality wise over this timeframe, with the AL having a tiny lead if anything).  BP’s version of FIP- (cFIP) has Stieb at 104.  That doesn’t really make any sense before looking at batted balls, and Stieb only allowed a HR/9 of 0.70 vs. a league average of 0.88.  He suppressed home runs by 20%- in a slight HR-friendly park- over 2900 innings, combined with an almost dead average K/BB profile, and BP rates his FIP as below average.  That is completely insane.

The second assertion is that Stieb relied too much on his defense.  We can see from above that an almost exactly average percentage of his PAs ended with balls in play, so that part falls flat, and while Toronto did have a slightly above-average defense, it was only SLIGHTLY above average.  Using BP’s own FRAA numbers, Jays fielders were only 236 runs above average from 79-92, and prorating for Stieb’s share of IP, they saved him 24 runs, or a 0.08 lower ERA (sure, it’s likely that they played a bit better behind him and a bit worse behind everybody else).  Stieb’s actual ERA was 3.44 and his DRA is 4.43- almost one full run worse- and the defense was only a small part of that difference.  Even starting from Stieb’s FIP of 3.82, there’s a hell of a long way to go to get up to 4.43, and a slightly good defense isn’t anywhere near enough to do it.

Stieb had a career BABIP against of .260 vs. AL average of .282, and the other pitchers on his teams had an aggregate BABIP of .278.  That’s more evidence of a slightly above-average defense, suppressing BABIP a little in a slight hitter’s home park, but Stieb’s BABIP suppression goes far beyond what the defense did for everybody else.  It’s thousands-to-1 against a league-average pitcher suppressing HR as much as Stieb did.  It’s also thousands-to-1 against a league-average pitcher in front of Toronto’s defense suppressing BABIP as much as Stieb did.  It’s exceptionally likely that Stieb actually was a true-talent soft contact machine.  Maybe not literally to his careen numbers, but the best estimate is a hell of a lot closer to career numbers than to average after 12,000 batters faced.

This is kind of DRA and DRC in a microcosm.  It can spit out values that make absolutely no sense at a quick glance, like a league-average K/BB guy with great HR suppression numbers grading out with a below-average cFIP, and it struggles to accept outlier performance on balls in play, even over gigantic samples, because the season-by-season construction is completely unfit for purpose when used to describe a career.  That’s literally the first thing I wrote when DRC+ was rolled out, and it’s still true here.

Uncertainty in baseball stats (and why DRC+ SD is a category error)

What does it mean to talk about the uncertainty in, say, a pitcher’s ERA or a hitter’s OBP?  You know exactly how many ER were allowed, exactly how many innings were pitched, exactly how many times the batter reached base, and exactly how many PAs he had.  Outside of MLB deciding to retroactively flip a hit/error decision, there is no uncertainty in the value of the stat.  It’s an exact measurement.  Likewise, there’s no uncertainty in Trout’s 2013 wOBA or wRC+.  They reflect things that happened, calculated in deterministic fashion from exact inputs.  Reporting a measurement uncertainty for any of these wouldn’t make any sense.

The Statcast metrics are a little different- EV, LA, sprint speed, hit distance, etc. all have a small amount of random error in each measurement, but since those errors are small and opportunities are numerous, the impact of random error is small to start with and totally meaningless quickly when aggregating measurements.  There’s no point in reporting random measurement uncertainty in a public-facing way because it may as well be 0 (checking for systematic bias is another story, but that’s done with the intent of being fixed/corrected for, not of being reported as metric uncertainty).

Point 1:

So we can’t be talking about the uncertainties in measuring/calculating these kinds of metrics- they’re irrelevant-to-nonexistent.  When we’re talking about the uncertainty in somebody’s ERA or OBP or wRC+, we’re talking about the uncertainty of the player’s skill at the metric in question, not the uncertainty of the player’s observed value.  That alone makes it silly to report such metrics as “observed value +/- something”, like ERA 0.37 +/- 3.95, because it’s implicitly treating the observed value as some kind of meaningful central-ish point in the player’s talent distribution.  There’s no reason for that to be true *because these aren’t talent metrics*.  They’re simply a measure of something over a sample, and many such metrics frequently give values where a better true talent is astronomically unlikely to be correct (a wRC+ over 300) or even impossible (an ERA below 0) and many less extreme but equally silly examples as well.

Point 2:

Expressing something non-stupidly in the A +/- B format (or listing percentiles if it’s significantly non-normal, whatever) requires a knowledge of the player’s talent distribution after the observed performance, and that can’t be derived solely from the player’s data.  If something happens 25% of the time, talent could cluster near 15% and the player is doing it more often, talent could cluster near 35% and the player is doing it less often, or talent could cluster near 25% and the player is average.  There’s no way to tell the difference from just the player’s stat line and therefore no way to know what number to report as the mean, much less the uncertainty.  Reporting a 25% mean might be correct (the latter case) or as dumb as reporting a mean wRC+ of 300 (if talent clusters quite tightly around 15%).

Once you build a prior talent distribution (based on what other players have done and any other material information), then it’s straightforward to use the observed performance at the metric in question and create a posterior distribution for the talent, and from that extract the mean and SD.  When only the mean is of interest, it’s common to regress by adding some number of average observations, more for a tighter talent distribution and fewer for a looser talent distribution, and this approximates the full Bayesian treatment.  If the quantity in the previous paragraph were HR/FB% (league average a little under 15%), then 25% for a pitcher would be regressed down a lot more than for a batter over the same number of PAs because pitcher HR/FB% allowed talent is much more tightly distributed than hitter HR/FB% talent, and the uncertainty reported would be a lot lower for the pitcher because of that tighter talent distribution.  None of that is accessible by just looking at a 25% stat line.

Actual talent metrics/projections, like Steamer and ZiPS, do exactly this (well, more complicated versions of this) using talent distributions and continually updating with new information, so when they spit out mean and SD, or mean and percentiles, they’re using a process where those numbers are meaningful, getting them as the result of using a reasonable prior for talent and therefore a reasonable posterior after observing some games.  Their means are always going to be “in the middle” of a reasonable talent posterior, not nonsense like wRC+ 300.

Which brings us to DRC+.. I’ve noted previously that the DRC+ SDs don’t make any sense, but I didn’t really have any idea how they were coming up with those numbers until  this recent article, and a reference to this old article on bagging.  My last two posts pointed out that DRC+ weights way too aggressively in small samples to be a talent metric and that DRC+ has to be heavily regressed to make projections, so when we see things in that article like Yelich getting assigned a DRC+ over 300 for a 4PA 1HR 2BB game, that just confirms what we already knew- DRC+ is happy to assign means far, far outside any reasonable distribution of talent and therefore can’t be based on a Bayesian framework using reasonable talent priors.

So DRC+ is already violating point 1 above, using the A +/- B format when A takes ridiculous values because DRC+ isn’t a talent metric.  Given that it’s not even using reasonable priors to get *means*, it’s certainly not shocking that it’s not using them to get SDs either, but what it’s actually doing is bonkers in a way that turns out kind of interesting.  The bagging method they use to get SDs is (roughly) treating the seasonal PA results as the exact true talent distribution of events, drawing  from them over and over (with replacement) to get a fake seasonal line, doing that a bunch of times and taking the SD of the fake seasonal lines as the SD of the metric.

That’s obviously just a category error.  As I explained in point 2, the posterior talent uncertainty depends on the talent distribution and can’t be calculated solely from the stat line, but such obstacles don’t seem to worry Jonathan Judge.  When talking about Yelich’s 353 +/- 6  DRC+, he said “The early-season uncertainties for DRC+ are high. At first there aren’t enough events to be uncertain about, but once we get above 10 plate appearances or so the system starts to work as expected, shooting up to over 70 points of probable error. Within a week, though, the SD around the DRC+ estimate has worked its way down to the high 30s for a full-time player.”  That’s just backwards about everything.  I don’t know (or care) why their algorithm fails under 10 PAs, but writing “not having enough events to be uncertain about” shows an amazing misunderstanding of everything.

The accurate statement- assuming you’re going in DRC+ style using only YTD knowledge of a player- is “there aren’t enough events to be CERTAIN about of much of anything”, and an accurate DRC+ value for Yelich- if DRC+ functioned properly as a talent metric- would be around 104 +/- 13 after that nice first game.  104 because a 4PA 1HR 2BB game preferentially selects- but not absurdly so- for above average hitters, and a SD of 13 because that’s about the SD of position player projections this year.  SDs of 70 don’t make any sense at all and are the artifact of an extremely high SD in observed wOBA (or wRC+) over 10-ish PAs, and remember that their bagging algorithm is using such small samples to create the values.  It’s clear WHY they’re getting values that high, but they just don’t make any sense because they’re approaching the SD from the stat line only and ignoring the talent distribution that should keep them tight.  When you’re reporting a SD 5 times higher than what you’d get just picking a player talent at random, you might have problems.

The Bayesian Central Limit Theorem

I promised there was something kind of interesting, and I didn’t mean bagging on DRC+ for the umpteenth time, although catching an outright category error is kind of cool.  For full-time players after a full season, the DRC+ SDs are actually in the ballpark of correct, even though the process they use to create them obviously has no logical justification (and fails beyond miserably for partial seasons, as shown above).  What’s going on is an example of the Bayesian Central Limit Theorem, which states that for any priors that aren’t intentionally super-obnoxious, repeatedly observing i.i.d variables will cause the posterior to converge to a normal distribution.  At the same time, the regular Central Limit Theorem means that the distribution of outcomes that their bagging algorithm generates should also approach a normal distribution.

Without the DRC+ processing baggage, these would be converging to the same normal distribution, as I’ll show with binomials in a minute, but of course DRC+ gonna DRC+ and turn virtually identical stat lines into significantly different numbers

NAME YEAR PA 1B 2B 3B HR TB BB IBB SO HBP AVG OBP SLG OPS ISO oppOPS DRC+ DRC+ SD
Pablo Sandoval 2014 638 119 26 3 16 244 39 6 85 4 0.279 0.324 0.415 0.739 0.136 0.691 113 7
Jacoby Ellsbury 2014 635 108 27 5 16 241 49 5 93 3 0.271 0.328 0.419 0.747 0.148 0.696 110 11

Ellsbury is a little more TTO-based and gets an 11 SD to Sandoval’s 7.  Seems legit.  Regardless of these blips, high single digits is about right for a DRC+ (wRC+) SD after observing a full season.

Getting rid of the DRC+ layer to show what’s going on, assume talent is uniform on [.250-.400] (SD of 0.043) and we’re dealing with 1000 Bernoulli observations.  Let’s say we observe 325 successes (.325), then when we plot the Bayesian posterior talent distribution and the binomial for 1000 p=.325 events (the distribution that bagging produces)

325posterior

They overlap so closely you can’t even see the other line.  Going closer to the edge, we get, for 275 and 260 observed successes,

At 275, we get a posterior SD of .13 vs the binomial .14, and at 260, we start to break the thing, capping how far to the left the posterior can go, and *still* get a posterior SD of .11 vs .14.  What’s going on here is that the weight for a posterior value is the prior-weighted probability that that value (say, .320) produces an observation of .325 in N attempts, while the binomial bagging weight at that point is the probability that .325 produces an observation of .320 in N attempts.  These aren’t the same, but under a lot of circumstances, they’re pretty damn close, and as N grows, and the numbers that take the place of .320 and .325 in the meat of the distributions get closer and closer together, the posterior converges to the same normal that describes the binomial bagging.  Bayesian CLT meets normal CLT.

When the binomial bagging variance starts dropping well below the prior population variance, this convergence starts to happen enough to where the numbers can loosely be called “close” for most observed success rates, and that transition point happens to come out around a full season of somewhat regressed observation of baseball talent. In the example above, the prior population SD was 0.043 and the binomial variance was 0.014, so it converged excellently until we ran too close to the edge of the prior.  It’s never always going to work, because a low end talent can get unlucky, or a high end talent can get lucky, and observed performance can be more extreme than the talent distribution (super-easy in small samples, still happens in seasonal ones) but for everybody in the middle, it works out great.

Let’s make the priors more obnoxious and see how well this works- this is with a triangle distribution, max weight at .250 straight line down to a 0 weight at .400.

 

The left-weighted prior shifts the means, but the standard deviations are obviously about the same again here.  Let’s up the difficulty even more, starting with a N(.325,.020) prior (0.020 standard deviation), which is pretty close to the actual mean/SD wOBA talent distribution among position players (that distribution is left-weighted like the triangle too, but we already know that doesn’t matter much for the SD)

Even now that the bagging distributions are *completely* wrong and we’re using observations almost 2 SD out, the standard deviations are still .014-.015 bagging and .012 for the posterior.  Observing 3 SD out isn’t significantly worse.  The prior population SD was 0.020, and the binomial bagging variance was 0.014, so it was low enough that we were close to converging when the observation was in the bulk of the distribution but still nowhere close when we were far outside, although the SDs of the two were still in the ballpark everywhere.

Using only 500 observations on the N(.325,.020) prior isn’t close to enough to pretend there’s convergence even when observing in the bulk.

325500pa

The posterior has narrowed to a SD of .014 (around 9 points of wRC+ if we assume this is wOBA and treat wOBA like a Bernoulli, which is handwavy close enough here), which is why I said above that high-single-digits was “right”, but the binomial variance is still at .021, 50% too high.  The regression in DRC+ tightens up the tails compared to “binomial wOBA”, and it happens to come out to around a reasonable SD after a full season.

Just to be clear, the bagging numbers are always wrong and logically unjustified here, but they’re a hackjob that happens to be “close” a lot of the time when working with the equivalent of full-season DRC+ numbers (or more).  Before that point, when the binomial bagging variance is higher than the completely naive population variance (the mechanism for DRC+ reporting SDs in the 70s, 30s, or whatever for partial seasons), the bagging procedure isn’t close at all.  This is just another example of DRC+ doing nonsense that looks like baseball analysis to produce a number that looks like a baseball stat, sometimes, if you don’t look too closely.

 

Revisiting the DRC+ team switcher claim

The algorithm has changed a fair bit since I investigated that claim- at the least, it’s gotten rid of most of its park factor and regresses (effectively) less than it used to.  It’s not impossible that it could grade out differently now than it did before, and I told somebody on twitter that I’d check it out again, so here we are.  First of all, let’s remind everybody what their claim is.  From https://www.baseballprospectus.com/news/article/45383/the-performance-case-for-drc/, Jonathan Judge says:


Table 2: Reliability of Team-Switchers, Year 1 to Year 2 (2010-2018); Normal Pearson Correlations[3]

Metric Reliability Error Variance Accounted For
DRC+ 0.73 0.001 53%
wOBA 0.35 0.001 12%
wRC+ 0.35 0.001 12%
OPS+ 0.34 0.001 12%
OPS 0.33 0.002 11%
True Average 0.30 0.002 9%
AVG 0.30 0.002 9%
OBP 0.30 0.002 9%

With this comparison, DRC+ pulls far ahead of all other batting metrics, park-adjusted and unadjusted. There are essentially three tiers of performance: (1) the group at the bottom, ranging from correlations of .3 to .33; (2) the middle group of wOBA and wRC+, which are a clear level up from the other metrics; and finally (3) DRC+, which has almost double the reliability of the other metrics.

You should pay attention to the “Variance Accounted For” column, more commonly known as r-squared. DRC+ accounts for over three times as much variance between batters than the next-best batting metric. In fact, one season of DRC+ explains over half of the expected differences in plate appearance quality between hitters who have switched teams; wRC+ checks in at a mere 16 percent.  The difference is not only clear: it is not even close.

Let’s look at Predictiveness.  It’s a very good sign that DRC+ correlates well with itself, but games are won by actual runs, not deserved runs. Using wOBA as a surrogate for run-scoring, how predictive is DRC+ for a hitter’s performance in the following season?

Table 3: Reliability of Team-Switchers, Year 1 to Year 2 wOBA (2010-2018); Normal Pearson Correlations

Metric Predictiveness Error
DRC+ 0.50 0.001
wOBA 0.37 0.001
wRC+ 0.37 0.002
OPS+ 0.37 0.001
OPS 0.35 0.002
True Average 0.34 0.002
OBP 0.30 0.002
AVG 0.25 0.002

If we may, let’s take a moment to reflect on the differences in performance we see in Table 3. It took baseball decades to reach consensus on the importance of OBP over AVG (worth five points of predictiveness), not to mention OPS (another five points), and finally to reach the existing standard metric, wOBA, in 2006. Over slightly more than a century, that represents an improvement of 12 points of predictiveness. Just over 10 years later, DRC+ now offers 13 points of improvement over wOBA alone.


 

Reading that, you’re pretty much expecting a DIPS-level revelation.  So let’s see how good DRC+ really is at predicting team switchers.  I put DRC+ on the wOBA scale, normalized each performance to the league-average wOBA that season (it ranged from .315 to .326), and measured the mean absolute error (MAE) of wOBA projections for the next season, weighted by the harmonic mean of the PAs in each season.  DRC+ had a MAE of 34.2 points of wOBA for team-switching position players.  Projecting every team-switching position player to be exactly league average had a MAE of 33.1 points of wOBA.  That’s not a mistake.  After all that build-up, DRC+ is literally worse at projecting team-switching position players than assuming that they’re all league average.

If you want to say something about pitchers at the plate…
i-dont-think-so-homey-dont-play-that

 

Even though Jonathan Judge felt like calling me a total asshole incompetent troll last night, I’m going to show how his metric could be not totally awful at this task if it were designed and quality-tested better.  As I noted yesterday, DRC+’s weightings are *way* too aggressive on small numbers of PAs.  DRC+ shouldn’t *need* to be regressed after the fact- the whole idea of the metric is that players should only be getting credit for what they’ve shown they deserve (in the given season), and after a few PAs, they barely deserve anything, but DRC+ doesn’t grasp that at all and its creator doesn’t seem to realize or care that it’s a problem.

If we regress DRC+ after the fact to see what happens in an attempt to correct that flaw, it’s actually not a dumpster fire.  All weightings are harmonic means of the PAs.  Every position player pair of consecutive 2010-18 seasons with at least 1 PA in each is eligible.  All tables are MAEs in points of wOBA trying to project year T+1 wOBA..

First off, I determined the regression amounts for DRC+ and wOBA to minimize the weighted MAE for all position players, and that came out to adding 416 league average PAs for wOBA and 273 league average PAs for DRC+.  wOBA assigns 100% credit to the batter.  DRC+ *still* needs to be regressed 65% as much as wOBA.  DRC+ is ridiculously overaggressive assigning “deserved” credit.

Table 1.  MAEs for all players

lgavg raw DRC+ raw wOBA reg wOBA reg DRC+
33.21 31.00 33.71 29.04 28.89

Table 2. MAEs for all players broken down by year T PAs

Year T PA lgavg raw DRC+ raw wOBA reg wOBA reg DRC+ T+1 wOBA
1-99 PAs 51.76 48.84 71.82 49.32 48.91 0.284
100-399 PA 36.66 36.64 40.16 34.12 33.44 0.304
400+ PA 30.77 27.65 28.97 25.81 25.91 0.328

Didn’t I just say DRC+ had a problem with being too aggressive in small samples?  Well, this is one area where that mistake pays off- because the group of hitters who have 1-99 PA over a full season are terrible, being overaggressive crediting their suckiness pays off, but if you’re in a situation like now, where the real players instead of just the scrubs and callups have 1-99 PAs, being overaggressive is terribly inaccurate.  Once the population mean approaches league-average quality, the need for- and benefit of- regression is clear. If we cheat and regress each bucket to its population mean, it’s clear that DRC+ wasn’t actually doing anything special in the low-PA bucket, it’s just that regression to 36 points of wOBA higher than the mean wasn’t a great corrector.

Table 3. (CHEATING) MAEs for all players broken down by year T PAs, regressed to their group means (same regression amounts as above).

Year T PA lgavg raw DRC+ raw wOBA reg wOBA reg DRC+ T+1 wOBA
1-99 PAs 51.76 48.84 71.82 46.17 46.30 0.284
100-399 PA 36.66 36.64 40.16 33.07 33.03 0.304
400+ PA 30.77 27.65 28.97 26.00 25.98 0.328

There’s very little difference between regressed wOBA and regressed DRC+ here.  DRC+ “wins” over wOBA by 0.00015 wOBA MAE over all position players, clearly justifying the massive amount of hype Jonathan Judge pumped us up with.  If we completely ignore the trash position players and only optimize over players who had 100+PA in year T, then the regression amounts increase slightly- 437 PA for wOBA and 286 for DRC+, and we get this chart:

Table 4. MAEs for all players broken down by year T PAs, optimized on 100+ PA players

Year T PA lgavg raw DRC+ raw wOBA reg wOBA reg DRC+ T+1 wOBA
100+ PA 32.55 30.37 32.36 28.32 28.19 0.321
100-399 PA 36.66 36.64 40.16 34.12 33.45 0.304
400+ PA 30.77 27.65 28.97 25.81 25.91 0.328

Nothing to see here either, DRC+ with a 0.00013 MAE advantage again.  Using only 400+PA players to optimize over only changes the DRC+ entry to 25.90, so regressed wOBA wins a 0.00009 MAE victory here.

In conclusion, regressed wOBA and regressed DRC+ are so close that there’s no meaningful difference, and I’d grade DRC+ a microscopic winner.  Raw DRC+ is completely awful in comparison, even though DRC+ shouldn’t need anywhere near this amount of extra regression if it were working correctly to begin with.

I’ve slowrolled the rest of the team-switcher nonsense.  It’s not very exciting either.  I defined 3 classes of players, Stay = played both years entirely for the same team, Switch = played year T entirely for 1 team and year T+1 entirely for 1 other team, Midseason = switched midseason in at least one of the years.

Table 5. MAEs for all players broken down by stay/switch, any number of year T PAs

stay/

switch

lgavg raw DRC+ raw wOBA reg wOBA reg DRC+ T+1 wOBA
stay 33.21 29.86 32.19 27.91 27.86 0.325
switch 33.12 34.20 37.89 31.57 31.53 0.312
mid 33.29 33.01 36.47 31.67 31.00 0.305
sw+mid 33.21 33.60 37.17 31.62 31.26 0.309

It’s the same story as before.  Raw DRC+ sucks balls at projecting T+1 wOBA and is actually worse than “everybody’s league average” for switchers, regressed DRC+ wins a microscopic victory over regressed wOBA for stayers and switchers.  THERE’S (STILL) LITERALLY NOTHING TO THE CLAIM THAT DRC+, REGRESSED OR OTHERWISE, IS ANYTHING SPECIAL WITH RESPECT TO PROJECTING TEAM SWITCHERS.  These are the same conclusions I found the first time I looked, and they still hold for the current version of the DRC+ algorithm.

 

 

DRC+ weights TTO relatively *less* than BIP after 10 games than after a full season

This is a cut-out from a longer post I was running some numbers for, but it’s straightforward enough and absurd enough that it deserves a standalone post.  I’d previously looked at DRAA linear weights and the relevant chart for that is reproduced here.  This is using seasons with 400+PA.

relative to average PA 1b 2b 3b hr bb hbp k bip out
old DRAA 0.22 0.38 0.52 1.16 0.28 0.24 -0.24 -0.13
new DRAA 0.26 0.45 0.62 1.17 0.26 0.30 -0.24 -0.15
wRAA 0.44 0.74 1.01 1.27 0.27 0.33 -0.26 -0.27

 

I reran the same analysis on 2019 YTD stats, with all position players and with a 25 PA minimum, and these are the values I recovered.  Full year is the new DRAA row above, and the percentages are the percent relative to those values.

1b 2b 3b hr bb hbp k BIP out
YTD 0.13 0.21 0.29 0.59 0.11 0.08 -0.14 -0.10
min 25 PA 0.16 0.27 0.37 0.63 0.12 0.09 -0.15 -0.11
Full Year 0.26 0.45 0.62 1.17 0.26 0.30 -0.24 -0.15
YTD %s 48% 47% 46% 50% 41% 27% 57% 64%
min 25PA %s 61% 59% 59% 54% 46% 30% 61% 74%

So.. this is quite something.  First of all, events are “more-than-half-deserved” relative to the full season after only 25-50 PA.  There’s no logical or mathematical reason for that to be true, for any reasonable definition of “deserved”, that quickly.  Second, BIP hits are discounted *LESS* in a small sample than walks are, and BIP outs are discounted *LESS* in a small sample than strikeouts are.  The whole premise of DRC+ is that TTO outcomes belong to the player more than the outcomes of balls in play, and are much more important in small samples, but here we are, with small samples, and according to DRC+, the TTO OUTCOMES ARE RELATIVELY LESS IMPORTANT NOW THAN THEY ARE AFTER A FULL SEASON.  Just to be sure, I reran with wRAA and extracted almost the exact same values as chart 1, so there’s nothing super weird going on here.  This is complete insanity- it’s completely backwards from what’s actually true, and even to what BP has stated is true.  The algorithm has to be complete nonsense to “come to that conclusion”.

Reading the explanation article, I kept thinking the same thing over and over.  There’s no clear logical or mathematical justification for most steps involved, and it’s just a pile of junk thrown together and tinkered with enough to output something resembling a baseball stat most of the time if you don’t look too closely. It’s not the answer to any articulable, well-defined question.  It’s not a credible run-it-back projection (I’ll show that unmistakably in the next post, even though it’s already ruled out by the.. interesting.. weightings above).

Whenever a hodgepodge model is thrown together like DRC+ is, it becomes difficult-to-impossible to constrain it to obey things that you know are true.  At what point in the process did it “decide” that TTO outcomes were relatively less important now?  Probably about 20 different places where it was doing nonsense-that-resembles-baseball-analysis and optimizing functions that have no logical link to reality.  When it’s failing basic quality testing- and even worse, when obvious quality assurance failures are observed and not even commented on (next post)- it’s beyond irresponsible to keep running it out as something useful solely on the basis of a couple of apples-to-oranges comparisons on rigged tests.

 

A look at DRC+’s guts (part 1 of N)

In trying to better understand what DRC+ changed with this iteration, I extracted the “implied” run values for each event by finding the best linear fit to DRAA over the last 5 seasons.  To avoid regression hell (and the nonsense where walks can be worth negative runs when pitchers draw them), I only used players with 400+ PA.  To make sure this should actually produce reasonable values, I did the same for WRAA.

relative to average out 1b 2b 3b hr bb hbp k bip out
old DRAA 0.419 0.416 0.75 1.37 0.44 0.41 -0.08 0.03
new DRAA 0.48 0.57 0.56 1.36 0.44 0.49 -0.06 0.02
wRAA 0.70 1.00 1.27 1.53 0.54 0.60 0.01 0.00

Those are basically the accepted linear weights in the wRAA row, but DRAA seems to have some confusion around the doubles.  In the first iteration, doubles came out worth fewer runs than singles, and in the new iteration, triples come out worth fewer runs than doubles.  Pepsi might be ok, but that’s not.

If we force the 1b/2b/3b ratio to conform to the wRAA ratios and regress again (on 6 free variables instead of 8), then we get something else interesting.

relative to average PA 1b 2b 3b hr bb hbp k bip out
old DRAA 0.22 0.38 0.52 1.16 0.28 0.24 -0.24 -0.13
new DRAA 0.26 0.45 0.62 1.17 0.26 0.30 -0.24 -0.15
wRAA 0.44 0.74 1.01 1.27 0.27 0.33 -0.26 -0.27

Old DRAA was made up of about 90% of TTO runs and 50% of BIP runs, and that changed to about 90% of TTO runs and 60% of BIP runs in the new iteration.  So it’s like the component wOBA breakdown Tango was doing recently, except regressing the TTO component 10% and the BIP part 40% (down from 50%).

I also noticed that there was something strange about the total DRAA itself.  In theory, the aggregate runs above average should be 0 each year, but the new version of DRAA managed to uncenter itself by a couple of percent (that’s about -2% of total runs scored each season)

year old DRAA new DRAA
2010 210.8 -559.1
2011 127.9 -550
2012 226.8 -735.9
2013 190.4 -447.5
2014 33.7 -659.9
2015 60.1 -89.1
2016 63.3 -401.2
2017 -37.8 -318.3
2018 -50.2 -240.4

Breaking that down into full-time players (400+ PA), part-time position players (<400 PA), and pitchers, we get

2010-18 runs old DRAA new DRAA WRAA
Full-time 13912 11223 15296
part-time -6033 -7850 -9202
pitchers -7054 -7369 -6730
total 825 -3996 -636

I don’t know why it decided players suddenly deserved 4800 fewer runs, but here we are, and it took 520 offensive BWARP (10% of their total) away from the batters in this iteration too, so it didn’t recalibrate at that step either.  This isn’t an intentional change in replacement level or anything like that. It’s just the machine going haywire again without sufficient internal or external quality control.

 

2/05/19 DRC+ update- some partial fixes, some new problems

BP released an update to DRC+ yesterday purporting to fix/improve several issues that have been raised on this blog.  One thing didn’t change at all though- DRC+ still isn’t a hitting metric.  It still assigns pitchers artificially low values no matter how well they hit, and the areas of superior projection (where actually true) are largely driven by this.  The update claimed two real areas of improvement.

Valuation

The first is in treating outlier players.  As discussed in C’mon Man- Baseball Prospectus DRC+ Edition by treating player seasons individually and regressing them, instead of treating careers, DRC+ will continually fail to realize that outliers are really outliers. Their fix is, roughly, to make a prior distribution based on all player performances in surrounding years, and hopefully not regress the outliers as much because it realizes something like them might actually exist.  That mitigates the problem a little, sometimes, but it’s still an essentially random fix.  Some cases previously mentioned look better, and others, like Don Kessinger vs. Larry Bowa still don’t make any sense at all.  They’re very similar offensive players, in the same league, overlapping in most of their careers, and yet Kessinger gets wRC-DRC bumped from 72 to 80 while Bowa only goes from 70 to 72, even though Kessinger was *more* TTO-based.

To their credit- or at least to credit their self-awareness, they seem to know that their metric is not reliable at its core for valuation.  Jonathan Judge says

“As always, you should remember that, over the course of a career, a player’s raw stats—even for something like batting average—tend to be much more informative than they are for individual seasons. If a hitter consistently seems to exceed what DRC+ expects for them, at some point, you should feel free to prefer, or at least further account for, the different raw results.”

Roughly translated, “Regressed 1-year performance is a better estimation of talent that 1-year raw performance, but ignoring the rest of a player’s career and re-estimating talent 1 year at a time can cause discrepancies, and if it does, trust the career numbers more.” I have no argument with that.  The question remains how BP will actually use the stat- if we get more fluff pieces on DRC+ outliers who are obviously just the kind career discrepancies Judge and I talked about, that’s bad.  If it is mainly used to de-luck balls in play for players who haven’t demonstrated that they deserve much outlier consideration, that’s basically fine and definitely not the dumbest thing I’ve seen lately.

 

This, on the other hand, well might be.

NAME YEAR PA BB DRC+ DRC+ SD DRAA
Mark Melancon 2011 1 1 -3 2 -0.1
Dan Runzler 2011 1 1 -17 2 -0.1
Matt Guerrier 2011 1 1 -13 2 -0.1
Santiago Casilla 2011 1 1 -12 2 -0.1
Josh Stinson 2011 1 1 -15 2 -0.1
Jose Veras 2011 1 1 -14 2 -0.1
Javy Guerra 2011 1 1 -15 2 -0.1
Joey Gathright 2011 1 1 81 1 0

Not just the blatant cheating (Gathright is the only position player on the list), but the DRC+ SDs make no sense.  Based on one identical PA, DRC+ claims that there’s a 1 in hundreds of thousands chance that Runzler is a better hitter than Melancon and also assigns negative runs to a walk because a pitcher drew it.  The DRC+ SDs were pure nonsense before, but now they’re a new kind of nonsense. These players ranged from 9-31 SD in the previous iteration of DRC+, and while the low end of that was still certainly too low, SDs of 1-2 are beyond absurd, and the fact that they’re that low *only for players with almost no PAs* is a huge red flag that something inside the black box is terribly wrong.  Tango recently explored the SD of wRC+/WAR and found that the SDs should be similar for most players with the same number of PA.  DRC+ SDs done correctly could legitimately show up as slightly lower, because they’re the SD of a regressed stat, but that’s with an emphasis on slightly.  Not SDs of 1 or 2 for anybody, and not lower SDs for pitchers and part-time players who aren’t close to a season full of PAs.

Park Adjustments

I’d observed before that DRC+ still contains a lot of park factor and they’ve taken steps to address this.  They adjusted Colorado hitters more in this iteration while saying there wasn’t anything wrong with their previous park factors.  I’m not sure exactly how that makes sense, unless they just weren’t correcting for park factor before, but they claim to be park-isolated now and show a regression against their park factors to prove it.  Of course the key word in that claim is THEIR park factors.  I reran the numbers from the linked post with the new DRC+s, and while they have made an improvement, they’re still correlated to both Fangraphs park factor and my surrounding-years park factor estimate at the r=0.17-0.18 level, with all that entails (still overrating Rockies hitters, for one, just not by as much).

 

DRC+ and Team Wins

A reader saw a television piece on DRC+, googled and found this site, and asked me a simple question: how does a DRC+ value correlate to a win? I answered that privately, but it occurred to me that team W-L record was a simple way to test DRC+’s claim of superior descriptiveness without having to rely on its false claim of being park-adjusted.

I used seasons from 2010-2018, with all stats below adjusted for year and league- i.e. the 2018 Braves are compared to the 2018 NL average.  Calculations were done with runs/game and win% since not all seasons were 162 games.

Team metric r^2 to team winning %
Run Differential 0.88
wRC+ 0.47
Runs Scored 0.43
OBP 0.38
wOBA 0.37
OPS 0.36
DRC+ 0.35

Run differential is cheating of course, since it’s the only one on the list that knows about runs allowed, but it does show that at the seasonal level, scoring runs and not allowing them is the overwhelming driver of W-L record and that properly matching RS to RA- i.e. not losing 5 1-run games and winning a 5-run game to “balance out”- is a distant second.

Good offense is based on three major things- being good, sequencing well, and playing in a friendly park.  Only the first two help you to outscore your opponent who’s playing the game in the same park, and Runs Scored can’t tell the difference between a good offense and a friendly park.  As it turns out, properly removing park factor noise (wRC+) is more important than capturing sequencing (Runs Scored).

Both clearly beat wOBA, as expected, because wRC+ is basically wOBA without park factor noise, and Runs Scored is basically wOBA with sequencing added.  OBP beating wOBA is kind of an accident- wOBA *differential* would beat OBP *differential*- but because park factor is more prevalent in SLG than OBP, offensive wOBA is more polluted by park noise and comes out slightly worse.

And then there’s DRC+.  Not only does it not know sequencing, it doesn’t even know what component events (BB, 1B, HR, etc) actually happened, and the 25% or so of park factor that it does neutralize is not enough to make up for that.  It’s not a good showing for the fancy new most descriptive metric ever when it’s literally more valuable to know a team’s OBP than its DRC+ to predict its W-L record, especially when wRC+ crushes the competition at the same task.

 

DRC+ isn’t even a hitting metric

At least not as the term is used in baseball.  Hitting metrics can adjust for nothing (box score stats, AVG, OBP, etc), league and park (OPS+, wRC+, etc), or more detailed conditions (opposing pitcher and defense, umpire, color of the uniforms, proximity of Snoop Dogg, whatever).  They don’t adjust for the position played.  Hitting is hitting, regardless of who does it.  Unless it’s not.  While fooling around with the data for DRC+ really isn’t any good at predicting next year’s wOBA for team switchers and The DRC+ team-switcher claim is utter statistical malpractice some more, it looked for all the world like DRC+ had to be cheating, and it is.

To prove that, I looked at seasons with exactly 1 PA and 1 unintentional walk for the entire season, and the DRC+ for those seasons.

NAME
YEAR
TEAM
DRC+
DRC+ SD
Audry Perez
2014
Cardinals
104
20
Spencer Kieboom
2016
Nationals
96
29
John Hester
2013
Angels
93
16
Joey Gathright
2011
Red Sox
89
24
J.c. Boscan
2010
Braves
78
25
Mark Melancon
2011
Astros
15
14
George Sherrill
2010
Dodgers
4
23
Antonio Bastardo
2014
Phillies
3
22
Dan Runzler
2011
Giants
2
19
Jose Veras
2011
Pirates
1
15
Matt Reynolds
2010
Rockies
1
12
Tony Cingrani
2016
Reds
0
25
Antonio Bastardo
2017
Pirates
-1
17
Javy Guerra
2011
Dodgers
-2
31
Josh Stinson
2011
Mets
-10
11
Aaron Thompson
2011
Pirates
-12
14
Brandon League
2013
Dodgers
-13
17
J.j. Hoover
2014
Reds
-14
32
Santiago Casilla
2011
Giants
-15
12
Jason Garcia
2015
Orioles
-16
12
Chris Capuano
2016
Brewers
-17
17
Edubray Ramos
2016
Phillies
-19
15
Matt Guerrier
2011
Dodgers
-22
9
Liam Hendriks
2015
Blue Jays
-24
15
Phillippe Aumont
2015
Phillies
-28
20
Randy Choate
2015
Cardinals
-28
52
Joe Blanton
2017
Nationals
-30
12
Jacob Barnes
2017
Brewers
-31
26
Sean Burnett
2012
Nationals
-33
20
Robert Carson
2013
Mets
-43
7

That’s a pretty good spread.  The top 5 are position players, the rest are pitchers.  DRC+ is blatantly cheating by assigning pitchers very low DRC+ values even when their offensive performance is good and not doing the same for 1-PA position players.  wOBA and wRC+ don’t do this, as evidenced by Kieboom (#5) right there with 3 pitchers with the same seasonal stat line.  It’s also not using data from prior seasons because that was Kieboom’s only career PA to date, and when Livan Hernandez debuted in 1996 for one game with 1 PA and 1 single, he got a DRC+ of -14 for his efforts.  It’s just cheating, period.  And it doesn’t learn either.  Even when Bumgarner was hitting in 2014-2017, his DRC+s were -15, 4, -17, and -19.

I also included the DRC+ SDs here just to show that they’re complete nonsense.  Pitcher Mark Melancon (15 +/- 14) has one career PA. Pitcher Robert Carson (-43 +/- 7) also has one career PA. Pitcher Randy Choate (-28 +/- 52) had one PA that year and 5 a decade earlier.  What in the actual fuck?

The entire DRC+ project is a complete farce at this point.  The outputs are a joke***  The SD values are nonsense (table above). The pillars it stands on are complete bullshit.  It’s more descriptive of the current season than park adjusted stats because it’s not anywhere near a park-adjusted stat, even though it claims to be.  It’s more predictive than park-adjusted stats for next year’s team because it’s somewhat regressed, meaning it basically can’t lose, and it’s also cheating the same way descriptiveness does by keeping a bunch of park factor.  Its claimed “substantial improvement over predicting wOBA for team switchers” is statistical malpractice to begin with, and now we see that the one area where it did predict significantly better than regressed wOBA, very-low-PA players, is driven by (almost) ignoring actual results for pitchers and saying they sucked at the plate no matter how well they really hit (and treating low-PA position players with the exact same stat lines as average-ish).

***Check out DRA- land where Billy Wagner is 26 percent more valuable on a per-inning basis than Mariano Rivera and almost as valuable for his career.  I love Billy Wagner, but still, come on.

RIP 12/29/2018.  Comment F to pay respects.

 

DRC+ still contains a lot of park factor

Required knowledge: DRC+ and park factors

TL;DR read the title above, the rant 3 paragraphs down, and the very bottom

DRC+ is supposed to be a fully park-adjusted metric, but from the initial article, I couldn’t understand how that could be consistent with the reported results without either an exceptional amount of overfitting or extremely good luck.  Team DRC+ was reported to be more reliable than team wRC+ at describing the SAME SEASON’s team runs/PA.  Since wRC+ is based off of wOBA, team wOBA basically is team scoring offense (r=0.94), and DRC+ regresses certain components of wOBA back towards the mean quite significantly (which is why DRC+ is structurally unfit for use in WAR), it made no sense to me that a metric that took away actual hits that created actual runs from teams with good BABIPs and invented hits in place of actual outs for teams with bad BABIPs could possibly correlate better to actual runs scored than a metric that used what happened on the field.  It’s not quite logically impossible for that to be true, but it’s pretty damn close.

It turns out the simple explanation for how a park-adjusted significantly regressed metric beat a park-adjusted unregressed metric is the correct one.  It didn’t. DRC+ keeps in a bunch of park factor and calls itself a park-adjusted metric when it’s simply not one, and not even close to one.  The park factor table near the bottom of the DRC+article should have given anybody who knows anything about baseball serious pause, and of course it fits right in with DRC+’s “great descriptiveness”.

RANT

How in the hell does a park factor of 104 for Coors get published without explanation by any person or institution trying to be serious?  The observed park factors (halved) the last few years, in reverse order: 114 (2018), 115, 116, 117, 120, 109, 123… You can’t throw out a number like Coors 104 like it’s nothing.  If Jonathan Judge could actually justify it somehow- maybe last year we got a fantastic confluence of garbage pitchers and great situational hitting at Coors and the reverse on the road while still somehow only putting up a 114, where you could at least handwave an attempt at a justification, then he should have made that case when he was asked about it, but instead he gave an answer indicative of never having taken a serious look at it.  Spitting out a 104 for Coors should have been like a tornado siren going off in his ear to do basic quality control checks on park effects for the entire model, but it evidently wasn’t, so here I am doing it instead.

/RANT

The basic questions are “how correlated is team DRC+ to home park factor?” and “how correlated should team DRC+ be to home park factor?”.  The naive answer to the second question is “not correlated at all since it’s park adjusted, duh”, but it’s possible that the talented hitters skew towards hitters’ parks, which would cause a legitimate positive correlation, or that they skew towards pitchers’ parks, which would cause a legitimate negative correlation.  As it turns out, over the 2003-2017 timeframe, hitting talent doesn’t skew at all, but that’s an assertion that has to be demonstrated instead of just assumed true, so let’s get to it.

We need a way to make (offensive talent, home park factor) team-season pairs that can measure both components separately without being causally correlated to each other.  Seasonal team road wOBA is a basically unbiased way to measure offensive quality independent of home park factor because the opposing parks played in have to average out pretty similarly for every team in the same league (AL/NL)**.  If we use that, then we need a way to make a park factor for those seasons that can’t include that year’s data, because everything else being equal, an increase in a team’s road wOBA would decrease its home park factor****, and we’re explicitly trying to avoid nonsense like that.  Using the observed park factors from *surrounding years*, not the current year, to estimate the current year’s park factor solves that problem, assuming those estimates don’t suck.

** there’s a tiny bias from not playing road games in a stadium with your park factor, but correcting that by adding a hypothetical 5 road games at estimated home park factor doesn’t change conclusions)

**** some increase will be skill that will, on average, increase home wOBA as well and mostly cancel out, and some increase will be luck that won’t cancel out and would screw the analysis up

Methodology

I used all eligible team-batting-seasons, pitchers included, from 2003-2017.  To estimate park factors, I used the surrounding 2 years (T-2, T-1, T+1, T+2) of observed park factors (for runs) if they were available, the surrounding 1 year (T-1, T+1) otherwise, and threw out the season if I didn’t have those.  That means I threw out all 2018s as well as the first and last years in each park.  I ignored other changes (moved fences, etc).

Because I have no idea what DRC+ is doing with pitcher-batters, how good its AL-NL benchmarking is, and the assumption of nearly equivalent aggregate road parks is only guaranteed to hold between same-league teams, I did the DRC+ analysis separately for AL and NL teams.

To control for changing leaguewide wOBA in the 2003-2017 time period, I used the same wOBA/LgAvGwOBA wOBA% method I used in DRC+ really isn’t any good at predicting next year’s wOBA for team switchers for wOBA and DRC+, just for AL teams and NL teams separately for the reasons above.  After this step, I did analyses with and without Coors because it’s an extreme outlier.  We already know with near certainty that their treatment of Coors is kind of questionable batshit crazy and keeps way too much park effect in DRC+, so I wanted to see how they did everywhere else.

Results

The park factor estimation worked pretty well.  2 surrounding year PF correlated to the  observed PF for the year in question at r=0.54 (0.65 with Coors) and the 1 surrounding year at r=0.52 (0.61 with Coors).  The 5-year FanGraphs PF, WHICH USES THE YEAR IN QUESTION, only correlates at r=0.7 (0.77 with Coors) and the 1 and 2 year park factors correlate to the Fangraphs PF at 0.87 and 0.96 respectively.  This is plenty to work with given the effect sizes later.

Team road wOBA% (squared or linear) correlates to the estimated home park factor at r = -0.03, literally nothing, and with the 5 extra hypothetical games as mentioned in the footnote above, r=0.02, also literally nothing.  It didn’t have to be this way, but it’s convenient that it is.  Just to show that road wOBA isn’t all noise, it correlates to that season’s home wOBA% at r=0.32 (0.35 with the adjustment) even though we’re dealing with half seasons and home wOBA% contains the entire park factor.  Road wOBA% correlates to home wOBA%/sqrt(estimated park factor) at r=0.56 (and wOBA%/park factor at r=0.54).  That’s estimated park factor from surrounding years, not using the home and road wOBA data in question.

Home wOBA% is obviously hugely correlated to estimated park factor (r=0.46 for home wOBA%^2 vs estimated PF), but park adjusting it by correlating

(home wOBA%)^2/estimated park factor TO estimated park factor

has r= -0.00017.  Completely uncorrelated to estimated PF (it’s pure luck that it’s THAT low).

So we’ve established that road wOBA really does contain a lot of information on a team’s offensive talent (that’s a legitimate naive “duh”), that it’s virtually uncorrelated to true home park factor, and that park-adjusted home wOBA% (using PF estimates from other seasons only) is also uncorrelated to true home park factor.  If DRC+ is a correctly park-adjusted metric that measures offensive talent, DRC+% should also have to be virtually uncorrelated to true home park factor.

And… the correlation of DRC+% to estimated park factor is r= 0.38 for AL teams, r=0.29 for NL teams excluding Colorado, r=0.31 including Colorado.  Well then.  That certainly explains how it can be more descriptive than an actually park-adjusted metric.

 

The DRC+ team-switcher claim is utter statistical malpractice

Required knowledge: MUST HAVE READ/SKIMMED DRC+ really isn’t any good at predicting next year’s wOBA for team switchers and a non-technical knowledge of what a correlation coefficient means wouldn’t hurt.

In doing the research for the other post, it was baffling to me what BP could have been doing to come up with the claim that DRC+ was a revolutionary advance for team-switchers.  It became completely obvious that there was nothing particularly meaningful there with respect to switchers and that it would take a totally absurd way of looking at the data to come to a different conclusion.  With that in mind, I clicked some buttons and stumbled into figuring out what they had to be doing wrong.  One would assume that any sophisticated practitioner doing a correlation where some season pairs had 600+ PA each and other season pairs had 5 PA each would weight them differently… and one would be wrong.

I decided to check 4 simple ways of weighting the correlation- unweighted, by year T PA, by year T+1 PA, and by the harmonic mean of year T PA and year T+1 PA.

Table 1.  Correlation coefficients to year T+1 wOBA% by different weighting methods, minimum 400 PAs year T.

400+ PA Harmonic Year T PA Year T+1 PA unweighted N
switch wOBA 0.34 0.35 0.34 0.34 473
switch DRC+ 0.35 0.35 0.34 0.35 473
same wOBA 0.55 0.53 0.55 0.51 1124
same DRC+ 0.57 0.55 0.57 0.54 1124

The way to read this chart is to compare the wOBA and DRC+ correlations for each group of hitters- switch to switch (lines 1 and 2) and same to same (lines 3 and 4).  It’s obvious that wOBA should correlate much better for same than switch because it contains the entire park effect which is maintained in “same” and lost in “switch”, but DRC+ behaves the same way because DRC+ also contains a lot of park factor even though it shouldn’t

In the 400+ year T PA group, the choice of weighting method is almost completely irrelevant. DRC+ correlates marginally better across the board and it has nothing to do with switch or stay.  Let’s add group 2 to the mix and see what we get.

Table 2.  Correlation coefficients to year T+1 wOBA% by different weighting methods, minimum 100 PAs year T.

100+ PA Harmonic Year T PA Year T+1 PA unweighted N
switch wOBA 0.31 0.29 0.29 0.26 1100
switch DRC+ 0.33 0.31 0.32 0.29 1100
same wOBA 0.51 0.47 0.50 0.44 2071
same DRC+ 0.54 0.51 0.53 0.47 2071

The values change, but DRC+’s slight correlation lead doesn’t, and again, nothing is special about switchers except that they’re overall less reliable. Some of the gaps widen by a point or two, but there’s no real sign of the impending disaster when the low-PA stuff that favors DRC+ comes in.  But what a disaster there is….

Table 3.  Correlation coefficients to year T+1 wOBA% by different weighting methods, all season pairs.

1+ PA Harmonic Year T PA Year T+1 PA unweighted N
switch wOBA 0.45 0.41 0.38 0.37 1941
switch DRC+ 0.54 0.47 0.58 0.57 1941
same wOBA 0.62 0.58 0.53 0.52 3639
same DRC+ 0.67 0.62 0.66 0.66 3639

The two weightings (Harmonic and Year T) that minimize the weight of low-data garbage projections stay saner, and the two methods that don’t (year T+1 and unweighted) go bonkers and diverge by around what BP reports, If I had to guess, I have more pitchers in my sample for a slightly bigger effect and regressed DRC+% correlates a bit better.  And to repeat yet again, the effect has nothing to do with stay/switch.  It’s entirely a mirage based on flooding the sample with bunches of low-data garbage projections based on handfuls of PAs and weighting them equally to pairs of qualified seasons.

You might be thinking that that sounds crazy and wondering why I’m confident that’s what really happened.  Well, as it turns out- and I didn’t realize this until after the analysis- they actually freaking told us that’s what they did.  The caption for the chart is “Table 3: Reliability of Team-Switchers, Year 1 to Year 2 wOBA (2010-2018); Normal Pearson Correlations”.  Normal Pearson correlations are unweighted. Mystery confirmed solved.