Stuff+ Doesn’t Have a Team-Switching Problem

Not even going to bother with a Betteridge’s Law headline here. On top of what at this point is presumptively bad-faith discussion of their own stats (rank-order correlations on team-switchers only? Really?) BP claimed that the Stuff+ metric has a team-switching problem and spent like 15 paragraphs discussing it. I’m only going to spend two paragraphs, because it just doesn’t.

Edit 5/5/2023: went ahead and got all the same players together with the exact same weighting for everything to make sure to compare DRA- to Stuff+ and other stats completely fairly and replaced the section with one composite chart.

Using data from Fangraphs and BP, I took each season 2020-2022 with at least 100 pitches thrown (this got rid of position players, etc.) and took DRA-, Pitches, Stuff+, FIP, xFIP-, SIERA, and ERA. Because each season’s ERA was quite different, I converted ERA/SIERA/FIP to Stat/MLB_Average_ERA for that season and multiplied by 100 to make a (non-park-adjusted) “Stat-“. DRA- and xFIP- are already on that scale. I then did an IP-weighted fit of same-season Stuff+ and “ERA-” and got predicted same-season “ERA-” = 98.93 – 1.15* (Stuff+ – 100). I then took paired consecutive player-seasons and compared weighted RMSEs for year T’s stats, broken down by team-switching status (No = both seasons for the same team, Yes = played for more than one team).

RMSE T+1 “ERA-“	Non-switch	Switch	All
Stuff+	38.0	37.4	37.7
“SIERA-“	39.5	38.9	39.26
DRA-	40.0	38.1	39.29
xFIP-	40.6	40.0	40.4
“FIP-“	43.4	49.2	45.6
“ERA-“	50.8	60.6	54.6
N	588	409	997

Literally no problems here. Stuff+ does fine with team-switchers, does better than park-unadjusted “FIP-” across the board, and does much better on team-switchers than park-unadjusted “FIP-“, as expected, since park-unadjusted FIP should be the metric taking a measurable accuracy hit from a park change. And yet somehow BP is reporting the complete opposite conclusions instead: 1) that Stuff+ is fine for non-switchers but becomes near-useless for team-switchers, and 2) that its performance degrades significantly compared to park-unadjusted-FIP for team switchers. Common sense and the data clearly say otherwise. DRA- grades out roughly between SIERA and xFIP- for non-switchers predicting next season’s ERA, on par with SIERA overall, and solidly behind Stuff+. (Apologies for temporarily stating it was much worse than that).

Looking at it another way, creating an IP-weighted-RMSE-minimizing linear fit for each metric to predict next season’s “ERA-” (e.g. Year T+1 ERA- = 99 + 0.1 * (year T DRA- – 100) gives the following chart

y=mx+b	intercept	slope	RMSE	r
Stuff+ ERA-	102.42	0.79	34.16	0.29
SIERA-	103.07	0.53	34.56	0.25
DRA-	101.42	0.49	34.62	0.24
xFIP-	101.57	0.40	34.88	0.21
“FIP-“	101.13	0.21	35.14	0.17
“ERA-“	100.87	0.11	35.40	0.12
everybody the same	100.55	0.00	35.65	0.00

The intercepts are different slightly out of noise and slightly because they’re not all centered exactly identically- SIERA has the lowest average value for whatever reason. ERA predicted from Stuff+ is the clear winner again, with DRA- again between SIERA and xFIP-. Since all the metrics being fit are on the same scale (Stuff+ was transformed into ERA- as in the paragraph above), the slopes can be compared directly, and the bigger the slope, the more one point of year-T stat predicts the year T+1 ERA-. Well, almost, since the slopes to year-T ERA aren’t exactly 1, but nothing is compressed enough to change rank order (DRA- almost catches SIERA, but falls further behind Stuff+) . One point of year-T Stuff+ ERA- is worth 1.00 points of Year T ERA- and 0.8 points of year T+1 ERA-. One point of year-T DRA- is worth 1.04 points of year-T ERA- but only 0.49 points of year-T+1 ERA-. Stuff+ is much stickier. Fitting to switchers only, the Stuff+ slope is 0.66 and DRA- is 0.46. Stuff+ is still much stickier. There’s just nothing here. Stuff+ doesn’t have a big team-switching problem and points of Stuff+ ERA- are clearly worth more than points of DRA- going forward for switchers and non-switchers alike.

Missing the forest for.. the forest

The paper A Random Forest approach to identify metrics that best predict match outcome and player ranking in the esport Rocket League got published yesterday (9/29/2021), and for a Cliff’s Notes version, it did two things: 1) Looked at 1-game statistics to predict that game’s winner and/or goal differential, and 2) Looked at 1-game statistics across several rank (MMR/ELO) stratifications to attempt to classify players into the correct rank based on those stats. The overarching theme of the paper was to identify specific areas that players could focus their training on to improve results.

For part 1, that largely involves finding “winner things” and “loser things” and the implicit assumption that choosing to do more winner things and fewer loser things will increase performance. That runs into the giant “correlation isn’t causation” issue. While the specific Rocket League details aren’t important, this kind of analysis will identify second-half QB kneeldowns as a huge winner move and having an empty net with a minute left in an NHL game as a huge loser move. Treating these as strategic directives- having your QB kneel more or refusing to pull your goalie ever- would be actively terrible and harm your chances of winning.

Those examples are so obviously ridiculous that nobody would ever take them seriously, but when the metrics don’t capture losing endgames as precisely, they can be even *more* dangerous, telling a story that’s incorrect for the same fundamental reason, but one that’s plausible enough to be believed. A common example is outrushing your opponent in the NFL being correlated to winning. We’ve seen Derrick Henry or Marshawn Lynch completely dump truck opposing defenses, and when somebody talks about outrushing leading to wins, it’s easy to think of instances like that and agree. In reality, leading teams run more and trailing teams run less, and the “signal” is much, much more from capturing leading/trailing behavior than from Marshawn going full beast mode sometimes.

If you don’t apply subject-matter knowledge to your data exploration, you’ll effectively ask bad questions that get answered by “what a losing game looks like” and not “what (actionable) choices led to losing”. That’s all well-known, if worth restating occasionally.

The more interesting part begins with the second objective. While the particular skills don’t matter, trust me that the difference in car control between top players and Diamond-ranked players is on the order of watching Simone Biles do a floor routine and watching me trip over my cat. Both involve tumbling, and that’s about where the similarity ends.

The paper identifies various mechanics and identifies rank pretty well based on those. What’s interesting is that while they can use those mechanics to tell a Diamond from a Bronze, when they tried to use those mechanics to predict the outcome of a game, they all graded out as basically worthless. While some may have suffered from adverse selection (something you do less when you’re winning), they had a pretty good selection of mechanics and they ALL sucked at predicting the winner. And, yet, beyond absolutely any doubt, the higher rank stratifications are much better at them than the lower-rank ones. WTF? How can that be?

The answer is in a sample constructed in a particularly pathological way, and it’s one that will be common among esports data sets for the foreseeable future. All of the matches are contested between players of approximately equal overall skill. The sample contains no games of Diamonds stomping Bronzes or getting crushed by Grand Champs.

The players in each match have different abilities at each of the mechanics, but the overall package always grades out similarly given that they have close enough MMR to get paired up. So if Player A is significantly stronger than player B at mechanic A to the point you’d expect it to show up, ceteris paribus, as a large winrate effect, A almost tautologically has to be worse at the other aspects, otherwise A would be significantly higher-rated than B and the pairing algorithm would have excluded that match from the sample. So the analysis comes to the conclusion that being better at mechanic A doesn’t predict winning a game. If the sample contained comparable numbers of cross-rank matches, all of the important mechanics would obviously be huge predictors of game winner/loser.

The sample being pathologically constructed led to the profoundly incorrect conclusion

Taken together, higher rank players show better control over the movement of their car and are able to play a greater proportion of their matches at high speed. However, within rank-matched matches, this does not predict match outcome.Therefore, our findings suggest that while focussing on game speed and car movement may not provide immediate benefit to the outcome within matches, these PIs are important to develop as they may facilitate one’s improvement in overall expertise over time.

even though adding or subtracting a particular ability from a player would matter *immediately*. The idea that you can work on mechanics to improve overall expertise (AKA achieving a significantly higher MMR) WITHOUT IT MANIFESTING IN MATCH RESULTS, WHICH IS WHERE MMR COMES FROM, is.. interesting. It’s trying to take two obviously true statements (Higher-ranked players play faster and with more control- quantified in the paper. Playing faster and with more control makes you better- self-evident to anybody who knows RL at all) and shoehorn a finding between them that obviously doesn’t comport.

This kind of mistake will occur over and over and over when data sets comprised of narrow-band matchmaking are analysed that way.

(It’s basically the same mistake as thinking that velocity doesn’t matter for mediocre MLB pitchers- it doesn’t correlate to a lower ERA among that group, but any individuals gaining velocity will improve ERA on average)

A survey about future behavior is not future behavior

People ~~lie their asses off all the time~~ give incorrect answers to survey questions about future actions. This is not news. Any analysis that requires treating such survey results as reality should be careful to validate them in some way first, and when simple validation tests show that the responses are ~~complete bullshit~~ significantly inaccurate in systematically biased ways, well, the rest of the analysis is quite suspect to say the least.

Let’s start with a simple hypothetical. You see a movie on opening weekend (in a world without COVID concerns). You like it and recommend it to a friend at work on Monday. He says he’ll definitely go see it. 6 weeks later (T+6 weeks), the movie has left the theaters and your friend never did see it. Clearly his initial statement did not reflect his behavior. Was he lying from the start? Did he change his mind?

Let’s add one more part to the hypothetical. After 3 weeks (T+3 weeks), you ask him if he’s seen it, and he says no, but he’s definitely going to go see it. Without doing pages of Bayesian analysis to detail every extremely contrived behavior pattern, it’s a safe conclusion under normal conditions (the friend actually does see movies sometimes, etc) that his statement is less credible now than it was 3 weeks ago. Most of the time he was telling the truth initially, he would have seen the movie by now. Most of the time he was lying, he would not have seen the movie by now. So compared to the initial statement three weeks ago, the new one is weighted much more toward lies.

There’s also another category of behavior- he was actually lying before but has changed his mind and is now telling you the truth, or he was telling you the truth before, but changed his mind and is lying now. If you somehow knew with certainty (at T+3 weeks) that he had changed his mind one way or the other , you probably don’t have great confidence right now in which statement was true and which one wasn’t.

But once another 3 weeks pass without him seeing the movie, by the same reasoning above, it’s now MUCH more likely that he was in the “don’t want to see the movie” state at T+3 weeks, and that he told the truth early and changed his mind against the movie. So at T+6 weeks, we’re in a situation where the person said the same words at T and T+3, but we know he was 1) much more likely to have been lying about wanting to see the movie at T+3 than at T, and 2) at T+3, much more likely to have changed his mind against seeing the movie than to have changed his mind towards seeing the movie.

Now let’s change the schedule a little. This example is trivial, but it’s here for completeness and I use the logic later. Let’s say it’s a film you saw at a festival, in a foreign language, or whatever, and it’s going to get its first wide release in 7 weeks. You tell your friend about it, he says he’s definitely going to see it. Same at T+3, same at T+6 (1 week before release). You haven’t learned much of anything here- he didn’t see the movie, but he *couldn’t have* seen the movie between T and T+6, so the honest responses are all still in the pool. The effect from before- more lies, and more mind-changing against the action- arises from not doing something *that you could have done*, not just from not doing something.

The data points in both movie cases are exactly the same. It requires an underlying model of the world to understand that they actually mean vastly different things.

This was all a buildup to the report here https://twitter.com/davidlazer/status/1390768934421516298 claiming that the J&J pause had no effect on vaccine attitudes. This group has done several surveys of vaccine attitudes, and I want to start with Report #43 which is a survey done in March, and focus on the state-by-state data at the end.

We know what percentage of eligible recipients have been vaccinated (in this post, vaccinated always means having received at least one dose of anything) in each state when I pulled CDC data on 5/9, and comparing survey results to that number. First off, everybody (effectively every marginal person) in the ASAP category could have gotten a vaccine by now, and effectively everybody in the “after some people I know” category has seen “some people they know” get the vaccine. The sum of vaccinated + ASAP + After Some comes out, on average, 3.4% above the actual vaccinated rate. That, by itself, isn’t a huge deal. It’s slight evidence of overpromising, but not “this is total bullshit and can’t be trusted” level. The residuals on the other hand..

Excess vaccinations = % of adults vaccinated – (Survey Vaxxed% + ASAP% + After some%)

State	Excess Vaccinations
MS	-17.5
LA	-15.7
SC	-15.3
AL	-12.9
TN	-12.5
UT	-10.5
IN	-10.2
WV	-10.1
MO	-9.5
AK	-8.7
TX	-8.5
NC	-7.9
DE	-7.8
WA	-7.6
NV	-7.4
AZ	-7.3
ID	-7.2
ND	-7.2
GA	-7
MT	-6.2
OH	-5.1
KY	-5
CO	-4
AR	-3.9
FL	-3.7
IA	-3.5
SD	-3.4
MI	-3.3
NY	-2.7
KS	-2.2
CA	-2.1
NJ	-1.8
VA	-1.8
WI	-1.7
OR	-1.5
WY	-1.2
HI	-0.7
NE	-0.4
OK	1.3
PA	2.3
IL	2.5
CT	3.4
RI	3.9
MN	4.1
MD	5.5
VT	7.8
ME	10.2
NH	10.7
NM	11.1
MA	14.2

This is practically a red-state blue-state list. Red states systematically undershot their survey, and blue states systematically overshot their survey. In fact, correlating excess vaccination% to Biden%-Trump% has an R^2 of 0.25, and a linear regression of survey results to the CDC vaccination %s on 5/9 is only 0.39. Partisan response bias is a big thing here, and the big takeaway is that answer groups are not close to homogeneous. Answer groups can’t properly be modeled as being composed of identical entities. There are plenty of liars/mind-changers in the survey pool, many more than could be detected by just looking at one top-line national number.

Respondents in blue states who answered ASAP or After Some are MUCH more likely to have been vaccinated by now, in reality, than respondents in red states who answered the same way (ASAP:After Some ratio was similar in both red and blue states). This makes the data a pain to work with, but this heterogeneity also means that the MEANING of the response changes, significantly, over time.

In March, the ASAP and After Some groups were composed of people who were telling the truth and people who were lying. As vaccine opportunities rolled out everywhere, the people who were telling the truth and didn’t change their mind (effectively) all got vaccinated by now, and the liars and mind-changers mostly did not. By state average, 46% of people answered ASAP or After Some, and 43% got vaccinated between the survey and May 8 (including some from the After Most or No groups of course). I can’t quantify exactly how many of the ASAP and After Some answer groups got vaccinated (in aggregate), but it’s hard to believe it’s under 80% and could well be 85%.

That means most people in the two favorable groups were telling the truth, but there were still plenty of liars as well, so that group has morphed from a mostly honest group in March to a strongly dishonest group now. The response stayed the same- the meaning of the response is dramatically different now. People who said it in March mostly got vaccinated quickly. Now, not so much.

This is readily apparent in Figure 1 in Report 48. That is a survey done throughout April, split to before, during, and after the J&J pause. Their top-line numbers for people already vaccinated were reasonably in line with CDC figures, so their sampling probably isn’t total crap. But if you add the numbers up, Vaxxed + ASAP + After Some is 70%, and the actual vaccination rate after 5/8 was only 58%. About 16% of the US got vaccinated between the median date on the pre-pause survey and 5/8, and from other data in the report, 1-2% of that was likely to be from the After Most or No groups, so 14-15% got vaccinated from the ASAP/After Some groups, and that group comprised 27% of the population. That’s a 50-55% conversion rate (down from 80+% conversion rate in March), and every state had been fully open for at least 3 weeks. Effectively any person capable of filling out their online survey who made any effort at all could have gotten a shot by now, meaning that aggregate group was now up to almost half liars.

During the pause, about 8% got vaccinated between the midpoint and 5/8, ~1% from After Some and No, so 7% vaccinated from 21% of the population, meaning that aggregate group was now 2/3 liars. And after the pause, 4% vaccinated, maybe 0.5% from After Some and No, and 18% of the population, so the aggregate group is now ~80% liars. The same responses went from 80% honest (really were going to get vaccinated soon) in March to 80% dishonest (not so much) in late April.

Looking at the crosstabs in Figure 2 (still report 48) also bears this out. In the ASAP group, 38% really did get vaccinated ASAP, 16% admitted shifting to a more hostile posture, and 46% still answered ASAP, except we know from figure 1 that means ~16% were ASAP-honest and ~30% were ASAP-lying (maybe a tad more honest here and a tad less honest in the After Some below, but in aggregate, it doesn’t matter)

In the After Some group, 23% got vaccinated and 35% shifted to actively hostile. 9% upgraded to ASAP, and 33% still answered After Some, which is more dishonest than it was initially. This is a clear downgrade in sentiment even if you didn’t acknowledge the increased dishonesty, and an even bigger one if you do.

If you just look at the raw numbers and sum the top 2 or 3 groups, you don’t see any change, and the hypothesis of J&J not causing an attitude change looks plausible. Except we know, by the same logic as the movie example above, the same words *coupled with a lack of future action* – not getting vaccinated between the survey and 5/8- discount the meaning of the responses.

Furthermore, we know from Report #48, Figure 2 that plenty of people self-reported mind-changes, and we have a pool that contained a large number of people who lied at least once (because we’re way, way below the implied 70% vax rate, and also the red-state/blue-state thing), so it would be astronomically unlikely for the “changed mind and then lied” group to be a tiny fraction of the favorable responses after the pause. These charts show the same baseline favorability (vaxxed + ASAP + Almost Some), but the correct conclusion -because of lack of future action- is that this most likely reflected a good chunk of mind-changing against the vaccine and lying about it, coupled with people who were lying all along, and that “effectively no change in sentiment” is a gigantic underdog to be the true story.

If you attempt to make a model of the world and see if survey responses, actual vaccination data, actual vaccine availability over time, and the hypothesis of J&J not causing an attitude change fit into a coherent model, it simply doesn’t work at all, and the alternative model- selection bias turning the ASAP and After Some groups increasingly and extremely dishonest (as the honest got vaccinated and left the response group while the liars remained) fits the world perfectly.

The ASAP and After Some groups were mostly honest when it was legitimately possible for a group of that size to not have vaccine access yet (not yet eligible in their state or very recently eligible and appointments full, like when the movie hadn’t been released yet), and they transitioned to dishonest as reality moved towards it being complete nonsense for a group of that size to not have vaccine access yet (everybody open for 3 weeks or more).

P.S. There’s another line of evidence that has nothing to do with data in the report that strongly suggests that attitudes really did change. First of all, comparing the final pre-pause week (4/6-4/12) to the first post-resumption week (4/24-4/30, or 4/26-5/2 if you want lag time), vaccination rate was down 33.5% (35.5%) post-resumption and was down in all 50 individual states. J&J came back and everybody everywhere was still in the toilet. Disambiguating an exhaustion of willing recipients from a decrease in willingness is impossible using just US aggregate numbers, but grouping states by when they fully opened to 16+/18+ gives a clearer picture.

Group 1 is 17 states that opened in March. Compared to the week before, these states were +8.6% in the first open week and +12.9% in the second. This all finished before (or the exact day of) the pause.

Group 2 is 14 states that opened 4/5 or 4/6. Their first open week was pre-pause and +11%, and their second week was during the pause and -12%. That could have been supply disruption or attitude change, and there’s no way to tell from just the top-line number.

Group 3 is 8 states that opened 4/18 or 4-19. Their prior week was mostly paused, their first week open was mostly paused, and their second week was fully post-resumption. Their opening week was flat, and their second open week was *-16%* despite J&J returning.

We would have expected a week-1 bump and a week-2 bump. It’s possible that the lack of a week-1 bump was the result of running at full mRNA throughput capacity both weeks (they may have even had enough demand left from prior eligibility groups that they wouldn’t have opened 4/19 without Biden’s decree, and there were no signs of flagging demand before the pause), but if that were true, a -16% change the week after, with J&J back, is utterly incomprehensible without a giant attitude change (or some kind of additional throughput capacity disruption that didn’t actually happen).

The “exhaustion of the willing” explanation was definitely true in plenty of red states where vaccination rates were clearly going down before the pause even happened, but it doesn’t fit the data from late-opening states at all. They make absolutely no sense without a significant change in actual demand.

Nate Silver vs AnEpidemiolgst

This beef started with this tweet https://twitter.com/AnEpidemiolgst/status/1258433065933824008

which is just something else for multiple reasons. Tone policing a neologism is just stupid, especially when it’s basically accurate. Doing so without providing a preferred term is even worse. But, you know, I’m probably not writing a post just because somebody acted like an asshole on twitter. I’m doing it for far more important reasons, namely:

duty_calls

And in this particular case, it’s not Nate. She also doubles down with https://twitter.com/JDelage/status/1258452085428928515

which is obviously wrong, even for a fuzzy definition of meaningfully, if you stop and think about it. R0 is a population average. Some people act like hermits and have little chance of spreading the disease much if they somehow catch it. Others have far, far more interactions than average and are at risk of being superspreaders if they get an asymptomatic infection (or are symptomatic assholes). These average out to R0.

Now, when 20% of the population is immune (assuming they develop immunity after infection, blah blah), who is it going to be? By definition, it’s people who already got infected. Who got infected? Obviously, for something like COVID, it’s weighted so that >>20% of potential superspreaders were already infected and <<20% of hermits were infected. That means that far more than the naive 20% of the interactions infected people have now are going to be with somebody who’s already immune (the exact number depending on the shape and variance of the interaction distribution), and so Rt is going to be much less than (1 – 0.2) * R0 at 20% immune, or in ELI5 language, 20% immune implies a lot more than a 20% decrease in transmission rate for a disease like COVID.

This is completely obvious, but somehow junk like this is being put out by Johns Hopkins of all places. Right-wing deliberate disinformation is bad enough, but professionals responding with obvious nonsense really doesn’t help the cause of truth. Please tell me the state of knowledge/education in this field isn’t truly that primitive. Or ship me a Nobel Prize in medicine, I’m good either way.

The Baseball Prospectus article comparing defensive metrics is… strange

TL;DR and by strange I mean a combination of utter nonsense tests on top of the now-expected rigged test.

Baseball Prospectus released a new article grading defensive metrics against each other and declared their FRAA metric the overall winner, even though it’s by far the most primitive defensive stat of the bunch for non-catchers. Furthermore, they graded FRAA as a huge winner in the outfield and Statcast’s Outs Above Average as a huge winner in the infield.. and graded FRAA as a dumpster fire in the infield and OAA as a dumpster fire in the outfield. This is all very curious. We’re going to answer the three questions in the following order:

On their tests, why does OAA rule the infield while FRAA sucks?
On their tests, why does FRAA rule the outfield while OAA sucks?
On their test, why does FRAA come out ahead overall?

First, a summary of the two systems. OAA ratings try to completely strip out positioning- they’re only a measure of how well the player did, given where the ball was and where the player started. FRAA effectively treats all balls as having the same difficulty (after dealing with park, handedness, etc). It assumes that each player should record the league-average X outs per BIP for the given defensive position/situation and gives +/- relative to that number.

A team allowing a million uncatchable base hits won’t affect the OAA at all (not making a literal 0% play doesn’t hurt your rating), but it will tank everybody’s FRAA because it thinks the fielders “should” be making X outs per Y BIPs. In a similar vein, hitting a million easy balls at a fielder who botches them all will destroy that fielder’s OAA but leave the rest of his teammates unchanged. It will still tank *everybody’s* FRAA the same as if the balls weren’t catchable. An average-performing (0 OAA), average-positioned fielder with garbage teammates will get dragged down to a negative FRAA. An average-performing (0 OAA), average-positioned fielder whose pitcher allows a bunch of difficult balls nowhere near him will also get dragged down to a negative FRAA.

So, in abstract terms: On a team level, team OAA=range + conversion and team FRAA = team OAA + positioning-based difficulty relative to average. On a player level, player OAA= range + conversion and player FRAA = player OAA + positioning + teammate noise.

Now, their methodology. It is very strange, and I tweeted at them to make sure they meant what they wrote. They didn’t reply, it fits the results, and any other method of assigning plays would be in-depth enough to warrant a description, so we’re just going to assume this is what they actually did. For the infield and outfield tests, they’re using the season-long rating each system gave a player to predict whether or not a play resulted in an out. That may not sound crazy at first blush, but..

…using only the fielder ratings for the position in question, run the same model type position by position to determine how each system predicts the out probability for balls fielded by each position. So, the position 3 test considers only the fielder quality rate of the first baseman on *balls fielded by first basemen*, and so on.

Their position-by-position comparisons ONLY INVOLVE BALLS THAT THE PLAYER ACTUALLY FIELDED. A ground ball right through the legs untouched does not count as a play for that fielder in their test (they treat it as a play for whoever picks it up in the outfield). Obviously, by any sane measure of defense, that’s a botched play by the defender, which means the position-by-position tests they’re running are not sane tests of defense. They’re tests of something else entirely, and that’s why they get the results that they do.

Using the bolded abstraction above, this is only a test of conversion. Every play that the player didn’t/couldn’t field IS NOT INCLUDED IN THE TEST. Since OAA adds the “noise” of range to conversion, and FRAA adds the noise of range PLUS the noise of positioning PLUS the noise from other teammates to conversion, OAA is less noisy and wins and FRAA is more noisy and sucks. UZR, which strips out some of the positioning noise based on ball location, comes out in the middle. The infield turned out to be pretty easy to explain.

The outfield is a bit trickier. Again, because ground balls that got through the infield are included in the OF test (because they were eventually fielded by an outfielder), the OF test is also not a sane test of defense. Unlike the infield, when the outfield doesn’t catch a ball, it’s still (usually) eventually fielded by an outfielder, and roughly on average by the same outfielder who didn’t catch it.

So using the abstraction, their OF test measures range + conversion + positioning + missed ground balls (that roll through to the OF). OAA has range and conversion. FRAA has range, conversion, positioning, and some part of missed ground balls through the teammate noise effect described earlier. FRAA wins and OAA gets dumpstered on this silly test, and again it’s not that hard to see why, not that it actually means much of anything.

Before talking about the teamwide defense test, it’s important to define what “defense” actually means (for positions 3-9). If a batter hits a line drive 50 feet from anybody, say a rope safely over the 3B’s head down the line, is it bad defense by 3-9 that it went for a hit? Clearly not, by the common usage of the word. Who would it be bad defense by? Nobody could have caught it. Nobody should have been positioned there.

BP implicitly takes a different approach

So, recognizing that defenses are, in the end, a system of players, we think an important measure of defensive metric quality is this: taking all balls in play that remained in the park for an entire season — over 100,000 of them in 2019 — which system on average most accurately measures whether an out is probable on a given play? This, ultimately, is what matters. Either you get more hitters out on balls in play or you do not. The better that a system can anticipate that a batter will be out, the better the system is.

that does consider this bad defense. It’s kind of amazing (and by amazing I mean not the least bit surprising at this point) that every “questionable” definition and test is always for the benefit one of BP’s stats. Neither OAA, nor any of the other non-FRAA stats mentioned, are based on outs/BIP or trying to explain outs/BIP. In fact, they’re specifically designed to do the exact opposite of that. The analytical community has spent decades making sure that uncatchable balls don’t negatively affect PLAYER defensive ratings, and more generally to give an appropriate amount of credit to the PLAYER based on the system’s estimate of the difficulty of the play (remember from earlier that FRAA doesn’t- it treats EVERY BIP as average difficulty).

The second “questionable” decision is to test against outs/BIP. Using abstract language again to break this down, outs/BIP = player performance given the difficulty of the opportunity + difficulty of opportunity. The last term can be further broken down into difficulty of opportunity = smart/dumb fielder positioning + quality of contact allowed (a pitcher who allows an excess of 100mph batted balls is going to make it harder for his defense to get outs, etc) + luck. In aggregate:

outs/BIP=

player performance given the difficulty of the opportunity (OAA) +

smart/dumb fielder positioning (a front-office/manager skill in 2019) +

quality of contact allowed (a batter/pitcher skill) +

luck (not a skill).

That’s testing against a lot of nonsense beyond fielder skill, and it’s testing against nonsense *that the other systems were explicitly designed to exclude*. It would take the creators of the other defensive systems less time than it took me to write the previous paragraph to run a query and report an average difficulty of opportunity metric when the player was on the field (their systems are all already designed around giving every BIP a difficulty of opportunity score), but again, they don’t do that because *they’re not trying to explain outs/BIP*.

The third “questionable” decision is to use 2019 ratings to predict 2019 outs/BIP. Because observed OAA is skill+luck, it benefits from “knowing” the luck in the plays it’s trying to predict. In this case, luck being whether a fielder converted plays at/above/below his true skill level. 2019 FRAA has all of the difficulty of opportunity information baked in for 2019 balls, INCLUDING all of the luck in difficulty of opportunity ON TOP OF the luck in conversion that OAA also has.

All of that luck is just noise in reality, but because BP is testing the rating against THE SAME PLAYS used to create the rating, that noise is actually signal in this test, and the more of it included, the better. That’s why FRAA “wins” handily. One could say that this test design is almost maximally disingenuous, and of course it’s for the benefit of BP’s in-house stat, because that’s how they roll.

Richard Epstein’s “coronavirus evolving to weaker virulence” reduced death toll argument is remarkably awful

Isaac Chotiner buried Epstein in this interview, but he understandably didn’t delve into the conditions necessary for Epstein’s “evolving to weaker virulence will reduce the near-term death toll” argument to be true. I did. It’s bad. Really bad. Laughably bad.. if you can laugh at something that might be playing a part in getting people killed.

TL;DR He thinks worldwide COVID-19 cases will cap out well under 1 million for this wave, and one of the reasons is that the virus will evolve to be less virulent. Unlike The Andromeda Strain, whose ending pissed me off when I read it in junior high, a virus doesn’t mutate the same way everywhere all at once. There’s one mutation event at a time in one host (person) at a time, and the mutated virus starts reproducing and spreading through the population like what’s seen here. The hypothetical mutated weak virus can only have a big impact reducing total deaths if it can quickly propagate to a scale big enough to significantly impede the original virus (by granting immunity from the real thing, presumably). If the original coronavirus only manages to infect under 1 million people worldwide in this wave, how in the hell is the hypothetical mutated weak coronavirus supposed to spread to a high enough percentage of the population -and even faster- to effectively vaccinate them from the real thing, even with a supposed transmission advantage?*** Even if it spread to 10 million worldwide over the same time frame (which would be impressive since there’s no evidence that it even exists right now….), that’s a ridiculously small percentage of the potentially infectable in hot zones. It would barely matter at all until COVID/weak virus saturation got much higher, which only happens at MUCH higher case counts.

That line of argumentation is utterly and completely absurd alongside a well-under-1 million worldwide cases projection.

***The average onset of symptoms is ~5 days from infection and the serial interval (time from symptoms in one person to symptoms in a person they infected) is also only around 5 days, meaning there’s a lot of transmission going on before people even show any symptoms. Furthermore, based on this timeline, which AFAIK is still roughly correct

coronavirus_mediantimeline_infographic-1584612208650

there’s another week on average to transmit the virus before the “strong virus” carriers are taken out of commission, meaning Epstein’s theory only gives a “transmission advantage” for the hypothetical weak virus more than 10 days after infection on average. And, oh yeah, 75-85% of “strong” infections never wind up in the hospital at all, so by Epstein’s theory, they’ll transmit at the same rate. There simply hasn’t been/isn’t now much room for a big transmission advantage for the weak virus under his assumptions. And to reiterate, no evidence that it actually even exists right now.

DRC+ still contains a lot of park factor

Required knowledge: DRC+ and park factors

TL;DR read the title above, the rant 3 paragraphs down, and the very bottom

DRC+ is supposed to be a fully park-adjusted metric, but from the initial article, I couldn’t understand how that could be consistent with the reported results without either an exceptional amount of overfitting or extremely good luck. Team DRC+ was reported to be more reliable than team wRC+ at describing the SAME SEASON’s team runs/PA. Since wRC+ is based off of wOBA, team wOBA basically is team scoring offense (r=0.94), and DRC+ regresses certain components of wOBA back towards the mean quite significantly (which is why DRC+ is structurally unfit for use in WAR), it made no sense to me that a metric that took away actual hits that created actual runs from teams with good BABIPs and invented hits in place of actual outs for teams with bad BABIPs could possibly correlate better to actual runs scored than a metric that used what happened on the field. It’s not quite logically impossible for that to be true, but it’s pretty damn close.

It turns out the simple explanation for how a park-adjusted significantly regressed metric beat a park-adjusted unregressed metric is the correct one. It didn’t. DRC+ keeps in a bunch of park factor and calls itself a park-adjusted metric when it’s simply not one, and not even close to one. The park factor table near the bottom of the DRC+article should have given anybody who knows anything about baseball serious pause, and of course it fits right in with DRC+’s “great descriptiveness”.

RANT

How in the hell does a park factor of 104 for Coors get published without explanation by any person or institution trying to be serious? The observed park factors (halved) the last few years, in reverse order: 114 (2018), 115, 116, 117, 120, 109, 123… You can’t throw out a number like Coors 104 like it’s nothing. If Jonathan Judge could actually justify it somehow- maybe last year we got a fantastic confluence of garbage pitchers and great situational hitting at Coors and the reverse on the road while still somehow only putting up a 114, where you could at least handwave an attempt at a justification, then he should have made that case when he was asked about it, but instead he gave an answer indicative of never having taken a serious look at it. Spitting out a 104 for Coors should have been like a tornado siren going off in his ear to do basic quality control checks on park effects for the entire model, but it evidently wasn’t, so here I am doing it instead.

/RANT

The basic questions are “how correlated is team DRC+ to home park factor?” and “how correlated should team DRC+ be to home park factor?”. The naive answer to the second question is “not correlated at all since it’s park adjusted, duh”, but it’s possible that the talented hitters skew towards hitters’ parks, which would cause a legitimate positive correlation, or that they skew towards pitchers’ parks, which would cause a legitimate negative correlation. As it turns out, over the 2003-2017 timeframe, hitting talent doesn’t skew at all, but that’s an assertion that has to be demonstrated instead of just assumed true, so let’s get to it.

We need a way to make (offensive talent, home park factor) team-season pairs that can measure both components separately without being causally correlated to each other. Seasonal team road wOBA is a basically unbiased way to measure offensive quality independent of home park factor because the opposing parks played in have to average out pretty similarly for every team in the same league (AL/NL)**. If we use that, then we need a way to make a park factor for those seasons that can’t include that year’s data, because everything else being equal, an increase in a team’s road wOBA would decrease its home park factor****, and we’re explicitly trying to avoid nonsense like that. Using the observed park factors from *surrounding years*, not the current year, to estimate the current year’s park factor solves that problem, assuming those estimates don’t suck.

** there’s a tiny bias from not playing road games in a stadium with your park factor, but correcting that by adding a hypothetical 5 road games at estimated home park factor doesn’t change conclusions)

**** some increase will be skill that will, on average, increase home wOBA as well and mostly cancel out, and some increase will be luck that won’t cancel out and would screw the analysis up

Methodology

I used all eligible team-batting-seasons, pitchers included, from 2003-2017. To estimate park factors, I used the surrounding 2 years (T-2, T-1, T+1, T+2) of observed park factors (for runs) if they were available, the surrounding 1 year (T-1, T+1) otherwise, and threw out the season if I didn’t have those. That means I threw out all 2018s as well as the first and last years in each park. I ignored other changes (moved fences, etc).

Because I have no idea what DRC+ is doing with pitcher-batters, how good its AL-NL benchmarking is, and the assumption of nearly equivalent aggregate road parks is only guaranteed to hold between same-league teams, I did the DRC+ analysis separately for AL and NL teams.

To control for changing leaguewide wOBA in the 2003-2017 time period, I used the same wOBA/LgAvGwOBA wOBA% method I used in DRC+ really isn’t any good at predicting next year’s wOBA for team switchers for wOBA and DRC+, just for AL teams and NL teams separately for the reasons above. After this step, I did analyses with and without Coors because it’s an extreme outlier. We already know with near certainty that their treatment of Coors is ~~kind of questionable~~ batshit crazy and keeps way too much park effect in DRC+, so I wanted to see how they did everywhere else.

Results

The park factor estimation worked pretty well. 2 surrounding year PF correlated to the observed PF for the year in question at r=0.54 (0.65 with Coors) and the 1 surrounding year at r=0.52 (0.61 with Coors). The 5-year FanGraphs PF, WHICH USES THE YEAR IN QUESTION, only correlates at r=0.7 (0.77 with Coors) and the 1 and 2 year park factors correlate to the Fangraphs PF at 0.87 and 0.96 respectively. This is plenty to work with given the effect sizes later.

Team road wOBA% (squared or linear) correlates to the estimated home park factor at r = -0.03, literally nothing, and with the 5 extra hypothetical games as mentioned in the footnote above, r=0.02, also literally nothing. It didn’t have to be this way, but it’s convenient that it is. Just to show that road wOBA isn’t all noise, it correlates to that season’s home wOBA% at r=0.32 (0.35 with the adjustment) even though we’re dealing with half seasons and home wOBA% contains the entire park factor. Road wOBA% correlates to home wOBA%/sqrt(estimated park factor) at r=0.56 (and wOBA%/park factor at r=0.54). That’s estimated park factor from surrounding years, not using the home and road wOBA data in question.

Home wOBA% is obviously hugely correlated to estimated park factor (r=0.46 for home wOBA%^2 vs estimated PF), but park adjusting it by correlating

(home wOBA%)^2/estimated park factor TO estimated park factor

has r= -0.00017. Completely uncorrelated to estimated PF (it’s pure luck that it’s THAT low).

So we’ve established that road wOBA really does contain a lot of information on a team’s offensive talent (that’s a legitimate naive “duh”), that it’s virtually uncorrelated to true home park factor, and that park-adjusted home wOBA% (using PF estimates from other seasons only) is also uncorrelated to true home park factor. If DRC+ is a correctly park-adjusted metric that measures offensive talent, DRC+% should also have to be virtually uncorrelated to true home park factor.

And… the correlation of DRC+% to estimated park factor is r= 0.38 for AL teams, r=0.29 for NL teams excluding Colorado, r=0.31 including Colorado. Well then. That certainly explains how it can be more descriptive than an actually park-adjusted metric.

C’mon Man- Baseball Prospectus DRC+ Edition

Required knowledge: A couple of “advanced” baseball stats. If you know BABIP, wRC+, and WAR, you shouldn’t have any trouble here. If you know box score stats, you should be able to get the gist.

Baseball Prospectus recently introduced its Deserved Runs Created offensive metric that purports to isolate player contribution to PA outcomes instead of just tallying up the PA outcomes, and they’re using that number as an offensive input into their version of WAR. On top of that, they’re pushing out articles trying to retcon the 2012 Trout vs. Cabrera “debate” in favor of Cabrera and trying to give Graig Nettles 15 more wins out of thin air. They appear to be quite serious and all-in on this concept as a more accurate measure of value. It’s not.

The exact workings of the model are opaque, but there’s enough description of the basic concept and the gigantic biases are so obvious that I feel comfortable describing it in broad strokes. Instead of measuring actual PA outcomes (like OPS/wOBA/wRC+/etc) or being a competitive forecasting system (Steamer/ZIPS/PECOTA), it’s effectively just a shitty forecast based on one hitter-season of data at a time****.

It weights the more reliable (K/BB/HR) components more and the less reliable (BABIP) components less like projections do, but because it’s wearing blinders and can’t see more than one season at a time, it NEVER FUCKING LEARNS**** that some players really do have outlier BABIP skill and keeps over-regressing them year after year. This is methodologically fatal. It’s impossible to salvage a one-year-of-stats-regressed framework. It might work as a career thing, but then year X WAR would change based on year X+1 performance.

Addendum for clarity: If DRC+ regresses each season as though that’s all the information it knows, then adds those regressed seasons up to determine career value, that is *NOT* the same as correctly regressing the total career. If, for example, BABIP skill got regressed 50% each year, then DRC+ would effectively regress the final career value 50% as well (as the result of adding up 50%-regressed seasons), even though the proper regression after 8000 PAs is much, much less. This is why the entire DRC+ concept and the other similarly constructed regressed-season BP metrics are broken beyond all repair. /addendum

****The description is vague enough that it might actually use multiple years and slowly learn over a player’s career, but it definitely doesn’t understand that a career of outlier skill means that the outlier skill (likely) existed the whole time it was presenting, so the general problem of over-regressing year after year would still apply, just more to the earlier years. Trout has 7 full years and he’s still being underrated by 18, 18, and 11 points the last 3 years compared to wRC+ and 17 points over his whole career.

DRC+ loves good hitters with terrible BABIPs and particularly ones with bad BABIPs and lots of HRs. Graig Nettles and his career .245 +/- .005 BABIP / 390 HRs looks great to DRC+ (120 vs 111 wRC+, +14.7 wins at the plate), as do Mark McGwire (164 vs 157, +8.5 wins), Harmon Killebrew (150 vs 142, +16.2 wins), Ernie Banks (129 vs 118, +20.8 wins), etc. Guys who beat the hell out of the ball and run average-ish BABIPs are rated similarly to wRC+, Barry Bonds (175 vs 173), Hank Aaron (150 vs 153), Willie Mays (150 vs 154), Albert Pujols (147 vs 146), etc.

The flip side of that is that DRC+ really, really hates low-ISO/high BABIP quality hitters. It underrates Tony Gwynn (119 vs 132, -12.9 wins) because it can’t figure out that the 8-time batting champ can hit. In addition, it hates Roberto Alomar (110 vs 118, -10.4 wins) Derek Jeter (105 vs 119, -17.9 wins), Rod Carew (112 v 132, -18.7 wins), etc. This is simply absurd.

C’mon man.