Nate Silver vs AnEpidemiolgst

This beef started with this tweet https://twitter.com/AnEpidemiolgst/status/1258433065933824008

which is just something else for multiple reasons.  Tone policing a neologism is just stupid, especially when it’s basically accurate.  Doing so without providing a preferred term is even worse.  But, you know, I’m probably not writing a post just because somebody acted like an asshole on twitter.  I’m doing it for far more important reasons, namely:

duty_calls

And in this particular case, it’s not Nate.  She also doubles down with https://twitter.com/JDelage/status/1258452085428928515

which is obviously wrong, even for a fuzzy definition of meaningfully, if you stop and think about it.  R0 is a population average.  Some people act like hermits and have little chance of spreading the disease much if they somehow catch it.  Others have far, far more interactions than average and are at risk of being superspreaders if they get an asymptomatic infection (or are symptomatic assholes).  These average out to R0.

Now, when 20% of the population is immune (assuming they develop immunity after infection, blah blah), who is it going to be?  By definition, it’s people who already got infected.  Who got infected?  Obviously, for something like COVID, it’s weighted so that >>20% of potential superspreaders were already infected and <<20% of hermits were infected.  That means that far more than the naive 20% of the interactions infected people have now are going to be with somebody who’s already immune (the exact number depending on the shape and variance of the interaction distribution), and so Rt is going to be much less than (1 – 0.2) * R0 at 20% immune, or in ELI5 language, 20% immune implies a lot more than a 20% decrease in transmission rate for a disease like COVID.

This is completely obvious, but somehow junk like this is being put out by Johns Hopkins of all places.  Right-wing deliberate disinformation is bad enough, but professionals responding with obvious nonsense really doesn’t help the cause of truth.  Please tell me the state of knowledge/education in this field isn’t truly that primitive.    Or ship me a Nobel Prize in medicine, I’m good either way.

The Baseball Prospectus article comparing defensive metrics is… strange

TL;DR and by strange I mean a combination of utter nonsense tests on top of the now-expected rigged test.

Baseball Prospectus released a new article grading defensive metrics against each other and declared their FRAA metric the overall winner, even though it’s by far the most primitive defensive stat of the bunch for non-catchers.  Furthermore, they graded FRAA as a huge winner in the outfield and Statcast’s Outs Above Average as a huge winner in the infield.. and graded FRAA as a dumpster fire in the infield and OAA as a dumpster fire in the outfield.  This is all very curious.  We’re going to answer the three questions in the following order:

  1. On their tests, why does OAA rule the infield while FRAA sucks?
  2. On their tests, why does FRAA rule the outfield while OAA sucks?
  3. On their test, why does FRAA come out ahead overall?

First, a summary of the two systems.  OAA ratings try to completely strip out positioning- they’re only a measure of how well the player did, given where the ball was and where the player started.  FRAA effectively treats all balls as having the same difficulty (after dealing with park, handedness, etc).  It assumes that each player should record the league-average X outs per BIP for the given defensive position/situation and gives +/- relative to that number.

A team allowing a million uncatchable base hits won’t affect the OAA at all (not making a literal 0% play doesn’t hurt your rating), but it will tank everybody’s FRAA because it thinks the fielders “should” be making X outs per Y BIPs.  In a similar vein, hitting a million easy balls at a fielder who botches them all will destroy that fielder’s OAA but leave the rest of his teammates unchanged.  It will still tank *everybody’s* FRAA the same as if the balls weren’t catchable.  An average-performing (0 OAA), average-positioned fielder with garbage teammates will get dragged down to a negative FRAA. An average-performing (0 OAA), average-positioned fielder whose pitcher allows a bunch of difficult balls nowhere near him will also get dragged down to a negative FRAA.

So, in abstract terms: On a team level, team OAA=range + conversion and team FRAA = team OAA + positioning-based difficulty relative to average.  On a player level, player OAA= range + conversion and player FRAA = player OAA + positioning + teammate noise.

Now, their methodology.  It is very strange, and I tweeted at them to make sure they meant what they wrote.  They didn’t reply, it fits the results, and any other method of assigning plays would be in-depth enough to warrant a description, so we’re just going to assume this is what they actually did.  For the infield and outfield tests, they’re using the season-long rating each system gave a player to predict whether or not a play resulted in an out.  That may not sound crazy at first blush, but..

…using only the fielder ratings for the position in question, run the same model type position by position to determine how each system predicts the out probability for balls fielded by each position. So, the position 3 test considers only the fielder quality rate of the first baseman on *balls fielded by first basemen*, and so on.

Their position-by-position comparisons ONLY INVOLVE BALLS THAT THE PLAYER ACTUALLY FIELDED.  A ground ball right through the legs untouched does not count as a play for that fielder in their test (they treat it as a play for whoever picks it up in the outfield).  Obviously, by any sane measure of defense, that’s a botched play by the defender, which means the position-by-position tests they’re running are not sane tests of defense.  They’re tests of something else entirely, and that’s why they get the results that they do.

Using the bolded abstraction above, this is only a test of conversion.  Every play that the player didn’t/couldn’t field IS NOT INCLUDED IN THE TEST.  Since OAA adds the “noise” of range to conversion, and FRAA adds the noise of range PLUS the noise of positioning PLUS the noise from other teammates to conversion, OAA is less noisy and wins and FRAA is more noisy and sucks.  UZR, which strips out some of the positioning noise based on ball location, comes out in the middle.  The infield turned out to be pretty easy to explain.

The outfield is a bit trickier.  Again, because ground balls that got through the infield are included in the OF test (because they were eventually fielded by an outfielder), the OF test is also not a sane test of defense.  Unlike the infield, when the outfield doesn’t catch a ball, it’s still (usually) eventually fielded by an outfielder, and roughly on average by the same outfielder who didn’t catch it.

So using the abstraction, their OF test measures range + conversion + positioning + missed ground balls (that roll through to the OF).  OAA has range and conversion.  FRAA has range, conversion, positioning, and some part of missed ground balls through the teammate noise effect described earlier.  FRAA wins and OAA gets dumpstered on this silly test, and again it’s not that hard to see why, not that it actually means much of anything.


Before talking about the teamwide defense test, it’s important to define what “defense” actually means (for positions 3-9).  If a batter hits a line drive 50 feet from anybody, say a rope safely over the 3B’s head down the line, is it bad defense by 3-9 that it went for a hit?  Clearly not, by the common usage of the word. Who would it be bad defense by?  Nobody could have caught it.  Nobody should have been positioned there.

BP implicitly takes a different approach

So, recognizing that defenses are, in the end, a system of players, we think an important measure of defensive metric quality is this: taking all balls in play that remained in the park for an entire season — over 100,000 of them in 2019 — which system on average most accurately measures whether an out is probable on a given play? This, ultimately, is what matters.  Either you get more hitters out on balls in play or you do not. The better that a system can anticipate that a batter will be out, the better the system is.

that does consider this bad defense.  It’s kind of amazing (and by amazing I mean not the least bit surprising at this point) that every “questionable” definition and test is always for the benefit one of BP’s stats.  Neither OAA, nor any of the other non-FRAA stats mentioned, are based on outs/BIP or trying to explain outs/BIP.  In fact, they’re specifically designed to do the exact opposite of that.  The analytical community has spent decades making sure that uncatchable balls don’t negatively affect PLAYER defensive ratings, and more generally to give an appropriate amount of credit to the PLAYER based on the system’s estimate of the difficulty of the play (remember from earlier that FRAA doesn’t- it treats EVERY BIP as average difficulty).

The second “questionable” decision is to test against outs/BIP.  Using abstract language again to break this down, outs/BIP = player performance given the difficulty of the opportunity + difficulty of opportunity.  The last term can be further broken down into difficulty of opportunity = smart/dumb fielder positioning + quality of contact allowed (a pitcher who allows an excess of 100mph batted balls is going to make it harder for his defense to get outs, etc) + luck.  In aggregate:

outs/BIP=

player performance given the difficulty of the opportunity (OAA) +

smart/dumb fielder positioning (a front-office/manager skill in 2019) +

quality of contact allowed (a batter/pitcher skill) +

luck (not a skill).

That’s testing against a lot of nonsense beyond fielder skill, and it’s testing against nonsense *that the other systems were explicitly designed to exclude*.  It would take the creators of the other defensive systems less time than it took me to write the previous paragraph to run a query and report an average difficulty of opportunity metric when the player was on the field (their systems are all already designed around giving every BIP a difficulty of opportunity score), but again, they don’t do that because *they’re not trying to explain outs/BIP*.

The third “questionable” decision is to use 2019 ratings to predict 2019 outs/BIP.  Because observed OAA is skill+luck, it benefits from “knowing” the luck in the plays it’s trying to predict.  In this case, luck being whether a fielder converted plays at/above/below his true skill level.  2019 FRAA has all of the difficulty of opportunity information baked in for 2019 balls, INCLUDING all of the luck in difficulty of opportunity ON TOP OF the luck in conversion that OAA also has.

All of that luck is just noise in reality, but because BP is testing the rating against THE SAME PLAYS used to create the rating, that noise is actually signal in this test, and the more of it included, the better.  That’s why FRAA “wins” handily.  One could say that this test design is almost maximally disingenuous, and of course it’s for the benefit of BP’s in-house stat, because that’s how they roll.

Richard Epstein’s “coronavirus evolving to weaker virulence” reduced death toll argument is remarkably awful

Isaac Chotiner buried Epstein in this interview, but he understandably didn’t delve into the conditions necessary for Epstein’s “evolving to weaker virulence will reduce the near-term death toll” argument to be true.  I did.  It’s bad.  Really bad.  Laughably bad.. if you can laugh at something that might be playing a part in getting people killed.

TL;DR He thinks worldwide COVID-19 cases will cap out well under 1 million for this wave, and one of the reasons is that the virus will evolve to be less virulent.  Unlike The Andromeda Strain, whose ending pissed me off when I read it in junior high, a virus doesn’t mutate the same way everywhere all at once.  There’s one mutation event at a time in one host (person) at a time, and the mutated virus starts reproducing and spreading through the population  like what’s seen here.  The hypothetical mutated weak virus can only have a big impact reducing total deaths if it can quickly propagate to a scale big enough to significantly impede the original virus (by granting immunity from the real thing, presumably).  If the original coronavirus only manages to infect under 1 million people worldwide in this wave, how in the hell is the hypothetical mutated weak coronavirus supposed to spread to a high enough percentage of the population -and even faster- to effectively vaccinate them from the real thing, even with a supposed transmission advantage?***  Even if it spread to 10 million worldwide over the same time frame (which would be impressive since there’s no evidence that it even exists right now….), that’s a ridiculously small percentage of the potentially infectable in hot zones.  It would barely matter at all until COVID/weak virus saturation got much higher, which only happens at MUCH higher case counts.

That line of argumentation is utterly and completely absurd alongside a well-under-1 million worldwide cases projection.


***The average onset of symptoms is ~5 days from infection and the serial interval (time from symptoms in one person to symptoms in a person they infected) is also only around 5 days, meaning there’s a lot of transmission going on before people even show any symptoms.  Furthermore, based on this timeline, which AFAIK is still roughly correct

coronavirus_mediantimeline_infographic-1584612208650

there’s another week on average to transmit the virus before the “strong virus” carriers are taken out of commission, meaning Epstein’s theory only gives a “transmission advantage” for the hypothetical weak virus more than 10 days after infection on average.  And, oh yeah, 75-85% of “strong” infections never wind up in the hospital at all, so by Epstein’s theory, they’ll transmit at the same rate.  There simply hasn’t been/isn’t now much room for a big transmission advantage for the weak virus under his assumptions.  And to reiterate, no evidence that it actually even exists right now.

DRC+ still contains a lot of park factor

Required knowledge: DRC+ and park factors

TL;DR read the title above, the rant 3 paragraphs down, and the very bottom

DRC+ is supposed to be a fully park-adjusted metric, but from the initial article, I couldn’t understand how that could be consistent with the reported results without either an exceptional amount of overfitting or extremely good luck.  Team DRC+ was reported to be more reliable than team wRC+ at describing the SAME SEASON’s team runs/PA.  Since wRC+ is based off of wOBA, team wOBA basically is team scoring offense (r=0.94), and DRC+ regresses certain components of wOBA back towards the mean quite significantly (which is why DRC+ is structurally unfit for use in WAR), it made no sense to me that a metric that took away actual hits that created actual runs from teams with good BABIPs and invented hits in place of actual outs for teams with bad BABIPs could possibly correlate better to actual runs scored than a metric that used what happened on the field.  It’s not quite logically impossible for that to be true, but it’s pretty damn close.

It turns out the simple explanation for how a park-adjusted significantly regressed metric beat a park-adjusted unregressed metric is the correct one.  It didn’t. DRC+ keeps in a bunch of park factor and calls itself a park-adjusted metric when it’s simply not one, and not even close to one.  The park factor table near the bottom of the DRC+article should have given anybody who knows anything about baseball serious pause, and of course it fits right in with DRC+’s “great descriptiveness”.

RANT

How in the hell does a park factor of 104 for Coors get published without explanation by any person or institution trying to be serious?  The observed park factors (halved) the last few years, in reverse order: 114 (2018), 115, 116, 117, 120, 109, 123… You can’t throw out a number like Coors 104 like it’s nothing.  If Jonathan Judge could actually justify it somehow- maybe last year we got a fantastic confluence of garbage pitchers and great situational hitting at Coors and the reverse on the road while still somehow only putting up a 114, where you could at least handwave an attempt at a justification, then he should have made that case when he was asked about it, but instead he gave an answer indicative of never having taken a serious look at it.  Spitting out a 104 for Coors should have been like a tornado siren going off in his ear to do basic quality control checks on park effects for the entire model, but it evidently wasn’t, so here I am doing it instead.

/RANT

The basic questions are “how correlated is team DRC+ to home park factor?” and “how correlated should team DRC+ be to home park factor?”.  The naive answer to the second question is “not correlated at all since it’s park adjusted, duh”, but it’s possible that the talented hitters skew towards hitters’ parks, which would cause a legitimate positive correlation, or that they skew towards pitchers’ parks, which would cause a legitimate negative correlation.  As it turns out, over the 2003-2017 timeframe, hitting talent doesn’t skew at all, but that’s an assertion that has to be demonstrated instead of just assumed true, so let’s get to it.

We need a way to make (offensive talent, home park factor) team-season pairs that can measure both components separately without being causally correlated to each other.  Seasonal team road wOBA is a basically unbiased way to measure offensive quality independent of home park factor because the opposing parks played in have to average out pretty similarly for every team in the same league (AL/NL)**.  If we use that, then we need a way to make a park factor for those seasons that can’t include that year’s data, because everything else being equal, an increase in a team’s road wOBA would decrease its home park factor****, and we’re explicitly trying to avoid nonsense like that.  Using the observed park factors from *surrounding years*, not the current year, to estimate the current year’s park factor solves that problem, assuming those estimates don’t suck.

** there’s a tiny bias from not playing road games in a stadium with your park factor, but correcting that by adding a hypothetical 5 road games at estimated home park factor doesn’t change conclusions)

**** some increase will be skill that will, on average, increase home wOBA as well and mostly cancel out, and some increase will be luck that won’t cancel out and would screw the analysis up

Methodology

I used all eligible team-batting-seasons, pitchers included, from 2003-2017.  To estimate park factors, I used the surrounding 2 years (T-2, T-1, T+1, T+2) of observed park factors (for runs) if they were available, the surrounding 1 year (T-1, T+1) otherwise, and threw out the season if I didn’t have those.  That means I threw out all 2018s as well as the first and last years in each park.  I ignored other changes (moved fences, etc).

Because I have no idea what DRC+ is doing with pitcher-batters, how good its AL-NL benchmarking is, and the assumption of nearly equivalent aggregate road parks is only guaranteed to hold between same-league teams, I did the DRC+ analysis separately for AL and NL teams.

To control for changing leaguewide wOBA in the 2003-2017 time period, I used the same wOBA/LgAvGwOBA wOBA% method I used in DRC+ really isn’t any good at predicting next year’s wOBA for team switchers for wOBA and DRC+, just for AL teams and NL teams separately for the reasons above.  After this step, I did analyses with and without Coors because it’s an extreme outlier.  We already know with near certainty that their treatment of Coors is kind of questionable batshit crazy and keeps way too much park effect in DRC+, so I wanted to see how they did everywhere else.

Results

The park factor estimation worked pretty well.  2 surrounding year PF correlated to the  observed PF for the year in question at r=0.54 (0.65 with Coors) and the 1 surrounding year at r=0.52 (0.61 with Coors).  The 5-year FanGraphs PF, WHICH USES THE YEAR IN QUESTION, only correlates at r=0.7 (0.77 with Coors) and the 1 and 2 year park factors correlate to the Fangraphs PF at 0.87 and 0.96 respectively.  This is plenty to work with given the effect sizes later.

Team road wOBA% (squared or linear) correlates to the estimated home park factor at r = -0.03, literally nothing, and with the 5 extra hypothetical games as mentioned in the footnote above, r=0.02, also literally nothing.  It didn’t have to be this way, but it’s convenient that it is.  Just to show that road wOBA isn’t all noise, it correlates to that season’s home wOBA% at r=0.32 (0.35 with the adjustment) even though we’re dealing with half seasons and home wOBA% contains the entire park factor.  Road wOBA% correlates to home wOBA%/sqrt(estimated park factor) at r=0.56 (and wOBA%/park factor at r=0.54).  That’s estimated park factor from surrounding years, not using the home and road wOBA data in question.

Home wOBA% is obviously hugely correlated to estimated park factor (r=0.46 for home wOBA%^2 vs estimated PF), but park adjusting it by correlating

(home wOBA%)^2/estimated park factor TO estimated park factor

has r= -0.00017.  Completely uncorrelated to estimated PF (it’s pure luck that it’s THAT low).

So we’ve established that road wOBA really does contain a lot of information on a team’s offensive talent (that’s a legitimate naive “duh”), that it’s virtually uncorrelated to true home park factor, and that park-adjusted home wOBA% (using PF estimates from other seasons only) is also uncorrelated to true home park factor.  If DRC+ is a correctly park-adjusted metric that measures offensive talent, DRC+% should also have to be virtually uncorrelated to true home park factor.

And… the correlation of DRC+% to estimated park factor is r= 0.38 for AL teams, r=0.29 for NL teams excluding Colorado, r=0.31 including Colorado.  Well then.  That certainly explains how it can be more descriptive than an actually park-adjusted metric.

 

C’mon Man- Baseball Prospectus DRC+ Edition

Required knowledge: A couple of “advanced” baseball stats.  If you know BABIP, wRC+, and WAR, you shouldn’t have any trouble here.  If you know box score stats, you should be able to get the gist.

Baseball Prospectus recently introduced its Deserved Runs Created offensive metric that purports to isolate player contribution to PA outcomes instead of just tallying up the PA outcomes, and they’re using that number as an offensive input into their version of WAR.  On top of that, they’re pushing out articles trying to retcon the 2012 Trout vs. Cabrera “debate” in favor of Cabrera and trying to give Graig Nettles 15 more wins out of thin air. They appear to be quite serious and all-in on this concept as a more accurate measure of value.  It’s not.

The exact workings of the model are opaque, but there’s enough description of the basic concept and the gigantic biases are so obvious that I feel comfortable describing it in broad strokes. Instead of measuring actual PA outcomes (like OPS/wOBA/wRC+/etc) or being a competitive forecasting system (Steamer/ZIPS/PECOTA), it’s effectively just a shitty forecast based on one hitter-season of data at a time****.

It weights the more reliable (K/BB/HR) components more and the less reliable (BABIP) components less like projections do, but because it’s wearing blinders and can’t see more than one season at a time, it NEVER FUCKING LEARNS**** that some players really do have outlier BABIP skill and keeps over-regressing them year after year.  This is methodologically fatal.  It’s impossible to salvage a one-year-of-stats-regressed framework.  It might work as a career thing, but then year X WAR would change based on year X+1 performance.

Addendum for clarity: If DRC+ regresses each season as though that’s all the information it knows, then adds those regressed seasons up to determine career value, that is *NOT* the same as correctly regressing the total career.  If, for example, BABIP skill got regressed 50% each year, then DRC+ would effectively regress the final career value 50% as well (as the result of adding up 50%-regressed seasons), even though the proper regression after 8000 PAs is much, much less.  This is why the entire DRC+ concept and the other similarly constructed regressed-season BP metrics are broken beyond all repair.  /addendum

 

****The description is vague enough that it might actually use multiple years and slowly learn over a player’s career, but it definitely doesn’t understand that a career of outlier skill means that the outlier skill (likely) existed the whole time it was presenting, so the general problem of over-regressing year after year would still apply, just more to the earlier years. Trout has 7 full years and he’s still being underrated by 18, 18, and 11 points the last 3 years compared to wRC+ and 17 points over his whole career.

DRC+ loves good hitters with terrible BABIPs and particularly ones with bad BABIPs and lots of HRs.  Graig Nettles and his career .245 +/- .005 BABIP / 390 HRs looks great to DRC+ (120 vs 111 wRC+, +14.7 wins at the plate), as do Mark McGwire (164 vs 157, +8.5 wins), Harmon Killebrew (150 vs 142, +16.2 wins), Ernie Banks (129 vs 118, +20.8 wins), etc.  Guys who beat the hell out of the ball and run average-ish BABIPs are rated similarly to wRC+, Barry Bonds (175 vs 173), Hank Aaron (150 vs 153), Willie Mays (150 vs 154), Albert Pujols (147 vs 146), etc.

The flip side of that is that DRC+ really, really hates low-ISO/high BABIP quality hitters.  It underrates Tony Gwynn (119 vs 132, -12.9 wins) because it can’t figure out that the 8-time batting champ can hit. In addition, it hates Roberto Alomar (110 vs 118, -10.4 wins) Derek Jeter (105 vs 119, -17.9 wins), Rod Carew (112 v 132, -18.7 wins), etc.  This is simply absurd.

C’mon man.