Nate Silver vs AnEpidemiolgst

This beef started with this tweet https://twitter.com/AnEpidemiolgst/status/1258433065933824008

which is just something else for multiple reasons.  Tone policing a neologism is just stupid, especially when it’s basically accurate.  Doing so without providing a preferred term is even worse.  But, you know, I’m probably not writing a post just because somebody acted like an asshole on twitter.  I’m doing it for far more important reasons, namely:

duty_calls

And in this particular case, it’s not Nate.  She also doubles down with https://twitter.com/JDelage/status/1258452085428928515

which is obviously wrong, even for a fuzzy definition of meaningfully, if you stop and think about it.  R0 is a population average.  Some people act like hermits and have little chance of spreading the disease much if they somehow catch it.  Others have far, far more interactions than average and are at risk of being superspreaders if they get an asymptomatic infection (or are symptomatic assholes).  These average out to R0.

Now, when 20% of the population is immune (assuming they develop immunity after infection, blah blah), who is it going to be?  By definition, it’s people who already got infected.  Who got infected?  Obviously, for something like COVID, it’s weighted so that >>20% of potential superspreaders were already infected and <<20% of hermits were infected.  That means that far more than the naive 20% of the interactions infected people have now are going to be with somebody who’s already immune (the exact number depending on the shape and variance of the interaction distribution), and so Rt is going to be much less than (1 – 0.2) * R0 at 20% immune, or in ELI5 language, 20% immune implies a lot more than a 20% decrease in transmission rate for a disease like COVID.

This is completely obvious, but somehow junk like this is being put out by Johns Hopkins of all places.  Right-wing deliberate disinformation is bad enough, but professionals responding with obvious nonsense really doesn’t help the cause of truth.  Please tell me the state of knowledge/education in this field isn’t truly that primitive.    Or ship me a Nobel Prize in medicine, I’m good either way.

Don’t use FRAA for outfielders

TL;DR OAA is far better, as expected.  Read after the break for next-season OAA prediction/commentary.

As a followup to my previous piece on defensive metrics, I decided to retest the metrics using a sane definition of opportunity.  BP’s study defined a defensive opportunity as any ball fielded by an outfielder, which includes completely uncatchable balls as well as ground balls that made it through the infield.  The latter are absolute nonsense, and the former are pretty worthless.  Thanks to Statcast, a better definition of defensive opportunity is available- any ball it gives a nonzero catch probability and assigns to an OF.  Because Statcast doesn’t provide catch probability/OAA on individual plays, we’ll be testing each outfielder in aggregate.

Similarly to what BP tried to do, we’re going to try to describe or predict each OF’s outs/opportunity, and we’re testing the 354 qualified OF player-seasons from 2016-2019.  Our contestants are Statcast’s OAA/opportunity, UZR/opportunity, FRAA/BIP (what BP used in their article), simple average catch probability (with no idea if the play was made or not), and positional adjustment (effectively the share of innings in CF, corner OF, or 1B/DH).  Because we’re comparing all outfielders to each other, and UZR and FRAA compare each position separately, those two received the positional adjustment (they grade quite a bit worse without it, as expected).

Using data from THE SAME SEASON (see previous post if it isn’t obvious why this is a bad idea) to describe that SAME SEASON’s outs/opportunity, which is what BP was testing, we get the following correlations:

Metric r^2 to same-season outs/opportunity
OAA/opp 0.74
UZR/opp 0.49
Catch Probability + Position 0.43
FRAA/BIP 0.34
Catch Probability 0.32
Positional adjustment/opp 0.25


OAA wins running away, UZR is a clear second, background information is 3rd, and FRAA is a distant 4th, barely ahead of raw catch probability.  And catch probability shouldn’t be that important.  It’s almost independent of OAA (r=0.06) and explains much less of the outs/opp variance.  Performance on opportunities is a much bigger driver than difficulty of opportunities over the course of a season.  I ran the same test on the 3 OF positions individually (using Statcast’s definition of primary position for that season), and the numbers bounced a little, but it’s the same rank order and similar magnitude of differences.

Attempting to describe same-season OAA/opp gives the following:

Metric r^2 to same-season OAA/opportunity
OAA/opp 1
UZR/opp 0.5
FRAA/BIP 0.32
Positional adjustment/opp 0.17
Catch Probability 0.004

As expected, catch probability drops way off.  CF opportunities are on average about 1% harder than corner OF opportunities.  Positional adjustment is obviously a skill correlate (Full-time CF > CF/corner tweeners > Full-time corner > corner/1B-DH tweeners), but it’s a little interesting that it drops off compared to same-season outs/opportunity.  It’s reasonably correlated to catch probability, which is good for describing outs/opp and useless for describing OAA/opp, so I’m guessing that’s most of the decline.


Now, on to the more interesting things.. Using one season’s metric to predict the NEXT season’s OAA/opportunity (both seasons must be qualified), which leaves 174 paired seasons, gives us the following (players who dropped out were almost average in aggregate defensively):

Metric r^2 to next season OAA/opportunity
OAA/opp 0.45
FRAA/BIP 0.27
UZR/opp 0.25
Positional adjustment 0.1
Catch Probability 0.02

FRAA notably doesn’t suck here- although unless you’re a modern-day Wintermute who is forbidden to know OAA, just use OAA of course.  Looking at the residuals from previous-season OAA, UZR is useless, but FRAA and positional adjustment contain a little information, and by a little I mean enough together to get the r^2 up to 0.47.  We’ve discussed positional adjustment already and that makes sense, but FRAA appears to know a little something that OAA doesn’t, and it’s the same story for predicting next-season outs/opp as well.

That’s actually interesting.  If the crew at BP had discovered that and spent time investigating the causes, instead of spending time coming up with ways to bullshit everybody that a metric that treats a ground ball to first as a missed play for the left fielder really does outperform Statcast, we might have all learned something useful.

The Baseball Prospectus article comparing defensive metrics is… strange

TL;DR and by strange I mean a combination of utter nonsense tests on top of the now-expected rigged test.

Baseball Prospectus released a new article grading defensive metrics against each other and declared their FRAA metric the overall winner, even though it’s by far the most primitive defensive stat of the bunch for non-catchers.  Furthermore, they graded FRAA as a huge winner in the outfield and Statcast’s Outs Above Average as a huge winner in the infield.. and graded FRAA as a dumpster fire in the infield and OAA as a dumpster fire in the outfield.  This is all very curious.  We’re going to answer the three questions in the following order:

  1. On their tests, why does OAA rule the infield while FRAA sucks?
  2. On their tests, why does FRAA rule the outfield while OAA sucks?
  3. On their test, why does FRAA come out ahead overall?

First, a summary of the two systems.  OAA ratings try to completely strip out positioning- they’re only a measure of how well the player did, given where the ball was and where the player started.  FRAA effectively treats all balls as having the same difficulty (after dealing with park, handedness, etc).  It assumes that each player should record the league-average X outs per BIP for the given defensive position/situation and gives +/- relative to that number.

A team allowing a million uncatchable base hits won’t affect the OAA at all (not making a literal 0% play doesn’t hurt your rating), but it will tank everybody’s FRAA because it thinks the fielders “should” be making X outs per Y BIPs.  In a similar vein, hitting a million easy balls at a fielder who botches them all will destroy that fielder’s OAA but leave the rest of his teammates unchanged.  It will still tank *everybody’s* FRAA the same as if the balls weren’t catchable.  An average-performing (0 OAA), average-positioned fielder with garbage teammates will get dragged down to a negative FRAA. An average-performing (0 OAA), average-positioned fielder whose pitcher allows a bunch of difficult balls nowhere near him will also get dragged down to a negative FRAA.

So, in abstract terms: On a team level, team OAA=range + conversion and team FRAA = team OAA + positioning-based difficulty relative to average.  On a player level, player OAA= range + conversion and player FRAA = player OAA + positioning + teammate noise.

Now, their methodology.  It is very strange, and I tweeted at them to make sure they meant what they wrote.  They didn’t reply, it fits the results, and any other method of assigning plays would be in-depth enough to warrant a description, so we’re just going to assume this is what they actually did.  For the infield and outfield tests, they’re using the season-long rating each system gave a player to predict whether or not a play resulted in an out.  That may not sound crazy at first blush, but..

…using only the fielder ratings for the position in question, run the same model type position by position to determine how each system predicts the out probability for balls fielded by each position. So, the position 3 test considers only the fielder quality rate of the first baseman on *balls fielded by first basemen*, and so on.

Their position-by-position comparisons ONLY INVOLVE BALLS THAT THE PLAYER ACTUALLY FIELDED.  A ground ball right through the legs untouched does not count as a play for that fielder in their test (they treat it as a play for whoever picks it up in the outfield).  Obviously, by any sane measure of defense, that’s a botched play by the defender, which means the position-by-position tests they’re running are not sane tests of defense.  They’re tests of something else entirely, and that’s why they get the results that they do.

Using the bolded abstraction above, this is only a test of conversion.  Every play that the player didn’t/couldn’t field IS NOT INCLUDED IN THE TEST.  Since OAA adds the “noise” of range to conversion, and FRAA adds the noise of range PLUS the noise of positioning PLUS the noise from other teammates to conversion, OAA is less noisy and wins and FRAA is more noisy and sucks.  UZR, which strips out some of the positioning noise based on ball location, comes out in the middle.  The infield turned out to be pretty easy to explain.

The outfield is a bit trickier.  Again, because ground balls that got through the infield are included in the OF test (because they were eventually fielded by an outfielder), the OF test is also not a sane test of defense.  Unlike the infield, when the outfield doesn’t catch a ball, it’s still (usually) eventually fielded by an outfielder, and roughly on average by the same outfielder who didn’t catch it.

So using the abstraction, their OF test measures range + conversion + positioning + missed ground balls (that roll through to the OF).  OAA has range and conversion.  FRAA has range, conversion, positioning, and some part of missed ground balls through the teammate noise effect described earlier.  FRAA wins and OAA gets dumpstered on this silly test, and again it’s not that hard to see why, not that it actually means much of anything.


Before talking about the teamwide defense test, it’s important to define what “defense” actually means (for positions 3-9).  If a batter hits a line drive 50 feet from anybody, say a rope safely over the 3B’s head down the line, is it bad defense by 3-9 that it went for a hit?  Clearly not, by the common usage of the word. Who would it be bad defense by?  Nobody could have caught it.  Nobody should have been positioned there.

BP implicitly takes a different approach

So, recognizing that defenses are, in the end, a system of players, we think an important measure of defensive metric quality is this: taking all balls in play that remained in the park for an entire season — over 100,000 of them in 2019 — which system on average most accurately measures whether an out is probable on a given play? This, ultimately, is what matters.  Either you get more hitters out on balls in play or you do not. The better that a system can anticipate that a batter will be out, the better the system is.

that does consider this bad defense.  It’s kind of amazing (and by amazing I mean not the least bit surprising at this point) that every “questionable” definition and test is always for the benefit one of BP’s stats.  Neither OAA, nor any of the other non-FRAA stats mentioned, are based on outs/BIP or trying to explain outs/BIP.  In fact, they’re specifically designed to do the exact opposite of that.  The analytical community has spent decades making sure that uncatchable balls don’t negatively affect PLAYER defensive ratings, and more generally to give an appropriate amount of credit to the PLAYER based on the system’s estimate of the difficulty of the play (remember from earlier that FRAA doesn’t- it treats EVERY BIP as average difficulty).

The second “questionable” decision is to test against outs/BIP.  Using abstract language again to break this down, outs/BIP = player performance given the difficulty of the opportunity + difficulty of opportunity.  The last term can be further broken down into difficulty of opportunity = smart/dumb fielder positioning + quality of contact allowed (a pitcher who allows an excess of 100mph batted balls is going to make it harder for his defense to get outs, etc) + luck.  In aggregate:

outs/BIP=

player performance given the difficulty of the opportunity (OAA) +

smart/dumb fielder positioning (a front-office/manager skill in 2019) +

quality of contact allowed (a batter/pitcher skill) +

luck (not a skill).

That’s testing against a lot of nonsense beyond fielder skill, and it’s testing against nonsense *that the other systems were explicitly designed to exclude*.  It would take the creators of the other defensive systems less time than it took me to write the previous paragraph to run a query and report an average difficulty of opportunity metric when the player was on the field (their systems are all already designed around giving every BIP a difficulty of opportunity score), but again, they don’t do that because *they’re not trying to explain outs/BIP*.

The third “questionable” decision is to use 2019 ratings to predict 2019 outs/BIP.  Because observed OAA is skill+luck, it benefits from “knowing” the luck in the plays it’s trying to predict.  In this case, luck being whether a fielder converted plays at/above/below his true skill level.  2019 FRAA has all of the difficulty of opportunity information baked in for 2019 balls, INCLUDING all of the luck in difficulty of opportunity ON TOP OF the luck in conversion that OAA also has.

All of that luck is just noise in reality, but because BP is testing the rating against THE SAME PLAYS used to create the rating, that noise is actually signal in this test, and the more of it included, the better.  That’s why FRAA “wins” handily.  One could say that this test design is almost maximally disingenuous, and of course it’s for the benefit of BP’s in-house stat, because that’s how they roll.

Richard Epstein’s “coronavirus evolving to weaker virulence” reduced death toll argument is remarkably awful

Isaac Chotiner buried Epstein in this interview, but he understandably didn’t delve into the conditions necessary for Epstein’s “evolving to weaker virulence will reduce the near-term death toll” argument to be true.  I did.  It’s bad.  Really bad.  Laughably bad.. if you can laugh at something that might be playing a part in getting people killed.

TL;DR He thinks worldwide COVID-19 cases will cap out well under 1 million for this wave, and one of the reasons is that the virus will evolve to be less virulent.  Unlike The Andromeda Strain, whose ending pissed me off when I read it in junior high, a virus doesn’t mutate the same way everywhere all at once.  There’s one mutation event at a time in one host (person) at a time, and the mutated virus starts reproducing and spreading through the population  like what’s seen here.  The hypothetical mutated weak virus can only have a big impact reducing total deaths if it can quickly propagate to a scale big enough to significantly impede the original virus (by granting immunity from the real thing, presumably).  If the original coronavirus only manages to infect under 1 million people worldwide in this wave, how in the hell is the hypothetical mutated weak coronavirus supposed to spread to a high enough percentage of the population -and even faster- to effectively vaccinate them from the real thing, even with a supposed transmission advantage?***  Even if it spread to 10 million worldwide over the same time frame (which would be impressive since there’s no evidence that it even exists right now….), that’s a ridiculously small percentage of the potentially infectable in hot zones.  It would barely matter at all until COVID/weak virus saturation got much higher, which only happens at MUCH higher case counts.

That line of argumentation is utterly and completely absurd alongside a well-under-1 million worldwide cases projection.


***The average onset of symptoms is ~5 days from infection and the serial interval (time from symptoms in one person to symptoms in a person they infected) is also only around 5 days, meaning there’s a lot of transmission going on before people even show any symptoms.  Furthermore, based on this timeline, which AFAIK is still roughly correct

coronavirus_mediantimeline_infographic-1584612208650

there’s another week on average to transmit the virus before the “strong virus” carriers are taken out of commission, meaning Epstein’s theory only gives a “transmission advantage” for the hypothetical weak virus more than 10 days after infection on average.  And, oh yeah, 75-85% of “strong” infections never wind up in the hospital at all, so by Epstein’s theory, they’ll transmit at the same rate.  There simply hasn’t been/isn’t now much room for a big transmission advantage for the weak virus under his assumptions.  And to reiterate, no evidence that it actually even exists right now.

Dave Stieb was good

Since there’s nothing of any interest going on in the country or the world today, I decided the time was right to defend the honour of a Toronto pitcher from the 80s.  Looking deeper into this article, https://www.baseballprospectus.com/news/article/57310/rubbing-mud-dra-and-dave-stieb/ which concluded that Stieb was actually average or worse rate-wise, many of the assertions are… strange.

First, there’s the repeated assertion that Stieb’s K and BB rates are bad.  They’re not.  He pitched to basically dead average defensive catchers, and weighted by the years Stieb pitched, he’s actually marginally above the AL average.  The one place where he’s subpar, hitting too many batters, isn’t even mentioned.  This adds up to a profile of

K/9 BB/9 HBP/9
AL Average 5.22 3.28 0.20
Stieb 5.19 3.21 0.40

Accounting for the extra HBPs, these components account for about 0.05 additional ERA over league average, or ~1%.  Without looking at batted balls at all, Stieb would only be 1% worse than average (AL and NL are pretty close pitcher-quality wise over this timeframe, with the AL having a tiny lead if anything).  BP’s version of FIP- (cFIP) has Stieb at 104.  That doesn’t really make any sense before looking at batted balls, and Stieb only allowed a HR/9 of 0.70 vs. a league average of 0.88.  He suppressed home runs by 20%- in a slight HR-friendly park- over 2900 innings, combined with an almost dead average K/BB profile, and BP rates his FIP as below average.  That is completely insane.

The second assertion is that Stieb relied too much on his defense.  We can see from above that an almost exactly average percentage of his PAs ended with balls in play, so that part falls flat, and while Toronto did have a slightly above-average defense, it was only SLIGHTLY above average.  Using BP’s own FRAA numbers, Jays fielders were only 236 runs above average from 79-92, and prorating for Stieb’s share of IP, they saved him 24 runs, or a 0.08 lower ERA (sure, it’s likely that they played a bit better behind him and a bit worse behind everybody else).  Stieb’s actual ERA was 3.44 and his DRA is 4.43- almost one full run worse- and the defense was only a small part of that difference.  Even starting from Stieb’s FIP of 3.82, there’s a hell of a long way to go to get up to 4.43, and a slightly good defense isn’t anywhere near enough to do it.

Stieb had a career BABIP against of .260 vs. AL average of .282, and the other pitchers on his teams had an aggregate BABIP of .278.  That’s more evidence of a slightly above-average defense, suppressing BABIP a little in a slight hitter’s home park, but Stieb’s BABIP suppression goes far beyond what the defense did for everybody else.  It’s thousands-to-1 against a league-average pitcher suppressing HR as much as Stieb did.  It’s also thousands-to-1 against a league-average pitcher in front of Toronto’s defense suppressing BABIP as much as Stieb did.  It’s exceptionally likely that Stieb actually was a true-talent soft contact machine.  Maybe not literally to his careen numbers, but the best estimate is a hell of a lot closer to career numbers than to average after 12,000 batters faced.

This is kind of DRA and DRC in a microcosm.  It can spit out values that make absolutely no sense at a quick glance, like a league-average K/BB guy with great HR suppression numbers grading out with a below-average cFIP, and it struggles to accept outlier performance on balls in play, even over gigantic samples, because the season-by-season construction is completely unfit for purpose when used to describe a career.  That’s literally the first thing I wrote when DRC+ was rolled out, and it’s still true here.

A 16-person format that doesn’t suck

This is in response to the Magic: the Gathering World Championship that just finished, which featured some great Magic played in a highly questionable format.  It had three giant flaws:

  1. It buried players far too quickly.  Assuming every match was a coinflip, each of the 16 players started with a 6.25% chance to win.  Going 2-0 or 2-1 in draft meant you were just over 12% to win and going 0-2 or 1-2 in draft meant you were just under 0.5% (under 1 in 200) to win.  Ouch.  In turn, this meant we were watching a crapload of low-stakes games and the players involved were just zombies drawing to worse odds than a 1-outer even if they won.
  2. It treated 2-0 and 2-1 match record in pods identically.  That’s kind of silly.
  3. The upper bracket was Bo1 match, with each match worth >$100,000 in equity.  The lower bracket was Bo3 matches, with encounters worth 37k (lower round 1), 49k, 73k, and 97k (lower finals).  Why were the more important matches more luck-based?

y0x9fpq

and the generic flaw that the structure just didn’t have a whole lot of play to it.  92% of the equity was accounted for on day 1 by players who already made the upper semis with an average of only 4.75 matches played, and the remaining 12 players were capped at 9 pre-bracket matches with an average of only 6.75 played.

Whatever the format is, it needs to try to accomplish several things at the same time:

  1. Fit in the broadcast window
  2. Pair people with equal stakes in the match (avoid somebody on a bubble playing somebody who’s already locked or can’t make it, etc)
  3. Try not to look like a total luckbox format- it should take work to win AND work to get eliminated
  4. Keep players alive and playing awhile and not just by having them play a bunch of zombie magic with microscopic odds of winning the tournament in the end
  5. Have matches with clear stakes and minimize the number with super-low stakes, AKA be exciting
  6. Reward better records pre-bracket (2-0 is better than 2-1, etc)
  7. Minimize win-order variance, at least before an elimination bracket (4-2 in the M:tG Worlds format could be upper semis (>23% to win) or lower round 1 (<1% to win) depending on result ordering.  Yikes.
  8. Avoid tiebreakers
  9. Matches with more at stake shouldn’t be shorter (e.g. Bo1 vs Bo3) than matches with less at stake.
  10. Be comprehensible

 

To be clear, there’s no “simple” format that doesn’t fail one of the first 4 rules horribly. Swiss has huge problems with point 2 late in the event, as well as tiebreakers.  Round robin is even worse.  16-player double elimination, or structures isomorphic to that (which the M:tG format was), bury early losers far too quickly, plus most of the games are between zombies.  Triple elimination (or more) Swiss runs into a hell match that can turn the pairings into nonsense with a bye if it goes the wrong way.  Given that nobody could understand this format, even though it was just a dressed-up 16-player double-elim bracket, and any format that doesn’t suck is going to be legitimately more complicated than that, we’re just going to punt on point 10 and settle for anything simpler than the tax code if we can make the rest of it work well.  And I think we can.

Hareeb Format for 16 players:

Day 1:

Draft like the opening draft in Worlds (win-2-before-lose-2).  The players will be split into 4 four-player pods based on record (2-0, 2-1, 1-2, 0-2).

Each pod plays a win-2-before-lose-2 of constructed.  The 4-0 player makes top-8 as the 1-seed.  The 0-4 player is eliminated in 16th place.

The two 4-1 players play a qualification match of constructed.  The winner makes top-8 as the #2 seed.  The two 1-4 players play an elimination match of constructed.  The loser is eliminated in 15th place.

This leaves 4 players with a winning record (Group A tomorrow), 4 players with an even record (2-2 or 3-3) (Group B tomorrow), and 4 players with a losing record (Group C tomorrow).

Day 2:

Each group plays a win-2-before-lose-2 of constructed, and instead of wall-of-texting the results, it’s easier to see graphically and that something is at stake with every match in every group.

hareebworldsformat

 

with the loser of the first round of the lower play-in finishing 11th-12th and the losers of the second round finishing 9th-10th.  So now we have a top-8 bracket seeded.  The first round of the top-8 bracket should be played on day 2 as well, broadcast willing (2 of the matches are available after the upper play-in while the 7-8 seeds are still being decided, so it’s only extending by ~1 round for 7-8 “rounds” total).


Before continuing, I want to show the possible records of the various seeds.  The #1 seed is always 4-0 and the #2 seed is always 5-1.  The #3 seed will either be 6-2 or 5-2.  The #4 seed will either be 5-2, 6-3, or 7-3.  In the exact case of 7-3 vs 5-2, the #4 seed will have a marginally more impressive record, but since the only difference is being on the same side of the bracket as the 4-0 instead of the 5-1, it really doesn’t matter much.

The #5-6 seed will have a record of 7-4,6-4, 5-4, or 5-3, a clean break from the possible top 4 records.  The #7-8 seeds will have winning or even records and the 9th-10th place finishers will have losing or even records. This is the only meaningful “tiebreak” in the system.  Only the players in the last round of the lower play-in can finish pre-bracket play at .500.  Ideally, everybody at .500 will either all advance or all be eliminated, or there just won’t be anybody at .500.  Less ideally, but still fine, either 2 or 4 players will finish at .500, and the last round of the lower play-in can be paired so that somebody 1 match above .500 is paired against somebody one match below .500.  In that case, the player who advances at .500 will have just defeated the eliminated player in the last round.  This covers 98% of the possibilities.  2% of the time, exactly 3 players will finish at .500.  Two of them will have just played a win-and-in against each other, and the other .500 player will have advanced as a #7-8 seed with a last-round win or been eliminated 9th-10th with a last-round loss.

As far as the top-8 bracket itself, it can go a few ways.  It can’t be Bo1 single elim, or somebody could get knocked out of Worlds losing 1 match, which is total BS (point 3), plus the possibility of going 4-1 5th-8th place in a 16-player event is automatically a horseshit system.  Even 5-2 or 6-3 5th-8th place (Bo3 or Bo5 single elim) is crap, but if we got to 4-3 or 5-4 finishing 7th-8th place, that’s totally fine.  It also takes at least 5 losses pre-bracket (or an 0-4 start) to get eliminated there, so it should take some work here too.  And we still need to deal with the top-4 having better records than 5-8 without creating a bunch of zombie Magic.  There’s a solution that solves all of this reasonably well at the same time IMO.

Hareeb format top-8 Bracket:

  1. Double-elimination bracket
  2. All upper bracket matchups are Bo3 matches
  3. In the upper quarters, the higher-seeded player starts up 1 match
  4. Grand finals are Bo5 matches with the upper-bracket representative starting up 1-0 (same as we just did)
  5. Lower bracket matches before lower finals are Bo1 (necessary for timing unless we truly have all day)
  6. Lower bracket finals can be Bo1 match or Bo3 matches depending on broadcast needs.  (Bo1 lower finals is max 11 sequential matches on Sunday, which is the same max we had at Worlds.  If there’s time for a potential 13, lower finals should definitely be Bo3 because they’re actually close to as important as upper-bracket matches, unlike the rest of the lower bracket)
  7. The more impressive match record gets the play-draw choice in the first game 1, then if Bo3/5, the loser of the previous match gets the choice in the next game 1. (if tied, head to head record decides the first play, if that’s tied, random)

 

This keeps the equity a lot more reasonably dispersed (I didn’t try to calculate play advantage throughout the bracket, but it’s fairly minor).  This format is a game of accumulating equity throughout the two days instead of 4 players hoarding >92% of it after day 1 and 8 zombies searching for scraps. Making the top 8 as a 5-8 seed is a bit better than the pre-tournament win probability under this format, instead of the 1.95% in the Worlds format.

hareebtop8

As far as win% at previous stages goes,

  1. Day 2 qualification match: 11.60%
  2. Upper play-in: 5.42%
  3. Lower play-in round 2: 3.61%
  4. Lower play-in round 1: 1.81%
  5. Day 2 elimination match: 0.90%

  1. Day 2 Group A: 9.15%
  2. Day 2 Group B: 4.93%
  3. Day 2 Group C: 2.03%

  1. Day 1 qualification match: 13.46%
  2. Day 1 2-0 Pod: 11.32%
  3. Day 1 2-1 Pod: 7.39%
  4. Day 1 1-2 Pod: 4.28%
  5. Day 1 0-2 Pod: 2.00%
  6. Day 1 Elimination match: 1.02%

2-0 in the draft is almost as good as before, but 2-1 and 1-2 are much more modest changes, and going 0-2 preserves far more equity (2% vs <0.5%).  Even starting 1-4 in this format has twice as much equity as starting 1-2 in the Worlds format.  It’s not an absolutely perfect format or anything- given enough tries, somebody will Javier Dominguez it and win going 14-10 in matches- but the equity changes throughout the stages feel a lot more reasonable here while maintaining perfect stake-parity in matches, and players get to play longer before being eliminated, literally or virtually.

Furthermore, while there’s some zombie-ish Magic in the 0-2 pod and Group C (although still nowhere near as bad as the Worlds format), it’s simultaneous with important matches so coverage isn’t stuck showing it.  Saturday was the upper semis (good) and a whole bunch of nonsense zombie matches (bad), because that’s all that was available, but there’s always something meaningful to be showing in this format. It looks like it fits well enough with the broadcast parameters this weekend as well with 7 “rounds” of coverage the first day and 8 the second (or 6 and 9 if that sounds nicer), and a same/similar maximum number of matches on Sunday to what we had for Worlds

It’s definitely a little more complicated, but it’s massive gains in everything else that matters.

*** The draft can always be paired without rematches.  For a pod, group, upper play-in, lower play-in, or loser’s bracket round 1, look at the 3 possible first-round pairings, minimize total times those matchups have been seen, then minimize total times those matchups have been seen in constructed, then choose randomly from whatever’s tied.  For assigning 5-6 seeds or 7-8 seeds in the top-8 bracket or pairings in lower play-in round 2 or loser’s round 2, do the same considering the two possible pairings, except for the potential double .500 scenario in lower play-in round 2 which must be paired.

 

Postseason Baseballs and Bauer Units

With the new MLB report on baseball drag leaving far more questions than answers as far as the 2019 postseason goes, I decided to look at the mystery from a different angle.  My hypothesis was a temporary storage issue with the baseballs, and it’s not clear, without knowing more about the methodology of the MLB testing, whether or not that’s still plausible.  But what I did find in pursuit of that was absolutely fascinating.

I started off by looking at the Bauer units (RPM/MPH) on FFs thrown by starting pitchers (only games with >10 FF used) and comparing to the seasonal average of Bauer units in their starts.  Let’s just say the results stand out.  This is excess Bauer units by season week, with the postseason all grouped into week 28 each year (2015 week 27 didn’t exist and is a 0).

excessbauerunits

And the postseasons are clearly different, with the wildcard and divisional rounds in 2019 clocking in at 0.64 excess Bauer units before coming back down closer to normal.  For the 2019 postseason as a whole, +1.7% in Bauer units.  Since Bauer units are RPM/MPH, it was natural to see what drove the increase.  In the 2019 postseason, vFF was up slightly, 0.4mph over seasonal average, which isn’t unexpected given shorter hooks and important games, but the spin was up +2.2% or almost 50 RPM.  That’s insane.  Park effects on Bauer units are small, and TB was actually the biggest depressor in 2019.

In regular season games, variation in Bauer units from seasonal average was almost *entirely* spin-driven.  The correlation to spin was 0.93 and to velocity was -0.04.  Some days pitchers can spin the ball, and some days they can’t.  I couldn’t be much further from a professional baseball pitcher, but it doesn’t make any sense to me that that’s a physical talent that pitchers leave dormant until the postseason.. well, 3 out of 5 postseasons.  Spin rate is normally only very slightly lower late in games, so it’s not an effect of cutting out the last inning of a start or anything like that.

Assuming pitchers are throwing fastballs similarly, which vFF seems to indicate, what could cause the ball to spin more?  One answer is sticky stuff on the ball, as Bauer himself knows, but it’s odd to only dial it up in the postseason- all of the numbers are comparisons to the pitcher’s own seasonal averages. Crap on the ball may increase the drag coefficient (AFAIK), so it’s not a crazy explanation.

Another one is that the balls are simply smaller.  It’s easier to spin a slightly smaller ball because the same contact path along the hand is more revolutions and imparts more angular velocity to the ball.  Likewise, it should be slightly easier to spin a lighter ball because it takes less torque to accelerate it to the same angular velocity.  Neither one of these would change the ACTUAL drag coefficient, but they would both APPEAR to have a higher drag coefficient in pitch trajectory (and ball flight) measurement and fly poorly off the bat.  Taking a same-size-but-less-dense ball, the drag force on a pitch would be (almost) the same, but since F=ma, and m is smaller, the measured drag acceleration would be higher, and since the calculations don’t know the ball is lighter, they think that the bigger acceleration actually means a higher drag force and therefore a higher Cd, and it comes out the same in the end for a smaller ball.

Both of those explanations seem plausible as well, not knowing exactly what the testing protocol was with postseason balls, and they could be the result of manufacturing differences (smaller balls) or possibly temporary storage in low humidity (lighter balls).  Personal experiments with baseballs rule both of these in.  Individual balls I’ve measured come with more than enough weight/cross-section variation for a bad batch to put up those results.  The leather is extremely responsive to humidity changes (it can gain more than 100% of its dry weight in high-humidity conditions), and losing maybe 2 grams of moisture from the exterior parts of the ball is enough to spike the imputed drag coefficient the observed amount without changing the CoR much, and that’s well within the range of temporary crappy storage.  It’s possible that they’re both ruled out by the MLB-sponsored analysis, but they didn’t report a detailed enough methodology for me to know.

The early-season “spike” is also strange.  Pitchers throw FFs a bit more slowly than seasonal average but also with more spin, which makes absolutely no sense without the baseballs being physically different then as well, just to a lesser degree (or another short-lived pine tar outbreak, I guess).

It could be something entirely different from my suggestions, maybe something with the surface of the baseball causing it to be easier to spin and also higher drag, but whatever it is, given the giant spin rate jump at the start of the postseason, and the similarly odd behavior in other postseasons, it seems highly unlikely to me that the imputed drag coefficient spike and the spin rate spike don’t share a common cause.  Given that postseason balls presumably don’t follow the same supply chain- they’re processed differently to get the postseason stamp, they’re not sent to all parks, they may follow a different shipping path, and they’re most likely used much sooner after ballpark arrival, on average, than regular-season balls, this strongly suggests manufacture/processing or storage issues to me.