RLCS X viewership is being manipulated

TL;DR EU regionals weekend 1 viewership was still fine, but legit viewership never cracked 80k even though the reported peak was over 150k.

Psyonix was cooperative enough to leave the Sunday stream shenanigan-free, so we have a natural comparison between the two days.  Both the Saturday and Sunday streams had the same stakes- half of the EU teams played each day trying to qualify for next weekend- so one would expect viewership each day to be pretty similar, and in reality, this was true. Twitch displays total viewers for everyone to see, and the number of people in the chatroom is available through the API, and I track it. (NOT the number of people actually chatting- you’re in the chatroom if you’re logged in, viewing normally, and didn’t go out of your way to close chat.  The fullscreen button doesn’t do it, and it appears that closing chat while viewing in squad mode doesn’t remove you from the chatroom either).  Only looking at people in the chatroom on Saturday and Sunday gives the following:


That’s really similar, as expected.  Looking at people not in the chatroom- the difference between the two numbers- tells an entirely different story.


LOL.  Alrighty then.  Maybe there’s a slight difference?

Large streams average around 70% of total viewers in chat.  Rocket League, because of drops, averages a bit higher than that.  More people make Twitch accounts and watch under those accounts to get rewards.  Sunday’s stream is totally in line with previous RLCS events and with the big events in the past offseason.  Saturday’s stream is.. not.  On top of the giant difference in magnitude, the Sunday number/percentage is pretty stable, while the Saturday number bounces all over the place.  Actual people come and go in approximately equal ratios whether they’re logged in or not.  At the very end of Saturday’s stream, it was on the Twitch frontpage, which boosts the not-in-chat count, but that doesn’t explain the rest of the time.

The answer is that the Saturday stream was embedded somewhere outside of Twitch.  Twitch allows a muted autoplay stream video in a random webpage to count as a viewer even if the user never interacts with the stream (a prominent media source said otherwise last year.  He was almost certainly wrong then, and I’ve personally tested in the past 2 months that he’s wrong now), and, modern society being what it is, services and ad networks exist to place muted streams where ~nobody pays any attention to them to boost apparent viewcount, and publishers pay 5 figures to appear to be more popular than they are. Psyonix/RLCS was listed as a client of one of these services before and has a history of pulling bullshit and buying fake viewers. There’s a nice article on kotaku detailing this nonsense across more esports.

If the stream were embedded somewhere it belonged, instead of as an advertisement to inflate viewcount, it’s hard to believe it also wouldn’t have been active on Sunday, so it’s very likely they’re just pulling bullshit and buying fake viewers again.  If somebody at Psyonix comments and explains otherwise, I’ll update the post, but don’t hold your breath.  Since a picture speaks a thousand words:



Nate Silver vs AnEpidemiolgst

This beef started with this tweet https://twitter.com/AnEpidemiolgst/status/1258433065933824008

which is just something else for multiple reasons.  Tone policing a neologism is just stupid, especially when it’s basically accurate.  Doing so without providing a preferred term is even worse.  But, you know, I’m probably not writing a post just because somebody acted like an asshole on twitter.  I’m doing it for far more important reasons, namely:


And in this particular case, it’s not Nate.  She also doubles down with https://twitter.com/JDelage/status/1258452085428928515

which is obviously wrong, even for a fuzzy definition of meaningfully, if you stop and think about it.  R0 is a population average.  Some people act like hermits and have little chance of spreading the disease much if they somehow catch it.  Others have far, far more interactions than average and are at risk of being superspreaders if they get an asymptomatic infection (or are symptomatic assholes).  These average out to R0.

Now, when 20% of the population is immune (assuming they develop immunity after infection, blah blah), who is it going to be?  By definition, it’s people who already got infected.  Who got infected?  Obviously, for something like COVID, it’s weighted so that >>20% of potential superspreaders were already infected and <<20% of hermits were infected.  That means that far more than the naive 20% of the interactions infected people have now are going to be with somebody who’s already immune (the exact number depending on the shape and variance of the interaction distribution), and so Rt is going to be much less than (1 – 0.2) * R0 at 20% immune, or in ELI5 language, 20% immune implies a lot more than a 20% decrease in transmission rate for a disease like COVID.

This is completely obvious, but somehow junk like this is being put out by Johns Hopkins of all places.  Right-wing deliberate disinformation is bad enough, but professionals responding with obvious nonsense really doesn’t help the cause of truth.  Please tell me the state of knowledge/education in this field isn’t truly that primitive.    Or ship me a Nobel Prize in medicine, I’m good either way.

Don’t use FRAA for outfielders

TL;DR OAA is far better, as expected.  Read after the break for next-season OAA prediction/commentary.

As a followup to my previous piece on defensive metrics, I decided to retest the metrics using a sane definition of opportunity.  BP’s study defined a defensive opportunity as any ball fielded by an outfielder, which includes completely uncatchable balls as well as ground balls that made it through the infield.  The latter are absolute nonsense, and the former are pretty worthless.  Thanks to Statcast, a better definition of defensive opportunity is available- any ball it gives a nonzero catch probability and assigns to an OF.  Because Statcast doesn’t provide catch probability/OAA on individual plays, we’ll be testing each outfielder in aggregate.

Similarly to what BP tried to do, we’re going to try to describe or predict each OF’s outs/opportunity, and we’re testing the 354 qualified OF player-seasons from 2016-2019.  Our contestants are Statcast’s OAA/opportunity, UZR/opportunity, FRAA/BIP (what BP used in their article), simple average catch probability (with no idea if the play was made or not), and positional adjustment (effectively the share of innings in CF, corner OF, or 1B/DH).  Because we’re comparing all outfielders to each other, and UZR and FRAA compare each position separately, those two received the positional adjustment (they grade quite a bit worse without it, as expected).

Using data from THE SAME SEASON (see previous post if it isn’t obvious why this is a bad idea) to describe that SAME SEASON’s outs/opportunity, which is what BP was testing, we get the following correlations:

Metric r^2 to same-season outs/opportunity
OAA/opp 0.74
UZR/opp 0.49
Catch Probability + Position 0.43
Catch Probability 0.32
Positional adjustment/opp 0.25

OAA wins running away, UZR is a clear second, background information is 3rd, and FRAA is a distant 4th, barely ahead of raw catch probability.  And catch probability shouldn’t be that important.  It’s almost independent of OAA (r=0.06) and explains much less of the outs/opp variance.  Performance on opportunities is a much bigger driver than difficulty of opportunities over the course of a season.  I ran the same test on the 3 OF positions individually (using Statcast’s definition of primary position for that season), and the numbers bounced a little, but it’s the same rank order and similar magnitude of differences.

Attempting to describe same-season OAA/opp gives the following:

Metric r^2 to same-season OAA/opportunity
OAA/opp 1
UZR/opp 0.5
Positional adjustment/opp 0.17
Catch Probability 0.004

As expected, catch probability drops way off.  CF opportunities are on average about 1% harder than corner OF opportunities.  Positional adjustment is obviously a skill correlate (Full-time CF > CF/corner tweeners > Full-time corner > corner/1B-DH tweeners), but it’s a little interesting that it drops off compared to same-season outs/opportunity.  It’s reasonably correlated to catch probability, which is good for describing outs/opp and useless for describing OAA/opp, so I’m guessing that’s most of the decline.

Now, on to the more interesting things.. Using one season’s metric to predict the NEXT season’s OAA/opportunity (both seasons must be qualified), which leaves 174 paired seasons, gives us the following (players who dropped out were almost average in aggregate defensively):

Metric r^2 to next season OAA/opportunity
OAA/opp 0.45
UZR/opp 0.25
Positional adjustment 0.1
Catch Probability 0.02

FRAA notably doesn’t suck here- although unless you’re a modern-day Wintermute who is forbidden to know OAA, just use OAA of course.  Looking at the residuals from previous-season OAA, UZR is useless, but FRAA and positional adjustment contain a little information, and by a little I mean enough together to get the r^2 up to 0.47.  We’ve discussed positional adjustment already and that makes sense, but FRAA appears to know a little something that OAA doesn’t, and it’s the same story for predicting next-season outs/opp as well.

That’s actually interesting.  If the crew at BP had discovered that and spent time investigating the causes, instead of spending time coming up with ways to bullshit everybody that a metric that treats a ground ball to first as a missed play for the left fielder really does outperform Statcast, we might have all learned something useful.

The Baseball Prospectus article comparing defensive metrics is… strange

TL;DR and by strange I mean a combination of utter nonsense tests on top of the now-expected rigged test.

Baseball Prospectus released a new article grading defensive metrics against each other and declared their FRAA metric the overall winner, even though it’s by far the most primitive defensive stat of the bunch for non-catchers.  Furthermore, they graded FRAA as a huge winner in the outfield and Statcast’s Outs Above Average as a huge winner in the infield.. and graded FRAA as a dumpster fire in the infield and OAA as a dumpster fire in the outfield.  This is all very curious.  We’re going to answer the three questions in the following order:

  1. On their tests, why does OAA rule the infield while FRAA sucks?
  2. On their tests, why does FRAA rule the outfield while OAA sucks?
  3. On their test, why does FRAA come out ahead overall?

First, a summary of the two systems.  OAA ratings try to completely strip out positioning- they’re only a measure of how well the player did, given where the ball was and where the player started.  FRAA effectively treats all balls as having the same difficulty (after dealing with park, handedness, etc).  It assumes that each player should record the league-average X outs per BIP for the given defensive position/situation and gives +/- relative to that number.

A team allowing a million uncatchable base hits won’t affect the OAA at all (not making a literal 0% play doesn’t hurt your rating), but it will tank everybody’s FRAA because it thinks the fielders “should” be making X outs per Y BIPs.  In a similar vein, hitting a million easy balls at a fielder who botches them all will destroy that fielder’s OAA but leave the rest of his teammates unchanged.  It will still tank *everybody’s* FRAA the same as if the balls weren’t catchable.  An average-performing (0 OAA), average-positioned fielder with garbage teammates will get dragged down to a negative FRAA. An average-performing (0 OAA), average-positioned fielder whose pitcher allows a bunch of difficult balls nowhere near him will also get dragged down to a negative FRAA.

So, in abstract terms: On a team level, team OAA=range + conversion and team FRAA = team OAA + positioning-based difficulty relative to average.  On a player level, player OAA= range + conversion and player FRAA = player OAA + positioning + teammate noise.

Now, their methodology.  It is very strange, and I tweeted at them to make sure they meant what they wrote.  They didn’t reply, it fits the results, and any other method of assigning plays would be in-depth enough to warrant a description, so we’re just going to assume this is what they actually did.  For the infield and outfield tests, they’re using the season-long rating each system gave a player to predict whether or not a play resulted in an out.  That may not sound crazy at first blush, but..

…using only the fielder ratings for the position in question, run the same model type position by position to determine how each system predicts the out probability for balls fielded by each position. So, the position 3 test considers only the fielder quality rate of the first baseman on *balls fielded by first basemen*, and so on.

Their position-by-position comparisons ONLY INVOLVE BALLS THAT THE PLAYER ACTUALLY FIELDED.  A ground ball right through the legs untouched does not count as a play for that fielder in their test (they treat it as a play for whoever picks it up in the outfield).  Obviously, by any sane measure of defense, that’s a botched play by the defender, which means the position-by-position tests they’re running are not sane tests of defense.  They’re tests of something else entirely, and that’s why they get the results that they do.

Using the bolded abstraction above, this is only a test of conversion.  Every play that the player didn’t/couldn’t field IS NOT INCLUDED IN THE TEST.  Since OAA adds the “noise” of range to conversion, and FRAA adds the noise of range PLUS the noise of positioning PLUS the noise from other teammates to conversion, OAA is less noisy and wins and FRAA is more noisy and sucks.  UZR, which strips out some of the positioning noise based on ball location, comes out in the middle.  The infield turned out to be pretty easy to explain.

The outfield is a bit trickier.  Again, because ground balls that got through the infield are included in the OF test (because they were eventually fielded by an outfielder), the OF test is also not a sane test of defense.  Unlike the infield, when the outfield doesn’t catch a ball, it’s still (usually) eventually fielded by an outfielder, and roughly on average by the same outfielder who didn’t catch it.

So using the abstraction, their OF test measures range + conversion + positioning + missed ground balls (that roll through to the OF).  OAA has range and conversion.  FRAA has range, conversion, positioning, and some part of missed ground balls through the teammate noise effect described earlier.  FRAA wins and OAA gets dumpstered on this silly test, and again it’s not that hard to see why, not that it actually means much of anything.

Before talking about the teamwide defense test, it’s important to define what “defense” actually means (for positions 3-9).  If a batter hits a line drive 50 feet from anybody, say a rope safely over the 3B’s head down the line, is it bad defense by 3-9 that it went for a hit?  Clearly not, by the common usage of the word. Who would it be bad defense by?  Nobody could have caught it.  Nobody should have been positioned there.

BP implicitly takes a different approach

So, recognizing that defenses are, in the end, a system of players, we think an important measure of defensive metric quality is this: taking all balls in play that remained in the park for an entire season — over 100,000 of them in 2019 — which system on average most accurately measures whether an out is probable on a given play? This, ultimately, is what matters.  Either you get more hitters out on balls in play or you do not. The better that a system can anticipate that a batter will be out, the better the system is.

that does consider this bad defense.  It’s kind of amazing (and by amazing I mean not the least bit surprising at this point) that every “questionable” definition and test is always for the benefit one of BP’s stats.  Neither OAA, nor any of the other non-FRAA stats mentioned, are based on outs/BIP or trying to explain outs/BIP.  In fact, they’re specifically designed to do the exact opposite of that.  The analytical community has spent decades making sure that uncatchable balls don’t negatively affect PLAYER defensive ratings, and more generally to give an appropriate amount of credit to the PLAYER based on the system’s estimate of the difficulty of the play (remember from earlier that FRAA doesn’t- it treats EVERY BIP as average difficulty).

The second “questionable” decision is to test against outs/BIP.  Using abstract language again to break this down, outs/BIP = player performance given the difficulty of the opportunity + difficulty of opportunity.  The last term can be further broken down into difficulty of opportunity = smart/dumb fielder positioning + quality of contact allowed (a pitcher who allows an excess of 100mph batted balls is going to make it harder for his defense to get outs, etc) + luck.  In aggregate:


player performance given the difficulty of the opportunity (OAA) +

smart/dumb fielder positioning (a front-office/manager skill in 2019) +

quality of contact allowed (a batter/pitcher skill) +

luck (not a skill).

That’s testing against a lot of nonsense beyond fielder skill, and it’s testing against nonsense *that the other systems were explicitly designed to exclude*.  It would take the creators of the other defensive systems less time than it took me to write the previous paragraph to run a query and report an average difficulty of opportunity metric when the player was on the field (their systems are all already designed around giving every BIP a difficulty of opportunity score), but again, they don’t do that because *they’re not trying to explain outs/BIP*.

The third “questionable” decision is to use 2019 ratings to predict 2019 outs/BIP.  Because observed OAA is skill+luck, it benefits from “knowing” the luck in the plays it’s trying to predict.  In this case, luck being whether a fielder converted plays at/above/below his true skill level.  2019 FRAA has all of the difficulty of opportunity information baked in for 2019 balls, INCLUDING all of the luck in difficulty of opportunity ON TOP OF the luck in conversion that OAA also has.

All of that luck is just noise in reality, but because BP is testing the rating against THE SAME PLAYS used to create the rating, that noise is actually signal in this test, and the more of it included, the better.  That’s why FRAA “wins” handily.  One could say that this test design is almost maximally disingenuous, and of course it’s for the benefit of BP’s in-house stat, because that’s how they roll.

Richard Epstein’s “coronavirus evolving to weaker virulence” reduced death toll argument is remarkably awful

Isaac Chotiner buried Epstein in this interview, but he understandably didn’t delve into the conditions necessary for Epstein’s “evolving to weaker virulence will reduce the near-term death toll” argument to be true.  I did.  It’s bad.  Really bad.  Laughably bad.. if you can laugh at something that might be playing a part in getting people killed.

TL;DR He thinks worldwide COVID-19 cases will cap out well under 1 million for this wave, and one of the reasons is that the virus will evolve to be less virulent.  Unlike The Andromeda Strain, whose ending pissed me off when I read it in junior high, a virus doesn’t mutate the same way everywhere all at once.  There’s one mutation event at a time in one host (person) at a time, and the mutated virus starts reproducing and spreading through the population  like what’s seen here.  The hypothetical mutated weak virus can only have a big impact reducing total deaths if it can quickly propagate to a scale big enough to significantly impede the original virus (by granting immunity from the real thing, presumably).  If the original coronavirus only manages to infect under 1 million people worldwide in this wave, how in the hell is the hypothetical mutated weak coronavirus supposed to spread to a high enough percentage of the population -and even faster- to effectively vaccinate them from the real thing, even with a supposed transmission advantage?***  Even if it spread to 10 million worldwide over the same time frame (which would be impressive since there’s no evidence that it even exists right now….), that’s a ridiculously small percentage of the potentially infectable in hot zones.  It would barely matter at all until COVID/weak virus saturation got much higher, which only happens at MUCH higher case counts.

That line of argumentation is utterly and completely absurd alongside a well-under-1 million worldwide cases projection.

***The average onset of symptoms is ~5 days from infection and the serial interval (time from symptoms in one person to symptoms in a person they infected) is also only around 5 days, meaning there’s a lot of transmission going on before people even show any symptoms.  Furthermore, based on this timeline, which AFAIK is still roughly correct


there’s another week on average to transmit the virus before the “strong virus” carriers are taken out of commission, meaning Epstein’s theory only gives a “transmission advantage” for the hypothetical weak virus more than 10 days after infection on average.  And, oh yeah, 75-85% of “strong” infections never wind up in the hospital at all, so by Epstein’s theory, they’ll transmit at the same rate.  There simply hasn’t been/isn’t now much room for a big transmission advantage for the weak virus under his assumptions.  And to reiterate, no evidence that it actually even exists right now.

Dave Stieb was good

Since there’s nothing of any interest going on in the country or the world today, I decided the time was right to defend the honour of a Toronto pitcher from the 80s.  Looking deeper into this article, https://www.baseballprospectus.com/news/article/57310/rubbing-mud-dra-and-dave-stieb/ which concluded that Stieb was actually average or worse rate-wise, many of the assertions are… strange.

First, there’s the repeated assertion that Stieb’s K and BB rates are bad.  They’re not.  He pitched to basically dead average defensive catchers, and weighted by the years Stieb pitched, he’s actually marginally above the AL average.  The one place where he’s subpar, hitting too many batters, isn’t even mentioned.  This adds up to a profile of

K/9 BB/9 HBP/9
AL Average 5.22 3.28 0.20
Stieb 5.19 3.21 0.40

Accounting for the extra HBPs, these components account for about 0.05 additional ERA over league average, or ~1%.  Without looking at batted balls at all, Stieb would only be 1% worse than average (AL and NL are pretty close pitcher-quality wise over this timeframe, with the AL having a tiny lead if anything).  BP’s version of FIP- (cFIP) has Stieb at 104.  That doesn’t really make any sense before looking at batted balls, and Stieb only allowed a HR/9 of 0.70 vs. a league average of 0.88.  He suppressed home runs by 20%- in a slight HR-friendly park- over 2900 innings, combined with an almost dead average K/BB profile, and BP rates his FIP as below average.  That is completely insane.

The second assertion is that Stieb relied too much on his defense.  We can see from above that an almost exactly average percentage of his PAs ended with balls in play, so that part falls flat, and while Toronto did have a slightly above-average defense, it was only SLIGHTLY above average.  Using BP’s own FRAA numbers, Jays fielders were only 236 runs above average from 79-92, and prorating for Stieb’s share of IP, they saved him 24 runs, or a 0.08 lower ERA (sure, it’s likely that they played a bit better behind him and a bit worse behind everybody else).  Stieb’s actual ERA was 3.44 and his DRA is 4.43- almost one full run worse- and the defense was only a small part of that difference.  Even starting from Stieb’s FIP of 3.82, there’s a hell of a long way to go to get up to 4.43, and a slightly good defense isn’t anywhere near enough to do it.

Stieb had a career BABIP against of .260 vs. AL average of .282, and the other pitchers on his teams had an aggregate BABIP of .278.  That’s more evidence of a slightly above-average defense, suppressing BABIP a little in a slight hitter’s home park, but Stieb’s BABIP suppression goes far beyond what the defense did for everybody else.  It’s thousands-to-1 against a league-average pitcher suppressing HR as much as Stieb did.  It’s also thousands-to-1 against a league-average pitcher in front of Toronto’s defense suppressing BABIP as much as Stieb did.  It’s exceptionally likely that Stieb actually was a true-talent soft contact machine.  Maybe not literally to his careen numbers, but the best estimate is a hell of a lot closer to career numbers than to average after 12,000 batters faced.

This is kind of DRA and DRC in a microcosm.  It can spit out values that make absolutely no sense at a quick glance, like a league-average K/BB guy with great HR suppression numbers grading out with a below-average cFIP, and it struggles to accept outlier performance on balls in play, even over gigantic samples, because the season-by-season construction is completely unfit for purpose when used to describe a career.  That’s literally the first thing I wrote when DRC+ was rolled out, and it’s still true here.

A 16-person format that doesn’t suck

This is in response to the Magic: the Gathering World Championship that just finished, which featured some great Magic played in a highly questionable format.  It had three giant flaws:

  1. It buried players far too quickly.  Assuming every match was a coinflip, each of the 16 players started with a 6.25% chance to win.  Going 2-0 or 2-1 in draft meant you were just over 12% to win and going 0-2 or 1-2 in draft meant you were just under 0.5% (under 1 in 200) to win.  Ouch.  In turn, this meant we were watching a crapload of low-stakes games and the players involved were just zombies drawing to worse odds than a 1-outer even if they won.
  2. It treated 2-0 and 2-1 match record in pods identically.  That’s kind of silly.
  3. The upper bracket was Bo1 match, with each match worth >$100,000 in equity.  The lower bracket was Bo3 matches, with encounters worth 37k (lower round 1), 49k, 73k, and 97k (lower finals).  Why were the more important matches more luck-based?


and the generic flaw that the structure just didn’t have a whole lot of play to it.  92% of the equity was accounted for on day 1 by players who already made the upper semis with an average of only 4.75 matches played, and the remaining 12 players were capped at 9 pre-bracket matches with an average of only 6.75 played.

Whatever the format is, it needs to try to accomplish several things at the same time:

  1. Fit in the broadcast window
  2. Pair people with equal stakes in the match (avoid somebody on a bubble playing somebody who’s already locked or can’t make it, etc)
  3. Try not to look like a total luckbox format- it should take work to win AND work to get eliminated
  4. Keep players alive and playing awhile and not just by having them play a bunch of zombie magic with microscopic odds of winning the tournament in the end
  5. Have matches with clear stakes and minimize the number with super-low stakes, AKA be exciting
  6. Reward better records pre-bracket (2-0 is better than 2-1, etc)
  7. Minimize win-order variance, at least before an elimination bracket (4-2 in the M:tG Worlds format could be upper semis (>23% to win) or lower round 1 (<1% to win) depending on result ordering.  Yikes.
  8. Avoid tiebreakers
  9. Matches with more at stake shouldn’t be shorter (e.g. Bo1 vs Bo3) than matches with less at stake.
  10. Be comprehensible


To be clear, there’s no “simple” format that doesn’t fail one of the first 4 rules horribly. Swiss has huge problems with point 2 late in the event, as well as tiebreakers.  Round robin is even worse.  16-player double elimination, or structures isomorphic to that (which the M:tG format was), bury early losers far too quickly, plus most of the games are between zombies.  Triple elimination (or more) Swiss runs into a hell match that can turn the pairings into nonsense with a bye if it goes the wrong way.  Given that nobody could understand this format, even though it was just a dressed-up 16-player double-elim bracket, and any format that doesn’t suck is going to be legitimately more complicated than that, we’re just going to punt on point 10 and settle for anything simpler than the tax code if we can make the rest of it work well.  And I think we can.

Hareeb Format for 16 players:

Day 1:

Draft like the opening draft in Worlds (win-2-before-lose-2).  The players will be split into 4 four-player pods based on record (2-0, 2-1, 1-2, 0-2).

Each pod plays a win-2-before-lose-2 of constructed.  The 4-0 player makes top-8 as the 1-seed.  The 0-4 player is eliminated in 16th place.

The two 4-1 players play a qualification match of constructed.  The winner makes top-8 as the #2 seed.  The two 1-4 players play an elimination match of constructed.  The loser is eliminated in 15th place.

This leaves 4 players with a winning record (Group A tomorrow), 4 players with an even record (2-2 or 3-3) (Group B tomorrow), and 4 players with a losing record (Group C tomorrow).

Day 2:

Each group plays a win-2-before-lose-2 of constructed, and instead of wall-of-texting the results, it’s easier to see graphically and that something is at stake with every match in every group.



with the loser of the first round of the lower play-in finishing 11th-12th and the losers of the second round finishing 9th-10th.  So now we have a top-8 bracket seeded.  The first round of the top-8 bracket should be played on day 2 as well, broadcast willing (2 of the matches are available after the upper play-in while the 7-8 seeds are still being decided, so it’s only extending by ~1 round for 7-8 “rounds” total).

Before continuing, I want to show the possible records of the various seeds.  The #1 seed is always 4-0 and the #2 seed is always 5-1.  The #3 seed will either be 6-2 or 5-2.  The #4 seed will either be 5-2, 6-3, or 7-3.  In the exact case of 7-3 vs 5-2, the #4 seed will have a marginally more impressive record, but since the only difference is being on the same side of the bracket as the 4-0 instead of the 5-1, it really doesn’t matter much.

The #5-6 seed will have a record of 7-4,6-4, 5-4, or 5-3, a clean break from the possible top 4 records.  The #7-8 seeds will have winning or even records and the 9th-10th place finishers will have losing or even records. This is the only meaningful “tiebreak” in the system.  Only the players in the last round of the lower play-in can finish pre-bracket play at .500.  Ideally, everybody at .500 will either all advance or all be eliminated, or there just won’t be anybody at .500.  Less ideally, but still fine, either 2 or 4 players will finish at .500, and the last round of the lower play-in can be paired so that somebody 1 match above .500 is paired against somebody one match below .500.  In that case, the player who advances at .500 will have just defeated the eliminated player in the last round.  This covers 98% of the possibilities.  2% of the time, exactly 3 players will finish at .500.  Two of them will have just played a win-and-in against each other, and the other .500 player will have advanced as a #7-8 seed with a last-round win or been eliminated 9th-10th with a last-round loss.

As far as the top-8 bracket itself, it can go a few ways.  It can’t be Bo1 single elim, or somebody could get knocked out of Worlds losing 1 match, which is total BS (point 3), plus the possibility of going 4-1 5th-8th place in a 16-player event is automatically a horseshit system.  Even 5-2 or 6-3 5th-8th place (Bo3 or Bo5 single elim) is crap, but if we got to 4-3 or 5-4 finishing 7th-8th place, that’s totally fine.  It also takes at least 5 losses pre-bracket (or an 0-4 start) to get eliminated there, so it should take some work here too.  And we still need to deal with the top-4 having better records than 5-8 without creating a bunch of zombie Magic.  There’s a solution that solves all of this reasonably well at the same time IMO.

Hareeb format top-8 Bracket:

  1. Double-elimination bracket
  2. All upper bracket matchups are Bo3 matches
  3. In the upper quarters, the higher-seeded player starts up 1 match
  4. Grand finals are Bo5 matches with the upper-bracket representative starting up 1-0 (same as we just did)
  5. Lower bracket matches before lower finals are Bo1 (necessary for timing unless we truly have all day)
  6. Lower bracket finals can be Bo1 match or Bo3 matches depending on broadcast needs.  (Bo1 lower finals is max 11 sequential matches on Sunday, which is the same max we had at Worlds.  If there’s time for a potential 13, lower finals should definitely be Bo3 because they’re actually close to as important as upper-bracket matches, unlike the rest of the lower bracket)
  7. The more impressive match record gets the play-draw choice in the first game 1, then if Bo3/5, the loser of the previous match gets the choice in the next game 1. (if tied, head to head record decides the first play, if that’s tied, random)


This keeps the equity a lot more reasonably dispersed (I didn’t try to calculate play advantage throughout the bracket, but it’s fairly minor).  This format is a game of accumulating equity throughout the two days instead of 4 players hoarding >92% of it after day 1 and 8 zombies searching for scraps. Making the top 8 as a 5-8 seed is a bit better than the pre-tournament win probability under this format, instead of the 1.95% in the Worlds format.


As far as win% at previous stages goes,

  1. Day 2 qualification match: 11.60%
  2. Upper play-in: 5.42%
  3. Lower play-in round 2: 3.61%
  4. Lower play-in round 1: 1.81%
  5. Day 2 elimination match: 0.90%

  1. Day 2 Group A: 9.15%
  2. Day 2 Group B: 4.93%
  3. Day 2 Group C: 2.03%

  1. Day 1 qualification match: 13.46%
  2. Day 1 2-0 Pod: 11.32%
  3. Day 1 2-1 Pod: 7.39%
  4. Day 1 1-2 Pod: 4.28%
  5. Day 1 0-2 Pod: 2.00%
  6. Day 1 Elimination match: 1.02%

2-0 in the draft is almost as good as before, but 2-1 and 1-2 are much more modest changes, and going 0-2 preserves far more equity (2% vs <0.5%).  Even starting 1-4 in this format has twice as much equity as starting 1-2 in the Worlds format.  It’s not an absolutely perfect format or anything- given enough tries, somebody will Javier Dominguez it and win going 14-10 in matches- but the equity changes throughout the stages feel a lot more reasonable here while maintaining perfect stake-parity in matches, and players get to play longer before being eliminated, literally or virtually.

Furthermore, while there’s some zombie-ish Magic in the 0-2 pod and Group C (although still nowhere near as bad as the Worlds format), it’s simultaneous with important matches so coverage isn’t stuck showing it.  Saturday was the upper semis (good) and a whole bunch of nonsense zombie matches (bad), because that’s all that was available, but there’s always something meaningful to be showing in this format. It looks like it fits well enough with the broadcast parameters this weekend as well with 7 “rounds” of coverage the first day and 8 the second (or 6 and 9 if that sounds nicer), and a same/similar maximum number of matches on Sunday to what we had for Worlds

It’s definitely a little more complicated, but it’s massive gains in everything else that matters.

*** The draft can always be paired without rematches.  For a pod, group, upper play-in, lower play-in, or loser’s bracket round 1, look at the 3 possible first-round pairings, minimize total times those matchups have been seen, then minimize total times those matchups have been seen in constructed, then choose randomly from whatever’s tied.  For assigning 5-6 seeds or 7-8 seeds in the top-8 bracket or pairings in lower play-in round 2 or loser’s round 2, do the same considering the two possible pairings, except for the potential double .500 scenario in lower play-in round 2 which must be paired.


Postseason Baseballs and Bauer Units

With the new MLB report on baseball drag leaving far more questions than answers as far as the 2019 postseason goes, I decided to look at the mystery from a different angle.  My hypothesis was a temporary storage issue with the baseballs, and it’s not clear, without knowing more about the methodology of the MLB testing, whether or not that’s still plausible.  But what I did find in pursuit of that was absolutely fascinating.

I started off by looking at the Bauer units (RPM/MPH) on FFs thrown by starting pitchers (only games with >10 FF used) and comparing to the seasonal average of Bauer units in their starts.  Let’s just say the results stand out.  This is excess Bauer units by season week, with the postseason all grouped into week 28 each year (2015 week 27 didn’t exist and is a 0).


And the postseasons are clearly different, with the wildcard and divisional rounds in 2019 clocking in at 0.64 excess Bauer units before coming back down closer to normal.  For the 2019 postseason as a whole, +1.7% in Bauer units.  Since Bauer units are RPM/MPH, it was natural to see what drove the increase.  In the 2019 postseason, vFF was up slightly, 0.4mph over seasonal average, which isn’t unexpected given shorter hooks and important games, but the spin was up +2.2% or almost 50 RPM.  That’s insane.  Park effects on Bauer units are small, and TB was actually the biggest depressor in 2019.

In regular season games, variation in Bauer units from seasonal average was almost *entirely* spin-driven.  The correlation to spin was 0.93 and to velocity was -0.04.  Some days pitchers can spin the ball, and some days they can’t.  I couldn’t be much further from a professional baseball pitcher, but it doesn’t make any sense to me that that’s a physical talent that pitchers leave dormant until the postseason.. well, 3 out of 5 postseasons.  Spin rate is normally only very slightly lower late in games, so it’s not an effect of cutting out the last inning of a start or anything like that.

Assuming pitchers are throwing fastballs similarly, which vFF seems to indicate, what could cause the ball to spin more?  One answer is sticky stuff on the ball, as Bauer himself knows, but it’s odd to only dial it up in the postseason- all of the numbers are comparisons to the pitcher’s own seasonal averages. Crap on the ball may increase the drag coefficient (AFAIK), so it’s not a crazy explanation.

Another one is that the balls are simply smaller.  It’s easier to spin a slightly smaller ball because the same contact path along the hand is more revolutions and imparts more angular velocity to the ball.  Likewise, it should be slightly easier to spin a lighter ball because it takes less torque to accelerate it to the same angular velocity.  Neither one of these would change the ACTUAL drag coefficient, but they would both APPEAR to have a higher drag coefficient in pitch trajectory (and ball flight) measurement and fly poorly off the bat.  Taking a same-size-but-less-dense ball, the drag force on a pitch would be (almost) the same, but since F=ma, and m is smaller, the measured drag acceleration would be higher, and since the calculations don’t know the ball is lighter, they think that the bigger acceleration actually means a higher drag force and therefore a higher Cd, and it comes out the same in the end for a smaller ball.

Both of those explanations seem plausible as well, not knowing exactly what the testing protocol was with postseason balls, and they could be the result of manufacturing differences (smaller balls) or possibly temporary storage in low humidity (lighter balls).  Personal experiments with baseballs rule both of these in.  Individual balls I’ve measured come with more than enough weight/cross-section variation for a bad batch to put up those results.  The leather is extremely responsive to humidity changes (it can gain more than 100% of its dry weight in high-humidity conditions), and losing maybe 2 grams of moisture from the exterior parts of the ball is enough to spike the imputed drag coefficient the observed amount without changing the CoR much, and that’s well within the range of temporary crappy storage.  It’s possible that they’re both ruled out by the MLB-sponsored analysis, but they didn’t report a detailed enough methodology for me to know.

The early-season “spike” is also strange.  Pitchers throw FFs a bit more slowly than seasonal average but also with more spin, which makes absolutely no sense without the baseballs being physically different then as well, just to a lesser degree (or another short-lived pine tar outbreak, I guess).

It could be something entirely different from my suggestions, maybe something with the surface of the baseball causing it to be easier to spin and also higher drag, but whatever it is, given the giant spin rate jump at the start of the postseason, and the similarly odd behavior in other postseasons, it seems highly unlikely to me that the imputed drag coefficient spike and the spin rate spike don’t share a common cause.  Given that postseason balls presumably don’t follow the same supply chain- they’re processed differently to get the postseason stamp, they’re not sent to all parks, they may follow a different shipping path, and they’re most likely used much sooner after ballpark arrival, on average, than regular-season balls, this strongly suggests manufacture/processing or storage issues to me.

On the London Mulligan

Zvi says ban it, and the pros I’ve seen talking about it lean towards the ban camp, but there are dissenters like BenS.  People also almost universally like it in limited.  Are they right? Are they highly confused?  What’s really going on?

From the baseline of the Paris mulligan (draw 6, draw 5, etc), on a 6-card keep, the Vancouver mulligan adds scry 1 and the London mulligan adds Loot 1 (discard to bottom of library).  London is clearly better, but plenty of times you’ll scry an extra land away like you would have with a loot, or the top card will be the one you would loot away anyway and there’s no real difference.  Other times you’re stuck with a clearly worse card in hand.  It’s better on 6-card keeps, but it’s not OMFG better.

Except that’s not quite the actual procedure.. on the London, you (effectively) loot, THEN you decide whether or not to keep.  That lets you make much better decisions, seeing all 7 cards instead of just 6 before deciding, and the difference on a 5-card keep is that Vancouver still just adds scry 1, but London adds Loot 2.  That’ adds up to a HUGE difference in starting hand quality.  And you can still go to 4 if your top 7 cards are total ass again.  I’d argue that the London is fine at 6 but goes totally bonkers at 5 and lower.

If you have decks that rely on card quantity more than a couple of specific quality cards, going to 5 cards, even best-5-out-of-7, is still a big punishment.  That’s most limited decks, where a 90th percentile 5 is going to play out like a 40th percentile 7, or something like that depending on archetype.  Barring something absurd like Pack Rat, aggressive mulligans aren’t a strategy.  You mulligan dysfunctional hands, not to find great hands.  London just lets you be a bit more liberal with the “dysfunctional” label in limited, and it’s generally fine there.

For Eternal formats, where lots of decks are trying to do something powerful and plan B is go to the next game, London rewarded all-in-on-plan-A strategies like Tron, Amulet, and now Whirza (which also just got a decent Plan B-roko).  Before rotation, and for most of 2019, it looks to me like Standard was a lot closer to Limited, at least game 1 in the dark.  Aggro decks really don’t want to go to 5 (although they’re better at it than the rest of these).  Esper really doesn’t want to go to 5.  Scapeshift really doesn’t want to go to 5.  Jeskai really doesn’t want to go to 5.  Not that they won’t if their hands are garbage, but their hand quality is far more smoothly distributed compared to a Tron deck’s highly polarized Tron-or-not, nuts-or-garbage and that means keeping more OK hands because the odds of beating it (or beating it by a lot) with fewer cards isn’t as high.  Aggro decks need a density of cheap beaters and usually its other flavor (pump in white, burn in red, Obsession/counters in blue, etc).  Midrange needs lands and 4-5 drops and something to do before that.  Control needs enough answers.

There just aren’t that many good 5-card combinations that cover the bases, even looking at 5-out-of-7, you’re quite reliant on the top of the deck to keep delivering whatever you’re light on.  There wasn’t any way for most of the decks to get powerful nut draws on 5 with any real consistency, even with London, so they couldn’t abuse the 5-card hand advantage because going to 5 really sucked.  Then came Eldraine.. Guess who doesn’t need a 7-card hand to do busted work?



Innkeeper doesn’t get to 5 that often, but any 5 or 6 with him is better than basically any 6 or 7 without, so the idea still applies.  Hands with these starts are MUCH stronger than hands without, and because of London and OUaT, they can be found with much more regularity.  If you take something like Torbran in mono-R on a 5-carder, WTF are you keeping that doesn’t have to draw near-perfectly to make T4 good?  Same with Embercleave in non-adventure Gruul, you can only keep pieces and hope to draw perfectly.

Oko not only has a self-contained nut draw on 5 cards, its backup of T3 Nissa is a hell of a lot easier to assemble on 5 than, say, a useful Torbran or Embercleave hand or a useful Fires or Reclamation hand.  Furthermore, thanks to OUaT (and Veil for indirectly keeping G1 interaction in check), it can actually assemble and play a great hand on 5 far too often.  Innkeeper can also start going off from a wide range of hands.  The ability to go bananas on a reasonable number of 5-card London hands certainly stretches things compared to where they were with Vancouver.

Maybe that will make for playable (albeit different) Eternal formats with a wide variety of decks trying to nut draw each other, kind of like Modern 1-2 years ago before Faithless Looting really broke out, with enough variance in the pairings lottery and sideboard cards that tier 2 and 3 decks can still put up regular results.  I have my doubts though- Modern was already collapsing away from that, and reducing the fail rates of the most powerful decks certainly doesn’t seem likely to foster diversity from where I sit- and if there is a direct gain, it’ll be something degenerate that’s now consistent enough to play.  Yippee.

It’s possible that some Standards will be okay, but even besides the obvious mistakes in Oko and Veil, this one has some issues.  You can’t ever have a cheap build-around unless it’s trivially dealt with by most of the meta (Innkeeper could be if Shock, Disfigure, Glass Casket, etc were big in the meta), in which case why even bother printing it?  You can’t have functionally more than 4x 1-cost acceleration without polarizing draws to 3-drop-on-turn-2 (or 5+ drop on turn 3) or garbage.  With only one card, and especially one card that might actually die, you can’t deckbuild all-in on it or mulligan to it.  With the 8x + OuAT available now, you can and likely should if you’re in that acceleration market at all.

I don’t trust Wizards to not print broken cheap stuff, and they probably don’t even trust themselves at this point, assuming it’s not actually on purpose, which it likely kind of is.  I barely mentioned postboard games where draws are naturally more polarized (and that polarization is known during mulligans), which leads to more mulligan death spiral games.  Nobody’s freaking out when a draft deck keeps 7 because it keeps plenty of average-ish hands as well as the good ones- you just have to mulligan slightly more aggressively.  When Tron or Oko keeps 7, you know damn well you’re in for something busted because they would have shipped all their mediocre hands and you have to mulligan to a hand that can play.. until we get a deck that can actually bluff keep a reasonable-but-not-broken plan B/C sometimes to get free equity off scared mulligans/fearless non-mulligans.

I wish I had a clean answer, but I don’t.  If all I were worried about were ladder-type things, I’d say you just get one mulligan, and have it be a London plus a scry, or even look at 8 and bottom 2, and you’re stuck with it.  If your hand is nonfunctional, then you just lose super-fast and go to the next game or match, no big deal.  That’s a lot of feels bad on a tournament schedule though where you lost and didn’t even get the illusion of playing a game and you’re doing nothing but moping for the next 30-40 minutes, and a lot of people aren’t even playing Magic in the way pros and grinders do.

To use a slightly crude analogy, they approach Magic like two guys who are too fat to reach their own cocks and agree to lay side-by-side and jerk each other off.  Some like to show off, a few like to watch, but it’s mainly about experiencing their dick, er, deck, doing what it’s built to do, and they can’t just play with themselves.  For those people, the London mulligan is like free Viagra making sure their deck is always ready to perform, so they absolutely love it, and that player type is approximately infinity times more common than the hardcore spikes who can enjoy a good struggle with a semi…functional hand.

For those reasons, I think we’re stuck with it, for better or for worse, and the best we can hope for is that WotC is cognizant of not allowing anything in Standard to do broken things on 5 with any frequency and banning ASAP when something gets through.

Reliever Sequencing, Real or Not?

I read this first article on reliever sequencing, and it seemed like a reasonable enough hypothesis, that batters would do better seeing pitches come from the same place and do worse seeing them come from somewhere else, but the article didn’t discuss the simplest variable that should have a big impact- does it screw batters up to face a lefty after a righty or does it really not matter much at all?  I don’t have their arm slot data, and I don’t know what their exact methodology was, so I just designed my own little study to measure the handedness switch impact.

Using PAs from 2015-18 where the batter is facing a different pitcher than the previous PA in this game (this excludes the first PA in the game for all batters, of course), I noted the handedness of the pitcher, the stance of the batter, and the standard wOBA result of the PA.  To determine the impact of the handedness switch, I compared pairs of data: (RHB vs RHP where the previous pitcher was a LHP) to (RHB vs RHP where the previous pitcher was a RHP), etc, which also controls for platoon effects without having to try to quantify them everywhere.  The raw data is

Table 1

Bats Throws Prev P wOBA N
L L L 0.302 16162
L L R 0.296 54160
L R R 0.329 137190
L R L 0.333 58959
R L L 0.339 19612
R L R 0.337 63733
R R R 0.315 191871
R R L 0.313 82190

which looks fairly minor, and the differences (following same hand – following opposite hand) come out to

Table 2

Bats Throws wOBA Diff SD Harmonic mean of N
L L 0.006 0.0045 24895
L R -0.0046 0.0025 82474
R L 0.002 0.0041 29994
R R 0.002 0.0021 115083
Total Total 0.000000752 252446

which is in the noise range in every bucket and overall no difference between same and opposite hand as the previous pitcher.  Just in case there was miraculously a player-quality effect exactly offsetting a real handedness effect, for each PA in the 8 groups in table 1, I calculated the overall (all 4 years) batter performance against the pitcher’s handedness and the pitcher’s overall performance against batters of that stance, then compared the quality of the group that followed same-handed pitching to the group that followed opposite-handed pitching.

As it turned out there was an effect… quality effects offset some of the observed differential in 3 of the buckets, and now the difference in every individual bucket is less than 1 SD away from 0.000 while the overall effect is still nonexistent.

Table 3

Bats Throws wOBA Diff Q diff Adj Diff SD Harmonic mean of N
L L 0.0057 0.0037 0.0020 0.0045 24895
L R -0.0046 -0.0038 -0.0008 0.0025 82474
R L 0.0018 -0.0022 0.0040 0.0041 29994
R R 0.0016 0.0033 -0.0017 0.0021 115083
Total Total 0 0.0004 -0.0004 252446

Q Diff means that LHP + LHB following a LHP were a combination of better batters/worse pitchers by 3.7 points of wOBA compared to LHP + LHB following a RHP, etc.  So of the observed 5.7 points of wOBA difference, 3.7 of it was expected from player quality and the 2 points left over is the adjusted difference.

I also looked at only the performance against the second pitcher the batter faced in the game using the first pitcher’s handedness, but in that case, following the same-handed pitcher actually LOWERED adjusted performance by 1.7 points of wOBA (third and subsequent pitcher faced was a 1 point benefit for samehandedness), but these are still nothing.  I just don’t see anything here.  If changing pitcher characteristics made a meaningful difference, it would almost have to show up in flipped handedness, and it just doesn’t.


There was one other obvious thing to check, velocity, and it does show the makings of a real (and potentially somewhat actionable) effect.  Bucketing pitchers into fast (average fastball velocity>94.5, slow <89.5, or medium and doing the same quality/handedness controls as above gave the following:

first reliever starter Quality-adjusted woba SD N
F F 0.319 0.0047 11545
F M 0.311 0.0019 65925
F S 0.306 0.0037 17898
M F 0.318 0.0033 23476
M M 0.321 0.0012 167328
M S 0.320 0.0022 50625
S F 0.321 0.0074 4558
S M 0.318 0.0025 39208
S S 0.330 0.0043 13262

Harder-throwing relievers do better, which isn’t a surprise, but it looks like there’s extra advantage when the starter was especially soft-tossing, and at the other end, slow-throwing relievers are max punished immediately following soft-tossing starters.  This deserves a more in-depth look with more granular tools than aggregate PA wOBA, but two independent groups both showing a >1SD effect in the hypothesized direction is.. something, at least, and an effect size on the order of .2-.3 RA/9 isn’t useless if it holds up.  I’m intrigued again.