The Five MTG: Arena Rankings

This ties up a few loose ends with the Mythic ranking system and explains how the other behind the scenes MMRs work.  As it turns out, there are five completely distinct rankings in Arena.  Mythic Constructed, Mythic Limited, Serious Constructed, Serious Limited, and Play (these are my names for them).


All games played in ranked (including Mythic) affect the Serious ratings, *as do the corresponding competitive events- Traditional Draft, Historic Constructed Event, etc*.  Serious ratings are conserved from month to month.   Play rating includes the Play and Brawl queues, as well as Jump In (despite Jump In being a limited event), and is also conserved month to month.  Nonsense events (e.g. the Momir and Artisan MWMs) don’t appear to be rated in any way.

The Serious and Play ratings appear to be intended to be a standard implementation of the Glicko-2 rating system except with ratings updated after every game.  I say intended because they have a gigantic bug that screws everything up, but we’ll talk about that later.  New accounts start with a rating of 1500, RD of 350, and volatility of 0.06.  These update after every game and there doesn’t seem to be any kind of decay or RD increase over time (an account that hadn’t even logged in since before rotation still had a low RD).  Bo1 and Bo3 match wins are rated the same for these.


Only games played while in Mythic count towards the Mythic rankings, and the gist of the system is exactly as laid out in One last rating update although I have a slight formula update.  These rankings appear to come into existence when Mythic is reached each month and disappear again at the end of the month (season).

I got the bright idea that they may be using the same code to calculate Mythic changes, and this appears to be true (I say appears because I have irreconcilable differences in the 5th decimal place for all of their Glicko-2 computations that I can’t resolve with any (tau, Glicko scale constant) pair.  There’s either a small bug or some kind of rounding issue on one of our ends, but it’s really tiny regardless).  The differences are that Mythic uses a fixed RD of 60 and a fixed volatility of 0.06 (or extremely extremely close) that doesn’t change after matches and that the rating changes are multiplied by 2 for Bo3 matches.  Glicko with a fixed RD is very similar to Elo with a fixed K on matchups within 200 points of each other.

In addition, the initial Mythic rating is seeded *as a function of the Serious rating*.  [Edit 8/1/2022: The formula is: if Serious Rating >=3000, Mythic Rating = 1650.  Otherwise Mythic Rating = 1650 – ((3000 – Serious Rating) / 10).  Using Serious Rating from before you win your play-in game to Mythic, not after you win it, because reasons] That’s the one piece I never exactly had a handle on.  I knew that tanking my rating gave me easy pairings and a trash initial Mythic rating the next month, but I didn’t know how that happened.  The existence of a separate conserved Serious rating explains it all.  [Edit 8/1/22: rest of paragraph deleted since it’s no longer relevant with the formula given]

The Mythic system also had two fixed constants that previously appeared to be randomly arbitrary- the minimum win of 5.007 points and the +7.4 win/-13.02 loss when playing a non-Mythic.  Using the Glicko formula with both players having RD 60 and Volatility 0.06, the first number appears when the rating difference is restricted to a maximum of exactly 200 points. Even if you’re 400 points above your opponent, the match is rated as though you’re only 200 points higher.  The second number appears when you treat the non-Mythic player as rated exactly 100 points lower than you are (regardless of what your rating actually is) with the same RD=60.  This conclusively settles the debate as to whether or not that system/those numbers were empirically derived or rectally derived.

The Huge Bug

As I reported on Twitter last month the Serious and Play ratings have a problem (but not Mythic.. at least not this problem).  If you lose, you’re rated as though you lost to your opponent.  If you win, you’re rated as though you beat a copy of yourself (rating and RD).  And, of course, even though correcting the formula/function call is absolutely trivial, it still hasn’t been fixed after weeks.  This bug dates to at least January, almost certainly to the back-end update last year, and quite possibly back into Beta.

Glicko-2 isn’t zero-sum by design (if two players with the same rating play, the one with higher RD will gain/lose more points), but it doesn’t rapidly go batshit.  With this bug, games are now positive-sum in expectation.  When the higher-rated player wins, points are created (the higher-rated player wins more points than they should).  When the lower-rated player wins, points are destroyed (the lower-rated player wins fewer points than they should). Handwaving away some distributional things that the data show don’t matter, since the higher-rated player wins more often, points are created more than destroyed, and the entire system inflates over time.

My play rating is 5032 (and I’m not spending my life tryharding the play queue), and the median rating is supposed to be around 1500.  In other words, if the system were functioning correctly, my rating would mean that I’d be expected to only lose to the median player 1 time per tens of millions of games.  I’m going to go out on a limb and say that even if I were playing against total newbies with precons, I wouldn’t make it anywhere close to 10 million games before winding up on the business end of a Gigantosaurus.  And I encountered a player in a constructed event who had a Serious Constructed rating over 9000, so I’m nowhere near the extreme here.  It appears from several reports that the cap is exactly 10000.

In addition to inflating everybody who plays, it also lets players go infinite (up to 10000) on rating.  Because beating a copy of yourself is worth ~half as many points as a loss to a player you’re expected to beat ~100% of the time, anybody who can win more than 2/3 of their matches will increase their rating without bound.  Clearly some players (like the 9000 I played) have been doing this for awhile.

If the system were functioning properly, most people would be in a fairly narrow range.  In Mythic constructed, something like 99% of players are between 1200-2100 (this is a guess, but I’m confident those endpoints aren’t too narrow by much, if at all), and that’s with a system that artificially spreads people out by letting high-rated players win too many points and low-rated players lose too many points.  Serious Constructed would be spread out a bit more because it includes all the non-Mythics as well, but it’s not going to be that huge a gap down to the people who can at least make it out of Silver.  And while the Play queue has much wider deck-strength, the deck-strength matching, while very far from perfect, should at least make the MMR difference more about play skill than deck strength, so it also wouldn’t spread out too far.

Instead, because of the rampant inflation, the center of the distribution is at like 4000 MMR instead of ~1500.  Strong players are going infinite on the high side, and because new accounts still start at 1500, and there’s no way to make it to 4000 without winning lots of games (especially if you screwed around with precons or something for a few games at some point), there’s a constant trickle of relatively new (and some truly atrocious) accounts on the left tail spanning the thousands-of-points gap between new players and established players, and that gap only exists because of the rating bug.  It looks something like this.

The three important features are that the bulk of the distribution is off by thousands of points from where it should be, noobs start thousands of points below the bulk, and players are spread over an absurdly wide range.  The curves are obviously hand-drawn in Paint, so don’t read anything beyond that into the precise shapes.

This is why new players often make it to Mythic one time easily with functional-but-quite-underpowered decks- they make it before their Serious Constructed rating has crossed the giant gap from 1500 to the cluster where the established players reside.  Then once they’ve crossed the gap, they mostly get destroyed.  This is also why tanking the rating is so effective.  It’s possible to tank thousands of points and get all the way back below the noob entry point.  I’ve said repeatedly that my matchups after doing that were almost entirely obvious noobs and horrific decks, and now it’s clear why.

It shouldn’t be possible (without spending a metric shitton of time conceding matches, since there should be rapidly diminishing returns once you separate from the bulk) to tank far enough to do an entire Mythic climb with a high winrate without at least getting the rating back to the point where you’re bumping into the halfway competent players/decks on a regular basis, but because of the giant gap, it is.  The Serious Constructed rating system wouldn’t be entirely fixed without the bug- MMR-based pairing still means a good player is going to face tougher matchups than a bad player on the way to Mythic, and tanking would still result in some very easy matches at the start- but those effects are greatly magnified because of the artificial gap that the bug has created.

What I still don’t know

I don’t know whether or not Serious or Mythic is used for pairing Mythic.  I don’t have the data to know if Serious is used to pair ranked drafts or any other events (Traditional Draft, Standard Constructed Event, etc).  It would take a 17lands-data-dump amount of match data to conclusively show that one way or another.  AFAIK, WotC has said that it isn’t, but they’ve made incorrect statements about ratings and pairings before.  And I certainly don’t know, in so, so many different respects, how everything made it to this point.

About MLB’s New Mudding and Storage Protocol

My prior research on the slippery ball problem: Baseball’s Last Mile Problem

The TL;DR is that mudding adds moisture to the surface of the ball.  Under normal conditions (i.e. stored with free airflow where it was stored before mudding), that moisture evaporates off in a few hours and leaves a good ball.  If that evaporation is stopped, the ball goes to complete hell and becomes more slippery than a new ball.  This is not fixed by time in free airflow afterwards.

My hypothesis is that the balls were sometimes getting stored in environments with sufficiently restricted airflow (the nylon ball bag) too soon after mudding, and that stopped the evaporation.  This only became a problem this season with the change to mudding all balls on gameday and storing them in a zipped nylon bag before the game.

MLB released a new memo yesterday that attempts to standardize the mudding and storage procedure.  Of the five bullet points, one (AFAIK) is not a change.  Balls were already supposed to sit in the humidor for at least 14 days.  Attempting to standardize the application procedure and providing a poster with allowable darkness/lightness levels are obviously good things.  It may be relevant here if the only problem balls were the muddiest (aka wettest) which shouldn’t happen anymore, but from anecdotal reports, there were problem balls where players didn’t think the balls were even mudded at all, and unless they’re blind, that seems hard to reconcile with also being too dark/too heavily mudded.  So this may help some balls, but probably not all of them.

Gameday Mudding

The other points are more interesting.  Requiring all balls to be mudded within 3 hours of each other could be good or bad.  If it eliminates stragglers getting mudded late, this is good.  If it pushes all mudding closer to gametime, this is bad.  Either way, unless MLB knows something I don’t (which is certainly possible- they’re a business worth billions and I’m one guy doing science in my kitchen), the whole gameday mudding thing makes *absolutely no sense* to me at all in any way.

Pre-mudding, all balls everywhere** are all equilibrated in the humidor the same way.  Post-mudding, the surface is disrupted with transient excess moisture.  If you want the balls restandardized for the game, then YOU MAKE SURE YOU GIVE THE BALL SURFACE TIME AFTER MUDDING TO REEQUILIBRATE TO A STANDARD ENVIRONMENT BEFORE DOING ANYTHING ELSE WITH THE BALL. And that takes hours.

In a world without universal humidors, gameday mudding might make sense since later storage could be widely divergent.  Now, it’s exactly the same everywhere**.  Unless MLB has evidence that a mudded ball sitting overnight in the humidor goes to hell (and I tested and found no evidence for that at all, but obviously my testing at home isn’t world-class- also, if it’s a problem, it should have shown up frequently in humidor parks before this season), I have no idea why you would mud on gameday instead of the day before like it was done last season.  The evaporation time between mudding and going in  the nylon bag for the game might not be long enough if mudding is done on gameday, but mudding the day before means it definitely is.

Ball Bag Changes

Cleaning the ball bag seems like it can’t hurt anything, but I’m also not sure it helps anything. I’m guessing that ball bag hygiene over all levels of the sport and prior seasons of MLB was generally pretty bad, yet somehow it was never a problem.  They’ve seen the bottom of the bags though.  I haven’t. If there’s something going on there, I’d expect it to be a symptom of something else and not a primary problem.

Limiting to 96 balls per bag is also kind of strange.  If there is something real about the bottom of the bag effect, I’d expect it to be *the bottom of the bag effect*.  As long as the number of balls is sufficient to require a tall stack in the bag (and 96 still is), and since compression at these number ranges doesn’t seem relevant (prior research post), I don’t have a physical model of what could be going on that would make much difference for being ball 120 of 120 vs. ball 96 of 96.  Also, if the bottom of the bag effect really is a primary problem this year, why wasn’t it a problem in the past?  Unless they’re using entirely new types of bags this season, which I haven’t seen mentioned, we should have seen it before.  But I’m theorizing and they may have been testing, so treat that paragraph with an appropriate level of skepticism.

Also, since MLB uses more than 96 balls on average in a game, this means that balls will need to come from multiple batches.  This seemed like it had the potential to be significantly bad (late-inning balls being stored in a different bag for much longer), but according to an AP report on the memo

“In an effort to reduce time in ball bags, balls are to be taken from the humidor 15-30 minutes before the scheduled start, and then no more than 96 balls at a time.  When needed, up to 96 more balls may be taken from the humidor, and they should not be mixed in bags with balls from the earlier bunch.”

This seems generally like a step in the smart direction, like they’d identified being zipped up in the bag as a potential problem (or gotten the idea from reading my previous post from 30 days ago :)).  I don’t know if it’s a sufficient mitigation because I don’t know exactly how long it takes for the balls to go to hell (60 minutes in near airtight made them complete garbage, so damage certainly appears in less time, but I don’t know how fast and can’t quickly test that).  And again, repeating the mantra from before, time spent in the ball bag *is only an issue if the balls haven’t evaporated off after mudding*.  And that problem is slam-dunk guaranteed solvable by mudding the day before, and then this whole section would be irrelevant.

Box Storage

The final point, “all balls should be placed back in the Rawlings boxes with dividers, and the boxes should then be placed in the humidor. In the past, balls were allowed to go directly into the humidor.” could be either extremely important or absolutely nothing.  This doesn’t say whether the boxes should be open or closed (have the box top on) in the humidor.  I tweeted to the ESPN writer and didn’t get an answer.

The boxes can be seen in the two images in  If they’re open (and not stacked or otherwise covered to restrict airflow), this is fine and at least as good as whatever was done before today.  If the boxes are closed, it could be a real problem.  Like the nylon ball bag, this is also a restricted-flow environment, and unlike the nylon ball bag, some balls will *definitely* get in the box before they’ve had time to evaporate off (since they go in shortly after mudding)

I have one Rawlings box without all the dividers.  The box isn’t airtight, but it’s hugely restricted airflow.  I put 3 moistened balls in the box along with a hygrometer and the RH increased 5% and the balls lost moisture about half as fast as they did in free air.  The box itself absorbed no relevant amount.  With 6 moistened balls in the box, the RH increased 7% (the maximum moistened balls in a confined space will do per prior research) and they lost moisture between 1/3 and 1/4 as fast as in free air.

Unlike the experiments in the previous post where the balls were literally sealed, there is still some moisture flux off the surface here.  I don’t know if it’s enough to stop the balls from going to hell.  It would take me weeks to get unmudded equilibrated balls to actually do mudding test runs in a closed box, and I only found out about this change yesterday with everybody else.  Even if the flux is still sufficient to avoid the balls going to hell directly, the evaporation time appears to be lengthened significantly, and that means that balls are more likely to make it into the closed nylon bag before they’ve evaporated off, which could also cause problems at that point (if there’s still enough time for problems there- see previous section).

The 3 and 6 ball experiments are one run each, in my ball box, which may have a better or worse seal than the average Rawlings box, and the dividers may matter (although they don’t seem to absorb very much moisture from the air, prior post), etc.  Error bars are fairly wide on the relative rates of evaporation, but hygrometer don’t lie.  There doesn’t seem to be any way a closed box isn’t measurably restricting airflow and increasing humidity inside unless the box design changed a lot in the last 3 years.  Maybe that humidity increase/restricted airflow isn’t enough to matter directly or indirectly, but it’s a complete negative freeroll.  Nothing good can come of it.  Bad things might.  If there are reports somewhere this week that tons of balls were garbage, closed-box storage after mudding is the likely culprit.  Or the instructions will actually be uncovered open box (and obeyed) and the last 5 paragraphs will be completely irrelevant.  That would be good.

Conclusion: A few of the changes are obviously common-sense good.  Gameday mudding continues to make no sense to me and looks like it’s just asking for trouble.  Box storage in the humidor after mudding, if the boxes are closed, may be introducing a new problem. It’s unclear to me if the new ball-bag procedures reduce time sufficiently to prevent restricted-airflow problems from arising there, although it’s at least clearly a considered attempt to mitigate a potential problem.

The Mythic Limited Rating Problem


Thanks to @twoduckcubed for reading my previous work and being familiar enough with high-end limited winrates to see that there was likely to be a real problem here, and there is.  If you haven’t read my previous work, it’s at One last rating update and Inside the MTG: Arena Rating System, but as long as you know anything at all about any MMR system, this post is intended to be readable by itself.

Mythic MMR starts from scratch each month and each player is assigned a Mythic MMR when they first make Mythic that month.  Most people start at the initial cap, 1650, and a few start a bit below that.  It takes losing an awful lot to be assigned an initial rating very far below that, and since losing a bunch of limited matches costs resources (while doing it in ranked constructed is basically free), it’s mostly 1650 or close.  When two people with a Mythic rating play in Premier or Quick, it’s approximately an Elo system with a K value of 20.4, and the matches are zero-sum.  When one player wins points, the other player loses exactly the same number of points.

Most games that Mythic limited players play aren’t against other Mythics though.  Diamonds are the most common opponents, with significant numbers of games against Platinums as well (and a handful against Gold/Silver/Bronze).  In this case, since the non-Mythic opponents literally don’t have a Mythic MMR to plug into the Elo formula, Arena, in a decision that’s utterly incomprehensible on multiple levels, rates all of these matches exactly the same regardless of the Mythic’s rating or the non-Mythic’s rank or match history.  +7.4 points for a win and -13 points for a loss, and this is *not* zero-sum because the non-Mythic doesn’t have a Mythic rating yet.  The points are simply created out of nothing or lost to the void.

+7.4 for a win and -13 for a loss means that the Mythics need to win 13/(13+7.4) = 63.7% of the time against non-Mythics to break even.   And, well, thanks to the 17lands data dumps, I found that they won 58.3% in SNC and 59.4% in NEO (VOW and MID didn’t seem to have opponent rank data available).  Nowhere close to breakeven.  ~57% vs. Diamonds and ~63% vs Plats.  Not even breakeven playing down two ranks.  And this is already a favorable sample for multiple reasons.  It’s 17lands users, who are above average Mythics (their Mythic-Mythic winrate is 52.4%).  It’s also a game-averaged sample instead of a player-averaged sample, and better players play more games on average in Mythic because they get there faster and have more resources to keep paying entry fees with.

Because of this, to a reasonable approximation, every time a Mythic Limited player hits the play button, 1 MMR vanishes into the void.  And since 1% of Mythic in limited is only ~16.5 MMR, 1% Mythic in expectation is lost every 2-3 drafts just for playing.  The more they play, the more MMR they lose into the void.  The very best players- those who can win 5% more than the average 17lands-using Mythic drafter- can outrun this and profit from playing lower ranks- but the vast majority can’t, hence the video at the top of the post.  Instead of Mythic MMR being a zero-sum game, it’s like gambling against a house edge, and playing at all is clearly disincentivized for most people.

Obviously this whole implementation is just profoundly flawed and needs to be fixed.  The 17lands data is anonymized, so I don’t know how many Mythic-Mythic games appeared from both sides, so I don’t know exactly what percentage of a Mythic’s games are against each level, but it’s something like 51% against Diamond, 29% against Mythic 19% against Plat, 1% Gold and below.  Clearly games vs Diamonds need to be handled responsibly, and games vs. Golds and below don’t matter much.

A simple fix that keeps most of the system intact (which may not be the best idea, but hey, at least it’s progress) is to assign the initial Mythic MMR upon making Platinum (instead of Mythic) and to not Mythic-rate any games involving a Gold or below.  You wouldn’t get leaderboard position or anything until actually making Mythic, but the rating would be there behind the scenes doing its thing and this silliness would be avoided since all the rated games would be zero-sum and all the Diamond opponents would be reasonably rated for quality after playing enough games to get out of Plat.

Constructed has the same implementation, but it’s mostly not as big a deal because outside of Alchemy, cross-rank pairing isn’t very common except at the beginning of the month, and even if top-1200 quality players are getting scammed out of points by lower ranks at the start of the month (and they may well not be), they have all the time in the world to reequilibrate their rating against a ~100% Mythic opponent lineup later.  Drafters play against bunches of non-Mythics throughout.  Cross-rank pairing in Alchemy ranked may be/become enough of a problem to warrant a similar fix (although likely for the opposite reason, farming lower ranks instead of losing points to them), and it’s not like assigning the initial Mythic rating upon reaching Diamond and ignoring games against lower ranks actually hurts anything there either.

Baseball’s Last Mile Problem

2022 has brought a constant barrage of players criticizing the baseballs as hard to grip and wildly inconsistent from inning to inning, and probably not coincidentally, a spike in throwing error rates to boot.  “Can’t get a grip” and “throwing error” do seem like they might go together.  MLB has denied any change in the manufacturing process, however there have been changes this season in how balls are handled in the stadium, and I believe that is likely to be the culprit.

I have a plausible explanation for how the new ball-handling protocol can cause even identical balls from identical humidors to turn out wildly different on the field, and it’s backed up by experiments and measurements I’ve done on several balls I have, but until those experiments can be repeated at an actual MLB facility (hint, hint), this is still just a hypothesis, albeit a pretty good one IMO.

Throwing Errors

First, to quantify the throwing errors, I used Throwing Errors + Assists as a proxy for attempted throws (it doesn’t count throws that are accurate but late, etc), and broke down TE/(TE+A) by infield position.

TE/(TE+A) 2022 2021 2011-20 max 21-22 Increase 2022 By Chance
C 9.70% 7.10% 9.19% 36.5% 1.9%
3B 3.61% 2.72% 3.16% 32.7% 0.8%
SS 2.20% 2.17% 2.21% 1.5% 46.9%
2B 1.40% 1.20% 1.36% 15.9% 20.1%

By Chance is the binomial odds of getting the 2022 rate or worse using 2021 as the true odds.  Not only are throwing errors per “opportunity” up over 2021, but they’re higher than every single season in the 10 years before that as well, and way higher for C and 3B.   C and 3B have the least time on average to establish a grip before throwing.  This would be interesting even without players complaining left and right about the grip.

The Last Mile

To explain what I suspect is causing this, I need to break down the baseball supply chain.  Baseballs are manufactured in a Rawlings factory, stored in conditions that, to the best of my knowledge, have never been made public, shipped to teams, sometimes stored again in unknown conditions outside a humidor, stored in a humidor for at least 2 weeks, and then prepared and used in a game.  Borrowing the term from telecommunications and delivery logistics, we’ll call everything after the 2+ weeks in the humidor the last mile.

Humidors were in use in 9 parks last year, and Meredith Wills has found that many of the balls this year are from the same batches as balls in 2021.  So we have literally some of the same balls in literally the same humidors, and there were no widespread grip complaints (or equivalent throwing error rates) in 2021.  This makes it rather likely that the difference, assuming there really is one, is occurring somewhere in the last mile.

The last mile starts with a baseball that has just spent 2+ weeks in the humidor.  That is long enough to equilibrate, per, other prior published research, and my own past experiments.  Getting atmospheric humidity changes to meaningfully affect the core of a baseball takes on the order of days to weeks.  That means that nothing humidity-wise in the last mile has any meaningful impact on the ball’s core because there’s not enough time for that to happen.

This article from the San Francisco Chronicle details how balls are prepared for a game after being removed from the humidor, and since that’s paywalled, a rough outline is:

  1. Removed from humidor at some point on gameday
  2. Rubbed with mud/water to remove gloss
  3. Reviewed by umpires
  4. Not kept out of the humidor for more than 2 hours
  5. Put in a security-sealed bag that’s only opened in the dugout when needed

While I don’t have 2022 game balls or official mud, I do have some 2019* balls, water, and dirt, so I decided to do some science at home.  Again, while I have confidence in my experiments done with my balls and my dirt, these aren’t exactly the same things being used in MLB, so it’s possible that what I found isn’t relevant to the 2022 questions.

Update: Dr. Wills informed that that 2019, and only 2019, had a production issue that resulted in squashed leather and could have affected the mudding results.  She checked my batch code, and it looks like my balls were made late enough in 2019 that they were actually used in 2020 with the non-problematic production method.  Yay.

Experiments With Water

When small amounts of water are rubbed on the surface of a ball, it absorbs pretty readily (the leather and laces love water), and once the external source of water is removed, that creates a situation where the outer edge of the ball is more moisture-rich than what’s slightly further inside and more moisture-rich than the atmosphere.  The water isn’t going to just stay there- it’s either going to evaporate off or start going slightly deeper into the leather as well.

As it turns out, if the baseball is rubbed with water and then stored with unrestricted air access (and no powered airflow) in the environment it was equilibrated with, the water entirely evaporates off fairly quickly with an excess-water half-life of a little over an hour (and this would likely be lower with powered air circulation) and goes right back to its pre-rub weight down to 0.01g precision.  So after a few hours, assuming you only added a reasonable amount of water to the surface (I was approaching 0.75 grams added at the most) and didn’t submerge the ball in a toilet or something ridiculous, you’d never know anything had happened.  These surface moisture changes are MUCH faster than the days-to-weeks timescales of core moisture changes.

Things get much more interesting if the ball is then kept in a higher-humidity environment.  I rubbed a ball down, wiped it with a paper towel, let it sit for a couple of minutes to deal with any surface droplets I missed, and then sealed the ball in a sandwich bag for 2 hours along with a battery-powered portable hygrometer.  I expected the ball to completely saturate the air while losing less mass than I could measure (<0.01g) in the process, but that’s not what actually happened.  The relative humidity in the bag only went up 7%, and as expected, the ball lost no measurable amount of mass.  After taking it out, it started losing mass with a slightly longer half-life than before and lost all the excess water in a few hours.

I repeated the experiment except this time I sealed the ball and the hygrometer in an otherwise empty 5-gallon pail.  Again, the relative humidity only went up 7%, and the ball lost 0.04 grams of mass.  I calculated that 0.02g of evaporation should have been sufficient to cause that humidity change, so I’m not exactly sure what happened- maybe 0.01 was measurement error (the scale I was using goes to 0.01g), maybe my seal wasn’t perfectly airtight, maybe the crud on the lid I couldn’t clean off or the pail itself absorbed a little moisture).  But the ball had 0.5g of excess water to lose (which it did completely lose after removal from the pail, as expected) and only lost 0.04g in the pail, so the basic idea is still the same.

This means that if the wet ball has restricted airflow, it’s going to take for freaking ever to reequilibrate (because it only takes a trivial amount of moisture loss to “saturate” a good-sized storage space), and that if it’s in a sealed environment or in a free-airflow environment more than 7% RH above what it’s equilibrated to, the excess moisture will travel inward to more of the leather instead of evaporating off (and eventually the entire ball would equilibrate to the higher-RH environment, but we’re only concerned with the high-RH environment as a temporary last-mile storage condition here, so that won’t happen on our timescales).

I also ran the experiment sealing the rubbed ball and the hygrometer in a sandwich bag overnight for 8 hours.  The half-life for losing moisture after that was around 2.5 hours, up from the 70 minutes when it was never sealed.  This confirms that the excess moisture doesn’t just sit around at the surface waiting if it can’t immediately evaporate, but that evaporation dominates when possible.

I also ran the experiment with a ball sealed in a sandwich bag for 2 hours along with an equilibrated cardboard divider that came with the box of balls I have.  That didn’t make much difference. The cardboard only absorbed 0.04g of the ~0.5g excess moisture in that time period, and that’s with a higher cardboard:ball ratio than a box actually comes with.  Equilibrated cardboard can’t substitute for free airflow on the timescale of a couple of hours.

Experiments With Mud

I mixed dirt and water to make my own mud and rubbed it in doing my best imitation of videos I could find, rubbing until the surface of the ball felt dry again.  Since I don’t have any kind of instrument to measure slickness, these are my perceptions plus those of my significant other.  We were in almost full agreement on every ball, and the one disagreement converged on the next measurement 30 minutes later.

If stored with unrestricted airflow in the environment it was equilibrated to, this led to roughly the following timeline:

  1. t=0, mudded, ball surface feels dry
  2. t= 30 minutes, ball surface feels moist and is worse than when it was first mudded.
  3. t=60 minutes, ball surface is drier and is similar in grip to when first mudded.
  4. t=90 minutes, ball is significantly better than when first mudded
  5. t=120 minutes, no noticeable change from t=90 minutes.
  6. T=12 hours, no noticeable change from t=120 minutes

I tested a couple of other things as well

  1. I took a 12-hour ball, put it in a 75% RH environment for an hour and then a 100% RH environment for 30 minutes, and it didn’t matter.  The ball surface was still fine.  The ball would certainly go to hell eventually under those conditions, but it doesn’t seem likely to be a concern with anything resembling current protocols.  I also stuck one in a bag for awhile and it didn’t affect the surface or change the RH at all, as expected since all of the excess moisture was already gone.
  2. I mudded a ball, let it sit out for 15 minutes, and then sealed it in a sandwich bag.  This ball was slippery at every time interval, 1 hour, 2 hours, 12 hours. (repeated twice).  Interestingly, putting the ball back in its normal environment for over 24 hours didn’t help much and it was still quite slippery.  Even with all the excess moisture gone, whatever had happened to the surface while bagged had ruined the ball.
  3. I mudded a ball, let it sit out for 2 hours, at which point the surface was quite good per the timeline above, and then sealed it in a bag.  THE RH WENT UP AND THE BALL TURNED SLIPPERY, WORSE THAN WHEN IT WAS FIRST MUDDED. (repeated 3x).  Like #2, time in the normal environment afterwards didn’t help.  Keeping the ball in its proper environment for 2 hours, sealing it for an hour, and then letting it out again was enough to ruin the ball.

That’s really important IMO.  We know from the water experiments that it takes more than 2 hours to lose the excess moisture under my storage conditions, and it looks like the combination of fresh(ish) mud plus excess surface moisture that can’t evaporate off is a really bad combo and a recipe for slippery balls.  Ball surfaces can feel perfectly good and game-ready while they still have some excess moisture left and then go to complete shit, apparently permanently, in under an hour if the evaporation isn’t allowed to finish.

Could this be the cause of the throwing errors and reported grip problems? Well…

2022 Last-Mile Protocol Changes

The first change for 2022 is that balls must be rubbed with mud on gameday, meaning they’re always taking on that surface moisture on gameday.  In 2021, balls had to be mudded at least 24 hours in advance of the game, and while 2021 changed the window to 1-2 days in advance, the window used to be up to 5 days in advance of the game.  I don’t know how far in advance they were regularly mudded before 2021, but even early afternoon for a night game would be fine assuming the afternoon storage had reasonable airflow.

The second change is that they’re put back in the humidor fairly quickly after being mudded and allowed a maximum of 2 hours out of the humidor.  While I don’t think there’s anything inherently wrong with putting the balls back in the humidor after mudding (unless it’s something specific to 2022 balls), humidors look something like this.  If the balls are kept in a closed box, or an open box with another box right on top of them, there’s little chance that they reequilibrate in time.  If they’re kept in an open box on a middle shelf without much room above, unless the air is really whipping around in there, the excess moisture half-life should increase.

There’s also a chance that something could go wrong if the balls are taken out of the humidor, kept in a wildly different environment for an hour, then mudded and put back in the humidor, but I haven’t investigated that, and there are many possible combinations of both humidity and temperature that would need to be checked for problems.

The third change (at least I think it’s a change) is that the balls are kept in a sealed bag- at least highly restricted flow, possibly almost airtight- until opened in the dugout.  Even if it’s not a change, it’s still extremely relevant- sealing balls that have evaporated their excess moisture off doesn’t affect anything, while sealing balls that haven’t finished evaporating off seems to be a disaster.


Mudding adds excess moisture to the surface of the ball, and if its evaporation is prevented for very long- either through restricted airflow or storage in too humid an environment- the surface of the ball becomes much more slippery and stays that way even if evaporation continues later.  It takes hours- dependent on various parameters- for that moisture to evaporate off, and 2022 protocol changes make it much more likely that the balls don’t get enough time to evaporate off, causing them to fall victim to that slipperiness.  In particular, balls can feel perfectly good and ready while they still have some excess surface moisture and then quickly go to hell if the remaining evaporation is prevented inside the security-sealed bag.

It looks to me like MLB had a potential problem- substantial latent excess surface moisture being unable to evaporate and causing slipperiness- that prior to 2022 it was avoiding completely by chance or by following old procedures derived from lost knowledge.   In an attempt to standardize procedures, MLB accidentally made the excess surface moisture problem a reality, and not only that, did it in a way where the amount of excess surface moisture was highly variable.

The excess surface moisture when a ball gets to a pitcher depends on the amount of moisture initially absorbed, the airflow and humidity of the last-mile storage environment, and the amount of time spent in those environments and in the sealed bag.  None of those are standardized parts of the protocol, and it’s easy to see how there would be wide variability ball-to-ball and game-to-game.

Assuming this is actually what’s happening, the fix is fairly easy.  Balls need to be mudded far enough in advance and stored afterwards in a way that they get sufficient airflow for long enough to reequilibrate (the exact minimum times depending on measurements done in real MLB facilities), but as an easy interim fix, going back to mudding the day before the game and leaving those balls in an open uncovered box in the humidor overnight should be more than sufficient.  (and again, checking that on-site is pretty easy)


I found (or didn’t find) some other things that I may as well list here as well along with some comments.

  1. These surface moisture changes don’t change the circumference of the baseball at all, down to 0.5mm precision, even after 8 hours.
  2. I took a ball that had stayed moisturized for 2 hours and put a 5-pound weight on it for an hour.  There was no visible distortion and the circumference was exactly the same as before along both seam axes (I oriented the pressure along one seam axis and perpendicular to the other).  To whatever extent flat-spotting is happening or happening more this season, I don’t see how it can be a last-mile cause, at least with my balls.  Dr. Wills has mentioned that the new balls seem uniquely bad at flat-spotting, so it’s not completely impossible that a moist new ball at the bottom of a bucket could deform under the weight, but I’d still be pretty surprised.
  3. The ball feels squishier to me after being/staying moisturized, and free pieces of leather from dissected balls are indisputably much squishier when equilibrated to higher humidity, but “feels squishier” isn’t a quantified measurement or an assessment of in-game impact.  The squishy-ball complaints may also be another symptom of unfinished evaporation.
  4. I have no idea if the surface squishiness in 3 affects the COR of the ball to a measurable degree.
  5. I have no idea if the excess moisture results in an increased drag coefficient.  We’re talking about changes to the surface, and my prior dissected-ball experiments showed that the laces love water and expand from it, so it’s at least in the realm of possibility.
  6. For the third time, this is a hypothesis.  I think it’s definitely one worth investigating since it’s supported by physical evidence, lines up with the protocol changes this year, and is easy enough to check with access to actual MLB facilities.  I’m confident in my findings as reported, but since I’m not using current balls or official mud, this mechanism could also turn out to have absolutely nothing to do with the 2022 game.

The 2022 MLB baseball

As of this writing on 4/25/2022, HRs are down, damage and distance on barrels are down, and both Alan Nathan and Rob Arthur have observed that the drag coefficient of baseballs this year is substantially increased.  This has led to speculation about what has changed with the 2022 balls and even what production batch of balls or mixture of batches may be in use this year.  Given the kerfluffle last year that resulted in MLB finally confirming that a mix of 2020 and 2021 balls were used during the season, that speculation is certainly reasonable.

It may well also turn out to be correct, and changes in the 2022 ball manufacture could certainly explain the current stats, but I think it’s worth noting that everything we’ve seen so far is ALSO consistent with “absolutely nothing changed with regard to ball manufacture/end product between 2021 and 2022” and “all or a vast majority of balls being used are from 2021 or 2022”.  

How is that possible?  Well, the 2021 baseball production was changed on purpose.  The new baseball was lighter, less dense, and less bouncy by design, or in more scientific terms, “dead”.  What if all we’re seeing now is the 2021 baseball specifications in their true glory, now untainted by the 2020 live balls that were mixed in last year?

Even without any change to the surface of the baseball, a lighter, less dense ball won’t carry as far.  The drag force is independent of the mass (for a given size, which changed less if at all), and F=MA, so a constant force and a lower mass means a higher drag deceleration and less carry.

The aforementioned measurements of the drag coefficient from Statcast data also *don’t measure the drag coefficient*.  They measure the drag *acceleration* and use an average baseball mass value to convert to the drag force (which is then used to get the drag coefficient).  If they’re using the same average mass for a now-lighter ball, they’re overestimating the drag force and the drag coefficient, and the drag coefficient may literally not have changed at all (while the drag acceleration did go up, per the previous paragraph).

Furthermore, I looked at pitchers who threw at least 50 four-seam fastballs last year after July 1, 2021 (after the sticky stuff crackdown) and have also thrown at least 50 FFs in 2022.  This group is, on average, -0.35 MPH and +0.175 RPM on their pitches.  These stats usually move in the same direction, and a 1 MPH increase “should” increase spin by about 20 RPM.  So the group should have lost around 7 RPM from decreased velocity and actually wound up slightly positive instead.  It’s possible that the current baseball is just easier to spin based on surface characteristics, but it’s also possible that it’s easier to spin because it’s lighter and has less rotational inertia.  None of this is proof, and until we have results from experiments on actual game balls in the wild, we won’t have a great idea of the what or the why behind the drag acceleration being up. 

(It’s not (just) the humidor- drag acceleration is up even in parks that already had a humidor, and in places where a new humidor would add some mass, making the ball heavier is the exact opposite of what’s needed to match drag observations, although being in the humidor could have other effects as well)

New MTG:A Event Payouts

With the new Premier Play announcement, we also got two new constructed event payouts and a slightly reworked Traditional Draft.

This is the analysis of the EV of those events for various winrates (game winrate for Bo1 and match winrate for Bo3).  I give the expected gem return, the expected number of packs, the expected number of play points, and two ROIs, one counting packs as 200 gems (store price), the other counting packs as 22.5 gems (if you have all the cards).  These are ROIs for gem entry.  For gold entry, multiply by whatever you’d otherwise use gold for.  If you’d buy packs, then multiply by 3/4.  If you’d otherwise draft, then the constructed event entries are the same exchange rate.

Traditional Draft (1500 gem entry)

Winrate Gems Packs Points ROI (200) ROI (22.5)
0.4 578 1.9 0.13 63.8% 41.4%
0.45 681 2.13 0.18 73.7% 48.6%
0.5 794 2.38 0.25 84.6% 56.5%
0.55 917 2.65 0.33 96.4% 65.1%
0.6 1050 2.94 0.43 109.3% 74.4%
0.65 1194 3.26 0.55 123.1% 84.5%
0.7 1348 3.6 0.69 137.9% 95.3%
0.75 1513 3.95 0.84 153.6% 106.8%

THIS DOES NOT INCLUDE ANY VALUE FROM THE CARDS TAKEN DURING THE DRAFT, which, if you value the cards, is a bit under 3 packs on average (no WC progress).

Bo1 Constructed Event (375 gem entry)

Winrate Gems Packs Points ROI (200) ROI (22.5)
0.4 129 0.65 0.03 68.7% 38.2%
0.45 160 0.81 0.05 85.9% 47.4%
0.5 195 1 0.09 105.6% 58.1%
0.55 235 1.22 0.15 128.0% 70.0%
0.6 278 1.47 0.23 152.7% 83.0%
0.65 323 1.74 0.34 179.0% 96.5%
0.7 367 2.03 0.46 205.9% 109.9%
0.75 408 2.31 0.6 231.8% 122.7%

Bo3 Constructed Event (750 gem entry)

Winrate Gems Packs Points ROI (200) ROI (22.5)
0.4 292 1.67 0.04 83.5% 43.9%
0.45 348 1.76 0.07 93.3% 51.6%
0.5 408 1.84 0.125 103.5% 59.9%
0.55 471 1.92 0.2 113.9% 68.5%
0.6 535 1.99 0.31 124.4% 77.3%
0.65 600 2.06 0.464 135.0% 86.2%
0.7 664 2.14 0.67 145.6% 95.0%
0.75 727 2.22 0.95 156.1% 103.5%

LED Lifetime Claims and the Questionable Standards Behind Them

This is a post somewhat outside my areas of expertise. I have modeled kinetics in the past, and the math here isn’t particularly complicated, so I’m not too worried there.  I’ve looked at test results covering tens of millions of hours.  My biggest concern is that, because I’m not a materials scientist or otherwise industry-adjacent, I simply haven’t encountered some relevant data or information due to paywalling and/or Google obscurity.  Hopefully not, but also not betting my life on that.  Continue reading accordingly.

As a quick digression, in case I’m getting any “the government uses these standards, so they must be accurate” readers, let’s look at Artificial Sweetener labeling.  Despite a cup of Splenda having more calories than a cup of blueberries, despite being ~90%+ rapidly-and-easily-digestible carbohydrates, and despite having ~90% of the caloric content of regular table sugar gram-for-gram, it’s legally allowed, via multiple layers of complete bullshit, to advertise Splenda as a “zero calorie food” and to put “0 calories” on the nutrition label on a bag/box/packet.  This isn’t due to the FDA not knowing that maltodextrin has calories or any other such form of ignorance.  It’s the FDA literally creating a known-to-be-bullshit standard to create a known-to-be-bullshit Official Number for the sole purpose of allowing decades of deliberately deceptive marketing.  Official Numbers created from Official Procedures are simply Official Numbers and need bear no resemblance to the actual numbers, and this situation can persist for decades even when the actual numbers are known beyond all doubt to be very different.  That’s a good thing to remember in many contexts, but it also applies here.

Ok, so, back to LED testing.  LED light sources can (legitimately) last for tens of thousands of hours, sometimes over 50,000 hours.  Given that there aren’t even 9000 hours in a year, doing a full lifetime test would take over 5 years, and waiting to sell products until the tests are complete would mean only selling ancient products.  In a fast-moving field with improving design technology, this doesn’t do anybody any good, hence the desire to do partial testing to extrapolate a useful lifetime.

This is a fine idea in theory, and led to the LM-80 and TM-21 standards.  LM-80 is a measurement standard, and for our purposes here, it basically says “run the bulb constantly under some specified set of conditions and measure the brightness of the bulb for at least 6000 hours total and at intervals no greater than every 1000 hours”.  Sure, fine for what it is.

TM-21 is a calculation standard that uses the LM-80 data.  It says, effectively, “take the last 5000 hours of data, or the second half of the test run, whichever is longer, and fit to an exponential decay curve.  Extrapolate forward based on that”.  This is where the problems start.

Light sources can fail in multiple ways, either “catastrophic” complete failure, like when the filament on an incandescent bulb finally snaps and the bulb instantly transitions from great to totally useless, or by gradually dimming over time until it’s not bright enough anymore.  LEDs themselves generally fail by the latter mechanism, and the threshold failure is defined to be 70% of initial brightness (based on research that I haven’t examined that shows that people are generally oblivious to brightness drops below that level).  So, roughly, fit the LM-80 brightness data to an exponential decay curve, see how long it takes that curve to decay to 70% brightness, and call that the lifetime (with another caveat not to claim lifetimes tooooo far out, but that part isn’t relevant yet).  Numbers besides 70% are also reported sometimes, but the math/extrapolation standard is the same so I won’t worry about that.

Using LM-80 + TM-21 lifetime estimates relies on several important assumptions that can all fail:

  1. Long-term lumen maintenance (beyond the test period) follows an exponential decay curve down to 70%
  2. Lumen maintenance inside the test period, what goes into the TM-21 calculations, also follows an exponential decay curve with the same parameters as #1.
  3. LM-80 data collection is sufficient, in both precision and accuracy, to allow proper curve fitting (if #2 is true).
  4. Test conditions are representative of real-life conditions (or the difference is irrelevant)
  5. Other failure modes don’t much matter for the true lifetime

In my opinion, ALL of these assumptions fail pretty clearly and substantially.  Leaving #1 until last is burying the lede a bit, but it’s also the messiest of the objections due to an apparent shortage of quality data.

Going in reverse order, #5 can be a pretty big problem.  First, look at this diagram of an LED bulb.  The LM-80/TM-21 standard ONLY REFERS TO THE LED, NOT TO THE BULB AS A WHOLE.  The end consumer only cares about the bulb as a whole.  There’s an analogous whole-bulb/whole-product standard (LM-84/TM-28), but it’s mostly optional and seems to be performed much less often/reported publicly much less.  For example, ENERGY STAR(r) certification can be obtained with only LM-80/TM-21 results.  It is estimated (here here and other places) that only 10% of full-product failures are from the LED.  These products are only as good as their weakest link, so good LEDs can be packaged with dogshit secondary components to produce a dogshit overall product, and that overall product can easily get certifications and claim an absurdly high lifetime just based on the LED.  Trying to give an exact overall failure rate is pointless with so many different brands/designs/conditions, but simply googling LED failure rate should make it abundantly clear that there are problems here, and I have a link later where around 30% of products failed inside 12,000 hours.

For some applications, like normal screw-in bulbs, it really doesn’t matter that much.  Any known brand is still (likely) the best product and the replacement costs are pretty trivial- buy a new bulb and screw it in.  However, for other applications… there are scores of new products coming out of the Integrated LED variety.  These, uh, integrate the LED with the entire fixture and often require complete replacement upon failure, or if you’re luckier, sourcing special spare parts and possibly working with wiring or hiring an electrician.  In these cases, it’s *extremely* important to have a good idea of the true lifetime of the product, and good luck with that in general.  I’ve looked at a few residential and commercial products and none of the lifetime claims give any indication of how they’re derived.  One call to an electrician when an integrated LED fails completely obliterates any possible longer-lifetime-based savings, so if you can’t do the replacement yourself, it’s probably wise to stick to conventional fixtures and separate LED bulbs.

Even if LM-84/TM-28 replaced LM-80/TM-11, there would still be problems, and the remaining points apply (or would apply) to both.  Continuing with point #4- testing conditions not matching real-life conditions.  While this is clearly true- test conditions have an isolated LED running 24/7 at a perfectly controlled constant temperature, humidity, and current, while real life is the opposite of that in every way, the question is how much it matters.

Fortunately, I found this document after writing the part above for problem #5, and I came to the same conclusions as the author in did in their section 2.4.2, and I may as well save time and just refer to their 2.4.3 for section 4.  TL;DR operating conditions matter.  On-off cycling (when the system has at least an hour to cool down to ambient temperature before heating back up) causes important amounts of thermal stress.

On to point #3- the accuracy and precision of the gathered data.  I don’t know if this is because testing setups aren’t good/consistent enough, or if LEDs are just a bit weird in a way that makes the collection of nice data impossible or highly unlikely, but the data is a bit of a mess.  As a reminder, the idea is to take the final 5000 hours (or second half of the data set, if that’s longer) and fit it to an exponential decay curve.  It’s already a bit of a tough ask to parameterize such a curve with much confidence over a small luminosity change, but there’s another problem- the data can be noisy as hell and sometimes outright nonsense.  This dovetails with #2, where the data actually need to inspire confidence that they’re generated by an exponential decay process, and it’s easier to look at graphs one time with both of those considerations in mind, as well as one more.  Any test of >6000h is acceptable, so if this procedure is valid, it should give (roughly) consistent estimates for every possible testing window depending on when the test could have been stopped.  If this all works, 1000-6000h should give similar estimates to 5000-10000.  A 10000h test isn’t just a test of that LED- it allows a test of the entire LM-80/TM-21 framework.  And that framework doesn’t look so hot IMO.

For instance, look at the graphs on page 8/10/12 in this report.  Three of the curve fits (52,54,58) came out to exponential GROWTH of luminosity.  Others are noisy as hell and nowhere near the trend line.  The fit of the dark blue dots on each graph are wildly different from 2000-7000 vs 5000-10000.  This has one exponential growth fit and assorted noise. Pages 33-34 would obviously  have no business including the 1000h data point.  This aggregate I put together of 8 Cree tests (80 total bulbs) has the same general problem.  Here are the individual runs (10 bulbs for each line).  There are more like these.  I rest my case.

Even the obvious nonsense exponential growth projections are considered “successful” tests that allow a long lifetime to be claimed.  Standard errors of individual data points are, in aggregate, above 10% of the difference between adjacent data points (and much worse sometimes in individual runs of course).  It has been known forever that LEDs can behave differently early in their lifetime (for various materials science reasons that I haven’t looked into), which is why the first 1000h were always excluded from the TM-21 standard, but it’s pretty clear to me that 1000h is not a sufficient exclusion period.

Last, and certainly not least, is the question of whether or not the brightness decay to 70% even is exponential, or for that matter, if it’s accelerating, decelerating, or roughly linear over time once it gets past the initial 1000h+ nonsense period.  Exponential would just be a specific class of decelerating decay.  There appears to be an absolutely incredible lack of published data on this subject.  The explanation for the choice of exponential decay in the TM-21 standard is given starting on page 17 here, and holy hell does that not inspire any confidence at all.

I’ve only found a few sources with data that goes past 15,000 hours, all listed here.  There’s probably something else out there that I haven’t been able to find, but the literature does not appear to be full of long tests.  This is a government L-Prize experiment, and figure 7.1 on page 42 is clearly concave down (accelerating decay), although the LED is likely going to last a lot longer regardless.  Furthermore, the clear winner of the “L-Prize” should be the person responsible for the Y-axis choice.  This paywalled article goes out to 20,000h.  I don’t want to post the exact figure from their paper, but it’s basically this and clearly concave down.  This, also paywalled, goes 60,000 hours(!).  They tested 10 groups of 14 bulbs, and all 10 groups were clearly concave down over the 15,000-60,000h period.

And if you parsed the abstract of that last paper quickly, you might have realized that in spite of the obvious acceleration, they fit the curve to.. a decelerating decay model!  Like the Cree tests from a few paragraphs ago, this is what it looks like if you choose to include data from the initial nonsense period.  But for the purposes of projecting future life, we don’t care what the decay shape was at the beginning.  We care what the decay looks like from the test window down to 70%.  If the model is exponential (decelerating) decay and reality is linear, then it’s still clear that most of these LEDs should last a really long time (instead of a really, really long time).  If degradation accelerates, and accelerates fast enough, you can get things like this (pages 12-16) where an LED (or full bulb in this case) looks fine and then falls off a cliff far short of its projection.

Unlike the case with Splenda, where everybody relevant knows not only that the Official Number is complete bullshit, but also what the true number is, it’s not clear to me at all that people recognize how incorrect the TM-21/28 results/underlying model can be.  I feel confident that they know full-bulb results are not the same as LED results, but there are far, far too many documents like this DOE-commissioned report that uncritically treat the TM-21 result as the truth, and like this project whose stated goal was to see if LED product lifetime claims were being met, but just wound up doing a short test, sticking the numbers in TM-28, and declaring that the LED products that survived passed without ever actually testing any product to even half of a claimed lifetime, sometimes not even to 25% of claimed lifetime.  There appears to be no particular effort anymore to even try to ascertain the true numbers (or the approximate veracity of the Official Numbers).  This is unfortunate.

There is another family of tests that can reveal information about performance degradation without attempting to quantify lifetime, and that’s the accelerated degradation test.  Parts or products are tested in unfriendly conditions (temperature/temperature variability, humidity, high/variable current, etc) to see what degrades and how quickly.  Even the power cycling test from just above would be a weak member of this family because the cycling was designed to induce repeated heat-up-cool-down thermal stress that doesn’t exist in the normal continuous operation test.  Obviously it’s not good at all if a bulb craps itself relative to projections as a result of being under the horrifically unrealistic stress of being turned on and off a couple of times a day.

These tests can also shed light on whether or not decay is decelerating, constant, or accelerating.  This picture is a bit more mixed.  Although there are some results, like the one in the last paragraph and like the last figure here that are obviously hugely accelerating, there are others that aren’t.  I tried to get a large sample of accelerated degradation tests using the following criteria:

  1. Searching Google Scholar for “accelerated degradation” led testing luminosity in 4-year chunks (2019-2022, 2015-2018, etc.  The earliest usable paper was 2007)
  2. Taking any English-language result from the first 5 pages that I could access that was an actual physical accelerated degradation test that applied a constant stress to a LED or LED product (there were some that ratcheted up stress over time, OLEDs not included, etc)
  3. Tested all or almost all samples to at least the 30% degradation cutoff and plotted actual luminosity vs. time data on a linear scale (so I could assess curvature without requesting raw data)
  4. Didn’t have some reason that I couldn’t use the data (e.g. two experiments nuked the LEDs so hard that the first data point after t=0 was already below 70% of initial luminosity.  Can’t get curvature there)

I assessed curvature in the 90%-70% range (I don’t care if things decelerate after they’re functionally dead, as the power-cycling bulbs did, and that seems to be common behavior) and came up with:

3 clearly accelerating (plus the power-cycling test above, for 4), 4 linear-ish, and 2 clearly decelerating.  One of the decelerating results was from 2007 and nuked two of the samples in under 3 hours, but it fit my inclusion criteria, so it’s counted even though I personally wouldn’t put any weight on it.  So, besides the amazing result that a couple hundred google results only produced 9 usable papers (and maybe 3 that would have been ok if they weren’t plotted on log scale), there’s no evidence here that a decelerating decay model is the right choice for what we’re trying to model.

It looks to me like the graph above is a common degradation pattern, and instead of fitting the region to predict, which is hard because who even knows what functional form it is and when to transition from initial nonsense to that form, people/TM-21/TM-28 use too little data at the beginning and decide it’s exponential, or include too much data after the bulb is quite dead and also decide it’s exponential, even though there’s not much evidence that I can find to support that the region to predict is actually exponential, and there’s plenty of evidence that I’ve found to contradict that.

There needs to be a concerted data-gathering effort, not just for the sake of this technology, but as a framework for future technologies.  It should be mandatory that whole-product tests are run, and run under reasonable operating conditions (e.g. being turned off and on a couple of times a day) and under reasonable operating stresses- humidity changes, imperfect power supply, etc.  before any certification is given, and no lifetime claim should be allowed that isn’t based on such a whole-product test.  There should be a separate certification for outdoor use that requires testing under wide ranges of ambient temperatures and up to 100% humidity (or for other special environments, testing under near-worst-case versions of those conditions).  The exact details of how to run accelerated degradation tests and correlate them to reasonable lifetime estimates isn’t a trivial problem, but if we’d spent the last 15-20 years running more useful tests along the way, we would have solved it long ago.  If we’re handing out grant money, it can go to far worse uses than this.  If we’re handing out long-life certifications to big manufacturers, they can pay for some long-term validation studies as a condition.  This is true for almost any new technology that looks promising, and not already having such an automatic framework in place for them a long time ago is quite the oversight.


One last rating update

Summary of everything I know about the constructed rating system first. (Edit 6/16/22: Mythic Limited appears to be exactly the same) Details of newly-discovered things below that.

  1. Bo1 K in closely matched Mythic-Mythic matches is 20.37.   The “K” for Mythic vs. non-Mythic matches is 20.41, and 20.52 for capped Mythic vs. Mythic matches (see #2).  These are definitely three different numbers.  Go figure.
  2. The minimum MMR change for a Bo1 win/loss is +/-5 and the maximum change is +/-15.5
  3. All Bo3 K values/minimum changes are exactly double the Bo1 K values.
  4. Every number is in the uncanny valley of being very close to a nice round number but not actually being a nice round number (e.g. the 13 below is 13.018).
  5. Every match between a Mythic and a Diamond/Plat (and probably true for Gold and lower as well) is rated *exactly the same way* regardless of either the Mythic’s MMR or the non-Mythic’s MMR. In Bo1 +7.4 points for a win and ~13 points for a loss (double for Bo3)
  6. As of November 2021, all draws are +/- 0 MMR
  7. Glicko-ness isn’t detectable at Mythic.  The precalculated/capped rating changes don’t vary at all based on number of games played, and controlled “competitive” Mythic matches run at exactly the same K at different times/on different accounts.
  8. Mythic vs. Mythic matches are zero-sum
  9. MMR doesn’t drift over time
  10. MMR when entering Mythic is pretty rangebound regardless of how high or low it “should” be.  It’s capped on the low end at what I think is 1400 (~78%) and capped on the high end at 1650 (~92%).  Pre-December, this range was 1485-1650.  January #1500 MMR was 1782.
  11. Having an atrocious Mythic MMR in one month gets you easier matchmaking the next month and influences where you rank into Mythic. [Edit: via the Serious Rating, read more here]
  12. Conceding during sideboarding in a Bo3 rates the game at its current match score with the Bo1 K value. (concede up 1-0, it’s like you won a Bo1 EVEN THOUGH YOU CONCEDED THE MATCH.  Concede 1-1 and it’s a 0-change draw.  Concede down 0-1 and lose half the rating points).  This is absolutely batshit insane behavior. (edit: finally fixed as of 3/17/2022)
  13. There are other implementation issues.  If WotC is going to ever take organized play seriously again with Arena as a part, or if Arena ever wants to take itself seriously at all, somebody should talk to me.

MMR for Mythic vs. Non-Mythic

Every match between a Mythic and a Diamond/Plat (and probably Gold and lower as well) was rated *exactly the same way* regardless of either the Mythic’s MMR or the non-Mythic’s MMR. In Bo1 +7.4 points for a win, ~13 points for a loss, and -5.6 points for a draw**(??!??!).  Bo3 values were exactly double that for win/loss.

That’s not something I was expecting to see.  Mythic MMR is not a separate thing that’s preserved in any meaningful way between months, so I have no idea what’s going on underneath the hood that led to that being the choice of implementations.  The draw value was obviously bugged- it’s *exactly* double the loss it should be, so somebody evidently used W+L instead of (W+L)/2.  **The 11/11/2021 update, instead of fixing the root cause of the instant-draw bug, instead changed all draws to +/-0 MMR and fixed the -5.6 point bug by accident.

That led to me slightly overestimating Bo3 K before- I got paired against more Diamonds in Bo3 than Bo1 and lost more points- and to my general confusion about how early-season MMR worked.  It now looks like it’s the exact same system the whole month, just with more matches against non-Mythics thrown in early on and rated strangely.

Ranking into Mythic

Edit: Read this

Before it got changed in December, everybody ranked into Mythic between 1650 and 1485 MMR, and it took concerted effort to be anywhere near the bottom of that range.  That’s roughly, most months, an MMR that would finish at ~83-93% if no more games were played, and in reality for anybody playing normally 88-93%.  The end-of-month #1500 MMR that I’ve observed was in the 1800s before November, ~1797 in November, ~1770 in December, and ~1780 in January.  So no matter how well you do pre-Mythic, you have work to do (in effect, grinding up from 92-93%) to finish top-1500.

In December, the floor got reduced to what appears to be ~1400, although the exact value is unattainable because you have to end on a winning streak to get out of Diamond.  An account with atrocious MMR in the previous month that also conceded a metric shitton of games at Diamond 4 ranked in at ~1410 under the new system.  The ceiling is still 1650.  These lower initial Mythic MMRs are almost certainly a big part of the end-of-season #1500 MMR dropping by 20-30 points.

Season Reset

Edit: Read this, section is obsolete now.

MMR is preserved, at least to some degree, and there isn’t a clearly separate Mythic and non-Mythic MMR.  I’d done some work to quantify this, but then things changed in December, and I’m unlikely to repeat it.  What has changed is that previous-season Mythic MMR seems to have a bigger impact now.  I had an account with a trash MMR in January make Mythic in February without losing a single game (finally!), and it still only ranked in at 1481.  It would have been near or above 1600 under the old system, and now it’s ranking in below the previous system’s floor.

I hope everybody enjoyed this iteration, and if you’re chasing #1 Mythic this month or in the future, it’s not me.  Or if it is me, I won’t be #1 and above 1650.

A comprehensive list of non-algorithmic pairings in RLCS 2021-22 Fall Split

Don’t read this post by accident.  It’s just a data dump.

This is a list of every Swiss scoregroup in RLCS 2021-22 Fall Split that doesn’t follow the “highest seed vs lowest possible seed without rematches” pairing algorithm.

I’ll grade every round as correct, WTFBBQ (the basic “rank teams, highest-lowest” would have worked, but we got something else, or we got literally impossible pairing like two teams who started 1-0 playing in R3 mid, or the pairings are otherwise utter nonsense), pairwise swap fail (e.g if 1 has played 8, swapping to 1-7 and 2-8 would work, but we got something worse), and complicated fails for things that aren’t resolved with one or more independent pairwise swaps.

Everything is hand-checked through Liquipedia.

There’s one common pairwise error that keeps popping up, with 6 teams and the #2 and #5 (referred to as 7-10 and 10-13 error a few times) have played.  Instead of leaving the 1-6 matchup alone and going to 2-4 and 3-5, they mess with the 1-6 and do 1-5, 2-6, 3-4.  The higher seed has priority for being left alone, and EU, which uses a sheet and made no mistakes all split, handled this correctly in both Regional 2 Closed R5 and Regional 3 Closed R4 Low.  EU is right.  The 7-10 and 10-13 errors below are actual mistakes.

SAM Fall Split Scoregroup Pairing Mistakes:

Invitational Qualifier

Round 2 High, FUR should be playing the worst 3-2 (TIME), but is playing Ins instead. WTFBBQ

Round 2 Low, NBL (highest 2-3) should be playing the worst 0-3 (FNG), but isn’t. WTFBBQ

Round 3 Low, Seeding is (W, L, GD, initial seed, followed by opponents each round if relevant)
13 lvi     0 2 -3 3
14 vprs 0 2 -4 12
15 neo  0 2 -4 15
16 fng   0 2 -5 16

but pairings were LVI-NEO and VPRS-FNG. WTFBBQ

Round 4 high:

Seedings were (1 and 2 are the 3-0s that have qualified, not a bug, the order is correct- LOC is 2 (4) because it started 2-0)
3 ts 2 1 4 5 vprs time fur
4 loc 2 1 1 14 lvi tts era
5 nbl 2 1 4 4 time vprs ins
6 tts 2 1 2 11 elv loc kru
7 elv 2 1 1 6 tts lvi nva
8 time 2 1 0 13 nbl ts dre

TS and time can’t play again, but 3-7 and 4-8 (and 5-6) is fine. Pairwise Swap fail.

Round 4 Low:

9 nva 1 2 -1 7 kru era elv
10 dre 1 2 -1 9 ins fng time
11 ins 1 2 -4 8 dre fur nbl
12 kru 1 2 -4 10 nva neo tts
13 lvi 1 2 -2 3 loc elv neo
14 vprs 1 2 -3 12 ts nbl fng

All base pairings worked, but we got ins-lvi somehow. WTFBBQ

Round 5:
Seedings were:
ts 2 2 1 5 vprs time fur elv
time 2 2 -2 13 nbl ts dre tts
loc 2 2 -2 14 lvi tts era nbl
lvi 2 2 -1 3 loc elv neo ins
vprs 2 2 -1 12 ts nbl fng nva
kru 2 2 -2 10 nva neo tts dre

Loc-LVI is the only problem, but the pairwise swap to time-lvi and loc-vprs works fine. Instead we got ts-lvi. Pairwise swap fail (being generous).

4 WTFBBQ, 2 pairwise swap fails, 6 total.


SAM regional 1 Closed Qualifier

Round 2 High has Kru-NEO and Round 2 Low has AC-QSHW which are both obviously wrong. WTFBBQ x2

Round 3 High

1 ts 2 0 6 1
2 ins 2 0 5 6
3 dre 2 0 4 4
4 neo 2 0 4 7

And we got ts-dre and neo-ins WTFBBQ

R3 Mid
5 nva 1 1 2 5 emt7 dre
6 loc 1 1 1 3 naiz ins
7 kru 1 1 -1 2 qshw neo
8 fng 1 1 -2 8 lgc ts
9 lgc 1 1 2 9 fng prds
10 ac 1 1 2 10 neo qshw
11 bnd 1 1 0 11 ins naiz
12 emt7 1 1 0 12 nva nice

5-12 and 8-9 are problems, but the two pairwise swaps (11 for 12 and 10 for 9) work. Instead we got nva-bnd and then loc-ac for no apparent reason. Pairwise swap fail.

Round 3 Low:
qshw 0 2 -5 15
nice 0 2 -6 13
naiz 0 2 -6 14
prds 0 2 -6 16

and we got qshw-naiz. WTFBBQ.

R4 High:
dre 2 1 1 4 nice nva ts
neo 2 1 1 7 ac kru ins
nva 2 1 5 5 emt7 dre bnd
ac 2 1 3 10 neo qshw loc
kru 2 1 0 2 qshw neo lgc
fng 2 1 -1 8 lgc ts emt7

neo is the problem, having played ac and kru. Dre-fng can be paired without a forced rematch, which is what the algorithm says to do, and we didn’t get it. Complicated fail

R4 Low
9 lgc 1 2 1 9 fng prds kru
10 loc 1 2 0 3 naiz ins ac
11 emt7 1 2 -1 12 nva nice fng
12 bnd 1 2 -3 11 ins naiz nva
13 naiz 1 2 -4 14 loc bnd qshw
14 prds 1 2 -5 16 ts lgc nice

9-14 and 10-13 are rematches, but the pairwise swap solves it. Instead we got EMT7-PRDS. Pairwise swap fail.

6 neo 2 2 0 7 ac kru ins fng
7 ac 2 2 0 10 neo qshw loc nva
8 kru 2 2 -2 2 qshw neo lgc dre
9 lgc 2 2 4 9 fng prds kru naiz
10 loc 2 2 3 3 naiz ins ac prds
11 emt7 2 2 2 12 nva nice fng bnd

7-10 and 8-9 are the problem, but instead of pairwise swapping, we got neo-lgc. Really should be a WTFBBQ, but a pairwise fail.

+4 WTFBBQ (8), +3 Pairwise (5), +1 complicated (1), 14 total.


SAM Regional 1

R4 Low

9 vprs 1 2 -1 7 drc emt7 ts
10 dre 1 2 -2 12 elv lvi nva
11 fng 1 2 -2 13 tts ftw drc
12 end 1 2 -3 8 ts loc tts
13 elv 1 2 -2 5 dre nva emt7
14 loc 1 2 -3 15 fur end ftw

10-13 is the only problem, but instead of 10-12 and 11-13, they fucked with vprs pairing. Pairwise swap fail.

+0 WTFBBQ(8), +1 pairwise swap (6), +0 complicated (1), 15 total.


SAM Regional 2 Closed Qualifier

Round 2 Low:

FDOM (0) vs END (-1) is obviously wrong. WTFBBQ. And LOLOLOL one more time at zero-game-differential forfeits.

R3 Mid:
5 fng 1 1 2 4 nice loc
6 ftw 1 1 2 7 flip vprs
7 drc 1 1 -1 3 kru elv
8 fdom 1 1 -1 9 emt7 nva
9 kru 1 1 1 14 drc ac
10 sac 1 1 0 16 vprs flip
11 end 1 1 -1 11 loc emt7
12 ball 1 1 -1 15 nva nice

The base pairings work, but we got FNG-end and ftw-ball for some reason. WTFBBQ. Even scoring the forfeit 0-3 doesn’t fix this.

R3 Low:
emt7 0 2 0 8
flip 0 2 -5 10
nice 0 2 -5 13
ac 0 2 -6 12

and we got emt-flip. WTFBBQ.

R4 High:

3 vprs 2 1 2 1 sac ftw nva
4 loc 2 1 -1 6 end fng elv
5 ftw 2 1 4 7 flip vprs ball
6 drc 2 1 2 3 kru elv sac
7 fdom 2 1 1 9 emt7 nva kru
8 end 2 1 1 11 loc emt7 fng

Base pairings worked. WTFBBQ

R4 Low:
9 fng 1 2 0 4 nice loc end
10 kru 1 2 -1 14 drc ac fdom
11 ball 1 2 -3 15 nva nice ftw
12 sac 1 2 -3 16 vprs flip drc
13 ac 1 2 -3 12 elv kru nice
14 flip 1 2 -5 10 ftw sac emt7

fng, the top seed, doesn’t play either of the teams that were 0-2.. base pairings would have worked. WTFBBQ.

+5 WTFBBQ (13), +0 pairwise fail (6), +0 complicated (1), 20 total.

Sam Regional 3 Closed

R4 High

3 endgame 2 1 5 1 bnd tn vprs
4 lev 2 1 4 5 sac endless elv
5 bnd 2 1 1 16 endgame benz ball
6 fng 2 1 0 6 ball resi endless
7 fdom 2 1 -1 10 ftw elv emt7
8 tn 2 1 -2 2 benz endgame ftw

end-tn doesn’t work, but the pairwise swap is fine, and instead we got end-fng!??! Pairwise fail.

R4 low:

9 emt7 1 2 0 9 endless sac fdom
10 ftw 1 2 -1 7 fdom ag tn
11 ball 1 2 -2 11 fng vprs bnd
12 endless 1 2 -3 8 emt7 lev fng
13 ag 1 2 -2 14 elv ftw benz
14 sac 1 2 -4 12 lev emt7 resi

9-14 and 10-13 are problems, but the pairwise swap is fine.  Instead we got emt7-ftw which is stunningly wrong.  Pairwise fail.

Round 5:

6 fng 2 2 -2 6 ball resi endless endgame
7 bnd 2 2 -2 16 endgame benz ball lev
8 tn 2 2 -3 2 benz endgame ftw fdom
9 ftw 2 2 1 7 fdom ag tn emt7
10 endless 2 2 0 8 emt7 lev fng ag
11 ball 2 2 -1 11 fng vprs bnd sac

This is a mess, going to the 8th most preferable pairing before finding no rematches, and.. they got it right.  After screwing up both simple R4s.

Final SAM mispaired scoregroup count: +5 WTFBBQ (13), +2 pairwise fail (8), +0 complicated (1), 22 total.

OCE Fall Scoregroup pairing mistakes


Regional 1 Invitational Qualifier

Round 2 Low:

The only 2-3, JOE, is paired against the *highest* seeded 0-3 instead of the lowest.  WTFBBQ.

R3 Mid:

5 kid 1 1 0 6 rrr riot
6 grog 1 1 0 9 joe gz
7 bndt 1 1 0 10 vc rng
8 dire 1 1 -1 3 eros wc
9 eros 1 1 0 14 dire pg
10 vc 1 1 -1 7 bndt gse
11 rrr 1 1 -2 11 kid joe
12 tri 1 1 -2 15 rng rru


8-9 and 7-10 are problems, but a pairwise swap fixes it, and instead every single pairing got fucked up.  Pairwise fail is too nice, but ok.

R4 Low:

9 eros 1 2 -2 14 dire pg kid
10 vc 1 2 -3 7 bndt gse dire
11 tri 1 2 -3 15 rng rru grog
12 rrr 1 2 -4 11 kid joe bndt
13 joe 1 2 1 8 grog rrr pg
14 gse 1 2 -3 13 wc vc rru

Base pairings work.  WTFBBQ

2 WTFBBQ, 1 pairwise, 3 total.

OCE Regional 1 Closed Qualifier

Round 2 Low:

The best 2-3 (1620) is paired against the best 0-3 (tri), not the worst (bgea).  WTFBBQ.

Round 4 High:

3 gse 2 1 1 1 tisi joe hms
4 grog 2 1 0 3 smsh bott wc
5 eros 2 1 4 5 hms rust bott
6 waz 2 1 4 11 tri hms pg
7 T1620 2 1 2 10 rru tri joe
8 rru 2 1 1 7 T1620 wc tisi


Base pairings work.  WTFBBQ.

Round 5:

6 T1620 2 2 1 10 rru tri joe gse
7 waz 2 2 1 11 tri hms pg eros
8 grog 2 2 -1 3 smsh bott wc rru
9 tisi 2 2 2 16 gse bgea rru bott
10 tri 2 2 -1 6 waz T1620 bgea joe
11 rust 2 2 -2 13 joe eros smsh pg

7-10 is the problem, but they swapped with 6-11 instead of leaving 6-11 and swapping with 7-10.  Pairwise fail.

+2 WTFBBQ(4), +1 pairwise(2), 6 total.

OCE Regional 1

Round 5:

6 riot 2 2 3 2 grog rust eros bndt
7 grog 2 2 0 15 riot gz blss rrr
8 wc 2 2 -1 9 kid bndt rng dire
9 fbrd 2 2 2 13 bndt kid T1620 hms
10 blss 2 2 1 6 eros T1620 grog gse
11 kid 2 2 -2 8 wc fbrd dire eros

Exact same mistake as above. Pairwise fail.

+0 WTFBBQ(4), +1 pairwise(3), 7 total.

OCE Regional 2 Closed

Round 3 Mid:

5 blss 1 1 0 2 tri fbrd
6 pg 1 1 0 14 gse t1620
7 tisi 1 1 -1 11 rust eros
8 ctrl 1 1 -1 13 hms waz
9 gse 1 1 0 3 pg lits
10 grog 1 1 0 12 eros rust
11 bott 1 1 0 16 fbrd tri
12 hms 1 1 -1 4 ctrl bstn


Base pairings work.  WTFBBQ

Round 4 Low:

9 blss 1 2 -1 2 tri fbrd bott
10 gse 1 2 -1 3 pg lits tisi
11 pg 1 2 -1 14 gse t1620 hms
12 ctrl 1 2 -2 13 hms waz grog
13 tri 1 2 -2 15 blss bott bstn
14 lits 1 2 -4 10 t1620 gse rust


Base pairings work.  WTFBBQ

+2 WTFBBQ(6), +0 pairwise(3), 9 total.

OCE Regional 3 Closed

R3 Mid:

5 gse 1 1 2 7 lits t1620
6 wc 1 1 0 1 tisi grog
7 deft 1 1 0 6 zbb eros
8 gue 1 1 0 15 bott crwn
9 tri 1 1 1 12 grog tisi
10 bott 1 1 0 2 gue pg
11 lits 1 1 -1 10 gse ctrl
12 corn 1 1 -2 13 eros zbb

Base pairings work.  WTFBBQ

R3 Low

13 zbb 0 2 -3 11 deft corn
14 ctrl 0 2 -5 8 t1620 lits
15 pg 0 2 -5 14 crwn bott
16 tisi 0 2 -5 16 wc tri

Base Pairings work.  WTFBBQ

R4 High:

3 crwn 2 1 4 3 pg gue t1620
4 grog 2 1 1 5 tri wc eros
5 wc 2 1 3 1 tisi grog bott
6 tri 2 1 2 12 grog tisi deft
7 gue 2 1 2 15 bott crwn lits
8 corn 2 1 0 13 eros zbb gse

Base pairings work.  WTFBBQ

R4 Low:

9 gse 1 2 0 7 lits t1620 corn
10 deft 1 2 -1 6 zbb eros tri
11 bott 1 2 -3 2 gue pg wc
12 lits 1 2 -3 10 gse ctrl gue
13 zbb 1 2 0 11 deft corn ctrl
14 tisi 1 2 -3 16 wc tri pg

10-13 is the problem, pairwise swapped with 9-14 instead of 11-12.  Pairwise fail.

+3 WTFBBQ(9), +1 pairwise(4), 13 total.

OCE Regional 3:

R3 Low:

13 tri 0 2 -3 14 riot wc
14 gue 0 2 -5 15 gz t1620
15 crwn 0 2 -6 11 kid grog
16 tisi 0 2 -6 16 rng eros

Base pairings work. WTFBBQ

OCE Final count: +1 WTFBBQ(10), +0 pairwise(4), 14 total.

SSA Scoregroup Pairing Mistakes

Regional 1

Round 2 Low:

The 0 GD FFs are paired against the best teams instead of the worst.  WTFBBQ.


SSA Regional 2 Closed

Round 4 Low:

9 bw 1 2 -3 6 crim est slim
10 aces 1 2 -3 12 llg biju info
11 crim 1 2 -5 11 bw dd fsn
12 win 1 2 -5 16 mist gst auf
13 biju 1 2 -2 13 info aces est
14 slap 1 2 -6 14 auf fsn gst

10-13 pairwise exchanging with 9-14 instead of 11-12.  Pairwise fail.

SSA Regional 2

R4 High

3 atk 2 1 4 2 mils ag bvd
4 org 2 1 1 3 dwym atmc op
5 atmc 2 1 4 7 mist org oor
6 llg 2 1 4 12 oor bvd dd
7 auf 2 1 3 9 ag mils exo
8 ag 2 1 0 8 auf atk mist

3-8 is the problem, but instead of pairwise swapping with 4-7, the pairing is 3-4?!?!?!

R4 Low

9 mist 1 2 0 10 atmc dwym ag
10 dd 1 2 -3 11 exo fsn llg
11 oor 1 2 -4 5 llg info atmc
12 exo 1 2 -5 6 dd op auf
13 fsn 1 2 -1 16 op dd dwym
14 info 1 2 -2 13 bvd oor mils

Another 10-13 pairwise mistake.



SSA Regional 3 Closed

Round 5

6 fsn 2 2 -1 5 slim lft gst blc
7 mils 2 2 -1 6 blc chk ddogs oor
8 roc 2 2 -1 10 gst mlg slim mist
9 ddogs 2 2 3 4 alfa mist mils lft
10 gst 2 2 0 7 roc oor fsn alfa
11 slim 2 2 0 12 fsn llg roc est


This is kind of a mess, but 6-9, 7-8, 10-11 is correct, and instead they went with 6-7, 8-9, 10-11!??!? which is not.  Complicated fail.


SSA total (1 event pending): 1 WTFBBQ, 3 pairwise, 1 complicated.

APAC N Swiss Scoregroup pairing mistakes

Regional 2

Round 5:

6 xfs 2 2 1 8 nor gds nov dtn
7 alh 2 2 0 10 n55 timi gra hill
8 cv 2 2 -1 4 chi gra gds n55
9 gra 2 2 -1 6 blt cv alh nor
10 chi 2 2 -2 13 cv blt dtn nov
11 timi 2 2 -2 16 ve alh blt wiz

This is a mess, but a pairing is possible with 6-11, so it must be done.  Instead they screwed up that match and left 7-10 alone.  Complicated fail

APAC S Swiss Scoregroup pairing mistakes

Regional 2 Closed

Round 4 Low:

9 exd 1 2 1 14 flu dd shg
10 vite 1 2 -2 12 shg cel flu
11 wide 1 2 -3 10 pow che ete
12 pow 1 2 -4 7 wide hira acrm
13 whe 1 2 -4 9 ete acrm che
14 sjk 1 2 -4 16 cel shg axi

11-12 is the problem, and instead of swapping to 10-12 11-13, they swapped to 10-11 and 12-13.  Pairwise fail.


Round 5

6 acrm 2 2 1 13 pc whe pow ete
7 flu 2 2 -2 3 exd axi vite dd
8 shg 2 2 -3 5 vite sjk exd hira
9 exd 2 2 4 14 flu dd shg sjk
10 pow 2 2 -1 7 wide hira acrm whe
11 wide 2 2 -3 10 pow che ete vite

another 8-9 issue screwing with 6-11 for no reason.  Complicated fail.

Regional 3 Closed:

Round 3 Mid

Two teams from the 1-0 bracket are playing each other (and two from the 0-1 playing each other as well obviously).  I checked this one on Smash too.  WTFBBQ

Round 4 High

3 dd 2 1 2 2 wc woa bro
4 flu 2 1 2 4 soy fre pum
5 znt 2 1 5 1 woa wc shg
6 pc 2 1 4 3 lads pum woa
7 ete 2 1 3 7 wash bro wide
8 fre 2 1 2 12 shg flu exd

The default pairings work.  WTFBBQ

Round 4 Low

9 exd 1 2 -1 6 pum lads fre
10 wide 1 2 -1 8 bro wash ete
11 woa 1 2 -3 16 znt dd pc
12 shg 1 2 -4 5 fre soy znt
13 soy 1 2 -2 13 flu shg wc
14 lads 1 2 -2 14 pc exd wash

9-14 is the problem, and the simple swap with 10-13 works.  Instead they did 9-12, 10-11 , 13-14.  Pairwise fail.


Round 5:

6 ete 2 2 2 7 wash bro wide dd
7 fre 2 2 0 12 shg flu exd znt
8 flu 2 2 -1 4 soy fre pum pc
9 soy 2 2 0 13 flu shg wc lads
10 woa 2 2 0 16 znt dd pc wide
11 shg 2 2 -3 5 fre soy znt exd

8-9 is the problem, but the simple swap with 7-10 works.  Instead we got 6-8, 7-9, 10-11.  Pairwise fail.

APAC Total: 2 WTFBBQ, 3 Pairwise fails, and 1 complicated.

NA Swiss Scoring Pairing Mistakes

Regional 3 Closed

Round 4

3 pk 2 1 4 11 rbg yo vib
4 gg 2 1 1 1 exe sq oxg
5 eu 2 1 4 14 sq exe rge
6 sq 2 1 1 3 eu gg tor
7 xlr8 2 1 1 12 vib leh sr
8 yo 2 1 1 15 tor pk clt

This is a mess.  3-8 is illegal, but 3-7 can lead to a legal pairing, so it must be done (3-7, 4-5, 6-8). Instead we got 3-6, 4-8, 5-7.  Complicated fail.

EU made no mistakes.  Liquipedia looks sketchy enough on MENA that I don’t want to try to figure out who was what seed when.

Final tally across all regions

WTFBBQ Pairwise Complicated Total
SAM 13 8 1 22
OCE 10 4 0 14
SSA 1 3 1 5
APAC 2 3 1 6
NA 0 0 1 1
EU 0 0 0 0
Total 26 18 4 48

Missing the forest for.. the forest

The paper  A Random Forest approach to identify metrics that best predict match outcome and player ranking in the esport Rocket League got published yesterday (9/29/2021), and for a Cliff’s Notes version, it did two things:  1) Looked at 1-game statistics to predict that game’s winner and/or goal differential, and 2) Looked at 1-game statistics across several rank (MMR/ELO) stratifications to attempt to classify players into the correct rank based on those stats.  The overarching theme of the paper was to identify specific areas that players could focus their training on to improve results.

For part 1, that largely involves finding “winner things” and “loser things” and the implicit assumption that choosing to do more winner things and fewer loser things will increase performance.  That runs into the giant “correlation isn’t causation” issue.  While the specific Rocket League details aren’t important, this kind of analysis will identify second-half QB kneeldowns as a huge winner move and having an empty net with a minute left in an NHL game as a huge loser move.  Treating these as strategic directives- having your QB kneel more or refusing to pull your goalie ever- would be actively terrible and harm your chances of winning.

Those examples are so obviously ridiculous that nobody would ever take them seriously, but when the metrics don’t capture losing endgames as precisely, they can be even *more* dangerous, telling a story that’s incorrect for the same fundamental reason, but one that’s plausible enough to be believed.  A common example is outrushing your opponent in the NFL being correlated to winning.  We’ve seen Derrick Henry or Marshawn Lynch completely dump truck opposing defenses, and when somebody talks about outrushing leading to wins, it’s easy to think of instances like that and agree.  In reality, leading teams run more and trailing teams run less, and the “signal” is much, much more from capturing leading/trailing behavior than from Marshawn going full beast mode sometimes.

If you don’t apply subject-matter knowledge to your data exploration, you’ll effectively ask bad questions that get answered by “what a losing game looks like” and not “what (actionable) choices led to losing”.  That’s all well-known, if worth restating occasionally.

The more interesting part begins with the second objective.  While the particular skills don’t matter, trust me that the difference in car control between top players and Diamond-ranked players is on the order of watching Simone Biles do a floor routine and watching me trip over my cat.  Both involve tumbling, and that’s about where the similarity ends.

The paper identifies various mechanics and identifies rank pretty well based on those.  What’s interesting is that while they can use those mechanics to tell a Diamond from a Bronze, when they tried to use those mechanics to predict the outcome of a game, they all graded out as basically worthless.  While some may have suffered from adverse selection (something you do less when you’re winning), they had a pretty good selection of mechanics and they ALL sucked at predicting the winner.  And, yet, beyond absolutely any doubt, the higher rank stratifications are much better at them than the lower-rank ones.  WTF? How can that be?

The answer is in a sample constructed in a particularly pathological way, and it’s one that will be common among esports data sets for the foreseeable future.  All of the matches are contested between players of approximately equal overall skill.  The sample contains no games of Diamonds stomping Bronzes or getting crushed by Grand Champs.

The players in each match have different abilities at each of the mechanics, but the overall package always grades out similarly given that they have close enough MMR to get paired up.  So if Player A is significantly stronger than player B at mechanic A to the point you’d expect it to show up, ceteris paribus, as a large winrate effect, A almost tautologically has to be worse at the other aspects, otherwise A would be significantly higher-rated than B and the pairing algorithm would have excluded that match from the sample.  So the analysis comes to the conclusion that being better at mechanic A doesn’t predict winning a game.  If the sample contained comparable numbers of cross-rank matches, all of the important mechanics would obviously be huge predictors of game winner/loser.

The sample being pathologically constructed led to the profoundly incorrect conclusion

Taken together, higher rank players show better control over the movement of their car and are able to play a greater proportion of their matches at high speed.  However, within rank-matched matches, this does not predict match outcome.Therefore, our findings suggest that while focussing on game speed and car movement may not provide immediate benefit to the outcome within matches, these PIs are important to develop as they may facilitate one’s improvement in overall expertise over time.

even though adding or subtracting a particular ability from a player would matter *immediately*.  The idea that you can work on mechanics to improve overall expertise (AKA achieving a significantly higher MMR) WITHOUT IT MANIFESTING IN MATCH RESULTS, WHICH IS WHERE MMR COMES FROM, is.. interesting.  It’s trying to take two obviously true statements (Higher-ranked players play faster and with more control- quantified in the paper. Playing faster and with more control makes you better- self-evident to anybody who knows RL at all) and shoehorn a finding between them that obviously doesn’t comport.

This kind of mistake will occur over and over and over when data sets comprised of narrow-band matchmaking are analysed that way.

(It’s basically the same mistake as thinking that velocity doesn’t matter for mediocre MLB pitchers- it doesn’t correlate to a lower ERA among that group, but any individuals gaining velocity will improve ERA on average)