The Mythic Limited Rating Problem

TL;DR

Thanks to @twoduckcubed for reading my previous work and being familiar enough with high-end limited winrates to see that there was likely to be a real problem here, and there is. If you haven’t read my previous work, it’s at One last rating update and Inside the MTG: Arena Rating System, but as long as you know anything at all about any MMR system, this post is intended to be readable by itself.

Mythic MMR starts from scratch each month and each player is assigned a Mythic MMR when they first make Mythic that month. Most people start at the initial cap, 1650, and a few start a bit below that. It takes losing an awful lot to be assigned an initial rating very far below that, and since losing a bunch of limited matches costs resources (while doing it in ranked constructed is basically free), it’s mostly 1650 or close. When two people with a Mythic rating play in Premier or Quick, it’s approximately an Elo system with a K value of 20.4, and the matches are zero-sum. When one player wins points, the other player loses exactly the same number of points.

Most games that Mythic limited players play aren’t against other Mythics though. Diamonds are the most common opponents, with significant numbers of games against Platinums as well (and a handful against Gold/Silver/Bronze). In this case, since the non-Mythic opponents literally don’t have a Mythic MMR to plug into the Elo formula, Arena, in a decision that’s utterly incomprehensible on multiple levels, rates all of these matches exactly the same regardless of the Mythic’s rating or the non-Mythic’s rank or match history. +7.4 points for a win and -13 points for a loss, and this is *not* zero-sum because the non-Mythic doesn’t have a Mythic rating yet. The points are simply created out of nothing or lost to the void.

+7.4 for a win and -13 for a loss means that the Mythics need to win 13/(13+7.4) = 63.7% of the time against non-Mythics to break even. And, well, thanks to the 17lands data dumps, I found that they won 58.3% in SNC and 59.4% in NEO (VOW and MID didn’t seem to have opponent rank data available). Nowhere close to breakeven. ~57% vs. Diamonds and ~63% vs Plats. Not even breakeven playing down two ranks. And this is already a favorable sample for multiple reasons. It’s 17lands users, who are above average Mythics (their Mythic-Mythic winrate is 52.4%). It’s also a game-averaged sample instead of a player-averaged sample, and better players play more games on average in Mythic because they get there faster and have more resources to keep paying entry fees with.

Because of this, to a reasonable approximation, every time a Mythic Limited player hits the play button, 1 MMR vanishes into the void. And since 1% of Mythic in limited is only ~16.5 MMR, 1% Mythic in expectation is lost every 2-3 drafts just for playing. The more they play, the more MMR they lose into the void. The very best players- those who can win 5% more than the average 17lands-using Mythic drafter- can outrun this and profit from playing lower ranks- but the vast majority can’t, hence the video at the top of the post. Instead of Mythic MMR being a zero-sum game, it’s like gambling against a house edge, and playing at all is clearly disincentivized for most people.

Obviously this whole implementation is just profoundly flawed and needs to be fixed. The 17lands data is anonymized, so I don’t know how many Mythic-Mythic games appeared from both sides, so I don’t know exactly what percentage of a Mythic’s games are against each level, but it’s something like 51% against Diamond, 29% against Mythic 19% against Plat, 1% Gold and below. Clearly games vs Diamonds need to be handled responsibly, and games vs. Golds and below don’t matter much.

A simple fix that keeps most of the system intact (which may not be the best idea, but hey, at least it’s progress) is to assign the initial Mythic MMR upon making Platinum (instead of Mythic) and to not Mythic-rate any games involving a Gold or below. You wouldn’t get leaderboard position or anything until actually making Mythic, but the rating would be there behind the scenes doing its thing and this silliness would be avoided since all the rated games would be zero-sum and all the Diamond opponents would be reasonably rated for quality after playing enough games to get out of Plat.

Constructed has the same implementation, but it’s mostly not as big a deal because outside of Alchemy, cross-rank pairing isn’t very common except at the beginning of the month, and even if top-1200 quality players are getting scammed out of points by lower ranks at the start of the month (and they may well not be), they have all the time in the world to reequilibrate their rating against a ~100% Mythic opponent lineup later. Drafters play against bunches of non-Mythics throughout. Cross-rank pairing in Alchemy ranked may be/become enough of a problem to warrant a similar fix (although likely for the opposite reason, farming lower ranks instead of losing points to them), and it’s not like assigning the initial Mythic rating upon reaching Diamond and ignoring games against lower ranks actually hurts anything there either.

Baseball’s Last Mile Problem

2022 has brought a constant barrage of players criticizing the baseballs as hard to grip and wildly inconsistent from inning to inning, and probably not coincidentally, a spike in throwing error rates to boot. “Can’t get a grip” and “throwing error” do seem like they might go together. MLB has denied any change in the manufacturing process, however there have been changes this season in how balls are handled in the stadium, and I believe that is likely to be the culprit.

I have a plausible explanation for how the new ball-handling protocol can cause even identical balls from identical humidors to turn out wildly different on the field, and it’s backed up by experiments and measurements I’ve done on several balls I have, but until those experiments can be repeated at an actual MLB facility (hint, hint), this is still just a hypothesis, albeit a pretty good one IMO.

Throwing Errors

First, to quantify the throwing errors, I used Throwing Errors + Assists as a proxy for attempted throws (it doesn’t count throws that are accurate but late, etc), and broke down TE/(TE+A) by infield position.

TE/(TE+A)	2022	2021	2011-20 max	21-22 Increase	2022 By Chance
C	9.70%	7.10%	9.19%	36.5%	1.9%
3B	3.61%	2.72%	3.16%	32.7%	0.8%
SS	2.20%	2.17%	2.21%	1.5%	46.9%
2B	1.40%	1.20%	1.36%	15.9%	20.1%

By Chance is the binomial odds of getting the 2022 rate or worse using 2021 as the true odds. Not only are throwing errors per “opportunity” up over 2021, but they’re higher than every single season in the 10 years before that as well, and way higher for C and 3B. C and 3B have the least time on average to establish a grip before throwing. This would be interesting even without players complaining left and right about the grip.

The Last Mile

To explain what I suspect is causing this, I need to break down the baseball supply chain. Baseballs are manufactured in a Rawlings factory, stored in conditions that, to the best of my knowledge, have never been made public, shipped to teams, sometimes stored again in unknown conditions outside a humidor, stored in a humidor for at least 2 weeks, and then prepared and used in a game. Borrowing the term from telecommunications and delivery logistics, we’ll call everything after the 2+ weeks in the humidor the last mile.

Humidors were in use in 9 parks last year, and Meredith Wills has found that many of the balls this year are from the same batches as balls in 2021. So we have literally some of the same balls in literally the same humidors, and there were no widespread grip complaints (or equivalent throwing error rates) in 2021. This makes it rather likely that the difference, assuming there really is one, is occurring somewhere in the last mile.

The last mile starts with a baseball that has just spent 2+ weeks in the humidor. That is long enough to equilibrate, per https://tht.fangraphs.com/the-physics-of-cheating-baseballs-humidors/, other prior published research, and my own past experiments. Getting atmospheric humidity changes to meaningfully affect the core of a baseball takes on the order of days to weeks. That means that nothing humidity-wise in the last mile has any meaningful impact on the ball’s core because there’s not enough time for that to happen.

This article from the San Francisco Chronicle details how balls are prepared for a game after being removed from the humidor, and since that’s paywalled, a rough outline is:

Removed from humidor at some point on gameday
Rubbed with mud/water to remove gloss
Reviewed by umpires
Not kept out of the humidor for more than 2 hours
Put in a security-sealed bag that’s only opened in the dugout when needed

While I don’t have 2022 game balls or official mud, I do have some 2019* balls, water, and dirt, so I decided to do some science at home. Again, while I have confidence in my experiments done with my balls and my dirt, these aren’t exactly the same things being used in MLB, so it’s possible that what I found isn’t relevant to the 2022 questions.

Update: Dr. Wills informed that that 2019, and only 2019, had a production issue that resulted in squashed leather and could have affected the mudding results. She checked my batch code, and it looks like my balls were made late enough in 2019 that they were actually used in 2020 with the non-problematic production method. Yay.

Experiments With Water

When small amounts of water are rubbed on the surface of a ball, it absorbs pretty readily (the leather and laces love water), and once the external source of water is removed, that creates a situation where the outer edge of the ball is more moisture-rich than what’s slightly further inside and more moisture-rich than the atmosphere. The water isn’t going to just stay there- it’s either going to evaporate off or start going slightly deeper into the leather as well.

As it turns out, if the baseball is rubbed with water and then stored with unrestricted air access (and no powered airflow) in the environment it was equilibrated with, the water entirely evaporates off fairly quickly with an excess-water half-life of a little over an hour (and this would likely be lower with powered air circulation) and goes right back to its pre-rub weight down to 0.01g precision. So after a few hours, assuming you only added a reasonable amount of water to the surface (I was approaching 0.75 grams added at the most) and didn’t submerge the ball in a toilet or something ridiculous, you’d never know anything had happened. These surface moisture changes are MUCH faster than the days-to-weeks timescales of core moisture changes.

Things get much more interesting if the ball is then kept in a higher-humidity environment. I rubbed a ball down, wiped it with a paper towel, let it sit for a couple of minutes to deal with any surface droplets I missed, and then sealed the ball in a sandwich bag for 2 hours along with a battery-powered portable hygrometer. I expected the ball to completely saturate the air while losing less mass than I could measure (<0.01g) in the process, but that’s not what actually happened. The relative humidity in the bag only went up 7%, and as expected, the ball lost no measurable amount of mass. After taking it out, it started losing mass with a slightly longer half-life than before and lost all the excess water in a few hours.

I repeated the experiment except this time I sealed the ball and the hygrometer in an otherwise empty 5-gallon pail. Again, the relative humidity only went up 7%, and the ball lost 0.04 grams of mass. I calculated that 0.02g of evaporation should have been sufficient to cause that humidity change, so I’m not exactly sure what happened- maybe 0.01 was measurement error (the scale I was using goes to 0.01g), maybe my seal wasn’t perfectly airtight, maybe the crud on the lid I couldn’t clean off or the pail itself absorbed a little moisture). But the ball had 0.5g of excess water to lose (which it did completely lose after removal from the pail, as expected) and only lost 0.04g in the pail, so the basic idea is still the same.

This means that if the wet ball has restricted airflow, it’s going to take for freaking ever to reequilibrate (because it only takes a trivial amount of moisture loss to “saturate” a good-sized storage space), and that if it’s in a sealed environment or in a free-airflow environment more than 7% RH above what it’s equilibrated to, the excess moisture will travel inward to more of the leather instead of evaporating off (and eventually the entire ball would equilibrate to the higher-RH environment, but we’re only concerned with the high-RH environment as a temporary last-mile storage condition here, so that won’t happen on our timescales).

I also ran the experiment sealing the rubbed ball and the hygrometer in a sandwich bag overnight for 8 hours. The half-life for losing moisture after that was around 2.5 hours, up from the 70 minutes when it was never sealed. This confirms that the excess moisture doesn’t just sit around at the surface waiting if it can’t immediately evaporate, but that evaporation dominates when possible.

I also ran the experiment with a ball sealed in a sandwich bag for 2 hours along with an equilibrated cardboard divider that came with the box of balls I have. That didn’t make much difference. The cardboard only absorbed 0.04g of the ~0.5g excess moisture in that time period, and that’s with a higher cardboard:ball ratio than a box actually comes with. Equilibrated cardboard can’t substitute for free airflow on the timescale of a couple of hours.

Experiments With Mud

I mixed dirt and water to make my own mud and rubbed it in doing my best imitation of videos I could find, rubbing until the surface of the ball felt dry again. Since I don’t have any kind of instrument to measure slickness, these are my perceptions plus those of my significant other. We were in almost full agreement on every ball, and the one disagreement converged on the next measurement 30 minutes later.

If stored with unrestricted airflow in the environment it was equilibrated to, this led to roughly the following timeline:

t=0, mudded, ball surface feels dry
t= 30 minutes, ball surface feels moist and is worse than when it was first mudded.
t=60 minutes, ball surface is drier and is similar in grip to when first mudded.
t=90 minutes, ball is significantly better than when first mudded
t=120 minutes, no noticeable change from t=90 minutes.
T=12 hours, no noticeable change from t=120 minutes

I tested a couple of other things as well

I took a 12-hour ball, put it in a 75% RH environment for an hour and then a 100% RH environment for 30 minutes, and it didn’t matter. The ball surface was still fine. The ball would certainly go to hell eventually under those conditions, but it doesn’t seem likely to be a concern with anything resembling current protocols. I also stuck one in a bag for awhile and it didn’t affect the surface or change the RH at all, as expected since all of the excess moisture was already gone.
I mudded a ball, let it sit out for 15 minutes, and then sealed it in a sandwich bag. This ball was slippery at every time interval, 1 hour, 2 hours, 12 hours. (repeated twice). Interestingly, putting the ball back in its normal environment for over 24 hours didn’t help much and it was still quite slippery. Even with all the excess moisture gone, whatever had happened to the surface while bagged had ruined the ball.
I mudded a ball, let it sit out for 2 hours, at which point the surface was quite good per the timeline above, and then sealed it in a bag. THE RH WENT UP AND THE BALL TURNED SLIPPERY, WORSE THAN WHEN IT WAS FIRST MUDDED. (repeated 3x). Like #2, time in the normal environment afterwards didn’t help. Keeping the ball in its proper environment for 2 hours, sealing it for an hour, and then letting it out again was enough to ruin the ball.

That’s really important IMO. We know from the water experiments that it takes more than 2 hours to lose the excess moisture under my storage conditions, and it looks like the combination of fresh(ish) mud plus excess surface moisture that can’t evaporate off is a really bad combo and a recipe for slippery balls. Ball surfaces can feel perfectly good and game-ready while they still have some excess moisture left and then go to complete shit, apparently permanently, in under an hour if the evaporation isn’t allowed to finish.

Could this be the cause of the throwing errors and reported grip problems? Well…

2022 Last-Mile Protocol Changes

The first change for 2022 is that balls must be rubbed with mud on gameday, meaning they’re always taking on that surface moisture on gameday. In 2021, balls had to be mudded at least 24 hours in advance of the game, and while 2021 changed the window to 1-2 days in advance, the window used to be up to 5 days in advance of the game. I don’t know how far in advance they were regularly mudded before 2021, but even early afternoon for a night game would be fine assuming the afternoon storage had reasonable airflow.

The second change is that they’re put back in the humidor fairly quickly after being mudded and allowed a maximum of 2 hours out of the humidor. While I don’t think there’s anything inherently wrong with putting the balls back in the humidor after mudding (unless it’s something specific to 2022 balls), humidors look something like this. If the balls are kept in a closed box, or an open box with another box right on top of them, there’s little chance that they reequilibrate in time. If they’re kept in an open box on a middle shelf without much room above, unless the air is really whipping around in there, the excess moisture half-life should increase.

There’s also a chance that something could go wrong if the balls are taken out of the humidor, kept in a wildly different environment for an hour, then mudded and put back in the humidor, but I haven’t investigated that, and there are many possible combinations of both humidity and temperature that would need to be checked for problems.

The third change (at least I think it’s a change) is that the balls are kept in a sealed bag- at least highly restricted flow, possibly almost airtight- until opened in the dugout. Even if it’s not a change, it’s still extremely relevant- sealing balls that have evaporated their excess moisture off doesn’t affect anything, while sealing balls that haven’t finished evaporating off seems to be a disaster.

Conclusion

Mudding adds excess moisture to the surface of the ball, and if its evaporation is prevented for very long- either through restricted airflow or storage in too humid an environment- the surface of the ball becomes much more slippery and stays that way even if evaporation continues later. It takes hours- dependent on various parameters- for that moisture to evaporate off, and 2022 protocol changes make it much more likely that the balls don’t get enough time to evaporate off, causing them to fall victim to that slipperiness. In particular, balls can feel perfectly good and ready while they still have some excess surface moisture and then quickly go to hell if the remaining evaporation is prevented inside the security-sealed bag.

It looks to me like MLB had a potential problem- substantial latent excess surface moisture being unable to evaporate and causing slipperiness- that prior to 2022 it was avoiding completely by chance or by following old procedures derived from lost knowledge. In an attempt to standardize procedures, MLB accidentally made the excess surface moisture problem a reality, and not only that, did it in a way where the amount of excess surface moisture was highly variable.

The excess surface moisture when a ball gets to a pitcher depends on the amount of moisture initially absorbed, the airflow and humidity of the last-mile storage environment, and the amount of time spent in those environments and in the sealed bag. None of those are standardized parts of the protocol, and it’s easy to see how there would be wide variability ball-to-ball and game-to-game.

Assuming this is actually what’s happening, the fix is fairly easy. Balls need to be mudded far enough in advance and stored afterwards in a way that they get sufficient airflow for long enough to reequilibrate (the exact minimum times depending on measurements done in real MLB facilities), but as an easy interim fix, going back to mudding the day before the game and leaving those balls in an open uncovered box in the humidor overnight should be more than sufficient. (and again, checking that on-site is pretty easy)

Notes

I found (or didn’t find) some other things that I may as well list here as well along with some comments.

These surface moisture changes don’t change the circumference of the baseball at all, down to 0.5mm precision, even after 8 hours.
I took a ball that had stayed moisturized for 2 hours and put a 5-pound weight on it for an hour. There was no visible distortion and the circumference was exactly the same as before along both seam axes (I oriented the pressure along one seam axis and perpendicular to the other). To whatever extent flat-spotting is happening or happening more this season, I don’t see how it can be a last-mile cause, at least with my balls. Dr. Wills has mentioned that the new balls seem uniquely bad at flat-spotting, so it’s not completely impossible that a moist new ball at the bottom of a bucket could deform under the weight, but I’d still be pretty surprised.
The ball feels squishier to me after being/staying moisturized, and free pieces of leather from dissected balls are indisputably much squishier when equilibrated to higher humidity, but “feels squishier” isn’t a quantified measurement or an assessment of in-game impact. The squishy-ball complaints may also be another symptom of unfinished evaporation.
I have no idea if the surface squishiness in 3 affects the COR of the ball to a measurable degree.
I have no idea if the excess moisture results in an increased drag coefficient. We’re talking about changes to the surface, and my prior dissected-ball experiments showed that the laces love water and expand from it, so it’s at least in the realm of possibility.
For the third time, this is a hypothesis. I think it’s definitely one worth investigating since it’s supported by physical evidence, lines up with the protocol changes this year, and is easy enough to check with access to actual MLB facilities. I’m confident in my findings as reported, but since I’m not using current balls or official mud, this mechanism could also turn out to have absolutely nothing to do with the 2022 game.

The 2022 MLB baseball

As of this writing on 4/25/2022, HRs are down, damage and distance on barrels are down, and both Alan Nathan and Rob Arthur have observed that the drag coefficient of baseballs this year is substantially increased. This has led to speculation about what has changed with the 2022 balls and even what production batch of balls or mixture of batches may be in use this year. Given the kerfluffle last year that resulted in MLB finally confirming that a mix of 2020 and 2021 balls were used during the season, that speculation is certainly reasonable.

It may well also turn out to be correct, and changes in the 2022 ball manufacture could certainly explain the current stats, but I think it’s worth noting that everything we’ve seen so far is ALSO consistent with “absolutely nothing changed with regard to ball manufacture/end product between 2021 and 2022” and “all or a vast majority of balls being used are from 2021 or 2022”.

How is that possible? Well, the 2021 baseball production was changed on purpose. The new baseball was lighter, less dense, and less bouncy by design, or in more scientific terms, “dead”. What if all we’re seeing now is the 2021 baseball specifications in their true glory, now untainted by the 2020 live balls that were mixed in last year?

Even without any change to the surface of the baseball, a lighter, less dense ball won’t carry as far. The drag force is independent of the mass (for a given size, which changed less if at all), and F=MA, so a constant force and a lower mass means a higher drag deceleration and less carry.

The aforementioned measurements of the drag coefficient from Statcast data also *don’t measure the drag coefficient*. They measure the drag *acceleration* and use an average baseball mass value to convert to the drag force (which is then used to get the drag coefficient). If they’re using the same average mass for a now-lighter ball, they’re overestimating the drag force and the drag coefficient, and the drag coefficient may literally not have changed at all (while the drag acceleration did go up, per the previous paragraph).

Furthermore, I looked at pitchers who threw at least 50 four-seam fastballs last year after July 1, 2021 (after the sticky stuff crackdown) and have also thrown at least 50 FFs in 2022. This group is, on average, -0.35 MPH and +0.175 RPM on their pitches. These stats usually move in the same direction, and a 1 MPH increase “should” increase spin by about 20 RPM. So the group should have lost around 7 RPM from decreased velocity and actually wound up slightly positive instead. It’s possible that the current baseball is just easier to spin based on surface characteristics, but it’s also possible that it’s easier to spin because it’s lighter and has less rotational inertia. None of this is proof, and until we have results from experiments on actual game balls in the wild, we won’t have a great idea of the what or the why behind the drag acceleration being up.

(It’s not (just) the humidor- drag acceleration is up even in parks that already had a humidor, and in places where a new humidor would add some mass, making the ball heavier is the exact opposite of what’s needed to match drag observations, although being in the humidor could have other effects as well)

New MTG:A Event Payouts

With the new Premier Play announcement, we also got two new constructed event payouts and a slightly reworked Traditional Draft.

This is the analysis of the EV of those events for various winrates (game winrate for Bo1 and match winrate for Bo3). I give the expected gem return, the expected number of packs, the expected number of play points, and two ROIs, one counting packs as 200 gems (store price), the other counting packs as 22.5 gems (if you have all the cards). These are ROIs for gem entry. For gold entry, multiply by whatever you’d otherwise use gold for. If you’d buy packs, then multiply by 3/4. If you’d otherwise draft, then the constructed event entries are the same exchange rate.

Traditional Draft (1500 gem entry)

Winrate	Gems	Packs	Points	ROI (200)	ROI (22.5)
0.4	578	1.9	0.13	63.8%	41.4%
0.45	681	2.13	0.18	73.7%	48.6%
0.5	794	2.38	0.25	84.6%	56.5%
0.55	917	2.65	0.33	96.4%	65.1%
0.6	1050	2.94	0.43	109.3%	74.4%
0.65	1194	3.26	0.55	123.1%	84.5%
0.7	1348	3.6	0.69	137.9%	95.3%
0.75	1513	3.95	0.84	153.6%	106.8%

THIS DOES NOT INCLUDE ANY VALUE FROM THE CARDS TAKEN DURING THE DRAFT, which, if you value the cards, is a bit under 3 packs on average (no WC progress).

Bo1 Constructed Event (375 gem entry)

Winrate	Gems	Packs	Points	ROI (200)	ROI (22.5)
0.4	129	0.65	0.03	68.7%	38.2%
0.45	160	0.81	0.05	85.9%	47.4%
0.5	195	1	0.09	105.6%	58.1%
0.55	235	1.22	0.15	128.0%	70.0%
0.6	278	1.47	0.23	152.7%	83.0%
0.65	323	1.74	0.34	179.0%	96.5%
0.7	367	2.03	0.46	205.9%	109.9%
0.75	408	2.31	0.6	231.8%	122.7%

Bo3 Constructed Event (750 gem entry)

Winrate	Gems	Packs	Points	ROI (200)	ROI (22.5)
0.4	292	1.67	0.04	83.5%	43.9%
0.45	348	1.76	0.07	93.3%	51.6%
0.5	408	1.84	0.125	103.5%	59.9%
0.55	471	1.92	0.2	113.9%	68.5%
0.6	535	1.99	0.31	124.4%	77.3%
0.65	600	2.06	0.464	135.0%	86.2%
0.7	664	2.14	0.67	145.6%	95.0%
0.75	727	2.22	0.95	156.1%	103.5%

LED Lifetime Claims and the Questionable Standards Behind Them

This is a post somewhat outside my areas of expertise. I have modeled kinetics in the past, and the math here isn’t particularly complicated, so I’m not too worried there. I’ve looked at test results covering tens of millions of hours. My biggest concern is that, because I’m not a materials scientist or otherwise industry-adjacent, I simply haven’t encountered some relevant data or information due to paywalling and/or Google obscurity. Hopefully not, but also not betting my life on that. Continue reading accordingly.

As a quick digression, in case I’m getting any “the government uses these standards, so they must be accurate” readers, let’s look at Artificial Sweetener labeling. Despite a cup of Splenda having more calories than a cup of blueberries, despite being ~90%+ rapidly-and-easily-digestible carbohydrates, and despite having ~90% of the caloric content of regular table sugar gram-for-gram, it’s legally allowed, via multiple layers of complete bullshit, to advertise Splenda as a “zero calorie food” and to put “0 calories” on the nutrition label on a bag/box/packet. This isn’t due to the FDA not knowing that maltodextrin has calories or any other such form of ignorance. It’s the FDA literally creating a known-to-be-bullshit standard to create a known-to-be-bullshit Official Number for the sole purpose of allowing decades of deliberately deceptive marketing. Official Numbers created from Official Procedures are simply Official Numbers and need bear no resemblance to the actual numbers, and this situation can persist for decades even when the actual numbers are known beyond all doubt to be very different. That’s a good thing to remember in many contexts, but it also applies here.

Ok, so, back to LED testing. LED light sources can (legitimately) last for tens of thousands of hours, sometimes over 50,000 hours. Given that there aren’t even 9000 hours in a year, doing a full lifetime test would take over 5 years, and waiting to sell products until the tests are complete would mean only selling ancient products. In a fast-moving field with improving design technology, this doesn’t do anybody any good, hence the desire to do partial testing to extrapolate a useful lifetime.

This is a fine idea in theory, and led to the LM-80 and TM-21 standards. LM-80 is a measurement standard, and for our purposes here, it basically says “run the bulb constantly under some specified set of conditions and measure the brightness of the bulb for at least 6000 hours total and at intervals no greater than every 1000 hours”. Sure, fine for what it is.

TM-21 is a calculation standard that uses the LM-80 data. It says, effectively, “take the last 5000 hours of data, or the second half of the test run, whichever is longer, and fit to an exponential decay curve. Extrapolate forward based on that”. This is where the problems start.

Light sources can fail in multiple ways, either “catastrophic” complete failure, like when the filament on an incandescent bulb finally snaps and the bulb instantly transitions from great to totally useless, or by gradually dimming over time until it’s not bright enough anymore. LEDs themselves generally fail by the latter mechanism, and the threshold failure is defined to be 70% of initial brightness (based on research that I haven’t examined that shows that people are generally oblivious to brightness drops below that level). So, roughly, fit the LM-80 brightness data to an exponential decay curve, see how long it takes that curve to decay to 70% brightness, and call that the lifetime (with another caveat not to claim lifetimes tooooo far out, but that part isn’t relevant yet). Numbers besides 70% are also reported sometimes, but the math/extrapolation standard is the same so I won’t worry about that.

Using LM-80 + TM-21 lifetime estimates relies on several important assumptions that can all fail:

Long-term lumen maintenance (beyond the test period) follows an exponential decay curve down to 70%
Lumen maintenance inside the test period, what goes into the TM-21 calculations, also follows an exponential decay curve with the same parameters as #1.
LM-80 data collection is sufficient, in both precision and accuracy, to allow proper curve fitting (if #2 is true).
Test conditions are representative of real-life conditions (or the difference is irrelevant)
Other failure modes don’t much matter for the true lifetime

In my opinion, ALL of these assumptions fail pretty clearly and substantially. Leaving #1 until last is burying the lede a bit, but it’s also the messiest of the objections due to an apparent shortage of quality data.

Going in reverse order, #5 can be a pretty big problem. First, look at this diagram of an LED bulb. The LM-80/TM-21 standard ONLY REFERS TO THE LED, NOT TO THE BULB AS A WHOLE. The end consumer only cares about the bulb as a whole. There’s an analogous whole-bulb/whole-product standard (LM-84/TM-28), but it’s mostly optional and seems to be performed much less often/reported publicly much less. For example, ENERGY STAR(r) certification can be obtained with only LM-80/TM-21 results. It is estimated (here here and other places) that only 10% of full-product failures are from the LED. These products are only as good as their weakest link, so good LEDs can be packaged with dogshit secondary components to produce a dogshit overall product, and that overall product can easily get certifications and claim an absurdly high lifetime just based on the LED. Trying to give an exact overall failure rate is pointless with so many different brands/designs/conditions, but simply googling LED failure rate should make it abundantly clear that there are problems here, and I have a link later where around 30% of products failed inside 12,000 hours.

For some applications, like normal screw-in bulbs, it really doesn’t matter that much. Any known brand is still (likely) the best product and the replacement costs are pretty trivial- buy a new bulb and screw it in. However, for other applications… there are scores of new products coming out of the Integrated LED variety. These, uh, integrate the LED with the entire fixture and often require complete replacement upon failure, or if you’re luckier, sourcing special spare parts and possibly working with wiring or hiring an electrician. In these cases, it’s *extremely* important to have a good idea of the true lifetime of the product, and good luck with that in general. I’ve looked at a few residential and commercial products and none of the lifetime claims give any indication of how they’re derived. One call to an electrician when an integrated LED fails completely obliterates any possible longer-lifetime-based savings, so if you can’t do the replacement yourself, it’s probably wise to stick to conventional fixtures and separate LED bulbs.

Even if LM-84/TM-28 replaced LM-80/TM-11, there would still be problems, and the remaining points apply (or would apply) to both. Continuing with point #4- testing conditions not matching real-life conditions. While this is clearly true- test conditions have an isolated LED running 24/7 at a perfectly controlled constant temperature, humidity, and current, while real life is the opposite of that in every way, the question is how much it matters.

Fortunately, I found this document after writing the part above for problem #5, and I came to the same conclusions as the author in did in their section 2.4.2, and I may as well save time and just refer to their 2.4.3 for section 4. TL;DR operating conditions matter. On-off cycling (when the system has at least an hour to cool down to ambient temperature before heating back up) causes important amounts of thermal stress.

On to point #3- the accuracy and precision of the gathered data. I don’t know if this is because testing setups aren’t good/consistent enough, or if LEDs are just a bit weird in a way that makes the collection of nice data impossible or highly unlikely, but the data is a bit of a mess. As a reminder, the idea is to take the final 5000 hours (or second half of the data set, if that’s longer) and fit it to an exponential decay curve. It’s already a bit of a tough ask to parameterize such a curve with much confidence over a small luminosity change, but there’s another problem- the data can be noisy as hell and sometimes outright nonsense. This dovetails with #2, where the data actually need to inspire confidence that they’re generated by an exponential decay process, and it’s easier to look at graphs one time with both of those considerations in mind, as well as one more. Any test of >6000h is acceptable, so if this procedure is valid, it should give (roughly) consistent estimates for every possible testing window depending on when the test could have been stopped. If this all works, 1000-6000h should give similar estimates to 5000-10000. A 10000h test isn’t just a test of that LED- it allows a test of the entire LM-80/TM-21 framework. And that framework doesn’t look so hot IMO.

For instance, look at the graphs on page 8/10/12 in this report. Three of the curve fits (52,54,58) came out to exponential GROWTH of luminosity. Others are noisy as hell and nowhere near the trend line. The fit of the dark blue dots on each graph are wildly different from 2000-7000 vs 5000-10000. This has one exponential growth fit and assorted noise. Pages 33-34 would obviously have no business including the 1000h data point. This aggregate I put together of 8 Cree tests (80 total bulbs) has the same general problem. Here are the individual runs (10 bulbs for each line). There are more like these. I rest my case.

Even the obvious nonsense exponential growth projections are considered “successful” tests that allow a long lifetime to be claimed. Standard errors of individual data points are, in aggregate, above 10% of the difference between adjacent data points (and much worse sometimes in individual runs of course). It has been known forever that LEDs can behave differently early in their lifetime (for various materials science reasons that I haven’t looked into), which is why the first 1000h were always excluded from the TM-21 standard, but it’s pretty clear to me that 1000h is not a sufficient exclusion period.

Last, and certainly not least, is the question of whether or not the brightness decay to 70% even is exponential, or for that matter, if it’s accelerating, decelerating, or roughly linear over time once it gets past the initial 1000h+ nonsense period. Exponential would just be a specific class of decelerating decay. There appears to be an absolutely incredible lack of published data on this subject. The explanation for the choice of exponential decay in the TM-21 standard is given starting on page 17 here, and holy hell does that not inspire any confidence at all.

I’ve only found a few sources with data that goes past 15,000 hours, all listed here. There’s probably something else out there that I haven’t been able to find, but the literature does not appear to be full of long tests. This is a government L-Prize experiment, and figure 7.1 on page 42 is clearly concave down (accelerating decay), although the LED is likely going to last a lot longer regardless. Furthermore, the clear winner of the “L-Prize” should be the person responsible for the Y-axis choice. This paywalled article goes out to 20,000h. I don’t want to post the exact figure from their paper, but it’s basically this and clearly concave down. This, also paywalled, goes 60,000 hours(!). They tested 10 groups of 14 bulbs, and all 10 groups were clearly concave down over the 15,000-60,000h period.

And if you parsed the abstract of that last paper quickly, you might have realized that in spite of the obvious acceleration, they fit the curve to.. a decelerating decay model! Like the Cree tests from a few paragraphs ago, this is what it looks like if you choose to include data from the initial nonsense period. But for the purposes of projecting future life, we don’t care what the decay shape was at the beginning. We care what the decay looks like from the test window down to 70%. If the model is exponential (decelerating) decay and reality is linear, then it’s still clear that most of these LEDs should last a really long time (instead of a really, really long time). If degradation accelerates, and accelerates fast enough, you can get things like this (pages 12-16) where an LED (or full bulb in this case) looks fine and then falls off a cliff far short of its projection.

Unlike the case with Splenda, where everybody relevant knows not only that the Official Number is complete bullshit, but also what the true number is, it’s not clear to me at all that people recognize how incorrect the TM-21/28 results/underlying model can be. I feel confident that they know full-bulb results are not the same as LED results, but there are far, far too many documents like this DOE-commissioned report that uncritically treat the TM-21 result as the truth, and like this project whose stated goal was to see if LED product lifetime claims were being met, but just wound up doing a short test, sticking the numbers in TM-28, and declaring that the LED products that survived passed without ever actually testing any product to even half of a claimed lifetime, sometimes not even to 25% of claimed lifetime. There appears to be no particular effort anymore to even try to ascertain the true numbers (or the approximate veracity of the Official Numbers). This is unfortunate.

There is another family of tests that can reveal information about performance degradation without attempting to quantify lifetime, and that’s the accelerated degradation test. Parts or products are tested in unfriendly conditions (temperature/temperature variability, humidity, high/variable current, etc) to see what degrades and how quickly. Even the power cycling test from just above would be a weak member of this family because the cycling was designed to induce repeated heat-up-cool-down thermal stress that doesn’t exist in the normal continuous operation test. Obviously it’s not good at all if a bulb craps itself relative to projections as a result of being under the horrifically unrealistic stress of being turned on and off a couple of times a day.

These tests can also shed light on whether or not decay is decelerating, constant, or accelerating. This picture is a bit more mixed. Although there are some results, like the one in the last paragraph and like the last figure here that are obviously hugely accelerating, there are others that aren’t. I tried to get a large sample of accelerated degradation tests using the following criteria:

Searching Google Scholar for “accelerated degradation” led testing luminosity in 4-year chunks (2019-2022, 2015-2018, etc. The earliest usable paper was 2007)
Taking any English-language result from the first 5 pages that I could access that was an actual physical accelerated degradation test that applied a constant stress to a LED or LED product (there were some that ratcheted up stress over time, OLEDs not included, etc)
Tested all or almost all samples to at least the 30% degradation cutoff and plotted actual luminosity vs. time data on a linear scale (so I could assess curvature without requesting raw data)
Didn’t have some reason that I couldn’t use the data (e.g. two experiments nuked the LEDs so hard that the first data point after t=0 was already below 70% of initial luminosity. Can’t get curvature there)

I assessed curvature in the 90%-70% range (I don’t care if things decelerate after they’re functionally dead, as the power-cycling bulbs did, and that seems to be common behavior) and came up with:

3 clearly accelerating (plus the power-cycling test above, for 4), 4 linear-ish, and 2 clearly decelerating. One of the decelerating results was from 2007 and nuked two of the samples in under 3 hours, but it fit my inclusion criteria, so it’s counted even though I personally wouldn’t put any weight on it. So, besides the amazing result that a couple hundred google results only produced 9 usable papers (and maybe 3 that would have been ok if they weren’t plotted on log scale), there’s no evidence here that a decelerating decay model is the right choice for what we’re trying to model.

It looks to me like the graph above is a common degradation pattern, and instead of fitting the region to predict, which is hard because who even knows what functional form it is and when to transition from initial nonsense to that form, people/TM-21/TM-28 use too little data at the beginning and decide it’s exponential, or include too much data after the bulb is quite dead and also decide it’s exponential, even though there’s not much evidence that I can find to support that the region to predict is actually exponential, and there’s plenty of evidence that I’ve found to contradict that.

There needs to be a concerted data-gathering effort, not just for the sake of this technology, but as a framework for future technologies. It should be mandatory that whole-product tests are run, and run under reasonable operating conditions (e.g. being turned off and on a couple of times a day) and under reasonable operating stresses- humidity changes, imperfect power supply, etc. before any certification is given, and no lifetime claim should be allowed that isn’t based on such a whole-product test. There should be a separate certification for outdoor use that requires testing under wide ranges of ambient temperatures and up to 100% humidity (or for other special environments, testing under near-worst-case versions of those conditions). The exact details of how to run accelerated degradation tests and correlate them to reasonable lifetime estimates isn’t a trivial problem, but if we’d spent the last 15-20 years running more useful tests along the way, we would have solved it long ago. If we’re handing out grant money, it can go to far worse uses than this. If we’re handing out long-life certifications to big manufacturers, they can pay for some long-term validation studies as a condition. This is true for almost any new technology that looks promising, and not already having such an automatic framework in place for them a long time ago is quite the oversight.

One last rating update

Summary of everything I know about the constructed rating system first. (Edit 6/16/22: Mythic Limited appears to be exactly the same) Details of newly-discovered things below that.

Bo1 K in closely matched Mythic-Mythic matches is 20.37. The “K” for Mythic vs. non-Mythic matches is 20.41, and 20.52 for capped Mythic vs. Mythic matches (see #2). These are definitely three different numbers. Go figure.
The minimum MMR change for a Bo1 win/loss is +/-5 and the maximum change is +/-15.5
All Bo3 K values/minimum changes are exactly double the Bo1 K values.
Every number is in the uncanny valley of being very close to a nice round number but not actually being a nice round number (e.g. the 13 below is 13.018).
Every match between a Mythic and a Diamond/Plat (and probably true for Gold and lower as well) is rated *exactly the same way* regardless of either the Mythic’s MMR or the non-Mythic’s MMR. In Bo1 +7.4 points for a win and ~13 points for a loss (double for Bo3)
As of November 2021, all draws are +/- 0 MMR
Glicko-ness isn’t detectable at Mythic. The precalculated/capped rating changes don’t vary at all based on number of games played, and controlled “competitive” Mythic matches run at exactly the same K at different times/on different accounts.
Mythic vs. Mythic matches are zero-sum
MMR doesn’t drift over time
MMR when entering Mythic is pretty rangebound regardless of how high or low it “should” be. It’s capped on the low end at what I think is 1400 (~78%) and capped on the high end at 1650 (~92%). Pre-December, this range was 1485-1650. January #1500 MMR was 1782.
Having an atrocious Mythic MMR in one month gets you easier matchmaking the next month and influences where you rank into Mythic. [Edit: via the Serious Rating, read more here]
Conceding during sideboarding in a Bo3 rates the game at its current match score with the Bo1 K value. (concede up 1-0, it’s like you won a Bo1 EVEN THOUGH YOU CONCEDED THE MATCH. Concede 1-1 and it’s a 0-change draw. Concede down 0-1 and lose half the rating points). This is absolutely batshit insane behavior. (edit: finally fixed as of 3/17/2022)
There are other implementation issues. If WotC is going to ever take organized play seriously again with Arena as a part, or if Arena ever wants to take itself seriously at all, somebody should talk to me.

MMR for Mythic vs. Non-Mythic

Every match between a Mythic and a Diamond/Plat (and probably Gold and lower as well) was rated *exactly the same way* regardless of either the Mythic’s MMR or the non-Mythic’s MMR. In Bo1 +7.4 points for a win, ~13 points for a loss, and -5.6 points for a draw**(??!??!). Bo3 values were exactly double that for win/loss.

That’s not something I was expecting to see. Mythic MMR is not a separate thing that’s preserved in any meaningful way between months, so I have no idea what’s going on underneath the hood that led to that being the choice of implementations. The draw value was obviously bugged- it’s *exactly* double the loss it should be, so somebody evidently used W+L instead of (W+L)/2. **The 11/11/2021 update, instead of fixing the root cause of the instant-draw bug, instead changed all draws to +/-0 MMR and fixed the -5.6 point bug by accident.

That led to me slightly overestimating Bo3 K before- I got paired against more Diamonds in Bo3 than Bo1 and lost more points- and to my general confusion about how early-season MMR worked. It now looks like it’s the exact same system the whole month, just with more matches against non-Mythics thrown in early on and rated strangely.

Ranking into Mythic

Edit: Read this

Before it got changed in December, everybody ranked into Mythic between 1650 and 1485 MMR, and it took concerted effort to be anywhere near the bottom of that range. That’s roughly, most months, an MMR that would finish at ~83-93% if no more games were played, and in reality for anybody playing normally 88-93%. The end-of-month #1500 MMR that I’ve observed was in the 1800s before November, ~1797 in November, ~1770 in December, and ~1780 in January. So no matter how well you do pre-Mythic, you have work to do (in effect, grinding up from 92-93%) to finish top-1500.

In December, the floor got reduced to what appears to be ~1400, although the exact value is unattainable because you have to end on a winning streak to get out of Diamond. An account with atrocious MMR in the previous month that also conceded a metric shitton of games at Diamond 4 ranked in at ~1410 under the new system. The ceiling is still 1650. These lower initial Mythic MMRs are almost certainly a big part of the end-of-season #1500 MMR dropping by 20-30 points.

Season Reset

Edit: Read this, section is obsolete now.

MMR is preserved, at least to some degree, and there isn’t a clearly separate Mythic and non-Mythic MMR. I’d done some work to quantify this, but then things changed in December, and I’m unlikely to repeat it. What has changed is that previous-season Mythic MMR seems to have a bigger impact now. I had an account with a trash MMR in January make Mythic in February without losing a single game (finally!), and it still only ranked in at 1481. It would have been near or above 1600 under the old system, and now it’s ranking in below the previous system’s floor.

I hope everybody enjoyed this iteration, and if you’re chasing #1 Mythic this month or in the future, it’s not me. Or if it is me, I won’t be #1 and above 1650.

A comprehensive list of non-algorithmic pairings in RLCS 2021-22 Fall Split

Don’t read this post by accident. It’s just a data dump.

This is a list of every Swiss scoregroup in RLCS 2021-22 Fall Split that doesn’t follow the “highest seed vs lowest possible seed without rematches” pairing algorithm.

I’ll grade every round as correct, WTFBBQ (the basic “rank teams, highest-lowest” would have worked, but we got something else, or we got literally impossible pairing like two teams who started 1-0 playing in R3 mid, or the pairings are otherwise utter nonsense), pairwise swap fail (e.g if 1 has played 8, swapping to 1-7 and 2-8 would work, but we got something worse), and complicated fails for things that aren’t resolved with one or more independent pairwise swaps.

Everything is hand-checked through Liquipedia.

There’s one common pairwise error that keeps popping up, with 6 teams and the #2 and #5 (referred to as 7-10 and 10-13 error a few times) have played. Instead of leaving the 1-6 matchup alone and going to 2-4 and 3-5, they mess with the 1-6 and do 1-5, 2-6, 3-4. The higher seed has priority for being left alone, and EU, which uses a sheet and made no mistakes all split, handled this correctly in both Regional 2 Closed R5 and Regional 3 Closed R4 Low. EU is right. The 7-10 and 10-13 errors below are actual mistakes.

SAM Fall Split Scoregroup Pairing Mistakes:

Invitational Qualifier

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/South_America/1/Invitational_Qualifier

Round 2 High, FUR should be playing the worst 3-2 (TIME), but is playing Ins instead. WTFBBQ

Round 2 Low, NBL (highest 2-3) should be playing the worst 0-3 (FNG), but isn’t. WTFBBQ

Round 3 Low, Seeding is (W, L, GD, initial seed, followed by opponents each round if relevant)
13 lvi 0 2 -3 3
14 vprs 0 2 -4 12
15 neo 0 2 -4 15
16 fng 0 2 -5 16

but pairings were LVI-NEO and VPRS-FNG. WTFBBQ

Round 4 high:

Seedings were (1 and 2 are the 3-0s that have qualified, not a bug, the order is correct- LOC is 2 (4) because it started 2-0)
3 ts 2 1 4 5 vprs time fur
4 loc 2 1 1 14 lvi tts era
5 nbl 2 1 4 4 time vprs ins
6 tts 2 1 2 11 elv loc kru
7 elv 2 1 1 6 tts lvi nva
8 time 2 1 0 13 nbl ts dre

TS and time can’t play again, but 3-7 and 4-8 (and 5-6) is fine. Pairwise Swap fail.

Round 4 Low:

9 nva 1 2 -1 7 kru era elv
10 dre 1 2 -1 9 ins fng time
11 ins 1 2 -4 8 dre fur nbl
12 kru 1 2 -4 10 nva neo tts
13 lvi 1 2 -2 3 loc elv neo
14 vprs 1 2 -3 12 ts nbl fng

All base pairings worked, but we got ins-lvi somehow. WTFBBQ

Round 5:
Seedings were:
ts 2 2 1 5 vprs time fur elv
time 2 2 -2 13 nbl ts dre tts
loc 2 2 -2 14 lvi tts era nbl
lvi 2 2 -1 3 loc elv neo ins
vprs 2 2 -1 12 ts nbl fng nva
kru 2 2 -2 10 nva neo tts dre

Loc-LVI is the only problem, but the pairwise swap to time-lvi and loc-vprs works fine. Instead we got ts-lvi. Pairwise swap fail (being generous).

4 WTFBBQ, 2 pairwise swap fails, 6 total.

———————————————–

SAM regional 1 Closed Qualifier

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/South_America/1/Closed_Qualifier

Round 2 High has Kru-NEO and Round 2 Low has AC-QSHW which are both obviously wrong. WTFBBQ x2

Round 3 High

1 ts 2 0 6 1
2 ins 2 0 5 6
3 dre 2 0 4 4
4 neo 2 0 4 7

And we got ts-dre and neo-ins WTFBBQ

R3 Mid
5 nva 1 1 2 5 emt7 dre
6 loc 1 1 1 3 naiz ins
7 kru 1 1 -1 2 qshw neo
8 fng 1 1 -2 8 lgc ts
9 lgc 1 1 2 9 fng prds
10 ac 1 1 2 10 neo qshw
11 bnd 1 1 0 11 ins naiz
12 emt7 1 1 0 12 nva nice

5-12 and 8-9 are problems, but the two pairwise swaps (11 for 12 and 10 for 9) work. Instead we got nva-bnd and then loc-ac for no apparent reason. Pairwise swap fail.

Round 3 Low:
qshw 0 2 -5 15
nice 0 2 -6 13
naiz 0 2 -6 14
prds 0 2 -6 16

and we got qshw-naiz. WTFBBQ.

R4 High:
dre 2 1 1 4 nice nva ts
neo 2 1 1 7 ac kru ins
nva 2 1 5 5 emt7 dre bnd
ac 2 1 3 10 neo qshw loc
kru 2 1 0 2 qshw neo lgc
fng 2 1 -1 8 lgc ts emt7

neo is the problem, having played ac and kru. Dre-fng can be paired without a forced rematch, which is what the algorithm says to do, and we didn’t get it. Complicated fail

R4 Low
9 lgc 1 2 1 9 fng prds kru
10 loc 1 2 0 3 naiz ins ac
11 emt7 1 2 -1 12 nva nice fng
12 bnd 1 2 -3 11 ins naiz nva
13 naiz 1 2 -4 14 loc bnd qshw
14 prds 1 2 -5 16 ts lgc nice

9-14 and 10-13 are rematches, but the pairwise swap solves it. Instead we got EMT7-PRDS. Pairwise swap fail.

R5:
6 neo 2 2 0 7 ac kru ins fng
7 ac 2 2 0 10 neo qshw loc nva
8 kru 2 2 -2 2 qshw neo lgc dre
9 lgc 2 2 4 9 fng prds kru naiz
10 loc 2 2 3 3 naiz ins ac prds
11 emt7 2 2 2 12 nva nice fng bnd

7-10 and 8-9 are the problem, but instead of pairwise swapping, we got neo-lgc. Really should be a WTFBBQ, but a pairwise fail.

+4 WTFBBQ (8), +3 Pairwise (5), +1 complicated (1), 14 total.

————————————————-

SAM Regional 1

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/South_America/1

R4 Low

9 vprs 1 2 -1 7 drc emt7 ts
10 dre 1 2 -2 12 elv lvi nva
11 fng 1 2 -2 13 tts ftw drc
12 end 1 2 -3 8 ts loc tts
13 elv 1 2 -2 5 dre nva emt7
14 loc 1 2 -3 15 fur end ftw

10-13 is the only problem, but instead of 10-12 and 11-13, they fucked with vprs pairing. Pairwise swap fail.

+0 WTFBBQ(8), +1 pairwise swap (6), +0 complicated (1), 15 total.

——————————————————————

SAM Regional 2 Closed Qualifier

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/South_America/2/Closed_Qualifier

Round 2 Low:

FDOM (0) vs END (-1) is obviously wrong. WTFBBQ. And LOLOLOL one more time at zero-game-differential forfeits.

R3 Mid:
5 fng 1 1 2 4 nice loc
6 ftw 1 1 2 7 flip vprs
7 drc 1 1 -1 3 kru elv
8 fdom 1 1 -1 9 emt7 nva
9 kru 1 1 1 14 drc ac
10 sac 1 1 0 16 vprs flip
11 end 1 1 -1 11 loc emt7
12 ball 1 1 -1 15 nva nice

The base pairings work, but we got FNG-end and ftw-ball for some reason. WTFBBQ. Even scoring the forfeit 0-3 doesn’t fix this.

R3 Low:
emt7 0 2 0 8
flip 0 2 -5 10
nice 0 2 -5 13
ac 0 2 -6 12

and we got emt-flip. WTFBBQ.

R4 High:

3 vprs 2 1 2 1 sac ftw nva
4 loc 2 1 -1 6 end fng elv
5 ftw 2 1 4 7 flip vprs ball
6 drc 2 1 2 3 kru elv sac
7 fdom 2 1 1 9 emt7 nva kru
8 end 2 1 1 11 loc emt7 fng

Base pairings worked. WTFBBQ

R4 Low:
9 fng 1 2 0 4 nice loc end
10 kru 1 2 -1 14 drc ac fdom
11 ball 1 2 -3 15 nva nice ftw
12 sac 1 2 -3 16 vprs flip drc
13 ac 1 2 -3 12 elv kru nice
14 flip 1 2 -5 10 ftw sac emt7

fng, the top seed, doesn’t play either of the teams that were 0-2.. base pairings would have worked. WTFBBQ.

+5 WTFBBQ (13), +0 pairwise fail (6), +0 complicated (1), 20 total.

Sam Regional 3 Closed

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/South_America/3/Closed_Qualifier

R4 High

3	endgame	2	1	5	1	bnd	tn	vprs
4	lev	2	1	4	5	sac	endless	elv
5	bnd	2	1	1	16	endgame	benz	ball
6	fng	2	1	0	6	ball	resi	endless
7	fdom	2	1	-1	10	ftw	elv	emt7
8	tn	2	1	-2	2	benz	endgame	ftw

end-tn doesn’t work, but the pairwise swap is fine, and instead we got end-fng!??! Pairwise fail.

R4 low:

9	emt7	1	2	0	9	endless	sac	fdom
10	ftw	1	2	-1	7	fdom	ag	tn
11	ball	1	2	-2	11	fng	vprs	bnd
12	endless	1	2	-3	8	emt7	lev	fng
13	ag	1	2	-2	14	elv	ftw	benz
14	sac	1	2	-4	12	lev	emt7	resi

9-14 and 10-13 are problems, but the pairwise swap is fine. Instead we got emt7-ftw which is stunningly wrong. Pairwise fail.

Round 5:

6	fng	2	2	-2	6	ball	resi	endless	endgame
7	bnd	2	2	-2	16	endgame	benz	ball	lev
8	tn	2	2	-3	2	benz	endgame	ftw	fdom
9	ftw	2	2	1	7	fdom	ag	tn	emt7
10	endless	2	2	0	8	emt7	lev	fng	ag
11	ball	2	2	-1	11	fng	vprs	bnd	sac

This is a mess, going to the 8th most preferable pairing before finding no rematches, and.. they got it right. After screwing up both simple R4s.

Final SAM mispaired scoregroup count: +5 WTFBBQ (13), +2 pairwise fail (8), +0 complicated (1), 22 total.

OCE Fall Scoregroup pairing mistakes

Regional 1 Invitational Qualifier

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Oceania/1/Invitational_Qualifier

Round 2 Low:

The only 2-3, JOE, is paired against the *highest* seeded 0-3 instead of the lowest. WTFBBQ.

R3 Mid:

5	kid	1	1	0	6	rrr	riot
6	grog	1	1	0	9	joe	gz
7	bndt	1	1	0	10	vc	rng
8	dire	1	1	-1	3	eros	wc
9	eros	1	1	0	14	dire	pg
10	vc	1	1	-1	7	bndt	gse
11	rrr	1	1	-2	11	kid	joe
12	tri	1	1	-2	15	rng	rru

8-9 and 7-10 are problems, but a pairwise swap fixes it, and instead every single pairing got fucked up. Pairwise fail is too nice, but ok.

R4 Low:

9	eros	1	2	-2	14	dire	pg	kid
10	vc	1	2	-3	7	bndt	gse	dire
11	tri	1	2	-3	15	rng	rru	grog
12	rrr	1	2	-4	11	kid	joe	bndt
13	joe	1	2	1	8	grog	rrr	pg
14	gse	1	2	-3	13	wc	vc	rru

Base pairings work. WTFBBQ

2 WTFBBQ, 1 pairwise, 3 total.

OCE Regional 1 Closed Qualifier

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Oceania/1/Closed_Qualifier

Round 2 Low:

The best 2-3 (1620) is paired against the best 0-3 (tri), not the worst (bgea). WTFBBQ.

Round 4 High:

3	gse	2	1	1	1	tisi	joe	hms
4	grog	2	1	0	3	smsh	bott	wc
5	eros	2	1	4	5	hms	rust	bott
6	waz	2	1	4	11	tri	hms	pg
7	T1620	2	1	2	10	rru	tri	joe
8	rru	2	1	1	7	T1620	wc	tisi

Base pairings work. WTFBBQ.

Round 5:

6	T1620	2	2	1	10	rru	tri	joe	gse
7	waz	2	2	1	11	tri	hms	pg	eros
8	grog	2	2	-1	3	smsh	bott	wc	rru
9	tisi	2	2	2	16	gse	bgea	rru	bott
10	tri	2	2	-1	6	waz	T1620	bgea	joe
11	rust	2	2	-2	13	joe	eros	smsh	pg

7-10 is the problem, but they swapped with 6-11 instead of leaving 6-11 and swapping with 7-10. Pairwise fail.

+2 WTFBBQ(4), +1 pairwise(2), 6 total.

OCE Regional 1

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Oceania/1

Round 5:

6	riot	2	2	3	2	grog	rust	eros	bndt
7	grog	2	2	0	15	riot	gz	blss	rrr
8	wc	2	2	-1	9	kid	bndt	rng	dire
9	fbrd	2	2	2	13	bndt	kid	T1620	hms
10	blss	2	2	1	6	eros	T1620	grog	gse
11	kid	2	2	-2	8	wc	fbrd	dire	eros

Exact same mistake as above. Pairwise fail.

+0 WTFBBQ(4), +1 pairwise(3), 7 total.

OCE Regional 2 Closed

Round 3 Mid:

5	blss	1	1	0	2	tri	fbrd
6	pg	1	1	0	14	gse	t1620
7	tisi	1	1	-1	11	rust	eros
8	ctrl	1	1	-1	13	hms	waz
9	gse	1	1	0	3	pg	lits
10	grog	1	1	0	12	eros	rust
11	bott	1	1	0	16	fbrd	tri
12	hms	1	1	-1	4	ctrl	bstn

Base pairings work. WTFBBQ

Round 4 Low:

9	blss	1	2	-1	2	tri	fbrd	bott
10	gse	1	2	-1	3	pg	lits	tisi
11	pg	1	2	-1	14	gse	t1620	hms
12	ctrl	1	2	-2	13	hms	waz	grog
13	tri	1	2	-2	15	blss	bott	bstn
14	lits	1	2	-4	10	t1620	gse	rust

Base pairings work. WTFBBQ

+2 WTFBBQ(6), +0 pairwise(3), 9 total.

OCE Regional 3 Closed

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Oceania/3/Closed_Qualifier

R3 Mid:

5	gse	1	1	2	7	lits	t1620
6	wc	1	1	0	1	tisi	grog
7	deft	1	1	0	6	zbb	eros
8	gue	1	1	0	15	bott	crwn
9	tri	1	1	1	12	grog	tisi
10	bott	1	1	0	2	gue	pg
11	lits	1	1	-1	10	gse	ctrl
12	corn	1	1	-2	13	eros	zbb

Base pairings work. WTFBBQ

R3 Low

13	zbb	2	-3	11	deft	corn
14	ctrl	2	-5	8	t1620	lits
15	pg	2	-5	14	crwn	bott
16	tisi	2	-5	16	wc	tri

Base Pairings work. WTFBBQ

R4 High:

3	crwn	2	1	4	3	pg	gue	t1620
4	grog	2	1	1	5	tri	wc	eros
5	wc	2	1	3	1	tisi	grog	bott
6	tri	2	1	2	12	grog	tisi	deft
7	gue	2	1	2	15	bott	crwn	lits
8	corn	2	1	0	13	eros	zbb	gse

Base pairings work. WTFBBQ

R4 Low:

9	gse	1	2	0	7	lits	t1620	corn
10	deft	1	2	-1	6	zbb	eros	tri
11	bott	1	2	-3	2	gue	pg	wc
12	lits	1	2	-3	10	gse	ctrl	gue
13	zbb	1	2	0	11	deft	corn	ctrl
14	tisi	1	2	-3	16	wc	tri	pg

10-13 is the problem, pairwise swapped with 9-14 instead of 11-12. Pairwise fail.

+3 WTFBBQ(9), +1 pairwise(4), 13 total.

OCE Regional 3:

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Oceania/3

R3 Low:

13	tri	2	-3	14	riot	wc
14	gue	2	-5	15	gz	t1620
15	crwn	2	-6	11	kid	grog
16	tisi	2	-6	16	rng	eros

Base pairings work. WTFBBQ

OCE Final count: +1 WTFBBQ(10), +0 pairwise(4), 14 total.

SSA Scoregroup Pairing Mistakes

Regional 1

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Sub-Saharan_Africa/1

Round 2 Low:

The 0 GD FFs are paired against the best teams instead of the worst. WTFBBQ.

SSA Regional 2 Closed

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Sub-Saharan_Africa/2/Closed_Qualifier

Round 4 Low:

9	bw	1	2	-3	6	crim	est	slim
10	aces	1	2	-3	12	llg	biju	info
11	crim	1	2	-5	11	bw	dd	fsn
12	win	1	2	-5	16	mist	gst	auf
13	biju	1	2	-2	13	info	aces	est
14	slap	1	2	-6	14	auf	fsn	gst

10-13 pairwise exchanging with 9-14 instead of 11-12. Pairwise fail.

SSA Regional 2

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Sub-Saharan_Africa/2

R4 High

3	atk	2	1	4	2	mils	ag	bvd
4	org	2	1	1	3	dwym	atmc	op
5	atmc	2	1	4	7	mist	org	oor
6	llg	2	1	4	12	oor	bvd	dd
7	auf	2	1	3	9	ag	mils	exo
8	ag	2	1	0	8	auf	atk	mist

3-8 is the problem, but instead of pairwise swapping with 4-7, the pairing is 3-4?!?!?!

R4 Low

9	mist	1	2	0	10	atmc	dwym	ag
10	dd	1	2	-3	11	exo	fsn	llg
11	oor	1	2	-4	5	llg	info	atmc
12	exo	1	2	-5	6	dd	op	auf
13	fsn	1	2	-1	16	op	dd	dwym
14	info	1	2	-2	13	bvd	oor	mils

Another 10-13 pairwise mistake.

SSA Regional 3 Closed

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Sub-Saharan_Africa/3/Closed_Qualifier

Round 5

6	fsn	2	2	-1	5	slim	lft	gst	blc
7	mils	2	2	-1	6	blc	chk	ddogs	oor
8	roc	2	2	-1	10	gst	mlg	slim	mist
9	ddogs	2	2	3	4	alfa	mist	mils	lft
10	gst	2	2	0	7	roc	oor	fsn	alfa
11	slim	2	2	0	12	fsn	llg	roc	est

This is kind of a mess, but 6-9, 7-8, 10-11 is correct, and instead they went with 6-7, 8-9, 10-11!??!? which is not. Complicated fail.

SSA total (1 event pending): 1 WTFBBQ, 3 pairwise, 1 complicated.

APAC N Swiss Scoregroup pairing mistakes

Regional 2

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Asia-Pacific_North/2

Round 5:

6	xfs	2	2	1	8	nor	gds	nov	dtn
7	alh	2	2	0	10	n55	timi	gra	hill
8	cv	2	2	-1	4	chi	gra	gds	n55
9	gra	2	2	-1	6	blt	cv	alh	nor
10	chi	2	2	-2	13	cv	blt	dtn	nov
11	timi	2	2	-2	16	ve	alh	blt	wiz

This is a mess, but a pairing is possible with 6-11, so it must be done. Instead they screwed up that match and left 7-10 alone. Complicated fail

APAC S Swiss Scoregroup pairing mistakes

Regional 2 Closed

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Asia-Pacific_South/2/Closed_Qualifier

Round 4 Low:

9	exd	1	2	1	14	flu	dd	shg
10	vite	1	2	-2	12	shg	cel	flu
11	wide	1	2	-3	10	pow	che	ete
12	pow	1	2	-4	7	wide	hira	acrm
13	whe	1	2	-4	9	ete	acrm	che
14	sjk	1	2	-4	16	cel	shg	axi

11-12 is the problem, and instead of swapping to 10-12 11-13, they swapped to 10-11 and 12-13. Pairwise fail.

Round 5

6	acrm	2	2	1	13	pc	whe	pow	ete
7	flu	2	2	-2	3	exd	axi	vite	dd
8	shg	2	2	-3	5	vite	sjk	exd	hira
9	exd	2	2	4	14	flu	dd	shg	sjk
10	pow	2	2	-1	7	wide	hira	acrm	whe
11	wide	2	2	-3	10	pow	che	ete	vite

another 8-9 issue screwing with 6-11 for no reason. Complicated fail.

Regional 3 Closed:

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/Asia-Pacific_South/3/Closed_Qualifier

Round 3 Mid

Two teams from the 1-0 bracket are playing each other (and two from the 0-1 playing each other as well obviously). I checked this one on Smash too. WTFBBQ

Round 4 High

3	dd	2	1	2	2	wc	woa	bro
4	flu	2	1	2	4	soy	fre	pum
5	znt	2	1	5	1	woa	wc	shg
6	pc	2	1	4	3	lads	pum	woa
7	ete	2	1	3	7	wash	bro	wide
8	fre	2	1	2	12	shg	flu	exd

The default pairings work. WTFBBQ

Round 4 Low

9	exd	1	2	-1	6	pum	lads	fre
10	wide	1	2	-1	8	bro	wash	ete
11	woa	1	2	-3	16	znt	dd	pc
12	shg	1	2	-4	5	fre	soy	znt
13	soy	1	2	-2	13	flu	shg	wc
14	lads	1	2	-2	14	pc	exd	wash

9-14 is the problem, and the simple swap with 10-13 works. Instead they did 9-12, 10-11 , 13-14. Pairwise fail.

Round 5:

6	ete	2	2	2	7	wash	bro	wide	dd
7	fre	2	2	0	12	shg	flu	exd	znt
8	flu	2	2	-1	4	soy	fre	pum	pc
9	soy	2	2	0	13	flu	shg	wc	lads
10	woa	2	2	0	16	znt	dd	pc	wide
11	shg	2	2	-3	5	fre	soy	znt	exd

8-9 is the problem, but the simple swap with 7-10 works. Instead we got 6-8, 7-9, 10-11. Pairwise fail.

APAC Total: 2 WTFBBQ, 3 Pairwise fails, and 1 complicated.

NA Swiss Scoring Pairing Mistakes

Regional 3 Closed

https://liquipedia.net/rocketleague/Rocket_League_Championship_Series/2021-22/Fall/North_America/3/Closed_Qualifier

Round 4

3	pk	2	1	4	11	rbg	yo	vib
4	gg	2	1	1	1	exe	sq	oxg
5	eu	2	1	4	14	sq	exe	rge
6	sq	2	1	1	3	eu	gg	tor
7	xlr8	2	1	1	12	vib	leh	sr
8	yo	2	1	1	15	tor	pk	clt

This is a mess. 3-8 is illegal, but 3-7 can lead to a legal pairing, so it must be done (3-7, 4-5, 6-8). Instead we got 3-6, 4-8, 5-7. Complicated fail.

EU made no mistakes. Liquipedia looks sketchy enough on MENA that I don’t want to try to figure out who was what seed when.

Final tally across all regions

	WTFBBQ	Pairwise	Complicated	Total
SAM	13	8	1	22
OCE	10	4	0	14
SSA	1	3	1	5
APAC	2	3	1	6
NA	0	0	1	1
EU	0	0	0	0
Total	26	18	4	48

Missing the forest for.. the forest

The paper A Random Forest approach to identify metrics that best predict match outcome and player ranking in the esport Rocket League got published yesterday (9/29/2021), and for a Cliff’s Notes version, it did two things: 1) Looked at 1-game statistics to predict that game’s winner and/or goal differential, and 2) Looked at 1-game statistics across several rank (MMR/ELO) stratifications to attempt to classify players into the correct rank based on those stats. The overarching theme of the paper was to identify specific areas that players could focus their training on to improve results.

For part 1, that largely involves finding “winner things” and “loser things” and the implicit assumption that choosing to do more winner things and fewer loser things will increase performance. That runs into the giant “correlation isn’t causation” issue. While the specific Rocket League details aren’t important, this kind of analysis will identify second-half QB kneeldowns as a huge winner move and having an empty net with a minute left in an NHL game as a huge loser move. Treating these as strategic directives- having your QB kneel more or refusing to pull your goalie ever- would be actively terrible and harm your chances of winning.

Those examples are so obviously ridiculous that nobody would ever take them seriously, but when the metrics don’t capture losing endgames as precisely, they can be even *more* dangerous, telling a story that’s incorrect for the same fundamental reason, but one that’s plausible enough to be believed. A common example is outrushing your opponent in the NFL being correlated to winning. We’ve seen Derrick Henry or Marshawn Lynch completely dump truck opposing defenses, and when somebody talks about outrushing leading to wins, it’s easy to think of instances like that and agree. In reality, leading teams run more and trailing teams run less, and the “signal” is much, much more from capturing leading/trailing behavior than from Marshawn going full beast mode sometimes.

If you don’t apply subject-matter knowledge to your data exploration, you’ll effectively ask bad questions that get answered by “what a losing game looks like” and not “what (actionable) choices led to losing”. That’s all well-known, if worth restating occasionally.

The more interesting part begins with the second objective. While the particular skills don’t matter, trust me that the difference in car control between top players and Diamond-ranked players is on the order of watching Simone Biles do a floor routine and watching me trip over my cat. Both involve tumbling, and that’s about where the similarity ends.

The paper identifies various mechanics and identifies rank pretty well based on those. What’s interesting is that while they can use those mechanics to tell a Diamond from a Bronze, when they tried to use those mechanics to predict the outcome of a game, they all graded out as basically worthless. While some may have suffered from adverse selection (something you do less when you’re winning), they had a pretty good selection of mechanics and they ALL sucked at predicting the winner. And, yet, beyond absolutely any doubt, the higher rank stratifications are much better at them than the lower-rank ones. WTF? How can that be?

The answer is in a sample constructed in a particularly pathological way, and it’s one that will be common among esports data sets for the foreseeable future. All of the matches are contested between players of approximately equal overall skill. The sample contains no games of Diamonds stomping Bronzes or getting crushed by Grand Champs.

The players in each match have different abilities at each of the mechanics, but the overall package always grades out similarly given that they have close enough MMR to get paired up. So if Player A is significantly stronger than player B at mechanic A to the point you’d expect it to show up, ceteris paribus, as a large winrate effect, A almost tautologically has to be worse at the other aspects, otherwise A would be significantly higher-rated than B and the pairing algorithm would have excluded that match from the sample. So the analysis comes to the conclusion that being better at mechanic A doesn’t predict winning a game. If the sample contained comparable numbers of cross-rank matches, all of the important mechanics would obviously be huge predictors of game winner/loser.

The sample being pathologically constructed led to the profoundly incorrect conclusion

Taken together, higher rank players show better control over the movement of their car and are able to play a greater proportion of their matches at high speed. However, within rank-matched matches, this does not predict match outcome.Therefore, our findings suggest that while focussing on game speed and car movement may not provide immediate benefit to the outcome within matches, these PIs are important to develop as they may facilitate one’s improvement in overall expertise over time.

even though adding or subtracting a particular ability from a player would matter *immediately*. The idea that you can work on mechanics to improve overall expertise (AKA achieving a significantly higher MMR) WITHOUT IT MANIFESTING IN MATCH RESULTS, WHICH IS WHERE MMR COMES FROM, is.. interesting. It’s trying to take two obviously true statements (Higher-ranked players play faster and with more control- quantified in the paper. Playing faster and with more control makes you better- self-evident to anybody who knows RL at all) and shoehorn a finding between them that obviously doesn’t comport.

This kind of mistake will occur over and over and over when data sets comprised of narrow-band matchmaking are analysed that way.

(It’s basically the same mistake as thinking that velocity doesn’t matter for mediocre MLB pitchers- it doesn’t correlate to a lower ERA among that group, but any individuals gaining velocity will improve ERA on average)

Monster Exploit(s) Available In M:tG Arena MMR

Edit: FINALLY fixed as of 3/17/2022 (both conceding during sideboarding and losing match due to roping out, and presumably to the chess clock running out as well)

Not just the “concede a lot for easy pairings” idea detailed in Inside the MTG: Arena Rating System, which still works quite well, as another Ivana 49-1 cakewalk from Plat 4 to Mythic the past couple of days would attest, but this time exploits that can be used for the top-1200 race itself.

In Bo3, conceding on the sideboarding screen ends the match *and only considers previous games when rating the match*. Conceding down 0-1 makes you lose half-K instead of full-K (Bo1 K-value is ~1/2 Bo3 K-value). If the matchup is bad, you can use this to cut your losses. Conceding at 1-1 treats the match as a (half-K) draw- literally adding a draw to your logfile stats as well as rating the match as a draw. If game 3 is bad for you, you can lock in a draw this way instead of playing it.

If you win game 1 and concede, it gets rated as a half-K match WIN (despite showing a Defeat screen). This means that you can always force a match to play exactly as Bo1 if you want to- half K, 1 game, win or lose- so you don’t have to worry about post-sideboard games, can safely play glass-cannon strategies that get crushed by a lot of decks post-board, etc.- and you still have the option to play on if you like the Game 2 matchup.

Draws from the draw bug (failure to connect, match instantly ending in a draw) are also rated as a draw. I believe that’s a new bug from the big update (edit: apparently not, unless it got patched and unpatched since February- see the comment below). It’s rated as a normal draw in Bo1 and a half-K draw in Bo3.

Inside the MTG: Arena Rating System

Edit: A summary, including some updates, is now at https://hareeb.com/2022/02/02/one-last-rating-update/

Note: If you’re playing in numbered Mythic Constructed during the rest of May, and/or you’d like to help me crowdsource enough logfiles to get a full picture of the Rank # – Rating relationship during the last week, please visit https://twitter.com/Hareeb_alSaq/status/1397022404363395079 and DM/share. If I get enough data, I can make a rank-decay curve for every rank at once, among other things.

Brought to you by the all-time undisputed king of the percent gamers

Apologies for the writing- Some parts I’d written before, some I’m just writing now, but there’s a ton to get out, a couple of necessary experiments weren’t performed or finished yet, and I’m sure I’ll find things I could have explained more clearly. The details are also seriously nerdy, so reading all of this definitely isn’t for everybody. Or maybe anybody.

TL;DR

There is rating-based pairing in ranked constructed below Mythic (as well as in Mythic).
It’s just as exploitable as you should think it is
There is no detectable Glicko-ness to Mythic constructed ratings in the second half of the month. It’s indistinguishable from base-Elo
1. Expected win% is constrained to a ~25%-~75% range, regardless of rating difference, for both Bo1 and Bo3. That comes out to around 11% Mythic later in the month.
2. After convergence, the Bo1 K-value is ~20.5. Bo3 K is ~45.
3. The minimum change in rating is ~5 points in a Bo1 match and ~10 points in a Bo3 match.
Early in the month, the system is more complicated.
Performance before Mythic seems to have only slight impact on where you’re initially placed in Mythic.
Giving everybody similar initial ratings when they make Mythic leads to issues at the end of the month.
The change making Gold +2 per win/-1 per loss likely turbocharged the issues from #6

It’s well known that the rank decay near the end of the month in Mythic Constructed is incredibly severe. These days, a top-600 rating with 24 hours left is insufficient to finish top-1200, and it’s not just a last-day effect. There’s significant decay in the days leading up to the last day, just not at that level of crazy. The canonical explanations were that people were grinding to mythic at the end of the month and that people were playing more in the last couple of days. While both true, neither seemed sufficient to me to explain that level of decay. Were clones of the top-600 all sitting around waiting until the last day to make Mythic and kick everybody else out? If they were already Mythic and top-1200 talent level, why weren’t they mostly already rated as such? The decay is also much, much worse than it was in late 2019, and those explanations give no real hint as to why.

The only two pieces of information we have been given are that 1) Mythic Percentile is the percentage (Int(Your Rating/#1500 rating)) of the actual internal rating of the #1500 player. This is true. 2) Arena uses a modified Glicko system. Glicko is a modification of the old Elo system. This is, at best, highly misleading. The actual system does multiple things that are not Glicko and does not do at least one thing that is in Glicko.

I suspected that WotC might be rigging the rating algorithm as the month progressed, either deliberately increasing variance by raising the K-value of matches or by making each match positive-sum instead of zero-sum (i.e. calculating the correct rating changes, then giving one or both players a small boost to reward playing). Either of these would explain the massive collision of people outside the top-1200, who are playing, into the the people inside the top-1200 who are trying to camp on their rating. As it turns out, neither of those appear to be directly true. The rating system seems to be effectively the same throughout the last couple of weeks of the month, at least in Mythic. The explanations for what’s actually going on are more technical, and the next couple of sections are going to be a bit dry. Scroll down- way down- to the Problems section if you want to skip how I wasted too much of my time.

I’ve decided to structure this as a journal-of-my-exploration style post, so it’s clear why it was necessary to do what I was doing if I wanted to get the information that WotC has continually failed to provide for years.

Experiments:

Background

I hoped that the minimum win/loss would be quantized at a useful level once the rating difference got big enough, and if true, it would allow me to probe the algorithm. Thankfully, this guess turned out to be correct. Deranking to absurdly low levels let me run several experiments.

Under the assumption that the #1500 rating does not change wildly over a few hours in the middle of the month when there are well over 1500 players, it’s possible to benchmark a rating without seeing it directly. For instance, a minimum value loss that knocks you from 33% to 32% at time T will leave you with a similar rating, within one minimum loss value, as a 33%-32% loss several hours later. Also, if nothing else is going on, like a baseline drift, the rating value of 0 is equivalent over any timescale within the same season. This sort of benchmarking was used throughout.

Relative win-loss values

Because at very low rating, every win would be a maximum value win and every loss would be a minimum value loss, the ratio I needed to maintain the same percentile would let me calculate the win% used to quantize the minimum loss. As it turned out, it was very close to 3 losses for every 1 win, or a 25%-75% cap, no matter how big the rating difference (at Mythic). This was true for both Bo1 and Bo3, although I didn’t measure Bo3 super-precisely because it’s a royal pain in the ass to win a lot of Bo3s compared to spamming Mono-R in Bo1 on my phone, but I’m not far off whatever it is. My return benchmark was reached at 13 wins and 39 losses, which is 3:1, and I assumed it would be a nice round number. Unfortunately, as I discovered later, it was not *exactly* 3:1, or everybody’s life would have been much easier.

Relative Bo1-Bo3 K values

Bo3 has about 2.2 times the K value of Bo1. By measuring how many min-loss matches I had to concede in each mode to drop the same percentage, it was clear that the Bo3 K-value was a little over double the Bo1 K-value. In a separate experiment, losing 2-0 or 2-1 in Bo3 made no difference (as expected, but no reason not to test it). Furthermore, being lower rated and having lost the last match (or N matches) had no effect on the coin toss in Bo3. Again, it shouldn’t have, but that was an easy test.

Elo value of a percentage point

This is not a constant value throughout the month because the rating of the #1500 player increases through the month, but it’s possible to get an approximate snapshot value of it. Measuring this, the first way I did it, was much more difficult because it required playing matches inside the 25%-75% range, and that comes with a repeated source of error. If you go 1-1 against players with mirrored percentile differences, those matches appear to offset, except because the ratings are only reported as integers, it’s possible that you went 1-1 against players who were on average 0.7% below you (meaning that 1-1 is below expectation) or vice versa. The SD of the noise term from offsetting matches would keep growing and my benchmark would be less and less accurate the more that happened.

I avoided that by conceding every match that was plausibly in the 25-75% range and only playing to beat much higher rated players (or much lower rated, but I never got one, if one even existed). Max-value wins have no error term, so the unavoidable aggregate uncertainty was kept as small as possible. Using the standard Elo formula value of 400 (who knows what it is internally, but Elo is scale-invariant), the 25%-75% cap is reached at a 191-point difference, and by solving for how many points/% returned my variable-value losses to the benchmark where I started, I got a value of 17.3 pts/% on 2/16 for Bo1.

I did a similar experiment for Bo3 to see if the 25%-75% threshold kicked in at the same rating difference (basically if Bo3 used a number bigger than 400). Gathering data was much more time-consuming this way, and I couldn’t measure with nearly the same precision, but I got enough data to where I could exclude much higher values. It’s quite unlikely that the value could have been above 550, and it was exactly consistent with 400, and it’s unlikely that they would have bothered to make a change smaller than that, so the Bo3 value is presumably just 400 as well.

This came out to a difference of around 11% mythic being the 25-75% cap for Bo1 and Bo3, and combined with earlier deranking experiments, a K-value likely between 20-24 for Bo1 and 40-48 for Bo3. Similar experiments on 2/24 gave similar numbers. I thought I’d solved the puzzle in February. Despite having the cutoffs incorrect, I still came pretty close to the right answer here.

Initial Mythic Rating/Rating-based pairing

My main account made Mythic on 3/1 with a 65-70% winrate in Diamond. I made two burners for March, played them normally through Plat, and then diverged in Diamond. Burner #1 played through diamond normally (42-22 in diamond, 65-9 before that). Burner #2 conceded hundreds of matches at diamond 4 before trying to win, then went something like 27-3 playing against almost none of the same decks-almost entirely against labors of jank love, upgraded precons, and total nonsense. The two burners made Mythic within minutes of each other. Burner #1 started at 90%. Burner #2 started at 86%. My main account was 89% at that point (I’d accidentally played and lost one match in ranked because the dogshit client reverted preferences during an update and stuck me in ranked instead of the play queue when I was trying to get my 4 daily wins). I have no idea what the Mythic seeding algorithm is, but there was minimal difference between solid performance and intentionally being as bad as possible.

It’s also difficult to overstate the difference in opponent difficulty that rating-based pairing presents. A trash rating carries over from month to month, so being a horrendous Mythic means you get easy matches after the reset, and conceding a lot of matches at any level gives you an easy path to Mythic (conceding in Gold 4 still gets you easy matches in Diamond, etc)

Lack of Glicko-ness

In Glicko, rating deviation (a higher rating deviation leads to a higher “K-value”) is supposed to decrease with number of games played and increase with inactivity. My main account and the two burners from above should have produced different behavior. The main account had craploads of games played lifetime, a near-minimum to reach Mythic in the current season, and had been idle in ranked for over 3 weeks with the exception of that 1 mistake game. Burner #1 had played a near-minimum number of games to reach Mythic (season and lifetime) and was currently active. Burner #2 had played hundreds more games (like 3x as many as Burner #1) and was also currently active.

My plan was to concede bunches of matches on each account and see how much curvature there was in the graph of Mythic % vs. expected points lost (using the 25-75 cap and the 11% approximation) and how different it was between accounts. Glicko-ness would manifest as a bigger drop in Mythic % earlier for the same number of expected points lost because the rating deviation would be higher early in the conceding session. As it turned out, all three accounts just produced straight lines with the same slope (~2.38%/K on 3/25). Games played before Mythic didn’t matter. Games played in Mythic didn’t matter. Inactivity didn’t matter. No Glicko-ness detected.

Lack of (explicit) inactivity penalty

I deranked two accounts to utterly absurd levels and benchmarked them at a 2:1 percentage ratio. They stayed in 2:1 lockstep throughout the month (changes reflecting the increase in the #1500 rating, as expected). I also sat an account just above 0 (within 5 points), and it stayed there for like 2 weeks, and then I lost a game and it dropped below 0, meaning it hadn’t moved any meaningful amount. Not playing appears to do absolutely nothing to rating during the month, and there doesn’t seem to be any kind of baseline drift.

At this point (late March), I believed that the system was probably just Elo (because the Glicko features I should have detected were clearly absent), and that the win:loss ratio was exactly 3:1, because why would it be so close to a round number without being a round number. Assuming that, I’d come up with a way to measure the actual K-value to high precision.

Measuring K-Value more precisely

Given that the system literally never tells you your rating, it may sound impossible to determine a K-value directly, but assuming that we’re on the familiar 400-point scale that Arpad Elo published that’s in common usage (and that competitive MTG used to use when they had such a thing), it actually is, albeit barely.

Assume you control the #1500-rated player and the #1501 player, and that #1501 is rated much lower than #1500. #1501 will be displayed as a percentile instead of a ranking. If you call the first percentile displayed you see 1501-A, then lose a (minimum value) match with the #1500 player, you’ll get a new percentile displayed, 1501B. Call the #1500’s initial rating X, and the #1501’s rating Y. This gives a solvable system of equations.

Y/X = 1501A and Y/(X-1 min loss) = 1501B.

This gives X and Y in terms of min-losses (e.g. X went from (+5.3 minlosses to +4.3 minlosses).

Because 1501A and 1501B are reported as integers, the only way to get that number reported to a useful precision is for Y to be very large in magnitude and X to be very small. And of course getting Y large in magnitude means losing a crapload of matches. Getting X to be very small was accomplished via the log files. The game doesn’t tell you your mythic percentile when you’re top-1500, but the logfile stores your percentage of the lowest-rated Mythic. So the lowest-rated Mythic is 100% in the logfile, but once the lowest-rated Mythic goes negative from losing a lot of matches, every normal Mythic will report a negative percentile. By conceding until the exact match where the percentile flips from -1.0 to 0, that puts the account with a rating within 1 minloss of 0. So you have a very big number divided by a very small number, and you get good precision.

Doing a similar thing controlling the #1499, #1500, and #1501 allows benchmarking all 3 accounts in terms of minloss, and then playing the 1499-1500 against each other creates a match where you know the initial rating and the final rating of each participant (as a multiple of minloss), and then, along with knowing that the win:loss ratio is 3:1, making K=4*minloss plugging into the Elo formula gives

RatingChange*minloss= 4*minloss/(1+ 10^(InitialRatingDifference*minloss/400))

and you can solve for minloss, and then for K. As long as nobody randomly makes Mythic right when you’re trying to measure, which would screw everything up and make you wait another month to try again… It also meant that I’d have multiple accounts whose rating in terms of minloss I knew exactly, and by playing them against each other and accounts nowhere close in rating (min losses and max wins), and logging exactly when each account went from positive to negative, I could make sure I had the right K-value.

That latter part didn’t work. I got a reasonable value out of the first measured match- K of about 20.25- but it was clear that subsequent matches were not behaving exactly as expected, and there was no value of K, and no combination of K and minloss, that would fix things. I couldn’t find a mistake in my match logging, (although I knew better than to completely rule it out), and the only other obvious simple source of error was the 3:1 assumption.

I’d only measured 13 wins offsetting 39 losses, which looked good, but certainly wasn’t a definitive 3.0000:1. So, of course the only way to measure this more precisely was to lose a crapload of games and see exactly how many wins it took to offset them. And that came out to a breakeven win% of 24.32%. And I did it again on a bigger samples, and came out with 24.37% and 24.40%, and in absolutely wonderful news, there was no single value that was consistent with all measurements. The breakeven win% in those samples really had slightly increased. FML.

Now that the system clearly wasn’t just Elo, and the breakeven W:L ratio was somehow at least slightly dynamic, I went in for another round of measurements in May. The first thing I noticed was that I got from my initial Mythic seed to a 0 rating MUCH faster than I had when deranking later in the month. And by later in the month, I mean anything after the first day or 2 of the season, not just actually late in the month.

When deranking my reference account (the big negative number I need for precise measurements), the measured number of minlosses was about 1.6 times as high as expected from the number of matches conceded, and I had 4 other accounts hovering around a 0 rating who benchmarked and played each other in the short window of time when I controlled the #1500 player, and all of those measurements were consistent with each other. The calculated reference ratings were different by 1 in the 6th significant digit, so I have confidence in that measurement.

I got a similar K-value as the first time, but I noticed something curious when I was setting up the accounts for measurements. Whereas before, with the breakeven win% at 24.4%, 3 losses and 1 win (against much-higher rated players, i.e. everybody but me) was a slight increase in rating. Early in May, it was a slight *decrease* in rating, so the breakeven win% against the opponents I played was slightly OVER 25%, the first time I’d seen that. And as of a few days ago, it was back to being an increase in rating. I still don’t have a clear explanation for that, although I do have an idea or two.

Once I’d done my measurements and calculations, I had a reference account with a rating equal to a known number of minlosses-at-that-time, and a few other accounts with nothing better to do than to lose games to see how or if the value of a minloss changed over a month. If I started at 0, and took X minlosses, and my reference account was at -Y minlosses, then if the value of a minloss is constant, the Mythic Percentile ratio and X/Y ratio should be the same, which is what I was currently in the process of measuring. And, obviously, measuring that precisely requires.. conceding craploads of games. What I got was consistent with no change, but not to the precision I was aiming for before this all blew up.

So this meant that the rating change from a minloss was not stable throughout the month- it was much higher at the very beginning, as seen from my reference account, but that it probably had stabilized- at least for my accounts playing each other- by the time the 1500th Mythic arrived on May 7 or 8. That’s quite strange. Combined with the prior observation where approximately the bare minimum number of games to make mythic did NOT cause an increase in the minloss value, this wasn’t a function of my games played, which were already far above the games played on that account from deranking to 0.

In Glicko, the “K-value” of a match depends on the number of games you’ve played (more=lower, but we know that’s irrelevant after this many games), the inactivity period (more=higher, but also known to be irrelevant here), and the number of games your opponent has played (more=higher, which is EXACTLY BACKWARDS here). So the only Glicko-relevant factor left is behaving strongly in the wrong direction (obviously opponents on May 1 have fewer games played, on average, than opponents on May 22).

So something else is spiking the minloss value at the beginning of the month, and I suspect it’s simply a quickly decaying function of time left/elapsed in the month. Instead of an inactivity term, I suspect WotC just runs a super-high K value/change multiplier/whatever at the start of the month that calms down pretty fast over the first week or so. I had planned to test that by speedrunning a couple of accounts to Mythic at the start of June, deranking them to 0 rating, and then having each account concede some number of games sequentially (Account A scoops a bunch of matches on 6/2, Account B scoops a bunch of matches on 6/3, etc) and then seeing what percentile they ended up at after we got 1500 mythics. Even though they would have lost the same number of matches from 0, I expected to see A with a lower percentile than B, etc, because of that decaying function. Again, something that can only be measured by conceding a bunch of matches, and something in the system completely unrelated to the Glicko they told us they were running. If you’re wondering why it’s taking months to try to figure this stuff out, well, it’s annoying when every other test reveals some new “feature” that there was no reason to suspect existed.

Problems

Rating-based pairing below Mythic is absurdly exploitable and manifestly unfair

I’m not the first person to discover it. I’ve seen a couple of random reddit posts suggesting conceding a bunch of matches at the start of the season, then coasting to Mythic. This advice is clearly correct if you just want to make Mythic. It’s not super-helpful trying to make Mythic on day 1, because there’s not that much nonsense (or really weak players) in Diamond that early, but later in the month, the Play button may as well say Click to Win if you’re decent and your rating is horrible.

When you see somebody post about their total jankfest making Mythic after going 60% in Diamond or something, it’s some amount of luck, but they probably played even worse decks, tanked their rating hard at Diamond 4, and then found something marginally playable and crushed the bottom of the barrel after switching decks. Meanwhile, halfway decent players are preferentially paired against other decent players and don’t get anywhere.

Rating-based pairing might be appropriate at the bottom level of each rank (Diamond 4, Plat 4, etc), just so people can try nonsense in ranked and not get curbstomped all the time, but after that, it should be random same-rank pairing with no regard to rating (using ratings to pair in Draft, to some extent, has valid reasons that don’t exist in Constructed, and the Play Queue is an entirely different animal altogether).

Of course, my “should” is from the perspective of wanting a fairer and unexploitable ladder climb, and WotC’s “should” is from the perspective of making it artificially difficult for more invested players to rank up by giving them tougher pairings (in the same rank), presumably causing them to spend more time and money to make progress in the game.

Bo3 K is WAY too high

Several things should jump out at you if you’re familiar with either Magic or Elo. First, given the same initial participant ratings, winning consecutive Bo1 games rewards fewer points (X + slightly fewer than X) than winning one Bo3 (~2.2X), even though going 2-0 is clearly a more convincing result. There’s no rating-accuracy justification whatsoever for Bo3 being double the K value of Bo1. 1.25x or 1.33x might be reasonable, although the right multiplier could be even lower than that. Second, while a K-value of 20.5 might be a bit on the aggressive side for Bo1 among well-established players (chess, sports), ~45 for a B03 is absolutely batshit.

Back when WotC used Elo for organized play, random events had K values of 16, PTQs used 32, and Worlds/Pro Tours used 48. All for one B03. The current implementation on Arena is using ~20.5 for Bo1 and a near-pro-tour K-value for one random Bo3 ladder match. Yeah.

The ~75%-25% cap is far too narrow

While not many people have overall 75% winrates in Mythic, it seems utterly implausible, both from personal experience and from things like the MtG Elo Project, that when strong players play weaker players, the aggregate matchup isn’t more lopsided than that. After conceding bunches of games at Plat 4 to get a low rating, my last three accounts went 51-3, 49-1, 48-2 to reach Mythic from Plat 4. When doing my massive “measure the W:L ratio” experiment last month, I was just over 87% winrate (in almost 750 matches) in Mythic when trying to win, and that’s in Bo1, mostly on my phone while multitasking, and I’m hardly the second coming of Finkel, Kai, or PVDDR (and I didn’t “cheat” and concede garbage and play good starting hands- I was either playing to win every game or to snap-concede every game). Furthermore, having almost the same ~75%-25% cap for both Bo1 and Bo3 is self-evidently nonsense when the cap is possibly in play.

The Elo formula is supposed to ensure that any two swaths of players are going to be close to equilibrium at any given time, with minimal average point flow if they keep getting paired against each other, but with WotC’s truncated implementation, when one group actually beats another more than 75% of the time, and keeps getting rewarded as though they were only supposed to win 75%, the good players farm (expected) points off the weaker players every time they’re paired up. I reached out to the makers of several trackers to try to get a large sample of the actual results when two mythic %s played each other, but the only one who responded didn’t have the data. I can certainly believe that Magic needs something that declines in a less extreme fashion than the Elo curve for large rating differences, but a 75%-25% cap is nowhere close to the correct answer.

An Overlooked Change

With the Ikoria release in April 2020, Gold was changed to be 2 pips of progress per win instead of 1, making it like Silver. This had the obvious effect of letting weak/new players make Platinum while before they got stuck in Gold. I suspected that this may have allowed a bunch more weaker players to make it to Mythic late in the month, and this looks extremely likely to be correct.

I obviously don’t have population data for each rank, but since Mythic resets to Plat, I created a toy model of 30k Plats ~N(1600,85), 90k Golds ~N(1450,85), 150k Silvers ~N(1300,85) constant talent level, started each player at rating=talent, and simulated what happened as it got back to 30k people in Mythic. In each “iteration”, people played random Bo1 K=22 matches in the same rank, and Diamonds played 4 matches, Plats 3, Golds/Silvers 2 per iteration. None of these are going to be exact obviously, but the basic conclusions below are robust over huge ranges of possibly reasonable assumptions.

As anybody should expect, the players who make Mythic later in the month are much weaker on average than the ones who make it early. In the toy model, the average Mythic talent was 1622, the first 20% to make Mythic are over 1700 talent on average (and almost nobody got stuck in Gold). The last 20% are about 1560. The cutoff for the top-10% talentwise (Rank 3000 out 30000) is about 1790. You may be able to see where this is going.

I reran the simulation using two different parameters- first, I made Gold the way it used to be- 1 pip per win and per loss. About 40% of people got stuck in Gold in this simulation, and the average Mythic player was MUCH stronger- 1695 vs 1622. There were also under 1/3 as many, 8800 vs 30,000 (running for the same number of iterations). The late-month Mythics are obviously still weaker here, but 1650 here on average instead of 1560. That’s a huge difference.

I also ran a model where Silver/Gold populations were 1/4 of their starting size (representing lots of people making Plat since it’s easy and then quitting before they play against those in the higher ranks). That’s 30k starting in Plat and 60k starting below Plat who continue to play in Plat, which seems like a quite conservative ratio to me. This came out roughly in the middle of the previous two. The average Mythic was 1660 and the late-season Mythics were around 1607 on average. It doesn’t require an overwhelming infusion into Plat to create a big effect on who makes it to Mythic late in the month.

Influx of Players and Overrated Players

The first part is obvious from the previous paragraph. A lot more people in Mythic is going to push the #1500 rating higher by variance alone, even if the newbies mostly aren’t that good.

Because WotC doesn’t use anything like a provisional rating, where a Mythic rating is based on the first X number of games at Mythic, and instead seems to give everybody fairly similar ratings throughout the month when they first make Mythic, the players who make it late in the month are MASSIVELY overrated relative to the existing population, on the order of 100+ Elo or more. Treating early-season Mythics and late-season Mythics as separate populations, when two players from the same group play each other, the group keeps the same average rating, When cross-group play happens, the early-season Mythics farm the hell out of the late-season Mythics (because they’re weaker, but rated the same) until a new equilibrium is reached. And with lots more (weaker) players making Mythic because of the change to Gold, there’s a lot of farming to be done.

This effectively makes playing late in the month positive-sum for good players because there are tons of new fish to farm showing up every day. It also indirectly punishes people who camp at the end of the month because they can’t collect the free points if they aren’t playing. This was likely always a significant cause of rank decay, but the easier path to Mythic gives a clear explanation of why rank decay is so much more severe now than it was pre-Ikoria: more players and lots more fish. The influx of weak players also means more people in the queue for good players to 75-25 farm, even after equilibration, but I expect that effect is smaller than the direct point donation.

New-player ratings are a solved problem in chess and were implemented in a proper Glicko framework in the mid-90s. WotC used the dumb implementation, “everybody starts at 1600”, for competitive paper magic back in the day, and that had the exact same problem then as their Mythic seeding procedure does now- people late to the party are weaker than average, by a lot, and while their MTG:A implementation added a fancy wrapper, it still appears to be making the same fundamental mistake that they made 25 years ago.

This is a graph of the #1500 rating in April as the month progressed. I got it from my reference account’s percentile changing (with a constant actual rating) as the month progressed.

The part on the left is when there are barely more than 1500 people in Mythic at all, and on the right is the late-month rating inflation. Top-1200 inflation was likely even worse (it was in January at least). The middle section of approximately a straight line is more interesting than it seems. In a normal-ish distribution, once you get out of the stupid area equivalent to the left of this graph, adding more people to the distribution increases the #1500 rating in a highly sub-linear way. To keep a line going, and to actually go above linear near the end, requires some combination of beyond-exponential growth in the Mythic population through the whole month and/or lots of fish-farming by the top end. I have no way to try to measure how much of each without bulk tracker data, but I expect both to matter. And both would be tamped down if Gold were still +1/-1.

Conclusions

Cutting way back on rating-based pairing in Constructed would create a much fairer ladder climb before Mythic and take away the easy-mode exploit. Bringing the Bo3 K way down would create a more talent-based distribution at the top of Mythic instead of a giant crapshoot. A better Mythic seeding algorithm would offset the increase in weak players making it late in the month. The ~75-25 cap.. I just don’t even. I’ll leave it to the reader’s imagination as to why their algorithm does what it does and why the details have been kept obfuscated for years now.

P.S. Apologies to anybody who was annoyed by queueing into me. I was hoping a quick free win wouldn’t be that bad. At Bo3 K-values, the rating result of any match is 95% gone inside 50 matches, so conceding to somebody early in the month is completely irrelevant to the final positioning, and due to rating-based pairing, I didn’t get matched very often against real T-1200 contenders later on. Going over 100 games without seeing a single 90% or higher was not strange.