The Explosion of Uncollectible Mythics on Arena

The funny part is that I was looking at this just before WotC announced that OTJ was going full bukkake with 2 different bonus sheets plus even more Special Guests. At this point, with the release of OTJ, about *half* of the ~960 different Mythics on Arena are either literally or effectively uncollectible from store packs. That’s beyond insane.

(My counts are going to be a little off because Scryfall didn’t have perfectly up to date info when writing- it had OTJ, but didn’t know that Stoneforge Mystic is on Arena, etc., and I’m not checking everything by hand because the point I’m making isn’t at all dependent on near-perfect accuracy)

Once WotC announced Mythic packs a little over 2 years ago, you’d think that you’d just be able to spend 1300 gold and get a Mythic from whatever set you’re after. Or at least you’d be able to the vast majority of the time. And if you did think that, you were horribly, horribly mistaken.

First off, WotC sometimes just… didn’t implement Mythic packs. The first 8 regular sets (up to M20), Kaladesh Remastered, Amonkhet Remastered, all of the non-standalone Alchemy sets, and MoM: Aftermath don’t have Mythic packs available for purchase. Why? Who knows. It makes no sense. This puts 60 recent Mythics (MAT+Alchemy) and ~170 older Mythics behind a pay gate where the only way to buy them with in-game currency is to buy that set’s regular packs (far) beyond the point where you’re rare complete and they’re just giving 20 gems most of the time, making them effectively uncollectible without spending wildcards. (you can also wait a month to collect a new regular set, overbuy alchemy packs until alchemy-mythic-complete without running into base-set rare completion, then finish the base set, but that still sucks and only addresses 1/4th of the problem)

Second, there are another ~67 promo and/or commander Mythics associated with particular sets that simply aren’t openable in that set’s store boosters. They aren’t even pay-gated behind almost-worthless packs- they’re truly wildcard-only. Why? Again, it makes absolutely no sense. Not having them in *draft* boosters makes sense, but Arena store boosters are a unique digital thing that has never mirrored paper- they have a different number of cards, sometimes different frequencies or different available cards, and hell, sometimes the slots aren’t even filled with cards! Why not have them drop with the lowest priority (after everything else and down with the banned cards, etc.)? Another easy fix.

And third, we have the absolute grade-A bullshit that is Arena bonus sheets. Bonus sheets (before BIG, which may be a one-off) can mostly be ignored in paper because all of the cards they contain have already existed in normal printings, so it doesn’t often matter if they’re ultra-rare in this printing. A new-to-arena bonus sheet Mythic is effectively uncollectible. For example, it would take opening **540** SoI:Remastered store packs on average to open a **single copy** of Geist of Saint Traft from the Shadows of the Past bonus sheet. Absolute bullshit. There are about 100 of these, and all of them except the 14 from Strixhaven’s Mystical Archive are from the last year and a half. It has gotten completely out of control. Goodbye to hundreds of Mythic wildcards if you try to keep up.

And on top of that, a BIG mythic drops in paper Play boosters in 18.4% of packs (1 in 5.4), a bit more frequently than a normal OTJ Mythic (edit: because there are 30 of them compared to 20 OTJ Mythics, each individual card drops slightly less often- 1 in 140 packs for each OTJ Mythic, ~1 in 160 paper packs for each BIG Mythic), and that slot is in addition to the base-set rare slot. WotC actually tried to make the paper BIG cards reasonably available because they knew they needed to do that with a first printing. In Arena store packs, a BIG Mythic drops in 1 of every **35** non-wildcard packs instead of 1 in 5.4 (1 in 1050 packs for an individual card instead of 1 in 160), and it replaces the regular Mythic. Why? Because fuck you twice, that’s why.

Fourth, you have the other ~70 Mythics that were never available in store packs and are wildcard-only now- Special Guests, Anthology cards, Jumpstart exclusives. Some were reasonably collectible for a short time, others not so much, but they’re all impossible now.

This is all trivially fixable- if the plan were to ever fix anything instead of looking for new places to nerf Mythic drop rates by a factor of 7. Fully implement Mythic packs as should have been done 2 years ago. Put promo/commander cards down with banned cards at the end of packs. Have the rare(mythic) slot in a store pack give a bonus sheet/list/SPG rare(mythic) instead of gems when you’re rare(mythic) complete in the base set. That’s 90% of the uncollectible Mythic problem solved right there, and the other 10% isn’t that hard if you’re a little creative, and also doesn’t matter that much without the other massive Mythic wildcard drains running wild.

Update to the MtG: Arena Mythic Ranking System

I’d noticed things weren’t right in November 2023, and according to other people, things had appeared different for a couple of months before that (I had played very little ranked). After investigating further, it appears that the core system is largely the same while the actual calculation module is now just three bugs in a trench coat.

What doesn’t seem to have changed:

There’s a cap of 1650 MMR when ranking in each month and you have to either be very new to ranked or expend effort to rank in well below that, and it’s (almost certainly) still based on your Serious Rating.
Games against other Mythics are capped at a 200 MMR difference for rating calculations regardless of how wide the difference is.
Games against non-Mythics are rated as though you’re 100 points above the non-Mythic, regardless of what your MMR is.
MMR doesn’t drift over time
Bo3 matches are worth twice as many points as Bo1
Mythic is run at a fixed Rating Deviation- even though the system is likely still trying to be Glicko-2, Mythic RD doesn’t change with played matches

What has definitely changed:

Bo1 matches against somebody 200+ points lower used to be +5 points for a win and -15.5 points for a loss for a breakeven winrate of 75.6%. Now this is +3.8 points for a win and -5.9 points for a loss for a breakeven winrate of 61.1% and many fewer points at stake.
Bo1 matches against a non-mythic used to be -13 points for a loss and +7.4 points for a win for a breakeven winrate of 63.7%. Now this is -5.4 points for a loss and +4.3 points for a win for a breakeven winrate of 55.6% and many fewer points at stake.

Even though I consider it highly likely that this output is just the result of silly bugs, this system does fix the Mythic limited rating problem, albeit in a quite dumb way. Reducing the number of points at stake in each match is also probably a good thing. On the flip side, only needing a 55.6% breakeven rate against non-Mythics in constructed is even more advantageous than before, and having a 61.1% maximum breakeven against all Mythics in constructed is downright atrocious. Approximately, to compete for #1/very high mythic in constructed before, you needed to maximize #gamesplayed * (winrate – 75.6%), and having a sustained winrate much over 75.6% in high-ish Mythic is fairly difficult. Now the formula is approximately #gamesplayed * (winrate – 61.1%), which lets mediocre-winrate players reach very high mythic just by playing a lot.

The other interesting question- to me, at least- is where the hell any of these numbers come from. The system definitely isn’t Elo, and there’s no simple way I see to bastardize Glicko-1 to give these outputs, and the 100/200 point things still seem untouched, so I figured the simplest possibility was that it was likely to be some kind of Glicko-2 modification. In any Elo-like system, a matchup between players with particular ratings can be characterized in two ways- the obvious one being the points gained or lost based on a win/loss, and the other on the ratio between the win/loss that determines the breakeven winning percentage. -15.5/+5 and -31/+10 have half/double the points at stake, but the same breakeven percentage.

In Glicko-2, the knob to adjust the breakeven percentage for a fixed rating difference is the rating deviation, which is a measure of how uncertain the ratings are. Under the old Mythic system with a RD of 60, 200 points was a 75.6% breakeven. Lowering that to 61.1% would require increasing the rating deviation to about 741.5, and given that the value for a completely unknown player is initialized at 350 (remember that number), it seemed almost impossible that 741.5 could have been coded in. And that would have given 77.2 times as many points out (454.8 vs. 5.9), which is another nonsense number. So that seemed unlikely. Getting the points given out to decrease to anything like the new low numbers required massively reducing the RD, but that only made the breakeven winrate higher than 75.6%, so that wasn’t any good either.

Basically, these numbers aren’t even close to normal Glicko-2 numbers- reducing the points at stake would require increasing the already-far-too-high breakeven percentage, and reducing the breakeven percentage would require increasing the already far-too-high points at stake. I didn’t have any great ideas- I’d noticed that the a 200-point difference under the new system was close to an 80-point difference under the new system, as far as breakeven goes, which was about a factor of 2.5. I was looking at the Glicko-2 algorithm to see if there were anything they could have plausibly screwed up to account for that, and.. Glicko-2 has a scale constant of ln(10)/400 that ratings (and RDs) are divided by. If they’d miscoded that as the base-10 logarithm instead of ln(10), that’s a factor of about 2.3. So I coded that mistake into my Glicko-2 algorithm and checked what rating deviation I would need the players to be to come out with the 61.1% breakeven.. and the answer came back.. you guessed it.. 350! (350.36). That’s an almost impossible coincidence- that the RD to get the observed breakeven % after using the incorrect logarithm would come out at almost exactly the new-player initialization level (which they actually do/did use to initialize the play and serious ratings for actual new players).

So if we assume that’s what the algorithm is doing- using the wrong logarithm and the wrong RD initialization value (Mythics used to initialize at RD 60, which is somewhat reasonable, not 350)- how do we get from there to the actual point values? That would give out far too many points- 147.6 vs 5.9- but in this case, the multiplier is almost exactly a nice round number, 25 (25.02).

I don’t know if there’s a third huge bug that mimics a divide-by-25 in rating changes, whether somebody put in a divide-by-25 to get reasonably-sized numbers in lieu of actual debugging, or if this is all part of some new system and the log(10)/350 thing is just a total coincidence, but it’s hard to imagine somebody went to the trouble of completely redesigning the Mythic rating system and ended up with.. this.

*I can’t reproduce the point values exactly using any values for scale constant, initial RD, and multiplier, but I couldn’t exactly match the values before this change either. It seems fairly likely that there’s another smaller bug somewhere in their algorithm, and/or a large bug that’s mimicking divide-by-25 somehow.

The Five MTG: Arena Rankings

This ties up a few loose ends with the Mythic ranking system and explains how the other behind the scenes MMRs work. As it turns out, there are five completely distinct rankings in Arena. Mythic Constructed, Mythic Limited, Serious Constructed, Serious Limited, and Play (these are my names for them).

Non-Mythic

All games played in ranked (including Mythic) affect the Serious ratings, *as do the corresponding competitive events- Traditional Draft, Historic Constructed Event, etc*. Serious ratings are conserved from month to month. Play rating includes the Play and Brawl queues, as well as Jump In (despite Jump In being a limited event), and is also conserved month to month. Nonsense events (e.g. the Momir and Artisan MWMs) don’t appear to be rated in any way.

The Serious and Play ratings appear to be intended to be a standard implementation of the Glicko-2 rating system except with ratings updated after every game. I say intended because they have a gigantic bug that screws everything up, but we’ll talk about that later. New accounts start with a rating of 1500, RD of 350, and volatility of 0.06. These update after every game and there doesn’t seem to be any kind of decay or RD increase over time (an account that hadn’t even logged in since before rotation still had a low RD). Bo1 and Bo3 match wins are rated the same for these.

Mythic

Only games played while in Mythic count towards the Mythic rankings, and the gist of the system is exactly as laid out in One last rating update although I have a slight formula update. These rankings appear to come into existence when Mythic is reached each month and disappear again at the end of the month (season).

I got the bright idea that they may be using the same code to calculate Mythic changes, and this appears to be true (I say appears because I have irreconcilable differences in the 5th decimal place for all of their Glicko-2 computations that I can’t resolve with any (tau, Glicko scale constant) pair. There’s either a small bug or some kind of rounding issue on one of our ends, but it’s really tiny regardless). The differences are that Mythic uses a fixed RD of 60 and a fixed volatility of 0.06 (or extremely extremely close) that doesn’t change after matches and that the rating changes are multiplied by 2 for Bo3 matches. Glicko with a fixed RD is very similar to Elo with a fixed K on matchups within 200 points of each other.

In addition, the initial Mythic rating is seeded *as a function of the Serious rating*. [Edit 8/1/2022: The formula is: if Serious Rating >=3000, Mythic Rating = 1650. Otherwise Mythic Rating = 1650 – ((3000 – Serious Rating) / 10). Using Serious Rating from before you win your play-in game to Mythic, not after you win it, because reasons] That’s the one piece I never exactly had a handle on. I knew that tanking my rating gave me easy pairings and a trash initial Mythic rating the next month, but I didn’t know how that happened. The existence of a separate conserved Serious rating explains it all. [Edit 8/1/22: rest of paragraph deleted since it’s no longer relevant with the formula given]

The Mythic system also had two fixed constants that previously appeared to be randomly arbitrary- the minimum win of 5.007 points and the +7.4 win/-13.02 loss when playing a non-Mythic. Using the Glicko formula with both players having RD 60 and Volatility 0.06, the first number appears when the rating difference is restricted to a maximum of exactly 200 points. Even if you’re 400 points above your opponent, the match is rated as though you’re only 200 points higher. The second number appears when you treat the non-Mythic player as rated exactly 100 points lower than you are (regardless of what your rating actually is) with the same RD=60. This conclusively settles the debate as to whether or not that system/those numbers were empirically derived or rectally derived.

The Huge Bug

As I reported on Twitter last month the Serious and Play ratings have a problem (but not Mythic.. at least not this problem). If you lose, you’re rated as though you lost to your opponent. If you win, you’re rated as though you beat a copy of yourself (rating and RD). And, of course, even though correcting the formula/function call is absolutely trivial, it still hasn’t been fixed after weeks. This bug dates to at least January, almost certainly to the back-end update last year, and quite possibly back into Beta.

Glicko-2 isn’t zero-sum by design (if two players with the same rating play, the one with higher RD will gain/lose more points), but it doesn’t rapidly go batshit. With this bug, games are now positive-sum in expectation. When the higher-rated player wins, points are created (the higher-rated player wins more points than they should). When the lower-rated player wins, points are destroyed (the lower-rated player wins fewer points than they should). Handwaving away some distributional things that the data show don’t matter, since the higher-rated player wins more often, points are created more than destroyed, and the entire system inflates over time.

My play rating is 5032 (and I’m not spending my life tryharding the play queue), and the median rating is supposed to be around 1500. In other words, if the system were functioning correctly, my rating would mean that I’d be expected to only lose to the median player 1 time per tens of millions of games. I’m going to go out on a limb and say that even if I were playing against total newbies with precons, I wouldn’t make it anywhere close to 10 million games before winding up on the business end of a Gigantosaurus. And I encountered a player in a constructed event who had a Serious Constructed rating over 9000, so I’m nowhere near the extreme here. It appears from several reports that the cap is exactly 10000.

In addition to inflating everybody who plays, it also lets players go infinite (up to 10000) on rating. Because beating a copy of yourself is worth ~half as many points as a loss to a player you’re expected to beat ~100% of the time, anybody who can win more than 2/3 of their matches will increase their rating without bound. Clearly some players (like the 9000 I played) have been doing this for awhile.

If the system were functioning properly, most people would be in a fairly narrow range. In Mythic constructed, something like 99% of players are between 1200-2100 (this is a guess, but I’m confident those endpoints aren’t too narrow by much, if at all), and that’s with a system that artificially spreads people out by letting high-rated players win too many points and low-rated players lose too many points. Serious Constructed would be spread out a bit more because it includes all the non-Mythics as well, but it’s not going to be that huge a gap down to the people who can at least make it out of Silver. And while the Play queue has much wider deck-strength, the deck-strength matching, while very far from perfect, should at least make the MMR difference more about play skill than deck strength, so it also wouldn’t spread out too far.

Instead, because of the rampant inflation, the center of the distribution is at like 4000 MMR instead of ~1500. Strong players are going infinite on the high side, and because new accounts still start at 1500, and there’s no way to make it to 4000 without winning lots of games (especially if you screwed around with precons or something for a few games at some point), there’s a constant trickle of relatively new (and some truly atrocious) accounts on the left tail spanning the thousands-of-points gap between new players and established players, and that gap only exists because of the rating bug. It looks something like this.

The three important features are that the bulk of the distribution is off by thousands of points from where it should be, noobs start thousands of points below the bulk, and players are spread over an absurdly wide range. The curves are obviously hand-drawn in Paint, so don’t read anything beyond that into the precise shapes.

This is why new players often make it to Mythic one time easily with functional-but-quite-underpowered decks- they make it before their Serious Constructed rating has crossed the giant gap from 1500 to the cluster where the established players reside. Then once they’ve crossed the gap, they mostly get destroyed. This is also why tanking the rating is so effective. It’s possible to tank thousands of points and get all the way back below the noob entry point. I’ve said repeatedly that my matchups after doing that were almost entirely obvious noobs and horrific decks, and now it’s clear why.

It shouldn’t be possible (without spending a metric shitton of time conceding matches, since there should be rapidly diminishing returns once you separate from the bulk) to tank far enough to do an entire Mythic climb with a high winrate without at least getting the rating back to the point where you’re bumping into the halfway competent players/decks on a regular basis, but because of the giant gap, it is. The Serious Constructed rating system wouldn’t be entirely fixed without the bug- MMR-based pairing still means a good player is going to face tougher matchups than a bad player on the way to Mythic, and tanking would still result in some very easy matches at the start- but those effects are greatly magnified because of the artificial gap that the bug has created.

What I still don’t know

I don’t know whether or not Serious or Mythic is used for pairing Mythic. I don’t have the data to know if Serious is used to pair ranked drafts or any other events (Traditional Draft, Standard Constructed Event, etc). It would take a 17lands-data-dump amount of match data to conclusively show that one way or another. AFAIK, WotC has said that it isn’t, but they’ve made incorrect statements about ratings and pairings before. And I certainly don’t know, in so, so many different respects, how everything made it to this point.

The Mythic Limited Rating Problem

TL;DR

Thanks to @twoduckcubed for reading my previous work and being familiar enough with high-end limited winrates to see that there was likely to be a real problem here, and there is. If you haven’t read my previous work, it’s at One last rating update and Inside the MTG: Arena Rating System, but as long as you know anything at all about any MMR system, this post is intended to be readable by itself.

Mythic MMR starts from scratch each month and each player is assigned a Mythic MMR when they first make Mythic that month. Most people start at the initial cap, 1650, and a few start a bit below that. It takes losing an awful lot to be assigned an initial rating very far below that, and since losing a bunch of limited matches costs resources (while doing it in ranked constructed is basically free), it’s mostly 1650 or close. When two people with a Mythic rating play in Premier or Quick, it’s approximately an Elo system with a K value of 20.4, and the matches are zero-sum. When one player wins points, the other player loses exactly the same number of points.

Most games that Mythic limited players play aren’t against other Mythics though. Diamonds are the most common opponents, with significant numbers of games against Platinums as well (and a handful against Gold/Silver/Bronze). In this case, since the non-Mythic opponents literally don’t have a Mythic MMR to plug into the Elo formula, Arena, in a decision that’s utterly incomprehensible on multiple levels, rates all of these matches exactly the same regardless of the Mythic’s rating or the non-Mythic’s rank or match history. +7.4 points for a win and -13 points for a loss, and this is *not* zero-sum because the non-Mythic doesn’t have a Mythic rating yet. The points are simply created out of nothing or lost to the void.

+7.4 for a win and -13 for a loss means that the Mythics need to win 13/(13+7.4) = 63.7% of the time against non-Mythics to break even. And, well, thanks to the 17lands data dumps, I found that they won 58.3% in SNC and 59.4% in NEO (VOW and MID didn’t seem to have opponent rank data available). Nowhere close to breakeven. ~57% vs. Diamonds and ~63% vs Plats. Not even breakeven playing down two ranks. And this is already a favorable sample for multiple reasons. It’s 17lands users, who are above average Mythics (their Mythic-Mythic winrate is 52.4%). It’s also a game-averaged sample instead of a player-averaged sample, and better players play more games on average in Mythic because they get there faster and have more resources to keep paying entry fees with.

Because of this, to a reasonable approximation, every time a Mythic Limited player hits the play button, 1 MMR vanishes into the void. And since 1% of Mythic in limited is only ~16.5 MMR, 1% Mythic in expectation is lost every 2-3 drafts just for playing. The more they play, the more MMR they lose into the void. The very best players- those who can win 5% more than the average 17lands-using Mythic drafter- can outrun this and profit from playing lower ranks- but the vast majority can’t, hence the video at the top of the post. Instead of Mythic MMR being a zero-sum game, it’s like gambling against a house edge, and playing at all is clearly disincentivized for most people.

Obviously this whole implementation is just profoundly flawed and needs to be fixed. The 17lands data is anonymized, so I don’t know how many Mythic-Mythic games appeared from both sides, so I don’t know exactly what percentage of a Mythic’s games are against each level, but it’s something like 51% against Diamond, 29% against Mythic 19% against Plat, 1% Gold and below. Clearly games vs Diamonds need to be handled responsibly, and games vs. Golds and below don’t matter much.

A simple fix that keeps most of the system intact (which may not be the best idea, but hey, at least it’s progress) is to assign the initial Mythic MMR upon making Platinum (instead of Mythic) and to not Mythic-rate any games involving a Gold or below. You wouldn’t get leaderboard position or anything until actually making Mythic, but the rating would be there behind the scenes doing its thing and this silliness would be avoided since all the rated games would be zero-sum and all the Diamond opponents would be reasonably rated for quality after playing enough games to get out of Plat.

Constructed has the same implementation, but it’s mostly not as big a deal because outside of Alchemy, cross-rank pairing isn’t very common except at the beginning of the month, and even if top-1200 quality players are getting scammed out of points by lower ranks at the start of the month (and they may well not be), they have all the time in the world to reequilibrate their rating against a ~100% Mythic opponent lineup later. Drafters play against bunches of non-Mythics throughout. Cross-rank pairing in Alchemy ranked may be/become enough of a problem to warrant a similar fix (although likely for the opposite reason, farming lower ranks instead of losing points to them), and it’s not like assigning the initial Mythic rating upon reaching Diamond and ignoring games against lower ranks actually hurts anything there either.

New MTG:A Event Payouts

With the new Premier Play announcement, we also got two new constructed event payouts and a slightly reworked Traditional Draft.

This is the analysis of the EV of those events for various winrates (game winrate for Bo1 and match winrate for Bo3). I give the expected gem return, the expected number of packs, the expected number of play points, and two ROIs, one counting packs as 200 gems (store price), the other counting packs as 22.5 gems (if you have all the cards). These are ROIs for gem entry. For gold entry, multiply by whatever you’d otherwise use gold for. If you’d buy packs, then multiply by 3/4. If you’d otherwise draft, then the constructed event entries are the same exchange rate.

Traditional Draft (1500 gem entry)

Winrate	Gems	Packs	Points	ROI (200)	ROI (22.5)
0.4	578	1.9	0.13	63.8%	41.4%
0.45	681	2.13	0.18	73.7%	48.6%
0.5	794	2.38	0.25	84.6%	56.5%
0.55	917	2.65	0.33	96.4%	65.1%
0.6	1050	2.94	0.43	109.3%	74.4%
0.65	1194	3.26	0.55	123.1%	84.5%
0.7	1348	3.6	0.69	137.9%	95.3%
0.75	1513	3.95	0.84	153.6%	106.8%

THIS DOES NOT INCLUDE ANY VALUE FROM THE CARDS TAKEN DURING THE DRAFT, which, if you value the cards, is a bit under 3 packs on average (no WC progress).

Bo1 Constructed Event (375 gem entry)

Winrate	Gems	Packs	Points	ROI (200)	ROI (22.5)
0.4	129	0.65	0.03	68.7%	38.2%
0.45	160	0.81	0.05	85.9%	47.4%
0.5	195	1	0.09	105.6%	58.1%
0.55	235	1.22	0.15	128.0%	70.0%
0.6	278	1.47	0.23	152.7%	83.0%
0.65	323	1.74	0.34	179.0%	96.5%
0.7	367	2.03	0.46	205.9%	109.9%
0.75	408	2.31	0.6	231.8%	122.7%

Bo3 Constructed Event (750 gem entry)

Winrate	Gems	Packs	Points	ROI (200)	ROI (22.5)
0.4	292	1.67	0.04	83.5%	43.9%
0.45	348	1.76	0.07	93.3%	51.6%
0.5	408	1.84	0.125	103.5%	59.9%
0.55	471	1.92	0.2	113.9%	68.5%
0.6	535	1.99	0.31	124.4%	77.3%
0.65	600	2.06	0.464	135.0%	86.2%
0.7	664	2.14	0.67	145.6%	95.0%
0.75	727	2.22	0.95	156.1%	103.5%

One last rating update

Summary of everything I know about the constructed rating system first. (Edit 6/16/22: Mythic Limited appears to be exactly the same) Details of newly-discovered things below that.

Bo1 K in closely matched Mythic-Mythic matches is 20.37. The “K” for Mythic vs. non-Mythic matches is 20.41, and 20.52 for capped Mythic vs. Mythic matches (see #2). These are definitely three different numbers. Go figure.
The minimum MMR change for a Bo1 win/loss is +/-5 and the maximum change is +/-15.5
All Bo3 K values/minimum changes are exactly double the Bo1 K values.
Every number is in the uncanny valley of being very close to a nice round number but not actually being a nice round number (e.g. the 13 below is 13.018).
Every match between a Mythic and a Diamond/Plat (and probably true for Gold and lower as well) is rated *exactly the same way* regardless of either the Mythic’s MMR or the non-Mythic’s MMR. In Bo1 +7.4 points for a win and ~13 points for a loss (double for Bo3)
As of November 2021, all draws are +/- 0 MMR
Glicko-ness isn’t detectable at Mythic. The precalculated/capped rating changes don’t vary at all based on number of games played, and controlled “competitive” Mythic matches run at exactly the same K at different times/on different accounts.
Mythic vs. Mythic matches are zero-sum
MMR doesn’t drift over time
MMR when entering Mythic is pretty rangebound regardless of how high or low it “should” be. It’s capped on the low end at what I think is 1400 (~78%) and capped on the high end at 1650 (~92%). Pre-December, this range was 1485-1650. January #1500 MMR was 1782.
Having an atrocious Mythic MMR in one month gets you easier matchmaking the next month and influences where you rank into Mythic. [Edit: via the Serious Rating, read more here]
Conceding during sideboarding in a Bo3 rates the game at its current match score with the Bo1 K value. (concede up 1-0, it’s like you won a Bo1 EVEN THOUGH YOU CONCEDED THE MATCH. Concede 1-1 and it’s a 0-change draw. Concede down 0-1 and lose half the rating points). This is absolutely batshit insane behavior. (edit: finally fixed as of 3/17/2022)
There are other implementation issues. If WotC is going to ever take organized play seriously again with Arena as a part, or if Arena ever wants to take itself seriously at all, somebody should talk to me.

MMR for Mythic vs. Non-Mythic

Every match between a Mythic and a Diamond/Plat (and probably Gold and lower as well) was rated *exactly the same way* regardless of either the Mythic’s MMR or the non-Mythic’s MMR. In Bo1 +7.4 points for a win, ~13 points for a loss, and -5.6 points for a draw**(??!??!). Bo3 values were exactly double that for win/loss.

That’s not something I was expecting to see. Mythic MMR is not a separate thing that’s preserved in any meaningful way between months, so I have no idea what’s going on underneath the hood that led to that being the choice of implementations. The draw value was obviously bugged- it’s *exactly* double the loss it should be, so somebody evidently used W+L instead of (W+L)/2. **The 11/11/2021 update, instead of fixing the root cause of the instant-draw bug, instead changed all draws to +/-0 MMR and fixed the -5.6 point bug by accident.

That led to me slightly overestimating Bo3 K before- I got paired against more Diamonds in Bo3 than Bo1 and lost more points- and to my general confusion about how early-season MMR worked. It now looks like it’s the exact same system the whole month, just with more matches against non-Mythics thrown in early on and rated strangely.

Ranking into Mythic

Edit: Read this

Before it got changed in December, everybody ranked into Mythic between 1650 and 1485 MMR, and it took concerted effort to be anywhere near the bottom of that range. That’s roughly, most months, an MMR that would finish at ~83-93% if no more games were played, and in reality for anybody playing normally 88-93%. The end-of-month #1500 MMR that I’ve observed was in the 1800s before November, ~1797 in November, ~1770 in December, and ~1780 in January. So no matter how well you do pre-Mythic, you have work to do (in effect, grinding up from 92-93%) to finish top-1500.

In December, the floor got reduced to what appears to be ~1400, although the exact value is unattainable because you have to end on a winning streak to get out of Diamond. An account with atrocious MMR in the previous month that also conceded a metric shitton of games at Diamond 4 ranked in at ~1410 under the new system. The ceiling is still 1650. These lower initial Mythic MMRs are almost certainly a big part of the end-of-season #1500 MMR dropping by 20-30 points.

Season Reset

Edit: Read this, section is obsolete now.

MMR is preserved, at least to some degree, and there isn’t a clearly separate Mythic and non-Mythic MMR. I’d done some work to quantify this, but then things changed in December, and I’m unlikely to repeat it. What has changed is that previous-season Mythic MMR seems to have a bigger impact now. I had an account with a trash MMR in January make Mythic in February without losing a single game (finally!), and it still only ranked in at 1481. It would have been near or above 1600 under the old system, and now it’s ranking in below the previous system’s floor.

I hope everybody enjoyed this iteration, and if you’re chasing #1 Mythic this month or in the future, it’s not me. Or if it is me, I won’t be #1 and above 1650.

Monster Exploit(s) Available In M:tG Arena MMR

Edit: FINALLY fixed as of 3/17/2022 (both conceding during sideboarding and losing match due to roping out, and presumably to the chess clock running out as well)

Not just the “concede a lot for easy pairings” idea detailed in Inside the MTG: Arena Rating System, which still works quite well, as another Ivana 49-1 cakewalk from Plat 4 to Mythic the past couple of days would attest, but this time exploits that can be used for the top-1200 race itself.

In Bo3, conceding on the sideboarding screen ends the match *and only considers previous games when rating the match*. Conceding down 0-1 makes you lose half-K instead of full-K (Bo1 K-value is ~1/2 Bo3 K-value). If the matchup is bad, you can use this to cut your losses. Conceding at 1-1 treats the match as a (half-K) draw- literally adding a draw to your logfile stats as well as rating the match as a draw. If game 3 is bad for you, you can lock in a draw this way instead of playing it.

If you win game 1 and concede, it gets rated as a half-K match WIN (despite showing a Defeat screen). This means that you can always force a match to play exactly as Bo1 if you want to- half K, 1 game, win or lose- so you don’t have to worry about post-sideboard games, can safely play glass-cannon strategies that get crushed by a lot of decks post-board, etc.- and you still have the option to play on if you like the Game 2 matchup.

Draws from the draw bug (failure to connect, match instantly ending in a draw) are also rated as a draw. I believe that’s a new bug from the big update (edit: apparently not, unless it got patched and unpatched since February- see the comment below). It’s rated as a normal draw in Bo1 and a half-K draw in Bo3.

Inside the MTG: Arena Rating System

Edit: A summary, including some updates, is now at https://hareeb.com/2022/02/02/one-last-rating-update/

Note: If you’re playing in numbered Mythic Constructed during the rest of May, and/or you’d like to help me crowdsource enough logfiles to get a full picture of the Rank # – Rating relationship during the last week, please visit https://twitter.com/Hareeb_alSaq/status/1397022404363395079 and DM/share. If I get enough data, I can make a rank-decay curve for every rank at once, among other things.

Brought to you by the all-time undisputed king of the percent gamers

Apologies for the writing- Some parts I’d written before, some I’m just writing now, but there’s a ton to get out, a couple of necessary experiments weren’t performed or finished yet, and I’m sure I’ll find things I could have explained more clearly. The details are also seriously nerdy, so reading all of this definitely isn’t for everybody. Or maybe anybody.

TL;DR

There is rating-based pairing in ranked constructed below Mythic (as well as in Mythic).
It’s just as exploitable as you should think it is
There is no detectable Glicko-ness to Mythic constructed ratings in the second half of the month. It’s indistinguishable from base-Elo
1. Expected win% is constrained to a ~25%-~75% range, regardless of rating difference, for both Bo1 and Bo3. That comes out to around 11% Mythic later in the month.
2. After convergence, the Bo1 K-value is ~20.5. Bo3 K is ~45.
3. The minimum change in rating is ~5 points in a Bo1 match and ~10 points in a Bo3 match.
Early in the month, the system is more complicated.
Performance before Mythic seems to have only slight impact on where you’re initially placed in Mythic.
Giving everybody similar initial ratings when they make Mythic leads to issues at the end of the month.
The change making Gold +2 per win/-1 per loss likely turbocharged the issues from #6

It’s well known that the rank decay near the end of the month in Mythic Constructed is incredibly severe. These days, a top-600 rating with 24 hours left is insufficient to finish top-1200, and it’s not just a last-day effect. There’s significant decay in the days leading up to the last day, just not at that level of crazy. The canonical explanations were that people were grinding to mythic at the end of the month and that people were playing more in the last couple of days. While both true, neither seemed sufficient to me to explain that level of decay. Were clones of the top-600 all sitting around waiting until the last day to make Mythic and kick everybody else out? If they were already Mythic and top-1200 talent level, why weren’t they mostly already rated as such? The decay is also much, much worse than it was in late 2019, and those explanations give no real hint as to why.

The only two pieces of information we have been given are that 1) Mythic Percentile is the percentage (Int(Your Rating/#1500 rating)) of the actual internal rating of the #1500 player. This is true. 2) Arena uses a modified Glicko system. Glicko is a modification of the old Elo system. This is, at best, highly misleading. The actual system does multiple things that are not Glicko and does not do at least one thing that is in Glicko.

I suspected that WotC might be rigging the rating algorithm as the month progressed, either deliberately increasing variance by raising the K-value of matches or by making each match positive-sum instead of zero-sum (i.e. calculating the correct rating changes, then giving one or both players a small boost to reward playing). Either of these would explain the massive collision of people outside the top-1200, who are playing, into the the people inside the top-1200 who are trying to camp on their rating. As it turns out, neither of those appear to be directly true. The rating system seems to be effectively the same throughout the last couple of weeks of the month, at least in Mythic. The explanations for what’s actually going on are more technical, and the next couple of sections are going to be a bit dry. Scroll down- way down- to the Problems section if you want to skip how I wasted too much of my time.

I’ve decided to structure this as a journal-of-my-exploration style post, so it’s clear why it was necessary to do what I was doing if I wanted to get the information that WotC has continually failed to provide for years.

Experiments:

Background

I hoped that the minimum win/loss would be quantized at a useful level once the rating difference got big enough, and if true, it would allow me to probe the algorithm. Thankfully, this guess turned out to be correct. Deranking to absurdly low levels let me run several experiments.

Under the assumption that the #1500 rating does not change wildly over a few hours in the middle of the month when there are well over 1500 players, it’s possible to benchmark a rating without seeing it directly. For instance, a minimum value loss that knocks you from 33% to 32% at time T will leave you with a similar rating, within one minimum loss value, as a 33%-32% loss several hours later. Also, if nothing else is going on, like a baseline drift, the rating value of 0 is equivalent over any timescale within the same season. This sort of benchmarking was used throughout.

Relative win-loss values

Because at very low rating, every win would be a maximum value win and every loss would be a minimum value loss, the ratio I needed to maintain the same percentile would let me calculate the win% used to quantize the minimum loss. As it turned out, it was very close to 3 losses for every 1 win, or a 25%-75% cap, no matter how big the rating difference (at Mythic). This was true for both Bo1 and Bo3, although I didn’t measure Bo3 super-precisely because it’s a royal pain in the ass to win a lot of Bo3s compared to spamming Mono-R in Bo1 on my phone, but I’m not far off whatever it is. My return benchmark was reached at 13 wins and 39 losses, which is 3:1, and I assumed it would be a nice round number. Unfortunately, as I discovered later, it was not *exactly* 3:1, or everybody’s life would have been much easier.

Relative Bo1-Bo3 K values

Bo3 has about 2.2 times the K value of Bo1. By measuring how many min-loss matches I had to concede in each mode to drop the same percentage, it was clear that the Bo3 K-value was a little over double the Bo1 K-value. In a separate experiment, losing 2-0 or 2-1 in Bo3 made no difference (as expected, but no reason not to test it). Furthermore, being lower rated and having lost the last match (or N matches) had no effect on the coin toss in Bo3. Again, it shouldn’t have, but that was an easy test.

Elo value of a percentage point

This is not a constant value throughout the month because the rating of the #1500 player increases through the month, but it’s possible to get an approximate snapshot value of it. Measuring this, the first way I did it, was much more difficult because it required playing matches inside the 25%-75% range, and that comes with a repeated source of error. If you go 1-1 against players with mirrored percentile differences, those matches appear to offset, except because the ratings are only reported as integers, it’s possible that you went 1-1 against players who were on average 0.7% below you (meaning that 1-1 is below expectation) or vice versa. The SD of the noise term from offsetting matches would keep growing and my benchmark would be less and less accurate the more that happened.

I avoided that by conceding every match that was plausibly in the 25-75% range and only playing to beat much higher rated players (or much lower rated, but I never got one, if one even existed). Max-value wins have no error term, so the unavoidable aggregate uncertainty was kept as small as possible. Using the standard Elo formula value of 400 (who knows what it is internally, but Elo is scale-invariant), the 25%-75% cap is reached at a 191-point difference, and by solving for how many points/% returned my variable-value losses to the benchmark where I started, I got a value of 17.3 pts/% on 2/16 for Bo1.

I did a similar experiment for Bo3 to see if the 25%-75% threshold kicked in at the same rating difference (basically if Bo3 used a number bigger than 400). Gathering data was much more time-consuming this way, and I couldn’t measure with nearly the same precision, but I got enough data to where I could exclude much higher values. It’s quite unlikely that the value could have been above 550, and it was exactly consistent with 400, and it’s unlikely that they would have bothered to make a change smaller than that, so the Bo3 value is presumably just 400 as well.

This came out to a difference of around 11% mythic being the 25-75% cap for Bo1 and Bo3, and combined with earlier deranking experiments, a K-value likely between 20-24 for Bo1 and 40-48 for Bo3. Similar experiments on 2/24 gave similar numbers. I thought I’d solved the puzzle in February. Despite having the cutoffs incorrect, I still came pretty close to the right answer here.

Initial Mythic Rating/Rating-based pairing

My main account made Mythic on 3/1 with a 65-70% winrate in Diamond. I made two burners for March, played them normally through Plat, and then diverged in Diamond. Burner #1 played through diamond normally (42-22 in diamond, 65-9 before that). Burner #2 conceded hundreds of matches at diamond 4 before trying to win, then went something like 27-3 playing against almost none of the same decks-almost entirely against labors of jank love, upgraded precons, and total nonsense. The two burners made Mythic within minutes of each other. Burner #1 started at 90%. Burner #2 started at 86%. My main account was 89% at that point (I’d accidentally played and lost one match in ranked because the dogshit client reverted preferences during an update and stuck me in ranked instead of the play queue when I was trying to get my 4 daily wins). I have no idea what the Mythic seeding algorithm is, but there was minimal difference between solid performance and intentionally being as bad as possible.

It’s also difficult to overstate the difference in opponent difficulty that rating-based pairing presents. A trash rating carries over from month to month, so being a horrendous Mythic means you get easy matches after the reset, and conceding a lot of matches at any level gives you an easy path to Mythic (conceding in Gold 4 still gets you easy matches in Diamond, etc)

Lack of Glicko-ness

In Glicko, rating deviation (a higher rating deviation leads to a higher “K-value”) is supposed to decrease with number of games played and increase with inactivity. My main account and the two burners from above should have produced different behavior. The main account had craploads of games played lifetime, a near-minimum to reach Mythic in the current season, and had been idle in ranked for over 3 weeks with the exception of that 1 mistake game. Burner #1 had played a near-minimum number of games to reach Mythic (season and lifetime) and was currently active. Burner #2 had played hundreds more games (like 3x as many as Burner #1) and was also currently active.

My plan was to concede bunches of matches on each account and see how much curvature there was in the graph of Mythic % vs. expected points lost (using the 25-75 cap and the 11% approximation) and how different it was between accounts. Glicko-ness would manifest as a bigger drop in Mythic % earlier for the same number of expected points lost because the rating deviation would be higher early in the conceding session. As it turned out, all three accounts just produced straight lines with the same slope (~2.38%/K on 3/25). Games played before Mythic didn’t matter. Games played in Mythic didn’t matter. Inactivity didn’t matter. No Glicko-ness detected.

Lack of (explicit) inactivity penalty

I deranked two accounts to utterly absurd levels and benchmarked them at a 2:1 percentage ratio. They stayed in 2:1 lockstep throughout the month (changes reflecting the increase in the #1500 rating, as expected). I also sat an account just above 0 (within 5 points), and it stayed there for like 2 weeks, and then I lost a game and it dropped below 0, meaning it hadn’t moved any meaningful amount. Not playing appears to do absolutely nothing to rating during the month, and there doesn’t seem to be any kind of baseline drift.

At this point (late March), I believed that the system was probably just Elo (because the Glicko features I should have detected were clearly absent), and that the win:loss ratio was exactly 3:1, because why would it be so close to a round number without being a round number. Assuming that, I’d come up with a way to measure the actual K-value to high precision.

Measuring K-Value more precisely

Given that the system literally never tells you your rating, it may sound impossible to determine a K-value directly, but assuming that we’re on the familiar 400-point scale that Arpad Elo published that’s in common usage (and that competitive MTG used to use when they had such a thing), it actually is, albeit barely.

Assume you control the #1500-rated player and the #1501 player, and that #1501 is rated much lower than #1500. #1501 will be displayed as a percentile instead of a ranking. If you call the first percentile displayed you see 1501-A, then lose a (minimum value) match with the #1500 player, you’ll get a new percentile displayed, 1501B. Call the #1500’s initial rating X, and the #1501’s rating Y. This gives a solvable system of equations.

Y/X = 1501A and Y/(X-1 min loss) = 1501B.

This gives X and Y in terms of min-losses (e.g. X went from (+5.3 minlosses to +4.3 minlosses).

Because 1501A and 1501B are reported as integers, the only way to get that number reported to a useful precision is for Y to be very large in magnitude and X to be very small. And of course getting Y large in magnitude means losing a crapload of matches. Getting X to be very small was accomplished via the log files. The game doesn’t tell you your mythic percentile when you’re top-1500, but the logfile stores your percentage of the lowest-rated Mythic. So the lowest-rated Mythic is 100% in the logfile, but once the lowest-rated Mythic goes negative from losing a lot of matches, every normal Mythic will report a negative percentile. By conceding until the exact match where the percentile flips from -1.0 to 0, that puts the account with a rating within 1 minloss of 0. So you have a very big number divided by a very small number, and you get good precision.

Doing a similar thing controlling the #1499, #1500, and #1501 allows benchmarking all 3 accounts in terms of minloss, and then playing the 1499-1500 against each other creates a match where you know the initial rating and the final rating of each participant (as a multiple of minloss), and then, along with knowing that the win:loss ratio is 3:1, making K=4*minloss plugging into the Elo formula gives

RatingChange*minloss= 4*minloss/(1+ 10^(InitialRatingDifference*minloss/400))

and you can solve for minloss, and then for K. As long as nobody randomly makes Mythic right when you’re trying to measure, which would screw everything up and make you wait another month to try again… It also meant that I’d have multiple accounts whose rating in terms of minloss I knew exactly, and by playing them against each other and accounts nowhere close in rating (min losses and max wins), and logging exactly when each account went from positive to negative, I could make sure I had the right K-value.

That latter part didn’t work. I got a reasonable value out of the first measured match- K of about 20.25- but it was clear that subsequent matches were not behaving exactly as expected, and there was no value of K, and no combination of K and minloss, that would fix things. I couldn’t find a mistake in my match logging, (although I knew better than to completely rule it out), and the only other obvious simple source of error was the 3:1 assumption.

I’d only measured 13 wins offsetting 39 losses, which looked good, but certainly wasn’t a definitive 3.0000:1. So, of course the only way to measure this more precisely was to lose a crapload of games and see exactly how many wins it took to offset them. And that came out to a breakeven win% of 24.32%. And I did it again on a bigger samples, and came out with 24.37% and 24.40%, and in absolutely wonderful news, there was no single value that was consistent with all measurements. The breakeven win% in those samples really had slightly increased. FML.

Now that the system clearly wasn’t just Elo, and the breakeven W:L ratio was somehow at least slightly dynamic, I went in for another round of measurements in May. The first thing I noticed was that I got from my initial Mythic seed to a 0 rating MUCH faster than I had when deranking later in the month. And by later in the month, I mean anything after the first day or 2 of the season, not just actually late in the month.

When deranking my reference account (the big negative number I need for precise measurements), the measured number of minlosses was about 1.6 times as high as expected from the number of matches conceded, and I had 4 other accounts hovering around a 0 rating who benchmarked and played each other in the short window of time when I controlled the #1500 player, and all of those measurements were consistent with each other. The calculated reference ratings were different by 1 in the 6th significant digit, so I have confidence in that measurement.

I got a similar K-value as the first time, but I noticed something curious when I was setting up the accounts for measurements. Whereas before, with the breakeven win% at 24.4%, 3 losses and 1 win (against much-higher rated players, i.e. everybody but me) was a slight increase in rating. Early in May, it was a slight *decrease* in rating, so the breakeven win% against the opponents I played was slightly OVER 25%, the first time I’d seen that. And as of a few days ago, it was back to being an increase in rating. I still don’t have a clear explanation for that, although I do have an idea or two.

Once I’d done my measurements and calculations, I had a reference account with a rating equal to a known number of minlosses-at-that-time, and a few other accounts with nothing better to do than to lose games to see how or if the value of a minloss changed over a month. If I started at 0, and took X minlosses, and my reference account was at -Y minlosses, then if the value of a minloss is constant, the Mythic Percentile ratio and X/Y ratio should be the same, which is what I was currently in the process of measuring. And, obviously, measuring that precisely requires.. conceding craploads of games. What I got was consistent with no change, but not to the precision I was aiming for before this all blew up.

So this meant that the rating change from a minloss was not stable throughout the month- it was much higher at the very beginning, as seen from my reference account, but that it probably had stabilized- at least for my accounts playing each other- by the time the 1500th Mythic arrived on May 7 or 8. That’s quite strange. Combined with the prior observation where approximately the bare minimum number of games to make mythic did NOT cause an increase in the minloss value, this wasn’t a function of my games played, which were already far above the games played on that account from deranking to 0.

In Glicko, the “K-value” of a match depends on the number of games you’ve played (more=lower, but we know that’s irrelevant after this many games), the inactivity period (more=higher, but also known to be irrelevant here), and the number of games your opponent has played (more=higher, which is EXACTLY BACKWARDS here). So the only Glicko-relevant factor left is behaving strongly in the wrong direction (obviously opponents on May 1 have fewer games played, on average, than opponents on May 22).

So something else is spiking the minloss value at the beginning of the month, and I suspect it’s simply a quickly decaying function of time left/elapsed in the month. Instead of an inactivity term, I suspect WotC just runs a super-high K value/change multiplier/whatever at the start of the month that calms down pretty fast over the first week or so. I had planned to test that by speedrunning a couple of accounts to Mythic at the start of June, deranking them to 0 rating, and then having each account concede some number of games sequentially (Account A scoops a bunch of matches on 6/2, Account B scoops a bunch of matches on 6/3, etc) and then seeing what percentile they ended up at after we got 1500 mythics. Even though they would have lost the same number of matches from 0, I expected to see A with a lower percentile than B, etc, because of that decaying function. Again, something that can only be measured by conceding a bunch of matches, and something in the system completely unrelated to the Glicko they told us they were running. If you’re wondering why it’s taking months to try to figure this stuff out, well, it’s annoying when every other test reveals some new “feature” that there was no reason to suspect existed.

Problems

Rating-based pairing below Mythic is absurdly exploitable and manifestly unfair

I’m not the first person to discover it. I’ve seen a couple of random reddit posts suggesting conceding a bunch of matches at the start of the season, then coasting to Mythic. This advice is clearly correct if you just want to make Mythic. It’s not super-helpful trying to make Mythic on day 1, because there’s not that much nonsense (or really weak players) in Diamond that early, but later in the month, the Play button may as well say Click to Win if you’re decent and your rating is horrible.

When you see somebody post about their total jankfest making Mythic after going 60% in Diamond or something, it’s some amount of luck, but they probably played even worse decks, tanked their rating hard at Diamond 4, and then found something marginally playable and crushed the bottom of the barrel after switching decks. Meanwhile, halfway decent players are preferentially paired against other decent players and don’t get anywhere.

Rating-based pairing might be appropriate at the bottom level of each rank (Diamond 4, Plat 4, etc), just so people can try nonsense in ranked and not get curbstomped all the time, but after that, it should be random same-rank pairing with no regard to rating (using ratings to pair in Draft, to some extent, has valid reasons that don’t exist in Constructed, and the Play Queue is an entirely different animal altogether).

Of course, my “should” is from the perspective of wanting a fairer and unexploitable ladder climb, and WotC’s “should” is from the perspective of making it artificially difficult for more invested players to rank up by giving them tougher pairings (in the same rank), presumably causing them to spend more time and money to make progress in the game.

Bo3 K is WAY too high

Several things should jump out at you if you’re familiar with either Magic or Elo. First, given the same initial participant ratings, winning consecutive Bo1 games rewards fewer points (X + slightly fewer than X) than winning one Bo3 (~2.2X), even though going 2-0 is clearly a more convincing result. There’s no rating-accuracy justification whatsoever for Bo3 being double the K value of Bo1. 1.25x or 1.33x might be reasonable, although the right multiplier could be even lower than that. Second, while a K-value of 20.5 might be a bit on the aggressive side for Bo1 among well-established players (chess, sports), ~45 for a B03 is absolutely batshit.

Back when WotC used Elo for organized play, random events had K values of 16, PTQs used 32, and Worlds/Pro Tours used 48. All for one B03. The current implementation on Arena is using ~20.5 for Bo1 and a near-pro-tour K-value for one random Bo3 ladder match. Yeah.

The ~75%-25% cap is far too narrow

While not many people have overall 75% winrates in Mythic, it seems utterly implausible, both from personal experience and from things like the MtG Elo Project, that when strong players play weaker players, the aggregate matchup isn’t more lopsided than that. After conceding bunches of games at Plat 4 to get a low rating, my last three accounts went 51-3, 49-1, 48-2 to reach Mythic from Plat 4. When doing my massive “measure the W:L ratio” experiment last month, I was just over 87% winrate (in almost 750 matches) in Mythic when trying to win, and that’s in Bo1, mostly on my phone while multitasking, and I’m hardly the second coming of Finkel, Kai, or PVDDR (and I didn’t “cheat” and concede garbage and play good starting hands- I was either playing to win every game or to snap-concede every game). Furthermore, having almost the same ~75%-25% cap for both Bo1 and Bo3 is self-evidently nonsense when the cap is possibly in play.

The Elo formula is supposed to ensure that any two swaths of players are going to be close to equilibrium at any given time, with minimal average point flow if they keep getting paired against each other, but with WotC’s truncated implementation, when one group actually beats another more than 75% of the time, and keeps getting rewarded as though they were only supposed to win 75%, the good players farm (expected) points off the weaker players every time they’re paired up. I reached out to the makers of several trackers to try to get a large sample of the actual results when two mythic %s played each other, but the only one who responded didn’t have the data. I can certainly believe that Magic needs something that declines in a less extreme fashion than the Elo curve for large rating differences, but a 75%-25% cap is nowhere close to the correct answer.

An Overlooked Change

With the Ikoria release in April 2020, Gold was changed to be 2 pips of progress per win instead of 1, making it like Silver. This had the obvious effect of letting weak/new players make Platinum while before they got stuck in Gold. I suspected that this may have allowed a bunch more weaker players to make it to Mythic late in the month, and this looks extremely likely to be correct.

I obviously don’t have population data for each rank, but since Mythic resets to Plat, I created a toy model of 30k Plats ~N(1600,85), 90k Golds ~N(1450,85), 150k Silvers ~N(1300,85) constant talent level, started each player at rating=talent, and simulated what happened as it got back to 30k people in Mythic. In each “iteration”, people played random Bo1 K=22 matches in the same rank, and Diamonds played 4 matches, Plats 3, Golds/Silvers 2 per iteration. None of these are going to be exact obviously, but the basic conclusions below are robust over huge ranges of possibly reasonable assumptions.

As anybody should expect, the players who make Mythic later in the month are much weaker on average than the ones who make it early. In the toy model, the average Mythic talent was 1622, the first 20% to make Mythic are over 1700 talent on average (and almost nobody got stuck in Gold). The last 20% are about 1560. The cutoff for the top-10% talentwise (Rank 3000 out 30000) is about 1790. You may be able to see where this is going.

I reran the simulation using two different parameters- first, I made Gold the way it used to be- 1 pip per win and per loss. About 40% of people got stuck in Gold in this simulation, and the average Mythic player was MUCH stronger- 1695 vs 1622. There were also under 1/3 as many, 8800 vs 30,000 (running for the same number of iterations). The late-month Mythics are obviously still weaker here, but 1650 here on average instead of 1560. That’s a huge difference.

I also ran a model where Silver/Gold populations were 1/4 of their starting size (representing lots of people making Plat since it’s easy and then quitting before they play against those in the higher ranks). That’s 30k starting in Plat and 60k starting below Plat who continue to play in Plat, which seems like a quite conservative ratio to me. This came out roughly in the middle of the previous two. The average Mythic was 1660 and the late-season Mythics were around 1607 on average. It doesn’t require an overwhelming infusion into Plat to create a big effect on who makes it to Mythic late in the month.

Influx of Players and Overrated Players

The first part is obvious from the previous paragraph. A lot more people in Mythic is going to push the #1500 rating higher by variance alone, even if the newbies mostly aren’t that good.

Because WotC doesn’t use anything like a provisional rating, where a Mythic rating is based on the first X number of games at Mythic, and instead seems to give everybody fairly similar ratings throughout the month when they first make Mythic, the players who make it late in the month are MASSIVELY overrated relative to the existing population, on the order of 100+ Elo or more. Treating early-season Mythics and late-season Mythics as separate populations, when two players from the same group play each other, the group keeps the same average rating, When cross-group play happens, the early-season Mythics farm the hell out of the late-season Mythics (because they’re weaker, but rated the same) until a new equilibrium is reached. And with lots more (weaker) players making Mythic because of the change to Gold, there’s a lot of farming to be done.

This effectively makes playing late in the month positive-sum for good players because there are tons of new fish to farm showing up every day. It also indirectly punishes people who camp at the end of the month because they can’t collect the free points if they aren’t playing. This was likely always a significant cause of rank decay, but the easier path to Mythic gives a clear explanation of why rank decay is so much more severe now than it was pre-Ikoria: more players and lots more fish. The influx of weak players also means more people in the queue for good players to 75-25 farm, even after equilibration, but I expect that effect is smaller than the direct point donation.

New-player ratings are a solved problem in chess and were implemented in a proper Glicko framework in the mid-90s. WotC used the dumb implementation, “everybody starts at 1600”, for competitive paper magic back in the day, and that had the exact same problem then as their Mythic seeding procedure does now- people late to the party are weaker than average, by a lot, and while their MTG:A implementation added a fancy wrapper, it still appears to be making the same fundamental mistake that they made 25 years ago.

This is a graph of the #1500 rating in April as the month progressed. I got it from my reference account’s percentile changing (with a constant actual rating) as the month progressed.

The part on the left is when there are barely more than 1500 people in Mythic at all, and on the right is the late-month rating inflation. Top-1200 inflation was likely even worse (it was in January at least). The middle section of approximately a straight line is more interesting than it seems. In a normal-ish distribution, once you get out of the stupid area equivalent to the left of this graph, adding more people to the distribution increases the #1500 rating in a highly sub-linear way. To keep a line going, and to actually go above linear near the end, requires some combination of beyond-exponential growth in the Mythic population through the whole month and/or lots of fish-farming by the top end. I have no way to try to measure how much of each without bulk tracker data, but I expect both to matter. And both would be tamped down if Gold were still +1/-1.

Conclusions

Cutting way back on rating-based pairing in Constructed would create a much fairer ladder climb before Mythic and take away the easy-mode exploit. Bringing the Bo3 K way down would create a more talent-based distribution at the top of Mythic instead of a giant crapshoot. A better Mythic seeding algorithm would offset the increase in weak players making it late in the month. The ~75-25 cap.. I just don’t even. I’ll leave it to the reader’s imagination as to why their algorithm does what it does and why the details have been kept obfuscated for years now.

P.S. Apologies to anybody who was annoyed by queueing into me. I was hoping a quick free win wouldn’t be that bad. At Bo3 K-values, the rating result of any match is 95% gone inside 50 matches, so conceding to somebody early in the month is completely irrelevant to the final positioning, and due to rating-based pairing, I didn’t get matched very often against real T-1200 contenders later on. Going over 100 games without seeing a single 90% or higher was not strange.

A 16-person format that doesn’t suck

This is in response to the Magic: the Gathering World Championship that just finished, which featured some great Magic played in a highly questionable format. It had three giant flaws:

It buried players far too quickly. Assuming every match was a coinflip, each of the 16 players started with a 6.25% chance to win. Going 2-0 or 2-1 in draft meant you were just over 12% to win and going 0-2 or 1-2 in draft meant you were just under 0.5% (under 1 in 200) to win. Ouch. In turn, this meant we were watching a crapload of low-stakes games and the players involved were just zombies drawing to worse odds than a 1-outer even if they won.
It treated 2-0 and 2-1 match record in pods identically. That’s kind of silly.
The upper bracket was Bo1 match, with each match worth >$100,000 in equity. The lower bracket was Bo3 matches, with encounters worth 37k (lower round 1), 49k, 73k, and 97k (lower finals). Why were the more important matches more luck-based?

y0x9fpq

and the generic flaw that the structure just didn’t have a whole lot of play to it. 92% of the equity was accounted for on day 1 by players who already made the upper semis with an average of only 4.75 matches played, and the remaining 12 players were capped at 9 pre-bracket matches with an average of only 6.75 played.

Whatever the format is, it needs to try to accomplish several things at the same time:

Fit in the broadcast window
Pair people with equal stakes in the match (avoid somebody on a bubble playing somebody who’s already locked or can’t make it, etc)
Try not to look like a total luckbox format- it should take work to win AND work to get eliminated
Keep players alive and playing awhile and not just by having them play a bunch of zombie magic with microscopic odds of winning the tournament in the end
Have matches with clear stakes and minimize the number with super-low stakes, AKA be exciting
Reward better records pre-bracket (2-0 is better than 2-1, etc)
Minimize win-order variance, at least before an elimination bracket (4-2 in the M:tG Worlds format could be upper semis (>23% to win) or lower round 1 (<1% to win) depending on result ordering. Yikes.
Avoid tiebreakers
Matches with more at stake shouldn’t be shorter (e.g. Bo1 vs Bo3) than matches with less at stake.
Be comprehensible

To be clear, there’s no “simple” format that doesn’t fail one of the first 4 rules horribly. Swiss has huge problems with point 2 late in the event, as well as tiebreakers. Round robin is even worse. 16-player double elimination, or structures isomorphic to that (which the M:tG format was), bury early losers far too quickly, plus most of the games are between zombies. Triple elimination (or more) Swiss runs into a hell match that can turn the pairings into nonsense with a bye if it goes the wrong way. Given that nobody could understand this format, even though it was just a dressed-up 16-player double-elim bracket, and any format that doesn’t suck is going to be legitimately more complicated than that, we’re just going to punt on point 10 and settle for anything simpler than the tax code if we can make the rest of it work well. And I think we can.

Hareeb Format for 16 players:

Day 1:

Draft like the opening draft in Worlds (win-2-before-lose-2). The players will be split into 4 four-player pods based on record (2-0, 2-1, 1-2, 0-2).

Each pod plays a win-2-before-lose-2 of constructed. The 4-0 player makes top-8 as the 1-seed. The 0-4 player is eliminated in 16th place.

The two 4-1 players play a qualification match of constructed. The winner makes top-8 as the #2 seed. The two 1-4 players play an elimination match of constructed. The loser is eliminated in 15th place.

This leaves 4 players with a winning record (Group A tomorrow), 4 players with an even record (2-2 or 3-3) (Group B tomorrow), and 4 players with a losing record (Group C tomorrow).

Day 2:

Each group plays a win-2-before-lose-2 of constructed, and instead of wall-of-texting the results, it’s easier to see graphically and that something is at stake with every match in every group.

hareebworldsformat

with the loser of the first round of the lower play-in finishing 11th-12th and the losers of the second round finishing 9th-10th. So now we have a top-8 bracket seeded. The first round of the top-8 bracket should be played on day 2 as well, broadcast willing (2 of the matches are available after the upper play-in while the 7-8 seeds are still being decided, so it’s only extending by ~1 round for 7-8 “rounds” total).

Before continuing, I want to show the possible records of the various seeds. The #1 seed is always 4-0 and the #2 seed is always 5-1. The #3 seed will either be 6-2 or 5-2. The #4 seed will either be 5-2, 6-3, or 7-3. In the exact case of 7-3 vs 5-2, the #4 seed will have a marginally more impressive record, but since the only difference is being on the same side of the bracket as the 4-0 instead of the 5-1, it really doesn’t matter much.

The #5-6 seed will have a record of 7-4,6-4, 5-4, or 5-3, a clean break from the possible top 4 records. The #7-8 seeds will have winning or even records and the 9th-10th place finishers will have losing or even records. This is the only meaningful “tiebreak” in the system. Only the players in the last round of the lower play-in can finish pre-bracket play at .500. Ideally, everybody at .500 will either all advance or all be eliminated, or there just won’t be anybody at .500. Less ideally, but still fine, either 2 or 4 players will finish at .500, and the last round of the lower play-in can be paired so that somebody 1 match above .500 is paired against somebody one match below .500. In that case, the player who advances at .500 will have just defeated the eliminated player in the last round. This covers 98% of the possibilities. 2% of the time, exactly 3 players will finish at .500. Two of them will have just played a win-and-in against each other, and the other .500 player will have advanced as a #7-8 seed with a last-round win or been eliminated 9th-10th with a last-round loss.

As far as the top-8 bracket itself, it can go a few ways. It can’t be Bo1 single elim, or somebody could get knocked out of Worlds losing 1 match, which is total BS (point 3), plus the possibility of going 4-1 5th-8th place in a 16-player event is automatically a horseshit system. Even 5-2 or 6-3 5th-8th place (Bo3 or Bo5 single elim) is crap, but if we got to 4-3 or 5-4 finishing 7th-8th place, that’s totally fine. It also takes at least 5 losses pre-bracket (or an 0-4 start) to get eliminated there, so it should take some work here too. And we still need to deal with the top-4 having better records than 5-8 without creating a bunch of zombie Magic. There’s a solution that solves all of this reasonably well at the same time IMO.

Hareeb format top-8 Bracket:

Double-elimination bracket
All upper bracket matchups are Bo3 matches
In the upper quarters, the higher-seeded player starts up 1 match
Grand finals are Bo5 matches with the upper-bracket representative starting up 1-0 (same as we just did)
Lower bracket matches before lower finals are Bo1 (necessary for timing unless we truly have all day)
Lower bracket finals can be Bo1 match or Bo3 matches depending on broadcast needs. (Bo1 lower finals is max 11 sequential matches on Sunday, which is the same max we had at Worlds. If there’s time for a potential 13, lower finals should definitely be Bo3 because they’re actually close to as important as upper-bracket matches, unlike the rest of the lower bracket)
The more impressive match record gets the play-draw choice in the first game 1, then if Bo3/5, the loser of the previous match gets the choice in the next game 1. (if tied, head to head record decides the first play, if that’s tied, random)

This keeps the equity a lot more reasonably dispersed (I didn’t try to calculate play advantage throughout the bracket, but it’s fairly minor). This format is a game of accumulating equity throughout the two days instead of 4 players hoarding >92% of it after day 1 and 8 zombies searching for scraps. Making the top 8 as a 5-8 seed is a bit better than the pre-tournament win probability under this format, instead of the 1.95% in the Worlds format.

hareebtop8

As far as win% at previous stages goes,

Day 2 qualification match: 11.60%
Upper play-in: 5.42%
Lower play-in round 2: 3.61%
Lower play-in round 1: 1.81%
Day 2 elimination match: 0.90%

Day 2 Group A: 9.15%
Day 2 Group B: 4.93%
Day 2 Group C: 2.03%

Day 1 qualification match: 13.46%
Day 1 2-0 Pod: 11.32%
Day 1 2-1 Pod: 7.39%
Day 1 1-2 Pod: 4.28%
Day 1 0-2 Pod: 2.00%
Day 1 Elimination match: 1.02%

2-0 in the draft is almost as good as before, but 2-1 and 1-2 are much more modest changes, and going 0-2 preserves far more equity (2% vs <0.5%). Even starting 1-4 in this format has twice as much equity as starting 1-2 in the Worlds format. It’s not an absolutely perfect format or anything- given enough tries, somebody will Javier Dominguez it and win going 14-10 in matches- but the equity changes throughout the stages feel a lot more reasonable here while maintaining perfect stake-parity in matches, and players get to play longer before being eliminated, literally or virtually.

Furthermore, while there’s some zombie-ish Magic in the 0-2 pod and Group C (although still nowhere near as bad as the Worlds format), it’s simultaneous with important matches so coverage isn’t stuck showing it. Saturday was the upper semis (good) and a whole bunch of nonsense zombie matches (bad), because that’s all that was available, but there’s always something meaningful to be showing in this format. It looks like it fits well enough with the broadcast parameters this weekend as well with 7 “rounds” of coverage the first day and 8 the second (or 6 and 9 if that sounds nicer), and a same/similar maximum number of matches on Sunday to what we had for Worlds

It’s definitely a little more complicated, but it’s massive gains in everything else that matters.

*** The draft can always be paired without rematches. For a pod, group, upper play-in, lower play-in, or loser’s bracket round 1, look at the 3 possible first-round pairings, minimize total times those matchups have been seen, then minimize total times those matchups have been seen in constructed, then choose randomly from whatever’s tied. For assigning 5-6 seeds or 7-8 seeds in the top-8 bracket or pairings in lower play-in round 2 or loser’s round 2, do the same considering the two possible pairings, except for the potential double .500 scenario in lower play-in round 2 which must be paired.

On the London Mulligan

Zvi says ban it, and the pros I’ve seen talking about it lean towards the ban camp, but there are dissenters like BenS. People also almost universally like it in limited. Are they right? Are they highly confused? What’s really going on?

From the baseline of the Paris mulligan (draw 6, draw 5, etc), on a 6-card keep, the Vancouver mulligan adds scry 1 and the London mulligan adds Loot 1 (discard to bottom of library). London is clearly better, but plenty of times you’ll scry an extra land away like you would have with a loot, or the top card will be the one you would loot away anyway and there’s no real difference. Other times you’re stuck with a clearly worse card in hand. It’s better on 6-card keeps, but it’s not OMFG better.

Except that’s not quite the actual procedure.. on the London, you (effectively) loot, THEN you decide whether or not to keep. That lets you make much better decisions, seeing all 7 cards instead of just 6 before deciding, and the difference on a 5-card keep is that Vancouver still just adds scry 1, but London adds Loot 2. That’ adds up to a HUGE difference in starting hand quality. And you can still go to 4 if your top 7 cards are total ass again. I’d argue that the London is fine at 6 but goes totally bonkers at 5 and lower.

If you have decks that rely on card quantity more than a couple of specific quality cards, going to 5 cards, even best-5-out-of-7, is still a big punishment. That’s most limited decks, where a 90th percentile 5 is going to play out like a 40th percentile 7, or something like that depending on archetype. Barring something absurd like Pack Rat, aggressive mulligans aren’t a strategy. You mulligan dysfunctional hands, not to find great hands. London just lets you be a bit more liberal with the “dysfunctional” label in limited, and it’s generally fine there.

For Eternal formats, where lots of decks are trying to do something powerful and plan B is go to the next game, London rewarded all-in-on-plan-A strategies like Tron, Amulet, and now Whirza (which also just got a decent Plan B-roko). Before rotation, and for most of 2019, it looks to me like Standard was a lot closer to Limited, at least game 1 in the dark. Aggro decks really don’t want to go to 5 (although they’re better at it than the rest of these). Esper really doesn’t want to go to 5. Scapeshift really doesn’t want to go to 5. Jeskai really doesn’t want to go to 5. Not that they won’t if their hands are garbage, but their hand quality is far more smoothly distributed compared to a Tron deck’s highly polarized Tron-or-not, nuts-or-garbage and that means keeping more OK hands because the odds of beating it (or beating it by a lot) with fewer cards isn’t as high. Aggro decks need a density of cheap beaters and usually its other flavor (pump in white, burn in red, Obsession/counters in blue, etc). Midrange needs lands and 4-5 drops and something to do before that. Control needs enough answers.

There just aren’t that many good 5-card combinations that cover the bases, even looking at 5-out-of-7, you’re quite reliant on the top of the deck to keep delivering whatever you’re light on. There wasn’t any way for most of the decks to get powerful nut draws on 5 with any real consistency, even with London, so they couldn’t abuse the 5-card hand advantage because going to 5 really sucked. Then came Eldraine.. Guess who doesn’t need a 7-card hand to do busted work?

Innkeeper doesn’t get to 5 that often, but any 5 or 6 with him is better than basically any 6 or 7 without, so the idea still applies. Hands with these starts are MUCH stronger than hands without, and because of London and OUaT, they can be found with much more regularity. If you take something like Torbran in mono-R on a 5-carder, WTF are you keeping that doesn’t have to draw near-perfectly to make T4 good? Same with Embercleave in non-adventure Gruul, you can only keep pieces and hope to draw perfectly.

Oko not only has a self-contained nut draw on 5 cards, its backup of T3 Nissa is a hell of a lot easier to assemble on 5 than, say, a useful Torbran or Embercleave hand or a useful Fires or Reclamation hand. Furthermore, thanks to OUaT (and Veil for indirectly keeping G1 interaction in check), it can actually assemble and play a great hand on 5 far too often. Innkeeper can also start going off from a wide range of hands. The ability to go bananas on a reasonable number of 5-card London hands certainly stretches things compared to where they were with Vancouver.

Maybe that will make for playable (albeit different) Eternal formats with a wide variety of decks trying to nut draw each other, kind of like Modern 1-2 years ago before Faithless Looting really broke out, with enough variance in the pairings lottery and sideboard cards that tier 2 and 3 decks can still put up regular results. I have my doubts though- Modern was already collapsing away from that, and reducing the fail rates of the most powerful decks certainly doesn’t seem likely to foster diversity from where I sit- and if there is a direct gain, it’ll be something degenerate that’s now consistent enough to play. Yippee.

It’s possible that some Standards will be okay, but even besides the obvious mistakes in Oko and Veil, this one has some issues. You can’t ever have a cheap build-around unless it’s trivially dealt with by most of the meta (Innkeeper could be if Shock, Disfigure, Glass Casket, etc were big in the meta), in which case why even bother printing it? You can’t have functionally more than 4x 1-cost acceleration without polarizing draws to 3-drop-on-turn-2 (or 5+ drop on turn 3) or garbage. With only one card, and especially one card that might actually die, you can’t deckbuild all-in on it or mulligan to it. With the 8x + OuAT available now, you can and likely should if you’re in that acceleration market at all.

I don’t trust Wizards to not print broken cheap stuff, and they probably don’t even trust themselves at this point, assuming it’s not actually on purpose, which it likely kind of is. I barely mentioned postboard games where draws are naturally more polarized (and that polarization is known during mulligans), which leads to more mulligan death spiral games. Nobody’s freaking out when a draft deck keeps 7 because it keeps plenty of average-ish hands as well as the good ones- you just have to mulligan slightly more aggressively. When Tron or Oko keeps 7, you know damn well you’re in for something busted because they would have shipped all their mediocre hands and you have to mulligan to a hand that can play.. until we get a deck that can actually bluff keep a reasonable-but-not-broken plan B/C sometimes to get free equity off scared mulligans/fearless non-mulligans.

I wish I had a clean answer, but I don’t. If all I were worried about were ladder-type things, I’d say you just get one mulligan, and have it be a London plus a scry, or even look at 8 and bottom 2, and you’re stuck with it. If your hand is nonfunctional, then you just lose super-fast and go to the next game or match, no big deal. That’s a lot of feels bad on a tournament schedule though where you lost and didn’t even get the illusion of playing a game and you’re doing nothing but moping for the next 30-40 minutes, and a lot of people aren’t even playing Magic in the way pros and grinders do.

To use a slightly crude analogy, they approach Magic like two guys who are too fat to reach their own cocks and agree to lay side-by-side and jerk each other off. Some like to show off, a few like to watch, but it’s mainly about experiencing their dick, er, deck, doing what it’s built to do, and they can’t just play with themselves. For those people, the London mulligan is like free Viagra making sure their deck is always ready to perform, so they absolutely love it, and that player type is approximately infinity times more common than the hardcore spikes who can enjoy a good struggle with a semi…functional hand.

For those reasons, I think we’re stuck with it, for better or for worse, and the best we can hope for is that WotC is cognizant of not allowing anything in Standard to do broken things on 5 with any frequency and banning ASAP when something gets through.