A 16-person format that doesn’t suck

This is in response to the Magic: the Gathering World Championship that just finished, which featured some great Magic played in a highly questionable format.  It had three giant flaws:

  1. It buried players far too quickly.  Assuming every match was a coinflip, each of the 16 players started with a 6.25% chance to win.  Going 2-0 or 2-1 in draft meant you were just over 12% to win and going 0-2 or 1-2 in draft meant you were just under 0.5% (under 1 in 200) to win.  Ouch.  In turn, this meant we were watching a crapload of low-stakes games and the players involved were just zombies drawing to worse odds than a 1-outer even if they won.
  2. It treated 2-0 and 2-1 match record in pods identically.  That’s kind of silly.
  3. The upper bracket was Bo1 match, with each match worth >$100,000 in equity.  The lower bracket was Bo3 matches, with encounters worth 37k (lower round 1), 49k, 73k, and 97k (lower finals).  Why were the more important matches more luck-based?

y0x9fpq

and the generic flaw that the structure just didn’t have a whole lot of play to it.  92% of the equity was accounted for on day 1 by players who already made the upper semis with an average of only 4.75 matches played, and the remaining 12 players were capped at 9 pre-bracket matches with an average of only 6.75 played.

Whatever the format is, it needs to try to accomplish several things at the same time:

  1. Fit in the broadcast window
  2. Pair people with equal stakes in the match (avoid somebody on a bubble playing somebody who’s already locked or can’t make it, etc)
  3. Try not to look like a total luckbox format- it should take work to win AND work to get eliminated
  4. Keep players alive and playing awhile and not just by having them play a bunch of zombie magic with microscopic odds of winning the tournament in the end
  5. Have matches with clear stakes and minimize the number with super-low stakes, AKA be exciting
  6. Reward better records pre-bracket (2-0 is better than 2-1, etc)
  7. Minimize win-order variance, at least before an elimination bracket (4-2 in the M:tG Worlds format could be upper semis (>23% to win) or lower round 1 (<1% to win) depending on result ordering.  Yikes.
  8. Avoid tiebreakers
  9. Matches with more at stake shouldn’t be shorter (e.g. Bo1 vs Bo3) than matches with less at stake.
  10. Be comprehensible

 

To be clear, there’s no “simple” format that doesn’t fail one of the first 4 rules horribly. Swiss has huge problems with point 2 late in the event, as well as tiebreakers.  Round robin is even worse.  16-player double elimination, or structures isomorphic to that (which the M:tG format was), bury early losers far too quickly, plus most of the games are between zombies.  Triple elimination (or more) Swiss runs into a hell match that can turn the pairings into nonsense with a bye if it goes the wrong way.  Given that nobody could understand this format, even though it was just a dressed-up 16-player double-elim bracket, and any format that doesn’t suck is going to be legitimately more complicated than that, we’re just going to punt on point 10 and settle for anything simpler than the tax code if we can make the rest of it work well.  And I think we can.

Hareeb Format for 16 players:

Day 1:

Draft like the opening draft in Worlds (win-2-before-lose-2).  The players will be split into 4 four-player pods based on record (2-0, 2-1, 1-2, 0-2).

Each pod plays a win-2-before-lose-2 of constructed.  The 4-0 player makes top-8 as the 1-seed.  The 0-4 player is eliminated in 16th place.

The two 4-1 players play a qualification match of constructed.  The winner makes top-8 as the #2 seed.  The two 1-4 players play an elimination match of constructed.  The loser is eliminated in 15th place.

This leaves 4 players with a winning record (Group A tomorrow), 4 players with an even record (2-2 or 3-3) (Group B tomorrow), and 4 players with a losing record (Group C tomorrow).

Day 2:

Each group plays a win-2-before-lose-2 of constructed, and instead of wall-of-texting the results, it’s easier to see graphically and that something is at stake with every match in every group.

hareebworldsformat

 

with the loser of the first round of the lower play-in finishing 11th-12th and the losers of the second round finishing 9th-10th.  So now we have a top-8 bracket seeded.  The first round of the top-8 bracket should be played on day 2 as well, broadcast willing (2 of the matches are available after the upper play-in while the 7-8 seeds are still being decided, so it’s only extending by ~1 round for 7-8 “rounds” total).


Before continuing, I want to show the possible records of the various seeds.  The #1 seed is always 4-0 and the #2 seed is always 5-1.  The #3 seed will either be 6-2 or 5-2.  The #4 seed will either be 5-2, 6-3, or 7-3.  In the exact case of 7-3 vs 5-2, the #4 seed will have a marginally more impressive record, but since the only difference is being on the same side of the bracket as the 4-0 instead of the 5-1, it really doesn’t matter much.

The #5-6 seed will have a record of 7-4,6-4, 5-4, or 5-3, a clean break from the possible top 4 records.  The #7-8 seeds will have winning or even records and the 9th-10th place finishers will have losing or even records. This is the only meaningful “tiebreak” in the system.  Only the players in the last round of the lower play-in can finish pre-bracket play at .500.  Ideally, everybody at .500 will either all advance or all be eliminated, or there just won’t be anybody at .500.  Less ideally, but still fine, either 2 or 4 players will finish at .500, and the last round of the lower play-in can be paired so that somebody 1 match above .500 is paired against somebody one match below .500.  In that case, the player who advances at .500 will have just defeated the eliminated player in the last round.  This covers 98% of the possibilities.  2% of the time, exactly 3 players will finish at .500.  Two of them will have just played a win-and-in against each other, and the other .500 player will have advanced as a #7-8 seed with a last-round win or been eliminated 9th-10th with a last-round loss.

As far as the top-8 bracket itself, it can go a few ways.  It can’t be Bo1 single elim, or somebody could get knocked out of Worlds losing 1 match, which is total BS (point 3), plus the possibility of going 4-1 5th-8th place in a 16-player event is automatically a horseshit system.  Even 5-2 or 6-3 5th-8th place (Bo3 or Bo5 single elim) is crap, but if we got to 4-3 or 5-4 finishing 7th-8th place, that’s totally fine.  It also takes at least 5 losses pre-bracket (or an 0-4 start) to get eliminated there, so it should take some work here too.  And we still need to deal with the top-4 having better records than 5-8 without creating a bunch of zombie Magic.  There’s a solution that solves all of this reasonably well at the same time IMO.

Hareeb format top-8 Bracket:

  1. Double-elimination bracket
  2. All upper bracket matchups are Bo3 matches
  3. In the upper quarters, the higher-seeded player starts up 1 match
  4. Grand finals are Bo5 matches with the upper-bracket representative starting up 1-0 (same as we just did)
  5. Lower bracket matches before lower finals are Bo1 (necessary for timing unless we truly have all day)
  6. Lower bracket finals can be Bo1 match or Bo3 matches depending on broadcast needs.  (Bo1 lower finals is max 11 sequential matches on Sunday, which is the same max we had at Worlds.  If there’s time for a potential 13, lower finals should definitely be Bo3 because they’re actually close to as important as upper-bracket matches, unlike the rest of the lower bracket)
  7. The more impressive match record gets the play-draw choice in the first game 1, then if Bo3/5, the loser of the previous match gets the choice in the next game 1. (if tied, head to head record decides the first play, if that’s tied, random)

 

This keeps the equity a lot more reasonably dispersed (I didn’t try to calculate play advantage throughout the bracket, but it’s fairly minor).  This format is a game of accumulating equity throughout the two days instead of 4 players hoarding >92% of it after day 1 and 8 zombies searching for scraps. Making the top 8 as a 5-8 seed is a bit better than the pre-tournament win probability under this format, instead of the 1.95% in the Worlds format.

hareebtop8

As far as win% at previous stages goes,

  1. Day 2 qualification match: 11.60%
  2. Upper play-in: 5.42%
  3. Lower play-in round 2: 3.61%
  4. Lower play-in round 1: 1.81%
  5. Day 2 elimination match: 0.90%

  1. Day 2 Group A: 9.15%
  2. Day 2 Group B: 4.93%
  3. Day 2 Group C: 2.03%

  1. Day 1 qualification match: 13.46%
  2. Day 1 2-0 Pod: 11.32%
  3. Day 1 2-1 Pod: 7.39%
  4. Day 1 1-2 Pod: 4.28%
  5. Day 1 0-2 Pod: 2.00%
  6. Day 1 Elimination match: 1.02%

2-0 in the draft is almost as good as before, but 2-1 and 1-2 are much more modest changes, and going 0-2 preserves far more equity (2% vs <0.5%).  Even starting 1-4 in this format has twice as much equity as starting 1-2 in the Worlds format.  It’s not an absolutely perfect format or anything- given enough tries, somebody will Javier Dominguez it and win going 14-10 in matches- but the equity changes throughout the stages feel a lot more reasonable here while maintaining perfect stake-parity in matches, and players get to play longer before being eliminated, literally or virtually.

Furthermore, while there’s some zombie-ish Magic in the 0-2 pod and Group C (although still nowhere near as bad as the Worlds format), it’s simultaneous with important matches so coverage isn’t stuck showing it.  Saturday was the upper semis (good) and a whole bunch of nonsense zombie matches (bad), because that’s all that was available, but there’s always something meaningful to be showing in this format. It looks like it fits well enough with the broadcast parameters this weekend as well with 7 “rounds” of coverage the first day and 8 the second (or 6 and 9 if that sounds nicer), and a same/similar maximum number of matches on Sunday to what we had for Worlds

It’s definitely a little more complicated, but it’s massive gains in everything else that matters.

*** The draft can always be paired without rematches.  For a pod, group, upper play-in, lower play-in, or loser’s bracket round 1, look at the 3 possible first-round pairings, minimize total times those matchups have been seen, then minimize total times those matchups have been seen in constructed, then choose randomly from whatever’s tied.  For assigning 5-6 seeds or 7-8 seeds in the top-8 bracket or pairings in lower play-in round 2 or loser’s round 2, do the same considering the two possible pairings, except for the potential double .500 scenario in lower play-in round 2 which must be paired.

 

Postseason Baseballs and Bauer Units

With the new MLB report on baseball drag leaving far more questions than answers as far as the 2019 postseason goes, I decided to look at the mystery from a different angle.  My hypothesis was a temporary storage issue with the baseballs, and it’s not clear, without knowing more about the methodology of the MLB testing, whether or not that’s still plausible.  But what I did find in pursuit of that was absolutely fascinating.

I started off by looking at the Bauer units (RPM/MPH) on FFs thrown by starting pitchers (only games with >10 FF used) and comparing to the seasonal average of Bauer units in their starts.  Let’s just say the results stand out.  This is excess Bauer units by season week, with the postseason all grouped into week 28 each year (2015 week 27 didn’t exist and is a 0).

excessbauerunits

And the postseasons are clearly different, with the wildcard and divisional rounds in 2019 clocking in at 0.64 excess Bauer units before coming back down closer to normal.  For the 2019 postseason as a whole, +1.7% in Bauer units.  Since Bauer units are RPM/MPH, it was natural to see what drove the increase.  In the 2019 postseason, vFF was up slightly, 0.4mph over seasonal average, which isn’t unexpected given shorter hooks and important games, but the spin was up +2.2% or almost 50 RPM.  That’s insane.  Park effects on Bauer units are small, and TB was actually the biggest depressor in 2019.

In regular season games, variation in Bauer units from seasonal average was almost *entirely* spin-driven.  The correlation to spin was 0.93 and to velocity was -0.04.  Some days pitchers can spin the ball, and some days they can’t.  I couldn’t be much further from a professional baseball pitcher, but it doesn’t make any sense to me that that’s a physical talent that pitchers leave dormant until the postseason.. well, 3 out of 5 postseasons.  Spin rate is normally only very slightly lower late in games, so it’s not an effect of cutting out the last inning of a start or anything like that.

Assuming pitchers are throwing fastballs similarly, which vFF seems to indicate, what could cause the ball to spin more?  One answer is sticky stuff on the ball, as Bauer himself knows, but it’s odd to only dial it up in the postseason- all of the numbers are comparisons to the pitcher’s own seasonal averages. Crap on the ball may increase the drag coefficient (AFAIK), so it’s not a crazy explanation.

Another one is that the balls are simply smaller.  It’s easier to spin a slightly smaller ball because the same contact path along the hand is more revolutions and imparts more angular velocity to the ball.  Likewise, it should be slightly easier to spin a lighter ball because it takes less torque to accelerate it to the same angular velocity.  Neither one of these would change the ACTUAL drag coefficient, but they would both APPEAR to have a higher drag coefficient in pitch trajectory (and ball flight) measurement and fly poorly off the bat.  Taking a same-size-but-less-dense ball, the drag force on a pitch would be (almost) the same, but since F=ma, and m is smaller, the measured drag acceleration would be higher, and since the calculations don’t know the ball is lighter, they think that the bigger acceleration actually means a higher drag force and therefore a higher Cd, and it comes out the same in the end for a smaller ball.

Both of those explanations seem plausible as well, not knowing exactly what the testing protocol was with postseason balls, and they could be the result of manufacturing differences (smaller balls) or possibly temporary storage in low humidity (lighter balls).  Personal experiments with baseballs rule both of these in.  Individual balls I’ve measured come with more than enough weight/cross-section variation for a bad batch to put up those results.  The leather is extremely responsive to humidity changes (it can gain more than 100% of its dry weight in high-humidity conditions), and losing maybe 2 grams of moisture from the exterior parts of the ball is enough to spike the imputed drag coefficient the observed amount without changing the CoR much, and that’s well within the range of temporary crappy storage.  It’s possible that they’re both ruled out by the MLB-sponsored analysis, but they didn’t report a detailed enough methodology for me to know.

The early-season “spike” is also strange.  Pitchers throw FFs a bit more slowly than seasonal average but also with more spin, which makes absolutely no sense without the baseballs being physically different then as well, just to a lesser degree (or another short-lived pine tar outbreak, I guess).

It could be something entirely different from my suggestions, maybe something with the surface of the baseball causing it to be easier to spin and also higher drag, but whatever it is, given the giant spin rate jump at the start of the postseason, and the similarly odd behavior in other postseasons, it seems highly unlikely to me that the imputed drag coefficient spike and the spin rate spike don’t share a common cause.  Given that postseason balls presumably don’t follow the same supply chain- they’re processed differently to get the postseason stamp, they’re not sent to all parks, they may follow a different shipping path, and they’re most likely used much sooner after ballpark arrival, on average, than regular-season balls, this strongly suggests manufacture/processing or storage issues to me.

On the London Mulligan

Zvi says ban it, and the pros I’ve seen talking about it lean towards the ban camp, but there are dissenters like BenS.  People also almost universally like it in limited.  Are they right? Are they highly confused?  What’s really going on?

From the baseline of the Paris mulligan (draw 6, draw 5, etc), on a 6-card keep, the Vancouver mulligan adds scry 1 and the London mulligan adds Loot 1 (discard to bottom of library).  London is clearly better, but plenty of times you’ll scry an extra land away like you would have with a loot, or the top card will be the one you would loot away anyway and there’s no real difference.  Other times you’re stuck with a clearly worse card in hand.  It’s better on 6-card keeps, but it’s not OMFG better.

Except that’s not quite the actual procedure.. on the London, you (effectively) loot, THEN you decide whether or not to keep.  That lets you make much better decisions, seeing all 7 cards instead of just 6 before deciding, and the difference on a 5-card keep is that Vancouver still just adds scry 1, but London adds Loot 2.  That’ adds up to a HUGE difference in starting hand quality.  And you can still go to 4 if your top 7 cards are total ass again.  I’d argue that the London is fine at 6 but goes totally bonkers at 5 and lower.

If you have decks that rely on card quantity more than a couple of specific quality cards, going to 5 cards, even best-5-out-of-7, is still a big punishment.  That’s most limited decks, where a 90th percentile 5 is going to play out like a 40th percentile 7, or something like that depending on archetype.  Barring something absurd like Pack Rat, aggressive mulligans aren’t a strategy.  You mulligan dysfunctional hands, not to find great hands.  London just lets you be a bit more liberal with the “dysfunctional” label in limited, and it’s generally fine there.

For Eternal formats, where lots of decks are trying to do something powerful and plan B is go to the next game, London rewarded all-in-on-plan-A strategies like Tron, Amulet, and now Whirza (which also just got a decent Plan B-roko).  Before rotation, and for most of 2019, it looks to me like Standard was a lot closer to Limited, at least game 1 in the dark.  Aggro decks really don’t want to go to 5 (although they’re better at it than the rest of these).  Esper really doesn’t want to go to 5.  Scapeshift really doesn’t want to go to 5.  Jeskai really doesn’t want to go to 5.  Not that they won’t if their hands are garbage, but their hand quality is far more smoothly distributed compared to a Tron deck’s highly polarized Tron-or-not, nuts-or-garbage and that means keeping more OK hands because the odds of beating it (or beating it by a lot) with fewer cards isn’t as high.  Aggro decks need a density of cheap beaters and usually its other flavor (pump in white, burn in red, Obsession/counters in blue, etc).  Midrange needs lands and 4-5 drops and something to do before that.  Control needs enough answers.

There just aren’t that many good 5-card combinations that cover the bases, even looking at 5-out-of-7, you’re quite reliant on the top of the deck to keep delivering whatever you’re light on.  There wasn’t any way for most of the decks to get powerful nut draws on 5 with any real consistency, even with London, so they couldn’t abuse the 5-card hand advantage because going to 5 really sucked.  Then came Eldraine.. Guess who doesn’t need a 7-card hand to do busted work?

 

 

Innkeeper doesn’t get to 5 that often, but any 5 or 6 with him is better than basically any 6 or 7 without, so the idea still applies.  Hands with these starts are MUCH stronger than hands without, and because of London and OUaT, they can be found with much more regularity.  If you take something like Torbran in mono-R on a 5-carder, WTF are you keeping that doesn’t have to draw near-perfectly to make T4 good?  Same with Embercleave in non-adventure Gruul, you can only keep pieces and hope to draw perfectly.

Oko not only has a self-contained nut draw on 5 cards, its backup of T3 Nissa is a hell of a lot easier to assemble on 5 than, say, a useful Torbran or Embercleave hand or a useful Fires or Reclamation hand.  Furthermore, thanks to OUaT (and Veil for indirectly keeping G1 interaction in check), it can actually assemble and play a great hand on 5 far too often.  Innkeeper can also start going off from a wide range of hands.  The ability to go bananas on a reasonable number of 5-card London hands certainly stretches things compared to where they were with Vancouver.

Maybe that will make for playable (albeit different) Eternal formats with a wide variety of decks trying to nut draw each other, kind of like Modern 1-2 years ago before Faithless Looting really broke out, with enough variance in the pairings lottery and sideboard cards that tier 2 and 3 decks can still put up regular results.  I have my doubts though- Modern was already collapsing away from that, and reducing the fail rates of the most powerful decks certainly doesn’t seem likely to foster diversity from where I sit- and if there is a direct gain, it’ll be something degenerate that’s now consistent enough to play.  Yippee.

It’s possible that some Standards will be okay, but even besides the obvious mistakes in Oko and Veil, this one has some issues.  You can’t ever have a cheap build-around unless it’s trivially dealt with by most of the meta (Innkeeper could be if Shock, Disfigure, Glass Casket, etc were big in the meta), in which case why even bother printing it?  You can’t have functionally more than 4x 1-cost acceleration without polarizing draws to 3-drop-on-turn-2 (or 5+ drop on turn 3) or garbage.  With only one card, and especially one card that might actually die, you can’t deckbuild all-in on it or mulligan to it.  With the 8x + OuAT available now, you can and likely should if you’re in that acceleration market at all.

I don’t trust Wizards to not print broken cheap stuff, and they probably don’t even trust themselves at this point, assuming it’s not actually on purpose, which it likely kind of is.  I barely mentioned postboard games where draws are naturally more polarized (and that polarization is known during mulligans), which leads to more mulligan death spiral games.  Nobody’s freaking out when a draft deck keeps 7 because it keeps plenty of average-ish hands as well as the good ones- you just have to mulligan slightly more aggressively.  When Tron or Oko keeps 7, you know damn well you’re in for something busted because they would have shipped all their mediocre hands and you have to mulligan to a hand that can play.. until we get a deck that can actually bluff keep a reasonable-but-not-broken plan B/C sometimes to get free equity off scared mulligans/fearless non-mulligans.

I wish I had a clean answer, but I don’t.  If all I were worried about were ladder-type things, I’d say you just get one mulligan, and have it be a London plus a scry, or even look at 8 and bottom 2, and you’re stuck with it.  If your hand is nonfunctional, then you just lose super-fast and go to the next game or match, no big deal.  That’s a lot of feels bad on a tournament schedule though where you lost and didn’t even get the illusion of playing a game and you’re doing nothing but moping for the next 30-40 minutes, and a lot of people aren’t even playing Magic in the way pros and grinders do.

To use a slightly crude analogy, they approach Magic like two guys who are too fat to reach their own cocks and agree to lay side-by-side and jerk each other off.  Some like to show off, a few like to watch, but it’s mainly about experiencing their dick, er, deck, doing what it’s built to do, and they can’t just play with themselves.  For those people, the London mulligan is like free Viagra making sure their deck is always ready to perform, so they absolutely love it, and that player type is approximately infinity times more common than the hardcore spikes who can enjoy a good struggle with a semi…functional hand.

For those reasons, I think we’re stuck with it, for better or for worse, and the best we can hope for is that WotC is cognizant of not allowing anything in Standard to do broken things on 5 with any frequency and banning ASAP when something gets through.

Reliever Sequencing, Real or Not?

I read this first article on reliever sequencing, and it seemed like a reasonable enough hypothesis, that batters would do better seeing pitches come from the same place and do worse seeing them come from somewhere else, but the article didn’t discuss the simplest variable that should have a big impact- does it screw batters up to face a lefty after a righty or does it really not matter much at all?  I don’t have their arm slot data, and I don’t know what their exact methodology was, so I just designed my own little study to measure the handedness switch impact.

Using PAs from 2015-18 where the batter is facing a different pitcher than the previous PA in this game (this excludes the first PA in the game for all batters, of course), I noted the handedness of the pitcher, the stance of the batter, and the standard wOBA result of the PA.  To determine the impact of the handedness switch, I compared pairs of data: (RHB vs RHP where the previous pitcher was a LHP) to (RHB vs RHP where the previous pitcher was a RHP), etc, which also controls for platoon effects without having to try to quantify them everywhere.  The raw data is

Table 1

Bats Throws Prev P wOBA N
L L L 0.302 16162
L L R 0.296 54160
L R R 0.329 137190
L R L 0.333 58959
R L L 0.339 19612
R L R 0.337 63733
R R R 0.315 191871
R R L 0.313 82190

which looks fairly minor, and the differences (following same hand – following opposite hand) come out to

Table 2

Bats Throws wOBA Diff SD Harmonic mean of N
L L 0.006 0.0045 24895
L R -0.0046 0.0025 82474
R L 0.002 0.0041 29994
R R 0.002 0.0021 115083
Total Total 0.000000752 252446

which is in the noise range in every bucket and overall no difference between same and opposite hand as the previous pitcher.  Just in case there was miraculously a player-quality effect exactly offsetting a real handedness effect, for each PA in the 8 groups in table 1, I calculated the overall (all 4 years) batter performance against the pitcher’s handedness and the pitcher’s overall performance against batters of that stance, then compared the quality of the group that followed same-handed pitching to the group that followed opposite-handed pitching.

As it turned out there was an effect… quality effects offset some of the observed differential in 3 of the buckets, and now the difference in every individual bucket is less than 1 SD away from 0.000 while the overall effect is still nonexistent.

Table 3

Bats Throws wOBA Diff Q diff Adj Diff SD Harmonic mean of N
L L 0.0057 0.0037 0.0020 0.0045 24895
L R -0.0046 -0.0038 -0.0008 0.0025 82474
R L 0.0018 -0.0022 0.0040 0.0041 29994
R R 0.0016 0.0033 -0.0017 0.0021 115083
Total Total 0 0.0004 -0.0004 252446

Q Diff means that LHP + LHB following a LHP were a combination of better batters/worse pitchers by 3.7 points of wOBA compared to LHP + LHB following a RHP, etc.  So of the observed 5.7 points of wOBA difference, 3.7 of it was expected from player quality and the 2 points left over is the adjusted difference.

I also looked at only the performance against the second pitcher the batter faced in the game using the first pitcher’s handedness, but in that case, following the same-handed pitcher actually LOWERED adjusted performance by 1.7 points of wOBA (third and subsequent pitcher faced was a 1 point benefit for samehandedness), but these are still nothing.  I just don’t see anything here.  If changing pitcher characteristics made a meaningful difference, it would almost have to show up in flipped handedness, and it just doesn’t.

Update:

There was one other obvious thing to check, velocity, and it does show the makings of a real (and potentially somewhat actionable) effect.  Bucketing pitchers into fast (average fastball velocity>94.5, slow <89.5, or medium and doing the same quality/handedness controls as above gave the following:

first reliever starter Quality-adjusted woba SD N
F F 0.319 0.0047 11545
F M 0.311 0.0019 65925
F S 0.306 0.0037 17898
M F 0.318 0.0033 23476
M M 0.321 0.0012 167328
M S 0.320 0.0022 50625
S F 0.321 0.0074 4558
S M 0.318 0.0025 39208
S S 0.330 0.0043 13262

Harder-throwing relievers do better, which isn’t a surprise, but it looks like there’s extra advantage when the starter was especially soft-tossing, and at the other end, slow-throwing relievers are max punished immediately following soft-tossing starters.  This deserves a more in-depth look with more granular tools than aggregate PA wOBA, but two independent groups both showing a >1SD effect in the hypothesized direction is.. something, at least, and an effect size on the order of .2-.3 RA/9 isn’t useless if it holds up.  I’m intrigued again.

How Casting Spells Should Really Work

Warning: extreme MTG nerdery ahead.

Starting with the Magic Origins update bulletin, I’ve observed the rules manager/team trying to work out the process of casting a spell, trying to allow what should be legal (using Bestow when you can’t cast creatures), disallow what shouldn’t be legal (casting Squee out of Ixalan’s Binding), and not twist themselves into a pretzel of incoherent nonsense in the process… which is what happened with the latest update to 601.3e.  Well, actually it started in the previous update with Mystic Forge rulings, but they doubled down on that mistake here and made it really bad.

Assuming (as always) no other relevant cards/effects, and speaking normatively throughout, if you have a Mystic Forge out and a Deathmist Raptor on top, you shouldn’t be able to cast it.  You can’t normally cast the top card of your library.  It’s not an artifact or colorless nonland card, so Mystic Forge shouldn’t let you cast it.  Done.  Full stop.  Nothing lets you cast that object, so you can’t take the game action of starting to cast it.  This shouldn’t even be a question.***

With the new update, Cascade on a 3 mana spell, which reads “… until you exile a nonland card whose converted mana cost is less than this spell’s converted mana cost” still (correctly) doesn’t recognize Beck//Call as a card with CMC<3 because it’s CMC 2+6=8…. while at the same time, Kari Zev’s Expertise, which says “You may cast a card with converted mana cost 2 or less from your hand…” somehow DOESN’T even recognize that it’s an 8 CMC card.  And Brazen Borrower’s Petty Theft can be cast from the graveyard using Wrenn and Six’s emblem. What in the actual fuck.  Eli- C’mon man.

In the explanation, Eli said “if you’re allowed to cast a spell with a certain mana cost or color…” and I don’t know if that’s just a typo or an actual misunderstanding, but Kari Zev’s Expertise says CARD, not SPELL, and it has to SAY card and MEAN card for anything to make any sense, and the failure to properly separate thoughts about cards and thoughts about spells seems to be at the root of all of the issues.


The process of casting a spell can be reduced to

  1. Casting an object
  2. as a spell with certain characteristics/targets/etc
  3. at a particular time

The tricky part is that the properties of the spell are not known at the start of casting, and the allowable times (and other things) depend on those properties.  Furthermore, since the gamestate changes between 1 and 2 when the object goes on the stack, what’s allowed and prohibited can change in the middle of casting (e.g. Squee and Ixalan’s Binding).  None of this is a problem, or even particularly complicated, but the algorithm to process it all has to be correct or it can spit out some fantastic levels of nonsense.


 

Let’s start with part 1, casting an object.  Their original attempt was that you can literally cast anything- a card from your hand, from your opponent’s library, whatever, and it would be dealt with later if it was illegal.  This was.. not a good idea, although it had the seed of one.  Since we don’t know what the spell is going to look like at this point, ignoring spell prohibitions here is correct, but something has to control what cards can be cast.  As it turns out, the correct answer to that is the same as the answer to why you can’t just put your opponent’s entire hand into his graveyard whenever you feel like it- you can only take legal game actions, and that isn’t one.  The path they went down instead, trying to determine what objects can be put on the stack based on what their spells might end up looking like, is effectively a category error.  The beginning of casting is about CARDS, not SPELLS.

There are *no other prohibitions* on what objects can be cast.  The set of legal objects should be defined *constructively*- cards in your hand by rule and whatever other objects effects allow you to cast (cards in the graveyard with Flashback, cards CMC<=2 in hand while resolving Kari Zev’s Expertise, copies from Isochron Scepter, etc).  Combined with the rules on timing and priority, which needs a small (obvious) rewrite to 117.1 to comply, we get the proper result

“A player with priority can cast a castable card” with the set of castable cards defined constructively as above.  There’s no way to cast an object at a random time because the only times casting something is an allowed game action are when you have priority or when something is telling you to during its resolution.  There’s no way to cast a random object because it’s never a legal game action.  This *does* allow you to start casting a sorcery during your opponent’s turn, or even start casting a land from hand, but this is actually fine and no different than starting to cast a spell you can’t choose targets for, which has been CompRules-legal the whole time.  The rest of 117 needs to be updated to “normally”, etc. to fit with this paradigm.

There are no cards that prohibit casting *cards*, so there are no prohibitions to worry about here (there are a couple that prohibit playing lands based on characteristics, but playing lands uses a special action and they don’t change characteristics in the middle of that action, so it’s not a problem).  All affirmative prohibitions (e.g. “Players can’t play creature spells”) affect *spells*, not *cards*, so there’s nothing else to worry about here.


 

As far as dealing with the spell aspect, once the object is on the stack, there’s no reason at all to intervene before the legality check in 601.2e as long as the previous steps can be followed (immediate rewind if you can’t choose legal targets, etc). No other objects move, so no more information can be leaked, and 601.3 is basically completely wrong/useless as it currently exists.  The only necessary checks in 601.2e are

  1. Land spells are illegal to cast (regardless of other types)
  2. If the spell was cast while the player had priority, the casting is illegal unless it is an instant, has flash, or was cast during the player’s main phase when the stack was empty.  (“as though it had flash” gets by this, obviously)
  3. If an effect prohibits the spell’s casting, considering the spell’s properties at this time, the casting is illegal.

That’s it… almost.  Squee is still escaping from Ixalan’s Binding.  A Grafdigger’s Cage variant that functioned while in the graveyard could (conceivably) be successfully cast from the graveyard.  The fix for that is trivial though.  The set of spell prohibitions is locked in before the object is put on the stack, so prohibitions that depend somehow on the to-be-cast object itself still apply, and then that set of prohibitions is what’s checked in 601.2e.  Squee is stuck.  So is my cage.

And that’s it for real.  To summarize:

  1. The player picks a castable CARD (object), either a card in hand or an object an effect allows to be cast, completely ignoring any properties the resulting spell may or may not have, any spell-casting prohibitions, and the possibility/impossibility of completing announcement successfully.
  2. The set of spell-casting prohibitions is locked in.
  3. The object is put on the stack
  4. Announcement proceeds up to 601.2e (if possible, rewind if not)
  5. Legality vs. the prohibitions in step 2 is checked using the spell’s current properties
  6. If legal, proceed to 601.2f. If illegal, rewind.

 


***If you’re wondering how to verify a morph was properly cast with Mystic Forge, sure, it’s ugly, but verifying ANY morph was ever properly cast in any way is ugly, but we survived multiple formats with morph and manifest legal and played at the same time, and this is nothing in comparison.

The Bill James support measure is not objective

As discussed in The Independent Chip Model of Politics and HoF Voting and Bill James and the Trump polarization problem, the system doesn’t work.  There was a twitter exchange where Bill seemed hard-pressed to say anything more profound than the trivial and self-referential “the algorithm prints the output of the algorithm”.  If Bill were right, and the algorithm actually measured any objective value, then it shouldn’t much matter what sets of polls were run as long as there was enough mixing in the poll groups to have a link to everybody (polls of only R candidates and polls of only D candidates don’t say anything about how Rs would do against Ds, etc).

If the ICM distribution in the other post held, it wouldn’t matter if the population were sampled with one poll including everybody, sets of 4-person polls, sets of 3-person polls, sets of 2-person polls where one matchup was polled 100x more than the rest, or anything else.  They would all generate the same support scores.  This is the only distribution with that property.  We know the ICM distribution doesn’t reflect reality, but how much does that make the support scores sampling-method-dependent?  Well…..

As a toy model of the election, start with 4 candidates A/B/C/D who follow the ICM distribution with starting stacks/support in a 4:3:2:1 ratio.  From this, we can calculate the probabilities of each of the 4! order preferences.  For example, ABCD order has a probability of (4/(4+3+2+1)) * (3/(3+2+1)) * (2/(2+1)) = 13.33% and DCBA has a probability of (1/(4+3+2+1)) * (2/(4+3+2)) * (3/(4+3)) = 0.95%.  Since we know everybody’s order preferences, and we make the friendly assumption that we can always sample the population completely and that the preferences never change, we can generate the result of any poll and calculate support scores.  In this example, no matter how we poll, the 4:3:2:1 ratio holds and the support scores (normalized to add up to 10,000) are A: 4,000 B: 3,000 C: 2,000 D: 1,000.

Now let’s throw a wrench in this by adding candidate T who gets 40% in every poll he’s involved in and is everybody’s first or last choice.  For simplicity, and to make a point later, we’ll treat this as a population with 48 order preferences, 24 with T in 1st followed by the ABCD ICM distribution above for 2nd-5th and 24 with the ABCD ICM distribution for 1st-4th followed by T in 5th (P(TABCD) = .4 * 13.33%, P(ABCDT) = .6* 13.33%, etc).  Now we’ll poll this in different ways.  Because we can generate any poll we want, we can poll every possible combination once and see what the support scores are.  The only variable is how many candidates are included in each poll, and that gives the following support scores:

# in polls 5 (1) 4 (5) 3 (10) 2 (10)
T 4000 3340 2513 1429
A 2400 2628 2891 3189
B 1800 2001 2243 2525
C 1200 1351 1551 1813
D 600 681 802 1044
A/ABCD 40 39.4 38.6 37.2
B/ABCD 30 30.0 30.0 29.5
C/ABCD 20 20.3 20.7 21.2
D/ABCD 10 10.2 10.7 12.2

Even with T thrown in, the relative behavior of the ICM-compliant ABCD group stays mostly reasonable regardless of which poll size is used.   T, however, ranges from a commanding first place to a distant 4th place depending on the poll size.  Even without trying to define “support” in a meaningful, non-self-referential way, it’s obvious that claiming that any one of the 4 aggregated support numbers *is* T’s support (and that the other 3 are not) is ludicrous.  The aggregation clearly isn’t measuring anything when it can massively flip 3 ordinal rankings based only on changing poll sizes.

Integrating different factions (that are strongly ICM-noncompliant with each other) into one list doesn’t work at all- the algorithm can spit out a random number, but it’s hugely dependent on procedural choices that shouldn’t make much difference if the methodology actually worked, so any particular output for any particular choice clearly doesn’t mean anything, and there’s almost no point in even calculating or reporting it.

The Independent Chip Model of Politics and HoF Voting

I’d talked about the Bill James presidential polls before, and he’s running a similar set of polls for HoF candidates that have a similar kind of issue.  For whatever reason, this time around I realized that his assumptions are the same as the Independent Chip Model (ICM) for poker tournament equity based on stack sizes.  If you aren’t familiar with that, then assume we have 4 players, A with 40 chips, B with 30, C with 20, and D with 10.  Everything else being equal, A should win 40% of the time.  The ICM goes further than that, and for predicting the probability of second place, uses calculations of the form

Assuming A wins, what are the odds B gets second:  Remove A’s chips, and then B has 30/(30+20+10)=50% of the remaining chips, so B is 50% to get second *assuming A wins*.

If you don’t see the analogy yet, the ICM takes as input the stack sizes, which are identical to the probability of finishing first, and uses the first-place percentages to calculate the results of every poll subset.  Bill James runs polls and uses the (first-place) percentages to calculate every head to head subset.  The ICM assumption that to calculate the result between B/C/D, you just ignore A’s chips, is equivalent to the Bill James assumption that A’s support, if A is not an option, will break evenly among B/C/D based on their poll percentage.

That assumption doesn’t hold in politics, for reasons discussed before, and it doesn’t hold for HoF voting because different people prefer different player types even beyond the roid/no roid dichotomy.  As it stands, in the linked poll, Beltre would almost certainly be the leader in 4th-place rankings with ~70% 4th place votes and an average finishing position near or even above 3.0.  He’d likely get stomped in every head-to-head matchup, lose the overall rating, etc, but by using only first-place%, he looks like the clear second-preferred candidate, which is obviously very, very wrong.

It could have gotten even worse if Bonds didn’t dominate the roid vote.  Let’s say we had a different poll, Beltre 30%, Generic Roidmonster 1 (23.33%), Generic Roidmonster 2 (23.33%), Generic Roidmonster 3 (23.33%) where (if people ranked 1-4) the Roidmonsters were ranked randomly 1-3 or 2-4 depending on whether or not the voter was a never-roider or not.  In this case, each Roidmonster would have an average finishing position of 2.3 (Beltre 3.1) and would win the head-to-head with Beltre 70-30… yet Beltre wins the poll only counting first-place votes.

It’s clear that the ICM/James assumptions are violated, and violated to where they’re nowhere close to reality, in polls like this. In the same poll without Bonds, the Bonds votes would go overwhelmingly to Clemens and A-Rod, even though ICM/James assume a majority would go to Beltre. Aggregating sets of votes is going to keep a lot of the same problems because the vote share of any two people in a poll is (well, can be) strongly dependent upon who else is in the poll.  The ICM/James model are built on the assumption of independence there, but it’s clearly not close to true in HoF voting or in politics.