Be very, very skeptical of Judge Academy

For the non-Magic: the Gathering audience, some tournament judges filed a lawsuit alleging that they should have been classified as employees.  IANAL, and a lot of the issues are beyond the scope of what a non-expert can discuss at all with any confidence, but one aspect- that for-profit companies cannot accept volunteer labor- is one, that by a plain English reading (again, IANAL) they clearly violated thousands of times in the past, including with me personally.  Of course, I was “volunteering” with the implicit understanding that I was going to get “gifted” promotional cards and game product worth well into the hundreds of dollars afterwards, and I did.

That’s shady enough domestically with US citizens, but when US citizens went abroad to “work as a volunteer”, and WotC brought foreigners to the US to “work as volunteers”, they were almost certainly running afoul of various labor laws, and on at least one occasion, a judge made the mistake of saying “work” at the border and despite further explanation still wasn’t allowed in the country.  So when there’s talk about WotC playing fast and loose with legal obligations around labor.. it’s because they clearly did.  And still might, I don’t know.

To avoid these headaches, WotC (purportedly) ditched its judge program and there is a new organization to replace it, Judge Academy.  JA held an AMA on Reddit and while it isn’t worth reading the whole thing, they were incredibly evasive on a large number of questions, said they couldn’t disclose financials (for no actual legal reason), and their responses included some gems like (in response to why they aren’t a non-profit), “We also felt it was import not to compete with organizations like the Red Cross for the charitable support being given by these companies.”  Really.  That happened.  So the whole thing looks.. uh.. shady AF.

Looking into the incentives of all those involved and the methods of leverage that can be exercised makes it look even worse than just a shady money grab. Going down the line of what each party wants:

WotC: be free from the legal headache, still effectively control the judge program, spend as little money as possible

Major tournament organizers: still have a competent potential staff pool without spending any new money to train it, keep judges from organizing to ask for more money

High-level judges: still have paying jobs, improve the pay-to-mentoring/training ratio, not have to spend infinite time on the logistics of certification

Now, JA provides certification testing and foils for $100.  Without the foils, effectively nobody- and certainly not enough people to bother running a business with staff- would pay $100 to certify.  Major tournament organizers would have to pick up the slack and do it at no cost to the trainees.  So JA’s business is *entirely* dependent on WotC providing foils that can be resold for over $100 on the secondary market, and they have *zero* recourse if WotC decides they don’t want to do that anymore, either by stopping the foil supply altogether or intentionally sending them garbage to distribute.  If WotC does, JA disappears instantly, and both sides are well aware of that.  There’s also no chance (for various reasons outside the scope of this post) that JA actually has a contract stipulating a minimum resale value of foils.

So JA *cannot cause trouble for WotC* or WotC just kills it.  JA *cannot disobey WotC* or WotC just kills it.  Despite being legally separate entities, JA is 100% WotC’s butt muppet.  JA has less wiggle room than if they were actually all WotC employees because at least then they’d have some workplace protections, which is kind of ironic given the whole context.

Big TOs will be happy with this for several reasons. The first is that they don’t have to put more of their own resources into maintaining a qualified staff pool around their regions. The second is that because JA is essentially existentially forbidden from causing any trouble, it’s not going to agitate for better working conditions/compensation, and any energy directed at lobbying JA, or any misunderstanding that JA might ever do that is less energy directed at anything that could affect TO bottom line.

There’s not going to be any elected representation for obvious reasons. JA cant cause trouble, so they’re going to vet staff carefully and only work with people who “get it”. And by “get it”, I mean understand that JA is not an organization for judges, it’s an organization that exists to be WotC’s butt muppet, keep big TOs staffed and happy, and get Tim and some high-level judges paid.

There’s no financial transparency- and some combination of incompetence/misrepresentations/blatant lies whenever financials are discussed- because the whole arrangement is shady AF on every level. There’s no way they’re going to go from “shady and opaque” to providing a line by line accounting of their revenue and expenditures that shows everybody exactly how messed up the whole situation is. If JA were an organization for judges, they’d be happy to prove it with financials- and if they were legally registered as one of several types of organizations for judges, they would HAVE TO prove it with financials- but again, they’re not an organization for judges and they’re simply choosing not to be transparent.

When an organization is de facto funded by somebody else (by foils laundered to cash through subscriptions), is incentivized to act in somebody else’s interests, has denied to enter into any obligation to act in your interests, has refused to allow you any ability to determine that it is acting in your interests, and has dodged/obfuscated/misrepresented/outright lied to you repeatedly when these issues are raised, you have to be an absolute fool to trust that it really is going to act in your interests and not somebody else’s.

 

Mythic Championship III Day 1-Blatant viewer manipulation and group breakdowns

First off, the level of view-count fraud was absolutely out of control today. The bullshit today (ht: darrenoc on reddit) isn’t particularly different than the bullshit they pulled with the Mythic Invitational, but the actual viewership today was anemic to begin with.  From the time I first checked, around the start of round 2, until the end of round 8, the number of people in chat (chat being sub-only is ~irrelevant to this) was between 9,000 and 11,500.  Since 70-75% of viewers in most large channels are logged in to chat, that’s a real viewership of 12k-16k. Going slightly above that isn’t impossible, but not by too much.

The nominal viewership I saw got as high as 65k, which means that literally 75-80%, or very close, of the reported viewer count was completely fake.  Once WotC stopped paying for new fake views, and the numbers started decaying as the day wound down, total views dropped from the 60-thousands to the 20-thousands while the actual people logged into chat- representative of real viewers- stayed in the same 9k-11.5k range.  It’s utterly and blatantly fraudulent. There’s a long section about WotC’s viewer fraud in this Kotaku article (open it and ctrl-f magic), and if it’s correct, WotC is spending *hundreds of thousands of dollars per event* for the sole purpose of creating transparently fraudulent viewer numbers.

That’s utterly disgusting.


On to the actual day 1 results.. I’m sure there will be several metagame breakdowns posted elsewhere, so I’m not bothering with that, especially since I had to go derive and input round 7 and 8 results by hand because the official page had this……………..

lolround7

and round 8 results still aren’t up, but I was mainly curious how the different kinds of players did.

I classified the players into 4 groups:  MPL members, pros/ex-pros, challengers, and invited personalities from the extra 16 invites (lists at the bottom of the post).  The only questionable classification was former PT champion Simon Görtzen, who does commentary now instead of playing.  I put him with the pros/ex-pros based on his pro history and that he wasn’t one of the extra invites.  These are the performances of each group vs. each other group.

left vs. top MPL Pro/ex-pro Challenger Personality
MPL 42-42 19-22 27-18 18-12
 Pro/ex-pro 22-19 11-11 7-6 6-3
Challenger 18-27 6-7 8-8 5-5
Personality 12-18 3-6 5-5 9-9

.

Combining the group performances and looking at day 2 conversion rates (not counting the 4 MPL players with byes into day 2) gives

vs. out of group out of group win% day 2 day 2 advance
MPL 64-52 55.2% 6/28 21.4%
Pro/ex-pro 35-28 55.6% 5/13 38.5%
Challenger 29-39 42.6% 1/13 7.7%
Personality 20-29 40.8% 0/10 0%

Looks like the pros crushed it, taking it to the MPL 22-19 while the MPL went 45-30 against the challengers and personalities.  There’s a marked difference between those who are/have been at the top of the game and those who’ve never come close.

 

————————————————————————————————————

Player lists (Bold = day 2)

MPL:

Alexander Hayne
Andrea Mengucci
Andrew Cuneo
Autumn Burchett
Ben Stark
Carlos Romao
Christian Hauck
Eric Froehlich
Grzegorz Kowalski
Janne Mikkonen
Javier Dominguez
Jean Emmanuel Depraz
Jessica Estephan
John Rolf
Lee Shi Tian
Lucas Esper Berthoud
Luis Salvatto
Marcio Carvalho
Martin Juza
Matthew Nass
Mike Sigrist
Paulo Vitor Damo da Rosa
Piotr Glogowski
Reid Duke
Seth Manfield
Shahar Shenhar
Shota Yasooka
William Jensen

Pros/ex-pros:

Allen Wu
Andrew Elenbogen
Ben Hull
Corey Burkhart
Greg Orange
Kai Budde
Kentaro Yamamoto

Luis Scott Vargas
Noah Walker
Ondrej Strasky
Raphaël Lévy
Simon Görtzen
Wyatt Darby

Challengers:

Alexey Shashov
André Santos
CJ Steele
Eric Oresick
Evan Gascoyne
Marcin Tokajuk
Matias Leveratto
Montserrat Ayensa
Nicholas Carlson
Patrick Fernandes
Takashi Iwasaki
Yuki Matsuda
Yuma Koizumi

Personalities:

Amy Demicco
Ashley Espinoza
Audrey Zoschak
Emma Handy
Giana Kaplan
Jason Chan
Jeffrey Brusi
Nhi Pham
Teresa Pho
Vanessa Hinostroza

 

Bill James and the Trump polarization problem

Bill has been running a series of polls with various candidates for president matched against each other, in an attempt to create a ranking system like this current one at time of writing.  There’s no question that the polls can be used to create the rankings, but for the rankings to be meaningful, preferences need to have certain properties, and they don’t appear to.

Starting with Bill’s college football example, if a team is expected to be remotely competitive against Clemson (national champion), they’re also expected to crush UTEP (horrific), and if a team’s game is expected to be remotely competitive against UTEP, they’re expected to get obliterated by Clemson.  There’s no concept of a team that’s 30% against both, or 60-40 vs UTEP and 40-60 vs Clemson (while Clemson is 99%+ against UTEP).  The football season works as a reasonable approximation because the teams can be given a rating, on one axis, and every pairwise comparison of teams is expected to play out “close” to the rating difference.

If teams with pathological properties in the previous paragraph existed, it would be *impossible* to give them any ratings that wouldn’t produce a bunch of wildly inaccurate predictions, and at that point, it’s not clear what the numbers would represent (since they can’t be used for pairwise or group predictions without a bunch of grievous errors), or what the point of the exercise would be.  Any time the pairwise matchups depend on multiple axes- something beyond the assigned rating from Bill’s system- it’s possible for the exercise to go completely haywire.

In college football, the secondary axes, the matchup-specific details, aren’t nonexistent, but they’re much smaller than the primary axis overall quality difference, so the rating method basically works.  If a major secondary axis were added where every team were randomly assigned one of rock-paper-scissors at the beginning of the season, and the “winner” started ahead 14-0, the overall ratings wouldn’t be particularly different (treating the 14 points as legitimately scored for the rating calculations), but the pairwise rating-based game predictions would be utterly haywire because they’d have no idea when one of the teams was going to start with two free touchdowns, and *every* possible set of ratings is going to go totally bonkers with game predictions under that setup.  That example is totally contrived of course, but the general point is that it’s *impossible* to represent a system with multiple significant axes with one rating number and have it reliably mean anything prediction-wise.

Politics has the obvious multiple axes of party affiliation and candidate preference within the party.  and while party affiliation is not absolute, a significant number of people are going to order their preferences as either (almost any D > almost any R) or (almost any R > almost any D), which means that you run into the “team that’s 30% against both UTEP and Clemson” problem.  There were 6 polls with Trump against 3 Democrats and he polled 26%-29% against fields headlined by everybody from Warren or Biden down to Booker or Abrams.  My guess is he wouldn’t go far above 29%, if at all, even in a field headlined by Inslee.  Warren and Biden smash Booker and Abrams head to head, but Trump polls the same against all of them (the only variability is when a second R is included).

There’s no single rating to give Trump that doesn’t go completely bonkers with his predictions against most of the range of D candidates.  The system just doesn’t work at all with Trump.  Trump is basically his own axis, even stronger than R/D alone, because he’s so polarizing.   26-29% rank him above all Ds, and 70%+ of the poll respondents rank him near last place.  The system looks reasonable for Ds relative to other Ds because none of the leaders are super-polarizing relative to the others right now, but that’s not a thing that ever has to be true or stay true.

Because Trump can’t be rated properly by this method, and rating other Rs alongside Ds is also going to be super sketchy for similar reasons, either always including Trump and ignoring his number for ranking the Ds against each other or only including Ds to begin with both seem like improvements, although the latter comes with guaranteed-R voters voting on Ds, which isn’t necessarily ideal.

 

The quantitative effect of voting machine vulnerabilities in the US

TL;DR Democrats can’t win the presidency in 2020 without flipping deep red (R+15 or more in 2016) states or at least one, and probably two states that Rs won in 2016 that have multiple severe election security vulnerabilities.

 

Election security has been a hot-button topic lately, but I have yet to see any articles about how much these vulnerabilities allow the 2020 election to be manipulated.  As an introduction to election security issues, I highly recommend watching why electronic voting is a bad idea (short and entertaining, trust me), and if you want a more academic take, this recent paper discusses the issues with ballot-marking devices (BMD).  This blog is in complete agreement with the paper that the only legitimate use for ballot-marking devices is for those who are physically incapable of hand-marking a paper ballot by themselves, but still doesn’t consider them a necessity for that purpose (states can have voter-assistance protocols and only use hand marked paper ballots).  BMDs and other voting machines are technologies that have absolutely no reason to exist for the general population, but thanks to ignorance and good old-fashioned corruption, we’ve given corporate handouts of hundreds of millions of dollars in return for machines that are worse than worthless and compromise the very ability of a fair and verifiable election to exist in many jurisdictions.

This post only covers threats that result in the final vote count not reflecting the votes that were cast.  Compromising the list of eligible voters, engaging in a variety of forms of voter suppression, and packed courts simply refusing to accept the results even after a recount are also dangers, but they’re beyond the scope of this post.  This post would not be possible without the resources at Verified Voting, and unless sourced otherwise, information about voting methods in use are from there.  Electoral college maps are from 270towin.com.  Links to recount statutes were mostly found on Ballotpedia.

There are three major types of voting equipment in use.

The hand-marked paper ballot, generally read by an optical scanning machine.  While the video above is correct that the scanning machine is vulnerable to attacks, the defense to these attacks is the ability to hand-count ballots, and having a candidate-funded recount always available by law is the ultimate backstop against scanner attacks.  The robustness of various recount schemes will be discussed in the state-by-state section later.

Ballot-marking devices are extremely expensive pencils that fill out a ballot that a scanner then reads.  While useful for the small number of physically impaired voters, as the paper above notes, in practice it’s difficult to make sure that they’re working properly on election day.  Quoting from the paper, “half of voters don’t look at their ballot printout at all, and those who do look for an average of 4 seconds”.  They’re brutally vulnerable for down-ballot races, and even for the top race (e.g. President), attacks to change the overall margin by 1-2% are close to undetectable under optimistic assumptions and possibly even 5% or more under real-world conditions.  The defense to BMD attacks is simply to not use them, or at worst for only the physically impaired to use them.  Because they create a ballot that then has to be scanned, scanner attacks from the previous section are also still in play.  These are bad- really bad- but at least they aren’t…

Direct recording electronic systems (DRE) record votes directly on the machine itself.  This is obviously a complete security disaster.  Some machines also create a paper ballot, which would make them similar to the BMD group- if they work properly.  The ES&S ExpressVote XL and Dominion ImageCast Evolution have a ridiculous security flaw that allows users to irrevocably decline to review their ballot, AND THE MACHINE ONLY PRINTS THE BALLOT AFTER THAT- whatever ballot it feels like printing because the voter can’t detect it anymore.  The Dominion ImageCast X can’t do that, but it can print on the ballot after it has been “verified” for the last time. Because the ImageCast X can only fill in races where the voter didn’t record a vote, that flaw is much more limited in scope, especially for top-of-the-ballot races, but it, and all other DREs with a paper trail, are at *least* as bad as BMDs above.  The defense to DREs is to dump them all in the bottom of the sea.

 

This post is going to focus on these weakness, but if you really feel like being depressed, there are a lot of security weaknesses that we’re not addressing here.  It’s really difficult to overstate the attack surface all of these electronics allow, and none of them would be more than mere annoyances if hand-marked paper ballots with a hand recount always available were adopted everywhere.  But alas..

This is a map of the 2016 election (deep red (e.g. OK, LA) = Trump crushed, pink (e.g. FL, PA) = Trump won small, etc).

2016base

Deep blue and deep red states aren’t going to be examined- if any are flipped legitimately, the election is almost certainly over, and hacking them is unnecessary and too obvious.  Looking at the competitive states, some are pretty boring from a security perspective.

These states all have hand-marked paper ballots and hand recounts always available (if the candidate is willing to pay for it, of course).

Oregon (D+11).

New Mexico (D+8)

Colorado (D+4)

Maine (D+3) / Maine District 2 (R+10)

Minnesota (D+1.5)

Michigan (R+0.2)

Nebraska District 2 (R+2)

These states use all or predominantly hand-marked paper ballots, but have issues with their recount protocol (ranging from likely inconsequential to extremely vulnerable)

New Hampshire (D+0.3) All paper ballots, hand recount available if the margin of victory is within 20%.  This is a stupid rule, but cheating blatantly enough in 2020 to produce R or D +20 in a competitive national election would be even more stupid.

Connecticut (D+13) All paper ballots, hand recount mandatory in very close races *or* if the election moderators suspect shenanigans.  Any R win outside the margin of error would qualify as shenanigans unless the national election is a bloodbath in Trump’s favor.  This is a horrible system that just isn’t likely to be exploited here.  Discretion on whether or not a recount is ever performed shouldn’t exist and certainly shouldn’t belong to one party.

Iowa (R+9) All paper ballots, recount always available, but the election officials have the discretion to recount ballots with machines again instead of by hand.. which removes the ability to correct machine attacks.  As in Connecticut, this discretion shouldn’t exist.

Virginia (D+5) All paper ballots, recount only available in very close elections.  Furthermore, recounts for optically scanned ballots are *rescanned by machine only*.  The machines are supposedly tested before the recount, but that obviously doesn’t defend against certain attacks.  This statute is completely insane in two ways.  If one side simply cheats *a lot*, there’s no recount available at all, and never doing a hand recount, when a major purpose of a recount is to fix machine screwups and defend against machine cheats, is inexcusable.

Arizona (R+3) Mostly paper ballots by mail (~80% of votes) and a mix of paper and BMDs at precincts.  The only recount available is if the vote tally is within 0.1%.  There’s a mandated pre-count check on some machines that compares machine and hand counts for 1-2% of ballots before counting all the ballots.  That’s trivially defeated by telling the machines to be honest for the first X ballots, and the centralized locations of vote-by-mail counting makes it possible to change *a lot* of votes by compromising a very small number of machines in one place, and the lack of a recount allows it to work.  This is effectively the exact system that the video was warning against re: blindly trusting scanning machines.

These states use significant numbers of BMDs or BMD-equivalent DREs, but have hand recounts always available.

Once the electronics are introduced for the act of marking a ballot, they become an attack surface along with the scanners used to count those ballots.  In this group, the scanner vulnerabilities are mitigated by the hand recount availability, but the BMD vulnerabilities discussed above remain.  And for those who trust in machines and state election officials and all, these machines were already a disaster without being maliciously attacked.

Nevada (D+2) Almost entirely Dominion ImageCast X (one that can still print on ballots after voter verification) with no plans that I can find to dump this before 2020. Amusingly, a quirk in Nevada law dating to the 1970s requires an option for “None of the above candidates” in every race, allowing voters to affirmatively mark a ballot for “Nobody” instead of simply leaving the contest blank…. which mitigates the ImageCast X design flaw of being able to print votes in contests the voter left blank, making it effectively a BMD.

Ohio (R+8) has a mix of paper and BMD/DREs now, but it’s pushing towards paper ballots and simple BMDs.  No DREs are certified so far for 2020, and hopefully that will continue to be true.

These states use significant numbers of BMDs or BMD-equivalent DREs and have no hand recount always available, making them doubly vulnerable

North Carolina (R+3) Mix of paper ballots, BMDs, and DREs with a paper trail in 2016.  The main current DRE (iVotronic) is getting decertified for 2020.  Counties appear to be individually responsible for selecting new systems that either use hand-marked paper ballots or mark a paper ballot, and that leaves the possibility of a significant number of ExpressVote XLs and ImageCasts appearing on the scene, which would warrant an even worse grouping.  Even if it’s “just” a lot of new BMDs, recounts are only available for races within 1%, leaving the scanners vulnerable as well.

These states will have no ability to conduct a verifiable close election in a statewide race, either because they use enough machines with no paper trail or use enough ExpressVote XLs and/or ImageCast Evolutions to render the paper trail meaningless.

Wisconsin (R+0.7%) Mix of paper ballots and a wide variety of BMDs, and DREs with a paper trail, including 10.7% of municipalities using ImageCast Evolutions. Furthermore, Wisconsin only allows recounts in very close races, so all three avenues are vulnerable- the scanners, the BMDs, and the ImageCast Evolutions.

Delaware (D+11)  Wasting tens of millions of dollars to replace everything with ExpressVote XLs for 2020. This is completely insane. Recounts are only available in close races (as useless as recounting fabricated ballots would be).

New Jersey (D+14) mishmash of different DREs with no paper trail and no coherent plan to not be quite vulnerable in 2020.  A couple of counties might move to something less awful, but not enough to matter.  Candidates can pay for a recount… except there aren’t any ballots to count again.

Texas (R+9) Has tons of DREs with no paper trail and no plan to change that for 2020.

Florida (R+1) Mostly paper, but enough DREs with no paper trail to flip any legitimately close election.  If this is somehow remedied by 2020, the recount law is terribly deficient- only races within 0.25% get hand recounts and races within 0.25%-0.50% get a machine recount, which would put Florida in the Arizona group.

Georgia (R+5) currently uses all DREs with no paper trail.  There is talk of replacing these machines by 2020, but given the entanglements between Georgia politicians and ES&S, it would almost certainly be with ES&S ExpressVote XLs, which means replacing no paper trail with a potentially fake paper trail.

And then there’s Pennsylvania..

Pennsylvania is currently a dumpster fire like Georgia or Texas, overrun by DREs with no paper trail.  The governor is making a hard push to replace these by 2020, which may or may not work.  And when it does “work”, places like Philadelphia can just buy ExpressVote XLs and not really make progress on the security problem.  It’s not clear exactly how bad election vulnerability in PA will be, but it’s horrible now and it seems unlikely that what needs to happen to secure it- almost all current DREs removed and almost no ExpressVote XLs and ImageCasts installed- will actually happen, or probably even come close to happening.

What does this all mean?

Under the hypothetical 2020 scenarios where Democrats do a bit better across the board (if Rs do, voting security doesn’t matter since they’re winning in a landslide), let’s look at what happens with a little malfeasance that benefits Republicans.  Under the following rules:

Democrats hold every state they won in 2016 (Rs don’t try to rig NJ or DE because it’s too obvious and leave NV alone since it’s a bit risky to flip, say, 5% off BMD deficiencies on the top-ballot race)

Republicans or Republican-aligned interests take the low-hanging fruit and rig the horribly vulnerable elections in states they control (TX/GA/AZ/FL/IA) as well as holding all the deep red states.

We get this map, marking “decided” states in deep color, states/districts with R>=+5 in 2016 in pink, and everything else a tossup

2016simplerig2

Democrats need 38 more (269-269 is a R win), which is not a simple ask.  Ohio (R+8) was significantly red and will be BMD-infested at best.  Michigan,Nebraska District 2, and Maine District 2 are fair, but WI and likely PA are both dumpster fires in terms of security, and NC is a real mess as well.

It’s *literally impossible*, unless D’s flip one or more deep red states somehow, to win in 2020 without taking a state that has serious-to-extreme election security flaws, and most likely more than one.  Breaking this down a little further, if D’s lose Michigan (a fair state), they’re also almost guaranteed to lose OH that they lost by 8% more in 2016, meaning they have to sweep WI/PA/NC, all of which have major security issues.  I wouldn’t want to be the Ds there even if those elections were fairly counted.  Assuming the Ds win MI, which they likely do in a legitimate win, gives this map

winmich

which clarifies things considerably.  Ds have 2 paths, winning PA, NE 2 (R+2), and ME 2 (R+10), which is a really tough ask, or winning any 2 of OH, WI, PA, NC.  That’s a 2016 R+8 state and 3 states that Rs won in 2016 with disaster-level election security problems.  If you take election fairness seriously at all, this has to be terrifying.

There is a path out for Ds here, but it requires strong leadership and decisive action on something that’s still a fringe issue to most people in and out of government, and it goes against entrenched corporate interests, so I know damn well it’s never going to happen, but the proper course of action security-wise is for WI and NC to immediately decertify all electronics, including scanners, and for PA to decertify all electronics except for scanners.  PA can allow scanners because it has a solid recount statute, but WI and NC don’t (unless their Republican-controlled legislatures decide to pass one), so the only safe method for them for now is good old hand-counting.

 

And don’t forget the various forms of attacks on the voter rolls that are already happening that weren’t discussed here……..

 

P.S. Comments are moderated.  Corrections or additions are welcome. Legitimate questions and interesting content are welcome.  JAQing off and random partisan hackery won’t be approved.

 

 

Trust the barrels

Inspired by the curious case of Harrison Bader

baderbarrels

whose average exit velocity is horrific, hard hit% is average, and barrel/contact% is great (not shown, but a little better than the xwOBA marker), I decided to look at which one of these metrics was more predictive.  Barrels are significantly more descriptive of current-season wOBAcon (wOBA on batted balls/contact), and average exit velocity is sketchy because the returns on harder-hit balls are strongly nonlinear. The game rewards hitting the crap out of the ball, and one rocket and one trash ball come out a lot better than two average balls.

Using consecutive seasons with at least 150 batted balls (there’s some survivor bias based on quality of contact, but it’s pretty much even across all three measures), which gave 763 season pairs, barrel/contact% led the way with r=0.58 to next season’s wOBAcon, followed by hard-hit% at r=0.53 and average exit velocity at r=0.49.  That’s not a huge win, but it is a win, but since these are three ways of measuring a similar thing (quality of contact), they’re likely to be highly correlated, and we can do a little more work to figure out where the information lies.

evvehardhit

I split the sample into tenths based on average exit velocity rank, and Hard-hit% and average exit velocity track an almost perfect line at the group (76-77 player) level.  Barrels deviate from linearity pretty measurably with the outliers on either end, so I interpolated and extrapolated on the edges to get an “expected” barrel% based on the average exit velocity, and then I looked at how players who overperformed and underperformed their expected barrel% by more than 1 SD (of the barrel% residual) did with next season’s wOBAcon.

Avg EV decile >2.65% more barrels than expected average-ish barrels >2.65% fewer barrels than expected whole group
0 0.362 0.334 none 0.338
1 0.416 0.356 0.334 0.360
2 0.390 0.377 0.357 0.376
3 0.405 0.386 0.375 0.388
4 0.389 0.383 0.380 0.384
5 0.403 0.389 0.374 0.389
6 0.443 0.396 0.367 0.402
7 0.434 0.396 0.373 0.401
8 0.430 0.410 0.373 0.405
9 0.494 0.428 0.419 0.441

That’s.. a gigantic effect.  Knowing barrel/contact% provides a HUGE amount of information on top of average exit velocity going forward to the next season.  I also looked at year-to-year changes in non-contact wOBA (K/BB/HBP) for these groups just to make sure and it’s pretty close to noise, no real trend and nothing close to this size.

It’s also possible to look at this in the opposite direction- find the expected average exit velocity based on the barrel%, then look at players who hit the ball more than 1 SD (of the average EV residual) harder or softer than they “should” have and see how much that tells us.

Barrel% decile >1.65 mph faster than expected average-ish EV >1.65 mph slower than expected whole group
0 0.358 0.339 0.342 0.344
1 0.362 0.359 0.316 0.354
2 0.366 0.364 0.361 0.364
3 0.389 0.377 0.378 0.379
4 0.397 0.381 0.376 0.384
5 0.388 0.395 0.418 0.397
6 0.429 0.400 0.382 0.403
7 0.394 0.398 0.401 0.398
8 0.432 0.414 0.409 0.417
9 0.449 0.451 0.446 0.450


There’s still some information there, but while the average difference between the good and bad EV groups here is 12 points of next season’s wOBAcon, the average difference for good and bad barrel groups was 50 points.  Knowing barrels on top of average EV tells you a lot.  Knowing average EV on top of barrels tells you a little.

Back to Bader himself, a month of elite barreling doesn’t mean he’s going to keep smashing balls like Stanton or anything silly, and trying to project him based on contact quality so far is way beyond the scope of this post, but if you have to be high on one and low on the other, lots of barrels and a bad average EV is definitely the way to go, both for YTD and expected future production.

 

Uncertainty in baseball stats (and why DRC+ SD is a category error)

What does it mean to talk about the uncertainty in, say, a pitcher’s ERA or a hitter’s OBP?  You know exactly how many ER were allowed, exactly how many innings were pitched, exactly how many times the batter reached base, and exactly how many PAs he had.  Outside of MLB deciding to retroactively flip a hit/error decision, there is no uncertainty in the value of the stat.  It’s an exact measurement.  Likewise, there’s no uncertainty in Trout’s 2013 wOBA or wRC+.  They reflect things that happened, calculated in deterministic fashion from exact inputs.  Reporting a measurement uncertainty for any of these wouldn’t make any sense.

The Statcast metrics are a little different- EV, LA, sprint speed, hit distance, etc. all have a small amount of random error in each measurement, but since those errors are small and opportunities are numerous, the impact of random error is small to start with and totally meaningless quickly when aggregating measurements.  There’s no point in reporting random measurement uncertainty in a public-facing way because it may as well be 0 (checking for systematic bias is another story, but that’s done with the intent of being fixed/corrected for, not of being reported as metric uncertainty).

Point 1:

So we can’t be talking about the uncertainties in measuring/calculating these kinds of metrics- they’re irrelevant-to-nonexistent.  When we’re talking about the uncertainty in somebody’s ERA or OBP or wRC+, we’re talking about the uncertainty of the player’s skill at the metric in question, not the uncertainty of the player’s observed value.  That alone makes it silly to report such metrics as “observed value +/- something”, like ERA 0.37 +/- 3.95, because it’s implicitly treating the observed value as some kind of meaningful central-ish point in the player’s talent distribution.  There’s no reason for that to be true *because these aren’t talent metrics*.  They’re simply a measure of something over a sample, and many such metrics frequently give values where a better true talent is astronomically unlikely to be correct (a wRC+ over 300) or even impossible (an ERA below 0) and many less extreme but equally silly examples as well.

Point 2:

Expressing something non-stupidly in the A +/- B format (or listing percentiles if it’s significantly non-normal, whatever) requires a knowledge of the player’s talent distribution after the observed performance, and that can’t be derived solely from the player’s data.  If something happens 25% of the time, talent could cluster near 15% and the player is doing it more often, talent could cluster near 35% and the player is doing it less often, or talent could cluster near 25% and the player is average.  There’s no way to tell the difference from just the player’s stat line and therefore no way to know what number to report as the mean, much less the uncertainty.  Reporting a 25% mean might be correct (the latter case) or as dumb as reporting a mean wRC+ of 300 (if talent clusters quite tightly around 15%).

Once you build a prior talent distribution (based on what other players have done and any other material information), then it’s straightforward to use the observed performance at the metric in question and create a posterior distribution for the talent, and from that extract the mean and SD.  When only the mean is of interest, it’s common to regress by adding some number of average observations, more for a tighter talent distribution and fewer for a looser talent distribution, and this approximates the full Bayesian treatment.  If the quantity in the previous paragraph were HR/FB% (league average a little under 15%), then 25% for a pitcher would be regressed down a lot more than for a batter over the same number of PAs because pitcher HR/FB% allowed talent is much more tightly distributed than hitter HR/FB% talent, and the uncertainty reported would be a lot lower for the pitcher because of that tighter talent distribution.  None of that is accessible by just looking at a 25% stat line.

Actual talent metrics/projections, like Steamer and ZiPS, do exactly this (well, more complicated versions of this) using talent distributions and continually updating with new information, so when they spit out mean and SD, or mean and percentiles, they’re using a process where those numbers are meaningful, getting them as the result of using a reasonable prior for talent and therefore a reasonable posterior after observing some games.  Their means are always going to be “in the middle” of a reasonable talent posterior, not nonsense like wRC+ 300.

Which brings us to DRC+.. I’ve noted previously that the DRC+ SDs don’t make any sense, but I didn’t really have any idea how they were coming up with those numbers until  this recent article, and a reference to this old article on bagging.  My last two posts pointed out that DRC+ weights way too aggressively in small samples to be a talent metric and that DRC+ has to be heavily regressed to make projections, so when we see things in that article like Yelich getting assigned a DRC+ over 300 for a 4PA 1HR 2BB game, that just confirms what we already knew- DRC+ is happy to assign means far, far outside any reasonable distribution of talent and therefore can’t be based on a Bayesian framework using reasonable talent priors.

So DRC+ is already violating point 1 above, using the A +/- B format when A takes ridiculous values because DRC+ isn’t a talent metric.  Given that it’s not even using reasonable priors to get *means*, it’s certainly not shocking that it’s not using them to get SDs either, but what it’s actually doing is bonkers in a way that turns out kind of interesting.  The bagging method they use to get SDs is (roughly) treating the seasonal PA results as the exact true talent distribution of events, drawing  from them over and over (with replacement) to get a fake seasonal line, doing that a bunch of times and taking the SD of the fake seasonal lines as the SD of the metric.

That’s obviously just a category error.  As I explained in point 2, the posterior talent uncertainty depends on the talent distribution and can’t be calculated solely from the stat line, but such obstacles don’t seem to worry Jonathan Judge.  When talking about Yelich’s 353 +/- 6  DRC+, he said “The early-season uncertainties for DRC+ are high. At first there aren’t enough events to be uncertain about, but once we get above 10 plate appearances or so the system starts to work as expected, shooting up to over 70 points of probable error. Within a week, though, the SD around the DRC+ estimate has worked its way down to the high 30s for a full-time player.”  That’s just backwards about everything.  I don’t know (or care) why their algorithm fails under 10 PAs, but writing “not having enough events to be uncertain about” shows an amazing misunderstanding of everything.

The accurate statement- assuming you’re going in DRC+ style using only YTD knowledge of a player- is “there aren’t enough events to be CERTAIN about of much of anything”, and an accurate DRC+ value for Yelich- if DRC+ functioned properly as a talent metric- would be around 104 +/- 13 after that nice first game.  104 because a 4PA 1HR 2BB game preferentially selects- but not absurdly so- for above average hitters, and a SD of 13 because that’s about the SD of position player projections this year.  SDs of 70 don’t make any sense at all and are the artifact of an extremely high SD in observed wOBA (or wRC+) over 10-ish PAs, and remember that their bagging algorithm is using such small samples to create the values.  It’s clear WHY they’re getting values that high, but they just don’t make any sense because they’re approaching the SD from the stat line only and ignoring the talent distribution that should keep them tight.  When you’re reporting a SD 5 times higher than what you’d get just picking a player talent at random, you might have problems.

The Bayesian Central Limit Theorem

I promised there was something kind of interesting, and I didn’t mean bagging on DRC+ for the umpteenth time, although catching an outright category error is kind of cool.  For full-time players after a full season, the DRC+ SDs are actually in the ballpark of correct, even though the process they use to create them obviously has no logical justification (and fails beyond miserably for partial seasons, as shown above).  What’s going on is an example of the Bayesian Central Limit Theorem, which states that for any priors that aren’t intentionally super-obnoxious, repeatedly observing i.i.d variables will cause the posterior to converge to a normal distribution.  At the same time, the regular Central Limit Theorem means that the distribution of outcomes that their bagging algorithm generates should also approach a normal distribution.

Without the DRC+ processing baggage, these would be converging to the same normal distribution, as I’ll show with binomials in a minute, but of course DRC+ gonna DRC+ and turn virtually identical stat lines into significantly different numbers

NAME YEAR PA 1B 2B 3B HR TB BB IBB SO HBP AVG OBP SLG OPS ISO oppOPS DRC+ DRC+ SD
Pablo Sandoval 2014 638 119 26 3 16 244 39 6 85 4 0.279 0.324 0.415 0.739 0.136 0.691 113 7
Jacoby Ellsbury 2014 635 108 27 5 16 241 49 5 93 3 0.271 0.328 0.419 0.747 0.148 0.696 110 11

Ellsbury is a little more TTO-based and gets an 11 SD to Sandoval’s 7.  Seems legit.  Regardless of these blips, high single digits is about right for a DRC+ (wRC+) SD after observing a full season.

Getting rid of the DRC+ layer to show what’s going on, assume talent is uniform on [.250-.400] (SD of 0.043) and we’re dealing with 1000 Bernoulli observations.  Let’s say we observe 325 successes (.325), then when we plot the Bayesian posterior talent distribution and the binomial for 1000 p=.325 events (the distribution that bagging produces)

325posterior

They overlap so closely you can’t even see the other line.  Going closer to the edge, we get, for 275 and 260 observed successes,

At 275, we get a posterior SD of .13 vs the binomial .14, and at 260, we start to break the thing, capping how far to the left the posterior can go, and *still* get a posterior SD of .11 vs .14.  What’s going on here is that the weight for a posterior value is the prior-weighted probability that that value (say, .320) produces an observation of .325 in N attempts, while the binomial bagging weight at that point is the probability that .325 produces an observation of .320 in N attempts.  These aren’t the same, but under a lot of circumstances, they’re pretty damn close, and as N grows, and the numbers that take the place of .320 and .325 in the meat of the distributions get closer and closer together, the posterior converges to the same normal that describes the binomial bagging.  Bayesian CLT meets normal CLT.

When the binomial bagging variance starts dropping well below the prior population variance, this convergence starts to happen enough to where the numbers can loosely be called “close” for most observed success rates, and that transition point happens to come out around a full season of somewhat regressed observation of baseball talent. In the example above, the prior population SD was 0.043 and the binomial variance was 0.014, so it converged excellently until we ran too close to the edge of the prior.  It’s never always going to work, because a low end talent can get unlucky, or a high end talent can get lucky, and observed performance can be more extreme than the talent distribution (super-easy in small samples, still happens in seasonal ones) but for everybody in the middle, it works out great.

Let’s make the priors more obnoxious and see how well this works- this is with a triangle distribution, max weight at .250 straight line down to a 0 weight at .400.

 

The left-weighted prior shifts the means, but the standard deviations are obviously about the same again here.  Let’s up the difficulty even more, starting with a N(.325,.020) prior (0.020 standard deviation), which is pretty close to the actual mean/SD wOBA talent distribution among position players (that distribution is left-weighted like the triangle too, but we already know that doesn’t matter much for the SD)

Even now that the bagging distributions are *completely* wrong and we’re using observations almost 2 SD out, the standard deviations are still .014-.015 bagging and .012 for the posterior.  Observing 3 SD out isn’t significantly worse.  The prior population SD was 0.020, and the binomial bagging variance was 0.014, so it was low enough that we were close to converging when the observation was in the bulk of the distribution but still nowhere close when we were far outside, although the SDs of the two were still in the ballpark everywhere.

Using only 500 observations on the N(.325,.020) prior isn’t close to enough to pretend there’s convergence even when observing in the bulk.

325500pa

The posterior has narrowed to a SD of .014 (around 9 points of wRC+ if we assume this is wOBA and treat wOBA like a Bernoulli, which is handwavy close enough here), which is why I said above that high-single-digits was “right”, but the binomial variance is still at .021, 50% too high.  The regression in DRC+ tightens up the tails compared to “binomial wOBA”, and it happens to come out to around a reasonable SD after a full season.

Just to be clear, the bagging numbers are always wrong and logically unjustified here, but they’re a hackjob that happens to be “close” a lot of the time when working with the equivalent of full-season DRC+ numbers (or more).  Before that point, when the binomial bagging variance is higher than the completely naive population variance (the mechanism for DRC+ reporting SDs in the 70s, 30s, or whatever for partial seasons), the bagging procedure isn’t close at all.  This is just another example of DRC+ doing nonsense that looks like baseball analysis to produce a number that looks like a baseball stat, sometimes, if you don’t look too closely.

 

Revisiting the DRC+ team switcher claim

The algorithm has changed a fair bit since I investigated that claim- at the least, it’s gotten rid of most of its park factor and regresses (effectively) less than it used to.  It’s not impossible that it could grade out differently now than it did before, and I told somebody on twitter that I’d check it out again, so here we are.  First of all, let’s remind everybody what their claim is.  From https://www.baseballprospectus.com/news/article/45383/the-performance-case-for-drc/, Jonathan Judge says:


Table 2: Reliability of Team-Switchers, Year 1 to Year 2 (2010-2018); Normal Pearson Correlations[3]

Metric Reliability Error Variance Accounted For
DRC+ 0.73 0.001 53%
wOBA 0.35 0.001 12%
wRC+ 0.35 0.001 12%
OPS+ 0.34 0.001 12%
OPS 0.33 0.002 11%
True Average 0.30 0.002 9%
AVG 0.30 0.002 9%
OBP 0.30 0.002 9%

With this comparison, DRC+ pulls far ahead of all other batting metrics, park-adjusted and unadjusted. There are essentially three tiers of performance: (1) the group at the bottom, ranging from correlations of .3 to .33; (2) the middle group of wOBA and wRC+, which are a clear level up from the other metrics; and finally (3) DRC+, which has almost double the reliability of the other metrics.

You should pay attention to the “Variance Accounted For” column, more commonly known as r-squared. DRC+ accounts for over three times as much variance between batters than the next-best batting metric. In fact, one season of DRC+ explains over half of the expected differences in plate appearance quality between hitters who have switched teams; wRC+ checks in at a mere 16 percent.  The difference is not only clear: it is not even close.

Let’s look at Predictiveness.  It’s a very good sign that DRC+ correlates well with itself, but games are won by actual runs, not deserved runs. Using wOBA as a surrogate for run-scoring, how predictive is DRC+ for a hitter’s performance in the following season?

Table 3: Reliability of Team-Switchers, Year 1 to Year 2 wOBA (2010-2018); Normal Pearson Correlations

Metric Predictiveness Error
DRC+ 0.50 0.001
wOBA 0.37 0.001
wRC+ 0.37 0.002
OPS+ 0.37 0.001
OPS 0.35 0.002
True Average 0.34 0.002
OBP 0.30 0.002
AVG 0.25 0.002

If we may, let’s take a moment to reflect on the differences in performance we see in Table 3. It took baseball decades to reach consensus on the importance of OBP over AVG (worth five points of predictiveness), not to mention OPS (another five points), and finally to reach the existing standard metric, wOBA, in 2006. Over slightly more than a century, that represents an improvement of 12 points of predictiveness. Just over 10 years later, DRC+ now offers 13 points of improvement over wOBA alone.


 

Reading that, you’re pretty much expecting a DIPS-level revelation.  So let’s see how good DRC+ really is at predicting team switchers.  I put DRC+ on the wOBA scale, normalized each performance to the league-average wOBA that season (it ranged from .315 to .326), and measured the mean absolute error (MAE) of wOBA projections for the next season, weighted by the harmonic mean of the PAs in each season.  DRC+ had a MAE of 34.2 points of wOBA for team-switching position players.  Projecting every team-switching position player to be exactly league average had a MAE of 33.1 points of wOBA.  That’s not a mistake.  After all that build-up, DRC+ is literally worse at projecting team-switching position players than assuming that they’re all league average.

If you want to say something about pitchers at the plate…
i-dont-think-so-homey-dont-play-that

 

Even though Jonathan Judge felt like calling me a total asshole incompetent troll last night, I’m going to show how his metric could be not totally awful at this task if it were designed and quality-tested better.  As I noted yesterday, DRC+’s weightings are *way* too aggressive on small numbers of PAs.  DRC+ shouldn’t *need* to be regressed after the fact- the whole idea of the metric is that players should only be getting credit for what they’ve shown they deserve (in the given season), and after a few PAs, they barely deserve anything, but DRC+ doesn’t grasp that at all and its creator doesn’t seem to realize or care that it’s a problem.

If we regress DRC+ after the fact to see what happens in an attempt to correct that flaw, it’s actually not a dumpster fire.  All weightings are harmonic means of the PAs.  Every position player pair of consecutive 2010-18 seasons with at least 1 PA in each is eligible.  All tables are MAEs in points of wOBA trying to project year T+1 wOBA..

First off, I determined the regression amounts for DRC+ and wOBA to minimize the weighted MAE for all position players, and that came out to adding 416 league average PAs for wOBA and 273 league average PAs for DRC+.  wOBA assigns 100% credit to the batter.  DRC+ *still* needs to be regressed 65% as much as wOBA.  DRC+ is ridiculously overaggressive assigning “deserved” credit.

Table 1.  MAEs for all players

lgavg raw DRC+ raw wOBA reg wOBA reg DRC+
33.21 31.00 33.71 29.04 28.89

Table 2. MAEs for all players broken down by year T PAs

Year T PA lgavg raw DRC+ raw wOBA reg wOBA reg DRC+ T+1 wOBA
1-99 PAs 51.76 48.84 71.82 49.32 48.91 0.284
100-399 PA 36.66 36.64 40.16 34.12 33.44 0.304
400+ PA 30.77 27.65 28.97 25.81 25.91 0.328

Didn’t I just say DRC+ had a problem with being too aggressive in small samples?  Well, this is one area where that mistake pays off- because the group of hitters who have 1-99 PA over a full season are terrible, being overaggressive crediting their suckiness pays off, but if you’re in a situation like now, where the real players instead of just the scrubs and callups have 1-99 PAs, being overaggressive is terribly inaccurate.  Once the population mean approaches league-average quality, the need for- and benefit of- regression is clear. If we cheat and regress each bucket to its population mean, it’s clear that DRC+ wasn’t actually doing anything special in the low-PA bucket, it’s just that regression to 36 points of wOBA higher than the mean wasn’t a great corrector.

Table 3. (CHEATING) MAEs for all players broken down by year T PAs, regressed to their group means (same regression amounts as above).

Year T PA lgavg raw DRC+ raw wOBA reg wOBA reg DRC+ T+1 wOBA
1-99 PAs 51.76 48.84 71.82 46.17 46.30 0.284
100-399 PA 36.66 36.64 40.16 33.07 33.03 0.304
400+ PA 30.77 27.65 28.97 26.00 25.98 0.328

There’s very little difference between regressed wOBA and regressed DRC+ here.  DRC+ “wins” over wOBA by 0.00015 wOBA MAE over all position players, clearly justifying the massive amount of hype Jonathan Judge pumped us up with.  If we completely ignore the trash position players and only optimize over players who had 100+PA in year T, then the regression amounts increase slightly- 437 PA for wOBA and 286 for DRC+, and we get this chart:

Table 4. MAEs for all players broken down by year T PAs, optimized on 100+ PA players

Year T PA lgavg raw DRC+ raw wOBA reg wOBA reg DRC+ T+1 wOBA
100+ PA 32.55 30.37 32.36 28.32 28.19 0.321
100-399 PA 36.66 36.64 40.16 34.12 33.45 0.304
400+ PA 30.77 27.65 28.97 25.81 25.91 0.328

Nothing to see here either, DRC+ with a 0.00013 MAE advantage again.  Using only 400+PA players to optimize over only changes the DRC+ entry to 25.90, so regressed wOBA wins a 0.00009 MAE victory here.

In conclusion, regressed wOBA and regressed DRC+ are so close that there’s no meaningful difference, and I’d grade DRC+ a microscopic winner.  Raw DRC+ is completely awful in comparison, even though DRC+ shouldn’t need anywhere near this amount of extra regression if it were working correctly to begin with.

I’ve slowrolled the rest of the team-switcher nonsense.  It’s not very exciting either.  I defined 3 classes of players, Stay = played both years entirely for the same team, Switch = played year T entirely for 1 team and year T+1 entirely for 1 other team, Midseason = switched midseason in at least one of the years.

Table 5. MAEs for all players broken down by stay/switch, any number of year T PAs

stay/

switch

lgavg raw DRC+ raw wOBA reg wOBA reg DRC+ T+1 wOBA
stay 33.21 29.86 32.19 27.91 27.86 0.325
switch 33.12 34.20 37.89 31.57 31.53 0.312
mid 33.29 33.01 36.47 31.67 31.00 0.305
sw+mid 33.21 33.60 37.17 31.62 31.26 0.309

It’s the same story as before.  Raw DRC+ sucks balls at projecting T+1 wOBA and is actually worse than “everybody’s league average” for switchers, regressed DRC+ wins a microscopic victory over regressed wOBA for stayers and switchers.  THERE’S (STILL) LITERALLY NOTHING TO THE CLAIM THAT DRC+, REGRESSED OR OTHERWISE, IS ANYTHING SPECIAL WITH RESPECT TO PROJECTING TEAM SWITCHERS.  These are the same conclusions I found the first time I looked, and they still hold for the current version of the DRC+ algorithm.

 

 

DRC+ weights TTO relatively *less* than BIP after 10 games than after a full season

This is a cut-out from a longer post I was running some numbers for, but it’s straightforward enough and absurd enough that it deserves a standalone post.  I’d previously looked at DRAA linear weights and the relevant chart for that is reproduced here.  This is using seasons with 400+PA.

relative to average PA 1b 2b 3b hr bb hbp k bip out
old DRAA 0.22 0.38 0.52 1.16 0.28 0.24 -0.24 -0.13
new DRAA 0.26 0.45 0.62 1.17 0.26 0.30 -0.24 -0.15
wRAA 0.44 0.74 1.01 1.27 0.27 0.33 -0.26 -0.27

 

I reran the same analysis on 2019 YTD stats, with all position players and with a 25 PA minimum, and these are the values I recovered.  Full year is the new DRAA row above, and the percentages are the percent relative to those values.

1b 2b 3b hr bb hbp k BIP out
YTD 0.13 0.21 0.29 0.59 0.11 0.08 -0.14 -0.10
min 25 PA 0.16 0.27 0.37 0.63 0.12 0.09 -0.15 -0.11
Full Year 0.26 0.45 0.62 1.17 0.26 0.30 -0.24 -0.15
YTD %s 48% 47% 46% 50% 41% 27% 57% 64%
min 25PA %s 61% 59% 59% 54% 46% 30% 61% 74%

So.. this is quite something.  First of all, events are “more-than-half-deserved” relative to the full season after only 25-50 PA.  There’s no logical or mathematical reason for that to be true, for any reasonable definition of “deserved”, that quickly.  Second, BIP hits are discounted *LESS* in a small sample than walks are, and BIP outs are discounted *LESS* in a small sample than strikeouts are.  The whole premise of DRC+ is that TTO outcomes belong to the player more than the outcomes of balls in play, and are much more important in small samples, but here we are, with small samples, and according to DRC+, the TTO OUTCOMES ARE RELATIVELY LESS IMPORTANT NOW THAN THEY ARE AFTER A FULL SEASON.  Just to be sure, I reran with wRAA and extracted almost the exact same values as chart 1, so there’s nothing super weird going on here.  This is complete insanity- it’s completely backwards from what’s actually true, and even to what BP has stated is true.  The algorithm has to be complete nonsense to “come to that conclusion”.

Reading the explanation article, I kept thinking the same thing over and over.  There’s no clear logical or mathematical justification for most steps involved, and it’s just a pile of junk thrown together and tinkered with enough to output something resembling a baseball stat most of the time if you don’t look too closely. It’s not the answer to any articulable, well-defined question.  It’s not a credible run-it-back projection (I’ll show that unmistakably in the next post, even though it’s already ruled out by the.. interesting.. weightings above).

Whenever a hodgepodge model is thrown together like DRC+ is, it becomes difficult-to-impossible to constrain it to obey things that you know are true.  At what point in the process did it “decide” that TTO outcomes were relatively less important now?  Probably about 20 different places where it was doing nonsense-that-resembles-baseball-analysis and optimizing functions that have no logical link to reality.  When it’s failing basic quality testing- and even worse, when obvious quality assurance failures are observed and not even commented on (next post)- it’s beyond irresponsible to keep running it out as something useful solely on the basis of a couple of apples-to-oranges comparisons on rigged tests.

 

A new look at the TTOP, plus a mystery

I had the bright idea to look at the familiarity vs. fatigue TTOP debate, which has MGL on the familiarity side and Pizza Cutter on the fatigue side, by measuring performance based on the number of pitches the batter had seen previously and the number of pitches that the pitcher had thrown to other players in between the PAs in question.  After all, a fatigue effect on the TTOP shouldn’t be from “fatigue”, but “relative change in fatigue”, and that seemed like a cleaner line of inquiry than just total pitch count.  Not a perfect one, but one that should pick up a signal if it’s there.  Then I realized MGL had already done the first part of that experiment, which I’d somehow completely forgotten even though I’d read that article and the followup around the time they came out.  Oh well.  It never hurts to redo the occasional analysis to make sure conclusions still hold true.

I found a baseline 15 point PA1-PA2 increase as well as another 15 point PA2-PA3 increase.  I didn’t bother looking at PA4+ because the samples were tiny and usage is clearly changing.  In news that should be surprising to absolutely nobody reading this, PAs given to starters are on the decline overall and the number of PA4+ is absolutely imploding lately.

Season Total PAs 1st TTO 2nd 3rd 4th 5th
2008 116960 42614 40249 30731 3359 7
2009 116963 42628 40186 30736 3406 7
2010 119130 42621 40457 32058 3990 4
2011 119462 42588 40458 32333 4080 3
2012 116637 42506 40336 30741 3050 4
2013 116872 42570 40422 31026 2851 3
2014 117325 42612 40618 31235 2856 4
2015 114797 42658 40245 29580 2314 0
2016 112480 42461 40128 28193 1698 0
2017 110195 42478 39912 26476 1329 0
2018 106051 42146 38797 24057 1051 0

Looking specifically at PA2 based on the number of pitches seen in PA1, I found a more muted effect than MGL did using 2008-2018 data with pitcher-batters and IBB/sac-bunt PAs removed.  My data set consisted of (game,starter,batter,pa1,pa2,pa3) rows where the batter had to face the starter at least twice, the batter wasn’t the pitcher, and any ibb/sac bunt PA in the first three trips disqualified the row (pitch counts do include pitches to non-qualified rows where relevant).  For a first pass, that seemed less reliant on individual batter-pitcher projections than allowing each set of PAs to be biased by crap hitters sac-bunting and good hitters getting IBBd would have been.

Pitches in PA 1 wOBA in PA 2 Expected** n
1 0.338 0.336 39832
2 0.341 0.335 69761
3 0.336 0.335 79342
4 0.334 0.335 82847
5 0.339 0.337 74786
6 0.347 0.338 51374
7+ 0.349 0.337 36713

MGL found a 15 point bonus for seeing 5+ pitches the first time up (on top of the baseline 10 he found), but I only get about an 11 point bonus on 6+ pitches and 3 points of that are from increased batter/worse pitcher quality (“Expected” is just a batter/pitcher quality measure, not an actual 2nd TTO prediction). The SD of each bucket is on the order of .002, so it’s extremely likely that this effect is real, and also likely that it’s legitimately smaller than it was in MGL’s dataset, assuming I’m using a similar enough sampling/exclusion method, which I think I am.  It’s not clear to me that that has to be an actual familiarity effect, because I would naively expect to see more of a monotonic increase throughout the number of pitches seen instead of the J-curve, but the buckets have just enough noise that the J-curve might simply be an artifact anyway, and short PAs are an odd animal in their own right as we’ll see later.

Doing the new part of the analysis, looking at the wOBA difference in PA2-PA1 based on the number of intervening pitches to other batters, I wasn’t sure I was going to find much fatigue evidence early in the game, but as it turns out, the relationship is clear and huge.

intervening pitches wOBA PA2-PA1 vs base .015 TTOP n
<=20 -0.021 -0.036 9476
21 -0.005 -0.020 5983
22 -0.005 -0.020 8652
23 0.004 -0.011 11945
24 0.000 -0.015 15683
25 0.004 -0.011 19592
26 0.001 -0.014 23057
27 0.005 -0.010 26504
28 0.009 -0.006 29690
29 0.015 0.000 31453
30 0.021 0.006 32356
31 0.014 -0.001 32250
32 0.020 0.005 30723
33 0.018 0.003 28390
34 0.027 0.012 25745
35 0.028 0.013 22407
36 0.023 0.008 18860
37 0.030 0.015 15429
38 0.025 0.010 12420
39 0.012 -0.003 9558
40 0.045 0.030 7362
41-42 0.032 0.017 9241
43+ 0.027 0.012 7879

That’s a monster effect, 2 points of TTOP wOBA per intervening pitch with an unmistakable trend.  Jackpot.  Hareeb’s a genius.  That’s big enough that it should result in actionable game situations all the time.  Let’s look at it in terms of actual 2nd time wOBAs (quality-adjusted).

intervening pitches PA2 wOBA (adj)
<=20 0.339
21 0.346
22 0.343
23 0.344
24 0.340
25 0.341
26 0.339
27 0.339
28 0.337
29 0.340
30 0.341
31 0.338
32 0.347
33 0.336
34 0.345
35 0.344
36 0.336
37 0.340
38 0.328
39 0.335
40 0.340
41-42 0.338
43+ 0.344

Wait what??!?!? Those look almost the same everywhere.  If you look closely, the higher-pitch-count PA2 wOBAs even average out to be a tad (4-5 points) *lower* than the low-pitch-count ones (and the same for PA1-PA3, though that needs a closer look). If I didn’t screw anything up, that can only mean..

intervening pitches PA1 wOBA (adj)
<=20 0.361
21 0.351
22 0.348
23 0.339
24 0.340
25 0.336
26 0.338
27 0.335
28 0.327
29 0.325
30 0.320
31 0.325
32 0.326
33 0.319
34 0.318
35 0.316
36 0.312
37 0.311
38 0.303
39 0.323
40 0.295
41-42 0.306
43+ 0.318

Yup.  The number of intervening pitches TO OTHER BATTERS between somebody’s first and second PA has a monster “effect on” the PA1 wOBA.  I started hand-checking more rows of pitch counts and PA results, you name it.  I couldn’t believe this was possibly real.  I asked one of my friends to verify that for me, and he did, and I mentioned the “effect” to Tango and he also observed the same pattern.  This is actually real.  It also works the same way between PA2 and PA3. I couldn’t keep looking at other TTOP stuff with this staring me in the face, so the rest of this post is going down this rabbit hole showing my path to figuring out what was going on.  If you want to stop here and try to work it out for yourself, or just think about it for awhile before reading on, I thought it was an interesting puzzle.

It’s conventional sabermetric wisdom that the box-score-level outcome of one PA doesn’t impart giant predictive effects, but let’s make sure that still holds up.

Reached base safely in PA1 PA2 wOBA (adj) Batter quality Pitcher quality
Yes 0.348 0.338 0.339
No 0.336 0.334 0.336

That’s a 12 point effect, but 7 of it is immediately explained by talent differences, and given the plethora of other factors I didn’t control for, all of which will also skew hitter-friendly like the batter and pitcher quality did, there’s just nothing of any significance here.    Maybe the effect is shorter-term than that?

Reached base safely in PA1 Next batter wOBA (adj) Next batter quality Pitcher quality
Yes 0.330 0.337 0.339
No 0.323 0.335 0.336

A 7 point effect where 5 is immediately explained by talent.  Also nothing here.  Maybe there’s some effect on intervening pitch count somehow?

Reached base safely in PA1 Average intervening pitches intervening wOBA (adj)
Yes 30.58 0.3282
No 30.85 0.3276

Barely, and the intervening batters don’t even hit quite as well as expected given that we know the average pitcher is 3 points worse in the Yes group.  Alrighty then.  There’s a big “effect” from intervening pitch count on PA1 wOBA, but PA1 wOBA has minimal to no effect on intervening pitch count, intervening wOBA, PA2 wOBA, or the very next hitter’s wOBA.  That’s… something.

In another curious note to this effect,

intervening pitches intervening wOBA (adj)
<=20 0.381
21 0.373
22 0.363
23 0.358
24 0.351
25 0.344
26 0.343
27 0.335
28 0.333
29 0.328
30 0.324
31 0.322
32 0.319
33 0.316
34 0.316
35 0.312
36 0.310
37 0.310
38 0.307
39 0.311
40 0.308
41-42 0.309
43+ 0.311

Another monster correlation, but that one has a much simpler explanation: short PAs show better results for hitters

Pitches in PA wOBA (adj) n
1 0.401 133230
2 0.383 195614
3 0.317 215141
4 0.293 220169
5 0.313 198238
6 0.328 133841
7 0.347 57396
8+ 0.369 37135

Throw a bunch of shorter PAs together and you get the higher aggregate wOBA seen in the table right above this one. It seems like the PA length effect has to be a key.  Maybe there’s a difference in the next batter’s pitch distribution depending on PA1?

Pitches in PA Fraction of PA after reached base Fraction of PA  after out wOBA after reached base wOBA after out OBP after reached base OBP after out
1 0.109 0.089 0.394 0.402 0.362 0.359
2 0.164 0.158 0.375 0.376 0.348 0.343
3 0.183 0.182 0.308 0.303 0.284 0.278
4 0.186 0.191 0.289 0.276 0.299 0.281
5 0.165 0.174 0.311 0.301 0.339 0.323
6 0.112 0.120 0.323 0.32 0.367 0.360
7 0.049 0.052 0.346 0.339 0.393 0.386
8+ 0.032 0.034 0.356 0.36 0.401 0.405

Now we’re cooking with gas.  That’s a huge likelihood ratio difference for 1-pitch PAs, and using our PA1 OBP of about .324, we’d expect to see a PA1 OBP of .370 given a 1-pitch PA followup, which is exactly what we get, and the longer PAs are more weighted to previous outs because of the odds ratio favoring outs after we get to 4 pitches.

Next PA pitches This PA1 OBP This PA1 wOBA
1 0.370 0.373
2 0.333 0.332
3 0.326 0.325
4 0.319 0.318
5 0.313 0.313
6 0.311 0.313
7 0.314 0.310
8 0.313 0.309

It seems like this should be a big cause of the observed effect. I used the 2nd/6th and 3rd/7th columns from two tables up to create a process that would “play through” the next 8 PAs starting after an out or a successful PA, deciding on the number of pitches and then whether it was an out or not based on the average values.  Then I calculated the expected OBP for PA1 based on the likelihood ratios of each number of total pitches to happen (the same way I got .370 from the odds ratio for a 1-pitch followup PA).

As it turns out, that effect alone can reproduce the shape and a little over half the spread

intervening pitches PA1 OBP (adj) model PA1 OBP
<=20 0.366 0.340
21 0.351 0.336
22 0.349 0.329
23 0.339 0.338
24 0.343 0.332
25 0.336 0.328
26 0.335 0.327
27 0.335 0.328
28 0.328 0.328
29 0.325 0.326
30 0.320 0.326
31 0.324 0.323
32 0.324 0.323
33 0.318 0.321
34 0.318 0.324
35 0.317 0.323
36 0.312 0.317
37 0.313 0.318
38 0.307 0.320
39 0.320 0.310
40 0.300 0.317
41-42 0.308 0.309
43+ 0.320 0.317

and that simple model is deficient at a number of things (correlations longer than 1 pa, different batters, base-out states, etc).  I don’t know everything that’s causing the effect, but I have a good chunk of it, and that reverse pitch count selection bias isn’t something I’ve ever seen mentioned before.  This is also a caution to any kind of analysis involving pitch counts to be very careful to avoid walking into this effect.

 

A look at DRC+’s guts (part 1 of N)

In trying to better understand what DRC+ changed with this iteration, I extracted the “implied” run values for each event by finding the best linear fit to DRAA over the last 5 seasons.  To avoid regression hell (and the nonsense where walks can be worth negative runs when pitchers draw them), I only used players with 400+ PA.  To make sure this should actually produce reasonable values, I did the same for WRAA.

relative to average out 1b 2b 3b hr bb hbp k bip out
old DRAA 0.419 0.416 0.75 1.37 0.44 0.41 -0.08 0.03
new DRAA 0.48 0.57 0.56 1.36 0.44 0.49 -0.06 0.02
wRAA 0.70 1.00 1.27 1.53 0.54 0.60 0.01 0.00

Those are basically the accepted linear weights in the wRAA row, but DRAA seems to have some confusion around the doubles.  In the first iteration, doubles came out worth fewer runs than singles, and in the new iteration, triples come out worth fewer runs than doubles.  Pepsi might be ok, but that’s not.

If we force the 1b/2b/3b ratio to conform to the wRAA ratios and regress again (on 6 free variables instead of 8), then we get something else interesting.

relative to average PA 1b 2b 3b hr bb hbp k bip out
old DRAA 0.22 0.38 0.52 1.16 0.28 0.24 -0.24 -0.13
new DRAA 0.26 0.45 0.62 1.17 0.26 0.30 -0.24 -0.15
wRAA 0.44 0.74 1.01 1.27 0.27 0.33 -0.26 -0.27

Old DRAA was made up of about 90% of TTO runs and 50% of BIP runs, and that changed to about 90% of TTO runs and 60% of BIP runs in the new iteration.  So it’s like the component wOBA breakdown Tango was doing recently, except regressing the TTO component 10% and the BIP part 40% (down from 50%).

I also noticed that there was something strange about the total DRAA itself.  In theory, the aggregate runs above average should be 0 each year, but the new version of DRAA managed to uncenter itself by a couple of percent (that’s about -2% of total runs scored each season)

year old DRAA new DRAA
2010 210.8 -559.1
2011 127.9 -550
2012 226.8 -735.9
2013 190.4 -447.5
2014 33.7 -659.9
2015 60.1 -89.1
2016 63.3 -401.2
2017 -37.8 -318.3
2018 -50.2 -240.4

Breaking that down into full-time players (400+ PA), part-time position players (<400 PA), and pitchers, we get

2010-18 runs old DRAA new DRAA WRAA
Full-time 13912 11223 15296
part-time -6033 -7850 -9202
pitchers -7054 -7369 -6730
total 825 -3996 -636

I don’t know why it decided players suddenly deserved 4800 fewer runs, but here we are, and it took 520 offensive BWARP (10% of their total) away from the batters in this iteration too, so it didn’t recalibrate at that step either.  This isn’t an intentional change in replacement level or anything like that. It’s just the machine going haywire again without sufficient internal or external quality control.