As discussed in The Independent Chip Model of Politics and HoF Voting and Bill James and the Trump polarization problem, the system doesn’t work. There was a twitter exchange where Bill seemed hard-pressed to say anything more profound than the trivial and self-referential “the algorithm prints the output of the algorithm”. If Bill were right, and the algorithm actually measured any objective value, then it shouldn’t much matter what sets of polls were run as long as there was enough mixing in the poll groups to have a link to everybody (polls of only R candidates and polls of only D candidates don’t say anything about how Rs would do against Ds, etc).

If the ICM distribution in the other post held, it wouldn’t matter if the population were sampled with one poll including everybody, sets of 4-person polls, sets of 3-person polls, sets of 2-person polls where one matchup was polled 100x more than the rest, or anything else. They would all generate the same support scores. This is the only distribution with that property. We know the ICM distribution doesn’t reflect reality, but how much does that make the support scores sampling-method-dependent? Well…..

As a toy model of the election, start with 4 candidates A/B/C/D who follow the ICM distribution with starting stacks/support in a 4:3:2:1 ratio. From this, we can calculate the probabilities of each of the 4! order preferences. For example, ABCD order has a probability of (4/(4+3+2+1)) * (3/(3+2+1)) * (2/(2+1)) = 13.33% and DCBA has a probability of (1/(4+3+2+1)) * (2/(4+3+2)) * (3/(4+3)) = 0.95%. Since we know everybody’s order preferences, and we make the friendly assumption that we can always sample the population completely and that the preferences never change, we can generate the result of any poll and calculate support scores. In this example, no matter how we poll, the 4:3:2:1 ratio holds and the support scores (normalized to add up to 10,000) are A: 4,000 B: 3,000 C: 2,000 D: 1,000.

Now let’s throw a wrench in this by adding candidate T who gets 40% in every poll he’s involved in and is everybody’s first or last choice. For simplicity, and to make a point later, we’ll treat this as a population with 48 order preferences, 24 with T in 1st followed by the ABCD ICM distribution above for 2nd-5th and 24 with the ABCD ICM distribution for 1st-4th followed by T in 5th (P(TABCD) = .4 * 13.33%, P(ABCDT) = .6* 13.33%, etc). Now we’ll poll this in different ways. Because we can generate any poll we want, we can poll every possible combination once and see what the support scores are. The only variable is how many candidates are included in each poll, and that gives the following support scores:

# in polls | 5 (1) | 4 (5) | 3 (10) | 2 (10) |

T | 4000 | 3340 | 2513 | 1429 |

A | 2400 | 2628 | 2891 | 3189 |

B | 1800 | 2001 | 2243 | 2525 |

C | 1200 | 1351 | 1551 | 1813 |

D | 600 | 681 | 802 | 1044 |

A/ABCD | 40 | 39.4 | 38.6 | 37.2 |

B/ABCD | 30 | 30.0 | 30.0 | 29.5 |

C/ABCD | 20 | 20.3 | 20.7 | 21.2 |

D/ABCD | 10 | 10.2 | 10.7 | 12.2 |

Even with T thrown in, the relative behavior of the ICM-compliant ABCD group stays mostly reasonable regardless of which poll size is used. T, however, ranges from a commanding first place to a distant 4th place depending on the poll size. Even without trying to define “support” in a meaningful, non-self-referential way, it’s obvious that claiming that any one of the 4 aggregated support numbers *is* T’s support (and that the other 3 are not) is ludicrous. The aggregation clearly isn’t measuring anything when it can massively flip 3 ordinal rankings based only on changing poll sizes.

Integrating different factions (that are strongly ICM-noncompliant with each other) into one list doesn’t work at all- the algorithm can spit out a random number, but it’s hugely dependent on procedural choices that shouldn’t make much difference if the methodology actually worked, so any particular output for any particular choice clearly doesn’t mean anything, and there’s almost no point in even calculating or reporting it.