July 2026

Diversity is the strength of the AI crowd

Abstract

Top AI forecasting systems perform similarly to skilled humans by combining frontier LLMs with forecasting-specific context gathering and scaffolding. We study how to improve this recipe through ensembling: given a fixed number of samples, which off-the-shelf model forecasts should be combined to maximize forecasting performance? On binary questions from the Metaculus AI Benchmark, we find that standalone performance is not enough for strong ensembling: additional forecasts add little when they come from frontier LLMs with highly correlated predictions. Instead, the strongest ensembles combine accurate but diverse forecasters; among the off-the-shelf frontier models we study, we find Grok 4 to be especially valuable because its predictions are less correlated with those of Gemini 3 Pro and GPT-5. These results suggest that the strength of the AI crowd comes not from sampling more forecasts indiscriminately, but from combining forecasts across models with complementary errors, motivating forecasting systems that explicitly optimize for both model quality and diversity.

Authors

Matthew Aitchison, Scott Jeen, Toby Shevlane, Ben Day

Venue

ICML 2026 Workshop on Forecasting as a New Frontier of Intelligence