Transparent & Open
Methodology
A subjective framework for evaluating AI chemistry capabilities. Community preferences shape the rankings through rigorous statistical methods.
Community-Driven
Rankings from real user votes
Statistically Rigorous
Bradley-Terry pairwise model
Bias-Free
Anonymous model evaluation
Bradley-Terry Model
The Bradley-Terry model is a probability model that predicts the outcome of pairwise comparisons. For any two models i and j, the probability that i beats j is:
P(i > j) = pᵢ / (pᵢ + pⱼ)
where pᵢ and pⱼ are the "strength" parameters. We use maximum likelihood estimation with iterative updates to find optimal strength estimates.
The algorithm converges when strength estimates stabilize (threshold: 0.0001) or after 200 iterations. Final ratings are converted to an Elo-style scale centered at 1500.
Tournament Structure
Each voting session follows a structured tournament:
1. A prompt is randomly selected from the category pool
2. Four distinct models are chosen from the active model pool
3. All models receive identical prompts with temperature 0.8
4. The first two models to complete are presented anonymously
5. Users vote on the better response
6. Winners face winners, losers face losers
7. This creates 5 pairwise comparisons per session
Each comparison feeds directly into the Bradley-Terry calculations.
Anonymization & Fairness
To ensure unbiased evaluation:
• Model identities are hidden throughout voting
• Responses are presented in randomized order (left/right)
• No identifying information is included in prompts
• Model configurations are standardized where possible
• All methodologies are publicly documented
Models are revealed only after voting to prevent brand bias.
Real-Time Updates
Rankings update continuously as votes come in:
• Each vote immediately affects the leaderboard
• Bradley-Terry recalculation happens in real-time
• New models appear with "New" status until 50+ comparisons
• Confidence intervals shrink as more data accumulates
• Historical data is preserved for trend analysis
Implementation Example
// Bradley-Terry probability calculation
function predictWinProbability(ratingA: number, ratingB: number): number {
// Convert Elo ratings to win probability
return 1 / (1 + Math.pow(10, (ratingB - ratingA) / 400));
}
// Example: Model A (Elo 1600) vs Model B (Elo 1500)
const probability = predictWinProbability(1600, 1500);
// Result: 0.64 (64% chance A wins)Limitations & Considerations
- •Rankings reflect subjective human preferences, not objective correctness
- •Models with fewer comparisons have higher rating uncertainty
- •Prompt selection and phrasing can influence model performance
- •Voter expertise varies; chemistry experts may evaluate differently than novices
- •Model versions are updated regularly; historical comparisons may not reflect current capabilities
Questions about our methodology? Reach us at contact@chemistryarena.ai