Transparent & Open

Methodology

A subjective framework for evaluating AI chemistry capabilities. Community preferences shape the rankings through rigorous statistical methods.

Community-Driven

Rankings from real user votes

Statistically Rigorous

Bradley-Terry pairwise model

Bias-Free

Anonymous model evaluation

Bradley-Terry Model

The Bradley-Terry model is a probability model that predicts the outcome of pairwise comparisons. For any two models i and j, the probability that i beats j is: P(i > j) = pᵢ / (pᵢ + pⱼ) where pᵢ and pⱼ are the "strength" parameters. We use maximum likelihood estimation with iterative updates to find optimal strength estimates. The algorithm converges when strength estimates stabilize (threshold: 0.0001) or after 200 iterations. Final ratings are converted to an Elo-style scale centered at 1500.

Tournament Structure

Each voting session follows a structured tournament: 1. A prompt is randomly selected from the category pool 2. Four distinct models are chosen from the active model pool 3. All models receive identical prompts with temperature 0.8 4. The first two models to complete are presented anonymously 5. Users vote on the better response 6. Winners face winners, losers face losers 7. This creates 5 pairwise comparisons per session Each comparison feeds directly into the Bradley-Terry calculations.

Anonymization & Fairness

To ensure unbiased evaluation: • Model identities are hidden throughout voting • Responses are presented in randomized order (left/right) • No identifying information is included in prompts • Model configurations are standardized where possible • All methodologies are publicly documented Models are revealed only after voting to prevent brand bias.

Real-Time Updates

Rankings update continuously as votes come in: • Each vote immediately affects the leaderboard • Bradley-Terry recalculation happens in real-time • New models appear with "New" status until 50+ comparisons • Confidence intervals shrink as more data accumulates • Historical data is preserved for trend analysis

Implementation Example

// Bradley-Terry probability calculation
function predictWinProbability(ratingA: number, ratingB: number): number {
  // Convert Elo ratings to win probability
  return 1 / (1 + Math.pow(10, (ratingB - ratingA) / 400));
}

// Example: Model A (Elo 1600) vs Model B (Elo 1500)
const probability = predictWinProbability(1600, 1500);
// Result: 0.64 (64% chance A wins)

Limitations & Considerations

•Rankings reflect subjective human preferences, not objective correctness
•Models with fewer comparisons have higher rating uncertainty
•Prompt selection and phrasing can influence model performance
•Voter expertise varies; chemistry experts may evaluate differently than novices
•Model versions are updated regularly; historical comparisons may not reflect current capabilities

Ready to contribute?

Your votes help build the most authentic chemistry AI benchmark.

Start Voting

Questions about our methodology? Reach us at contact@chemistryarena.ai