Measuring AI performance at Gong

Gong uses AI to give you deeper insights into what’s going on in your business. We use a variety of AI models, and measure performance per model and in different ways to make sure the AI is accurate and optimized. How we measure performance depends on the type of model we’re using. Smart trackers, for example, are based on in-house models, and we measure hit rate (how many snippets were correctly detected) and precision (how many of the snippets that we detected as positive were actually positive). Learn more about how we estimate performance for smart trackers see this.

Features like Call spotlight and Ask anything use generative AI based on large language models (LLMs). Multiple elements that go into these models, including different LLMs, the type of data they receive, and the types of prompts used to generate responses. We evaluate all of these elements on an ongoing basis using the Elo rating system to make sure that the model we’re using is optimized.

Making sure we are always improving

The Elo rating system was originally developed for calculating the relative skill level of chess players. Today, this system is used to rank a variety of competitive games, as well as LLM performance. At Gong, we’ve adopted the Elo system to compare and measure the performance of AI models on a per feature level. This includes measuring the systems that support the models, what data is brought into the models, and how that data is handled.

The Elo ranking system has a variety of benefits that make it particularly suited to measuring AI performance:

Objectivity: It provides a clear, numerical value, making it easy for everyone to understand the advancements we’re making.
Simple comparison: It enables us to directly compare new model versions with older ones, so that when we update a model we know that the new one is an improvement.
Widely recognized: It’s a universally accepted rating method.

How we implement the Elo rating system

We use the Elo system to determine if we should adopt a new model in a number of situations. These include:

When we upgrade the underlying LLMs
When we update prompts
When we improve the input or data we give to the prompt
When we migrate from off-the-shelf models to fine-tuned proprietary models

In all of these situations, we follow a structured process to assess performance and see whether the new model outperforms the old one:

Setup comparison: We create a comparison scenario in which each AI model is tested under similar conditions. Outcomes are categorized as ‘old win’, ‘new win’ or ‘tie’.
Calculate win rate: Each 'new win' scores a point; each 'tie' scores half a point. The total points are then divided by the total number of possible points to get the win rate.
Calculate the Elo difference: Using the win rate, we calculate the Elo difference between the old and new models. This number indicates the level of improvement.

An Elo comparison of our most recent models

In August 2025, we updated the underlying models and prompts for all of our key AI features, including account, deal, and contact ask anything, briefs, and more. Here are the highlights of the improved models:

Ask anything
Old model wins	25
New model wins	130
Ties	45
ELO score	202.63

Key discussion points
Old model wins	36
New model wins	107
Ties	57
ELO score	128.95

Notes
Old model wins	53
New model wins	235
Ties	112
ELO score	170.59

Recap
Old model wins	15
New model wins	28
Ties	107
ELO score	30.19

Transparency, accuracy and reliability

We use Elo to ensure that when we change our models, we know that the new models perform better than the old ones. We don’t just update a model to update it; we confirm that the new model is better for a specific use case, and let you know when we’ve chosen a new model. For you, this means transparency, accurate, reliable insights, and an even deeper understanding of your customer conversations.