- DarkLight
Gong utilizes AI to provide in-depth insights into business operations. Various AI models are employed, with performance measured per model to ensure accuracy and optimization. Performance evaluation methods vary based on the model type. Smart trackers, for instance, are assessed using hit rate and precision metrics. Gong uses generative AI based on large language models, evaluating elements like different LLMs and data types continuously using the Elo rating system. This system allows for objective, simple comparison and is widely recognized. Gong implements the Elo system to determine when to adopt new models, following a structured process to assess performance improvements. The Elo rating system ensures transparency, accuracy, and reliability in providing deeper customer conversation insights.
Gong uses AI to give you deeper insights into what’s going on in your business. We use a variety of AI models, and measure performance per model and in different ways to make sure the AI is accurate and optimized. How we measure performance depends on the type of model we’re using. Smart trackers, for example, are based on in-house models, and we measure hit rate (how many snippets were correctly detected) and precision (how many of the snippets that we detected as positive were actually positive). Learn more about how we estimate performance for smart trackers see this.
Features like Call spotlight and Ask anything use generative AI based on large language models (LLMs). Multiple elements that go into these models, including different LLMs, the type of data they receive, and the types of prompts used to generate responses. We evaluate all of these elements on an ongoing basis using the Elo rating system to make sure that the model we’re using is optimized.
Making sure we are always improving
The Elo rating system was originally developed for calculating the relative skill level of chess players. Today, this system is used to rank a variety of competitive games, as well as LLM performance. At Gong, we’ve adopted the Elo system to compare and measure the performance of AI models on a per feature level. This includes measuring the systems that support the models, what data is brought into the models, and how that data is handled.
The Elo ranking system has a variety of benefits that make it particularly suited to measuring AI performance:
Objectivity: It provides a clear, numerical value, making it easy for everyone to understand the advancements we’re making.
Simple comparison: It enables us to directly compare new model versions with older ones, so that when we update a model we know that the new one is an improvement.
Widely recognized: It’s a universally accepted rating method.
How we implement the Elo rating system
We use the Elo system to determine if we should adopt a new model in a number of situations. These include:
When we upgrade the underlying LLMs
When we update prompts
When we improve the input or data we give to the prompt
When we migrate from off-the-shelf models to fine-tined proprietary models
In all of these situations, we follow a structured process to assess performance and see whether the new model outperforms the old one:
Setup comparison: We create a comparison scenario in which each AI model is tested under similar conditions. Outcomes are categorized as ‘old win’, ‘new win’ or ‘tie’.
Calculate win rate: Each 'new win' scores a point; each 'tie' scores half a point. The total points are then divided by the total number of possible points to get the win rate.
Calculate the Elo difference: Using the win rate, we calculate the Elo difference between the old and new models. This number indicates the level of improvement.
A typical Elo comparison
Here’s how a typical Elo comparison may look after we update an AI model:
Ask Anything
Old model wins: 23
New model wins: 33
Ties: 41
Win rate calculation for the new model
Points for wins: 33
Points for ties: 20.5 (1/2 point for each tie)
Total points: 33 + 20.5 = 53.5 (out of 97 possible points)
Win rate: 53.5/97 = 55.15%
The Elo formula is applied to this percentage, for an Elo difference of 36 points. This indicates that the new model outperforms the old one by 36 points.
Transparency, accuracy and reliability
We use Elo to ensure that when we change our models, we know that the new models perform better than the old ones. We don’t just update a model to update it; we confirm that the new model is better for a specific use case, and let you know when we’ve chosen a new model. For you, this means transparency, accurate, reliable insights, and an even deeper understanding of your customer conversations.