How Gong identifies speakers

Gong analyzes calls using multiple methods to determine who was on the call and when they spoke. This information is used to calculate stats, and is shown on the call page to help you navigate to the relevant parts of the call.

Dividing calls into speaker segments

The first step in speaker identification is dividing the call into segments, each associated with a single speaker. Gong does this in one of two ways:

Conference calls: When Gong records a web conference call, we look up the participant list during the call to get a rough estimate of who is present, and when each participant speaks. Conferencing systems tend to exhibit large delays in presenting speaker switches, so the information we get from them regarding when a participant speaks is often inaccurate. To address this issue, we apply a proprietary refined speaker separation algorithm that identifies smaller speech segments (for example, "Yes", "OK"), to attribute the speakers better, even when the conferencing system itself did not present a speaker switch or presented a switch with a delay.
Telephony calls: When Gong receives stereo recordings, we use the two channels to determine the speakers. Assuming that these are the two speakers, we do not attempt to divide the call further.
When Gong receives mono recordings, we separate the single audio channel into as many channels as there are speakers, according to voice variance in a process known as diarization.

How participant identification works

Gong applies different methods of participant identification, according to the type of call.

Conference calls

Participants join conference calls in one of 2 ways:

Using their computer: Participants identify themselves using their full name or a nickname.
By dialing in: The conferencing system typically shows the partial or full phone number of the participant.

To identify participants on conference calls, Gong matches the names entered by the participants to those of the call invitees, using:

Their full name
Shorter combinations of their first and last name, if they are unique (for example, Mary, marys)
Their phone number
Over time, Gong learns the nicknames and phone numbers used by recorded users, so that we can identify them correctly going forwards.

Note: Gong doesn’t limit the number of speakers identified on a call, but each speaker needs to be identified either through the web conferencing application or by their phone number.
If several people sit in the same room and use a single device, they all appear as a single talker track, identified by the person who opened the web conferencing application or phone (with the exception of Zoom Rooms).
We also sort people by their speech volume (for example, speaker tracks aren't shown for silent participants).
If there is no other form of identification, sampled voice recordings of the user can be used to identify the speaker (see below). This identification applies to any Gong-using participant in your org.

Stereo phone calls

When Gong analyzes phone calls imported from modern telephony systems, audio is provided to Gong in two channels: one channel includes the recorded Gong user, and the other channel includes the customer. In addition, Gong knows which extension (or, in the more general case, which recorded Gong user) the call is associated with.

In this case, Gong associates one side of the call with the recorded Gong user and the other side with the other party. This ensures maximum accuracy as long as the telephony system records audio consistently across the channels.

Mono phone calls

Some telephony systems record audio in a single channel. That is, the audio from the two parties is merged into a single channel, and there's no easy way to tell who is speaking on the call. Gong uses diarization to separate the audio into two channels, and then leverages machine learning to assess which of the two channels matches the company side. In cases when the model does not identify the Gong user correctly, for example, when the recording is not in English, Gong can leverage samples of previous recordings of the recorded Gong user to identify them. (See Voice Identification, below, for more details.)

Once identified, Gong marks that channel as the recorded Gong user known to be on the call and the other channel as the customer.

Note: If you're using a telephony system that records the call in mono, we recommend using Gong’s voice identification feature. This feature must be enabled by your Gong administrator, and then expressly opted into by individual team members. When enabled, Gong performs on-the-fly analytics to identify the user from a small sample of previously recorded and continuously refreshed calls.

Voice identification on mono calls

In mono call recordings, we only store voice identification for users who have opted into the voice identification feature. Voice identification is not stored for any other call participants.

Gong collects up to 5 short recordings of subscribed users who have opted into the feature. For best results, we look for calls that:
- Are mono telephony calls
- Include at least 2 minutes of recorded speech
Typically, Gong can accurately identify individuals from their second recorded call, based on the sample collected during their first call.
Gong replaces these samples on an ongoing basis in order to keep the sample fresh, and to increase recognition accuracy. This helps us identify the Gong user in variable conditions, like when they start the call from a different environment, use a different telephony system, or use a new headset.
As soon as we have enough samples for an individual, we revisit earlier calls where recorded team members were not identified, and leverage the sample to rerun voice identification. All of this analytics is performed on-the-fly, so no file containing a user’s voice identification is retained.

For information on how to enable Voice Identification, see this.