How Gong identifies speakers

Gong uses multiple methods to to determine who was on a call and when they spoke. This information is used to calculate stats, and is shown on the call page to help people navigate to the relevant parts of the call.

Dividing calls into speaker segments

The first step in speaker identification is dividing the call into segments, each associated with a single speaker. Gong does this in the following ways:

Conference calls: We separate the single audio channel into as many channels as there are speakers, according to voice variance. Next, we apply a proprietary refined speaker separation algorithm that identifies smaller speech segments (for example, "Yes", "OK") to attribute the speakers better, even when the conferencing system doesn’t present a speaker switch, or presents a switch with a delay.
Telephony recordings: When Gong receives stereo recordings, we assume the two channels are the two speakers, and don’t try to divide the call further.
Mono recordings: When Gong receives mono recordings, we separate the single audio channel into as many channels as there are speakers, according to voice variance.

How participant identification works

Gong applies different methods of participant identification, according to the type of call.

Conference calls

Participants join conference calls in one of two ways:

Using their computer: Participants identify themselves using their full name or a nickname.
By dialing in: The conferencing system typically shows the partial or full phone number of the participant.

To identify participants on conference calls, Gong matches the names entered by the participants to those of the call invitees, using:

Their full name
Shorter combinations of their first and last name, if they are unique (for example, Mary, marys)
Their phone number

Speaker separation in Zoom Native calls

In a Zoom Native call, when multiple speakers join the call from the same conference room, we detect most of the active speakers and identify them as separate speakers. We don't know the name associated with each voice, so they are all assigned the room name, and an increasing index (Speaker 1, Speaker 2, and so on). Once the call has been analyzed, a call participant or business admin can go to the call page and edit the speaker name.

Considerations

We sort speakers by their speech volume, so speaker tracks aren't shown for silent participants, and may not be shown for participants who speak very briefly.
For web conference providers that are not Zoom Native: When several people sit in the same room and use a single device, they appear as a single talk track, identified by the person who opened the web conferencing application or phone.
If there is no other form of identification, sampled voice recordings of a Gong-using participant at your company may be used to identify the speaker (see below).

Stereo phone calls

When Gong analyzes phone calls imported from modern telephony systems, audio is provided to Gong in two channels: one channel includes the recorded Gong user, and the other channel includes the customer. In addition, Gong knows which extension (or, in the more general case, which recorded Gong user) the call is associated with.

In this case, Gong associates one side of the call with the recorded Gong user and the other side with the other party. This ensures maximum accuracy as long as the telephony system records audio consistently across the channels.

Mono phone calls

Some telephony systems record audio in a single channel. That is, the audio from the two parties is merged into a single channel, and there's no easy way to tell who is speaking on the call. Gong separates the audio into two channels, and then based on the transcript, assesses which of the two channels matches the company side. In cases when the model does not identify the Gong user correctly, Gong can leverage samples of previous recordings of the recorded Gong user to identify them. See Voice identification on mono calls for more details.

Once identified, Gong marks that channel as the recorded Gong user known to be on the call and the other channel as the customer.

Note: If you're using a telephony system that records the call in mono, we recommend using Gong’s voice identification feature. This feature must be enabled by your Gong administrator, and then expressly opted into by individual team members. When enabled, Gong performs on-the-fly analytics to identify the user from a small sample of previously recorded and continuously refreshed calls.

Voice identification on mono calls

In mono call recordings, we only store voice identification for users who have opted into the voice identification feature. Voice identification is not stored for any other call participants.

Gong collects up to 5 short recordings of subscribed users who have opted into the feature. For best results, we look for calls that:
- Are mono telephony calls
- Include at least 2 minutes of recorded speech
Typically, Gong can accurately identify individuals from their second recorded call, based on the sample collected during their first call.
Gong replaces these samples on an ongoing basis in order to keep the sample fresh, and to increase recognition accuracy. This helps us identify the Gong user in variable conditions, like when they start the call from a different environment, use a different telephony system, or use a new headset.
As soon as we have enough samples for an individual, we revisit earlier calls where recorded team members were not identified, and leverage the sample to rerun voice identification. All of this analytics is performed on-the-fly, so no file containing a user’s voice identification is retained.

For information on how to enable voice identification, see this.