能够识别“谁说了什么”，或者能够说话人分类，是通过自动化手段理解人类对话音频的关键步骤。比如，有个医生说，在医生和病人之间的一次医疗对话中，对 “你经常服用心脏药物吗?” 这句话的回答是 "是的" 和反问"是的？"有着本质的不同。
传统的说话人分类系统使用两步，第一阶段是检测声音频谱的变化，以确定说话人在对话中何时发生变化；第二阶段是识别谈话中的单个说话人。这个 基本的多阶段方法 持续了将近 20 年，在那段时间，只有说话人变化检测组件得到了改进。
随着最近开发出的一种新的神经网络模型—循环神经网络传感器 (RNN-T)—我们现在拥有了一个合适的架构来改进说话人分类的性能，解决了我们最近提出的以前说话人分类系统的局限性。 正如我们将会发表在Interspeech 2019的论文 “Joint Speech Recognition and Speaker Diarization via Sequence Transduction,” 中所报告的， 我们已经开发出了一个基于说话人分类系统的 RNN-T 模型，并在性能上有了从约 20% 到 2% 的单词分类错误率的突破，提高了 10 倍。
在实际应用中，说话人二值化系统与自动语音识别（ASR）系统并行运行，两种系统的输出相结合，将说话人标签归属于被识别的单词。传统的说话人二值化系统在声域内推断说话人标签，然后将说话人标签覆盖在一个单独的 ASR 系统生成的单词上。
There are a few exceptions to the conventional speaker diarization system, and one such exception was reported in our recent blog post. In that work, the hidden states of the recurrent neural network (RNN) tracked the speakers, circumventing the weakness of the clustering stage. The work reported in this post takes a different approach and incorporates linguistic cues, as well.
An Integrated Speech Recognition and Speaker Diarization System
We developed a novel and simple model that not only combines acoustic and linguistic cues seamlessly, but also combines speaker diarization and speech recognition into one system. The integrated model does not degrade the speech recognition performance significantly compared to an equivalent recognition only system.
The key insight in our work was to recognize that the RNN-T architecture is well-suited to integrate acoustic and linguistic cues. The RNN-T model consists of three different networks: (1) a transcription network (or encoder) that maps the acoustic frames to a latent representation, (2) a prediction network that predicts the next target label given the previous target labels, and (3) a joint network that combines the output of the previous two networks and generates a probability distribution over the set of output labels at that time step. Note, there is a feedback loop in the architecture (diagram below) where previously recognized words are fed back as input, and this allows the RNN-T model to incorporate linguistic cues, such as the end of a question.
An integrated speech recognition and speaker diarization system where the system jointly infers who spoke when and what.
Training the RNN-T model on accelerators like graphical processing units (GPU) or tensor processing units (TPU) is non-trivial as computation of the loss function requires running the forward-backward algorithm, which includes all possible alignments of the input and the output sequences. This issue was addressed recently in a TPU friendly implementation of the forward-backward algorithm, which recasts the problem as a sequence of matrix multiplications. We also took advantage of an efficient implementation of the RNN-T loss in TensorFlow that allowed quick iterations of model development and trained a very deep network.
The integrated model can be trained just like a speech recognition system. The reference transcripts for training contain words spoken by a speaker followed by a tag that defines the role of the speaker. For example, “When is the homework due?” ≺student≻, “I expect you to turn them in tomorrow before class,” ≺teacher≻. Once the model is trained with examples of audio and corresponding reference transcripts, a user can feed in the recording of the conversation and expect to see an output in a similar form. Our analyses show that improvements from the RNN-T system impact all categories of errors, including short speaker turns, splitting at the word boundaries, incorrect speaker assignment in the presence of overlapping speech, and poor audio quality. Moreover, the RNN-T system exhibited consistent performance across conversation with substantially lower variance in average error rate per conversation compared to the conventional system.
A comparison of errors committed by the conventional system vs. the RNN-T system, as categorized by human annotators.
Furthermore, this integrated model can predict other labels necessary for generating more reader-friendly ASR transcripts. For example, we have been able to successfully improve our transcripts with punctuation and capitalization symbols using the appropriately matched training data. Our outputs have lower punctuation and capitalization errors than our previous models that were separately trained and added as a post-processing step after ASR.
This model has now become a standard component in our project on understanding medical conversations and is also being adopted more widely in our non-medical speech services.
We would like to thank Hagen Soltau without whose contributions this work would not have been possible. This work was performed in collaboration with Google Brain and Speech teams.