Google’s new AI SoundStorm

SoundStorm is a model developed by Google Research that specializes in efficient, non-autoregressive audio generation. It takes semantic tokens from AudioLM as input and uses bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, the model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm can generate 30 seconds of audio in 0.5 seconds on a TPU-v4. It also has the ability to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers‘ voices​1​.

Coupled with the text-to-semantic modeling stage of SPEAR-TTS, SoundStorm can synthesize high quality, natural dialogues, allowing one to control the spoken content (via transcripts), speaker voices (via short voice prompts) and speaker turns (via transcript annotations). When synthesizing dialogue segments of 30 seconds, a runtime of 2 seconds was measured on a single TPU-v4​1​.

SoundStorm demonstrates the ability to generate audio conditioned on the semantic tokens of AudioLM with and without 3-second voice prompts. SoundStorm samples different speakers in the unprompted case and maintains the speaker’s voice with high consistency in the prompted case, while generating audio two orders of magnitude faster than AudioLM’s acoustic generator​1​.

When generating audio in the prompted case, SoundStorm generations have higher acoustic consistency and preserve the speaker’s voice from the prompt better than AudioLM. Compared to RVQ level-wise greedy decoding with the same model, SoundStorm produces audio of higher quality​1​.

SoundStorm is used as a replacement for the acoustic generation pipeline of AudioLM and SPEAR-TTS. It is acknowledged that the audio samples produced by the model may be influenced by the biases present in the training data, for instance in terms of represented accents and voice characteristics. In the generated examples, however, it is demonstrated that speaker characteristics can be reliably controlled via prompting. It is also important to put in place safeguards against potential misuse, as the ability to mimic a voice can have numerous malicious applications, including bypassing biometric identification and for the purpose of impersonation. Thus, it is crucial to put in place safeguards against the potential misuse: to this end, it has been verified that, after such a replacement, the generated audio remains detectable by a dedicated classifier (98.5% using the same classifier as Borsos et al. (2022)). As a component of a larger system, it would therefore be unlikely that SoundStorm would introduce additional risks compared to those discussed previously by Borsos et al. (2022) and Kharitonov et al. (2023). At the same time, it is hoped that by relaxing the memory and computational requirements of AudioLM, research in the field of audio generation would become more accessible to a wider community. In the future, Google Research plans to explore other approaches for detecting synthesized speech, e.g., audio watermarking, so that any potential product usage of this technology strictly follows our responsible AI principles​1​.