8
Hi, 论文中提到使用 Speech LLM 的 output (last hidden state) 作为 BigVGAN2 Decoder 的输入,请问:训练阶段,BigVGAN2 的输入 latent 是怎么获取的呢?
BigVGAN2 Decoder 的输出是一条真实音频的波形,那训练阶段,对应于一条波形 target,应该如何获取 SpeechLLM 的输出(BigVGAN2 的输入)?
是用 teacher-forcing 式的方案,将这段真实音频先提取出来 Speech Token,然后 teacher-forcing 来获取每一帧的 latent 吗?
如果是 teacher-forcing,LLM 的推理使用 zero-shot、还是直接只用这条音频的文本作为 prompt 来 teacher-forcing 推理?
论文描述如下:
The second approach is to directly convert the
SpeechLLM output, conditioned on speaker embedding, into
the final waveform. We adopts the second approach, based on
the BigVGAN2[14] vocoder, directly reconstructing the audio
based on the last hidden state of the SpeechLLM, which is conditioned with speaker embedding. The latent sampling rate is
25Hz. It is interpolated to 100Hz and then input into BigVGAN2. Subsequently, the signal is decoded by BigVGAN2 and
finally outputs at a frequency of 24KHz.