[index-tts]短句生成了超长音频

2025-10-28 742 views
7

经过几轮测试,相较其他几个现在火热的模型,效果已经非常棒了!我的项目已经从f5-tts迁移过来了。

但是同一个音频多次推理测试还是经常出现生成的音频超长现象(其它推理模型也一样):

比如也就不到10来个字,参考音频语速也正常。但是生成的音频有时会超过10几甚至30秒+的音频。 有时是莫名多了一大段的静音空白,有时候是每个单词之间超长的间隔。

这个问题不是必现,但是10来次推理基本必现。在其他推理库中也经常出现,也不知道是不是推理卡住了还是怎样。

有没有好的建议?我要用于批量场景。速度慢点没关系,质量第一。

感谢贡献,期待建议,项目急需

回答

2

我们infer.py 增加了一个 remove_long_silence 函数,临时解决超长静音问题,后面还是会通过优化模型提升稳定性从根本上解决这个问题。

2

我一般是几百次偶然出现一次超长静音区间问题。我检测到后,自动重新推理,大概率还是一样会出现长静音,就是静音的长度变化一下。这时候,大概率推出来的音频也会比较奇怪。于是我基本放弃了自动重试,而是去检测到这种wav数据,发现这类情况后,重新采用去掉标点符号的短句去推理,然后在短句之间插入延迟,最后合成一段话。 这个虽然能部分解决这类问题,但是去掉标点,分短句去推理的话,有可能导致这段话合成的不够自然,没有整体推理来的自然。

9

測試下來發現有些樣本固定會出現這些現象。我測試的樣本是一個15秒的遊戲語音。 更換成其他來源的語音的30秒片段後即正常,且出現此現象時,也會因為 token 中包含許多 52 而在產出期間特別久。

4

infer.py 更新了,应该不会出现这种结果了,大家可以试下。。

5

@index-tts self.stop_mel_token 未定义,是否也是 8193 ?

6

是的

2
LOOK AT THIS PICTURE. NOW, LET'S READ, SICK.
tensor([[  52,   52,   52, 7809, 4879,  421, 5715, 3147, 3628, 4711, 3808, 2128,
         7068,  101, 1744, 4446,  821, 1319, 6021, 1003, 8022, 4461, 5358, 3695,
          139, 2927, 7777, 7041,  988, 4891, 7524, 1151, 5654, 7207,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
         6478, 2854, 6338, 5541, 3568, 3332,  469, 4242, 2374, 2527, 4530, 6490,
         4365, 4856,  224, 2818, 8134,   10,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52, 7342, 5183, 5343, 1011, 3358,
         1592, 5930, 1784,  971, 7733, 5018, 1773, 1337, 7603, 8069, 1234, 4489,
         7557, 1287, 2017, 4435, 4135, 1257, 7232, 1019, 1258,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,   52,
           52,   52,   52,   52,   52,   52,   52,   52,   52,   52]],
       device='cuda:0')
codes shape: torch.Size([1, 598])

有些句子替换 52后是可以 静音的,但是如上推理这种情况下,就是丢失句子了。 似乎是因为模型没有推理完成,

9

这个bug修复了,大家可以拉下来试试。

7

这个bug修复了,大家可以拉下来试试。

新代码测试了几轮,确实没发生了,继续观察。 感谢及时处理,也感谢各位建议。

5

又出现了,试了几次都必现,素材如下:

text="Like I said, on your body, for me",

clip_8.zip

6

又出现了,试了几次都必现,素材如下:

text="Like I said, on your body, for me",

clip_8.zip

@index-tts 生产环境急等修复。麻烦帮忙看一下

8

这类似于LLM重复回复的问题吧,推理时把gpt2的token采样随机性提高应该就行了吧

6

又出现了,试了几次都必现,素材如下: text="Like I said, on your body, for me", clip_8.zip

@index-tts 生产环境急等修复。麻烦帮忙看一下

你好 请问解决这个问题了吗