[index-tts]推理速度还是很慢,哪位大佬再优化一下

2025-11-11 124 views
1

回答

8

self.gpt.inference_speech 这里太耗时了

Reference audio length: 12.95 seconds gpt_gen_time: 47.16 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.95 seconds Total inference time: 49.82 seconds Generated audio length: 13.14 seconds

1

@imwxc 可以尝试把 checkpoints\gpt.pth 导出 onnx, 然后使用 TensorRT 加速~ 经过查看,它本身是一个 transformers 的 GPT2PreTrainedModel 模型。

并且发现: onnxruntime 官方有直接提供 GPT2 模型转换 onnx 的脚本: onnxruntime.transformers.models.gpt2.convert_to_onnx

导出失败,若有什么更好的思路,可以一起分享~

onnxruntime\transformers\models\gpt2\gpt2_helper.py", line 68, in post_process
    if isinstance(result[1][0], (tuple, list)):
IndexError: tuple index out of range

我抽空试试,现在的思路还是看看能不能通过修改源码来支持 sdpa。

尽力了。

0

self.gpt.inference_speech 这里太耗时了

Reference audio length: 12.95 seconds gpt_gen_time: 47.16 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.95 seconds Total inference time: 49.82 seconds Generated audio length: 13.14 seconds

@xzw168 是的,bigvgan_time 已经可以优化到0.1s左右了,主要还是在 gpt_gen_time 这里很慢

0

批处理出来的音频质量差太多了

8

chattts 流式水平 首包在 100ms ; 17s音频推理大约只需要2.6s,相比这个差距有点大。

1

可以试试微调 generate 的参数,下面是MacBook Pro mps 实测数据:

do_sample num_beams top_k GenerationMode RTF True 3 top_k > 1 beam_search 6.2184 False 3 top_k > 1 beam_search 6.4713 False 1 top_k > 1 greedy_search 2.8936 False 1 top_k = None sample 2.9

关于 GenerationMode:https://huggingface.co/docs/transformers/v4.51.3/en/generation_strategies

9

在我的机器上,测试了几种参数,下面这样稍好一些。

num_beams = 1  
do_sample = False 
5

我修改了下deepspeed的参数,把replace_with_kernel_inject改成了True,速度提升了不少,在我的机器上合成29秒音频耗时从12.34秒降到6.97秒。我自己试感觉效果还可以,可以多试试,看看有没有什么问题。 https://docs.deepspeed.org.cn/en/latest/inference-init.html

replace_with_kernel_inject为False的时候:

Reference audio length: 4.01 seconds gpt_gen_time: 10.98 seconds gpt_forward_time: 0.12 seconds bigvgan_time: 1.14 seconds Total inference time: 12.34 seconds Generated audio length: 29.14 seconds RTF: 0.4235

replace_with_kernel_inject为True的时候:

Reference audio length: 4.01 seconds gpt_gen_time: 5.41 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.35 seconds Total inference time: 6.97 seconds Generated audio length: 27.61 seconds RTF: 0.2525

2

@superfl 确实有效,改成了True,批次速度更快了,我试了没啥问题。

>> gpt_gen_time: 5.23 seconds
>> gpt_forward_time: 2.57 seconds
>> bigvgan_time: 1.92 seconds
>> Total fast inference time: 11.12 seconds
>> Generated audio length: 172.72 seconds
>> [fast] RTF: 0.0644
9

我修改了下deepspeed的参数,把replace_with_kernel_inject改成了True,速度提升了不少,在我的机器上合成29秒音频耗时从12.34秒降到6.97秒。我自己试感觉效果还可以,可以多试试,看看有没有什么问题。 https://docs.deepspeed.org.cn/en/latest/inference-init.html

replace_with_kernel_inject为False的时候:

Reference audio length: 4.01 seconds gpt_gen_time: 10.98 seconds gpt_forward_time: 0.12 seconds bigvgan_time: 1.14 seconds Total inference time: 12.34 seconds Generated audio length: 29.14 seconds RTF: 0.4235

replace_with_kernel_inject为True的时候:

Reference audio length: 4.01 seconds gpt_gen_time: 5.41 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.35 seconds Total inference time: 6.97 seconds Generated audio length: 27.61 seconds RTF: 0.2525

@superfl 大佬用的什么机器配置?速度飞起啊。

用我的渣渣mac跑的结果:

tensor([[10201, 2044, 10201, 1973, 10201, 6377, 10201, 6377, 10201, 208, 10201, 146, 10202, 10201, 203, 10201, 1547, 10201, 3288, 10201, 855, 10201, 167, 10201, 208, 10201, 146, 10201, 2588, 10201, 694, 10215]], dtype=torch.int32) text_tokens shape: torch.Size([1, 32]), text_tokens type: torch.int32 text_token_syms is same as sentence tokens True tensor([[5826, 5556, 7585, 7684, 6738, 6506, 7184, 5574, 7020, 6018, 4831, 1331, 6663, 3043, 6148, 8170, 6507, 7303, 6515, 6270, 3583, 7382, 3180, 30, 6040, 8021, 7569, 4298, 436, 4955, 7404, 7309, 6874, 7043, 7555, 103, 7030, 194, 5682, 6289, 691, 6269, 5996, 5817, 6426, 5719, 6583, 3694, 7921, 6333, 2980, 1385, 4446, 5137, 1026, 6109, 6905, 7273, 1271, 1401, 7397, 5859, 7813, 7480, 6763, 5481, 1460, 5025, 6502, 5796, 2082, 7627, 1057, 4457, 4358, 4359, 3038, 5883, 6893, 5677, 6505, 6930, 222, 7941, 5808, 5659, 7553, 2398, 6431, 2396, 2586, 8193]]) <class 'torch.Tensor'> codes shape: torch.Size([1, 92]), codes type: torch.int64 code len: tensor([92]) tensor([[5826, 5556, 7585, 7684, 6738, 6506, 7184, 5574, 7020, 6018, 4831, 1331, 6663, 3043, 6148, 8170, 6507, 7303, 6515, 6270, 3583, 7382, 3180, 30, 6040, 8021, 7569, 4298, 436, 4955, 7404, 7309, 6874, 7043, 7555, 103, 7030, 194, 5682, 6289, 691, 6269, 5996, 5817, 6426, 5719, 6583, 3694, 7921, 6333, 2980, 1385, 4446, 5137, 1026, 6109, 6905, 7273, 1271, 1401, 7397, 5859, 7813, 7480, 6763, 5481, 1460, 5025, 6502, 5796, 2082, 7627, 1057, 4457, 4358, 4359, 3038, 5883, 6893, 5677, 6505, 6930, 222, 7941, 5808, 5659, 7553, 2398, 6431, 2396]]) <class 'torch.Tensor'> fix codes shape: torch.Size([1, 90]), codes type: torch.int64 code len: tensor([90]) wav shape: torch.Size([1, 92160]) min: tensor(-18609.9023) max: tensor(16730.5020)

Reference audio length: 25.50 seconds gpt_gen_time: 76.50 seconds gpt_forward_time: 1.42 seconds bigvgan_time: 6.84 seconds Total inference time: 84.85 seconds Generated audio length: 3.84 seconds RTF: 22.0963

2

@superfl 确实有效,改成了True,批次速度更快了,我试了没啥问题。

>> gpt_gen_time: 5.23 seconds
>> gpt_forward_time: 2.57 seconds
>> bigvgan_time: 1.92 seconds
>> Total fast inference time: 11.12 seconds
>> Generated audio length: 172.72 seconds
>> [fast] RTF: 0.0644

谁能帮忙编译一个deepspeed,支持cuda11.8 pytorch2.0.1,replace_with_kernel_inject这个参数官方编译的只支持12.4

2

@JiapengLi replace_with_kernel_inject 你的这个参数是Linux部署的吗?win好像不能获取deepspeed完整支持

7

我修改了下deepspeed的参数,把replace_with_kernel_inject改成了True,速度提升了不少,在我的机器上合成29秒音频耗时从12.34秒降到6.97秒。我自己试感觉效果还可以,可以多试试,看看有没有什么问题。 https://docs.deepspeed.org.cn/en/latest/inference-init.html

replace_with_kernel_inject为False的时候:

Reference audio length: 4.01 seconds gpt_gen_time: 10.98 seconds gpt_forward_time: 0.12 seconds bigvgan_time: 1.14 seconds Total inference time: 12.34 seconds Generated audio length: 29.14 seconds RTF: 0.4235

replace_with_kernel_inject为True的时候:

Reference audio length: 4.01 seconds gpt_gen_time: 5.41 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.35 seconds Total inference time: 6.97 seconds Generated audio length: 27.61 seconds RTF: 0.2525

win部署还是Linux部署能获得deepspeed完整支持

0

self.gpt.inference_speech 这里太耗时了

Reference audio length: 12.95 seconds gpt_gen_time: 47.16 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.95 seconds Total inference time: 49.82 seconds Generated audio length: 13.14 seconds

@xzw168 是的,bigvgan_time 已经可以优化到0.1s左右了,主要还是在 gpt_gen_time 这里很慢 bigvgan_time怎么优化到0.1s的,感觉快了0.5还是很有用的,看你那个链接有点看不懂改了什么地方

4

self.gpt.inference_speech 这里太耗时了

Reference audio length: 12.95 seconds gpt_gen_time: 47.16 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.95 seconds Total inference time: 49.82 seconds Generated audio length: 13.14 seconds

@xzw168 是的,bigvgan_time 已经可以优化到0.1s左右了,主要还是在 gpt_gen_time 这里很慢

挺快的,bigvgan和gpt的怎么优化的啊?