[index-tts]推理速度还是很慢，哪位大佬再优化一下

9

https://github.com/imwxc/index-tts/tree/feat/compile_bigvgan#

尝试用 compile 优化了下，但是发现效果不太明显。

主要耗时都在 gpt_gen_time 上了

imwxc

0

https://github.com/imwxc/index-tts/tree/feat/compile_bigvgan#

尝试用 compile 优化了下，但是发现效果不太明显。

主要耗时都在 gpt_gen_time 上了

发现还有有一点点用的。

A10 机器上，预热两次之后可以提升 0.5s 左右。聊胜于无吧。。

imwxc

8

self.gpt.inference_speech 这里太耗时了

Reference audio length: 12.95 seconds gpt_gen_time: 47.16 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.95 seconds Total inference time: 49.82 seconds Generated audio length: 13.14 seconds

xzw168

1

@imwxc 可以尝试把 checkpoints\gpt.pth 导出 onnx，然后使用 TensorRT 加速~ 经过查看，它本身是一个 transformers 的 GPT2PreTrainedModel 模型。

并且发现： onnxruntime 官方有直接提供 GPT2 模型转换 onnx 的脚本： onnxruntime.transformers.models.gpt2.convert_to_onnx

导出失败，若有什么更好的思路，可以一起分享~

onnxruntime\transformers\models\gpt2\gpt2_helper.py", line 68, in post_process
    if isinstance(result[1][0], (tuple, list)):
IndexError: tuple index out of range

我抽空试试，现在的思路还是看看能不能通过修改源码来支持 sdpa。

尽力了。

imwxc

0

self.gpt.inference_speech 这里太耗时了

Reference audio length: 12.95 seconds gpt_gen_time: 47.16 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.95 seconds Total inference time: 49.82 seconds Generated audio length: 13.14 seconds

@xzw168 是的，bigvgan_time 已经可以优化到0.1s左右了，主要还是在 gpt_gen_time 这里很慢

imwxc

0

批处理出来的音频质量差太多了

kakaxixx

8

chattts 流式水平首包在 100ms ； 17s音频推理大约只需要2.6s，相比这个差距有点大。

fengyizhu

1

可以试试微调 generate 的参数，下面是MacBook Pro mps 实测数据：

do_sample num_beams top_k GenerationMode RTF True 3 top_k > 1 beam_search 6.2184 False 3 top_k > 1 beam_search 6.4713 False 1 top_k > 1 greedy_search 2.8936 False 1 top_k = None sample 2.9

关于 GenerationMode：https://huggingface.co/docs/transformers/v4.51.3/en/generation_strategies

yrom

9

在我的机器上，测试了几种参数，下面这样稍好一些。

num_beams = 1  
do_sample = False

juntaosun

5

我修改了下deepspeed的参数，把replace_with_kernel_inject改成了True，速度提升了不少，在我的机器上合成29秒音频耗时从12.34秒降到6.97秒。我自己试感觉效果还可以，可以多试试，看看有没有什么问题。 https://docs.deepspeed.org.cn/en/latest/inference-init.html

replace_with_kernel_inject为False的时候：

Reference audio length: 4.01 seconds gpt_gen_time: 10.98 seconds gpt_forward_time: 0.12 seconds bigvgan_time: 1.14 seconds Total inference time: 12.34 seconds Generated audio length: 29.14 seconds RTF: 0.4235

replace_with_kernel_inject为True的时候：

Reference audio length: 4.01 seconds gpt_gen_time: 5.41 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.35 seconds Total inference time: 6.97 seconds Generated audio length: 27.61 seconds RTF: 0.2525

superfl

2

@superfl 确实有效，改成了True，批次速度更快了，我试了没啥问题。

>> gpt_gen_time: 5.23 seconds
>> gpt_forward_time: 2.57 seconds
>> bigvgan_time: 1.92 seconds
>> Total fast inference time: 11.12 seconds
>> Generated audio length: 172.72 seconds
>> [fast] RTF: 0.0644

juntaosun

9

我修改了下deepspeed的参数，把replace_with_kernel_inject改成了True，速度提升了不少，在我的机器上合成29秒音频耗时从12.34秒降到6.97秒。我自己试感觉效果还可以，可以多试试，看看有没有什么问题。 https://docs.deepspeed.org.cn/en/latest/inference-init.html

replace_with_kernel_inject为False的时候：

Reference audio length: 4.01 seconds gpt_gen_time: 10.98 seconds gpt_forward_time: 0.12 seconds bigvgan_time: 1.14 seconds Total inference time: 12.34 seconds Generated audio length: 29.14 seconds RTF: 0.4235

replace_with_kernel_inject为True的时候：

Reference audio length: 4.01 seconds gpt_gen_time: 5.41 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.35 seconds Total inference time: 6.97 seconds Generated audio length: 27.61 seconds RTF: 0.2525

@superfl 大佬用的什么机器配置？速度飞起啊。

用我的渣渣mac跑的结果：

tensor([[10201, 2044, 10201, 1973, 10201, 6377, 10201, 6377, 10201, 208, 10201, 146, 10202, 10201, 203, 10201, 1547, 10201, 3288, 10201, 855, 10201, 167, 10201, 208, 10201, 146, 10201, 2588, 10201, 694, 10215]], dtype=torch.int32) text_tokens shape: torch.Size([1, 32]), text_tokens type: torch.int32 text_token_syms is same as sentence tokens True tensor([[5826, 5556, 7585, 7684, 6738, 6506, 7184, 5574, 7020, 6018, 4831, 1331, 6663, 3043, 6148, 8170, 6507, 7303, 6515, 6270, 3583, 7382, 3180, 30, 6040, 8021, 7569, 4298, 436, 4955, 7404, 7309, 6874, 7043, 7555, 103, 7030, 194, 5682, 6289, 691, 6269, 5996, 5817, 6426, 5719, 6583, 3694, 7921, 6333, 2980, 1385, 4446, 5137, 1026, 6109, 6905, 7273, 1271, 1401, 7397, 5859, 7813, 7480, 6763, 5481, 1460, 5025, 6502, 5796, 2082, 7627, 1057, 4457, 4358, 4359, 3038, 5883, 6893, 5677, 6505, 6930, 222, 7941, 5808, 5659, 7553, 2398, 6431, 2396, 2586, 8193]]) <class 'torch.Tensor'> codes shape: torch.Size([1, 92]), codes type: torch.int64 code len: tensor([92]) tensor([[5826, 5556, 7585, 7684, 6738, 6506, 7184, 5574, 7020, 6018, 4831, 1331, 6663, 3043, 6148, 8170, 6507, 7303, 6515, 6270, 3583, 7382, 3180, 30, 6040, 8021, 7569, 4298, 436, 4955, 7404, 7309, 6874, 7043, 7555, 103, 7030, 194, 5682, 6289, 691, 6269, 5996, 5817, 6426, 5719, 6583, 3694, 7921, 6333, 2980, 1385, 4446, 5137, 1026, 6109, 6905, 7273, 1271, 1401, 7397, 5859, 7813, 7480, 6763, 5481, 1460, 5025, 6502, 5796, 2082, 7627, 1057, 4457, 4358, 4359, 3038, 5883, 6893, 5677, 6505, 6930, 222, 7941, 5808, 5659, 7553, 2398, 6431, 2396]]) <class 'torch.Tensor'> fix codes shape: torch.Size([1, 90]), codes type: torch.int64 code len: tensor([90]) wav shape: torch.Size([1, 92160]) min: tensor(-18609.9023) max: tensor(16730.5020)

Reference audio length: 25.50 seconds gpt_gen_time: 76.50 seconds gpt_forward_time: 1.42 seconds bigvgan_time: 6.84 seconds Total inference time: 84.85 seconds Generated audio length: 3.84 seconds RTF: 22.0963

JiapengLi

2

@superfl 确实有效，改成了True，批次速度更快了，我试了没啥问题。

>> gpt_gen_time: 5.23 seconds
>> gpt_forward_time: 2.57 seconds
>> bigvgan_time: 1.92 seconds
>> Total fast inference time: 11.12 seconds
>> Generated audio length: 172.72 seconds
>> [fast] RTF: 0.0644

谁能帮忙编译一个deepspeed，支持cuda11.8 pytorch2.0.1，replace_with_kernel_inject这个参数官方编译的只支持12.4

einsqing

7

我基于本项目使用vllm加速了gpt2的推理过程：https://github.com/Ksuriuri/index-tts-vllm 目前从RTF来看大约加速了300%，大家可以试试 0.0

Ksuriuri

2

@JiapengLi replace_with_kernel_inject 你的这个参数是Linux部署的吗？win好像不能获取deepspeed完整支持

doooboi

7

我修改了下deepspeed的参数，把replace_with_kernel_inject改成了True，速度提升了不少，在我的机器上合成29秒音频耗时从12.34秒降到6.97秒。我自己试感觉效果还可以，可以多试试，看看有没有什么问题。 https://docs.deepspeed.org.cn/en/latest/inference-init.html

replace_with_kernel_inject为False的时候：

Reference audio length: 4.01 seconds gpt_gen_time: 10.98 seconds gpt_forward_time: 0.12 seconds bigvgan_time: 1.14 seconds Total inference time: 12.34 seconds Generated audio length: 29.14 seconds RTF: 0.4235

replace_with_kernel_inject为True的时候：

Reference audio length: 4.01 seconds gpt_gen_time: 5.41 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.35 seconds Total inference time: 6.97 seconds Generated audio length: 27.61 seconds RTF: 0.2525

win部署还是Linux部署能获得deepspeed完整支持

doooboi

4

欢迎参考此导出脚本获得ONNX模型，並体验ONNX Runtime的加速效果。在模型的GPT与BigVGAN两部份，GPU算子布署率可达100%。

DakeQQ

0

self.gpt.inference_speech 这里太耗时了

Reference audio length: 12.95 seconds gpt_gen_time: 47.16 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.95 seconds Total inference time: 49.82 seconds Generated audio length: 13.14 seconds

@xzw168 是的，bigvgan_time 已经可以优化到0.1s左右了，主要还是在 gpt_gen_time 这里很慢 bigvgan_time怎么优化到0.1s的，感觉快了0.5还是很有用的，看你那个链接有点看不懂改了什么地方

learnuser1

4

self.gpt.inference_speech 这里太耗时了

Reference audio length: 12.95 seconds gpt_gen_time: 47.16 seconds gpt_forward_time: 0.11 seconds bigvgan_time: 1.95 seconds Total inference time: 49.82 seconds Generated audio length: 13.14 seconds

@xzw168 是的，bigvgan_time 已经可以优化到0.1s左右了，主要还是在 gpt_gen_time 这里很慢

挺快的，bigvgan和gpt的怎么优化的啊？

wincing2

[index-tts]推理速度还是很慢，哪位大佬再优化一下

回答