[2noise/ChatTTS]zero shot 在 input 较短时,合成音频为空

2025-11-17 905 views
5
问题描述

尝试在通过 webUI 上传了 sample audio 和 sample text 合成音频,当我将 Input 输入为「四川美食确实以辣闻名」时,能够成功 refine 和合成,但是合成音频为空。

当我将音频上传至 ChatTTS Voice Cloning 获取 speaker embedding 后,通过 speaker embedding 合成,则可以实现短音频合成。

已经切换至 dev 分支。(此前使用 main 分支时会出现 regenerate in order to ensure non-empty 的问题,和 #673 情况一样,切换为 dev 分支后,该问题解决,但较短输入无法合成的问题仍然存在)

回答

1

较短输入无法合成的外在表现即是ensure non-empty,dev只是默认不循环生成了。我用webui,全默认设置测试四川美食确实以辣闻名可以正常生成,因此如果无法生成,请先检查自己的参数是否合理。不合理的模型参数可能会导致模型直接输出eos token,即表现为empty

2

在 webui 仅上传了 sample audio 和 sample text,其余参数默认不变(refine 开启, audio temperature 0.3, top_P 0.7, top_K 20,DVAE 等均未修改)

Input Text 四川美食确实以辣闻名

Sample Text The following is a conversation with the founding members of the Cursor team, Michael Truel, Swale Asif, Arvid Lundmark, and Aman Sanger.

Sample Audio lex_ai_cursor_team-00.00.00.000-00.00.09.945.mp3.zip

Terminal 中显示:

* Running on local URL:  http://0.0.0.0:8080

To create a public link, set `share=True` in `launch()`.
C:\Users\xxx\anaconda3\envs\chattts\Lib\site-packages\gradio\blocks.py:1746: UserWarning: A function returned too many output values (needed: 0, returned: 1). Ignoring extra values.
    Output components:
        []
    Output values returned:
        [None]
  warnings.warn(
text:   0%|                                                                                 | 0/384(max) [00:00, ?it/s]C:\Users\xxx\anaconda3\envs\chattts\Lib\site-packages\transformers\models\llama\modeling_llama.py:655: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
text:   0%|▏                                                                            | 1/384(max) [00:00,  2.85it/s]We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)
text:   4%|██▊                                                                         | 14/384(max) [00:00, 15.51it/s]
code:   0%|▏                                                                           | 4/2048(max) [00:00, 19.47it/s]
9

看看text输出了什么。另外尽量不要用英文中文混搭。注意 Sample Text 要与文本完全相同,这意味着你需要模仿refine_text的写法进行书写。请尝试以下转写:

the following is a conversation [uv_break] with the founding members of the cursor team, [uv_break] michael truel, [uv_break] swale asif, [uv_break] arvid lundmark, [uv_break] and aman sanger. [uv_break]