- 
在使用 ds_train_finetune.sh加自己数据训练过程中,训练刚开始就会出现如下OVERFLOW的提示信息,训练过程中也时常出现。
- 
训练了一段时间,在出现 OVERFLOW信息后,loss飙升。
- 
持续训练,直至最终出现 Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.后,训练中断。
- 
使用loss飙升前的节点进行预测,效果正常;使用loss飙升后的节点进行预测,结果就是胡乱输出了。 
请教下,OVERFLOW出现的原因是什么,如何避免此问题的出现,谢谢!
[INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
[INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
[INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
[INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
[INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
[INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024ds_train_finetune.sh如下:
LR=1e-4
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --num_gpus=8 --master_port $MASTER_PORT main.py \
    --deepspeed deepspeed.json \
    --do_train \
    --train_file /data/data_train.json \
    --prompt_column instruction \
    --response_column output\
    --preprocessing_num_workers 8 \
    --cache_dir ./cache/self-ft-$LR \
    --overwrite_cache \
    --model_name_or_path /data/chatglm-6b \
    --output_dir ./output/self-chatglm-6b-ft-$LR \
    --overwrite_output_dir \
    --max_source_length 128 \
    --max_target_length 350 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --predict_with_generate \
    --max_steps 200000 \
    --logging_steps 10 \
    --save_steps 5000 \
    --learning_rate $LR \
    --fp16- OS: Ubuntu 20.04
- Python: 3.9.12
- Transformers: 4.28.0
- PyTorch: 2.0.0
- CUDA Support True