|
|
4 tháng trước cách đây | |
|---|---|---|
| .. | ||
| scripts | 4 tháng trước cách đây | |
| Dockerfile | 4 tháng trước cách đây | |
| README.md | 4 tháng trước cách đây | |
| huggingface_to_pretrained.py | 4 tháng trước cách đây | |
| infer_dataset.py | 4 tháng trước cách đây | |
| prepare_data.py | 4 tháng trước cách đây | |
| pretrained_to_huggingface.py | 4 tháng trước cách đây | |
| requirements-cosyvoice.txt | 4 tháng trước cách đây | |
| reward_tts.py | 4 tháng trước cách đây | |
| run.sh | 4 tháng trước cách đây | |
| token2wav_asr_server.py | 4 tháng trước cách đây | |
This recipe demonstrates how to fine-tune the CosyVoice2 large language model with reinforcement learning algorithms—specifically GRPO—using the veRL framework. Our experiments show that applying GRPO reduces the character error rate (CER) on the CosyVoice3 zero_shot_zh set from 4.08 % to 3.36 %.
We recommend using the pre-built Docker image below. Alternatively, you can manually install the dependencies following the Dockerfile.
docker pull soar97/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2
prepare_data.py expects a JSON/JSONL file with at least the following schema:
{
"text": "An example sentence to be synthesized."
}
You can download the JSONL files from the metadata directory of the SparkAudio/voxbox dataset on Hugging Face.
Stage 0 converts raw JSONL files into the parquet format expected by veRL:
bash run.sh 0 0
Create two JSONL files—train.jsonl and test.jsonl.
The script will then generate two Parquet files:
data/parquet_tiny/train.parquet
data/parquet_tiny/test.parquet
Each sample is automatically wrapped into a cosyvoice2-style prompt so that the LLM learns to output CosyVoice2 speech tokens.
To compute rewards we run a lightweight server that:
Start the server (stage 1) in a dedicated terminal or on a separate GPU:
bash run.sh 1 1
# Triton server listens on ports 8000/8001/8002
The custom reward implementation lives in reward_tts.py and calls the server to obtain the reward score.
Run stage 2 to start GRPO training:
bash run.sh 2 2
Key CLI arguments passed to verl.trainer.main_ppo:
algorithm.adv_estimator=grpo – use GRPO instead of PPO.data.train_files=data/parquet_aishell3/train.parquet and data.val_files=data/parquet_aishell3/test.parquetcustom_reward_function.path=reward_tts.py – custom reward function described above.Adjust CUDA_VISIBLE_DEVICES, batch sizes, and other hyperparameters to match your hardware.
After training completes, collect the sharded FSDP weights and export a Hugging Face-style checkpoint (stage 3):
bash run.sh 3 3 # merges weights into $llm_path/merged_hf_model
You can then evaluate the model on the CosyVoice3 zero-shot Chinese test set (stage 4):
bash run.sh 4 4
This command launches distributed inference via infer_dataset.py and computes WER with scripts/compute_wer.sh.
[!TIP] The script also supports the Seed-TTS test set by setting
dataset=test_zh.
To use the RL-trained model with the official CosyVoice repository:
bash run.sh 5 5
The script converts the Hugging Face checkpoint back into the format expected by the CosyVoice repository.
| Model | Seed-TTS test_zh CER |
CosyVoice3 zero_shot_zh CER |
Comment |
|---|---|---|---|
| CosyVoice2 LLM (official) | 1.45 % | 4.08 % | See the paper |
| CosyVoice2 LLM + GRPO | 1.37 % | 3.36 % | See the decoding results |
This work was inspired by the implementation in ch-tts-llasa-rl-grpo.