This recipe demonstrates how to fine-tune the CosyVoice2 large language model with reinforcement learning algorithms—specifically GRPO—using the veRL framework. Our experiments show that applying GRPO reduces the character error rate (CER) on the CosyVoice3 zero_shot_zh set from 4.08% to 3.36%.
We recommend using the pre-built Docker image below. Alternatively, you can manually install the dependencies following the Dockerfile.
docker pull soar97/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2
If Docker is not available, you can refer to run.sh stage -2 to install the dependencies locally.
prepare_data.py expects a JSON/JSONL file with at least the following schema:
{
"text": "An example sentence to be synthesized."
}
You can download the JSONL files from the metadata directory of the SparkAudio/voxbox dataset on Hugging Face.
Stage 0 converts raw JSONL files into the parquet format expected by veRL:
bash run.sh 0 0
Create two JSONL files—train.jsonl and test.jsonl.
The script will then generate two Parquet files:
data/parquet_tiny/train.parquet
data/parquet_tiny/test.parquet
Each sample is automatically wrapped into a CosyVoice2-style prompt so that the LLM learns to output CosyVoice2 speech tokens.
To compute rewards, we run a lightweight server that:
Start the server (stage 1) in a dedicated terminal or on a separate GPU:
bash run.sh 1 1
# Triton server listens on ports 8000/8001/8002
The custom reward implementation is located in reward_tts.py and calls the server to obtain the reward score.
Run stage 2 to start GRPO training:
bash run.sh 2 2
Key CLI arguments passed to verl.trainer.main_ppo:
algorithm.adv_estimator=grpo – use GRPO instead of PPO.data.train_files=data/parquet_aishell3/train.parquet and data.val_files=data/parquet_aishell3/test.parquetcustom_reward_function.path=reward_tts.py – custom reward function described above.Adjust CUDA_VISIBLE_DEVICES, batch sizes, and other hyperparameters to match your hardware.
[!TIP] Note: the lm_head bias is disabled during training to make the model compatible with VLLM and Transformers' Qwen model.
After training is complete, collect the sharded FSDP weights and export a Hugging Face-style checkpoint (stage 3):
bash run.sh 3 3 # merges weights into $llm_path/merged_hf_model
You can then evaluate the model on the CosyVoice3 zero-shot Chinese test set (stage 4):
bash run.sh 4 4
This command launches distributed inference via infer_dataset.py and computes WER with scripts/compute_wer.sh.
[!TIP] The script also supports the Seed-TTS test set by setting
dataset=test_zh.
To use the RL-trained model with the official CosyVoice repository:
bash run.sh 5 5
The script converts the Hugging Face checkpoint back into the format expected by the CosyVoice repository.
[!TIP] However, we observed a slight accuracy drop when using the RL-trained model after conversion, compared with the Hugging Face format.
| Model | Seed-TTS test_zh CER |
CosyVoice3 zero_shot_zh CER |
Comment |
|---|---|---|---|
| CosyVoice2 LLM (official) | 1.45% | 4.08% | See the paper |
| CosyVoice2 LLM + GRPO | 1.37% | 3.36% | See the decoding results, Hugging Face-format model |
This work was inspired by the implementation in ch-tts-llasa-rl-grpo.