Contributed by Yuekai Zhang (NVIDIA).
Launch the service directly with Docker Compose:
docker compose -f docker-compose.cosyvoice3.yml up
To build the image from scratch:
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
your_mount_dir=/mnt:/mnt
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
run_cosyvoice3.shThe run_cosyvoice3.sh script orchestrates the entire workflow through numbered stages.
You can run a subset of stages with:
bash run_cosyvoice3.sh <start_stage> <stop_stage>
<start_stage>: The stage to start from.<stop_stage>: The stage to stop after.Stages:
CosyVoice repository.Fun-CosyVoice3-0.5B-2512 model and its HuggingFace LLM checkpoint.cosyvoice3, token2wav, vocoder, audio_tokenizer, and speaker_embedding.trtllm-serve to deploy CosyVoice3 LLM.Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
# This command runs stages 0, 1, 2, and 3
bash run_cosyvoice3.sh 0 3
To benchmark the running Triton server, run stage 4:
bash run_cosyvoice3.sh 4 4
# You can customize parameters such as the number of tasks inside the script.
The following results were obtained by decoding on a single L20 GPU.
First Chunk Latency
| Concurrent Tasks | Average (ms) | 50th Percentile (ms) | 90th Percentile (ms) | 95th Percentile (ms) | 99th Percentile (ms) |
|---|---|---|---|---|---|
| 4 | 750.42 | 740.31 | 941.05 | 977.55 | 1002.37 |
For offline inference mode benchmark, please run stage 5:
bash run_cosyvoice3.sh 5 5
| Backend | LLM Batch Size | llm_time (s) | token2wav_time (s) | pipeline_time (s) | RTF |
|---|---|---|---|---|---|
| TRTLLM | 1 | 13.21 | 5.72 | 19.48 | 0.1091 |
| TRTLLM | 2 | 8.46 | 6.02 | 14.91 | 0.0822 |
| TRTLLM | 4 | 5.07 | 5.95 | 11.43 | 0.0630 |
| TRTLLM | 8 | 2.98 | 6.11 | 9.53 | 0.0562 |
| TRTLLM | 16 | 2.12 | 6.27 | 8.83 | 0.0501 |