README.Cosyvoice3.md 3.3 KB

Accelerating CosyVoice3 with NVIDIA Triton Inference Server and TensorRT-LLM

Contributed by Yuekai Zhang (NVIDIA).

Quick Start

Launch the service directly with Docker Compose:

docker compose -f docker-compose.cosyvoice3.yml up

Build the Docker Image

To build the image from scratch:

docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06

Run a Docker Container

your_mount_dir=/mnt:/mnt
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06

Understanding run_cosyvoice3.sh

The run_cosyvoice3.sh script orchestrates the entire workflow through numbered stages.

You can run a subset of stages with:

bash run_cosyvoice3.sh <start_stage> <stop_stage>
  • <start_stage>: The stage to start from.
  • <stop_stage>: The stage to stop after.

Stages:

  • Stage -1: Clones the CosyVoice repository.
  • Stage 0: Downloads the Fun-CosyVoice3-0.5B-2512 model and its HuggingFace LLM checkpoint.
  • Stage 1: Converts the HuggingFace checkpoint for the LLM to the TensorRT-LLM format and builds the TensorRT engines.
  • Stage 2: Creates the Triton model repository, including configurations for cosyvoice3, token2wav, vocoder, audio_tokenizer, and speaker_embedding.
  • Stage 3: Launches the Triton Inference Server for Token2Wav module and uses trtllm-serve to deploy CosyVoice3 LLM.
  • Stage 4: Runs the gRPC benchmark client for performance testing.
  • Stage 5: Runs the offline TTS inference benchmark test.

Export Models and Launch Server

Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:

# This command runs stages 0, 1, 2, and 3
bash run_cosyvoice3.sh 0 3

Benchmark with client-server mode

To benchmark the running Triton server, run stage 4:

bash run_cosyvoice3.sh 4 4

# You can customize parameters such as the number of tasks inside the script.

The following results were obtained by decoding on a single L20 GPU.

Streaming TTS (Concurrent Tasks = 4)

First Chunk Latency

Concurrent Tasks Average (ms) 50th Percentile (ms) 90th Percentile (ms) 95th Percentile (ms) 99th Percentile (ms)
4 750.42 740.31 941.05 977.55 1002.37

Benchmark with offline inference mode

For offline inference mode benchmark, please run stage 5:

bash run_cosyvoice3.sh 5 5

Offline TTS (CosyVoice3 0.5B LLM + Token2Wav with TensorRT)

Backend LLM Batch Size llm_time (s) token2wav_time (s) pipeline_time (s) RTF
TRTLLM 1 13.21 5.72 19.48 0.1091
TRTLLM 2 8.46 6.02 14.91 0.0822
TRTLLM 4 5.07 5.95 11.43 0.0630
TRTLLM 8 2.98 6.11 9.53 0.0562
TRTLLM 16 2.12 6.27 8.83 0.0501