## Accelerating CosyVoice3 with NVIDIA Triton Inference Server and TensorRT-LLM Contributed by Yuekai Zhang (NVIDIA). ### Quick Start Launch the service directly with Docker Compose: ```sh docker compose -f docker-compose.cosyvoice3.yml up ``` ### Build the Docker Image To build the image from scratch: ```sh docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06 ``` ### Run a Docker Container ```sh your_mount_dir=/mnt:/mnt docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06 ``` ### Understanding `run_cosyvoice3.sh` The `run_cosyvoice3.sh` script orchestrates the entire workflow through numbered stages. You can run a subset of stages with: ```sh bash run_cosyvoice3.sh ``` - ``: The stage to start from. - ``: The stage to stop after. **Stages:** - **Stage -1**: Clones the `CosyVoice` repository. - **Stage 0**: Downloads the `Fun-CosyVoice3-0.5B-2512` model and its HuggingFace LLM checkpoint. - **Stage 1**: Converts the HuggingFace checkpoint for the LLM to the TensorRT-LLM format and builds the TensorRT engines. - **Stage 2**: Creates the Triton model repository, including configurations for `cosyvoice3`, `token2wav`, `vocoder`, `audio_tokenizer`, and `speaker_embedding`. - **Stage 3**: Launches the Triton Inference Server for Token2Wav module and uses `trtllm-serve` to deploy CosyVoice3 LLM. - **Stage 4**: Runs the gRPC benchmark client for performance testing. - **Stage 5**: Runs the offline TTS inference benchmark test. ### Export Models and Launch Server Inside the Docker container, prepare the models and start the Triton server by running stages 0-3: ```sh # This command runs stages 0, 1, 2, and 3 bash run_cosyvoice3.sh 0 3 ``` ### Benchmark with client-server mode To benchmark the running Triton server, run stage 4: ```sh bash run_cosyvoice3.sh 4 4 # You can customize parameters such as the number of tasks inside the script. ``` The following results were obtained by decoding on a single L20 GPU. #### Streaming TTS (Concurrent Tasks = 4) **First Chunk Latency** | Concurrent Tasks | Average (ms) | 50th Percentile (ms) | 90th Percentile (ms) | 95th Percentile (ms) | 99th Percentile (ms) | | ---------------- | ------------ | -------------------- | -------------------- | -------------------- | -------------------- | | 4 | 750.42 | 740.31 | 941.05 | 977.55 | 1002.37 | ### Benchmark with offline inference mode For offline inference mode benchmark, please run stage 5: ```sh bash run_cosyvoice3.sh 5 5 ``` #### Offline TTS (CosyVoice3 0.5B LLM + Token2Wav with TensorRT) | Backend | LLM Batch Size | llm_time (s) | token2wav_time (s) | pipeline_time (s) | RTF | |---------|------------|--------------|--------------------|--------------------|--------| | TRTLLM | 1 | 13.21 | 5.72 | 19.48 | 0.1091 | | TRTLLM | 2 | 8.46 | 6.02 | 14.91 | 0.0822 | | TRTLLM | 4 | 5.07 | 5.95 | 11.43 | 0.0630 | | TRTLLM | 8 | 2.98 | 6.11 | 9.53 | 0.0562 | | TRTLLM | 16 | 2.12 | 6.27 | 8.83 | 0.0501 |