github-repo/CosyVoice: https://github.com/FunAudioLLM/CosyVoice @ 5427c274e3d0aa067199256593d71283da1aac06

Yuekai Zhang 5427c274e3 add triton solution		před 4 měsíci
..
model_repo	5427c274e3 add triton solution	před 4 měsíci
scripts	5427c274e3 add triton solution	před 4 měsíci
Dockerfile.server	5427c274e3 add triton solution	před 4 měsíci
README.md	5427c274e3 add triton solution	před 4 měsíci
client_grpc.py	5427c274e3 add triton solution	před 4 měsíci
client_http.py	5427c274e3 add triton solution	před 4 měsíci
docker-compose.yml	5427c274e3 add triton solution	před 4 měsíci
requirements.txt	5427c274e3 add triton solution	před 4 měsíci
run.sh	5427c274e3 add triton solution	před 4 měsíci

Nvidia Triton Inference Serving Best Practice for Spark TTS

Quick Start

Directly launch the service using docker compose.

docker compose up

Build Image

Build the docker image from scratch.

docker build . -f Dockerfile.server -t soar97/triton-spark-tts:25.02

Create Docker Container

your_mount_dir=/mnt:/mnt
docker run -it --name "spark-tts-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-spark-tts:25.02

Understanding `run.sh`

The run.sh script automates various steps using stages. You can run specific stages using:

bash run.sh <start_stage> <stop_stage> [service_type]

<start_stage>: The stage to begin execution from (0-5).
<stop_stage>: The stage to end execution at (0-5).
[service_type]: Optional, specifies the service type ('streaming' or 'offline', defaults may apply based on script logic). Required for stages 4 and 5.

Stages:

Stage 0: Download Spark-TTS-0.5B model from HuggingFace.
Stage 1: Convert HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines.
Stage 2: Create the Triton model repository structure and configure model files (adjusts for streaming/offline).
Stage 3: Launch the Triton Inference Server.
Stage 4: Run the gRPC benchmark client.
Stage 5: Run the single utterance client (gRPC for streaming, HTTP for offline).

Export Models to TensorRT-LLM and Launch Server

Inside the docker container, you can prepare the models and launch the Triton server by running stages 0 through 3. This involves downloading the original model, converting it to TensorRT-LLM format, building the optimized TensorRT engines, creating the necessary model repository structure for Triton, and finally starting the server.

# This runs stages 0, 1, 2, and 3
bash run.sh 0 3

Note: Stage 2 prepares the model repository differently based on whether you intend to run streaming or offline inference later. You might need to re-run stage 2 if switching service types.

Single Utterance Client

Run a single inference request. Specify streaming or offline as the third argument.

Streaming Mode (gRPC):

bash run.sh 5 5 streaming

This executes the client_grpc.py script with predefined example text and prompt audio in streaming mode.

Offline Mode (HTTP):

bash run.sh 5 5 offline

Benchmark using Dataset

Run the benchmark client against the running Triton server. Specify streaming or offline as the third argument.

# Run benchmark in streaming mode
bash run.sh 4 4 streaming

# Run benchmark in offline mode
bash run.sh 4 4 offline

# You can also customize parameters like num_task directly in client_grpc.py or via args if supported
# Example from run.sh (streaming):
# python3 client_grpc.py \
#     --server-addr localhost \
#     --model-name spark_tts \
#     --num-tasks 2 \
#     --mode streaming \
#     --log-dir ./log_concurrent_tasks_2_streaming_new

# Example customizing dataset (requires modifying client_grpc.py or adding args):
# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts --split-name wenetspeech4tts --mode [streaming|offline]

Benchmark Results

Decoding on a single L20 GPU, using 26 different prompt_audio/target_text pairs, total audio duration 169 secs.

| Mode | Note | Concurrency | Avg Latency | First Chunk Latency (P50) | RTF | |-------|-----------|-----------------------|---------|----------------|-| | Offline | Code Commit | 1 | 876.24 ms |-| 0.1362| | Offline | Code Commit | 2 | 920.97 ms |-|0.0737| | Offline | Code Commit | 4 | 1611.51 ms |-| 0.0704| | Streaming | Code Commit | 1 | 913.28 ms |210.42 ms| 0.1501 | | Streaming | Code Commit | 2 | 1009.23 ms |226.08 ms |0.0862 | | Streaming | Code Commit | 4 | 1793.86 ms |1017.70 ms| 0.0824 |

README.md