|
|
преди 3 месеца | |
|---|---|---|
| .. | ||
| model_repo | преди 3 месеца | |
| scripts | преди 4 месеца | |
| Dockerfile.server | преди 4 месеца | |
| README.md | преди 4 месеца | |
| client_grpc.py | преди 3 месеца | |
| client_http.py | преди 4 месеца | |
| docker-compose.yml | преди 4 месеца | |
| requirements.txt | преди 4 месеца | |
| run.sh | преди 4 месеца | |
Thanks to the contribution from NVIDIA Yuekai Zhang.
Launch the service directly with Docker Compose:
docker compose up
Build the image from scratch:
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
your_mount_dir=/mnt:/mnt
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
run.shThe run.sh script orchestrates the entire workflow through numbered stages.
Run a subset of stages with:
bash run.sh <start_stage> <stop_stage> [service_type]
<start_stage> – stage to start from (0-5).<stop_stage> – stage to stop after (0-5).Stages:
Decoupled=True/False will be used later).Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
# Runs stages 0, 1, 2, and 3
bash run.sh 0 3
Note: Stage 2 prepares the model repository differently depending on whether you intend to run with Decoupled=False or Decoupled=True. Rerun stage 2 if you switch the service type.
Send a single HTTP inference request:
bash run.sh 4 4
Benchmark the running Triton server. Pass either streaming or offline as the third argument.
bash run.sh 5 5
# You can also customise parameters such as num_task and dataset split directly:
# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts_cosy2 --split-name test_zh --mode [streaming|offline]
[!TIP] Only offline CosyVoice TTS is currently supported. Setting the client to
streamingsimply enables NVIDIA Triton’s decoupled mode so that responses are returned as soon as they are ready.
Decoding on a single L20 GPU with 26 prompt_audio/target_text pairs (≈221 s of audio):
| Mode | Note | Concurrency | Avg Latency (ms) | P50 Latency (ms) | RTF |
|---|---|---|---|---|---|
| Decoupled=False | Commit | 1 | 758.04 | 615.79 | 0.0891 |
| Decoupled=False | Commit | 2 | 1025.93 | 901.68 | 0.0657 |
| Decoupled=False | Commit | 4 | 1914.13 | 1783.58 | 0.0610 |
| Decoupled=True | Commit | 1 | 659.87 | 655.63 | 0.0891 |
| Decoupled=True | Commit | 2 | 1103.16 | 992.96 | 0.0693 |
| Decoupled=True | Commit | 4 | 1790.91 | 1668.63 | 0.0604 |
To launch an OpenAI-compatible service, run:
git clone https://github.com/yuekaizhang/Triton-OpenAI-Speech.git
pip install -r requirements.txt
# After the Triton service is up, start the FastAPI bridge:
python3 tts_server.py --url http://localhost:8000 --ref_audios_dir ./ref_audios/ --port 10086 --default_sample_rate 24000
# Test with curl
bash test/test_cosyvoice.sh
This section originates from the NVIDIA CISI project. We also provide other multimodal resources—see mair-hub for details.