Automate the deployment of multiple Triton Inference Server instances in Kubernetes with resource-efficient model orchestration.
NVIDIA Triton™, part of the NVIDIA® AI platform, offers a new functionality called Triton Management Service that automates the deployment of multiple Triton Inference Server instances in Kubernetes with resource-efficient model orchestration on GPUs and CPUs. This software application manages deployment of Triton Inference Server instances with one or more AI models, allocates models to individual GPUs/CPUs, and efficiently collocates models by frameworks. Triton Management Service enables large-scale inference deployment with high performance and hardware utilization. It will soon be available exclusively with NVIDIA AI Enterprise, an enterprise-grade AI software platform.
Automates deploying and managing Triton Server Instances on Kubernetes and helps group models from different frameworks for efficient use of memory.
Loads models on demand, unloads models when not in use via a lease system, and places as many models as possible on a single GPU server.
Monitors each Triton Inference Server’s health, capacity, and autoscale based on latency and hardware utilization.
Use Triton Management Service to manage inference deployment from a single model to hundreds of models efficiently. Deploy on-premises or on any public cloud.
Stay up to date on the latest AI inference news from NVIDIA.