NVIDIA Triton Inference Server

Deploy, run, and scale AI for any application on any platform.

Inference for Every AI Workload

Run inference on trained machine learning or deep learning models from any framework on any processor—GPU, CPU, or other—with NVIDIA Triton™ Inference Server. Part of the NVIDIA AI platform and available with NVIDIA AI Enterprise, Triton Inference Server is open-source software that standardizes AI model deployment and execution across every workload.

Explore the Benefits of Triton Inference Server

Supports All Training and Inference Frameworks

Deploy AI models on any major framework with Triton Inference Server—including TensorFlow, PyTorch, Python, ONNX, NVIDIA® TensorRT™, RAPIDS™ cuML, XGBoost, scikit-learn RandomForest, OpenVINO, custom C++, and more.

High-Performance Inference on Any Platform

Maximize throughput and utilization with dynamic batching, concurrent execution, optimal configuration, and streaming audio and video. Triton Inference Server supports all NVIDIA GPUs, x86 and Arm® CPUs, and AWS Inferentia.

Open Source and Designed for DevOps and MLOps

Integrate Triton Inference Server into DevOps and MLOps solutions such as Kubernetes for scaling and Prometheus for monitoring. It can also be used in all major cloud and on-premises AI and MLOps platforms.

Enterprise-Grade Security, Manageability and API Stability

NVIDIA AI Enterprise, including NVIDIA Triton Inference Server and Triton Management Service, is a secure, production-ready AI software platform designed to accelerate time to value with support, security, and API stability.

Get Started With Triton

Purchase NVIDIA AI Enterprise With Triton for Production Deployment

Purchase NVIDIA AI Enterprise, which includes NVIDIA Triton Inference Server and Triton Management Service for production inference.

Download Containers and Code for Development

Triton Inference Server containers are available on NVIDIA NGC™ and as open-source code on GitHub.

Triton Management Service

Automate the deployment of multiple Triton Inference Server instances in Kubernetes with resource-efficient model orchestration on GPUs and CPUs.

Features and Tools

Large Language Model Inference

TensorRT-LLM is an open-source library for defining, optimizing, and executing large language models (LLM) for inference in production. It maintains the core functionality of FasterTransformer, paired with TensorRT’s Deep Learning Compiler, in an open source Python API to quickly support new models and customizations.

Model Ensembles

Many modern AI workloads require executing multiple models, often with pre- and postprocessing steps for each query. Triton supports model ensembles and pipelines, can execute different parts of the ensemble on CPU or GPU, and allows multiple frameworks inside the ensemble.

Tree-Based Models

The Forest Inference Library (FIL) backend in Triton provides support for high-performance inference of tree-based models with explainability (SHAP values) on CPUs and GPUs. It supports models from XGBoost, LightGBM, scikit-learn RandomForest, RAPIDS cuML RandomForest, and others in Treelite format.

NVIDIA PyTriton

PyTriton provides a simple interface that lets Python developers use Triton to serve anything—models, simple processing functions, or entire inference pipelines. This native support for Triton in Python enables rapid prototyping and testing of machine learning models with performance and efficiency. A single line of code brings up Triton, providing benefits such as dynamic batching, concurrent model execution, and support for GPU and CPU. This eliminates the need to set up model repositories and convert model formats. Existing inference pipeline code can be used without modification.

NVIDIA Triton Model Analyzer

Triton Model Analyzer is a tool that automatically evaluates model deployment configurations in Triton Inference Server, such as batch size, precision, and concurrent execution instances on the target processor. It helps select the optimal configuration to meet application quality-of-service (QoS) constraints—like latency, throughput, and memory requirements—and reduces the time needed to find the optimal configuration. This tool also supports model ensembles and multi-model analysis.

Customer Stories

Discover how Amazon improved customer satisfaction with NVIDIA AI by accelerating its inference by 5X.

Learn how American Express improved fraud detection by analyzing tens of millions of daily transactions 50X faster.

Discover how Siemens Energy augmented inspections by providing AI-based remote monitoring for leaks, abnormal noises, and more.

See how Microsoft Teams used Triton Inference Server to optimize live captioning and transcription in multiple languages with very low latency.

See how NIO achieved a low-latency inference workflow by integrating NVIDIA Triton Inference Server into their autonomous driving inference pipeline.

More Resources

Get an Introduction

Understand the key features in Triton Inference Server that help you deploy, run, and scale AI models in production with ease.

Hear from Experts

Explore GTC sessions on Inference and getting started with Triton Inference Server.

Explore Technical Blogs

Read blogs about Triton Inference Server.

Check Out an Ebook

Discover the modern landscape of AI inference, production use cases from companies, and real-world challenges and solutions.

Stay up to date on the latest AI inference news from NVIDIA.