TGI

AI & LLMs

Large Language Model Text Generation Inference

Latest: v3.3.715 releases2 breaking changesView on GitHub →

Release History

v3.3.71 fix1 feature

Dec 19, 2025

This release introduces support for limiting image fetching size and fixes an issue related to automatic device count computation. The system is also entering Maintenance mode.

v3.3.62 fixes

Sep 17, 2025

This patch release (v3.3.6) focuses primarily on bug fixes, including correcting an issue with flashinfer masking and removing Azure references, alongside minor documentation and code cleanup.

v3.3.5Breaking5 fixes8 features

Sep 2, 2025

This release introduces significant hardware acceleration updates, including V2 Pydantic migration, XPU LoRA support, and various Gaudi optimizations for models like Gemma3 and Deepseek v2. It also bumps core dependencies like transformers and huggingface_hub.

v3.3.41 fix2 features

Jun 19, 2025

This release introduces initial support for Gemma 3 models on Gaudi and fixes a bug related to Neuron models exported with batch_size 1.

v3.3.34 fixes1 feature

Jun 18, 2025

This release focuses on updating the Neuron backend, including bumping the SDK version and adding support for the Qwen3_moe model on Gaudi. Several Gaudi-specific fixes and performance optimizations were also implemented.

v3.3.23 fixes2 features

May 30, 2025

This release focuses on Gaudi improvements, including OOM fixes and new hardware support, alongside an upgrade to vllm extension operations and the addition of the Qwen3 model.

v3.3.12 fixes2 features

May 22, 2025

This release updates TGI to Torch 2.7 and CUDA 12.8, incorporating HPU warmup logic refinements, kernel updates, and bug fixes.

v3.3.015 fixes4 features

May 9, 2025

This release introduces prefill chunking for VLMs and includes numerous stability fixes across various hardware backends like Gaudi and L4. Key updates involve dependency bumps and specific model support enhancements.

v3.2.31 fix1 feature

Apr 8, 2025

This release introduces patching for Llama 4 and updates underlying dependencies like ROCM and transformers. It also includes a fix for a compute type typo.

v3.2.21 fix3 features

Apr 6, 2025

This release introduces support for the llama4 model, adds a configurable termination timeout, and includes several fixes, notably for Gaudi hardware.

v3.2.12 fixes2 features

Mar 18, 2025

This release introduces support for the Gemma 3 text model type and the official release of the Gaudi Backend. It also includes necessary updates for Triton kernel compilation and various bug fixes.

v3.2.0Breaking6 fixes3 features

Mar 12, 2025

This release introduces support for the Gemma 3 model and brings significant updates to tool calling behavior, aligning it more closely with OpenAI's specification, alongside various backend and model-specific bug fixes.

v3.1.114 fixes9 features

Mar 4, 2025

This release focuses on backend expansion, adding support for Llamacpp, Neuron, and Gaudi backends, alongside significant improvements to Qwen VL handling and template features. It also includes various stability fixes and dependency updates.

v3.1.04 fixes3 features

Jan 31, 2025

This release introduces full hardware support for Deepseek R1 on AMD and Nvidia, adds fp8 support for MoE models, and includes several stability fixes and dependency updates.

v3.0.214 fixes11 features

Jan 24, 2025

This release introduces a major new transformers backend supporting flashattention for unsupported models and adds support for several new models including Cohere2 and OLMo variants. Numerous bug fixes target specific model issues, VLM handling, and hardware acceleration improvements across CUDA, ROCm, and XPU platforms.