Migrating to vLLM v0.12.0
Version v0.12.0 introduces 6 breaking changes. This guide details how to update your code.
Released: 12/3/2025
⚠️ Check Your Code
If you use any of these symbols, you need to read this guide:
GPUModelRunnerV2ParallelConfigCompilationConfig.use_inductorSamplingParamsmodel.load_weightsAiterFlashAttentionBackendFusedMoEToolServerBreaking Changes
●Issue #1
PyTorch upgrade to 2.9.0 requires CUDA 12.9 environment.
●Issue #2
Removed 'num_lookahead_slots' parameter.
●Issue #3
Removed 'best_of' parameter.
●Issue #4
Removed LoRA extra vocab support.
●Issue #5
Mistral format auto-detection now applied for model loading.
●Issue #6
Online quantization logic moved to 'model.load_weights'.
Migration Steps
- 1Update host environment to CUDA 12.9.
- 2Replace 'best_of' and 'num_lookahead_slots' parameters in API calls as they are removed.
- 3Update LoRA configurations to remove extra vocab dependencies.
- 4Transition away from 'xformers' backend to supported alternatives like FlashInfer or Triton.
- 5Update GGUF loading code to use the new 'repo_id:quant_type' syntax.
Release Summary
vLLM v0.12.0 introduces a major architectural shift with GPU Model Runner V2 and PyTorch 2.9.0 integration. It delivers significant performance gains (up to 18% throughput) and expands support for DeepSeek-V3, multimodal models, and AMD hardware.
Need More Details?
View the full release notes and all changes for vLLM v0.12.0.
View Full Changelog