Migrating to vLLM v0.12.0

Version v0.12.0 introduces 6 breaking changes. This guide details how to update your code.

Released: 12/3/2025

Breaking Changes

Migration Steps

Affected Symbols

⚠️ Check Your Code

If you use any of these symbols, you need to read this guide:

GPUModelRunnerV2ParallelConfigCompilationConfig.use_inductorSamplingParamsmodel.load_weightsAiterFlashAttentionBackendFusedMoEToolServer

Breaking Changes

●Issue #1

PyTorch upgrade to 2.9.0 requires CUDA 12.9 environment.

●Issue #2

Removed 'num_lookahead_slots' parameter.

●Issue #3

Removed 'best_of' parameter.

●Issue #4

Removed LoRA extra vocab support.

●Issue #5

Mistral format auto-detection now applied for model loading.

●Issue #6

Online quantization logic moved to 'model.load_weights'.

Migration Steps

1
Update host environment to CUDA 12.9.
2
Replace 'best_of' and 'num_lookahead_slots' parameters in API calls as they are removed.
3
Update LoRA configurations to remove extra vocab dependencies.
4
Transition away from 'xformers' backend to supported alternatives like FlashInfer or Triton.
5
Update GGUF loading code to use the new 'repo_id:quant_type' syntax.

Release Summary

vLLM v0.12.0 introduces a major architectural shift with GPU Model Runner V2 and PyTorch 2.9.0 integration. It delivers significant performance gains (up to 18% throughput) and expands support for DeepSeek-V3, multimodal models, and AMD hardware.

Need More Details?

View the full release notes and all changes for vLLM v0.12.0.

View Full Changelog