Change8

Migrating to vLLM v0.12.0

Version v0.12.0 introduces 6 breaking changes. This guide details how to update your code.

Released: 12/3/2025

6
Breaking Changes
5
Migration Steps
8
Affected Symbols

⚠️ Check Your Code

If you use any of these symbols, you need to read this guide:

GPUModelRunnerV2ParallelConfigCompilationConfig.use_inductorSamplingParamsmodel.load_weightsAiterFlashAttentionBackendFusedMoEToolServer

Breaking Changes

Issue #1

PyTorch upgrade to 2.9.0 requires CUDA 12.9 environment.

Issue #2

Removed 'num_lookahead_slots' parameter.

Issue #3

Removed 'best_of' parameter.

Issue #4

Removed LoRA extra vocab support.

Issue #5

Mistral format auto-detection now applied for model loading.

Issue #6

Online quantization logic moved to 'model.load_weights'.

Migration Steps

  1. 1
    Update host environment to CUDA 12.9.
  2. 2
    Replace 'best_of' and 'num_lookahead_slots' parameters in API calls as they are removed.
  3. 3
    Update LoRA configurations to remove extra vocab dependencies.
  4. 4
    Transition away from 'xformers' backend to supported alternatives like FlashInfer or Triton.
  5. 5
    Update GGUF loading code to use the new 'repo_id:quant_type' syntax.

Release Summary

vLLM v0.12.0 introduces a major architectural shift with GPU Model Runner V2 and PyTorch 2.9.0 integration. It delivers significant performance gains (up to 18% throughput) and expands support for DeepSeek-V3, multimodal models, and AMD hardware.

Need More Details?

View the full release notes and all changes for vLLM v0.12.0.

View Full Changelog