Change8

Migrating to llama.cpp b7739

Version b7739 introduces 2 breaking changes. This guide details how to update your code.

Released: 1/15/2026

2
Breaking Changes
2
Migration Steps
9
Affected Symbols

⚠️ Check Your Code

If you use any of these symbols, you need to read this guide:

two_stage_warp_reduceblock_reducesoftmax kernelgroup_norm_f32rms_norm_f32l2_norm_f32norm_f32RMS_NORM_BACKblock_reduce_method

Breaking Changes

Issue #1

The CUDA function `two_stage_warp_reduce` has been renamed to `block_reduce`. Code using the old name must be updated.

Issue #2

Shared memory (smem) handling was moved out of the `__device__` function `two_stage_warp_reduce` (now `block_reduce`) to the `__global__` function. This change was necessary because the compiler/runtime was not freeing smem correctly, leading to failures with `cudaFuncSetAttribute`.

Migration Steps

  1. 1
    Rename usages of `two_stage_warp_reduce` to `block_reduce` in CUDA code.
  2. 2
    Review and update shared memory allocation/management logic around the new `block_reduce` function calls, ensuring smem is managed in the calling `__global__` function.

Release Summary

This release focuses heavily on refactoring CUDA kernels by extracting and renaming the warp reduction logic to `block_reduce`, improving smem handling, and integrating it across various normalization functions. It also fixes a potential build issue related to template instantiation and static assertions.

Need More Details?

View the full release notes and all changes for llama.cpp b7739.

View Full Changelog