Migrating to llama.cpp b7739
Version b7739 introduces 2 breaking changes. This guide details how to update your code.
Released: 1/15/2026
⚠️ Check Your Code
If you use any of these symbols, you need to read this guide:
two_stage_warp_reduceblock_reducesoftmax kernelgroup_norm_f32rms_norm_f32l2_norm_f32norm_f32RMS_NORM_BACKblock_reduce_methodBreaking Changes
●Issue #1
The CUDA function `two_stage_warp_reduce` has been renamed to `block_reduce`. Code using the old name must be updated.
●Issue #2
Shared memory (smem) handling was moved out of the `__device__` function `two_stage_warp_reduce` (now `block_reduce`) to the `__global__` function. This change was necessary because the compiler/runtime was not freeing smem correctly, leading to failures with `cudaFuncSetAttribute`.
Migration Steps
- 1Rename usages of `two_stage_warp_reduce` to `block_reduce` in CUDA code.
- 2Review and update shared memory allocation/management logic around the new `block_reduce` function calls, ensuring smem is managed in the calling `__global__` function.
Release Summary
This release focuses heavily on refactoring CUDA kernels by extracting and renaming the warp reduction logic to `block_reduce`, improving smem handling, and integrating it across various normalization functions. It also fixes a potential build issue related to template instantiation and static assertions.
Need More Details?
View the full release notes and all changes for llama.cpp b7739.
View Full Changelog