Unlocking SVM-Optimal MemOps

Unlock optimal memory performance on the SVM by combining libcall overrides with compiler optimizations

Unlocking SVM-Optimal MemOps cover

Introduction

We introduced libcalls in our Accelerating u128 Math with Libcalls and JIT Intrinsics article. In this piece, we explore another dimension of libcalls: using them to generate performant memory operations tailored specifically to the Solana Virtual Machine (SVM) - without requiring any upstream changes to the compiler.

This serves as a concrete example that the limitation does not lie in the upstream LLVM BPF backend. Instead, the libcall model, combined with compiler optimizations, gives developers the tools to unlock bytecode-optimal programs.

In this article, we walk through the two layers of lowering memcmp in the compiler. We first show how libcalls can be overridden to precisely control memory operations at the bytecode level, and then how compiler optimizations take over to eliminate overhead and produce optimal code.

Recap: What are Libcalls?

When handling non-trivial low-level operations that cannot be directly lowered into machine instructions, the Rust compiler relies on a mechanism called library calls (libcalls). Libcalls allow compiler engineers to provide custom implementations for such functionality without needing to fork or modify the entire compiler toolchain.

A Minimal Example: 32-Byte Memory Comparison

The following program simulates a Solana entrypoint that reads a 32-byte Blake3 value directly from raw input (starting at offset 16) and compares it against an expected hash:

rust
pub fn entrypoint(input: *mut u8) -> u64 {
    let pubkey: Blake3Hash = unsafe { Blake3Hash(*(input.add(16) as *const [u8; 32])) };
    if pubkey.0.ne(&Blake3Hash([3u8; 32]).0) {
        return 1;
    }
    0
}

From Rust to Libcall: How Memcmp Emerges

Compilers lower high-level programs in stages before emitting machine code. In Rust, this pipeline goes from HIR to MIR, then into LLVM IR, and finally into target-specific machine code.

To understand how memory comparisons are implemented, let’s trace how our example is lowered through these stages.

At the MIR level, the equality check on [u8; 32] is expanded into a series of inlined trait calls, eventually reaching core::array::equality

rust
let mut _9: &Blake3Hash;
scope 3 (inlined array::equality::<impl PartialEq for [u8; 32]>::ne) {
    scope 4 (inlined <u8 as array::equality::SpecArrayEq<u8, 32>>::spec_ne) {
        let mut _10: bool;
        scope 5 (inlined <u8 as array::equality::SpecArrayEq<u8, 32>>::spec_eq) {
        }
    }
}

While this looks abstract, the key point is that the compiler recognizes this pattern as a fixed-size memory comparison. By the time we reach pre-link LLVM IR, this abstraction is lowered into a direct call to memcmp:

llvm
; <u8 as core::array::equality::SpecArrayEq<u8, 32>>::spec_eq
%1 = call i32 @memcmp(ptr %a, ptr %b, i64 32)
%2 = icmp eq i32 %1, 0

From Generic to Specialized: Overriding Memcmp

Without overriding memcmp, the compiler falls back to a default builtin implementation. This typically expands into a byte-by-byte comparison loop, which is correct but highly inefficient in the SVM execution model where every instruction contributes directly to cost.

To take control over this lowering, we can provide our own implementation of memcmp:

rust
pub unsafe extern "C" fn memcmp(a: *const u8, b: *const u8, n: usize) -> i32 {
    let mut i = 0usize;
    // compare 8 bytes at a time
    while i + 8 <= n  {
        let wa = core::ptr::read_unaligned(a.add(i) as *const u64);
        let wb = core::ptr::read_unaligned(b.add(i) as *const u64);
        if wa != wb {
            return 1;
        }
        i += 8;
    }
    // handle remaining bytes
    while i < n {
        if *a.add(i) != *b.add(i) {
            return return 1;
        }
        i += 1;
    }
    0
}

This implementation performs wide (64-bit) comparisons whenever possible, significantly reducing the number of load and branch instructions compared to a naive byte-by-byte loop.

Crucially, this @memcmp resolves to a local symbol rather than an external libcall symbol. Because libcalls are typically emitted as weak symbols, the linker will prefer this definition when present, allowing us to override the default implementation without modifying the compiler.

Compiler Optimizations: Where the Magic Happens

In our example, the comparison size is a constant (n = 32) known at compile time. This allows the optimizer to inline our custom memcmp, propagate the constant, and simplify the control flow before code generation.

As a result, what started as a generic loop-based implementation becomes a fully specialized sequence tailored to a fixed-size comparison.

Concretely, the compiler performs a series of transformations:

  • Inlines the memcmp implementation into the caller
  • Propagates the constant size (n = 32)
  • Unrolls or eliminates loop structures
  • Simplifies control flow and removes dead branches

By the time lowering completes, only the optimal sequence of load/compare instructions remains, without any of the original abstraction overhead.

In the following sections, we examine each of these optimizations in detail using LLVM IR snippets.

1. Initial libcall (memcmp)

At this stage, the comparison is still expressed as a libcall, with intermediate stack copies. At this point, everything is still generic: memory is copied, and the comparison is delegated to memcmp:

llvm
%_5 = getelementptr inbounds nuw i8, ptr %input, i64 16
call void @llvm.memcpy.p0.p0.i64(ptr align 1 %_3, ptr align 1 %_5, i64 32, i1 false)
call void @llvm.memcpy.p0.p0.i64(ptr align 1 %pubkey, ptr align 1 %_3, i64 32, i1 false)
%1 = call i32 @memcmp(ptr %pubkey, ptr @alloc_5bdec7133f381be14aeeec5080d97c71, i64 32)
%2 = icmp eq i32 %1, 0

2. Inline local memcmp

The compiler inlines our custom memcmp instead of default builtin into the caller:

llvm
bb3.i:
    %_10.i = getelementptr inbounds nuw i8, ptr %pubkey, i64 %i.sroa.0.0.i
    %_27.sroa.0.0.copyload.i = load i64, ptr %_10.i, align 1
    %_13.i = getelementptr inbounds nuw i8, ptr @alloc_5bdec7133f381be14aeeec5080d97c71, i64 %i.sroa.0.0.i
    %_33.sroa.0.0.copyload.i = load i64, ptr %_13.i, align 1
    %_15.not.i = icmp eq i64 %_27.sroa.0.0.copyload.i, %_33.sroa.0.0.copyload.i
    br i1 %_15.not.i, label %bb1.i, label %memcmp.exit

3. Constant propagation (n = 32)

The loop bounds become constant, so the compiler now knows exactly how many iterations are required:

llvm
bb1.i:
    %i.sroa.0.0.i = phi i64 [ 0, %start ], [ %_8.0.i, %bb3.i ]
    %_8.0.i = add i64 %i.sroa.0.0.i, 8
    %_5.not.i = icmp ugt i64 %_8.0.i, 32
    br i1 %_5.not.i, label %bb8.preheader.i, label %bb3.i
bb8.preheader.i:
    %_1713.i = icmp ult i64 %i.sroa.0.0.i, 32

4. Constant folding of RHS

The expected value [3u8; 32] is folded into a constant, this replaces repeated memory loads with a single immediate value (0x0303030303030303):

llvm
bb3.i:
    %_8.0.i = add nuw nsw i64 %i.sroa.0.0.i, 8
    %_10.i = getelementptr inbounds nuw i8, ptr %pubkey, i64 %i.sroa.0.0.i
    %_27.sroa.0.0.copyload.i = load i64, ptr %_10.i, align 1
    %_15.not.i = icmp eq i64 %_27.sroa.0.0.copyload.i, 217020518514230019
    br i1 %_15.not.i, label %bb1.i, label %memcmp.exit

5. Loop unrolling and tail elimination

With fixed bounds, the loop is fully unrolled and the byte tail is eliminated:

llvm
bb3.i:
    %_27.sroa.0.0.copyload.i = load i64, ptr %pubkey, align 1
    %_15.not.i = icmp eq i64 %_27.sroa.0.0.copyload.i, 217020518514230019
    br i1 %_15.not.i, label %bb1.i.1, label %memcmp.exit
bb3.i.1:
    %_10.i.1 = getelementptr inbounds nuw i8, ptr %pubkey, i64 8
    %_27.sroa.0.0.copyload.i.1 = load i64, ptr %_10.i.1, align 1
    %_15.not.i.1 = icmp eq i64 %_27.sroa.0.0.copyload.i.1, 217020518514230019
bb1.i.4:
    br i1 true, label %bb8.preheader.i, label %bb3.i.4

By combining libcall with compiler optimizations, what remains in the final bytecode is only the essential sequence of loads and comparisons, nothing more.

asm
entrypoint:
  mov64 r0, 1
  ldxdw r3, [r1+16]
  lddw r2, 217020518514230019
  jne r3, r2, jmp_0060
  ldxdw r3, [r1+24]
  jne r3, r2, jmp_0060
  ldxdw r3, [r1+32]
  jne r3, r2, jmp_0060
  ldxdw r1, [r1+40]
  jne r1, r2, jmp_0060
  mov64 r0, 0

Misaligned Memory Access: No Longer a Limitation in Upstream

One of the biggest historical limitations of the upstream BPF backend was the lack of support for misaligned memory access. Even with custom libcalls, any misaligned load or store would be conservatively lowered into byte-by-byte operations by the backend.

That is no longer the case.

With LLVM 22, support for misaligned memory access has been upstreamed behind an opt-in target feature (+allows-misaligned-mem-access). When enabled, the compiler can emit efficient wide loads and stores directly, even for unaligned addresses.

Without misaligned access support, the generated code reconstructs values byte by byte:

asm
// Build a 16-bit little-endian chunk from bytes at offsets 24..25
ldxb r3, [r1+25]  // Load high byte
lsh64 r3, 8
ldxb r2, [r1+24]  // Load low byte
or64 r3, r2

// Same pattern repeats for the remaining bytes...

This results in multiple instructions per value, significantly increasing instruction count.

With misaligned access enabled, the compiler emits direct 64-bit loads:

asm
// Direct 64-bit loads
ldxdw r2, [r1+40]
stxdw [r10-8], r2
ldxdw r2, [r1+32]
stxdw [r10-16], r2

This replaces several instructions with a single load, drastically simplifying the generated code.

In this example, the upstream compiler with misaligned access enabled produces 12 CU, compared to 19 CU from the current Solana toolchain with a hardcoded memcmp implementation.

Threshold-Based Lowering: Best of Both Worlds

Can we push this further?

Yes.

The Solana toolchain already applies a heuristic: for larger inputs, it switches to the sol_memcmp_ syscall once the cost of inline load/store pairs exceeds the syscall cost.

We can implement the same strategy upstream, directly in our custom memcmp.

rust
pub unsafe extern "C" fn memcmp(a: *const u8, b: *const u8, n: usize) -> i32 {
    if n > 64 {
        sol_memcmp_(a, b, n)
    } else {
        // same optimized inline implementation as before
    }
}

At first glance, this introduces a runtime branch. But in our case, the size (n) is a compile-time constant.

This means the compiler can:

  • evaluate the condition at compile time
  • eliminate the dead branch entirely
  • inline only the optimal path into the final program

As a result, the generated bytecode contains only the chosen strategy, with zero overhead from the unused path all through your source rust code

For our example (n = 32), the inline path is selected. For larger inputs, only the syscall path remains:

asm
entrypoint:

 // push input data to stack and point r1 to input data on stack

 lddw r2, data_0000
 mov64 r3, 128
 call sol_memcmp_
 mov64 r1, r0

Conclusion

The takeaway is simple: by combining libcall overrides with compiler optimizations, we can precisely control how high-level abstractions are lowered, while still relying on the optimizer to eliminate overhead and specialize the final code. With recent upstream improvements such as misaligned memory access support, the LLVM BPF backend, when properly configured, is already capable of generating more efficient code than the current Solana toolchain.

Featured Work
© 2026 Blueshift Labs Limited