Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Wasm SIMD: a primitive JavaScript doesn’t have

WebAssembly’s fixed-width SIMD proposal exposes 128-bit vector registers and lane-wise arithmetic intrinsics (f32x4_mul, i32x4_add, v128_load, …). They’ve been baseline in every major browser since 2023. From Rust we get them as plain functions in std::arch::wasm32.

JavaScript has no equivalent. The original SIMD.js proposal was withdrawn in favour of “use Wasm SIMD instead”.

Wasm SIMD can be critical to providing maximum performance to users.

The workload

Per call:

  • take two Float32Arrays of length N
  • return sum_{i=0..N} a[i] * b[i]

To keep the chart honest, all three variants are doing the same I/O pattern:

  • The JavaScript variant’s arrays live on the JS heap and are filled in place each round (no per-round allocation churn).
  • Both Wasm variants pre-allocate two Vec<f32> of length N in their respective module’s linear memory and expose them as Float32Array views. This emulates the pre-allocated buffer pattern.

So the duration on the chart is the cost of the loop itself, not bridge or allocation overhead.

JavaScript variant

function jsDotProduct(a: Float32Array, b: Float32Array): number {
    let acc = 0;
    for (let i = 0; i < a.length; i++) {
        acc += a[i] * b[i];
    }
    return acc;
}

Rust scalar variant (compiled without +simd128)

#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn dot_product_scalar(n: u32) -> f32 {
    let n = n as usize;
    DOT_A.with(|a| {
        DOT_B.with(|b| {
            let a = a.borrow();
            let b = b.borrow();
            let a = &a[..n];
            let b = &b[..n];
            let mut acc = 0.0_f32;
            for i in 0..n {
                acc += a[i] * b[i];
            }
            acc
        })
    })
}
}

Rust SIMD variant (compiled with +relaxed-simd)

v128_load 16 bytes at a time, f32x4_mul lane-wise, accumulate into a v128 running sum, horizontally reduce once at the end, and a scalar tail for any leftover < 4 elements.

#![allow(unused)]
fn main() {
#[cfg(target_feature = "simd128")]
#[wasm_bindgen]
pub fn dot_product_simd(n: u32) -> f32 {
    use std::arch::wasm32::{
        f32x4_add, f32x4_extract_lane, f32x4_mul, f32x4_splat, v128_load,
    };

    let n = n as usize;
    DOT_A.with(|a| {
        DOT_B.with(|b| {
            let a = a.borrow();
            let b = b.borrow();
            let a = &a[..n];
            let b = &b[..n];

            let mut acc = f32x4_splat(0.0);
            let chunks = n / 4;
            // SAFETY: `a` and `b` are `&[f32]` of length `n`; we read
            // exactly `chunks * 4` lanes and the scalar tail covers the
            // remainder.
            unsafe {
                for i in 0..chunks {
                    let va = v128_load(a.as_ptr().add(i * 4) as *const _);
                    let vb = v128_load(b.as_ptr().add(i * 4) as *const _);
                    acc = f32x4_add(acc, f32x4_mul(va, vb));
                }
            }

            let mut sum = f32x4_extract_lane::<0>(acc)
                + f32x4_extract_lane::<1>(acc)
                + f32x4_extract_lane::<2>(acc)
                + f32x4_extract_lane::<3>(acc);
            for i in (chunks * 4)..n {
                sum += a[i] * b[i];
            }
            sum
        })
    })
}
}

The chart

What you should see

Three lines, all linear in N:

  • JavaScript and Wasm scalar sit close together. JS is doing the same scalar multiply-accumulate the Wasm scalar version is.
  • Wasm SIMD drops to roughly a quarter of either scalar line. The ceiling is because we’re packing four f32 multiplies into one f32x4_mul.

The interesting takeaway isn’t the size of the speedup. It’s the shape: at this kind of straight-line numeric loop, just moving from JavaScript to scalar Wasm doesn’t buy you much. The win comes from Wasm SIMD, which is something JavaScript can’t express at all.