Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Wasm threading: parallel without paying postMessage

The saving grace of low-tier devices is that they’re multithreaded. Wasm takes better advantage of that than JavaScript does.

In Rust + Wasm with wasm-bindgen-rayon, every thread is a Web Worker spawned over the same WebAssembly.Memory. A &[f32] passed to rayon::par_iter_mut is a pointer into shared linear memory; every worker thread can read and write it directly. There is no copy. Standard library std::sync::Mutex, RwLock, MPSC channels, atomics all just work.

In JavaScript, the equivalent fan-out is postMessage to a pool of Web Workers and there are two flavours of it shown the below benchmarks (structural cloning and trying as much as possible to minimise copy).

The workload

The lightest meaningful parallel map: SAXPY, out[i] = a * x[i] + y[i]. Per element it’s one f32 multiply + one f32 add. It’s a simple operation where we aim to measure message overhead rather than computation speed.

We test 5 variants, as detailed below:

1. JavaScript variant (single-threaded)

A baseline comparison that simply executes everything on one thread.

const n = x.length;
for (let i = 0; i < n; i++) {
  output[i] = a * x[i] + y[i];
}

2. JavaScript worker (structured clone)

A persistent pool of K = navigator.hardwareConcurrency (capped at 8) workers.

Each call, we:

  • postMessage {x_chunk, y_chunk, a} to each worker (structured-clone alloc + memcpy).
  • the worker performs the computation.
  • the worker allocates an output Float32Array.
  • the worker posts it back.
  • the main thread glues the K output chunks together.
self.onmessage = (event: MessageEvent<SaxpyCloneRequest>) => {
    const { requestId, a, x, y } = event.data;
    const n = x.length;
    const output = new Float32Array(n);
    for (let i = 0; i < n; i++) {
        output[i] = a * x[i] + y[i];
    }
    const response: SaxpyCloneResponse = { requestId, output };
    self.postMessage(response);
};

3. JavaScript worker (transferables)

The same as the structured clone version but where no allocations occur and we take advantage of transferables.

self.onmessage = (event: MessageEvent<SaxpyTransferRequest>) => {
    const { requestId, a, x, y, output } = event.data;
    const n = x.length;
    for (let i = 0; i < n; i++) {
        output[i] = a * x[i] + y[i];
    }
    const response: SaxpyTransferResponse = { requestId, x, y, output };
    self.postMessage(response, [x.buffer, y.buffer, output.buffer]);
};

4. Rust scalar single threaded

The same as 1 but entirely in Wasm.

#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn saxpy_scalar(n: u32, a: f32) {
    let n = n as usize;
    SAXPY_X.with(|x| {
        SAXPY_Y.with(|y| {
            SAXPY_OUT.with(|o| {
                let x = x.borrow();
                let y = y.borrow();
                let mut o = o.borrow_mut();
                let x = &x[..n];
                let y = &y[..n];
                let o = &mut o[..n];
                for i in 0..n {
                    o[i] = a * x[i] + y[i];
                }
            })
        })
    });
}
}

5. Rust parallel (Rayon + Atomics)

We perform multithreading using the Shared Array Buffer and Atomics Web API. Zero bytes cross any boundary and the buffers live where the threads can already access them.

#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn saxpy_parallel(n: u32, a: f32) {
    let n = n as usize;
    SAXPY_X.with(|x| {
        SAXPY_Y.with(|y| {
            SAXPY_OUT.with(|o| {
                let x = x.borrow();
                let y = y.borrow();
                let mut o = o.borrow_mut();
                let x = &x[..n];
                let y = &y[..n];
                let o = &mut o[..n];
                o.par_iter_mut()
                    .with_min_len(8192)
                    .zip(x.par_iter())
                    .zip(y.par_iter())
                    .for_each(|((out, &xv), &yv)| {
                        *out = a * xv + yv;
                    });
            })
        })
    });
}
}

The chart

Analysis

Wasm threading sees the best performance because message overhead is totally eliminated using Wasm threading. Rayon efficiently dispatches the batches of work to available threads who operate over the same memory.

What about SharedArrayBuffer + Atomics in JavaScript?

Yes, you can have shared memory in pure JavaScript. The cost is that you stop writing JavaScript and start writing a byte-level protocol.