Why WebAssembly on the browser?
Although JavaScript can be written to be extremely fast, it’s non trivial to squeeze performance out of it. Often it requires writing the JavaScript like C code, and you still need to be extremely aware of performance cliffs that exist in the underlying JavaScript interpreters.
This book comes with a built-in benchmark runner so we can test directly in your browser. Because it
runs on your specific hardware and browser engine, your results will be unique. If the data looks
noisy, hit the ↺ Restart button.
To test out the benchmark system in pure JavaScript, let’s explicitly measure the performance degradation of function deoptimisation.
Although JavaScript is dynamically typed, object literals are still assigned a hidden class. So,
declaring {a: 1, b: 2} in JavaScript gets a different hidden class from the object {b: 2, a: 1}
even though the objects are otherwise identical. If a function is called with different hidden class
arguments, it can deoptimise and become much slower to call.
Given this trivial function that sums some fields on a JavaScript object:
function sum_fields(obj) {
return obj.a + obj.b + obj.c + obj.d + obj.e;
}
We can benchmark and compare the speed of the function based on simply what input objects are generated for the benchmark.
The monomorphic benchmark creates input data using a single factory — one shape, one hidden class, function is expected on the fast path:
const objectWithNumbersFactory = (a: number, b: number, c: number, d: number, e: number) => ({ a, b, c, d, e });
The megamorphic benchmark generates input data by randomly choosing from one of eight factories. Each factory generates the object literal with fields in different orders causing the function to deoptimise.
const objectWithNumbersFactories: Array<(a: number, b: number, c: number, d: number, e: number) => ObjectWithNumbers> = [
(a, b, c, d, e) => ({ a, b, c, d, e }),
(a, b, c, d, e) => ({ b, a, c, d, e }),
(a, b, c, d, e) => ({ c, b, a, d, e }),
(a, b, c, d, e) => ({ d, c, b, a, e }),
(a, b, c, d, e) => ({ e, d, c, b, a }),
(a, b, c, d, e) => ({ a, c, e, b, d }),
(a, b, c, d, e) => ({ b, d, a, e, c }),
(a, b, c, d, e) => ({ e, c, a, d, b }),
];
The graph below is calculated on your machine so may show slightly different results on each try.
The graph above shows that the inputs into otherwise identical functions can have huge impact on the performance of the function. In the worst case the megamorphic benchmark can be 10 times slower. In fact, these optimisations were specifically used to improve TypeScript’s compiler performance.
Unlike JavaScript, which contains these complex runtime behaviors and heuristics, WebAssembly is statically typed and compiled ahead of time. This allows WebAssembly to achieve its design goal of deterministic high performance.
WebAssembly to JavaScript Bridge Myths
WebAssembly is designed to interoperate with the existing web ecosystem, but also embed in external systems outside the browser. Thus, WebAssembly does not natively operate directly on JavaScript data structures but instead operates on numbers passed into Wasm functions, or on its shared linear memory.
This means that JavaScript data that needs to be inspected by WebAssembly must be first copied into the linear memory. This fundamental truth has created some myths:
- Calling from Wasm to Js and vice-versa is expensive.
- Strings are expensive to pass across the Wasm bridge and ruin performance.
- Large complex objects are tricky to work with across the Wasm bridge.
The truth is the copy cost is unavoidable. How expensive is it, and can it be avoided or mitigated?
Myth: Calling from JavaScript to WebAssembly and vice-versa is expensive
Every call to a WebAssembly function crosses the FFI boundary, does that add up as measurable overhead?
We can compare an identity function implemented in JavaScript directly against a Rust identity
function.
The JavaScript identity never leaves JavaScript:
function identity(val: unknown): unknown {
return val;
}
The WebAssembly identity crosses the bridge on every call:
#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn identity(val: JsValue) -> JsValue {
val
}
}
Both benchmarks receive a pre-generated string of length \(N\) and call their respective identity function once. Generation is unmeasured.
Results for this benchmark are extremely noisy because in both cases the computation is almost instant regardless of the length of the generated string being passed in.
Did you notice that we are passing a generated string through WebAssembly using a
JsValuetype? This type allowswasm-bindgento pass a lightweight reference to the JavaScript string and entirely skip copying the JavaScript string into WebAssembly’s linear memory.
However, we don’t need many benchmarks to prove that WebAssembly functions are extremely optimised. All the way back in 2018, one year before WebAssembly was made the fourth language of the Web, calls between JavaScript and WebAssembly became fast.
In the more recent times JavaScript engines are continuously adding optimisations to WebAssembly,
for example V8, the JavaScript engine running in Chrome can speculatively inline all WebAssembly
function call instructions, e.g. call, call_indirect, and call_ref.
Thus we can safely say, myth busted!
Myth: Wasm string overhead ruins performance
In the previous myth we showed that calling a WebAssembly function from
JavaScript is nearly free using an identity function. But, by using JsValue to pass the string
through Wasm we cannot inspect the contents of the string.
#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn identity(val: JsValue) -> JsValue {
val
}
}
JsValue is a lightweight handle to the JavaScript value, thus function execution time is constant
regardless of the string length.
#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn string_identity(val: &str) -> String {
val.to_owned()
}
}
This second Rust function also does no real work, but because it’s consuming its argument as a &str
and returning it as a String, it must copy the JavaScript string into Wasm and decode the Wasm string
back into JavaScript when the value is returned.
Unsurprisingly, the duration of the JsValue → JsValue identity remains flat regardless of string size input.
There is nothing to copy, so string length doesn’t matter.
The &str → String identity scales linearly with string length: every additional character costs
additional encode and decode work.
String copying across the WebAssembly bridge is not free, and is instead a memory copy proportional to the number of bytes. Additionally, JavaScript strings are UTF-16 whilst Rust strings are UTF-8. The text encoding and decoding process must also make this transcoding.
A real world example
Because Canva works with designs and colors every day, a common but trivial task is parsing CSS colors that are strings into their numeric representation.
Thus the problem statement is, given some well-formed CSS hex color such as #ffdd00, extract the
red, green, and blue channels. So, for an input of #ffdd00, the expected output is [255, 221, 0].
This case study has also been chosen because it reflects a seemingly worst case scenario for using WebAssembly. The input is a small string that must be copied into the WebAssembly linear memory (paying a wasm tax), and there is very little processing within WebAssembly to pay-off the tax.
The JavaScript implementation graphed on the blue line is as follows:
function parseHexColor(hex) {
return [
parseInt(hex.slice(1, 3), 16),
parseInt(hex.slice(3, 5), 16),
parseInt(hex.slice(5, 7), 16),
];
}
And a sensible Rust implementation graphed on the orange line is implemented as:
#[wasm_bindgen]
pub fn parse_hex_color_str(hex: &str) -> Vec<u8> {
let b = hex.as_bytes();
vec![
(hex_nibble(b[1]) << 4) | hex_nibble(b[2]),
(hex_nibble(b[3]) << 4) | hex_nibble(b[4]),
(hex_nibble(b[5]) << 4) | hex_nibble(b[6]),
]
}
#[inline(always)]
fn hex_nibble(byte: u8) -> u8 {
match byte {
b'0'..=b'9' => byte - b'0',
b'A'..=b'F' => byte - b'A' + 10,
b'a'..=b'f' => byte - b'a' + 10,
_ => 0,
}
}
This benchmark is measuring the total duration for some amount of colors parsed, so it intuitively makes sense that the Wasm tax scales at a constant factor worse than the JavaScript implementation. This reinforces the myth that copying strings is expensive, but can we beat JavaScript?
Before optimising this function, what is the WebAssembly implementation doing?
wasm-bindgenmust allocate a slice of linear memory to copy the JavaScript string to.wasm-bindgenthen text encodes the string into that recently allocated linear memory (with a UTF-16 to UTF-8 conversion).- Our function
parse_hex_color_stris called which allocates a Rust vector. - The vector is copied out into JavaScript.
- The Rust vector is freed.
“If you’re willing to restrict the flexibility of your approach, you can almost always do something better” ~ John Carmack
wasm-bindgen is extremely ergonomic and general, but, it doesn’t know the specifics of our function.
We can do better by leveraging problem specific invariants. We know that:
- We are running in a single threaded JavaScript environment.
- The function can pre-allocate 7 bytes and re-use those bytes for the string copy.
- CSS hex colors consist entirely of ASCII characters, so we do not need to pay for a UTF-16 to UTF-8 conversion. ASCII characters are identical in both UTF-16 and UTF-8.
- The output of three 8 bit unsigned integers can be packed into a single 32 bit number avoiding a vector allocation and free.
With this problem specific info, we can write a third Wasm implementation of parseHexColor that
has exactly the same user visible behavior as the JavaScript implementation whilst avoiding the
memory allocations.
On my laptop running Chrome, the near-zero allocation variant of the Wasm color parsing function is almost twice as fast as the JavaScript implementation whilst still paying a string copy wasm tax.
Lets take a look at how this has been done.
thread_local! {
static HEX_STRING_BUF: RefCell<[u8; 7]> = const { RefCell::new([0; 7]) };
}
#[wasm_bindgen]
pub fn get_hex_buffer_view() -> Uint8Array {
let ptr = HEX_STRING_BUF.with(|buf| buf.as_ptr() as u32);
let memory: WebAssembly::Memory = wasm_bindgen::memory().unchecked_into();
Uint8Array::new_with_byte_offset_and_length(&memory.buffer(), ptr, 7)
}
#[wasm_bindgen]
pub fn parse_hex_color_no_alloc() -> u32 {
HEX_STRING_BUF.with(|buf| {
let b = buf.borrow();
let r = ((hex_nibble(b[1]) << 4) | hex_nibble(b[2])) as u32;
let g = ((hex_nibble(b[3]) << 4) | hex_nibble(b[4])) as u32;
let b_val = ((hex_nibble(b[5]) << 4) | hex_nibble(b[6])) as u32;
(r << 16) | (g << 8) | b_val
})
}
The Rust code has two large changes that impact the input and output of the
parse_hex_color_no_alloc function. First, the input argument is gone. Instead of directly passing
an input and relying on wasm-bindgen to implement all of the allocations and copying, we can more
finely control this behavior by preallocating a 7 byte array. JavaScript can then copy the string
into this stable location in linear memory avoiding a memory allocation with the string copy.
Additionally, by packing the returned RGB values into a number, we can avoid the allocation and free cost of a returned vector.
import {
get_hex_buffer_view,
parse_hex_color_no_alloc
} from "../../../generated_wasm/rustweek_2026_wasm_myths.js";
let view = undefined;
function parseHexColor(hex) {
// Safety: refresh the view if Wasm memory grew so prior memory is detached.
if (view === undefined || view.byteLength === 0) {
view = get_hex_buffer_view();
}
for (let j = 0; j < 7; j++) {
view[j] = hex.charCodeAt(j);
}
const colorInt = parse_hex_color_no_alloc();
return [
(colorInt >> 16) & 255, // R
(colorInt >> 8) & 255, // G
colorInt & 255 // B
];
}
This more complex Rust implementation can be paired with a JavaScript facade function that allows
the public parseHexColor API to remain unchanged. This function still takes in a hex string and
returns an array of three RGB numbers.
The internals now take a stable view of the linear memory and directly copy the hex string using
charCodeAt skipping the UTF-16 to UTF-8 conversion. The returned RGB number is also unpacked
into a JavaScript array.
From doing this exercise I hope I’ve convinced you that it is possible to both pay the Wasm data copy tax and outperform JavaScript, for a small WebAssembly function while preserving the user-facing public API.
The high level takeaway here is that wasm-bindgen facilitates very high level interactions between
Wasm and JavaScript, but pays a performance penalty for its flexibility. By trading the ergonomic
abstraction for specific and intentional memory management, we can completely avoid the majority of
the Wasm tax and outperform JavaScript even on small functions.
Myth: Working with objects in WebAssembly is expensive
This myth is a natural extension of the myth that strings are expensive.
Objects must also navigate the wasm tax of having their data be copied into WebAssembly linear
memory. The additional complexity is that there are extra crates available such as
serde_wasm_bindgen that can further assist with copying JavaScript objects into Rust, and
vice-versa.
The mental model for achieving performance remains exactly the same as strings. Performance is achieved by reducing the amount of data copied, and the amount of memory allocations. For the object myth I think it’s instructive to jump straight into a benchmarked case study.
Case Study
The JavaScript function we will be benchmarking against is the following:
function sum_object_a_b(obj) {
return obj.a + obj.b
}
// Usage example:
sum_object_a_b({a: 6, b: 7}) // returns 13
However, JavaScript is a dynamically typed language. Although the sum_object_a_b function expects
the fields a and b, additional fields may also be present and are ignored. The objects generated
for benchmarking contain an additional id field containing a 10,000 length random string.
Thus an example input ends up looking like:
sum_object_a_b({
id: 'aaaaaaaaaaaa[... 1,000 length string truncated]',
a: 6,
b: 7
}) // returns 13
Instead of benchmarking the variants up top and then explaining them, below are three different WebAssembly implementations. As I introduce each one try and guess how it compares to the JavaScript implementation given the input data.
A: Pass object argument with serde_wasm_bindgen crate
#![allow(unused)]
fn main() {
#[derive(Deserialize)]
struct SumABData {
a: f64,
b: f64,
}
#[wasm_bindgen]
pub fn sum_ab_serde(val: JsValue) -> f64 {
let obj: SumABData = serde_wasm_bindgen::from_value(val).unwrap();
obj.a + obj.b
}
}
The serde_wasm_bindgen crate provides an incredibly ergonomic, “Rust-native” developer experience.
You define a strongly typed struct, and the library handles the translation.
This abstraction comes with a cost as the WebAssembly module must dynamically inspect the JavaScript object across the Wasm and JavaScript boundary, handle type checking, and allocate a new Rust struct.
B: Use wasm_bindgen structural access to object fields
#![allow(unused)]
fn main() {
#[wasm_bindgen]
extern "C" {
pub type SumABObj;
#[wasm_bindgen(method, getter)]
pub fn a(this: &SumABObj) -> f64;
#[wasm_bindgen(method, getter)]
pub fn b(this: &SumABObj) -> f64;
}
#[wasm_bindgen]
pub fn sum_ab_structural(obj: &SumABObj) -> f64 {
obj.a() + obj.b()
}
}
Using wasm-bindgen directly, we can annotate an extern "C" type directly with the getters we
expect to be present. This allows SumABObj to have fields a and b directly accessed.
wasm-bindgen generates these getters for us whilst also avoiding copying the object and allocating
a new struct.
C: Use a JavaScript facade to destructure the object and pass through the fields directly
import { sum } from "../../../generated_wasm/rustweek_2026_wasm_myths.js";
function sum_object_facade(obj: BenchObj): number {
return sum(obj.a, obj.b)
}
Where sum is defined as:
#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn sum(a: f64, b: f64) -> f64 {
a + b
}
}
This approach applies a similar optimisation strategy as used in the CSS hex color example. Rely on
the host environment for its heavily optimised property access, and then directly pass the a and
b values directly into the wasm function we avoid all objects.
This option ends up very performant but it’s not always viable for deeply nested or highly complex objects.
So, are objects expensive to work with in WebAssembly? It entirely depends on how you handle them.
If you attempt to generically deserialize objects using serde_wasm_bindgen on a hot path, the
overhead will definitely ruin your performance. The tax is too high. On my machine
serde_wasm_bindgen runs about six times slower than B: structural access.
Option B initially made me concerned about the multiple function calls and performance degradation
caused when calling the wasm_bindgen generated getters. I actually went to an open source Rust
library where I know there is a lot of web_sys and I essentially pulled JavaScript out of the Rust
similar to C the facade pattern. But – this ends up being a micro optimisation so small that I
couldn’t achieve any measurable difference.
Thus I think B is acceptable for the vast majority of application code where developer ergonomics and
maintainability are the primary goals.
Conclusion
It’s critical to keep in mind exactly what these benchmarks are measuring. Pure wasm tax overhead.
In the objects case study, the Wasm functions do almost zero actual work. We’re only adding two numbers together. This means measurements are heavily skewed to show only the fee of crossing the JavaScript to WebAssembly bridge, without reaping any of the WebAssembly rewards.
As shown by the CSS string color parsing examples, it doesn’t take much computation for WebAssembly to pay off the wasm tax. The wasm tax is real, but, provided you consider your data access patterns, it is a small entry fee to a much faster, and much more predictable execution environment.
WebAssembly Performance
This section outlines some areas where Wasm+Rust outperforms JavaScript.
GC jitter: how JS GC breaks predictable performance
This section shows how JavaScript’s garbage collection can negatively impact predictable performance.
The workload
A deeply nested, pointer-heavy graph that tries to emulate some real world application state that is used to, per call:
- Build the tree.
- Sum every node’s value via a recursive walk.
- Push the resulting tree onto a rolling retention buffer.
JavaScript variant
Plain { value, children } objects on the JS heap. The GC owns the graph.
interface JsTreeNode {
value: number;
children: JsTreeNode[];
}
function jsBuildTree(depth: number, branching: number, seed: number): JsTreeNode {
const children: JsTreeNode[] = [];
if (depth > 0) {
for (let i = 0; i < branching; i++) {
const childSeed = ((seed * 31) + i) >>> 0;
children.push(jsBuildTree(depth - 1, branching, childSeed));
}
}
return { value: seed, children };
}
function jsSumTree(node: JsTreeNode): number {
let sum = node.value;
const children = node.children;
for (let i = 0; i < children.length; i++) {
sum = (sum + jsSumTree(children[i])) >>> 0;
}
return sum;
}
Rust → Wasm variant
TreeNode records linked through Vecs in Wasm linear memory. Memory is
freed deterministically in the VecDeque.
#![allow(unused)]
fn main() {
struct TreeNode {
value: u32,
children: Vec<TreeNode>,
}
fn build_tree(depth: u32, branching: u32, seed: u32) -> TreeNode {
let mut children: Vec<TreeNode> = Vec::new();
if depth > 0 {
children.reserve_exact(branching as usize);
for i in 0..branching {
let child_seed = seed.wrapping_mul(31).wrapping_add(i);
children.push(build_tree(depth - 1, branching, child_seed));
}
}
TreeNode { value: seed, children }
}
fn sum_tree(node: &TreeNode) -> u32 {
let mut sum = node.value;
for child in &node.children {
sum = sum.wrapping_add(sum_tree(child));
}
sum
}
}
What you should see
- Top chart (per-call work time, ms). The JS half bobs around the same baseline most of the time, but sees regular spikes. The Wasm half stays close to a flat band.
- Bottom chart (heap MB). The JS half climbs in a sawtooth pattern. A slow ramp occurs as memory fills, then a sharp drop every time a major GC cycle completes. Each drop on the heap chart lines up with a spike on the work-time chart above. The Wasm half stays almost flat.
performance.memory is a Chromium-only API. In Firefox and Safari the heap
chart will stay empty.
Analysis
Rust + Wasm do not require garbage collection. Memory is freed deterministically and, unlike JavaScript, do not accumulate into a single moment that can cause a dropped frame.
Wasm SIMD: a primitive JavaScript doesn’t have
WebAssembly’s fixed-width SIMD proposal exposes 128-bit vector registers and lane-wise arithmetic intrinsics (f32x4_mul, i32x4_add, v128_load, …). They’ve been baseline in every major browser since 2023. From Rust we get them as plain functions in std::arch::wasm32.
JavaScript has no equivalent. The original SIMD.js proposal was withdrawn in favour of “use Wasm SIMD instead”.
Wasm SIMD can be critical to providing maximum performance to users.
The workload
Per call:
- take two
Float32Arrays of lengthN - return
sum_{i=0..N} a[i] * b[i]
To keep the chart honest, all three variants are doing the same I/O pattern:
- The JavaScript variant’s arrays live on the JS heap and are filled in place each round (no per-round allocation churn).
- Both Wasm variants pre-allocate two
Vec<f32>of lengthNin their respective module’s linear memory and expose them asFloat32Arrayviews. This emulates the pre-allocated buffer pattern.
So the duration on the chart is the cost of the loop itself, not bridge or allocation overhead.
JavaScript variant
function jsDotProduct(a: Float32Array, b: Float32Array): number {
let acc = 0;
for (let i = 0; i < a.length; i++) {
acc += a[i] * b[i];
}
return acc;
}
Rust scalar variant (compiled without +simd128)
#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn dot_product_scalar(n: u32) -> f32 {
let n = n as usize;
DOT_A.with(|a| {
DOT_B.with(|b| {
let a = a.borrow();
let b = b.borrow();
let a = &a[..n];
let b = &b[..n];
let mut acc = 0.0_f32;
for i in 0..n {
acc += a[i] * b[i];
}
acc
})
})
}
}
Rust SIMD variant (compiled with +relaxed-simd)
v128_load 16 bytes at a time, f32x4_mul lane-wise, accumulate into a v128 running sum, horizontally reduce once at the end, and a scalar tail for any leftover < 4 elements.
#![allow(unused)]
fn main() {
#[cfg(target_feature = "simd128")]
#[wasm_bindgen]
pub fn dot_product_simd(n: u32) -> f32 {
use std::arch::wasm32::{
f32x4_add, f32x4_extract_lane, f32x4_mul, f32x4_splat, v128_load,
};
let n = n as usize;
DOT_A.with(|a| {
DOT_B.with(|b| {
let a = a.borrow();
let b = b.borrow();
let a = &a[..n];
let b = &b[..n];
let mut acc = f32x4_splat(0.0);
let chunks = n / 4;
// SAFETY: `a` and `b` are `&[f32]` of length `n`; we read
// exactly `chunks * 4` lanes and the scalar tail covers the
// remainder.
unsafe {
for i in 0..chunks {
let va = v128_load(a.as_ptr().add(i * 4) as *const _);
let vb = v128_load(b.as_ptr().add(i * 4) as *const _);
acc = f32x4_add(acc, f32x4_mul(va, vb));
}
}
let mut sum = f32x4_extract_lane::<0>(acc)
+ f32x4_extract_lane::<1>(acc)
+ f32x4_extract_lane::<2>(acc)
+ f32x4_extract_lane::<3>(acc);
for i in (chunks * 4)..n {
sum += a[i] * b[i];
}
sum
})
})
}
}
The chart
What you should see
Three lines, all linear in N:
- JavaScript and Wasm scalar sit close together. JS is doing the same scalar multiply-accumulate the Wasm scalar version is.
- Wasm SIMD drops to roughly a quarter of either scalar line. The
ceiling is
4×because we’re packing four f32 multiplies into onef32x4_mul.
The interesting takeaway isn’t the size of the speedup. It’s the shape: at this kind of straight-line numeric loop, just moving from JavaScript to scalar Wasm doesn’t buy you much. The win comes from Wasm SIMD, which is something JavaScript can’t express at all.
Wasm threading: parallel without paying postMessage
The saving grace of low-tier devices is that they’re multithreaded. Wasm takes better advantage of that than JavaScript does.
In Rust + Wasm with wasm-bindgen-rayon, every thread is a Web Worker spawned over the same WebAssembly.Memory. A &[f32] passed to rayon::par_iter_mut is a pointer into shared linear memory; every worker thread can read and write it directly. There is no copy. Standard library std::sync::Mutex, RwLock, MPSC channels, atomics all just work.
In JavaScript, the equivalent fan-out is postMessage to a pool of Web Workers and there are two flavours of it shown the below benchmarks (structural cloning and trying as much as possible to minimise copy).
The workload
The lightest meaningful parallel map: SAXPY, out[i] = a * x[i] + y[i]. Per element it’s one f32 multiply + one f32 add. It’s a simple operation where we aim to measure message overhead rather than computation speed.
We test 5 variants, as detailed below:
1. JavaScript variant (single-threaded)
A baseline comparison that simply executes everything on one thread.
const n = x.length;
for (let i = 0; i < n; i++) {
output[i] = a * x[i] + y[i];
}
2. JavaScript worker (structured clone)
A persistent pool of K = navigator.hardwareConcurrency (capped at 8) workers.
Each call, we:
postMessage{x_chunk, y_chunk, a}to each worker (structured-clone alloc + memcpy).- the worker performs the computation.
- the worker allocates an output
Float32Array. - the worker posts it back.
- the main thread glues the
Koutput chunks together.
self.onmessage = (event: MessageEvent<SaxpyCloneRequest>) => {
const { requestId, a, x, y } = event.data;
const n = x.length;
const output = new Float32Array(n);
for (let i = 0; i < n; i++) {
output[i] = a * x[i] + y[i];
}
const response: SaxpyCloneResponse = { requestId, output };
self.postMessage(response);
};
3. JavaScript worker (transferables)
The same as the structured clone version but where no allocations occur and we take advantage of transferables.
self.onmessage = (event: MessageEvent<SaxpyTransferRequest>) => {
const { requestId, a, x, y, output } = event.data;
const n = x.length;
for (let i = 0; i < n; i++) {
output[i] = a * x[i] + y[i];
}
const response: SaxpyTransferResponse = { requestId, x, y, output };
self.postMessage(response, [x.buffer, y.buffer, output.buffer]);
};
4. Rust scalar single threaded
The same as 1 but entirely in Wasm.
#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn saxpy_scalar(n: u32, a: f32) {
let n = n as usize;
SAXPY_X.with(|x| {
SAXPY_Y.with(|y| {
SAXPY_OUT.with(|o| {
let x = x.borrow();
let y = y.borrow();
let mut o = o.borrow_mut();
let x = &x[..n];
let y = &y[..n];
let o = &mut o[..n];
for i in 0..n {
o[i] = a * x[i] + y[i];
}
})
})
});
}
}
5. Rust parallel (Rayon + Atomics)
We perform multithreading using the Shared Array Buffer and Atomics Web API. Zero bytes cross any boundary and the buffers live where the threads can already access them.
#![allow(unused)]
fn main() {
#[wasm_bindgen]
pub fn saxpy_parallel(n: u32, a: f32) {
let n = n as usize;
SAXPY_X.with(|x| {
SAXPY_Y.with(|y| {
SAXPY_OUT.with(|o| {
let x = x.borrow();
let y = y.borrow();
let mut o = o.borrow_mut();
let x = &x[..n];
let y = &y[..n];
let o = &mut o[..n];
o.par_iter_mut()
.with_min_len(8192)
.zip(x.par_iter())
.zip(y.par_iter())
.for_each(|((out, &xv), &yv)| {
*out = a * xv + yv;
});
})
})
});
}
}
The chart
Analysis
Wasm threading sees the best performance because message overhead is totally eliminated using Wasm threading. Rayon efficiently dispatches the batches of work to available threads who operate over the same memory.
What about SharedArrayBuffer + Atomics in JavaScript?
Yes, you can have shared memory in pure JavaScript. The cost is that you stop writing JavaScript and start writing a byte-level protocol.