[Web] Improve large tensor loading in wasm runtime#19771
Conversation
There was a problem hiding this comment.
Code Review
This pull request restructures TVM-FFI includes in wasm_runtime.cc to prevent static initialization crashes and updates ArrayDecodeStorage to tolerate uncompressed float32 weights under the 'f32-to-bf16' format. In runtime.ts, it introduces chunked record loading and copying (up to 128MB chunks) to handle large tensors efficiently, and adds support for kTVMFFIShape types. The review feedback suggests optimizing these chunking loops by utilizing the cached makeShapeTuple method on the Instance class rather than invoking the FFI this.ctx.makeShapeTuple repeatedly, which reduces redundant FFI round-trips.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| this.ctx.makeShapeTuple( | ||
| ...chunkShape.map((value) => new Scalar(value, "int")), | ||
| ), |
There was a problem hiding this comment.
We can leverage the cached makeShapeTuple method on the Instance class instead of directly calling the FFI this.ctx.makeShapeTuple on every chunk. This avoids redundant FFI round-trips to create the same shape tuple multiple times across chunks and records, improving performance.
this.makeShapeTuple(chunkShape),| const chunkShapeTuple = this.ctx.makeShapeTuple( | ||
| ...chunkShape.map((value) => new Scalar(value, "int")), | ||
| ); |
There was a problem hiding this comment.
We can leverage the cached makeShapeTuple method on the Instance class instead of directly calling the FFI this.ctx.makeShapeTuple on every chunk. This avoids redundant FFI round-trips to create the same shape tuple multiple times across chunks and records, improving performance.
const chunkShapeTuple = this.makeShapeTuple(chunkShape);012380a to
9c4334d
Compare
| artifactCache: ArtifactCacheTemplate, | ||
| signal?: AbortSignal, | ||
| ) { | ||
| const maxChunkBytes = 128 * 1024 * 1024; |
There was a problem hiding this comment.
how is the number determined, would be good to have a sense of what WebGPU runtime supports, the motivation here is not as clear
|
@akaashrp would be great if you can help review |
|
Please elaborate on "reorder FFI implementation includes before runtime implementation includes in the wasm single-translation-unit build, avoiding static initialization ordering issues during module startup" Since static init ordering should not impact the startup, if there is an impact we need to know details |
|
would be great to get elaboration on "load large tensor-cache records in chunks to avoid oversized JS-to-wasm decode/copy calls", is there a specific scenario that motivate this, generally, we would favor simplicity as long as such case is covered by webgpu runtime |
9c4334d to
1fea1f5
Compare
|
Thanks for the review. I updated the PR to narrow the scope and make the motivation more explicit:
Local checks still pass after the update: |
|
Thanks, is there a real usecase for the chunking? .eg. if maxStorageBufferBindingSize limit is set really to 128MB, seems that means we canot allocate tensor larger than that anyway as a result we won't have copy in such shape. And in cases where maxStorageBufferBindingSize is larger, we don't need chunking. |
1fea1f5 to
a83d3a4
Compare
|
You are right that The concrete use case is a very large tensor-cache record that is valid for the target device but fragile as one JS/Wasm staging call. The Gemma 4 E2B artifact has three records above 128 MiB:
In the local Chrome/WebGPU validation lane, the failure was deterministic on the 1120 MiB record. Instrumentation reached |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces chunked loading for large tensor-cache records to avoid massive single JS-to-wasm byte-array calls, and updates the WASM runtime to handle uncompressed float32 payloads labeled as 'f32-to-bf16'. Feedback on these changes highlights several critical issues: the custom f32-to-bf16 decoding path in C++ ignores cpu_arr->byte_offset which causes data corruption during chunked loading; the storageBytes helper in TypeScript fails for "bool" data types; a potential resource leak exists in the GPU chunked copy path if an exception occurs during view creation; and a defensive null check is needed for shapeObjPtr before loading memory.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // the "f32-to-bf16" tag without performing the bf16 truncation. | ||
| if (size == byte_size / 2) { | ||
| const uint16_t* bf16 = reinterpret_cast<const uint16_t*>(byte_data); | ||
| uint32_t* data = static_cast<uint32_t*>(cpu_arr->data); |
There was a problem hiding this comment.
The custom f32-to-bf16 decoding path directly uses cpu_arr->data as the destination pointer, completely ignoring cpu_arr->byte_offset. When loading chunked records, chunkView is a tensor view with a non-zero byte_offset. Ignoring this offset causes all decoded chunks to be written to the very beginning of the underlying tensor, leading to silent data corruption. To fix this, the destination pointer must be offset by cpu_arr->byte_offset.
| uint32_t* data = static_cast<uint32_t*>(cpu_arr->data); | |
| uint32_t* data = reinterpret_cast<uint32_t*>(static_cast<char*>(cpu_arr->data) + cpu_arr->byte_offset); |
| case TypeIndex.kTVMFFIShape: { | ||
| const shapeObjPtr = this.memory.loadPointer(valuePtr); | ||
| if (callbackArg) { |
There was a problem hiding this comment.
Add a defensive null check for shapeObjPtr before performing pointer arithmetic and loading memory. If shapeObjPtr is null (0), accessing it will cause an out-of-bounds memory access or return garbage.
| case TypeIndex.kTVMFFIShape: { | |
| const shapeObjPtr = this.memory.loadPointer(valuePtr); | |
| if (callbackArg) { | |
| case TypeIndex.kTVMFFIShape: { | |
| const shapeObjPtr = this.memory.loadPointer(valuePtr); | |
| if (shapeObjPtr === 0) { | |
| return null; | |
| } | |
| if (callbackArg) { |
| const storageBytes = (dtype: string) => { | ||
| const match = dtype.match(/(\d+)(?:x(\d+))?$/); | ||
| if (match === null) { | ||
| throw new Error("Cannot determine storage width of dtype " + dtype); | ||
| } | ||
| const bits = Number(match[1]); | ||
| const lanes = match[2] === undefined ? 1 : Number(match[2]); | ||
| return (bits * lanes + 7) >> 3; | ||
| }; |
There was a problem hiding this comment.
The storageBytes helper uses a regular expression to extract the bit width from the dtype string. However, if the dtype is "bool", the regex will fail to match any digits and throw an error. In TVM, boolean tensors are commonly represented with the "bool" dtype (which is 1 byte / 8 bits in DLPack). Adding an explicit fallback for "bool" makes the helper more robust.
| const storageBytes = (dtype: string) => { | |
| const match = dtype.match(/(\d+)(?:x(\d+))?$/); | |
| if (match === null) { | |
| throw new Error("Cannot determine storage width of dtype " + dtype); | |
| } | |
| const bits = Number(match[1]); | |
| const lanes = match[2] === undefined ? 1 : Number(match[2]); | |
| return (bits * lanes + 7) >> 3; | |
| }; | |
| const storageBytes = (dtype: string) => { | |
| if (dtype === "bool") { | |
| return 1; | |
| } | |
| const match = dtype.match(/(\d+)(?:x(\d+))?$/); | |
| if (match === null) { | |
| throw new Error("Cannot determine storage width of dtype " + dtype); | |
| } | |
| const bits = Number(match[1]); | |
| const lanes = match[2] === undefined ? 1 : Number(match[2]); | |
| return (bits * lanes + 7) >> 3; | |
| }; |
| const [cpuView, gpuView] = this.withNewScope(() => { | ||
| return [ | ||
| this.detachFromCurrentScope( | ||
| this.ctx.tensorCreateView( | ||
| cpu_arr, | ||
| chunkShapeTuple, | ||
| rec.dtype, | ||
| new Scalar(targetByteOffset, "int"), | ||
| ) | ||
| ), | ||
| this.detachFromCurrentScope( | ||
| this.ctx.tensorCreateView( | ||
| gpu_arr, | ||
| chunkShapeTuple, | ||
| rec.dtype, | ||
| new Scalar(targetByteOffset, "int"), | ||
| ) | ||
| ), | ||
| ]; | ||
| }); |
There was a problem hiding this comment.
In the GPU copy path, cpuView and gpuView are created and immediately detached from the current scope inside the withNewScope block. If the second tensorCreateView (for gpuView) throws an exception, the first view (cpuView) will have already been detached from the scope's auto-disposal list, but the assignment to [cpuView, gpuView] will never complete. This results in a resource leak of cpuView because it is neither auto-disposed by the scope nor disposed in the finally block. To prevent this, detach both views only after both have been successfully created.
const [cpuView, gpuView] = this.withNewScope(() => {
const cView = this.ctx.tensorCreateView(
cpu_arr,
chunkShapeTuple,
rec.dtype,
new Scalar(targetByteOffset, "int"),
);
const gView = this.ctx.tensorCreateView(
gpu_arr,
chunkShapeTuple,
rec.dtype,
new Scalar(targetByteOffset, "int"),
);
return [
this.detachFromCurrentScope(cView),
this.detachFromCurrentScope(gView),
];
});a83d3a4 to
f1e69f2
Compare
|
Addressed the Gemini follow-up comments in the latest commit:
Local checks still pass: |
f1e69f2 to
1d2a3f1
Compare
|
Pushed a small follow-up on the new head
Local validation passes: |
1d2a3f1 to
a2359c1
Compare
|
Pushed a small follow-up on head The smoke first exposed one more scope issue in the chunked path: Local TVM web checks pass on the pushed head:
Downstream Chrome/WebGPU smoke also passes with a WebLLM JS bundle rebuilt against the pushed TVM source:
Prompt smoke results:
Boundary note: this validates the current PR-head TypeScript/WebLLM loading path with the latest available Apache model wasm. It is not a fresh model-wasm rebuild from |
This splits out the Web/WebGPU runtime-only portion of #19766 into a smaller PR, following reviewer feedback that the compiler-side changes should be handled separately.
This PR keeps the scope to
web/runtime code:ArrayDecodeStoragetoleratef32-to-bf16records whose payload is already native float32-sized, while preserving the existing packed-bf16 expansion pathkTVMFFIShapecallback results as JS number arrays so chunked tensor views can pass explicit shape tuplesConcrete motivation: the Gemma 4 E2B MLC artifact has tensor-cache records at 192 MiB, 140 MiB, and 1120 MiB. The 1120 MiB record is
model.embed_tokens_per_layer.q_weightwith shape[262144, 1120]. Local browser validation on an Apple Silicon Chrome/WebGPU lane reproduced an abort during CPU-sidearrayDecodeStorageof that record before GPU copy began. This is a JS/Wasm staging issue for very large tensor-cache records, not a claim that the target WebGPU device cannot allocate the final tensor.The 128 MiB chunk cap is intentionally conservative and only applies as a per-call staging size for records above that threshold. Smaller records continue to use the existing full-record path.
Local validation:
npm run lintfromweb/npx tsc --noEmit --pretty falsefromweb/git diff --checkI could not run the full local
npm run prepwasm && npm run buildpath on this machine because Emscripten (emcc/emsdk) is not installed. The earlier broad PR had the Apache wasm CI pass before the compiler-side CI failures; this PR is intended to let the wasm job validate the runtime-only split independently.