Skip to content

Speed up .sng header parsing (~3.3×) and add readSongIni for metadata-only scans#16

Open
elicwhite wants to merge 6 commits intoGeomitron:masterfrom
elicwhite:perf/header-parse-speedups
Open

Speed up .sng header parsing (~3.3×) and add readSongIni for metadata-only scans#16
elicwhite wants to merge 6 commits intoGeomitron:masterfrom
elicwhite:perf/header-parse-speedups

Conversation

@elicwhite
Copy link
Copy Markdown

Summary

A series of focused, independently-reviewable perf improvements to the .sng header parse path, plus one small additive API (readSongIni) for the common "I only want the song.ini data" use case.

Each commit is scoped to one change, with profiling rationale and before/after benchmarks in the commit body. Nothing breaks existing behavior — SngStream's public contract is unchanged.

Cumulative impact (204 .sng files, concurrency=32, node 24.11):

Scenario Before After Delta
Header parse via SngStream (cold cache) 11.51 ms / 2,540 files/s 3.32 ms / 8,303 files/s −71% latency, +227% throughput
Metadata-only scan (warm cache, readSongIni) 2.23 ms / 13,510 files/s (via event API) 1.63 ms / 18,820 files/s −27% latency, +39% throughput

Commits, in review order

  1. Replace chunked header buffer with single growing buffer — the biggest win. The old getHeaderBuffer scanned every byte of every chunk on each of three length probes, and mergeUint8Arrays used an O(n²) .slice(0, i).reduce(...) inside .forEach. Replaced with a single geometrically-growing Uint8Array. −48% latency, +88% throughput.
  2. Hoist binary-parser Parser definitions to module scopeparseSngHeader was constructing three binary-parser.Parser instances on every call, throwing away binary-parser's internal compiled-function cache. Moved to module scope. −8% latency, +9% throughput.
  3. Skip generateIniFileText when generateSongIni is disabledgenerateIniFileText was building + encoding ini text on every header parse even when the result was discarded. Moved inside the generateSongIni: true branch. −10% latency, +9% throughput.
  4. Defer leftover-chunk unmask from start() to pull() — when the header's final read() overshoots into the first file's bytes, readFile's ReadableStream.start was synchronously XOR-unmasking those leftover bytes for ~40–60 KB per file, before a metadata-only consumer's cancel() could land. Deferred to the first pull() so cancel-before-pull skips the XOR entirely. −33% latency, +47% throughput for metadata-only scanners; identical behavior for full-extraction consumers.
  5. Add readSongIni() for metadata-only scanners — a single new top-level export. Reads just enough of a stream to parse the header, cancels the source, returns { [key: string]: string } (same shape as SngHeader.metadata). Skips EventEmitter, file ReadableStream construction, chunk-unmasker allocation, and readFile dispatch — all unnecessary when the caller only wants the header. −27% latency, +39% throughput vs. using SngStream + cancel for the same workload.
  6. Add changeset for header-parse perf improvements — bumps the package minor (new public API).

Testing

  • tsc --noEmit clean
  • tsup build succeeds
  • End-to-end extraction of a multi-file .sng with generateSongIni: true produces byte-identical output at each commit (MIDI/JPEG/Opus magic bytes intact, sizes match, generated song.ini well-formed)
  • readSongIni returns the same metadata object shape as SngHeader.metadata
  • CPU profiling was used at each step to verify the expected hotspot was eliminated

Before, in-flight header bytes were kept as an array of chunks, and
every header-length probe (3 per file) rebuilt a fresh Uint8Array by
scanning every byte of every chunk. `mergeUint8Arrays` then reassembled
the full buffer with an O(n^2) `buffers.slice(0, i).reduce(...)` inside
a forEach. CPU profiling on 204 .sng files showed these two functions
accounting for 57ms of self-time — more than the rest of the header
parse combined.

This commit stores the header as a single `Uint8Array` that grows
geometrically as chunks arrive. Length probes become a `byteCount`
comparison + `DataView.getBigUint64`, the full-header parse is a
zero-copy `subarray`, and leftover bytes for the next file are a single
`slice`. The old `getHeaderBuffer`, `mergeUint8Arrays`, and free-function
`readBigUint64LE` helpers are gone.

Benchmark (204 .sng files, concurrency=32, node 24.11):

  before: mean 11.51 ms/file, 2540 files/s
  after:  mean  5.98 ms/file, 4785 files/s
  delta:  -48% latency, +88% throughput

Verified end-to-end extraction still produces byte-identical output
(MIDI/JPEG/Opus magic bytes intact; generated song.ini well-formed).
`parseSngHeader` was constructing three `binary-parser.Parser` instances
on every call. `binary-parser` generates its parse function via
`new Function()` on first `.parse()` invocation and caches it on the
Parser instance, so rebuilding the Parser per call threw away that cache
and paid allocation for the method-chain state each time.

Move the three Parser definitions to module scope so they're built (and
their parse functions compiled) exactly once per process.

Benchmark (204 .sng files, concurrency=32, node 24.11):

  before: mean 5.98 ms/file, 4785 files/s
  after:  mean 5.51 ms/file, 5197 files/s
  delta:  -8% latency, +9% throughput

The bulk of the remaining header-parse cost is inside `binary-parser`'s
generated code, not in Parser construction; this is a small-but-free
incremental win on top of the header-buffer rewrite.
`generateIniFileText` was called on every parsed header, even when
`config.generateSongIni` was false — in which case its output was only
used inside branches gated by the same flag and otherwise discarded.
This meant every header parse allocated a `TextEncoder`, built the ini
text string, and encoded it into a Uint8Array for nothing on the common
scanner path (browser/CLI tools that only read metadata).

Move the call inside the `if (this.config.generateSongIni)` branch, and
restructure the surrounding `emit('header')` / `readFile` dispatch so
the disabled path no longer runs any ini-related work.

Benchmark (204 .sng files with generateSongIni: false, concurrency=32,
node 24.11):

  before: mean 5.51 ms/file, 5197 files/s
  after:  mean 4.97 ms/file, 5648 files/s
  delta:  -10% latency, +9% throughput

Behavior is unchanged when `generateSongIni: true`; the ini text is
still generated and the song.ini file is still emitted first.
When the header's final read() overshoots into the first file's bytes,
`readFile` stored those bytes in `leftoverFileChunk` and the
ReadableStream's `start` handler unmasked (XOR'd) them synchronously at
construction time. Metadata-only consumers (browser/CLI scanners that
only read the `header` event and then `fileStream.cancel()` each file)
still paid the full XOR cost on ~40–60 KB of leftover per file before
their cancel had a chance to run. CPU profiling on 204 .sng files
showed this unmasker loop as the top self-time hotspot at 13ms — 35% of
wall time.

Move the leftover enqueue into `pull` and short-circuit before any
source read. If the consumer cancels before pulling, the leftover is
never unmasked. Full-extraction consumers see identical behavior: pull
is called immediately on a non-cancelled stream, so the leftover is
still processed first.

Benchmark (204 .sng files, concurrency=32, node 24.11):

  before: mean 4.97 ms/file, 5648 files/s
  after:  mean 3.32 ms/file, 8303 files/s
  delta:  -33% latency, +47% throughput

Verified end-to-end extraction still produces byte-identical output for
a multi-file .sng (MIDI/JPEG/Opus magic bytes intact, sizes unchanged,
generated song.ini well-formed).
Add a single new export that reads a .sng's header from a stream,
cancels the source, and returns the song.ini key/value data as a plain
`{ [key: string]: string }` — the same object that lives at
`SngHeader.metadata` on the full header.

This is the right shape for library indexers / song.ini scanners: no
`EventEmitter`, no file-content `ReadableStream`, no chunk-unmasker
allocation, no `readFile` dispatch. The function awaits `reader.read()`
in a tight loop until the header is complete, parses it, and cancels
the source.

Benchmark (204 .sng files, concurrency=32, warm page cache, node 24.11):

                      mean     throughput
  SngStream events    2.23 ms   13,510 files/s
  readSongIni         1.63 ms   18,820 files/s
  delta                -27%      +39%

Wrapping a direct `fs.open`+`fd.read(1MB)` into a single-chunk
ReadableStream fed into `readSongIni` lands in the same range after the
cold allocation settles — the Web-stream adapter is not the bottleneck,
so no Node-specific helper is needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant