Skip to content

Reduce method#46

Draft
caolan wants to merge 5 commits into
holepunchto:mainfrom
caolan:map-reduce2
Draft

Reduce method#46
caolan wants to merge 5 commits into
holepunchto:mainfrom
caolan:map-reduce2

Conversation

@caolan

@caolan caolan commented May 5, 2026

Copy link
Copy Markdown
Contributor

A reducer API for Hyperbee2

I'm not recommending you merge this yet. I'm opening the pull request as a place to discuss the feature. Take a look in the examples directory to give you a feel for how it works.

You'll notice this only includes the 'reduce' part of 'map/reduce'. That is intentional. A map is essentially another index/tree built on top of this one via a map function. I imagine, if required, we would create a new Hyperbee2 that watches the source tree for changes and applies a map function to any updates. This allows querying the mapped data with range requests, reducers, etc. using the regular Hyperbee2 API.

The intermediate output of reducers can either be ephemeral (temporary), cached in memory, or cached on disk (written to a batch in Hypercore). Caching the output of reducers for subtrees greatly improves query performance.

Writing to disk requires providing the desired reducer functions (and names) to flush() on the WriteBatch. It is possible with this API to stop using old reducers without incurring the cost of their ongoing recalculation and write overhead, and to introduce new reducers as necessary. The API does not rely on eagerly writing reducer output to nodes on every operation because this would be inefficient, but more importantly, it would make it difficult to layer new reducers on top of trees forked from a remote peer. Recalculation on demand during flush() and writing to your own batch avoids those issues.

When written to a batch, the cached results are included directly on the tree node inside the batch (JSON encoded for now). There is currently no indirection (like with value pointers), the reducer output is written directly into the node. Since any update to the node or it's descendants will invalidate the output of the reducer, I didn't see a reason to link them across batches via a pointer.

The API asks that any reducers using a cache provide a unique name as a string. This facilitates dropping old reducers and introducing new ones as the application develops. Changing your reducer in a backwards-incompatible way demands you change it's name string to invalidate the cache.

API

await db.reduce(name, reducer)

Calculates an accumulated value for all entries in the tree.

To cache intermediate results between calls, provide a string as the
name argument. This must be unique to the reducer function and be
updated if the reducer changes in a backwards-incompatible way.

To perform a one-off temporary reduce, provide null as the name.

The reducer is a function that will calculate the accumulated value.
It takes two arguments: values and rereduce.

When rereduce is false, the values argument will be an Array of
tree entries with key and value properties. When the rereduce
argument is true, the values argument will be an Array of values
returned from previous reducer calls.

The output of a reducer must be a JSON-compatible value.

Note: while entries provided to the reducer are in sorted key order,
those entries might not be contiguous across reducer calls. For example,
a reducer might receive [3,4], [6,7], [9], [5,8] across 4 calls.

import Hyperbee from '../index.js'
import Corestore from 'corestore'

const total = (values, rereduce) => {
  let total = 0
  for (const v of values) {
    if (rereduce) {
      total += v
    } else {
      total += Number(v.value.toString())
    }
  }
  return total
}

const b = new Hyperbee(new Corestore('./sandbox'))
await b.ready()

// Calculate total of all number strings
console.log(
    await b.reduce('total', total)
)

await db.reduceRange(name, reducer, start, end)

Calculates an accumulated value for a range of entries in the tree.
start can be null (to begin with the first entry) or a Buffer less
than or equal to the first key to include. end can be null (to end
after the last entry), or a Buffer greater than the last key to include.

The name and reducer arguments are described in the documentation for
db.reduce().

await batch.flush([reducers])

The reducers argument is an Object with reducer functions (as described by
the documentation for db.reduce()) keyed by unique strings. If provided,
these will be recalculated for all nodes lacking a cached reduce result and
updated values will be written to Hypercore as part of the batch. This greatly
improves query time for reducers at the expense of writing a larger batch.

Limitations

Because Hyperbee2 is a B-tree and not a B+tree, entries are sorted but might not be contiguous across calls. This is probably not what the author of a reducer expects. For example, a reducer might receive [3, 4], [6,7], [9], [5,8] across 4 calls.

@caolan caolan marked this pull request as draft May 5, 2026 16:10
@caolan caolan force-pushed the map-reduce2 branch 2 times, most recently from bb76da5 to 1324d78 Compare May 5, 2026 16:20
@caolan

caolan commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Note: this proposal needs at least a test suite before it can be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant