An LSM-tree key-value store built in Go from scratch.
Every write goes to the WAL first (so nothing is lost on a crash), then into the memtable. When the memtable is full it gets flushed to disk as an SSTable. Reads check the memtable first, then go through SSTables newest to oldest until the key is found.
Write -> WAL -> Memtable (B-tree)
|
flush
|
SSTable (disk)
Read -> Memtable -> SSTable[N] -> SSTable[N-1] -> ...
Keys are stored in a B-tree so they stay sorted. This makes flushing cheap since the SSTable is just a sequential write of an already-sorted tree, no extra sorting pass needed.
Deletes don't remove the key, they write a tombstone (IsDeleted=true). The key disappears from reads but the tombstone sticks around until compaction cleans it up. Compaction isn't implemented yet.
Each flush creates a directory with two files:
data— every row askey,value,deleted\n, written in key orderindex—key,byte_offset\nfor every row in data
The index is kept in memory as a map. A read looks up the offset, seeks to that spot in the data file, and reads one line. SSTables never change after they're written, so newer SSTables always win over older ones on reads.
Each SSTable has a bloom filter so reads can skip SSTables that definitely don't have the key without touching the index.
Interestingly, whether this actually helps depends on how big the SSTables are. With small SSTables the whole index fits in CPU cache so a map lookup is faster than hashing the key 3 times. With large SSTables the index is too big for cache and each lookup has to go to RAM. At that point the bloom filter wins because its bitset is small enough to stay cache-hot.
500,001 writes, 50,000 reads on keys that exist, 50,000 on keys that don't.
Small SSTables (maxSize=10k, ~50 SSTables)
| Writes | Read hits | Read misses | |
|---|---|---|---|
| Bloom on | 4.19s | 135ms | 58ms |
| Bloom off | 4.16s | 112ms | 39ms |
Bloom is slower here. The index fits in cache so a plain map lookup is cheaper than running 3 hashes.
Large SSTables (maxSize=100k, ~5 SSTables)
| Writes | Read hits | Read misses | |
|---|---|---|---|
| Bloom on | 4.11s | 68ms | 8.5ms |
| Bloom off | 3.91s | 70ms | 12ms |
Bloom wins on misses. Index maps are too big for cache, bloom's bitset isn't.
Without compaction (maxSize=10k, ~50 small SSTables)
| Writes | Read hits | Read misses | |
|---|---|---|---|
| Bloom on | 4.25s | 125ms | 52ms |
| Bloom off | 4.44s | 114ms | 34ms |
Reads walk 50 SSTables. Each index fits in CPU cache so map lookups are fast and bloom adds more overhead than it saves.
With compaction (maxSize=100k, ~5 large SSTables)
| Writes | Read hits | Read misses | |
|---|---|---|---|
| Bloom on | 15.4s | 62ms | 8.3ms |
| Bloom off | 15.7s | 68ms | 9.1ms |
Fewer SSTables so reads are faster. Writes are slower since each flush is much larger. Index maps no longer fit in CPU cache, so bloom's compact bitset beats a plain map lookup on misses.
Run on Apple M4, -benchtime=5s. Numbers are per-operation.
| Config | ns/op | B/op |
|---|---|---|
| bloom=true, lock=true | 58,165 | 28,180 |
| bloom=true, lock=false | 73,167 | 36,454 |
| bloom=false, lock=true | 70,526 | 35,228 |
| bloom=false, lock=false | 80,792 | 39,588 |
| Config | ns/op |
|---|---|
| bloom=true, lock=true | 5,445 |
| bloom=true, lock=false | 5,364 |
| bloom=false, lock=true | 5,351 |
| bloom=false, lock=false | 5,333 |
| Config | ns/op |
|---|---|
| bloom=true, lock=true | 65.4 |
| bloom=true, lock=false | 65.1 |
| bloom=false, lock=true | 67.5 |
| bloom=false, lock=false | 67.4 |
| Config | ns/op |
|---|---|
| hit, bloom=true | 5,418 |
| hit, bloom=false | 5,378 |
| miss, bloom=true | 97.5 |
| miss, bloom=false | 110.4 |
| Config | ns/op |
|---|---|
| bloom=true, lock=true | 18,639 |
| bloom=false, lock=true | 18,889 |
| Config | ns/op |
|---|---|
| bloom=true, lock=true | 4,696 |
| bloom=false, lock=true | 4,754 |
Bloom filter only helps misses. On a hit, bloom says "maybe" and you still have to read the index, so you pay the hash cost for nothing. On a miss it can reject the key without touching the index at all, which is where the speedup comes from.
The advantage scales with index size. With 10k keys the SSTable index maps fit in CPU cache, so a plain map lookup beats running 3 hashes and bloom only saves about 3% on misses (65ns vs 67ns). With 100k keys the indices spill out of cache and each lookup costs a RAM round-trip. At that point bloom's compact bitset stays cache-hot and the saving grows to about 12% (97ns vs 110ns).
Bloom never helps hits. Hit latency is nearly identical with and without bloom across both dataset sizes (~5.3-5.4µs). The SSTable data file read dominates and bloom just adds a small hash overhead before it.
An uncontended lock is basically free. Sequential writes with lock=true are actually faster than lock=false (58µs vs 73µs). There's no contention so the mutex costs nothing, and the two code paths end up with different allocation patterns. lock=false allocates about 30% more per op, which is where the difference comes from.
Concurrent writes scale to about 3x, not 10x. Sequential puts cost ~58µs; parallel drops to ~18µs across 10 goroutines. WAL appends and B-tree inserts both serialize on the write lock, so more goroutines can't help past that point.
Concurrent reads barely move. ~5.4µs drops to ~4.7µs. Reads take a shared RLock so they can run in parallel, but the bottleneck is SSTable file I/O which doesn't fan out well on a single drive.
go run .