diff --git a/sei-db/db_engine/litt/Makefile b/sei-db/db_engine/litt/Makefile new file mode 100644 index 0000000000..5939765f4f --- /dev/null +++ b/sei-db/db_engine/litt/Makefile @@ -0,0 +1,23 @@ +SHELL := /bin/bash + + +# Build the litt CLI tool. +build: + go build -o ./bin/litt ./cli + +# Remove the bin directory if it exists. +clean: + rm -rf ./bin + +# Build the litt CLI tool with debug flags. +debug-build: clean + go mod tidy + go build -gcflags "all=-N -l" -o ./bin/litt ./cli + +# Run all LittDB unit tests. +test: build + go test ./... -timeout=10m -v -p=1 -parallel=8 + +# Run all LittDB unit tests with verbose output. +test-verbose: build + go test ./... -v -timeout=10m -p=1 -parallel=8 diff --git a/sei-db/db_engine/litt/README.md b/sei-db/db_engine/litt/README.md new file mode 100644 index 0000000000..5566694448 --- /dev/null +++ b/sei-db/db_engine/litt/README.md @@ -0,0 +1,532 @@ +![](docs/resources/littdb-logo.png) + +# Work-in-progress guard + +This tree is a raw import from the upstream LittDB project. It has not yet been +adapted to build inside this module — imports still point at the origin repo +(`github.com/Layr-Labs/eigenda/...`) and external dependencies have not been +reconciled with this repo's `go.mod`. + +To keep CI green during incremental integration, every `.go` file under +`sei-db/db_engine/litt/` starts with: + +```go +//go:build littdb_wip +``` + +Without `-tags=littdb_wip` (the default in CI and in `make build`), the Go +toolchain skips the entire tree — `go build ./...`, `go test ./...`, `go vet +./...`, and `golangci-lint` all treat it as empty. + +To see the current (failing) state of the code locally: + +```bash +go build -tags=littdb_wip ./sei-db/db_engine/litt/... +``` + +This is expected to fail until each package is adapted to this module. + +# Contents + +- [License](docs/licenses/README.md) +- [What is LittDB?](#what-is-littdb) + - [Features](#features) + - [Consistency Guarantees](#consistency-guarantees) + - [Planned/Possible Features](#plannedpossible-features) + - [Anti-Features](#anti-features) +- [API](#api) + - [Overview](#overview) + - [Getting Started](#getting-started) + - [Configuration Options](#configuration-options) + - [CLI](#littdb-cli) +- [Definitions](#definitions) +- [Architecture](docsrchitecture.md) +- [Filesystem Layout](docsilesystem_layout.md) + +# What is LittDB? + +LittDB is a highly specialized embedded key-value store that is optimized for the following workload: + +- high write throughput +- low read latency +- low memory usage +- write once, never update +- data is only deleted via a [TTL](#ttl) (time-to-live) mechanism + +In order to achieve these goals, LittDB provides an intentionally limited feature set. For workloads +that are capable of being handled with this limited feature set, LittDB is going to be more performant +than just about any other key-value store on the market. For workloads that require more advanced +features, "sorry, not sorry". LittDB is able to do what it does precisely because it doesn't provide +a lot of the features that a more general-purpose key-value store would provide, and adding those +can only be done by sacrificing the performance that LittDB is designed to provide. + +## Features + +The following features are currently supported by LittDB: + +- writing values (once) +- reading values +- [TTLs](#ttl) and automatic (lazy) deletion of expired values +- [tables](#table) with non-overlapping namespaces +- multi-drive support (data can be spread across multiple physical volumes) +- incremental backups (both local and remote) +- keys and values up to 2^32 bytes in size +- incremental snapshots +- incremental remote backups + +## Consistency Guarantees + +The consistency guarantees provided by LittDB are more limited than those provided by typical general-purpose +transactional databases. This is intentional, as the intended use cases of LittDB do not require higher order +consistency guarantees. + +- thread safety +- [read-your-writes consistency](#read-your-writes-consistency) +- crash [durability](#durability) for data that has been [flushed](#flushing) +- [atomic](#atomicity) writes + - Although [batched writes](#batched-writes) are supported (for performance), batches are not [atomic](#atomicity). + Each individual write within a batch is [atomic](#atomicity), but the batch as a whole is not. That is to say, + if the computer crashes after a [batch](#batched-writes) has been written but before [flushing](#flushing), + some of the writes in the [batch](#batched-writes) may be [durable](#durability) on disk, while others may + not be. + +## Planned/Possible Features + +The following features are planned for future versions of LittDB, or are technically feasible if a strong +enough need is demonstrated: + +- dynamic multi-drive support: Drives can currently only be added/removed with a DB restart. + It's currently fast, but not instantaneous. With this feature, drives can be added/removed on the fly. +- read-only mode from an outside process +- DB iteration (this is plausible to implement without high overhead, but we don't currently have + a good use case to justify the implementation effort) +- more keymap implementations (e.g. badgerDB, a custom solution, etc.) +- data check-summing and verification (to protect/detect disk corruption) +- keys and values up to 2^64 bytes in size + +## Anti-Features + +These are the features that LittDB specifically does not provide, and will never provide. This is +not done because we're lazy, but because these features would significantly impact the performance +of the database, and because they are simply not needed for the intended use cases of LittDB. LittDB +is a highly specialized tool for a very specific task, and it is not intended to be a general-purpose +key-value store. + +- mutating existing values (once a value is written, it cannot be changed) +- deleting values (values only leave the DB when they expire via a TTL) +- transactions (individual operations are atomic, but there is no way to group operations atomically) +- fine granularity for [TTL](#ttl) (all data in the same table must have the same TTL) +- multi-computer replication (LittDB is designed to run on a single machine) +- data encryption +- data compression +- any sort of query language other than "get me the value associated with this key" +- ordered data iteration + +# API + +## Overview + +Below is a high level overview of the LittDB API. For more detailed information, see the inline documentation in the +interface files. + +Source: [db.go](db.go) + +```go +type DB interface { +GetTable(name string) (Table, error) +DropTable(name string) error +Stop() error +Destroy() error +} +``` + +Source: [table.go](table.go) + +```go +type Table interface { +Name() string +Put(key []byte, value []byte) error +PutBatch(batch []*types.KVPair) error +Get(key []byte) ([]byte, bool, error) +Exists(key []byte) (bool, error) +Flush() error +Size() uint64 +SetTTL(ttl time.Duration) error +SetCacheSize(size uint64) error +} +``` + +Source: [kv_pair.go](types/kv_pair.go) + +``` +type KVPair struct { + Key []byte + Value []byte +} +``` + +## Getting Started + +Below is a functional example showing how to use LittDB. + +```go +// Configure and build the database. +config, err := littbuilder.DefaultConfig("path/to/where/data/is/stored") +if err != nil { +return err +} + +db, err := config.Build(context.Background()) +if err != nil { +return err +} + +myTable, err := db.GetTable("my-table") // this code works if the table is new or if the table already exists +if err != nil { +return err +} + +// Write a key-value pair to the table. +key := []byte("this is a key") +value := []byte("this is a value") + +err = myTable.Put(key, value) +if err != nil { +return err +} + +// Flush the data to disk. +err = myTable.Flush() +if err != nil { +return err +} + +// Congratulations! Your data is now durable on disk. + +// Read the value back. This works before or after a flush. +val, ok, err := myTable.Get(key) +if err != nil { +return err +} +``` + +## Configuration Options + +For more information about configuration, see [littdb_config.go](littdb_config.go). + +## LittDB CLI + +The LittDB has a CLI utility for offline manipulation of DB files. See the [LittDB CLI](docs/littdb_cli.md) docs +for more information on how to use it. + +# Definitions + +This section contains an alphabetized list of technical definitions for a number of terms used by LittDB. This +list is not intended to be read in order, but rather to be used as a reference when reading other parts of the +documentation. + +## Address + +An address partially describes the location on disk where a [value](#value) is stored. Together with a [key](#key), +the [value](#value) associated with a [key](#key) can be retrieved from disk. + +An address is encoded in a 64-bit integer. It contains two pieces of information: + +- the [segment](#segment) [index](#segment-index) where the [value](#value) is stored +- the offset within the [value file](#segment-value-files) where the first byte of + the [value](#value) is stored + +This information is not enough by itself to retrieve the [value](#value) from disk if there is more than one +[shard](#shard) in the [table](#table). When there is more than one [shard](#shard), the following information +must also be known in order to retrieve the [value](#value) (i.e. to figure out which [shard](#shard) to look in): + +- the [sharding factor](#sharding-factor) for the [segment](#segment) where the [value](#value) is stored + (stored in the [segment metadata file](#segment-metadata-file)) +- the [sharding salt](#sharding-salt) for the [table](#table) where the [value](#value) is stored + (stored in the [table metadata file](#table-metadata-file)) +- the [key](#key) that the [value](#value) is associated with + +## Atomicity + +In the context of this document, atomicity means that an operation is either done completely or not at all. That is +to say, if there is a crash while an operation is in progress, the operation will either be completed when the +database is restarted, or it will not be completed at all. + +As a specific example, if writing a [value](#value) and there is a crash, either the entire [value](#value) will be +written to disk and available when the database is restarted, or the [value](#value) will be completely absent. +It will never be the case that only part of the [value](#value) is written to disk. + +## Cache + +LittDB maintains an in-memory cache of [key](#key)-[value](#value) pairs. Data is stored in this cache when a value +is first written, as well as when it is read from disk. This is not needed for correctness, but is rather a performance +optimization. The cache is not persistent, and is lost when the database is restarted. The size of the cache is +configurable. + +## Batched Writes + +LittDB supports batched write operations. Multiple write operations can be grouped together and passed to the database +as a single operation. This may have positive performance implications, but is semantically equivalent to writing each +value individually. A batch of writes is not [atomic](#atomicity) as a whole, but each individual write within the +batch is [atomic](#atomicity). That is to say, if there is a crash after a batch of writes has been written but before +it has been [flushed](#flushing), some of the writes in the batch may be [durable](#durability) on disk, while others +may not be. + +## Durability + +In this context, the term "durable" is used to mean that data is stored on disk in such a way that it will not be lost +in the event of a crash. Data that has been [flushed](#flushing) is considered durable. Data that has not been flushed +is not considered durable. That doesn't mean that the data will be lost in the event of a crash, but rather that it +is not guaranteed to be present after a crash. + +There are some limits to the strength of the durability guarantee provided by LittDB. For example, some drives buffer +data in internal buffers before writing it to disk, and do not necessarily write data to disk immediately. LittDB is +only as robust as the OS/hardware it is running on. This is true for any database, but it is worth mentioning here +for the sake of completeness. + +## Flushing + +Calling `Flush()` causes all data previously written to be written [durably](#durability) to disk. A call to `Flush()` +blocks until all data that was written prior to the call to `Flush()` has been written to disk. + +It is ok to never call `Flush()`. As internal buffers fill, data is written to disk automatically. However, calling +`Flush()` can be useful in some cases, such as when you want to ensure that data is written to disk before proceeding +with other operations. + +If `Flush()` is never called, data becomes durable through two mechanisms: + +- When a [segment](#segment) becomes full, it is made immutable and a new segment is created. As part of the process + of making a segment immutable, all data in the segment is fully written to disk. +- When the database is cleanly stopped via a call to `Stop()`, all unflushed data is written to disk. `Stop()` blocks + until this has been completed. + +`Flush()` makes no guarantees about the [durability](#durability) of data written concurrently with the call to +`Flush()` or after the call to `Flush()` has returned. It's not harmful to write data concurrently with a call to +`Flush()` as long as it is understood that this data may or may not be [durable](#durability) on disk when the call +to `Flush()` returns. + +The following example demonstrates the consistency guarantees provided by the `Flush()` operation: + +![](docs/resources/flush-visual.png) + +In this example there are two threads performing operations, `Thread 1` and `Thread 2`. `Thread 1` writes `A`, `B`, +and `C`, calls `Flush()`, and then writes `D`. `Thread 2` writes `W`, `X`, `Y`, and `Z`. `Time α` is the moment +when the flush operation is invoked, and `Time β` is the moment when the flush operation returns. + +All write operations that have completed at `Time α` before the flush operation is invoked are [durable](#durability) +when the flush operation returns at `Time β`. These are `A`, `B`, `C`, and `W`. Although writing `X` begins prior to +`Time α`, since it is not complete at `Time α`, the flush operation does not guarantee that `X` is +[durable](#durability) when it returns at `Time β`. The same is true for `Y`, `Z`, and `D`. + +Note that just because an operation is not guaranteed to be [durable](#durability) when `Flush()` returns does not mean +that is guaranteed to be not [durable](#durability). If the computer crashes after `Time β` but before the next call +to `Flush()`, then `X`, `Y`, `Z`, and `D` may or may not be lost as a result. + +## Key + +A key in a key-[value](#value) store. A key is a byte slice that is used to look up a [value](#value) in the database. + +LittDB is agnostic to the contents of the key, other than requiring that keys be unique within a [table](#table). +Although large keys are supported, performance has been tuned under the assumption that keys are generally small +compared to [values](#value). The use case LittDB was originally intended for uses 32-byte keys. + +## Keymap + +At a conceptual level, a keymap is a mapping from [keys](#key) to [addresses](#address). In order to look up a +[value](#value) in the database one needs to know two things: the [key](#key) and the [address](#address). The keymap +is therefore necessary to lookup data given a specific [key](#key). + +There are currently two implementations of the keymap in LittDB: an in-memory keymap and a keymap that uses levelDB. +There are tradeoffs to each implementation. The in-memory keymap is faster, but has higher memory usage and longer +startup times (it has to be rebuilt at boot time). The levelDB keymap is slower, but has a lower memory footprint and +faster startup times. + +From a thread safety point of view, if a mapping is present in the keymap, the [value](#value) associated with the +entry is guaranteed to be present on disk. + +- When writing a new [value](#value), it is first written to disk, and when that is complete the [key](#key) and + [address](#address) are written to the keymap. +- When deleting a [value](#value), the [key](#key) and [address](#address) are first removed from the keymap, and + then the [value](#value) is deleted from disk. + +LittDB supports reading [values](#value) immediately after they are written, and during that period there may not +be a corresponding entry in the keymap. For more information on how this edge case is handled, information about the +[unflushed data map](#unflushed-data-map). + +## Read-Your-Writes Consistency + +The definition of read-your-writes consistency is well summarized by its name. If a thread writes a [value](#value) +to the database and then turns around and attempts to read that [value](#value) back, it will either + +1. read the [value](#value) that was just written, or +2. read an updated [value](#value) that was written AFTER the [value](#value) that was just written + +Note that in LittDB, values are never permitted to be mutated. But when values grow older than their [TTL](#ttl), +the value can be deleted. From a consistency point of view, the garbage collection process is equivalent to an update. +That is to say, if a thread writes a [value](#value), waits a very long time, then reads that same [value](#value) +back again, it is not a violation of read-your-writes consistency if the [value](#value) is not present because the +[garbage collector](#garbage-collection) has deleted it. + +An "eventual consistent" database does not necessarily provide read-your-writes consistency. In the author's experience, +such systems can be very difficult to reason about, and can lead to subtle bugs that are difficult to track down. +Read-your-writes consistency is simple, yet powerful and intuitive. Since providing this level of consistency +does not hurt performance, the complexity of its implementation is justified. + +## Segment + +Data in LittDB [table](#table) can be visualized as a linked list. Each element in that linked list is called a +"segment". A segment can hold many individual [values](#value). Old data is near the beginning of the list, and new +data is near the end. Old, [expired](#ttl) data is always deleted from the first segment currently in the list. New +data is always written to the last segment currently in the list. + +Segments are deleted as a whole. That is, when a segment is deleted, all data in that segment is deleted at the same +time. Segments are only deleted when all data contained within them has [expired](#ttl). + +Segments have a target data size. When a segment is full, that segment is made immutable, and a new segment is created +and added to the end of the list. + +Note that the maximum size of a segment file is not a hard limit. As long as the first byte of a [value](#value) is +written to a segment file before the segment is full, the segment is permitted to hold it. An [address](#address) +points to that first byte of a value. Since there are 32 bits in an [address](#address) used to store the offset +within the file, the maximum offset for the first byte of a value is 2^32 bytes (4GB). + +A natural side effect of only requiring the first byte of a [value](#value) to be written before the segment is full is +that LittDB can support arbitrarily large [values](#value). Doing so may result in a large amount of data in a single +segment, but this does not violate any correctness invariants. + +Each segment may split its data into multiple [shards](#shard). The number of shards in a segment is called the +[sharding factor](#sharding-factor). The [sharding factor](#sharding-factor) is configurable, and different segments +may use different [sharding factors](#sharding-factor). + +There are three types of files that contain data for a segment: + +- [metadata](#segment-metadata-file) +- [keys](#segment-key-file) +- [values](#segment-value-files) + +### Segment Index + +Each segment has a serial number called a "segment index". The first segment ever created with index `0`, the next +segment created has index `1`, and so on. Segment `N` is always deleted before segment `N+1`, meaning there will +never be a gap in the segment indices currently in use. + +### Segment Key File + +A segment key file contains the [keys](#key) and [addresses](#address) for all the [values](#value) stored the segment. +At runtime, [keys](#key)-[address](#address) pairs are appended to the key file. It is not read except during the +following circumstances: + +- when a [segment](#segment) is deleted, the file is iterated to delete entries from the [keymap](#keymap) +- when the DB is loaded from disk, the data is used to rebuild the [keymap](#keymap). This may not be needed + in situations where the keymap has durably stored data, and does not need to be rebuilt. + +The file name of a key file is `X.keys`, where `X` is the [segment index](#segment-index). + +### Segment Metadata File + +This file contains metadata about the segment. This metadata is small, and so it can be kept in memory. The file is +read at startup to rebuild the in-memory representation of the segment. + +Each metadata contains the following information: + +- the [segment index](#segment-index) +- serialization version (in case the format changes in the future) +- the [sharding factor](#sharding-factor) for the segment +- the [salt](#sharding-salt) used for the segment +- the [timestamp](#segment-timestamp) of the last element written in the segment. + the [TTL](#ttl) of any data contained within it. +- whether or not the segment is [immutable](#segment-mutability) + +The file name of a metadata file is `X.metadata`, where `X` is the [segment index](#segment-index). + +### Segment Mutability + +Only the last segment in the "linked list" is mutable. All other segments are immutable. + +### Segment Timestamp + +The timestamp of the last element written to the segment. This is used to determine when it is safe to delete a +segment without violating the [TTL](#ttl) of any data contained within it. This value is unset for the last segment +in the list, as it is still being written to. + +### Segment Value Files + +Each segment has one value file for each [shard](#shard) in the segment. Values are appended to the value files. +The [address](#address) of a [value](#value) is the offset within the value file where the [value](#value) begins. + +The file name of a value file is `X-Y.values`, where `X` is the [segment index](#segment-index) and `Y` is the +[shard](#shard) index. + +## Shard + +LittDB supports sharding. That is to say, it can break the data into smaller pieces and spread those pieces across +multiple locations. + +In order to determine the shard that a particular [key](#key) is in, a hash function is used. The data that goes +into the hash function is the [key](#key) itself, as well as a [sharding salt](#sharding-salt) that is unique to +each [segment](#segment). + +The [sharding salt](#sharding-salt) is chosen randomly. Its purpose is to make the mapping between [keys](#key) and +shards unpredictable to an outside attacker. Without this sort of randomness, an attacker could intentionally craft +keys that all map to the same shard, causing a hot spot in the database and potentially degrading performance. + +### Sharding Factor + +The number of [shards](#shard) in a [segment](#segment) is called the "sharding factor". The sharding factor must be +a positive, non-zero integer. The sharding factor can be changed at runtime without restarting the database or +performing a data migration. + +### Sharding Salt + +A random number chosen to make the [shard](#shard) hash function unpredictable to an outside attacker. This number +does not need to be chosen via a cryptographically secure random number generator, as long as it is not publicly +known. + +## Table + +A table in LittDB is a unique namespace. Two [keys](#key) with identical values do not conflict with each other as +long as they are in different tables. + +Each table has its own [TTL](#ttl), and all data in the table is subject to that [TTL](#ttl). Each table has its +own [keymap](#keymap) and its own set of [segments](#segment). [Flushing](#flushing) one table does not affect +any other table. Aside from hardware, tables do not share any resources. + +In many ways, a table is a stand-alone database. The higher level [API](#api) that works with multiple tables is +provided as a convenience, but does not enhance the performance of the DB in any way. + +### Table Metadata File + +A [table](#table) metadata file contains configuration for the table. It is intended to preserve high level +configuration between restarts. + +## TTL + +TTL stands for "time-to-live". If data is configured to have a TTL of X hours, the data is automatically deleted +approximately X hours after it is written. + +Note that TTL is the only way littDB supports removing data from the database. Although it is legal to configure +a table with a TTL of 0 (i.e. where data never expires), such a table will never be able to remove data. + +## Unflushed Data Map + +An in-memory map that contains [keys](#key)-[values](#value) pairs that are not yet [durable](#durability) on disk. +Entries are added to the map when a [value](#value) is written, and removed when the [value](#value) is fully +written to both the [keymap](#keymap) and the [segment](#segment) files. + +This data structure is not to be confused with the [cache](#cache). Its purpose is not to improve performance, but +rather to provide [read-your-writes consistency](#read-your-writes-consistency). + +## Value + +The value in a key-[value](#value) store. A value is a byte slice that is associated with a [key](#key) in the database. +LittDB is optimized to support large values, although small values are perfectly fine as well. Writing the X bytes +of data as a single large value is more efficient than writing X bytes of data as Y smaller values. + +# Architecture + +For a detailed overview of the architecture of LittDB, see the [Architecture](docs/architecture.md) docs. + +# Filesystem Layout + +For information about how LittDB arranges its internal files, see the [Filesystem Layout](docs/filesystem_layout.md) +docs. \ No newline at end of file diff --git a/sei-db/db_engine/litt/benchmark/benchmark_engine.go b/sei-db/db_engine/litt/benchmark/benchmark_engine.go new file mode 100644 index 0000000000..1b09a61b82 --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/benchmark_engine.go @@ -0,0 +1,315 @@ +//go:build littdb_wip + +package benchmark + +import ( + "context" + "fmt" + "math/rand" + "os" + "os/signal" + "syscall" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/benchmark/config" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/docker/go-units" + "golang.org/x/time/rate" +) + +// BenchmarkEngine is a tool for benchmarking LittDB performance. +type BenchmarkEngine struct { + ctx context.Context + cancel context.CancelFunc + logger logging.Logger + + // The configuration for the benchmark. + config *config.BenchmarkConfig + + // The database to be benchmarked. + db litt.DB + + // The table in the database where data is stored. + table litt.Table + + // Keeps track of data to read and write. + dataTracker *DataTracker + + // The maximum write throughput in bytes per second for each worker thread. + writeBytesPerSecondPerThread uint64 + + // The maximum read throughput in bytes per second for each worker thread. + readBytesPerSecondPerThread uint64 + + // The burst size for write rate limiting. + writeBurstSize uint64 + + // The burst size for read rate limiting. + readBurstSize uint64 + + // Records benchmark metrics. + metrics *metrics + + // errorMonitor is used to handle fatal errors in the benchmark engine. + errorMonitor *util.ErrorMonitor +} + +// NewBenchmarkEngine creates a new BenchmarkEngine with the given configuration. +func NewBenchmarkEngine(configPath string) (*BenchmarkEngine, error) { + cfg, err := config.LoadConfig(configPath) + if err != nil { + return nil, fmt.Errorf("failed to load config file %s: %w", configPath, err) + } + + cfg.LittConfig.Logger, err = common.NewLogger(cfg.LittConfig.LoggerConfig) + if err != nil { + return nil, fmt.Errorf("failed to create logger: %w", err) + } + + cfg.LittConfig.ShardingFactor = uint32(len(cfg.LittConfig.Paths)) + + db, err := littbuilder.NewDB(cfg.LittConfig) + if err != nil { + return nil, fmt.Errorf("failed to create db: %w", err) + } + + table, err := db.GetTable("benchmark") + if err != nil { + return nil, fmt.Errorf("failed to create table: %w", err) + } + + ttl := time.Duration(cfg.TTLHours * float64(time.Hour)) + err = table.SetTTL(ttl) + if err != nil { + return nil, fmt.Errorf("failed to set TTL for table: %w", err) + } + + ctx, cancel := context.WithCancel(context.Background()) + + errorMonitor := util.NewErrorMonitor(ctx, cfg.LittConfig.Logger, nil) + + dataTracker, err := NewDataTracker(ctx, cfg, errorMonitor) + if err != nil { + cancel() + return nil, fmt.Errorf("failed to create data tracker: %w", err) + } + + writeBytesPerSecond := uint64(cfg.MaximumWriteThroughputMB * float64(units.MiB)) + writeBytesPerSecondPerThread := writeBytesPerSecond / uint64(cfg.WriterParallelism) + + // If we set the write burst size smaller than an individual value, then the rate limiter will never + // permit any writes. Ideally, we'd just set the burst size to 0 since we don't want bursty/volatile writes, + // but since we are using the rate.Limiter utility, we are required to set a burst size, and a burst size + // smaller than an individual value will cause the rate limiter to never permit writes. + writeBurstSize := uint64(cfg.ValueSizeMB * float64(units.MiB)) + + readBytesPerSecond := uint64(cfg.MaximumReadThroughputMB * float64(units.MiB)) + readBytesPerSecondPerThread := readBytesPerSecond / uint64(cfg.ReaderParallelism) + + // If we set the read burst size smaller than an individual value we need to read, then the rate limiter will + // never permit us to read that value. + readBurstSize := dataTracker.LargestReadableValueSize() + + return &BenchmarkEngine{ + ctx: ctx, + cancel: cancel, + logger: cfg.LittConfig.Logger, + config: cfg, + db: db, + table: table, + dataTracker: dataTracker, + writeBytesPerSecondPerThread: writeBytesPerSecondPerThread, + readBytesPerSecondPerThread: readBytesPerSecondPerThread, + writeBurstSize: writeBurstSize, + readBurstSize: readBurstSize, + metrics: newMetrics(ctx, cfg.LittConfig.Logger, cfg), + errorMonitor: errorMonitor, + }, nil +} + +// Logger returns the logger used by the benchmark engine. +func (b *BenchmarkEngine) Logger() logging.Logger { + return b.logger +} + +// Run executes the benchmark. This method blocks forever, or until the benchmark is stopped via control-C or +// encounters an error. +func (b *BenchmarkEngine) Run() error { + + if b.config.TimeLimitSeconds > 0 { + // If a time limit is set, create a timer to cancel the context after the specified duration + timeLimit := time.Duration(b.config.TimeLimitSeconds * float64(time.Second)) + timer := time.NewTimer(timeLimit) + + b.logger.Infof("Benchmark will auto-terminate after %s", timeLimit) + + go func() { + select { + case <-timer.C: + b.logger.Infof("Time limit reached, stopping benchmark.") + b.cancel() + case <-b.ctx.Done(): + timer.Stop() + } + }() + } + + // multiply by 2 to make configured value the average + sleepFactor := b.config.StartupSleepFactorSeconds * float64(time.Second) * 2.0 + + for i := 0; i < b.config.WriterParallelism; i++ { + // Sleep a short time to prevent all goroutines from starting in lockstep. + time.Sleep(time.Duration(sleepFactor * rand.Float64())) + + go b.writer() + } + + for i := 0; i < b.config.ReaderParallelism; i++ { + // Sleep a short time to prevent all goroutines from starting in lockstep. + time.Sleep(time.Duration(sleepFactor * rand.Float64())) + + go b.reader() + } + + // Create a channel to listen for OS signals + sigChan := make(chan os.Signal, 1) + signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM) + + // Wait for signal + select { + case <-b.ctx.Done(): + b.logger.Infof("Received shutdown signal, stopping benchmark.") + return nil + case <-sigChan: + // Cancel the context when signal is received + b.cancel() + } + + return nil +} + +// writer runs on a goroutine and writes data to the database. +func (b *BenchmarkEngine) writer() { + maxBatchSize := uint64(b.config.BatchSizeMB * float64(units.MiB)) + throttle := rate.NewLimiter(rate.Limit(b.writeBytesPerSecondPerThread), int(b.writeBurstSize)) + + for { + select { + case <-b.errorMonitor.ImmediateShutdownRequired(): + return + default: + batchSize := uint64(0) + + writtenIndices := make([]uint64, 0) + + for batchSize < maxBatchSize { + writeInfo := b.dataTracker.GetWriteInfo() + batchSize += uint64(len(writeInfo.Value)) + + reservation := throttle.ReserveN(time.Now(), len(writeInfo.Value)) + if !reservation.OK() { + b.errorMonitor.Panic(fmt.Errorf("failed to reserve write quota for key %s", writeInfo.Key)) + return + } + if reservation.Delay() > 0 { + time.Sleep(reservation.Delay()) + } + + start := time.Now() + + err := b.table.Put(writeInfo.Key, writeInfo.Value) + if err != nil { + b.errorMonitor.Panic(fmt.Errorf("failed to write data: %v", err)) + return + } + + b.metrics.reportWrite(time.Since(start), uint64(len(writeInfo.Value))) + writtenIndices = append(writtenIndices, writeInfo.KeyIndex) + } + + start := time.Now() + + err := b.table.Flush() + if err != nil { + b.errorMonitor.Panic(fmt.Errorf("failed to flush data: %v", err)) + return + } + + b.metrics.reportFlush(time.Since(start)) + + for _, index := range writtenIndices { + b.dataTracker.ReportWrite(index) + } + } + } +} + +// verifyValue checks if the actual value read from the database matches the expected value. +func (b *BenchmarkEngine) verifyValue(expected *ReadInfo, actual []byte) error { + if len(actual) != len(expected.Value) { + return fmt.Errorf("read value size %d does not match expected size %d for key %s", + len(actual), len(expected.Value), expected.Key) + } + for i := range actual { + if actual[i] != expected.Value[i] { + return fmt.Errorf("read value does not match expected value for key %s", expected.Key) + } + } + return nil +} + +// reader runs on a goroutine and reads data from the database. +func (b *BenchmarkEngine) reader() { + throttle := rate.NewLimiter(rate.Limit(b.readBytesPerSecondPerThread), int(b.readBurstSize)) + + for { + select { + case <-b.errorMonitor.ImmediateShutdownRequired(): + return + default: + readInfo := b.dataTracker.GetReadInfo() + if readInfo == nil { + // This can happen when the context gets cancelled. + return + } + + reservation := throttle.ReserveN(time.Now(), len(readInfo.Value)) + if !reservation.OK() { + b.errorMonitor.Panic(fmt.Errorf("failed to reserve read quota for key %s", readInfo.Key)) + return + } + if reservation.Delay() > 0 { + time.Sleep(reservation.Delay()) + } + + start := time.Now() + + value, exists, err := b.table.Get(readInfo.Key) + if err != nil { + b.errorMonitor.Panic(fmt.Errorf("failed to read data: %v", err)) + return + } + + b.metrics.reportRead(time.Since(start), uint64(len(readInfo.Value))) + + if !exists { + if b.config.PanicOnReadFailure { + b.errorMonitor.Panic(fmt.Errorf("key %s not found in database", readInfo.Key)) + return + } else { + b.logger.Errorf("key %s not found in database", readInfo.Key) + continue + } + } + err = b.verifyValue(readInfo, value) + if err != nil { + b.errorMonitor.Panic(err) + return + } + } + } +} diff --git a/sei-db/db_engine/litt/benchmark/benchmark_metrics.go b/sei-db/db_engine/litt/benchmark/benchmark_metrics.go new file mode 100644 index 0000000000..27ddcb8211 --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/benchmark_metrics.go @@ -0,0 +1,224 @@ +//go:build littdb_wip + +package benchmark + +import ( + "context" + "fmt" + "sync/atomic" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt/benchmark/config" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// metrics is a struct that holds various performance metrics for the benchmark. If configured, periodically +// writes a summary to the log. The intention is to expose data about the benchmark's performance even if +// prometheus is not available or configured. +type metrics struct { + ctx context.Context + logger logging.Logger + + // The configuration for the benchmark. + config *config.BenchmarkConfig + + // The time when the benchmark started. + startTime time.Time + + // The number of bytes written since the benchmark started. + bytesWritten atomic.Uint64 + + // The number of bytes read since the benchmark started. + bytesRead atomic.Uint64 + + // The number of write operations performed since the benchmark started. + writeCount atomic.Uint64 + + // The number of read operations performed since the benchmark started. + readCount atomic.Uint64 + + // The number of flush operations performed since the benchmark started. + flushCount atomic.Uint64 + + // The amount of time spent writing data. + nanosecondsSpentWriting atomic.Uint64 + + // The amount of time spent reading data. + nanosecondsSpentReading atomic.Uint64 + + // The amount of time spent flushing data. + nanosecondsSpentFlushing atomic.Uint64 + + // Longest write duration observed. + longestWriteDuration atomic.Uint64 + + // Longest read duration observed. + longestReadDuration atomic.Uint64 + + // Longest flush duration observed. + longestFlushDuration atomic.Uint64 +} + +// newMetrics initializes a new metrics object. +func newMetrics( + ctx context.Context, + logger logging.Logger, + config *config.BenchmarkConfig, +) *metrics { + + m := &metrics{ + ctx: ctx, + logger: logger, + config: config, + startTime: time.Now(), + } + + go m.reportGenerator() + return m +} + +// reportWrite records a write operation. +func (m *metrics) reportWrite(writeDuration time.Duration, bytesWritten uint64) { + m.writeCount.Add(1) + m.bytesWritten.Add(bytesWritten) + m.nanosecondsSpentWriting.Add(uint64(writeDuration.Nanoseconds())) + + // Update the longest write duration if this one is longer. + currentLongest := m.longestWriteDuration.Load() + for writeDuration.Nanoseconds() > int64(currentLongest) { + swapped := m.longestWriteDuration.CompareAndSwap(currentLongest, uint64(writeDuration.Nanoseconds())) + if swapped { + break + } + currentLongest = m.longestWriteDuration.Load() + } +} + +// reportRead records a read operation. +func (m *metrics) reportRead(readDuration time.Duration, bytesRead uint64) { + m.readCount.Add(1) + m.bytesRead.Add(bytesRead) + m.nanosecondsSpentReading.Add(uint64(readDuration.Nanoseconds())) + + // Update the longest read duration if this one is longer. + currentLongest := m.longestReadDuration.Load() + for readDuration.Nanoseconds() > int64(currentLongest) { + swapped := m.longestReadDuration.CompareAndSwap(currentLongest, uint64(readDuration.Nanoseconds())) + if swapped { + break + } + currentLongest = m.longestReadDuration.Load() + } +} + +// reportFlush records a flush operation. +func (m *metrics) reportFlush(flushDuration time.Duration) { + m.flushCount.Add(1) + m.nanosecondsSpentFlushing.Add(uint64(flushDuration.Nanoseconds())) + + // Update the longest flush duration if this one is longer. + currentLongest := m.longestFlushDuration.Load() + for flushDuration.Nanoseconds() > int64(currentLongest) { + swapped := m.longestFlushDuration.CompareAndSwap(currentLongest, uint64(flushDuration.Nanoseconds())) + if swapped { + break + } + currentLongest = m.longestFlushDuration.Load() + } +} + +// reportGenerator runs in a goroutine and periodically logs the metrics to the console. +func (m *metrics) reportGenerator() { + if m.config.MetricsLoggingPeriodSeconds <= 0 { + return // Metrics logging is disabled. + } + + ticker := time.NewTicker(time.Duration(m.config.MetricsLoggingPeriodSeconds * float64(time.Second))) + defer ticker.Stop() + + for { + select { + case <-m.ctx.Done(): + return // Context cancelled, stop reporting. + case <-ticker.C: + m.logMetrics() + } + } +} + +// logMetrics logs the current metrics to the console. +func (m *metrics) logMetrics() { + + averageWriteLatency := uint64(0) + writeCount := m.writeCount.Load() + if writeCount > 0 { + averageWriteLatency = + uint64((time.Duration(m.nanosecondsSpentWriting.Load()) / time.Duration(writeCount)).Nanoseconds()) + } + + averageReadLatency := uint64(0) + readCount := m.readCount.Load() + if readCount > 0 { + averageReadLatency = + uint64((time.Duration(m.nanosecondsSpentReading.Load()) / time.Duration(readCount)).Nanoseconds()) + } + + averageFlushLatency := uint64(0) + flushCount := m.flushCount.Load() + if flushCount > 0 { + averageFlushLatency = + uint64((time.Duration(m.nanosecondsSpentFlushing.Load()) / time.Duration(flushCount)).Nanoseconds()) + } + + elapsedTimeNanoseconds := uint64(time.Since(m.startTime).Nanoseconds()) + elapsedTimeSeconds := float64(elapsedTimeNanoseconds) / float64(time.Second) + + bytesWritten := m.bytesWritten.Load() + writeThroughput := uint64(0) + if elapsedTimeSeconds > 0 { + writeThroughput = uint64(float64(bytesWritten) / elapsedTimeSeconds) + } + + readThroughput := uint64(0) + if elapsedTimeSeconds > 0 { + readThroughput = uint64(float64(m.bytesRead.Load()) / elapsedTimeSeconds) + } + + totalTime := "" + if m.config.TimeLimitSeconds > 0 { + totalTime = fmt.Sprintf(" / %s", + common.PrettyPrintTime(uint64(m.config.TimeLimitSeconds*float64(time.Second)))) + } + + m.logger.Infof("Benchmark Metrics (since most recent restart):\n"+ + " Elapsed Time: %s%s\n\n"+ + " Write Throughput: %s/s\n"+ + " Bytes Written: %s\n"+ + " Write Count: %s\n"+ + " Average Write Latency: %s\n"+ + " Longest Write Duration: %s\n\n"+ + " Read Throughput: %s/s\n"+ + " Bytes Read: %s\n"+ + " Read Count: %s\n"+ + " Average Read Latency: %s\n"+ + " Longest Read Duration: %s\n\n"+ + " Flush Count: %s\n"+ + " Average Flush Latency: %s\n"+ + " Longest Flush Duration: %s", + common.PrettyPrintTime(elapsedTimeNanoseconds), + totalTime, + common.PrettyPrintBytes(writeThroughput), + common.PrettyPrintBytes(bytesWritten), + common.CommaOMatic(writeCount), + common.PrettyPrintTime(averageWriteLatency), + common.PrettyPrintTime(m.longestWriteDuration.Load()), + common.PrettyPrintBytes(readThroughput), + common.PrettyPrintBytes(m.bytesRead.Load()), + common.CommaOMatic(readCount), + common.PrettyPrintTime(averageReadLatency), + common.PrettyPrintTime(m.longestReadDuration.Load()), + common.CommaOMatic(flushCount), + common.PrettyPrintTime(averageFlushLatency), + common.PrettyPrintTime(m.longestFlushDuration.Load())) +} diff --git a/sei-db/db_engine/litt/benchmark/cmd/main.go b/sei-db/db_engine/litt/benchmark/cmd/main.go new file mode 100644 index 0000000000..2911497fbd --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/cmd/main.go @@ -0,0 +1,40 @@ +//go:build littdb_wip + +package main + +import ( + "fmt" + "log" + "os" + + "github.com/Layr-Labs/eigenda/litt/benchmark" +) + +func main() { + // Check for required argument + if len(os.Args) != 2 { + _, _ = fmt.Fprintf(os.Stderr, "Usage: run.sh \n") + _, _ = fmt.Fprintf(os.Stderr, "\nExample:\n") + _, _ = fmt.Fprintf(os.Stderr, " run.sh config/basic-config.json\n") + os.Exit(1) + } + + configPath := os.Args[1] + + // Create the benchmark engine + engine, err := benchmark.NewBenchmarkEngine(configPath) + if err != nil { + log.Fatalf("Failed to create benchmark engine: %v", err) + } + + // Run the benchmark + engine.Logger().Infof("Configuration loaded from %s", configPath) + engine.Logger().Info("Press Ctrl+C to stop the benchmark") + + err = engine.Run() + if err != nil { + engine.Logger().Fatalf("Benchmark failed: %v", err) + } else { + engine.Logger().Info("Benchmark Terminated") + } +} diff --git a/sei-db/db_engine/litt/benchmark/cohort.go b/sei-db/db_engine/litt/benchmark/cohort.go new file mode 100644 index 0000000000..c814bf9d25 --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/cohort.go @@ -0,0 +1,378 @@ +//go:build littdb_wip + +package benchmark + +import ( + "encoding/binary" + "fmt" + "math/rand" + "os" + "path" + "path/filepath" + "strconv" + "strings" + "time" + + "github.com/Layr-Labs/eigenda/litt/util" +) + +// CohortFileExtension is the file extension used for cohort files. +const CohortFileExtension = ".cohort" + +// CohortSwapFileExtension is the file extension used for cohort swap files. Used to atomically update cohort files. +const CohortSwapFileExtension = CohortFileExtension + util.SwapFileExtension + +/* The lifecycle of a cohort: + + +-----+ +-----------+ +----------+ +---------+ + | new | --> | exhausted | --> | complete | --> | expired | + +-----+ +-----------+ +----------+ +---------+ + | | + v | + +-----------+ | + | abandoned | <---| + +-----------+ + +- new: the cohort was just created and is currently being used to supply keys for writing. +- exhausted: all keys in the cohort have been scheduled for writing, but the DB may not have ingested them all yet. +- complete: all keys in the cohort have been written to the DB and are safe to read. +- abandoned: before becoming complete, the benchmark was restarted. It will never be thread safe to read or write + any keys in this cohort. +- expired: the cohort has been marked as complete, but it can no longer be read because the TTL has expired + (or is about to expire). +*/ + +// A Cohort is a grouping of key-value pairs used for benchmarking. +// +// If a benchmark wants to read values, it must somehow figure out which keys have been written to the database. +// If it wants to verify the validity of the data it reads, it must also be able to determine the correct value +// that should be associated with any particular key, and it must also be able to determine when keys are +// expected to be removed from the database due to TTL expiration. +// +// Tracking the sort of metadata required to do reads in a benchmark is not a trivial thing, especially when +// the scale of the benchmark is large (i.e. tens or hundreds of millions of keys over weeks or months of time). +// Storing this information in memory is simply not plausible, and storing it on disk requires database scale similar +// to what LittDB is handling, unless we are clever about it. A "cohort" is that clever mechanism. Each cohort tracks a +// large collection of key-value pairs in the database, and it does it in a way that uses very little disk space. +// +// Key-value pairs each have unique indices, and knowing the index of a key-value pair allows the data to be +// regenerated deterministically. All key-value pairs in a cohort have sequential indices. A single cohort can +// track multiple gigabytes worth of key-value pairs, but on disk it only requires a few dozen bytes of data. +type Cohort struct { + // The directory where the cohort file is stored. + parentDirectory string + + // The unique ID of this cohort. + cohortIndex uint64 + + // The index of the first key-value pair in the cohort. + lowKeyIndex uint64 + + // The index of the last key-value pair in the cohort. + highKeyIndex uint64 + + // The size of the values written in this cohort. + valueSize uint64 + + // The next available index to be written. Only relevant for a new cohort that is currently being written to + // the DB. This value is undefined for cohorts that have been completely written or loaded from disk. This value + // is NOT serialized to disk. + nextKeyIndex uint64 + + // True iff all key-value pairs in the cohort have been written to the database. + allValuesWritten bool + + // A timestamp that is guaranteed to come before the first value in the cohort is written to the database. + firstValueTimestamp time.Time + + // True iff the cohort has been loaded from disk. This value is NOT serialized to disk. + loadedFromDisk bool + + // Whether fsync mode is enabled. Disable for faster unit tests. + fsync bool +} + +// NewCohort creates a new cohort with the given index range. +func NewCohort( + parentDirectory string, + cohortIndex uint64, + lowIndex uint64, + highIndex uint64, + valueSize uint64, + fsync bool) (*Cohort, error) { + + cohort := &Cohort{ + parentDirectory: parentDirectory, + cohortIndex: cohortIndex, + lowKeyIndex: lowIndex, + highKeyIndex: highIndex, + valueSize: valueSize, + nextKeyIndex: lowIndex, + allValuesWritten: false, + firstValueTimestamp: time.Now(), + fsync: fsync, + } + + err := cohort.Write() + if err != nil { + return nil, fmt.Errorf("failed to write cohort file: %w", err) + } + + return cohort, nil +} + +// LoadCohort loads a cohort from the given path. +func LoadCohort(path string) (*Cohort, error) { + + parentDirectory := filepath.Dir(path) + // Cohort file names are in the format "X.cohort", where X is the cohort index. + // Replacing ".cohort" with an empty string gives us the cohort index in string form. + indexString := strings.Replace(filepath.Base(path), CohortFileExtension, "", 1) + cohortIndex, err := strconv.ParseUint(indexString, 10, 64) + if err != nil { + return nil, fmt.Errorf("failed to parse cohort file %s: %w", path, err) + } + + cohort := &Cohort{ + parentDirectory: parentDirectory, + cohortIndex: cohortIndex, + loadedFromDisk: true, + } + + filePath := cohort.Path() + if err = util.ErrIfNotExists(filePath); err != nil { + return nil, fmt.Errorf("cohort file does not exist: %s", filePath) + } + + file, err := os.Open(filePath) + if err != nil { + return nil, fmt.Errorf("failed to open cohort file: %w", err) + } + + data, err := os.ReadFile(filePath) + if err != nil { + return nil, fmt.Errorf("failed to read cohort file: %w", err) + } + + err = cohort.deserialize(data) + if err != nil { + return nil, fmt.Errorf("failed to deserialize cohort file: %w", err) + } + + err = file.Close() + if err != nil { + return nil, fmt.Errorf("failed to close cohort file: %w", err) + } + + return cohort, nil +} + +// NextCohort creates the next cohort in the sequence with the given number of keys. +func (c *Cohort) NextCohort(keyCount uint64, valueSize uint64) (*Cohort, error) { + nextIndex := c.cohortIndex + 1 + nextLowKeyIndex := c.highKeyIndex + 1 + nextHighKeyIndex := nextLowKeyIndex + keyCount - 1 + + nextCohort, err := NewCohort( + c.parentDirectory, + nextIndex, + nextLowKeyIndex, + nextHighKeyIndex, + valueSize, + c.fsync) + if err != nil { + return nil, fmt.Errorf("failed to create next cohort: %w", err) + } + return nextCohort, nil +} + +// CohortIndex returns the index of the cohort. +func (c *Cohort) CohortIndex() uint64 { + return c.cohortIndex +} + +// LowKeyIndex returns the index of the first key in the cohort. +func (c *Cohort) LowKeyIndex() uint64 { + return c.lowKeyIndex +} + +// HighKeyIndex returns the index of the last key in the cohort. +func (c *Cohort) HighKeyIndex() uint64 { + return c.highKeyIndex +} + +func (c *Cohort) ValueSize() uint64 { + return c.valueSize +} + +// FirstValueTimestamp returns the timestamp of the first value in the cohort. +func (c *Cohort) FirstValueTimestamp() time.Time { + return c.firstValueTimestamp +} + +// IsComplete returns true if all key-value pairs in the cohort have been written to the database. Only complete +// cohorts are safe to read from. +func (c *Cohort) IsComplete() bool { + return c.allValuesWritten +} + +// IsExhausted returns true if the cohort has been exhausted, i.e. it has produced all keys for writing that it is +// capable of producing. Once exhausted, a cohort should be marked as completed once all key-value pairs have been +// written to the database, thus making all keys in the cohort safe to read. +func (c *Cohort) IsExhausted() bool { + return c.nextKeyIndex > c.highKeyIndex +} + +// IsLoadedFromDisk returns true if the cohort has been loaded from disk. +func (c *Cohort) IsLoadedFromDisk() bool { + return c.loadedFromDisk +} + +// GetKeyIndexForWriting gets the next key to be written to the database. +func (c *Cohort) GetKeyIndexForWriting() (uint64, error) { + if c.loadedFromDisk { + return 0, fmt.Errorf("cannot allocate key for writing: cohort has been loaded from disk") + } + if c.allValuesWritten { + return 0, fmt.Errorf("cannot allocate key for writing: cohort is already complete") + } + if c.IsExhausted() { + return 0, fmt.Errorf("cannot allocate key for writing: cohort is exhausted") + } + + key := c.nextKeyIndex + c.nextKeyIndex++ + + return key, nil +} + +// GetKeyIndexForReading gets a random key from the cohort that is safe to read. This function should only be called +// after the cohort has been marked as complete. +func (c *Cohort) GetKeyIndexForReading(rand *rand.Rand) (uint64, error) { + if !c.allValuesWritten { + return 0, fmt.Errorf("cannot allocate key for reading: cohort is not complete") + } + + choice := (rand.Uint64() % (c.highKeyIndex - c.lowKeyIndex + 1)) + c.lowKeyIndex + + // sanity check + if choice < c.lowKeyIndex || choice > c.highKeyIndex { + return 0, fmt.Errorf("invalid choice: %d not in range [%d, %d]", choice, c.lowKeyIndex, c.highKeyIndex) + } + + return choice, nil +} + +// MarkComplete marks that all key-value pairs in the cohort have been written to the database. Once done, +// all key-value pairs in the cohort become safe to read, so long as the cohort has not yet expired. A cohort +// is said to have expired when it is possible that at least one key in the cohort may be deleted from the DB +// due to the TTL. +func (c *Cohort) MarkComplete() error { + if c.allValuesWritten { + return fmt.Errorf("cannot mark cohort complete: cohort is already complete") + } + if c.loadedFromDisk { + return fmt.Errorf("cannot mark cohort complete: cohort has been loaded from disk") + } + if c.nextKeyIndex <= c.highKeyIndex { + return fmt.Errorf("cannot mark cohort complete: cohort is not exhausted") + } + + c.allValuesWritten = true + err := c.Write() + if err != nil { + return fmt.Errorf("failed to mark cohort complete: %w", err) + } + return nil +} + +// Path returns the file path of the cohort file. +func (c *Cohort) Path() string { + return path.Join(c.parentDirectory, fmt.Sprintf("%d%s", c.cohortIndex, CohortFileExtension)) +} + +// Write the data in this cohort to its file on disk. When this method returns, the cohort file is guaranteed to be +// crash durable. +func (c *Cohort) Write() error { + err := util.AtomicWrite(c.Path(), c.serialize(), c.fsync) + if err != nil { + return fmt.Errorf("failed to write cohort file: %w", err) + } + + return nil +} + +// serialize serializes the cohort to a byte array. +func (c *Cohort) serialize() []byte { + // Data size: + // - cohortIndex (8 bytes) + // - lowKeyIndex (8 bytes) + // - highKeyIndex (8 bytes) + // - valueSize (8 bytes) + // - firstValueTimestamp (8 bytes) + // - allValuesWritten (1 byte) + // Total: 41 bytes + + data := make([]byte, 41) + binary.BigEndian.PutUint64(data[0:8], c.cohortIndex) + binary.BigEndian.PutUint64(data[8:16], c.lowKeyIndex) + binary.BigEndian.PutUint64(data[16:24], c.highKeyIndex) + binary.BigEndian.PutUint64(data[24:32], c.valueSize) + binary.BigEndian.PutUint64(data[32:40], uint64(c.firstValueTimestamp.Unix())) + if c.allValuesWritten { + data[40] = 1 + } else { + data[40] = 0 + } + + return data +} + +func (c *Cohort) deserialize(data []byte) error { + if len(data) != 41 { + return fmt.Errorf("invalid data length: %d", len(data)) + } + + cohortIndex := binary.BigEndian.Uint64(data[0:8]) + if cohortIndex != c.cohortIndex { + return fmt.Errorf("cohort index mismatch: %d != %d", cohortIndex, c.cohortIndex) + } + + c.lowKeyIndex = binary.BigEndian.Uint64(data[8:16]) + c.highKeyIndex = binary.BigEndian.Uint64(data[16:24]) + c.valueSize = binary.BigEndian.Uint64(data[24:32]) + if c.lowKeyIndex >= c.highKeyIndex { + return fmt.Errorf("invalid index range: %d >= %d", c.lowKeyIndex, c.highKeyIndex) + } + + c.firstValueTimestamp = time.Unix(int64(binary.BigEndian.Uint64(data[32:40])), 0) + c.allValuesWritten = data[40] == 1 + + return nil +} + +// IsExpired returns true if the cohort has expired (i.e. it is no longer safe to read). +func (c *Cohort) IsExpired(now time.Time, maxAge time.Duration) bool { + if !c.IsComplete() { + if c.loadedFromDisk { + // Incomplete cohorts loaded from disk are instantly expired. + return true + } else { + // A cohort currently in the process of being written can't expire. + return false + } + } + + age := now.Sub(c.firstValueTimestamp) + + return age > maxAge +} + +// Delete the associated cohort file. +func (c *Cohort) Delete() error { + err := os.Remove(c.Path()) + if err != nil { + return fmt.Errorf("failed to delete cohort file: %w", err) + } + return nil +} diff --git a/sei-db/db_engine/litt/benchmark/cohort_test.go b/sei-db/db_engine/litt/benchmark/cohort_test.go new file mode 100644 index 0000000000..b135c71c3e --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/cohort_test.go @@ -0,0 +1,313 @@ +//go:build littdb_wip + +package benchmark + +import ( + "testing" + + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestCohortSerialization(t *testing.T) { + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + cohortIndex := rand.Uint64() + lowIndex := rand.Uint64Range(1, 1000) + highIndex := rand.Uint64Range(1000, 2000) + valueSize := rand.Uint64() + cohort, err := NewCohort( + testDirectory, + cohortIndex, + lowIndex, + highIndex, + valueSize, + false) + require.NoError(t, err) + + require.Equal(t, cohortIndex, cohort.CohortIndex()) + require.Equal(t, lowIndex, cohort.LowKeyIndex()) + require.Equal(t, highIndex, cohort.HighKeyIndex()) + require.Equal(t, valueSize, cohort.ValueSize()) + require.Equal(t, false, cohort.IsComplete()) + + // Check if the cohort file exists + filePath := cohort.Path() + exists, err := util.Exists(filePath) + require.NoError(t, err) + require.True(t, exists) + + // Initialize a copy cohort from the file + loadedCohort, err := LoadCohort(cohort.Path()) + require.NoError(t, err) + require.Equal(t, cohortIndex, loadedCohort.CohortIndex()) + require.Equal(t, lowIndex, loadedCohort.LowKeyIndex()) + require.Equal(t, highIndex, loadedCohort.HighKeyIndex()) + require.Equal(t, valueSize, cohort.ValueSize()) + require.Equal(t, false, loadedCohort.IsComplete()) + + // Mark the cohort as written + loadedCohort.allValuesWritten = true + require.NoError(t, err) + require.True(t, loadedCohort.IsComplete()) + err = loadedCohort.Write() + require.NoError(t, err) + + // Load the cohort again. + loadedCohort, err = LoadCohort(cohort.Path()) + require.NoError(t, err) + require.Equal(t, cohortIndex, loadedCohort.CohortIndex()) + require.Equal(t, lowIndex, loadedCohort.LowKeyIndex()) + require.Equal(t, highIndex, loadedCohort.HighKeyIndex()) + require.Equal(t, valueSize, cohort.ValueSize()) + require.Equal(t, true, loadedCohort.IsComplete()) + + err = loadedCohort.Delete() + require.NoError(t, err) + + // The file should no longer exist. + exists, err = util.Exists(filePath) + require.NoError(t, err) + require.False(t, exists) +} + +func TestStandardCohortLifecycle(t *testing.T) { + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + cohortIndex := rand.Uint64() + lowIndex := rand.Uint64Range(1, 1000) + highIndex := rand.Uint64Range(1000, 2000) + valueSize := rand.Uint64() + cohort, err := NewCohort( + testDirectory, + cohortIndex, + lowIndex, + highIndex, + valueSize, + false) + require.NoError(t, err) + + require.Equal(t, cohortIndex, cohort.CohortIndex()) + require.Equal(t, lowIndex, cohort.LowKeyIndex()) + require.Equal(t, highIndex, cohort.HighKeyIndex()) + require.Equal(t, valueSize, cohort.ValueSize()) + require.Equal(t, false, cohort.IsComplete()) + + // Extract all keys from the cohort. + for i := lowIndex; i <= highIndex; i++ { + key, err := cohort.GetKeyIndexForWriting() + require.NoError(t, err) + require.Equal(t, i, key) + + shouldBeExhausted := i == highIndex + require.Equal(t, shouldBeExhausted, cohort.IsExhausted()) + + if i < highIndex { + // Attempting to mark as complete now should fail. + err = cohort.MarkComplete() + require.Error(t, err) + } + require.Equal(t, false, cohort.IsComplete()) + + // Attempting to get a key for reading should fail. + _, err = cohort.GetKeyIndexForReading(rand.Rand) + require.Error(t, err) + } + + // Attempting to allocate another key for writing should fail. + _, err = cohort.GetKeyIndexForWriting() + require.Error(t, err) + + // We can now mark the cohort as complete. + err = cohort.MarkComplete() + require.NoError(t, err) + require.Equal(t, true, cohort.IsComplete()) + + // We can now get keys for reading. + for i := 0; i < 100; i++ { + key, err := cohort.GetKeyIndexForReading(rand.Rand) + require.NoError(t, err) + require.GreaterOrEqual(t, key, lowIndex) + require.LessOrEqual(t, key, highIndex) + } + + // Marking complete again should fail. + err = cohort.MarkComplete() + require.Error(t, err) +} + +func TestIncompleteCohortAllKeysExtractedLifecycle(t *testing.T) { + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + cohortIndex := rand.Uint64() + lowIndex := rand.Uint64Range(1, 1000) + highIndex := rand.Uint64Range(1000, 2000) + valueSize := rand.Uint64() + cohort, err := NewCohort( + testDirectory, + cohortIndex, + lowIndex, + highIndex, + valueSize, + false) + require.NoError(t, err) + + require.Equal(t, cohortIndex, cohort.CohortIndex()) + require.Equal(t, lowIndex, cohort.LowKeyIndex()) + require.Equal(t, highIndex, cohort.HighKeyIndex()) + require.Equal(t, valueSize, cohort.ValueSize()) + require.Equal(t, cohort.IsComplete(), false) + + // Extract all keys from the cohort. + for i := lowIndex; i <= highIndex; i++ { + key, err := cohort.GetKeyIndexForWriting() + require.NoError(t, err) + require.Equal(t, i, key) + + shouldBeExhausted := i == highIndex + require.Equal(t, shouldBeExhausted, cohort.IsExhausted()) + + if i < highIndex { + // Attempting to mark as complete now should fail. + err = cohort.MarkComplete() + require.Error(t, err) + } + require.Equal(t, false, cohort.IsComplete()) + + // Attempting to get a key for reading should fail. + _, err = cohort.GetKeyIndexForReading(rand.Rand) + require.Error(t, err) + } + + // Simulate a benchmark restart by reloading the cohort from disk. + loadedCohort, err := LoadCohort(cohort.Path()) + require.NoError(t, err) + + require.Equal(t, loadedCohort.CohortIndex(), cohortIndex) + require.False(t, loadedCohort.IsComplete()) + + // Attempting to allocate another key for writing should fail. + _, err = loadedCohort.GetKeyIndexForWriting() + require.Error(t, err) + + // Attempting to get a key for reading should fail. + _, err = loadedCohort.GetKeyIndexForReading(rand.Rand) + require.Error(t, err) + + // We shouldn't be able to mark the cohort as complete. + err = loadedCohort.MarkComplete() + require.Error(t, err) +} + +func TestIncompleteCohortSomeKeysExtractedLifecycle(t *testing.T) { + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + cohortIndex := rand.Uint64() + lowIndex := rand.Uint64Range(1, 1000) + highIndex := rand.Uint64Range(1000, 2000) + valueSize := rand.Uint64() + cohort, err := NewCohort( + testDirectory, + cohortIndex, + lowIndex, + highIndex, + valueSize, + false) + require.NoError(t, err) + + require.Equal(t, cohortIndex, cohort.CohortIndex()) + require.Equal(t, lowIndex, cohort.LowKeyIndex()) + require.Equal(t, highIndex, cohort.HighKeyIndex()) + require.Equal(t, valueSize, cohort.ValueSize()) + require.Equal(t, false, cohort.IsComplete()) + + // Extract all keys from the cohort. + for i := lowIndex; i <= (lowIndex+highIndex)/2; i++ { + key, err := cohort.GetKeyIndexForWriting() + require.NoError(t, err) + require.Equal(t, i, key) + + require.Equal(t, false, cohort.IsExhausted()) + + // Attempting to mark as complete now should fail. + err = cohort.MarkComplete() + require.Error(t, err) + require.Equal(t, false, cohort.IsComplete()) + + // Attempting to get a key for reading should fail. + _, err = cohort.GetKeyIndexForReading(rand.Rand) + require.Error(t, err) + } + + // Simulate a benchmark restart by reloading the cohort from disk. + loadedCohort, err := LoadCohort(cohort.Path()) + require.NoError(t, err) + + require.Equal(t, loadedCohort.CohortIndex(), cohortIndex) + require.False(t, loadedCohort.IsComplete()) + + // Attempting to allocate another key for writing should fail. + _, err = loadedCohort.GetKeyIndexForWriting() + require.Error(t, err) + + // Attempting to get a key for reading should fail. + _, err = loadedCohort.GetKeyIndexForReading(rand.Rand) + require.Error(t, err) + + // We shouldn't be able to mark the cohort as complete. + err = loadedCohort.MarkComplete() + require.Error(t, err) +} + +func TestNextCohort(t *testing.T) { + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + cohortIndex := rand.Uint64() + lowIndex := rand.Uint64Range(1, 1000) + highIndex := rand.Uint64Range(1000, 2000) + valueSize := rand.Uint64() + cohort, err := NewCohort( + testDirectory, + cohortIndex, + lowIndex, + highIndex, + valueSize, + false) + require.NoError(t, err) + + require.Equal(t, cohortIndex, cohort.CohortIndex()) + require.Equal(t, lowIndex, cohort.LowKeyIndex()) + require.Equal(t, highIndex, cohort.HighKeyIndex()) + require.Equal(t, valueSize, cohort.ValueSize()) + require.Equal(t, false, cohort.IsComplete()) + + // Check if the cohort file exists + filePath := cohort.Path() + exists, err := util.Exists(filePath) + require.NoError(t, err) + require.True(t, exists) + + newKeyCount := rand.Uint64Range(1, 1000) + newValueSize := rand.Uint64Range(1, 1000) + nextCohort, err := cohort.NextCohort(newKeyCount, newValueSize) + + require.NoError(t, err) + + require.Equal(t, cohortIndex+1, nextCohort.CohortIndex()) + require.Equal(t, highIndex+1, nextCohort.LowKeyIndex()) + require.Equal(t, highIndex+newKeyCount, nextCohort.HighKeyIndex()) + require.Equal(t, newValueSize, nextCohort.ValueSize()) + require.Equal(t, false, nextCohort.IsComplete()) + + // Check if the next cohort file exists + nextFilePath := nextCohort.Path() + exists, err = util.Exists(nextFilePath) + require.NoError(t, err) + require.True(t, exists) +} diff --git a/sei-db/db_engine/litt/benchmark/config/basic-config.json b/sei-db/db_engine/litt/benchmark/config/basic-config.json new file mode 100644 index 0000000000..875ab584a6 --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/config/basic-config.json @@ -0,0 +1,8 @@ +{ + "LittConfig": { + "Paths": ["~/benchmark/volume1", "~/benchmark/volume2", "~/benchmark/volume3"], + "SnapshotDirectory": "~/snapshot" + }, + "MaximumWriteThroughputMB": 1024, + "MetricsLoggingPeriodSeconds": 1 +} \ No newline at end of file diff --git a/sei-db/db_engine/litt/benchmark/config/benchmark-grafana-dashboard.json b/sei-db/db_engine/litt/benchmark/config/benchmark-grafana-dashboard.json new file mode 100644 index 0000000000..b0b14ef474 --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/config/benchmark-grafana-dashboard.json @@ -0,0 +1,1982 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": 1, + "links": [], + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 0 + }, + "id": 1, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "builder", + "expr": "litt_table_size_bytes", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{table}}", + "range": true, + "refId": "A", + "useBackend": false + } + ], + "title": "Disk Footprint", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "locale" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 0 + }, + "id": 2, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "builder", + "expr": "litt_table_key_count", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{table}}", + "range": true, + "refId": "A", + "useBackend": false + } + ], + "title": "Key Count", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 8 + }, + "id": 3, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "code", + "expr": "rate(litt_bytes_written[$__rate_interval])", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{table}}", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "expr": "", + "hide": false, + "instant": false, + "range": true, + "refId": "B" + } + ], + "title": "Bytes Written / Second", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 8 + }, + "id": 4, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "builder", + "expr": "rate(litt_keys_written[$__rate_interval])", + "fullMetaSearch": false, + "includeNullMetadata": false, + "legendFormat": "__auto", + "range": true, + "refId": "A", + "useBackend": false + } + ], + "title": "Keys Written / Second", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 16 + }, + "id": 5, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "builder", + "expr": "rate(litt_flush_count[$__rate_interval])", + "fullMetaSearch": false, + "includeNullMetadata": false, + "legendFormat": "__auto", + "range": true, + "refId": "A", + "useBackend": false + } + ], + "title": "Flushes / Second", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "ms" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 16 + }, + "id": 6, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "builder", + "expr": "litt_write_latency_ms", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{quantile}}", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "avg(litt_write_latency_ms)", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "instant": false, + "legendFormat": "average", + "range": true, + "refId": "B", + "useBackend": false + } + ], + "title": "Write Latency", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "ms" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 24 + }, + "id": 7, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "code", + "expr": "litt_flush_latency_ms", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{quantile}}", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "disableTextWrap": false, + "editorMode": "code", + "expr": "avg(litt_flush_latency_ms)", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "instant": false, + "legendFormat": "average", + "range": true, + "refId": "B", + "useBackend": false + } + ], + "title": "Flush Latency", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "ms" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 24 + }, + "id": 8, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "code", + "expr": "litt_segment_flush_latency_ms", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{quantile}}", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "disableTextWrap": false, + "editorMode": "code", + "expr": "avg(litt_segment_flush_latency_ms)", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "instant": false, + "legendFormat": "average", + "range": true, + "refId": "B", + "useBackend": false + } + ], + "title": "Segment Flush Latency", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "ms" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 32 + }, + "id": 9, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "code", + "expr": "litt_keymap_flush_latency_ms", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{quantile}}", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "disableTextWrap": false, + "editorMode": "code", + "expr": "avg(litt_keymap_flush_latency_ms)", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "instant": false, + "legendFormat": "average", + "range": true, + "refId": "B", + "useBackend": false + } + ], + "title": "Keymap Flush Latency", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "ms" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 32 + }, + "id": 10, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "code", + "expr": "litt_garbage_collection_latency_ms", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{quantile}}", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "disableTextWrap": false, + "editorMode": "code", + "expr": "avg(litt_garbage_collection_latency_ms)", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "instant": false, + "legendFormat": "average", + "range": true, + "refId": "B", + "useBackend": false + } + ], + "title": "GC Latency", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 40 + }, + "id": 11, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "editorMode": "code", + "expr": "rate(litt_bytes_read[$__rate_interval])", + "legendFormat": "{{table}}", + "range": true, + "refId": "A" + } + ], + "title": "Bytes Read / Second", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "locale" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 40 + }, + "id": 12, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "editorMode": "code", + "expr": "rate(litt_keys_read[$__rate_interval])", + "legendFormat": "{{table}}", + "range": true, + "refId": "A" + } + ], + "title": "Keys Read / Second", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "ms" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 48 + }, + "id": 13, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "code", + "expr": "litt_read_latency_ms", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{quantile}}", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "disableTextWrap": false, + "editorMode": "code", + "expr": "avg(litt_read_latency_ms)", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "instant": false, + "legendFormat": "average", + "range": true, + "refId": "B", + "useBackend": false + } + ], + "title": "Read Latency", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "locale" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 48 + }, + "id": 14, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "editorMode": "code", + "expr": "rate(litt_cache_hits[$__rate_interval])", + "legendFormat": "{{table}}", + "range": true, + "refId": "A" + } + ], + "title": "Cache Hits / Second", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "locale" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 56 + }, + "id": 15, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "editorMode": "code", + "expr": "rate(litt_cache_misses[$__rate_interval])", + "legendFormat": "{{table}}", + "range": true, + "refId": "A" + } + ], + "title": "Cache Misses / Second", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "ms" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 56 + }, + "id": 16, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "code", + "expr": "litt_cache_miss_latency_ms", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{quantile}}", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "disableTextWrap": false, + "editorMode": "code", + "expr": "avg(litt_cache_miss_latency_ms)", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "instant": false, + "legendFormat": "average", + "range": true, + "refId": "B", + "useBackend": false + } + ], + "title": "Cache Miss Latency", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 64 + }, + "id": 19, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "editorMode": "code", + "expr": "process_resident_memory_bytes", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "Memory", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 64 + }, + "id": 18, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "editorMode": "code", + "expr": "rate(process_cpu_seconds_total[$__rate_interval])", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "CPU Seconds", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "denye6lsft2bka" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 72 + }, + "id": 20, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.0.1", + "targets": [ + { + "editorMode": "code", + "expr": "process_open_fds", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "Open File Descriptors", + "type": "timeseries" + } + ], + "preload": false, + "refresh": "5s", + "schemaVersion": 41, + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-15m", + "to": "now" + }, + "timepicker": {}, + "timezone": "browser", + "title": "Benchmark Metrics", + "uid": "6d768bdc-8863-48d9-a38f-d06cecc4f3e5", + "version": 6 +} \ No newline at end of file diff --git a/sei-db/db_engine/litt/benchmark/config/benchmark_config.go b/sei-db/db_engine/litt/benchmark/config/benchmark_config.go new file mode 100644 index 0000000000..75e387c18d --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/config/benchmark_config.go @@ -0,0 +1,155 @@ +//go:build littdb_wip + +package config + +import ( + "encoding/json" + "fmt" + "os" + "strings" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/docker/go-units" +) + +// BenchmarkConfig is a struct that holds the configuration for the benchmark. +type BenchmarkConfig struct { + + // Configuration for the LittDB instance. + LittConfig *litt.Config + + // The location where the benchmark stores test metadata. + MetadataDirectory string + + // The maximum target write throughput in MB/s. + MaximumWriteThroughputMB float64 + + // The maximum read throughput in MB/s. + MaximumReadThroughputMB float64 + + // The number of parallel write goroutines. + WriterParallelism int + + // The number of parallel read goroutines. + ReaderParallelism int + + // The size of the values in MB. + ValueSizeMB float64 + + // Data is written to the DB in batches and then flushed. This determines the size of those batches, in MB. + BatchSizeMB float64 + + // The frequency at which the benchmark does cohort garbage collection, in seconds + CohortGCPeriodSeconds float64 + + // The size of the write info channel. Controls the max number of keys to prepare for writing ahead of time. + WriteInfoChanelSize uint64 + + // The size of the read info channel. Controls the max number of keys to prepare for reading ahead of time. + ReadInfoChanelSize uint64 + + // The number of keys in a new cohort. + CohortSize uint64 + + // The time-to-live (TTL) for keys in the database, in hours. + TTLHours float64 + + // If data is within this many minutes of its expiration time, it will not be read. + ReadSafetyMarginMinutes float64 + + // A seed for the random number generator used to generate keys and values. When restarting the benchmark, + // it's important to always use the same seed. + Seed int64 + + // The size of the pool of random data. Instead of generating random data for each key/value pair + // (which is expensive), data from this pool is reused. When restarting the benchmark, + // it's important to always use the same pool size. + RandomPoolSize uint64 + + // When the benchmark starts, it sleeps for a length of time. The average amount of time spent sleeping is equal to + // this value, in seconds. The purpose of this sleeping to stagger the start of the workers so that they don't all + // operate in lockstep. + StartupSleepFactorSeconds float64 + + // The frequency at which the benchmark logs metrics, in seconds. If zero, then metrics logging is disabled. + MetricsLoggingPeriodSeconds float64 + + // If true, the benchmark will panic and halt if there is a read failure. + // There is currently a rare bug somewhere, I suspect in metadata tracking. The bug can cause + // the benchmark to read a key that is no longer present in the database. Until that bug is fixed, + // do not halt the benchmark on read failures by default. + PanicOnReadFailure bool + + // If true, fsync cohort files to ensure atomicity. Can be set to false for unit tests that need to be fast. + Fsync bool + + // If non-zero, then the benchmark will run for this many seconds and then stop. If zero, + // the benchmark will run until it is manually stopped. + TimeLimitSeconds float64 +} + +// DefaultBenchmarkConfig returns a default BenchmarkConfig. +func DefaultBenchmarkConfig() *BenchmarkConfig { + + littConfig := litt.DefaultConfigNoPaths() + littConfig.LoggerConfig = common.DefaultConsoleLoggerConfig() + littConfig.MetricsEnabled = true + + return &BenchmarkConfig{ + LittConfig: littConfig, + MetadataDirectory: "~/benchmark", + MaximumWriteThroughputMB: 10, + MaximumReadThroughputMB: 10, + WriterParallelism: 4, + ReaderParallelism: 32, + ValueSizeMB: 2.0, + BatchSizeMB: 32, + CohortGCPeriodSeconds: 10.0, + WriteInfoChanelSize: 1024, + ReadInfoChanelSize: 1024, + CohortSize: 1024, + TTLHours: 1.0, + ReadSafetyMarginMinutes: 5.0, + Seed: 1337, + RandomPoolSize: units.GiB, + StartupSleepFactorSeconds: 0.5, + MetricsLoggingPeriodSeconds: 60.0, + PanicOnReadFailure: false, + TimeLimitSeconds: 0.0, + } +} + +// LoadConfig loads the benchmark configuration from the json file at the given path. +func LoadConfig(path string) (*BenchmarkConfig, error) { + config := DefaultBenchmarkConfig() + + path, err := util.SanitizePath(path) + if err != nil { + return nil, fmt.Errorf("failed to sanitize path: %w", err) + } + + // Read the file + data, err := os.ReadFile(path) + if err != nil { + return nil, fmt.Errorf("failed to read config file: %w", err) + } + + // Create a decoder that will return an error if there are unmatched fields + decoder := json.NewDecoder(strings.NewReader(string(data))) + decoder.DisallowUnknownFields() + + // Unmarshal JSON into config struct + err = decoder.Decode(config) + if err != nil { + return nil, fmt.Errorf("failed to unmarshal config file: %w", err) + } + + config.MetadataDirectory, err = util.SanitizePath(config.MetadataDirectory) + if err != nil { + return nil, fmt.Errorf("failed to sanitize metadata directory: %w", err) + } + + return config, nil +} diff --git a/sei-db/db_engine/litt/benchmark/config/benchmark_config_test.go b/sei-db/db_engine/litt/benchmark/config/benchmark_config_test.go new file mode 100644 index 0000000000..0e11eb1f57 --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/config/benchmark_config_test.go @@ -0,0 +1,62 @@ +//go:build littdb_wip + +package config + +import ( + "os" + "path/filepath" + "testing" + + "github.com/stretchr/testify/require" +) + +func TestLoadConfig(t *testing.T) { + // Create a temporary directory for the test + tempDir := t.TempDir() + + testConfigJSON := `{ + "MetadataDirectory": "/test/dir", + "MaximumWriteThroughputMB": 20.0, + "ValueSizeMB": 3.0, + "BatchSizeMB": 15 + }` + + testConfigPath := filepath.Join(tempDir, "test-config.json") + err := os.WriteFile(testConfigPath, []byte(testConfigJSON), 0644) + require.NoError(t, err) + + // Expected config for comparison + expectedConfig := &BenchmarkConfig{ + MetadataDirectory: "/test/dir", + MaximumWriteThroughputMB: 20.0, + ValueSizeMB: 3.0, + BatchSizeMB: 15, + } + + // Test loading the config + loadedConfig, err := LoadConfig(testConfigPath) + require.NoError(t, err) + require.Equal(t, expectedConfig.MetadataDirectory, loadedConfig.MetadataDirectory) + require.Equal(t, expectedConfig.MaximumWriteThroughputMB, loadedConfig.MaximumWriteThroughputMB) + require.Equal(t, expectedConfig.ValueSizeMB, loadedConfig.ValueSizeMB) + require.Equal(t, expectedConfig.BatchSizeMB, loadedConfig.BatchSizeMB) + + // Test loading a non-existent file + _, err = LoadConfig("/non/existent/path.json") + require.Error(t, err) + + // Test that unknown fields cause an error + unknownFieldConfig := []byte(`{ + "MetadataDirectory": "/test/dir", + "MaximumWriteThroughputMB": 20.0, + "UnknownField": "this field doesn't exist in the struct" + }`) + + unknownFieldPath := filepath.Join(tempDir, "unknown-field.json") + err = os.WriteFile(unknownFieldPath, unknownFieldConfig, 0644) + require.NoError(t, err) + + _, err = LoadConfig(unknownFieldPath) + require.Error(t, err) + require.Contains(t, err.Error(), "unknown field") +} diff --git a/sei-db/db_engine/litt/benchmark/data_generator.go b/sei-db/db_engine/litt/benchmark/data_generator.go new file mode 100644 index 0000000000..1e0705e426 --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/data_generator.go @@ -0,0 +1,76 @@ +//go:build littdb_wip + +package benchmark + +import ( + "math/rand" + "sync" +) + +// DataGenerator is responsible for generating key-value pairs to be inserted into the database, for the sake of +// benchmarking. +type DataGenerator struct { + // Pool of random number generators + randPool *sync.Pool + + // A pool of randomness. Used to generate values. + dataPool []byte + + // The seed that determines the key/value pairs generated. + seed int64 +} + +// NewDataGenerator builds a data generator instance. +func NewDataGenerator(seed int64, poolSize uint64) *DataGenerator { + + randPool := &sync.Pool{ + New: func() interface{} { + return rand.New(rand.NewSource(seed)) + }, + } + + dataPool := make([]byte, poolSize) + rng := randPool.Get().(*rand.Rand) + rng.Read(dataPool) + randPool.Put(rng) + + return &DataGenerator{ + randPool: randPool, + dataPool: dataPool, + } +} + +// Key generates a new key. The value is deterministic for the same index and seed. +func (g *DataGenerator) Key(index uint64) []byte { + rng := g.randPool.Get().(*rand.Rand) + rng.Seed(g.seed + int64(index)) + + key := make([]byte, 32) + rng.Read(key) + g.randPool.Put(rng) + + return key +} + +// Value generates a new value. The value is deterministic for the same index, seed, and value size. +func (g *DataGenerator) Value(index uint64, valueLength uint64) []byte { + rng := g.randPool.Get().(*rand.Rand) + rng.Seed(g.seed + int64(index)) + + var value []byte + + if valueLength > uint64(len(g.dataPool)) { + // Special case: we don't have enough data in the pool to satisfy the request. + // For the sake of completeness, just generate the data if this happens. + // This shouldn't be encountered for sane configurations (i.e. with a pool size much larger than value sizes). + value = make([]byte, valueLength) + rng.Read(value) + } else { + startIndex := rng.Intn(len(g.dataPool) - int(valueLength)) + value = g.dataPool[startIndex : startIndex+int(valueLength)] + } + + g.randPool.Put(rng) + + return value +} diff --git a/sei-db/db_engine/litt/benchmark/data_generator_test.go b/sei-db/db_engine/litt/benchmark/data_generator_test.go new file mode 100644 index 0000000000..0460d79e3a --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/data_generator_test.go @@ -0,0 +1,53 @@ +//go:build littdb_wip + +package benchmark + +import ( + "testing" + + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestDeterminism(t *testing.T) { + rand := random.NewTestRandom() + + seed := rand.Int63() + bufferSize := 1024 * rand.Uint64Range(1, 10) + + generator1 := NewDataGenerator(seed, bufferSize) + generator2 := NewDataGenerator(seed, bufferSize) + + k1, v1 := generator1.Key(0), generator1.Value(0, 32) + k2, v2 := generator1.Key(0), generator1.Value(0, 32) + k3, v3 := generator2.Key(0), generator2.Value(0, 32) + require.Equal(t, k1, k2) + require.Equal(t, v1, v2) + require.Equal(t, k1, k3) + require.Equal(t, v1, v3) + + require.Equal(t, 32, len(v1)) + + index := rand.Uint64() + size := rand.Uint64Range(1, 100) + k1, v1 = generator1.Key(index), generator1.Value(index, size) + k2, v2 = generator1.Key(index), generator1.Value(index, size) + k3, v3 = generator2.Key(index), generator2.Value(index, size) + require.Equal(t, k1, k2) + require.Equal(t, v1, v2) + require.Equal(t, k1, k3) + require.Equal(t, v1, v3) + + require.Equal(t, size, uint64(len(v1))) + + index = rand.Uint64() + k1, v1 = generator1.Key(index), generator1.Value(index, bufferSize*2) + k2, v2 = generator1.Key(index), generator1.Value(index, bufferSize*2) + k3, v3 = generator2.Key(index), generator2.Value(index, bufferSize*2) + require.Equal(t, k1, k2) + require.Equal(t, v1, v2) + require.Equal(t, k1, k3) + require.Equal(t, v1, v3) + + require.Equal(t, bufferSize*2, uint64(len(v1))) +} diff --git a/sei-db/db_engine/litt/benchmark/data_tracker.go b/sei-db/db_engine/litt/benchmark/data_tracker.go new file mode 100644 index 0000000000..402d67370d --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/data_tracker.go @@ -0,0 +1,536 @@ +//go:build littdb_wip + +package benchmark + +import ( + "context" + "fmt" + "math" + "math/rand" + "os" + "path" + "strings" + "time" + + "github.com/Layr-Labs/eigenda/litt/benchmark/config" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/docker/go-units" +) + +// WriteInfo contains information needed to perform a write operation. +type WriteInfo struct { + // The index of the key to write. + KeyIndex uint64 + // The key to write. + Key []byte + // The value to write. + Value []byte +} + +// ReadInfo contains information needed to perform a read operation. +type ReadInfo struct { + // The key to read. + Key []byte + // The value we expect to read. + Value []byte +} + +// DataTracker is responsible for tracking key-value pairs that have been written to the database, and for generating +// new key-value pairs to be written. +type DataTracker struct { + ctx context.Context + cancel context.CancelFunc + + // A source of randomness. + rand *rand.Rand + + // The configuration for the benchmark. + config *config.BenchmarkConfig + + // The directory where cohort files are stored. + cohortDirectory string + + // A map from cohort index to information about the cohort. + cohorts map[uint64]*Cohort + + // The cohort that is currently being used to generate keys for writing. + activeCohort *Cohort + + // A set of cohorts that have been completely written to the database (i.e. cohorts that are safe to read). + completeCohortSet map[uint64]struct{} + + // A set of keys passed to ReportWrite() that have not yet been fully processed. + writtenKeysSet map[uint64]struct{} + + // The index of the oldest cohort being tracked. + lowestCohortIndex uint64 + + // The index of the newest cohort being tracked. + highestCohortIndex uint64 + + // Consider all key indices that have been generated this session (i.e. ignore keys indices generated prior to the + // most recent restart). We want to find the highest key index that has been written to the database AND + // where all lower key indices have also been written as well. + highestWrittenKeyIndex int64 + + // Consider all cohorts that have been generated this session (i.e. ignore cohorts generated prior to the most + // recent restart). We want to find the highest cohort index that has been fully written to the database AND + // where all cohorts with lower indices have also been written as well. + highestWrittenCohortIndex int64 + + // A channel containing keys-value pairs that are ready to be written. + writeInfoChan chan *WriteInfo + + // A channel containing keys that are ready to be read. + readInfoChan chan *ReadInfo + + // A channel containing information about keys that have been written to the database. + writtenKeyIndicesChan chan uint64 + + // Responsible for producing "random" data for key-value pairs. + generator *DataGenerator + + // The TTL minus a safety margin. Cohorts are considered to be expired if keys in them are older than this. + safeTTL time.Duration + + // The size of the values in bytes for new cohorts. + valueSize uint64 + + // This channel has capacity one and initially has one value in it. This value is drained when the DataTracker is + // fully stopped. Other threads can use this to block until the DataTracker is fully stopped. + closedChan chan struct{} + + // Used to handle fatal errors in the DataTracker. + errorMonitor *util.ErrorMonitor +} + +// NewDataTracker creates a new DataTracker instance, loading all relevant cohorts from disk. +func NewDataTracker( + ctx context.Context, + config *config.BenchmarkConfig, + errorMonitor *util.ErrorMonitor, +) (*DataTracker, error) { + + cohortDirectory := path.Join(config.MetadataDirectory, "cohorts") + + // Create the cohort directory if it doesn't exist. + err := util.EnsureDirectoryExists(cohortDirectory, config.Fsync) + if err != nil { + return nil, fmt.Errorf("failed to create cohort directory: %w", err) + } + + lowestCohortIndex, highestCohortIndex, cohorts, err := gatherCohorts(cohortDirectory) + if err != nil { + return nil, fmt.Errorf("failed to gather cohorts: %w", err) + } + + // Gather the set of complete cohorts. These are the cohorts we can read from. + completeCohortSet := make(map[uint64]struct{}) + if len(cohorts) != 0 { + for i := lowestCohortIndex; i <= highestCohortIndex; i++ { + if cohorts[i].IsComplete() { + completeCohortSet[i] = struct{}{} + } + } + } + + valueSize := uint64(config.ValueSizeMB * float64(units.MiB)) + + // Create an initial active cohort. + var activeCohort *Cohort + if len(cohorts) == 0 { + // Starting fresh, create a new cohort starting from key index 0. + activeCohort, err = NewCohort( + cohortDirectory, + 0, + 0, + config.CohortSize, + valueSize, + config.Fsync) + if err != nil { + return nil, fmt.Errorf("failed to create genesis cohort: %w", err) + } + } else { + activeCohort, err = cohorts[highestCohortIndex].NextCohort(config.CohortSize, valueSize) + if err != nil { + return nil, fmt.Errorf("failed to create next cohort: %w", err) + } + } + highestCohortIndex = activeCohort.CohortIndex() + cohorts[highestCohortIndex] = activeCohort + + writeInfoChan := make(chan *WriteInfo, config.WriteInfoChanelSize) + readInfoChan := make(chan *ReadInfo, config.ReadInfoChanelSize) + writtenKeyIndicesChan := make(chan uint64, 64) + + ttl := time.Duration(config.TTLHours * float64(time.Hour)) + safetyMargin := time.Duration(config.ReadSafetyMarginMinutes * float64(time.Minute)) + safeTTL := ttl - safetyMargin + + closedChan := make(chan struct{}, 1) + closedChan <- struct{}{} // Will be drained when the DataTracker is closed. + + ctx, cancel := context.WithCancel(ctx) + + tracker := &DataTracker{ + ctx: ctx, + cancel: cancel, + rand: rand.New(rand.NewSource(time.Now().UnixNano())), + config: config, + cohortDirectory: cohortDirectory, + cohorts: cohorts, + completeCohortSet: completeCohortSet, + writtenKeysSet: make(map[uint64]struct{}), + writeInfoChan: writeInfoChan, + readInfoChan: readInfoChan, + writtenKeyIndicesChan: writtenKeyIndicesChan, + activeCohort: activeCohort, + lowestCohortIndex: lowestCohortIndex, + highestCohortIndex: highestCohortIndex, + highestWrittenKeyIndex: int64(activeCohort.LowKeyIndex()) - 1, + highestWrittenCohortIndex: int64(highestCohortIndex) - 1, + safeTTL: safeTTL, + valueSize: valueSize, + generator: NewDataGenerator(config.Seed, config.RandomPoolSize), + closedChan: closedChan, + errorMonitor: errorMonitor, + } + + go tracker.dataGenerator() + + return tracker, nil +} + +// gatherCohorts loads cohorts from files on disk. The lowest/highest cohort indices are valid if and only if the +// cohorts map is not empty. If no cohorts are found, the lowest and highest cohort indices will be 0. +func gatherCohorts(cohortDirPath string) ( + lowestCohortIndex uint64, + highestCohortIndex uint64, + cohorts map[uint64]*Cohort, + err error) { + + cohorts = make(map[uint64]*Cohort) + + // walk over files in path + // for each file, check if it is a cohort file + // if it is, load the cohort and add it to the map + // if it is not, ignore it + files, err := os.ReadDir(cohortDirPath) + if err != nil { + return 0, + 0, + nil, + fmt.Errorf("failed to read directory: %w", err) + } + + lowestCohortIndex = math.MaxUint64 + highestCohortIndex = 0 + + for _, file := range files { + filePath := path.Join(cohortDirPath, file.Name()) + + if strings.HasSuffix(filePath, CohortFileExtension) { + cohort, err := LoadCohort(filePath) + if err != nil { + return 0, + 0, + nil, + fmt.Errorf("failed to load cohort: %w", err) + } + cohorts[cohort.CohortIndex()] = cohort + + if cohort.CohortIndex() < lowestCohortIndex { + lowestCohortIndex = cohort.CohortIndex() + } + if cohort.cohortIndex > highestCohortIndex { + highestCohortIndex = cohort.CohortIndex() + } + } else if strings.HasSuffix(filePath, CohortSwapFileExtension) { + // Delete any swap files discovered + err = os.Remove(filePath) + if err != nil && !os.IsNotExist(err) { + return 0, + 0, + nil, + fmt.Errorf("failed to delete swap file: %w", err) + } + } + } + + if len(cohorts) == 0 { + // Special case, no cohorts found. + return 0, 0, cohorts, nil + } + + return lowestCohortIndex, highestCohortIndex, cohorts, nil +} + +// LargestReadableValueSize returns the size of the largest value possible to read from the database, +// given current configuration. Considers both values previously written and stored +// (possibly with different configurations), and values that may be written in the future with the +// current configuration. +func (t *DataTracker) LargestReadableValueSize() uint64 { + largestValue := uint64(t.config.ValueSizeMB * float64(units.MiB)) + + if len(t.cohorts) > 0 { + for i := t.lowestCohortIndex; i <= t.highestCohortIndex; i++ { + cohort := t.cohorts[i] + if cohort.IsComplete() { + if cohort.ValueSize() > largestValue { + largestValue = cohort.ValueSize() + } + } + } + } + + return largestValue +} + +// GetWriteInfo returns information required to perform a write operation. It returns the key index (which is needed to +// call MarkHighestIndexWritten()), the key, and the value. Data is generated on background goroutines in order to +// make this method very fast. Will not block as long as data can be generated in the background fast enough. +// May return nil if the context is cancelled. +func (t *DataTracker) GetWriteInfo() *WriteInfo { + select { + case info := <-t.writeInfoChan: + return info + case <-t.ctx.Done(): + return nil + } +} + +// ReportWrite is called when a key has been written to the database. This means that the key is now safe to be read. +func (t *DataTracker) ReportWrite(index uint64) { + select { + case t.writtenKeyIndicesChan <- index: + return + case <-t.ctx.Done(): + return + } +} + +// GetReadInfo returns information required to perform a read operation. Blocks until there is data eligible to be read. +func (t *DataTracker) GetReadInfo() *ReadInfo { + select { + case info := <-t.readInfoChan: + return info + case <-t.ctx.Done(): + return nil + } +} + +// GetReadInfoWithTimeout returns information required to perform a read operation. Waits the specified timeout for +// data to be eligible to be read. If no data is available within the time limit, returns nil. +func (t *DataTracker) GetReadInfoWithTimeout(timeout time.Duration) *ReadInfo { + ctx, cancel := context.WithTimeout(t.ctx, timeout) + defer cancel() + + select { + case info := <-t.readInfoChan: + return info + case <-ctx.Done(): + return nil + } +} + +// Close stops the key manager's background tasks. +func (t *DataTracker) Close() { + t.cancel() + t.closedChan <- struct{}{} + <-t.closedChan +} + +// dataGenerator is responsible for generating data in the background. +func (t *DataTracker) dataGenerator() { + ticker := time.NewTicker(time.Duration(t.config.CohortGCPeriodSeconds * float64(time.Second))) + defer func() { + ticker.Stop() + <-t.closedChan + }() + + nextWriteInfo := t.generateNextWriteInfo() + nextReadInfo := t.generateNextReadInfo() + + for { + if nextReadInfo == nil { + // Edge case: when stared up for the first time, there won't be any values eligible to be read. + // We have to handle this in a special manner to prevent nil values from being inserted into + // the readInfoChan. + + select { + case <-t.errorMonitor.ImmediateShutdownRequired(): + return + case <-t.ctx.Done(): + return + case keyIndex := <-t.writtenKeyIndicesChan: + // track keys that have been written so that we can read them in the future + t.handleWrittenKey(keyIndex) + case t.writeInfoChan <- nextWriteInfo: + // prepare a value to be eventually written + nextWriteInfo = t.generateNextWriteInfo() + case <-ticker.C: + // perform garbage collection on cohorts + t.DoCohortGC() + } + + nextReadInfo = t.generateNextReadInfo() + + } else { + // Standard case. + + select { + case <-t.errorMonitor.ImmediateShutdownRequired(): + return + case <-t.ctx.Done(): + return + case keyIndex := <-t.writtenKeyIndicesChan: + // track keys that have been written so that we can read them in the future + t.handleWrittenKey(keyIndex) + case t.writeInfoChan <- nextWriteInfo: + // prepare a value to be eventually written + nextWriteInfo = t.generateNextWriteInfo() + case t.readInfoChan <- nextReadInfo: + // prepare a value to be eventually read + nextReadInfo = t.generateNextReadInfo() + case <-ticker.C: + // perform garbage collection on cohorts + t.DoCohortGC() + } + } + } +} + +// handleWrittenKey handles a key that has been written to the database. +func (t *DataTracker) handleWrittenKey(keyIndex uint64) { + // Add key index to the set of written keys we are tracking. + t.writtenKeysSet[keyIndex] = struct{}{} + + // Determine the highest key index written so far that also has all lower key indices written. + for { + nextKeyIndex := uint64(t.highestWrittenKeyIndex + 1) + if _, ok := t.writtenKeysSet[nextKeyIndex]; ok { + // The next key has been written, mark it as such. + t.highestWrittenKeyIndex = int64(nextKeyIndex) + delete(t.writtenKeysSet, nextKeyIndex) + } else { + // Once we find the first key that has not been written, we can stop checking. + // We want t.highestWrittenKeyIndex to be the highest key index that has been written + // without any gaps in the sequence. + break + } + } + + // Determine the highest cohort index written so far that also has all lower cohorts written. + for { + nextCohortIndex := uint64(t.highestWrittenCohortIndex + 1) + if nextCohortIndex >= t.activeCohort.CohortIndex() { + // Don't ever mark the active cohort as complete. + break + } + nextCohort := t.cohorts[nextCohortIndex] + if int64(nextCohort.HighKeyIndex()) <= t.highestWrittenKeyIndex { + // We've found a cohort that has all keys written. + t.highestWrittenCohortIndex = int64(nextCohort.CohortIndex()) + t.completeCohortSet[nextCohort.CohortIndex()] = struct{}{} + err := nextCohort.MarkComplete() + if err != nil { + t.errorMonitor.Panic(fmt.Errorf("failed to mark cohort as complete: %v", err)) + return + } + } else { + // Once we find the first cohort that does not have all keys written, we can stop checking. + break + } + } +} + +// generateNextWriteInfo generates the next write info to be placed into the writeInfoChan. +func (t *DataTracker) generateNextWriteInfo() *WriteInfo { + var err error + + if t.activeCohort.IsExhausted() { + t.activeCohort, err = t.cohorts[t.highestCohortIndex].NextCohort(t.config.CohortSize, t.valueSize) + if err != nil { + t.errorMonitor.Panic(fmt.Errorf("failed to generate next cohort for highest cohort: %v", err)) + return nil + } + t.highestCohortIndex = t.activeCohort.CohortIndex() + t.cohorts[t.highestCohortIndex] = t.activeCohort + } + + keyIndex, err := t.activeCohort.GetKeyIndexForWriting() + if err != nil { + t.errorMonitor.Panic(fmt.Errorf("failed to get key index for writing: %v", err)) + return nil + } + + return &WriteInfo{ + KeyIndex: keyIndex, + Key: t.generator.Key(keyIndex), + Value: t.generator.Value(keyIndex, t.activeCohort.valueSize), + } +} + +// generateNextReadInfo generates the next read info to be placed into the readInfoChan. +func (t *DataTracker) generateNextReadInfo() *ReadInfo { + if len(t.completeCohortSet) == 0 { + // No cohorts are complete, so we can't read anything. + return nil + } + + var cohortIndexToRead uint64 + for cohortIndexToRead = range t.completeCohortSet { + // map iteration is random in golang, so this will yield a random complete cohort. + break + } + cohortToRead := t.cohorts[cohortIndexToRead] + + keyIndex, err := cohortToRead.GetKeyIndexForReading(t.rand) + if err != nil { + t.errorMonitor.Panic(fmt.Errorf("failed to get key index for reading: %v", err)) + return nil + } + + return &ReadInfo{ + Key: t.generator.Key(keyIndex), + Value: t.generator.Value(keyIndex, cohortToRead.ValueSize()), + } +} + +// DoCohortGC performs garbage collection on the cohorts, removing cohorts with entries that are nearing expiration. +func (t *DataTracker) DoCohortGC() { + now := time.Now() + + // Check all cohorts except for the active cohort (i.e. the one with index t.highestCohortIndex). + for i := t.lowestCohortIndex; i < t.highestCohortIndex; i++ { + cohort := t.cohorts[i] + + if cohort.IsExpired(now, t.safeTTL) { + err := cohort.Delete() + if err != nil { + t.errorMonitor.Panic(fmt.Errorf("failed to delete expired cohort: %v", err)) + return + } + t.lowestCohortIndex++ + delete(t.cohorts, cohort.CohortIndex()) + delete(t.completeCohortSet, cohort.CohortIndex()) + } else { + // Stop once we find the first cohort that is not eligible for deletion. + break + } + } + + if len(t.cohorts) == 0 { + // Edge case: we've been writing data slow enough that the active cohort has expired. + // Create a new active cohort. + activeCohort, err := t.activeCohort.NextCohort(t.config.CohortSize, t.valueSize) + if err != nil { + t.errorMonitor.Panic(fmt.Errorf("failed to create new active cohort: %v", err)) + return + } + + t.activeCohort = activeCohort + t.highestCohortIndex = activeCohort.CohortIndex() + t.cohorts[activeCohort.CohortIndex()] = activeCohort + } +} diff --git a/sei-db/db_engine/litt/benchmark/data_tracker_test.go b/sei-db/db_engine/litt/benchmark/data_tracker_test.go new file mode 100644 index 0000000000..4b4d3e4f5d --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/data_tracker_test.go @@ -0,0 +1,256 @@ +//go:build littdb_wip + +package benchmark + +import ( + "os" + "testing" + "time" + + config2 "github.com/Layr-Labs/eigenda/litt/benchmark/config" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/docker/go-units" + "github.com/stretchr/testify/require" +) + +func TestTrackerDeterminism(t *testing.T) { + ctx := t.Context() + rand := random.NewTestRandom() + directory := t.TempDir() + + config := config2.DefaultBenchmarkConfig() + config.RandomPoolSize = units.MiB + config.CohortSize = rand.Uint64Range(10, 20) + config.MetadataDirectory = directory + config.Seed = rand.Int63() + config.ValueSizeMB = 1.0 / 1024 // 1kb + config.TTLHours = 1 + + // Generate enough data to fill 10ish cohorts. + keyCount := 10*config.CohortSize + rand.Uint64Range(0, 10) + + errorMonitor := util.NewErrorMonitor(ctx, config.LittConfig.Logger, nil) + + dataTracker, err := NewDataTracker(ctx, config, errorMonitor) + require.NoError(t, err) + + // map from indices to keys + expectedKeys := make(map[uint64][]byte) + + // map from indices to values + expectedValues := make(map[uint64][]byte) + + // Get a bunch of values. + for i := uint64(0); i < keyCount; i++ { + writeInfo := dataTracker.GetWriteInfo() + require.Equal(t, i, writeInfo.KeyIndex) + require.Equal(t, 32, len(writeInfo.Key)) + require.Equal(t, units.KiB, len(writeInfo.Value)) + + expectedKeys[i] = writeInfo.Key + expectedValues[i] = writeInfo.Value + } + + dataTracker.Close() + + // Rebuild the tracker at genesis. We should get the same sequence of keys and values. + err = os.RemoveAll(directory) + require.NoError(t, err) + err = os.MkdirAll(directory, os.ModePerm) + require.NoError(t, err) + dataTracker, err = NewDataTracker(ctx, config, errorMonitor) + require.NoError(t, err) + + for i := uint64(0); i < keyCount; i++ { + writeInfo := dataTracker.GetWriteInfo() + require.Equal(t, i, writeInfo.KeyIndex) + require.Equal(t, 32, len(writeInfo.Key)) + require.Equal(t, units.KiB, len(writeInfo.Value)) + require.Equal(t, expectedKeys[i], writeInfo.Key) + require.Equal(t, expectedValues[i], writeInfo.Value) + } + + dataTracker.Close() + + err = os.RemoveAll(directory) + require.NoError(t, err) + ok, _ := errorMonitor.IsOk() + require.True(t, ok) +} + +func TestTrackerRestart(t *testing.T) { + ctx := t.Context() + rand := random.NewTestRandom() + directory := t.TempDir() + + config := config2.DefaultBenchmarkConfig() + config.RandomPoolSize = units.MiB + config.CohortSize = rand.Uint64Range(10, 20) + config.MetadataDirectory = directory + config.Seed = rand.Int63() + config.ValueSizeMB = 1.0 / 1024 // 1kb + + // Generate enough data to fill 10ish cohorts. + keyCount := 10*config.CohortSize + rand.Uint64Range(0, 10) + + errorMonitor := util.NewErrorMonitor(ctx, config.LittConfig.Logger, nil) + + dataTracker, err := NewDataTracker(ctx, config, errorMonitor) + require.NoError(t, err) + + indexSet := make(map[uint64]struct{}) + + // Generate a bunch of values. + for i := uint64(0); i < keyCount; i++ { + writeInfo := dataTracker.GetWriteInfo() + require.Equal(t, i, writeInfo.KeyIndex) + require.Equal(t, 32, len(writeInfo.Key)) + require.Equal(t, units.KiB, len(writeInfo.Value)) + + indexSet[writeInfo.KeyIndex] = struct{}{} + } + + // All indices should be unique. + require.Equal(t, keyCount, uint64(len(indexSet))) + + // Restart. + dataTracker.Close() + dataTracker, err = NewDataTracker(ctx, config, errorMonitor) + require.NoError(t, err) + + // Generate more values. + for i := uint64(0); i < keyCount; i++ { + writeInfo := dataTracker.GetWriteInfo() + indexSet[writeInfo.KeyIndex] = struct{}{} + } + + // If we aren't reusing indices after the restart, then the set should now be equal to 2*keyCount. + require.Equal(t, 2*keyCount, uint64(len(indexSet))) + + dataTracker.Close() + + err = os.RemoveAll(directory) + require.NoError(t, err) + + ok, _ := errorMonitor.IsOk() + require.True(t, ok) +} + +func TestTrackReads(t *testing.T) { + ctx := t.Context() + rand := random.NewTestRandom() + directory := t.TempDir() + + config := config2.DefaultBenchmarkConfig() + config.RandomPoolSize = units.MiB + config.CohortSize = rand.Uint64Range(10, 20) + config.MetadataDirectory = directory + config.Seed = rand.Int63() + config.ValueSizeMB = 1.0 / 1024 // 1kb + + // Generate enough data to fill exactly 10 cohorts. + keyCount := 10 * config.CohortSize + + errorMonitor := util.NewErrorMonitor(ctx, config.LittConfig.Logger, nil) + + dataTracker, err := NewDataTracker(ctx, config, errorMonitor) + require.NoError(t, err) + + keyToIndexMap := make(map[string]uint64) + + // When reading, we should only ever read from indices that have been confirmed written. + highestWrittenIndex := -1 + highestIndexReportedWritten := -1 + readCount := uint64(0) + + // Generate a bunch of values. + for i := uint64(0); i < keyCount; i++ { + writeInfo := dataTracker.GetWriteInfo() + require.Equal(t, i, writeInfo.KeyIndex) + require.Equal(t, 32, len(writeInfo.Key)) + require.Equal(t, units.KiB, len(writeInfo.Value)) + + keyToIndexMap[string(writeInfo.Key)] = writeInfo.KeyIndex + + if rand.Float64() < 0.1 && i > 2*config.CohortSize { + // Advance the highest written index. + possibleIndex := rand.Uint64Range(i-config.CohortSize*2, i) + if int(possibleIndex) > highestWrittenIndex { + highestWrittenIndex = int(possibleIndex) + } else { + highestWrittenIndex++ + } + for highestIndexReportedWritten < highestWrittenIndex { + highestIndexReportedWritten++ + dataTracker.ReportWrite(uint64(highestIndexReportedWritten)) + } + + // Give the data tracker time to ingest data. Not required for the test to pass. + time.Sleep(10 * time.Millisecond) + } + + // Read a random value. + var readInfo *ReadInfo + if readCount == 0 { + // We are reading the first value, so one might not be available yet. Don't block forever. + readInfo = dataTracker.GetReadInfoWithTimeout(time.Millisecond) + } else { + // After we read the first value, we should never block. + readInfo = dataTracker.GetReadInfo() + } + if readInfo != nil { + readCount++ + index := keyToIndexMap[string(readInfo.Key)] + + // we should not read values we haven't told the data tracker we've written. + require.True(t, int(index) <= highestWrittenIndex) + } + } + + require.True(t, readCount > 0) + + // Mark all data as having been written so far. + highestWrittenIndex = int(keyCount - 1) + for highestIndexReportedWritten < highestWrittenIndex { + highestIndexReportedWritten++ + dataTracker.ReportWrite(uint64(highestIndexReportedWritten)) + } + + unwrittenKeys := make(map[string]struct{}) + + // Write a bunch more data, but do not mark any of it as having been written. + for i := uint64(0); i < keyCount; i++ { + writeInfo := dataTracker.GetWriteInfo() + unwrittenKeys[string(writeInfo.Key)] = struct{}{} + } + + // Restart the tracker without marking any of the new data as having been written. + dataTracker.Close() + dataTracker, err = NewDataTracker(ctx, config, errorMonitor) + require.NoError(t, err) + + // Read a bunch of data. + readDataSet := make(map[string]struct{}) + for i := uint64(0); i < keyCount*10; i++ { + readInfo := dataTracker.GetReadInfo() + require.NotNil(t, readInfo) + + if _, ok := unwrittenKeys[string(readInfo.Key)]; ok { + // We should not be able to read data that we haven't marked as having been written. + require.Fail(t, "read unwritten data") + } + + readDataSet[string(readInfo.Key)] = struct{}{} + } + + // The data we read is random, but the following heuristic should hold with high probability. + require.True(t, len(readDataSet) > int(0.5*float64(keyCount))) + + dataTracker.Close() + + err = os.RemoveAll(directory) + require.NoError(t, err) + ok, _ := errorMonitor.IsOk() + require.True(t, ok) +} diff --git a/sei-db/db_engine/litt/benchmark/run.sh b/sei-db/db_engine/litt/benchmark/run.sh new file mode 100755 index 0000000000..53abe88744 --- /dev/null +++ b/sei-db/db_engine/litt/benchmark/run.sh @@ -0,0 +1,19 @@ +#!/usr/bin/env bash + +# This script is used to run the LittDB benchmark. + +# Find the directory of this script +SCRIPT_DIR=$(dirname "$(readlink -f "$0")") + +# Get the absolute path to the binary. +BINARY_PATH="$SCRIPT_DIR/../bin/benchmark" +BINARY_PATH="$(cd "$(dirname "$BINARY_PATH")" && pwd)/$(basename "$BINARY_PATH")" + +CONFIG_PATH=""${1} +if [ -z "$CONFIG_PATH" ]; then + echo "Usage: $0 " + exit 1 +fi +CONFIG_PATH="$(cd "$(dirname "$CONFIG_PATH")" && pwd)/$(basename "$CONFIG_PATH")" + +$BINARY_PATH $CONFIG_PATH diff --git a/sei-db/db_engine/litt/cli/benchmark.go b/sei-db/db_engine/litt/cli/benchmark.go new file mode 100644 index 0000000000..387b929741 --- /dev/null +++ b/sei-db/db_engine/litt/cli/benchmark.go @@ -0,0 +1,38 @@ +//go:build littdb_wip + +package main + +import ( + "log" + + "github.com/Layr-Labs/eigenda/litt/benchmark" + "github.com/urfave/cli/v2" +) + +// A launcher for the benchmark. +func benchmarkCommand(ctx *cli.Context) error { + if ctx.NArg() != 1 { + return cli.Exit("benchmark command requires exactly one argument: ", 1) + } + + configPath := ctx.Args().Get(0) + + // Create the benchmark engine + engine, err := benchmark.NewBenchmarkEngine(configPath) + if err != nil { + log.Fatalf("Failed to create benchmark engine: %v", err) + } + + // Run the benchmark + engine.Logger().Infof("Configuration loaded from %s", configPath) + engine.Logger().Info("Press Ctrl+C to stop the benchmark") + + err = engine.Run() + if err != nil { + return err + } else { + engine.Logger().Info("Benchmark Terminated") + } + + return nil +} diff --git a/sei-db/db_engine/litt/cli/litt_cli.go b/sei-db/db_engine/litt/cli/litt_cli.go new file mode 100644 index 0000000000..9392a4f57e --- /dev/null +++ b/sei-db/db_engine/litt/cli/litt_cli.go @@ -0,0 +1,357 @@ +//go:build littdb_wip + +package main + +import ( + "bufio" + "fmt" + "os" + + "github.com/Layr-Labs/eigenda/common/pprof" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/urfave/cli/v2" +) + +// TODO (cody.littley): convert all commands to use flags stored in these variables +var ( + srcFlag = &cli.StringSliceFlag{ + Name: "src", + Aliases: []string{"s"}, + Usage: "Source paths where the DB data is found, at least one is required.", + Required: true, + } + forceFlag = &cli.BoolFlag{ + Name: "force", + Aliases: []string{"f"}, + Usage: "Force the operation without prompting for confirmation.", + } + knownHostsFileFlag = &cli.StringFlag{ + Name: "known-hosts", + Aliases: []string{"k"}, + Usage: "Path to a file containing known hosts for SSH connections.", + Required: false, + Value: "~/.ssh/known_hosts", + } +) + +// buildCliParser creates a command line parser for the LittDB CLI tool. +func buildCLIParser(logger logging.Logger) *cli.App { + app := &cli.App{ + Name: "litt", + Usage: "LittDB command line interface", + Flags: []cli.Flag{ + &cli.BoolFlag{ + Name: "debug", + Aliases: []string{"d"}, + Usage: "Enable debug mode. Program will pause for a debugger to attach.", + }, + &cli.BoolFlag{ + Name: "pprof", + Aliases: []string{"p"}, + Usage: "Starts a pprof server for profiling.", + }, + &cli.IntFlag{ + Name: "pprof-port", + Aliases: []string{"P"}, + Usage: "Port for the pprof server.", + Value: 6060, + }, + }, + Before: buildBeforeAction(logger), + Commands: []*cli.Command{ + { + Name: "ls", + Usage: "List tables in a LittDB instance", + ArgsUsage: "--src ... --src ", + Flags: []cli.Flag{ + &cli.StringSliceFlag{ + Name: "src", + Aliases: []string{"s"}, + Usage: "Source paths where the DB data is found, at least one is required.", + Required: true, + }, + }, + Action: lsCommand, + }, + { + Name: "table-info", + Usage: "Get information about a LittDB table. " + + "If the DB is spread across multiple paths, all paths must be provided.", + ArgsUsage: "--src ... --src ", + Args: true, + Flags: []cli.Flag{ + &cli.StringSliceFlag{ + Name: "src", + Aliases: []string{"s"}, + Usage: "Source paths where the DB data is found, at least one is required.", + Required: true, + }, + }, + Action: tableInfoCommand, + }, + { + Name: "rebase", + Usage: "Restructure LittDB file system layout.", + ArgsUsage: "--src ... --src " + + "--dest ... --dest [--preserve] [--quiet]", + Flags: []cli.Flag{ + &cli.StringSliceFlag{ + Name: "src", + Aliases: []string{"s"}, + Usage: "Source paths where the data is found, at least one is required.", + Required: true, + }, + &cli.StringSliceFlag{ + Name: "dst", + Aliases: []string{"d"}, + Usage: "Destination paths for the rebased LittDB, at least one is required.", + Required: true, + }, + &cli.BoolFlag{ + Name: "preserve", + Aliases: []string{"p"}, + Usage: "If enabled, then the old files are not removed.", + }, + &cli.BoolFlag{ + Name: "quiet", + Aliases: []string{"q"}, + Usage: "Reduces the verbosity of the output.", + }, + }, + Action: rebaseCommand, + }, + { + Name: "benchmark", + Usage: "Run a LittDB benchmark.", + ArgsUsage: "", + Args: true, + Action: benchmarkCommand, + }, + { + Name: "prune", + Usage: "Delete data from a LittDB database/snapshot.", + ArgsUsage: "--src ... --src --max-age " + + "[--table ... --table ]", + Flags: []cli.Flag{ + &cli.StringSliceFlag{ + Name: "src", + Aliases: []string{"s"}, + Usage: "Source paths where the DB data is found, at least one is required.", + Required: true, + }, + &cli.StringSliceFlag{ + Name: "table", + Aliases: []string{"t"}, + Usage: "Prune this table. If not specified, all tables will be pruned.", + }, + &cli.Uint64Flag{ + Name: "max-age", + Aliases: []string{"a"}, + Usage: "Maximum age of segments to keep, in seconds. " + + "Segments older than this will be deleted.", + Required: true, + }, + }, + Action: pruneCommand, + }, + { + Name: "push", + Usage: "Push data to a remote location using ssh and rsync.", + ArgsUsage: "--src ... --src " + + "--dst ... --dst " + + "[-i path/to/key] [-p port] [--no-gc] [--quiet] [--threads ] " + + "[--throttle ] @", + Args: true, + Flags: []cli.Flag{ + &cli.StringSliceFlag{ + Name: "src", + Aliases: []string{"s"}, + Usage: "Source paths where the data is found, at least one is required.", + Required: true, + }, + &cli.StringSliceFlag{ + Name: "dst", + Aliases: []string{"d"}, + Usage: "Remote destination paths, at least one is required.", + Required: true, + }, + &cli.Uint64Flag{ + Name: "port", + Aliases: []string{"p"}, + Usage: "SSH port to connect to the remote host.", + Value: 22, + }, + knownHostsFileFlag, + &cli.StringFlag{ + Name: "key", + Aliases: []string{"i"}, + Usage: "Path to the SSH private key file for authentication.", + Value: "~/.ssh/id_rsa", + }, + &cli.BoolFlag{ + Name: "no-gc", + Aliases: []string{"n"}, + Usage: "If true, do not delete files pushed to the remote host.", + }, + &cli.BoolFlag{ + Name: "quiet", + Aliases: []string{"q"}, + Usage: "Reduces the verbosity of the output.", + }, + &cli.Uint64Flag{ + Name: "threads", + Aliases: []string{"t"}, + Usage: "Number of parallel rsync operations.", + Value: 8, + }, + &cli.Float64Flag{ + Name: "throttle", + Aliases: []string{"T"}, + Usage: "Max network utilization, in mb/s", + Value: 0, + }, + }, + Action: pushCommand, + }, + { // TODO (cody.littley) test in preprod + Name: "sync", + Usage: "Periodically run 'litt push' to keep a remote backup in sync with local data. " + + "Optionally calls 'litt prune' remotely to manage data retention.", + ArgsUsage: "--src ... --src " + + "--dst ... --dst " + + "[-i ] [-p ] [--no-gc] [--quiet] [--threads ] " + + "[--throttle ] [--max-age ] [--litt-binary " + + " [--period ]" + + "@", + Flags: []cli.Flag{ + &cli.StringSliceFlag{ + Name: "src", + Aliases: []string{"s"}, + Usage: "Source paths where the data is found, at least one is required.", + Required: true, + }, + &cli.StringSliceFlag{ + Name: "dst", + Aliases: []string{"d"}, + Usage: "Remote destination paths, at least one is required.", + Required: true, + }, + &cli.Uint64Flag{ + Name: "port", + Aliases: []string{"p"}, + Usage: "SSH port to connect to the remote host.", + Value: 22, + }, + &cli.StringFlag{ + Name: "key", + Aliases: []string{"i"}, + Usage: "Path to the SSH private key file for authentication.", + Value: "~/.ssh/id_rsa", + }, + knownHostsFileFlag, + &cli.BoolFlag{ + Name: "no-gc", + Aliases: []string{"n"}, + Usage: "If true, do not delete files pushed to the remote host.", + }, + &cli.BoolFlag{ + Name: "quiet", + Aliases: []string{"q"}, + Usage: "Reduces the verbosity of the output.", + }, + &cli.Uint64Flag{ + Name: "threads", + Aliases: []string{"t"}, + Usage: "Number of parallel rsync operations.", + Value: 8, + }, + &cli.Float64Flag{ + Name: "throttle", + Aliases: []string{"T"}, + Usage: "Max network utilization, in mb/s", + Value: 0, + }, + &cli.Uint64Flag{ + Name: "max-age", + Aliases: []string{"a"}, + Usage: "If non-zero, remotely run 'litt prune' to delete segments " + + "older than this age in seconds.", + Value: 0, // Default to 0, meaning no age limit + }, + &cli.StringFlag{ + Name: "litt-binary", + Aliases: []string{"b"}, + Usage: "The remote location of the 'litt' CLI binary to use for pruning.", + Value: "litt", + }, + &cli.Uint64Flag{ + Name: "period", + Aliases: []string{"P"}, + Usage: "The period in seconds between sync operations.", + Value: 300, + }, + }, + Action: syncCommand, + }, + { + Name: "unlock", + Usage: "Manually delete LittDB lock files. Dangerous if used improperly, use with caution.", + ArgsUsage: "--src ... --src [--force]", + Flags: []cli.Flag{ + srcFlag, + forceFlag, + }, + Action: unlockCommand, + }, + }, + } + return app +} + +// Builds a function that is called before any command is executed. +func buildBeforeAction(logger logging.Logger) func(*cli.Context) error { + return func(ctx *cli.Context) error { + handleDebugMode(ctx, logger) + + err := handlePProfMode(ctx, logger) + if err != nil { + return fmt.Errorf("failed to start pprof: %w", err) + } + + return nil + } +} + +// If debug mode is enabled, this function will block until the user presses Enter. +func handleDebugMode(ctx *cli.Context, logger logging.Logger) { + debugModeEnabled := ctx.Bool("debug") + if !debugModeEnabled { + return + } + + pid := os.Getpid() + logger.Infof("Waiting for debugger to attach (pid: %d).\n", pid) + + logger.Infof("Press Enter to continue...") + reader := bufio.NewReader(os.Stdin) + _, _ = reader.ReadString('\n') // block until newline is read +} + +// If pprof is enabled, this function starts the pprof server. +func handlePProfMode(ctx *cli.Context, logger logging.Logger) error { + pprofEnabled := ctx.Bool("pprof") + if !pprofEnabled { + return nil + } + + pprofPort := ctx.Int("pprof-port") + if pprofPort <= 0 || pprofPort > 65535 { + return fmt.Errorf("invalid pprof port: %d", pprofPort) + } + + logger.Infof("pprof enabled on port %d", pprofPort) + profiler := pprof.NewPprofProfiler(fmt.Sprintf("%d", pprofPort), logger) + go profiler.Start() + + return nil +} diff --git a/sei-db/db_engine/litt/cli/ls.go b/sei-db/db_engine/litt/cli/ls.go new file mode 100644 index 0000000000..63efd9939d --- /dev/null +++ b/sei-db/db_engine/litt/cli/ls.go @@ -0,0 +1,122 @@ +//go:build littdb_wip + +package main + +import ( + "fmt" + "os" + "path" + "path/filepath" + "sort" + "strings" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/urfave/cli/v2" +) + +func lsCommand(ctx *cli.Context) error { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + if err != nil { + return fmt.Errorf("failed to create logger: %w", err) + } + + sources := ctx.StringSlice("src") + if len(sources) == 0 { + return fmt.Errorf("no sources provided") + } + for i, src := range sources { + var err error + sources[i], err = util.SanitizePath(src) + if err != nil { + return fmt.Errorf("invalid source path: %s", src) + } + } + + tables, err := lsPaths(logger, sources, true, true) + if err != nil { + return fmt.Errorf("failed to list tables in paths %v: %w", sources, err) + } + + sb := &strings.Builder{} + for _, table := range tables { + sb.WriteString(table) + sb.WriteString("\n") + } + + logger.Infof("Tables found:\n%s", sb.String()) + + return nil +} + +// Similar to ls, but searches for tables in multiple paths. +func lsPaths(logger logging.Logger, rootPaths []string, lock bool, fsync bool) ([]string, error) { + tableSet := make(map[string]struct{}) + + for _, rootPath := range rootPaths { + tables, err := ls(logger, rootPath, lock, fsync) + if err != nil { + return nil, fmt.Errorf("error finding tables: %w", err) + } + for _, table := range tables { + tableSet[table] = struct{}{} + } + } + + tableNames := make([]string, 0, len(tableSet)) + for tableName := range tableSet { + tableNames = append(tableNames, tableName) + } + + sort.Strings(tableNames) + + return tableNames, nil +} + +// Returns a list of LittDB tables at the specified LittDB path. Tables are alphabetically sorted by their names. +// Returns an error if the path does not exist or if no tables are found. +func ls(logger logging.Logger, rootPath string, lock bool, fsync bool) ([]string, error) { + + if lock { + // Forbid touching tables in active use. + lockPath := path.Join(rootPath, util.LockfileName) + fLock, err := util.NewFileLock(logger, lockPath, fsync) + if err != nil { + return nil, fmt.Errorf("failed to acquire lock on %s: %w", rootPath, err) + } + defer fLock.Release() + } + + // LittDB has one directory under the root directory per table, with the name + // of the table being the name of the directory. + possibleTables, err := os.ReadDir(rootPath) + if err != nil { + return nil, fmt.Errorf("failed to read dir %s: %w", rootPath, err) + } + + // Each table directory will contain a "segments" directory. Infer that any directory containing this directory + // is a table. If we are looking at a real LittDB instance, there shouldn't be any other directories, but + // there is no need to enforce that here. + tables := make([]string, 0, len(possibleTables)) + for _, entry := range possibleTables { + if !entry.IsDir() { + continue + } + + segmentPath := filepath.Join(rootPath, entry.Name(), segment.SegmentDirectory) + isDirectory, err := util.IsDirectory(segmentPath) + if err != nil { + return nil, fmt.Errorf("failed to check if segment path %s is a directory: %w", segmentPath, err) + } + if isDirectory { + tables = append(tables, entry.Name()) + } + } + + // Alphabetically sort the tables. + sort.Strings(tables) + + return tables, nil +} diff --git a/sei-db/db_engine/litt/cli/ls_test.go b/sei-db/db_engine/litt/cli/ls_test.go new file mode 100644 index 0000000000..5749eccc95 --- /dev/null +++ b/sei-db/db_engine/litt/cli/ls_test.go @@ -0,0 +1,128 @@ +//go:build littdb_wip + +package main + +import ( + "fmt" + "sort" + "testing" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestLs(t *testing.T) { + t.Parallel() + + logger := test.GetLogger() + rand := random.NewTestRandom() + directory := t.TempDir() + + // Spread data across several root directories. + rootCount := rand.Uint32Range(2, 5) + roots := make([]string, 0, rootCount) + for i := 0; i < int(rootCount); i++ { + roots = append(roots, fmt.Sprintf("%s/root-%d", directory, i)) + } + + config, err := litt.DefaultConfig(roots...) + require.NoError(t, err) + + // Make it so that we have at least as many shards as roots. + config.ShardingFactor = rootCount * rand.Uint32Range(1, 4) + + // Settings that should be enabled for LittDB unit tests. + config.DoubleWriteProtection = true + config.Fsync = false + + // Use small segments to ensure that we create a few segments per table. + config.TargetSegmentFileSize = 100 + + // Enable snapshotting. + snapshotDir := t.TempDir() + config.SnapshotDirectory = snapshotDir + + // Build the DB and a handful of tables. + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableCount := rand.Uint32Range(2, 5) + tables := make([]litt.Table, 0, tableCount) + expectedData := make(map[string]map[string][]byte) + tableNames := make([]string, 0, tableCount) + for i := 0; i < int(tableCount); i++ { + tableName := fmt.Sprintf("table-%d-%s", i, rand.PrintableBytes(8)) + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables = append(tables, table) + expectedData[table.Name()] = make(map[string][]byte) + tableNames = append(tableNames, tableName) + } + + // Alphabetize table names. ls should always return tables in this order. + sort.Strings(tableNames) + + // Insert some data into the tables. + for _, table := range tables { + for i := 0; i < 100; i++ { + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(10, 200) + expectedData[table.Name()][string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "Failed to put key-value pair in table %s", table.Name()) + } + err = table.Flush() + require.NoError(t, err, "Failed to flush table %s", table.Name()) + } + + // Verify that the data is correctly stored in the tables. + for _, table := range tables { + for key, expectedValue := range expectedData[table.Name()] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "Failed to get value for key %s in table %s", key, table.Name()) + require.True(t, ok, "Key %s not found in table %s", key, table.Name()) + require.Equal(t, expectedValue, value, + "Value mismatch for key %s in table %s", key, table.Name()) + } + } + + // We should not be able to call ls on the core directories while the table holds a lock. + for _, root := range roots { + _, err = ls(logger, root, true, false) + require.Error(t, err) + } + _, err = lsPaths(logger, roots, true, false) + require.Error(t, err) + + // Even when the DB is running, it should always be possible to ls the snapshot directory. + lsResult, err := ls(logger, snapshotDir, true, false) + require.NoError(t, err) + require.Equal(t, tableNames, lsResult) + + lsResult, err = lsPaths(logger, []string{snapshotDir}, true, false) + require.NoError(t, err) + require.Equal(t, tableNames, lsResult) + + err = db.Close() + require.NoError(t, err) + + // Now that the DB is closed, we should be able to ls it. We should find all tables defined regardless of which + // root directory we peer into. + for _, root := range roots { + lsResult, err = ls(logger, root, true, false) + require.NoError(t, err) + require.Equal(t, tableNames, lsResult) + } + + lsResult, err = lsPaths(logger, roots, true, true) + require.NoError(t, err) + require.Equal(t, tableNames, lsResult) + + // Data should still be present in the snapshot directory. + lsResult, err = ls(logger, snapshotDir, true, false) + require.NoError(t, err) + require.Equal(t, tableNames, lsResult) +} diff --git a/sei-db/db_engine/litt/cli/main.go b/sei-db/db_engine/litt/cli/main.go new file mode 100644 index 0000000000..00aaee6926 --- /dev/null +++ b/sei-db/db_engine/litt/cli/main.go @@ -0,0 +1,25 @@ +//go:build littdb_wip + +package main + +import ( + "fmt" + "os" + + "github.com/Layr-Labs/eigenda/common" +) + +// main is the entry point for the LittDB cli. +func main() { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + if err != nil { + _, _ = fmt.Fprintf(os.Stderr, "Failed to create logger: %v\n", err) + os.Exit(1) + } + + err = buildCLIParser(logger).Run(os.Args) + if err != nil { + logger.Errorf("Execution failed: %v\n", err) + os.Exit(1) + } +} diff --git a/sei-db/db_engine/litt/cli/prune.go b/sei-db/db_engine/litt/cli/prune.go new file mode 100644 index 0000000000..1eb5ed157b --- /dev/null +++ b/sei-db/db_engine/litt/cli/prune.go @@ -0,0 +1,245 @@ +//go:build littdb_wip + +package main + +import ( + "context" + "fmt" + "os" + "path" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/urfave/cli/v2" +) + +// pruneCommand can be used to remove data from a LittDB instance/snapshot. +func pruneCommand(ctx *cli.Context) error { + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + if err != nil { + return fmt.Errorf("failed to create logger: %w", err) + } + + sources := ctx.StringSlice("src") + if len(sources) == 0 { + return fmt.Errorf("no sources provided") + } + for i, src := range sources { + var err error + sources[i], err = util.SanitizePath(src) + if err != nil { + return fmt.Errorf("invalid source path: %s", src) + } + } + + tables := ctx.StringSlice("table") + + maxAgeSeconds := ctx.Uint64("max-age") + + return prune(logger, sources, tables, maxAgeSeconds, true) +} + +// prune deletes data from a littDB database/snapshot. +func prune(logger logging.Logger, sources []string, allowedTables []string, maxAgeSeconds uint64, fsync bool) error { + allowedTablesSet := make(map[string]struct{}) + for _, table := range allowedTables { + allowedTablesSet[table] = struct{}{} + } + + // Forbid touching tables in active use. + releaseLocks, err := util.LockDirectories(logger, sources, util.LockfileName, fsync) + if err != nil { + return fmt.Errorf("failed to acquire locks on paths %v: %w", sources, err) + } + defer releaseLocks() + + // Determine which tables to prune. + var tables []string + foundTables, err := lsPaths(logger, sources, false, fsync) + if err != nil { + return fmt.Errorf("failed to list tables in paths %v: %w", sources, err) + } + if len(allowedTables) == 0 { + tables = foundTables + } else { + for _, table := range foundTables { + if _, ok := allowedTablesSet[table]; ok { + tables = append(tables, table) + } + } + } + + // Prune each table. + for _, table := range tables { + bytesDeleted, err := pruneTable(logger, sources, table, maxAgeSeconds, fsync) + if err != nil { + return fmt.Errorf("failed to prune table %s in paths %v: %w", table, sources, err) + } + + logger.Infof("Deleted %s from table '%s'.", common.PrettyPrintBytes(bytesDeleted), table) + } + + return nil +} + +// pruneTable performs offline garbage collection on a LittDB database/snapshot. +func pruneTable( + logger logging.Logger, + sources []string, + tableName string, + maxAgeSeconds uint64, + fsync bool) (uint64, error) { + + errorMonitor := util.NewErrorMonitor(context.Background(), logger, nil) + + segmentPaths, err := segment.BuildSegmentPaths(sources, "", tableName) + if err != nil { + return 0, fmt.Errorf("failed to build segment paths for table %s at paths %v: %w", + tableName, sources, err) + } + + lowestSegmentIndex, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + segmentPaths, + false, + time.Now(), + true, + fsync) + if err != nil { + return 0, fmt.Errorf("failed to gather segment files for table %s at paths %v: %w", + tableName, sources, err) + } + + if len(segments) == 0 { + return 0, fmt.Errorf("no segments found for table %s at paths %v", tableName, sources) + } + + // Determine if we are working on the snapshot directory (i.e. the directory with symlinks to the segments). + isSnapshot, err := segments[lowestSegmentIndex].IsSnapshot() + if err != nil { + return 0, fmt.Errorf("failed to check if segment %d is a snapshot: %w", lowestSegmentIndex, err) + } + + if isSnapshot { + // If we are dealing with a snapshot, respect the snapshot upper bound specified by LittDB. + if len(sources) > 1 { + return 0, fmt.Errorf("this is a symlinked snapshot directory, " + + "snapshot directory cannot be spread across multiple sources") + } + upperBoundFile, err := disktable.LoadBoundaryFile(disktable.UpperBound, path.Join(sources[0], tableName)) + if err != nil { + return 0, fmt.Errorf("failed to load boundary file for table %s at path %s: %w", + tableName, sources[0], err) + } + if upperBoundFile.IsDefined() { + highestSegmentIndex = upperBoundFile.BoundaryIndex() + } + } + + // Delete old segments. + bytesDeleted := uint64(0) + deletedSegments := make([]*segment.Segment, 0) + for segmentIndex := lowestSegmentIndex; segmentIndex <= highestSegmentIndex; segmentIndex++ { + seg := segments[segmentIndex] + segmentAge := time.Since(seg.GetSealTime()) + + if segmentAge < time.Duration(maxAgeSeconds)*time.Second { + // We've pruned all segments that we can. + break + } + + deletedSegments = append(deletedSegments, seg) + bytesDeleted += seg.Size() + seg.Release() + } + + // Wait for deletion to complete. + for _, seg := range deletedSegments { + err = seg.BlockUntilFullyDeleted() + if err != nil { + return 0, fmt.Errorf("failed to block until segment %d is fully deleted: %w", + seg.SegmentIndex(), err) + } + } + + if ok, err := errorMonitor.IsOk(); !ok { + return 0, fmt.Errorf("error monitor reports errors: %w", err) + } + + if isSnapshot { + // This is a snapshot. Write a lower bound file to tell the DB not to re-snapshot files than have been pruned. + err = writeLowerBoundFile(sources[0], tableName, deletedSegments) + if err != nil { + return 0, fmt.Errorf("failed to write lower bound file for table %s at path %s: %w", + tableName, sources[0], err) + } + } else { + // If we are doing GC on a table that isn't a snapshot, then we need to delete the snapshots/keymap + // for the table. The DB will automatically rebuild the snapshots directory & keymap on the next startup. + err = deleteSnapshots(sources, tableName) + if err != nil { + return 0, fmt.Errorf("failed to delete snapshots/keymap for table %s at paths %v: %w", + tableName, sources, err) + } + } + + return bytesDeleted, nil +} + +// Updates the lower bound file after segments have been deleted. +func writeLowerBoundFile(snapshotRoot string, tableName string, deletedSegments []*segment.Segment) error { + if len(deletedSegments) == 0 { + // No segments were deleted, no need to write a lower bound file. + return nil + } + lowerBoundFile, err := disktable.LoadBoundaryFile(disktable.LowerBound, path.Join(snapshotRoot, tableName)) + if err != nil { + return fmt.Errorf("failed to load boundary file for table %s at path %s: %w", + tableName, snapshotRoot, err) + } + err = lowerBoundFile.Update(deletedSegments[len(deletedSegments)-1].SegmentIndex()) + if err != nil { + return fmt.Errorf("failed to update lower bound file for table %s at path %s: %w", + tableName, snapshotRoot, err) + } + + return nil +} + +// deletes the snapshot directories in all sources for the given table +func deleteSnapshots(sources []string, tableName string) error { + for _, source := range sources { + snapshotsPath := path.Join(source, tableName, segment.HardLinkDirectory) + exists, err := util.Exists(snapshotsPath) + if err != nil { + return fmt.Errorf("failed to check if snapshots path %s exists: %w", snapshotsPath, err) + } + if exists { + err = os.RemoveAll(snapshotsPath) + if err != nil { + return fmt.Errorf("failed to remove snapshots path %s: %w", snapshotsPath, err) + } + } + + keymapPath := path.Join(source, tableName, keymap.KeymapDirectoryName) + exists, err = util.Exists(keymapPath) + if err != nil { + return fmt.Errorf("failed to check if keymap path %s exists: %w", keymapPath, err) + } + if exists { + err = os.RemoveAll(keymapPath) + if err != nil { + return fmt.Errorf("failed to remove keymap path %s: %w", keymapPath, err) + } + } + } + + return nil +} diff --git a/sei-db/db_engine/litt/cli/prune_test.go b/sei-db/db_engine/litt/cli/prune_test.go new file mode 100644 index 0000000000..7ba02788be --- /dev/null +++ b/sei-db/db_engine/litt/cli/prune_test.go @@ -0,0 +1,342 @@ +//go:build littdb_wip + +package main + +import ( + "encoding/binary" + "fmt" + "os" + "path" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestPrune(t *testing.T) { + t.Parallel() + ctx := t.Context() + logger := test.GetLogger() + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + errorMonitor := util.NewErrorMonitor(ctx, logger, nil) + + rootPathCount := rand.Uint64Range(2, 5) + rootPaths := make([]string, rootPathCount) + for i := uint64(0); i < rootPathCount; i++ { + rootPaths[i] = path.Join(testDirectory, fmt.Sprintf("root-%d", i)) + } + + // Use a standard test configuration for LittDB. + config, err := litt.DefaultConfig(rootPaths...) + require.NoError(t, err) + config.Fsync = false + config.DoubleWriteProtection = true + config.ShardingFactor = uint32(rand.Uint64Range(rootPathCount, 2*rootPathCount)) + config.TargetSegmentFileSize = 100 + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableCount := rand.Uint64Range(2, 5) + tables := make(map[string]litt.Table, tableCount) + for i := uint64(0); i < tableCount; i++ { + tableName := fmt.Sprintf("table-%d", i) + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables[tableName] = table + } + + // map from table name to keys to values + expectedData := make(map[string]map[string][]byte) + for _, table := range tables { + expectedData[table.Name()] = make(map[string][]byte) + } + + // Write some data into the DB. + for i := 0; i < 1000; i++ { + tableIndex := rand.Uint64Range(0, tableCount) + tableName := fmt.Sprintf("table-%d", tableIndex) + table := tables[tableName] + + key := rand.String(32) + value := rand.PrintableVariableBytes(1, 100) + + err = table.Put([]byte(key), value) + require.NoError(t, err) + + expectedData[tableName][key] = value + } + + // Flush all tables to ensure data is written to disk. + for _, table := range tables { + err = table.Flush() + require.NoError(t, err) + } + + // Close the DB. Once this is done, override the timestamps on some of the segment files. + // We can then ask prune() to get rid of these segments without fear of race conditions. + err = db.Close() + require.NoError(t, err) + + // After pruning, the segment indexes in this map should be the lowest segment index that we keep for each table. + firstSegmentIndexToKeepByTable := make(map[string]uint32) + // A map from table name a set of keys that are expected to be pruned. + expectedPrunedKeys := make(map[string]map[string]struct{}) + + // This is the time we will assign to the "old" segments that we want to prune. + sixHoursAgo := uint64(time.Now().Add(-6 * time.Hour).Nanosecond()) + + for tableName := range tables { + segmentPaths, err := segment.BuildSegmentPaths(rootPaths, "", tableName) + require.NoError(t, err) + + lowSegmentIndex, highSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + segmentPaths, + false, + time.Now(), + false, + false) + require.NoError(t, err) + + firstSegmentIndexToKeep := lowSegmentIndex + (highSegmentIndex-lowSegmentIndex)/2 + firstSegmentIndexToKeepByTable[tableName] = firstSegmentIndexToKeep + + for i := lowSegmentIndex; i < firstSegmentIndexToKeep; i++ { + seg := segments[i] + metadataPath := seg.GetMetadataFilePath() + + // Overwrite the old metadata file. The timestamp is encoded at [24:32] in nanoseconds since the epoch. + data, err := os.ReadFile(metadataPath) + require.NoError(t, err) + binary.BigEndian.PutUint64(data[24:32], sixHoursAgo) + + // write the modified metadata file back to disk. + err = os.WriteFile(metadataPath, data, 0644) + require.NoError(t, err) + + // Record the keys in this segment. We shouldn't see them after pruning. + segmentKeys, err := seg.GetKeys() + require.NoError(t, err) + for _, key := range segmentKeys { + if _, exists := expectedPrunedKeys[tableName]; !exists { + expectedPrunedKeys[tableName] = make(map[string]struct{}) + } + expectedPrunedKeys[tableName][string(key.Key)] = struct{}{} + } + } + } + + // Now that we've doctored the segment files, tell prune to delete segments older than 1 hour. + // In a technical sense there is a race condition in this test, but since the unit test panel + // will time out long before 1 hour elapses, in practicality it can never be observed. + err = prune(logger, rootPaths, []string{}, 60*60 /* seconds */, false) + require.NoError(t, err) + + // Reopen the DB and verify its contents. + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + + for tableName := range tables { + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables[tableName] = table + } + + for tableName, expected := range expectedData { + for key, value := range expected { + actual, ok, err := tables[tableName].Get([]byte(key)) + require.NoError(t, err) + + if _, pruned := expectedPrunedKeys[tableName][key]; pruned { + // The key should have been pruned. + require.False(t, ok) + require.Nil(t, actual) + } else { + // The key should still exist. + require.True(t, ok) + require.Equal(t, value, actual) + } + } + } + + // tear down + err = db.Close() + require.NoError(t, err) +} + +func TestPruneSubset(t *testing.T) { + t.Parallel() + + ctx := t.Context() + logger := test.GetLogger() + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + errorMonitor := util.NewErrorMonitor(ctx, logger, nil) + + rootPathCount := rand.Uint64Range(2, 5) + rootPaths := make([]string, rootPathCount) + for i := uint64(0); i < rootPathCount; i++ { + rootPaths[i] = path.Join(testDirectory, fmt.Sprintf("root-%d", i)) + } + + // Use a standard test configuration for LittDB. + config, err := litt.DefaultConfig(rootPaths...) + require.NoError(t, err) + config.Fsync = false + config.DoubleWriteProtection = true + config.ShardingFactor = uint32(rand.Uint64Range(rootPathCount, 2*rootPathCount)) + config.TargetSegmentFileSize = 100 + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableCount := rand.Uint64Range(2, 5) + tables := make(map[string]litt.Table, tableCount) + // we will only prune data from these tables. + tablesToPrune := make([]string, 0, tableCount/2) + tablesToPruneSet := make(map[string]struct{}, tableCount/2) + for i := uint64(0); i < tableCount; i++ { + tableName := fmt.Sprintf("table-%d", i) + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables[tableName] = table + if i%2 == 0 { + // Only prune even-numbered tables. + tablesToPrune = append(tablesToPrune, tableName) + tablesToPruneSet[tableName] = struct{}{} + } + } + + // map from table name to keys to values + expectedData := make(map[string]map[string][]byte) + for _, table := range tables { + expectedData[table.Name()] = make(map[string][]byte) + } + + // Write some data into the DB. + for i := 0; i < 1000; i++ { + tableIndex := rand.Uint64Range(0, tableCount) + tableName := fmt.Sprintf("table-%d", tableIndex) + table := tables[tableName] + + key := rand.String(32) + value := rand.PrintableVariableBytes(1, 100) + + err = table.Put([]byte(key), value) + require.NoError(t, err) + + expectedData[tableName][key] = value + } + + // Flush all tables to ensure data is written to disk. + for _, table := range tables { + err = table.Flush() + require.NoError(t, err) + } + + // Close the DB. Once this is done, override the timestamps on some of the segment files. + // We can then ask prune() to get rid of these segments without fear of race conditions. + err = db.Close() + require.NoError(t, err) + + // After pruning, the segment indexes in this map should be the lowest segment index that we keep for each table. + firstSegmentIndexToKeepByTable := make(map[string]uint32) + // A map from table name a set of keys that are expected to be pruned. + expectedPrunedKeys := make(map[string]map[string]struct{}) + + // This is the time we will assign to the "old" segments that we want to prune. + sixHoursAgo := uint64(time.Now().Add(-6 * time.Hour).Nanosecond()) + + for tableName := range tables { + segmentPaths, err := segment.BuildSegmentPaths(rootPaths, "", tableName) + require.NoError(t, err) + + lowSegmentIndex, highSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + segmentPaths, + false, + time.Now(), + false, + false) + require.NoError(t, err) + + firstSegmentIndexToKeep := lowSegmentIndex + (highSegmentIndex-lowSegmentIndex)/2 + firstSegmentIndexToKeepByTable[tableName] = firstSegmentIndexToKeep + + for i := lowSegmentIndex; i < firstSegmentIndexToKeep; i++ { + seg := segments[i] + metadataPath := seg.GetMetadataFilePath() + + // Overwrite the old metadata file. The timestamp is encoded at [24:32] in nanoseconds since the epoch. + data, err := os.ReadFile(metadataPath) + require.NoError(t, err) + binary.BigEndian.PutUint64(data[24:32], sixHoursAgo) + + // write the modified metadata file back to disk. + err = os.WriteFile(metadataPath, data, 0644) + require.NoError(t, err) + + // Record the keys in this segment. We shouldn't see them after pruning. + if _, pruneTable := tablesToPruneSet[tableName]; pruneTable { + segmentKeys, err := seg.GetKeys() + require.NoError(t, err) + for _, key := range segmentKeys { + if _, exists := expectedPrunedKeys[tableName]; !exists { + expectedPrunedKeys[tableName] = make(map[string]struct{}) + } + expectedPrunedKeys[tableName][string(key.Key)] = struct{}{} + } + } + + } + } + + // Now that we've doctored the segment files, tell prune to delete segments older than 1 hour. + // In a technical sense there is a race condition in this test, but since the unit test panel + // will time out long before 1 hour elapses, in practicality it can never be observed. + err = prune(logger, rootPaths, tablesToPrune, 60*60 /* seconds */, false) + require.NoError(t, err) + + // Reopen the DB and verify its contents. + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + + for tableName := range tables { + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables[tableName] = table + } + + for tableName, expected := range expectedData { + for key, value := range expected { + actual, ok, err := tables[tableName].Get([]byte(key)) + require.NoError(t, err) + + if _, pruned := expectedPrunedKeys[tableName][key]; pruned { + // The key should have been pruned. + require.False(t, ok) + require.Nil(t, actual) + } else { + // The key should still exist. + require.True(t, ok) + require.Equal(t, value, actual) + } + } + } + + // tear down + err = db.Close() + require.NoError(t, err) +} diff --git a/sei-db/db_engine/litt/cli/push.go b/sei-db/db_engine/litt/cli/push.go new file mode 100644 index 0000000000..a0fa3a94ce --- /dev/null +++ b/sei-db/db_engine/litt/cli/push.go @@ -0,0 +1,368 @@ +//go:build littdb_wip + +package main + +import ( + "context" + "fmt" + "path" + "strings" + "sync/atomic" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/common/enforce" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/urfave/cli/v2" +) + +func pushCommand(ctx *cli.Context) error { + if ctx.NArg() < 1 { + return fmt.Errorf("not enough arguments provided, must provide USER@HOST") + } + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + if err != nil { + return fmt.Errorf("failed to create logger: %w", err) + } + + sources := ctx.StringSlice("src") + if len(sources) == 0 { + return fmt.Errorf("no sources provided") + } + for i, src := range sources { + var err error + sources[i], err = util.SanitizePath(src) + if err != nil { + return fmt.Errorf("invalid source path: %s", src) + } + } + + destinations := ctx.StringSlice("dest") + if len(destinations) == 0 { + return fmt.Errorf("no destinations provided") + } + + userHost := ctx.Args().First() + parts := strings.Split(userHost, "@") + if len(parts) != 2 { + return fmt.Errorf("invalid USER@HOST format: %s", userHost) + } + user := parts[0] + host := parts[1] + + port := ctx.Uint64("port") + + keyPath := ctx.String("key") + keyPath, err = util.SanitizePath(keyPath) + if err != nil { + return fmt.Errorf("invalid key path: %s", keyPath) + } + + knownHosts := ctx.String(knownHostsFileFlag.Name) + + deleteAfterTransfer := !ctx.Bool("no-gc") + threads := ctx.Uint64("threads") + verbose := !ctx.Bool("quiet") + throttleMB := ctx.Float64("throttle") + + return push( + logger, + sources, + destinations, + user, + host, + port, + keyPath, + knownHosts, + deleteAfterTransfer, + true, + threads, + throttleMB, + verbose) +} + +// push uses rsync to transfer LittDB data to the remote location(s) +func push( + logger logging.Logger, + sources []string, + destinations []string, + user string, + host string, + port uint64, + keyPath string, + knownHosts string, + deleteAfterTransfer bool, + fsync bool, + threads uint64, + throttleMB float64, + verbose bool) error { + + if len(sources) == 0 { + return fmt.Errorf("no source paths provided") + } + if len(destinations) == 0 { + return fmt.Errorf("no destination paths provided") + } + if threads == 0 { + return fmt.Errorf("threads must be greater than 0") + } + + // split bandwidth between workers + throttleMB /= float64(threads) + + // Lock source files. It would be nice to also lock the remote directories, but that's tricky given that + // we are interacting with the remote machine via SSH and rsync. + releaseSourceLocks, err := util.LockDirectories(logger, sources, util.LockfileName, fsync) + if err != nil { + return fmt.Errorf("failed to lock source directories: %w", err) + } + defer releaseSourceLocks() + + // Create an SSH session to the remote host. + connection, err := util.NewSSHSession(logger, user, host, port, keyPath, knownHosts, verbose) + if err != nil { + return fmt.Errorf("failed to create SSH session to %s@%s port %d: %w", user, host, port, err) + } + + tables, err := lsPaths(logger, sources, false, fsync) + if err != nil { + return fmt.Errorf("failed to list tables in source paths %v: %w", sources, err) + } + + for _, tableName := range tables { + err = pushTable( + logger, + tableName, + sources, + destinations, + connection, + deleteAfterTransfer, + fsync, + throttleMB, + threads, + ) + + if err != nil { + return fmt.Errorf("failed to push table %s: %w", tableName, err) + } + } + + return nil +} + +// Figure out which files are already present at the destination(s). Although these files may be partial, we always +// want to preserve any pre-existing arrangements of files at the destination(s). +// +// The returned map is a map from file name (e.g. 1234.metadata) to the destination path (e.g. /path/to/remote/dir). +func mapExistingFiles( + destinations []string, + tableName string, + connection *util.SSHSession) (map[string]string, error) { + + existingFiles := make(map[string]string) + + extensions := []string{segment.MetadataFileExtension, segment.KeyFileExtension, segment.ValuesFileExtension} + + for _, dest := range destinations { + tableDestination := path.Join(dest, tableName, segment.SegmentDirectory) + filePaths, err := connection.FindFiles(tableDestination, extensions) + if err != nil { + return nil, fmt.Errorf("failed to list files in destination %s: %w", dest, err) + } + + for _, filePath := range filePaths { + // Extract the file name from the path. + fileName := path.Base(filePath) + + enforce.MapDoesNotContainKey(existingFiles, fileName, + "duplicate file found: %s and %s", fileName, existingFiles[fileName]) + existingFiles[fileName] = dest + } + } + + return existingFiles, nil +} + +// Push the data in a single table to the remote location(s). +func pushTable( + logger logging.Logger, + tableName string, + sources []string, + destinations []string, + connection *util.SSHSession, + deleteAfterTransfer bool, + fsync bool, + throttleMB float64, + threads uint64) error { + + // Figure out where data currently exists at the destination(s). We don't want this operation to cause a file + // to exist in multiple places. + existingFilesMap, err := mapExistingFiles(destinations, tableName, connection) + if err != nil { + return fmt.Errorf("failed to map existing files at destinations: %w", err) + } + + segmentPaths, err := segment.BuildSegmentPaths(sources, "", tableName) + if err != nil { + return fmt.Errorf("failed to build segment paths for table %s at paths %v: %w", tableName, sources, err) + } + + errorMonitor := util.NewErrorMonitor(context.Background(), logger, nil) + + // Gather segment files to send. + lowestSegmentIndex, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + segmentPaths, + false, + time.Now(), + false, + fsync) + if err != nil { + return fmt.Errorf("failed to gather segment files for table %s at paths %v: %w", + tableName, sources, err) + } + + if len(segments) == 0 { + logger.Infof("No segments found for table %s", tableName) + return nil + } + + // Special handling if we are transferring data from a snapshot. + isSnapshot, err := segments[lowestSegmentIndex].IsSnapshot() + if err != nil { + return fmt.Errorf("failed to check if segment %d is a snapshot: %w", lowestSegmentIndex, err) + } + if isSnapshot { + if len(sources) > 1 { + return fmt.Errorf("table %s is a snapshot, but source more than one source directories found: %v", + tableName, sources) + } + + boundaryFile, err := disktable.LoadBoundaryFile(disktable.UpperBound, path.Join(sources[0], tableName)) + if err != nil { + return fmt.Errorf("failed to load boundary file for table %s at path %s: %w", + tableName, sources[0], err) + } + + if boundaryFile.IsDefined() { + highestSegmentIndex = boundaryFile.BoundaryIndex() + } + } else if deleteAfterTransfer { + return fmt.Errorf("--no-gc is required when pushing a non-snapshot table") + } + + // Ensure the remote segment directories exists. + for _, dest := range destinations { + segmentDir := path.Join(dest, tableName, segment.SegmentDirectory) + err = connection.Mkdirs(segmentDir) + if err != nil { + return fmt.Errorf("failed to create segment directory %s at destination %s: %w", + segmentDir, dest, err) + } + } + + // Used to limit rsync concurrency. + rsyncLimiter := make(chan struct{}, threads) + + rsyncsInProgress := atomic.Int64{} + + // Transfer the files. + for i := lowestSegmentIndex; i <= highestSegmentIndex; i++ { + seg := segments[i] + filesToTransfer := seg.GetFilePaths() + + for _, filePath := range filesToTransfer { + fileName := path.Base(filePath) + + destination := "" + if existingDest, exists := existingFilesMap[fileName]; exists { + destination = existingDest + } else { + destination, err = determineDestination(fileName, destinations) + if err != nil { + return fmt.Errorf("failed to determine destination for file %s: %w", fileName, err) + } + } + + targetLocation := path.Join(destination, tableName, segment.SegmentDirectory, fileName) + + rsyncLimiter <- struct{}{} + rsyncsInProgress.Add(1) + + boundFilePath := filePath + go func() { + err = connection.Rsync(boundFilePath, targetLocation, throttleMB) + if err != nil { + errorMonitor.Panic(err) + } + <-rsyncLimiter + rsyncsInProgress.Add(-1) + }() + } + } + + // Wait for all rsyncs to complete. + for rsyncsInProgress.Load() > 0 { + time.Sleep(100 * time.Millisecond) + } + + // Check if there were any errors during the transfer. + if ok, err := errorMonitor.IsOk(); !ok { + return fmt.Errorf("error detected during transfer: %w", err) + } + + // Now that we have transferred the files, we can delete them if requested. + if deleteAfterTransfer { + enforce.True(isSnapshot, "we should have already returned an error if this is a non-snapshot table") + + err = deleteLocalSegments(segments, tableName, true, sources, highestSegmentIndex) + if err != nil { + return fmt.Errorf("failed to delete segments after transfer: %w", err) + } + } + + return nil +} + +// Deletes local segments after they have been successfully transferred to the remote destination(s). +func deleteLocalSegments( + segments map[uint32]*segment.Segment, + tableName string, + isSnapshot bool, + sources []string, + highestSegmentIndex uint32) error { + + // Delete the segments. + for _, seg := range segments { + seg.Release() + } + // Wait for deletion to complete. + for _, seg := range segments { + err := seg.BlockUntilFullyDeleted() + if err != nil { + return fmt.Errorf("failed to delete segment %d for table %s: %w", + seg.SegmentIndex(), tableName, err) + } + } + + if isSnapshot { + // If we are dealing with a snapshot, update the lower bound file. + boundaryFile, err := disktable.LoadBoundaryFile(disktable.LowerBound, path.Join(sources[0], tableName)) + if err != nil { + return fmt.Errorf("failed to load boundary file for table %s at path %s: %w", + tableName, sources[0], err) + } + + err = boundaryFile.Update(highestSegmentIndex) + if err != nil { + return fmt.Errorf("failed to update boundary file for table %s at path %s: %w", + tableName, sources[0], err) + } + } + return nil +} diff --git a/sei-db/db_engine/litt/cli/push_test.go b/sei-db/db_engine/litt/cli/push_test.go new file mode 100644 index 0000000000..bf391a5479 --- /dev/null +++ b/sei-db/db_engine/litt/cli/push_test.go @@ -0,0 +1,710 @@ +//go:build littdb_wip + +package main + +import ( + "fmt" + "os" + "path" + "path/filepath" + "strconv" + "strings" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func pushTest( + t *testing.T, + sourceDirs uint64, + destDirs uint64, + verbose bool, +) { + logger := test.GetLogger() + rand := random.NewTestRandom() + testDir := t.TempDir() + sourceRoot := path.Join(testDir, "source") + destRoot := path.Join(testDir, "dest") + + err := os.MkdirAll(sourceRoot, 0755) + require.NoError(t, err) + err = os.MkdirAll(destRoot, 0755) + require.NoError(t, err) + + // Start a container that is running an SSH server. The push() command will communicate with this server. + container := util.SetupSSHTestContainer(t, destRoot) + defer container.Cleanup() + + sourceDirList := make([]string, 0, sourceDirs) + // The destination directories relative to the test's perspective of the filesystem. + destDirList := make([]string, 0, destDirs) + // The destination directories relative to the container's perspective of the filesystem. + dockerDestDirList := make([]string, 0, destDirs) + + for i := uint64(0); i < sourceDirs; i++ { + sourceDirList = append(sourceDirList, path.Join(sourceRoot, fmt.Sprintf("source-%d", i))) + } + for i := uint64(0); i < destDirs; i++ { + dir := fmt.Sprintf("dest-%d", i) + destDirList = append(destDirList, path.Join(destRoot, dir)) + dockerDestDirList = append(dockerDestDirList, path.Join(container.GetDataDir(), dir)) + } + + tableCount := rand.Uint64Range(2, 4) + tableNames := make([]string, 0, tableCount) + for i := uint64(0); i < tableCount; i++ { + tableNames = append(tableNames, rand.String(32)) + } + + shardingFactor := sourceDirs + rand.Uint64Range(0, 4) + + config, err := litt.DefaultConfig(sourceDirList...) + require.NoError(t, err) + config.DoubleWriteProtection = true + config.ShardingFactor = uint32(shardingFactor) + config.Fsync = false + config.TargetSegmentFileSize = 1024 + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + expectedData := make(map[string] /*table*/ map[string] /*value*/ []byte) + for _, tableName := range tableNames { + expectedData[tableName] = make(map[string][]byte) + } + + // Insert data into the tables. + keyCount := uint64(1024) + for i := uint64(0); i < keyCount; i++ { + tableIndex := rand.Uint64Range(0, tableCount) + table, err := db.GetTable(tableNames[tableIndex]) + require.NoError(t, err) + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(10, 100) + + expectedData[table.Name()][string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "failed to put key %s in table %s", key, table.Name()) + } + + // Flush all tables. + for _, tableName := range tableNames { + table, err := db.GetTable(tableName) + require.NoError(t, err) + err = table.Flush() + require.NoError(t, err, "failed to flush table %s", table.Name()) + } + + // Verify the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + // Verify expected directories. + for _, sourceDir := range sourceDirList { + // We should see each source dir. + exists, err := util.Exists(sourceDir) + require.NoError(t, err) + require.True(t, exists, "source directory %s does not exist", sourceDir) + } + for _, destDir := range destDirList { + // We should not see dest dirs yet. + exists, err := util.Exists(destDir) + require.NoError(t, err) + require.False(t, exists, "destination directory %s exists", destDir) + } + + // pushing with the DB still open should fail. + err = push(logger, sourceDirList, dockerDestDirList, container.GetUser(), container.GetHost(), + container.GetSSHPort(), container.GetPrivateKeyPath(), "", false, + false, 2, 1, verbose) + require.Error(t, err) + + // None of the source dirs should have been deleted. + for _, sourceDir := range sourceDirList { + // We should see each source dir. + exists, err := util.Exists(sourceDir) + require.NoError(t, err) + require.True(t, exists, "source directory %s does not exist", sourceDir) + } + + // The failed push should not have changed the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + //// Shut down the DB and push it. + err = db.Close() + require.NoError(t, err, "failed to close DB") + + // Deleting after transfer is only support for snapshots (which we are not testing here). + err = push(logger, sourceDirList, dockerDestDirList, container.GetUser(), container.GetHost(), + container.GetSSHPort(), container.GetPrivateKeyPath(), "", true, + false, 2, 1, verbose) + require.Error(t, err) + + // Actually push it correctly now. + err = push(logger, sourceDirList, dockerDestDirList, container.GetUser(), container.GetHost(), + container.GetSSHPort(), container.GetPrivateKeyPath(), "", false, + false, 8, 1, verbose) + require.NoError(t, err, "failed to close DB") + + // Verify the new directories. + for _, sourceDir := range sourceDirList { + exists, err := util.Exists(sourceDir) + require.NoError(t, err) + + // Even if we are deleting after transfer, the source directories should still exist. + require.True(t, exists, "source directory %s does not exist but should", sourceDir) + } + for _, destDir := range destDirList { + // We should see all destination dirs. + exists, err := util.Exists(destDir) + require.NoError(t, err) + require.True(t, exists, "destination directory %s does not exist", destDir) + } + + // Push works when there is nothing at the destination. It also works when some of the files are present or + // corrupted. Let's mess with the files at the destination and make sure that the push command is able to fix + // things afterward. + filesInTree := make([]string, 0) + err = filepath.Walk(destRoot, func(path string, info os.FileInfo, err error) error { + if err != nil { + return err + } + + if info.IsDir() { + // Skip directories. + return nil + } + + filesInTree = append(filesInTree, path) + + return nil + }) + require.NoError(t, err) + + for _, segmentFile := range filesInTree { + choice := rand.Float64() + + if choice < 0.3 { + // Delete the file. Push will copy it over again. + err = os.Remove(segmentFile) + require.NoError(t, err, "failed to delete file %s", segmentFile) + } else if choice < 0.6 { + // Overwrite the file with random data. Push will replace it with the correct data. + randomData := rand.Bytes(128) + // use broad file permissions to avoid issues with container user having different UID/GID. + err = os.WriteFile(segmentFile, randomData, 0666) + require.NoError(t, err, "failed to overwrite file %s", segmentFile) + } else if choice < 0.9 { + // Attempt to move the file to another legal location. + + if len(destDirList) == 1 { + // We can't move a file to a different directory if there is only one destination directory. + continue + } + + // Segment files will have the following format: destRoot/dest-N/tableName/segments/segmentFileName + // We want to change the "dest-N" part. This is a legal location for the data, since it doesn't matter + // which destination directory the data is in, as long as it is in one of them. + + parts := strings.Split(segmentFile, string(os.PathSeparator)) + require.Greater(t, len(parts), 3, "unexpected path format: %s", segmentFile) + + oldDir := parts[len(parts)-4] // This is the "dest-N" part. + oldDirIndexString := strings.Replace(oldDir, "dest-", "", 1) + oldDirIndex, err := strconv.Atoi(oldDirIndexString) + require.NoError(t, err) + newDirIndex := (oldDirIndex + 1) % len(destDirList) // Move to the next destination directory. + newPath := strings.Replace(segmentFile, oldDir, fmt.Sprintf("dest-%d", newDirIndex), 1) + + err = os.Rename(segmentFile, newPath) + require.NoError(t, err) + } + } + + // Push again, should fix the messed up files. + err = push(logger, sourceDirList, dockerDestDirList, container.GetUser(), container.GetHost(), + container.GetSSHPort(), container.GetPrivateKeyPath(), "", false, + false, 2, 1, verbose) + require.NoError(t, err) + + // Reopen the old DB, verify no data is missing. + db, err = littbuilder.NewDB(config) + require.NoError(t, err, "failed to open DB after rebase") + + // Verify the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + // Fully delete the old DB. The new DB should be a copy of the old one, so this should not affect copied data. + err = db.Destroy() + require.NoError(t, err) + + // Push should NOT copy the keymap. Verify that there is no keymap directory in destRoot. + err = filepath.Walk(destRoot, func(path string, info os.FileInfo, err error) error { + if err != nil { + return err + } + require.False(t, strings.Contains(path, keymap.KeymapDirectoryName)) + return nil + }) + require.NoError(t, err) + + // Reopen the DB at the new destination directories. + config.Paths = destDirList + db, err = littbuilder.NewDB(config) + require.NoError(t, err, "failed to open DB after rebase") + + // Verify the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + err = db.Close() + require.NoError(t, err, "failed to close DB after rebase") +} + +func TestPush1to1(t *testing.T) { + t.Skip() // Docker build is flaky, need to fix prior to re-enabling + + t.Parallel() + + sourceDirs := uint64(1) + destDirs := uint64(1) + + pushTest(t, sourceDirs, destDirs, false) +} + +func TestPush1toN(t *testing.T) { + t.Skip() // Docker build is flaky, need to fix prior to re-enabling + + t.Parallel() + + sourceDirs := uint64(1) + destDirs := uint64(4) + + pushTest(t, sourceDirs, destDirs, false) +} + +func TestPushNto1(t *testing.T) { + t.Skip() // Docker build is flaky, need to fix prior to re-enabling + + t.Parallel() + + sourceDirs := uint64(4) + destDirs := uint64(1) + + pushTest(t, sourceDirs, destDirs, false) +} + +func TestPushNtoN(t *testing.T) { + t.Skip() // Docker build is flaky, need to fix prior to re-enabling + + t.Parallel() + + sourceDirs := uint64(4) + destDirs := uint64(4) + + // This test is run in verbose mode to make sure we don't crash when that is enabled. + // Other tests in this file are not run in verbose mode to reduce log clutter. + pushTest(t, sourceDirs, destDirs, true) +} + +func TestPushSnapshot(t *testing.T) { + t.Skip() // Docker build is flaky, need to fix prior to re-enabling + + ctx := t.Context() + logger := test.GetLogger() + + rand := random.NewTestRandom() + sourceRoot := t.TempDir() + destRoot := t.TempDir() + snapshotDir := path.Join(t.TempDir(), "snapshot") + + sourceDirs := rand.Uint64Range(2, 4) + destDirs := rand.Uint64Range(2, 4) + + // Start a container that is running an SSH server. The push() command will communicate with this server. + container := util.SetupSSHTestContainer(t, destRoot) + defer container.Cleanup() + + sourceDirList := make([]string, 0, sourceDirs) + // The destination directories relative to the test's perspective of the filesystem. + destDirList := make([]string, 0, destDirs) + // The destination directories relative to the container's perspective of the filesystem. + dockerDestDirList := make([]string, 0, destDirs) + + for i := uint64(0); i < sourceDirs; i++ { + sourceDirList = append(sourceDirList, path.Join(sourceRoot, fmt.Sprintf("source-%d", i))) + } + for i := uint64(0); i < destDirs; i++ { + dir := fmt.Sprintf("dest-%d", i) + destDirList = append(destDirList, path.Join(destRoot, dir)) + dockerDestDirList = append(dockerDestDirList, path.Join(container.GetDataDir(), dir)) + } + + tableCount := rand.Uint64Range(2, 4) + tableNames := make([]string, 0, tableCount) + for i := uint64(0); i < tableCount; i++ { + tableNames = append(tableNames, rand.String(32)) + } + + shardingFactor := sourceDirs + rand.Uint64Range(0, 4) + + config, err := litt.DefaultConfig(sourceDirList...) + require.NoError(t, err) + config.DoubleWriteProtection = true + config.ShardingFactor = uint32(shardingFactor) + config.Fsync = false + config.TargetSegmentFileSize = 1024 + config.SnapshotDirectory = snapshotDir + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + expectedData := make(map[string] /*table*/ map[string] /*value*/ []byte) + for _, tableName := range tableNames { + expectedData[tableName] = make(map[string][]byte) + } + + // Insert data into the tables. + keyCount := uint64(1024) + for i := uint64(0); i < keyCount; i++ { + tableIndex := rand.Uint64Range(0, tableCount) + table, err := db.GetTable(tableNames[tableIndex]) + require.NoError(t, err) + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(10, 100) + + expectedData[table.Name()][string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "failed to put key %s in table %s", key, table.Name()) + } + + // Flush all tables. + for _, tableName := range tableNames { + table, err := db.GetTable(tableName) + require.NoError(t, err) + err = table.Flush() + require.NoError(t, err, "failed to flush table %s", table.Name()) + } + + // Verify the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + // Verify expected directories. + for _, sourceDir := range sourceDirList { + // We should see each source dir. + exists, err := util.Exists(sourceDir) + require.NoError(t, err) + require.True(t, exists, "source directory %s does not exist", sourceDir) + } + for _, destDir := range destDirList { + // We should not see dest dirs yet. + exists, err := util.Exists(destDir) + require.NoError(t, err) + require.False(t, exists, "destination directory %s exists", destDir) + } + + // pushing with the DB still open should fail. + err = push(logger, sourceDirList, dockerDestDirList, container.GetUser(), container.GetHost(), + container.GetSSHPort(), container.GetPrivateKeyPath(), "", false, + false, 2, 1, false) + require.Error(t, err) + + // None of the source dirs should have been deleted. + for _, sourceDir := range sourceDirList { + // We should see each source dir. + exists, err := util.Exists(sourceDir) + require.NoError(t, err) + require.True(t, exists, "source directory %s does not exist", sourceDir) + } + + // The failed push should not have changed the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + // Power cycle the DB twice. After the first shutdown, the last segment with data will not have been copied + // to the snapshot directory. When the database starts a second time, it will seal the last segment and make + // sure the snapshot directory includes it. + err = db.Close() + require.NoError(t, err, "failed to close DB") + + // Find the highest segment index for each table. We will use it to do verification later. + errorMonitor := util.NewErrorMonitor(ctx, logger, nil) + highestSegmentIndexForTable := make(map[string]uint32) + for tableName := range expectedData { + segmentPaths, err := segment.BuildSegmentPaths(sourceDirList, "", tableName) + require.NoError(t, err, "failed to build segment paths for table %s", tableName) + _, highestSegmentIndex, _, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + segmentPaths, + false, + time.Now(), + false, + false) + require.NoError(t, err) + highestSegmentIndexForTable[tableName] = highestSegmentIndex + } + ok, err := errorMonitor.IsOk() + require.NoError(t, err) + require.True(t, ok) + + // Second power cycle + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + err = table.Flush() + require.NoError(t, err, "failed to flush table %s", table.Name()) + } + err = db.Close() + require.NoError(t, err, "failed to close DB after second open") + + // Push the data. Do not delete the snapshot yet. + err = push(logger, []string{snapshotDir}, dockerDestDirList, container.GetUser(), container.GetHost(), + container.GetSSHPort(), container.GetPrivateKeyPath(), "", false, + false, 8, 1, false) + require.NoError(t, err, "failed to close DB") + + // Verify the new directories. + for _, sourceDir := range sourceDirList { + exists, err := util.Exists(sourceDir) + require.NoError(t, err) + + // Even if we are deleting after transfer, the source directories should still exist. + require.True(t, exists, "source directory %s does not exist but should", sourceDir) + } + for _, destDir := range destDirList { + // We should see all destination dirs. + exists, err := util.Exists(destDir) + require.NoError(t, err) + require.True(t, exists, "destination directory %s does not exist", destDir) + } + + // Push works when there is nothing at the destination. It also works when some of the files are present or + // corrupted. Let's mess with the files at the destination and make sure that the push command is able to fix + // things afterward. + filesInTree := make([]string, 0) + err = filepath.Walk(destRoot, func(path string, info os.FileInfo, err error) error { + if err != nil { + return err + } + + if info.IsDir() { + // Skip directories. + return nil + } + + filesInTree = append(filesInTree, path) + + return nil + }) + require.NoError(t, err) + + for _, segmentFile := range filesInTree { + choice := rand.Float64() + + if choice < 0.3 { + // Delete the file. Push will copy it over again. + err = os.Remove(segmentFile) + require.NoError(t, err, "failed to delete file %s", segmentFile) + } else if choice < 0.6 { + // Overwrite the file with random data. Push will replace it with the correct data. + randomData := rand.Bytes(128) + err = os.WriteFile(segmentFile, randomData, 0644) + require.NoError(t, err, "failed to overwrite file %s", segmentFile) + } else if choice < 0.9 { + // Attempt to move the file to another legal location. + + if len(destDirList) == 1 { + // We can't move a file to a different directory if there is only one destination directory. + continue + } + + // Segment files will have the following format: destRoot/dest-N/tableName/segments/segmentFileName + // We want to change the "dest-N" part. This is a legal location for the data, since it doesn't matter + // which destination directory the data is in, as long as it is in one of them. + + parts := strings.Split(segmentFile, string(os.PathSeparator)) + require.Greater(t, len(parts), 3, "unexpected path format: %s", segmentFile) + + oldDir := parts[len(parts)-4] // This is the "dest-N" part. + oldDirIndexString := strings.Replace(oldDir, "dest-", "", 1) + oldDirIndex, err := strconv.Atoi(oldDirIndexString) + require.NoError(t, err) + newDirIndex := (oldDirIndex + 1) % len(destDirList) // Move to the next destination directory. + newPath := strings.Replace(segmentFile, oldDir, fmt.Sprintf("dest-%d", newDirIndex), 1) + + err = os.Rename(segmentFile, newPath) + require.NoError(t, err) + } + } + + // Push again, should fix the messed up files. This time, tell the push command to clean up after itself. + err = push(logger, []string{snapshotDir}, dockerDestDirList, container.GetUser(), container.GetHost(), + container.GetSSHPort(), container.GetPrivateKeyPath(), "", true, + false, 2, 1, false) + require.NoError(t, err) + + // We instructed push() to delete files after pushing. For each table, we should observe a "lower bound" file + // with a segment index that matches the expected highest segment index for that table. This boundary file signals + // to LittDB that it shouldn't recreate the snapshot files that have been copied and deleted by push(). + for tableName, highestSegmentIndex := range highestSegmentIndexForTable { + tableSnapshotDir := path.Join(snapshotDir, tableName) + boundaryFile, err := disktable.LoadBoundaryFile(false, tableSnapshotDir) + require.NoError(t, err) + require.True(t, boundaryFile.IsDefined(), "boundary file for table %s is not defined", tableName) + require.Equal(t, highestSegmentIndex, boundaryFile.BoundaryIndex()) + } + + // There should be no segment files remaining in the snapshot directory. + err = filepath.Walk(snapshotDir, func(path string, info os.FileInfo, err error) error { + require.NoError(t, err) + require.False(t, strings.Contains(path, segment.MetadataFileExtension), + "unexpected file: %s", path) + require.False(t, strings.Contains(path, segment.KeyFileExtension), + "unexpected file: %s", path) + require.False(t, strings.Contains(path, segment.ValuesFileExtension), + "unexpected file: %s", path) + return nil + }) + require.NoError(t, err) + + // There should also not be any segment files in the hard link directories. + err = filepath.Walk(sourceRoot, func(path string, info os.FileInfo, err error) error { + require.NoError(t, err) + + inHardLinkDir := strings.Contains(path, segment.HardLinkDirectory) + if !inHardLinkDir { + return nil + } + + require.False(t, strings.Contains(path, segment.MetadataFileExtension), + "unexpected file: %s", path) + require.False(t, strings.Contains(path, segment.KeyFileExtension), + "unexpected file: %s", path) + require.False(t, strings.Contains(path, segment.ValuesFileExtension), + "unexpected file: %s", path) + return nil + }) + require.NoError(t, err) + + // Reopen the old DB, verify no data is missing. + db, err = littbuilder.NewDB(config) + require.NoError(t, err, "failed to open DB after rebase") + + // Verify the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + // Fully delete the old DB. The new DB should be a copy of the old one, so this should not affect copied data. + err = db.Destroy() + require.NoError(t, err) + + // Push should NOT copy the keymap. Verify that there is no keymap directory in destRoot. + err = filepath.Walk(destRoot, func(path string, info os.FileInfo, err error) error { + if err != nil { + return err + } + require.False(t, strings.Contains(path, keymap.KeymapDirectoryName)) + return nil + }) + require.NoError(t, err) + + // Reopen the DB at the new destination directories. + config.Paths = destDirList + config.SnapshotDirectory = "" + db, err = littbuilder.NewDB(config) + require.NoError(t, err, "failed to open DB after rebase") + + // Verify the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + err = db.Close() + require.NoError(t, err, "failed to close DB after rebase") +} diff --git a/sei-db/db_engine/litt/cli/rebase.go b/sei-db/db_engine/litt/cli/rebase.go new file mode 100644 index 0000000000..3a17004232 --- /dev/null +++ b/sei-db/db_engine/litt/cli/rebase.go @@ -0,0 +1,593 @@ +//go:build littdb_wip + +package main + +import ( + "bufio" + "errors" + "fmt" + "hash/fnv" + "os" + "path" + "path/filepath" + "sync/atomic" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/urfave/cli/v2" +) + +// rebaseCommand is the command to rebase a LittDB database. +func rebaseCommand(ctx *cli.Context) error { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + if err != nil { + return fmt.Errorf("failed to create logger: %w", err) + } + + sources := ctx.StringSlice("src") + if len(sources) == 0 { + return fmt.Errorf("no sources provided") + } + for i, src := range sources { + var err error + sources[i], err = util.SanitizePath(src) + if err != nil { + return fmt.Errorf("failed to sanitise path %s: %w", src, err) + } + } + + destinations := ctx.StringSlice("dst") + if len(destinations) == 0 { + return fmt.Errorf("no destinations provided") + } + for i, dest := range destinations { + var err error + destinations[i], err = util.SanitizePath(dest) + if err != nil { + return fmt.Errorf("failed to sanitise path %s: %w", dest, err) + } + } + + preserveOriginal := ctx.Bool("preserve") + verbose := !ctx.Bool("quiet") + + return rebase(logger, sources, destinations, preserveOriginal, true, verbose) +} + +// rebase moves LittDB database files from one location to another (locally). This function is idempotent. If it +// crashes part of the way through, just run it again and it will continue where it left off. +func rebase( + logger logging.Logger, + sources []string, + destinations []string, + preserveOriginal bool, + fsync bool, + verbose bool, +) error { + + sourceSet := make(map[string]struct{}) + for _, src := range sources { + exists, err := util.Exists(src) + if err != nil { + return fmt.Errorf("error checking if source path %s exists: %w", src, err) + } + // Ignore non-existent source paths. They could have been deleted by a prior run of this command. + if exists { + sourceSet[src] = struct{}{} + } + } + + destinationSet := make(map[string]struct{}) + for _, dest := range destinations { + destinationSet[dest] = struct{}{} + + err := util.EnsureDirectoryExists(dest, fsync) + if err != nil { + return fmt.Errorf("error ensuring destination path %s exists: %w", dest, err) + } + } + // Don't immediately take a lock on the source directories. Each source directory will be locked individually + // before its data is transferred. Because source directories are deleted after their data is transferred, + // it is inconvenient to hold the locks in this outer scope (since we need to release the lock to + // delete the directory). + + // Acquire locks on all destination directories. + releaseDestinationLocks, err := util.LockDirectories(logger, destinations, util.LockfileName, fsync) + if err != nil { + return fmt.Errorf("failed to acquire locks on destination directories %v: %w", destinations, err) + } + defer releaseDestinationLocks() + + // Figure out which directories are going away. We will need to transfer their data to new locations. + directoriesGoingAway := make([]string, 0, len(sourceSet)) + for source := range sourceSet { + // If the source directory is not in the destination set, it is going away. + if _, ok := destinationSet[source]; !ok { + directoriesGoingAway = append(directoriesGoingAway, source) + } + } + + var segmentFileCount atomic.Int64 + totalSegmentFileCount, symlinkFound, err := countSegmentFiles(directoriesGoingAway) + if err != nil { + return fmt.Errorf("failed to count segment files in sources %v: %w", sources, err) + } + + if symlinkFound { + // If any of the segment files are symlinks, that means that we are dealing with a snapshot. + return errors.New( + "snapshot detected (source files contain symlinks). Rebasing from a snapshot is not supported") + } + + // For each directory that is going away, transfer its data to the new destination. + for _, source := range directoriesGoingAway { + err := transferDataInDirectory( + logger, + source, + destinations, + preserveOriginal, + fsync, + verbose, + totalSegmentFileCount, + &segmentFileCount) + if err != nil { + return fmt.Errorf("error transferring data from %s to %v: %w", + source, destinations, err) + } + } + + return nil +} + +// Get a count of the segment files in the source directories. +// Also checks whether any of the segment files are symlinks. +func countSegmentFiles(sources []string) (count int64, symlinkFound bool, err error) { + for _, source := range sources { + exists, err := util.Exists(source) + if err != nil { + return 0, false, fmt.Errorf("failed to check if source directory %s exists: %w", source, err) + } + if !exists { + continue + } + + // Walk the file tree to find all files ending with .metadata, .keys, or .values. + err = filepath.WalkDir(source, func(path string, d os.DirEntry, err error) error { + if err != nil { + return fmt.Errorf("error walking directory %s: %w", path, err) + } + + if d.IsDir() { + // Skip directories + return nil + } + + // Ignore "table.metadata" files, as they are not segment files. + if d.Name() == disktable.TableMetadataFileName { + return nil + } + + // Check if the file is a segment file. + extension := filepath.Ext(path) + if extension == segment.MetadataFileExtension || + extension == segment.KeyFileExtension || + extension == segment.ValuesFileExtension { + + fileInfo, err := os.Lstat(path) + if err != nil { + return fmt.Errorf("failed to get file info for %s: %w", path, err) + } + isSymlink := fileInfo.Mode()&os.ModeSymlink != 0 + symlinkFound = isSymlink || symlinkFound + + count++ + } + + return nil + }) + + if err != nil { + return 0, false, fmt.Errorf("error counting segment files in source directories: %w", err) + } + } + + return count, symlinkFound, nil +} + +// transfers all data in a directory to the specified destinations. +func transferDataInDirectory( + logger logging.Logger, + source string, + destinations []string, + preserveOriginal bool, + fsync bool, + verbose bool, + totalSegmentFileCount int64, + segmentFileCount *atomic.Int64, +) error { + exists, err := util.Exists(source) + if err != nil { + return fmt.Errorf("failed to check if source directory %s exists: %w", source, err) + } + if !exists { + return nil + } + + // Acquire a lock on the source directory. + lockPath := path.Join(source, util.LockfileName) + lock, err := util.NewFileLock(logger, lockPath, fsync) + if err != nil { + return fmt.Errorf("failed to acquire lock on %s: %w", source, err) + } + defer lock.Release() // double release is a no-op + + // Transfer each table stored in this directory. + children, err := os.ReadDir(source) + if err != nil { + return fmt.Errorf("failed to read directory %s: %w", source, err) + } + for _, child := range children { + if !child.IsDir() { + continue + } + + err = transferDataInTable( + logger, + source, + child.Name(), + destinations, + preserveOriginal, + fsync, + verbose, + totalSegmentFileCount, + segmentFileCount) + if err != nil { + return fmt.Errorf("error transferring data in table %s: %w", child.Name(), err) + } + } + + // Release the lock so we can delete the directory. + lock.Release() + + if !preserveOriginal { + // Delete the directory. + err = os.Remove(source) + if err != nil { + return fmt.Errorf("failed to remove source directory %s: %w", source, err) + } + } + + return nil +} + +func transferDataInTable( + logger logging.Logger, + source string, + tableName string, + destinations []string, + preserveOriginal bool, + fsync bool, + verbose bool, + totalSegmentFileCount int64, + segmentFileCount *atomic.Int64, +) error { + + err := createDestinationTableDirectories(destinations, tableName, fsync) + if err != nil { + return fmt.Errorf("failed to create destination table directories for table %s: %w", tableName, err) + } + + err = transferKeymap(source, tableName, destinations, preserveOriginal, fsync, verbose) + if err != nil { + return fmt.Errorf("failed to transfer keymap for table %s: %w", tableName, err) + } + + err = transferTableMetadata(source, tableName, destinations, preserveOriginal, fsync, verbose) + if err != nil { + return fmt.Errorf("failed to transfer table metadata for table %s: %w", tableName, err) + } + + err = transferSegmentData( + source, + tableName, + destinations, + preserveOriginal, + fsync, + verbose, + totalSegmentFileCount, + segmentFileCount) + if err != nil { + return fmt.Errorf("failed to transfer segment data for table %s: %w", tableName, err) + } + + if !preserveOriginal { + err = deleteSnapshotDirectory(source, tableName) + if err != nil { + return fmt.Errorf("failed to delete snapshot directory for table %s: %w", tableName, err) + } + + err = deleteBoundaryFiles(logger, source, tableName, verbose) + if err != nil { + return fmt.Errorf("failed to delete boundary files for table %s: %w", tableName, err) + } + + // Once all data in a table is transferred, delete the table directory. + sourceTableDir := filepath.Join(source, tableName) + err = os.Remove(sourceTableDir) + if err != nil { + return fmt.Errorf("failed to remove table directory %s: %w", sourceTableDir, err) + } + } + + return nil +} + +// deleteBoundaryFiles deletes the boundary files for a table. Only will be present if the source +// directory contains symlink snapshots. +func deleteBoundaryFiles(logger logging.Logger, source string, tableName string, verbose bool) error { + lowerBoundPath := path.Join(source, tableName, disktable.LowerBoundFileName) + exists, err := util.Exists(lowerBoundPath) + if err != nil { + return fmt.Errorf("failed to check if lower bound file %s exists: %w", lowerBoundPath, err) + } + if exists { + if verbose { + logger.Infof("Deleting lower bound file: %s", lowerBoundPath) + } + err = os.Remove(lowerBoundPath) + if err != nil { + return fmt.Errorf("failed to remove lower bound file %s: %w", lowerBoundPath, err) + } + } + + upperBoundPath := path.Join(source, tableName, disktable.UpperBoundFileName) + exists, err = util.Exists(upperBoundPath) + if err != nil { + return fmt.Errorf("failed to check if upper bound file %s exists: %w", upperBoundPath, err) + } + if exists { + if verbose { + logger.Infof("Deleting upper bound file: %s", upperBoundPath) + } + err = os.Remove(upperBoundPath) + if err != nil { + return fmt.Errorf("failed to remove upper bound file %s: %w", upperBoundPath, err) + } + } + + return nil +} + +// delete the old snapshot directory for a table. This will be reconstructed the next time the DB is loaded. +func deleteSnapshotDirectory(source string, tableName string) error { + snapshotDir := filepath.Join(source, tableName, segment.HardLinkDirectory) + + exists, err := util.Exists(snapshotDir) + if err != nil { + return fmt.Errorf("failed to check if snapshot directory %s exists: %w", snapshotDir, err) + } + if !exists { + return nil + } + + err = os.RemoveAll(snapshotDir) + if err != nil { + return fmt.Errorf("failed to remove snapshot directory %s: %w", snapshotDir, err) + } + + return nil +} + +// In the destination directories, create directories for the tables (if they don't exist). +func createDestinationTableDirectories(destinations []string, tableName string, fsync bool) error { + for _, destination := range destinations { + destinationTableDir := filepath.Join(destination, tableName) + + err := util.EnsureDirectoryExists(destinationTableDir, fsync) + if err != nil { + return fmt.Errorf("failed to ensure destination table directory %s exists: %w", + destinationTableDir, err) + } + } + + return nil +} + +// Transfer the keymap (if it is present in the source directory). +func transferKeymap( + source string, + tableName string, + destinations []string, + preserveOriginal bool, + fsync bool, + verbose bool, +) error { + + sourceKeymapPath := filepath.Join(source, tableName, keymap.KeymapDirectoryName) + exists, err := util.Exists(sourceKeymapPath) + if err != nil { + return fmt.Errorf("failed to check if keymap directory %s exists: %w", sourceKeymapPath, err) + } + if !exists { + return nil + } + + destination, err := determineDestination(sourceKeymapPath, destinations) + if err != nil { + return fmt.Errorf("failed to determine destination for keymap %s: %w", sourceKeymapPath, err) + } + + destinationKeymapPath := filepath.Join(destination, tableName, keymap.KeymapDirectoryName) + + if verbose { + text := fmt.Sprintf("Transferring table '%s' keymap", tableName) + writer := bufio.NewWriter(os.Stdout) + _, _ = fmt.Fprintf(writer, "\r%-100s", text) + _ = writer.Flush() + } + + err = util.RecursiveMove(sourceKeymapPath, destinationKeymapPath, preserveOriginal, fsync) + if err != nil { + return fmt.Errorf("failed to copy keymap from %s to %s: %w", + sourceKeymapPath, destinationKeymapPath, err) + } + + return nil +} + +// transfers data in the segments/ directory +func transferSegmentData( + source string, + tableName string, + destinations []string, + preserveOriginal bool, + fsync bool, + verbose bool, + totalSegmentFileCount int64, + segmentFileCount *atomic.Int64, +) error { + + sourceTableDir := filepath.Join(source, tableName) + + sourceSegmentDir := filepath.Join(sourceTableDir, segment.SegmentDirectory) + exists, err := util.Exists(sourceSegmentDir) + if err != nil { + return fmt.Errorf("failed to check if segment directory %s exists: %w", sourceSegmentDir, err) + } + if !exists { + return nil + } + + segmentFiles, err := os.ReadDir(sourceSegmentDir) + if err != nil { + return fmt.Errorf("failed to read segment directory %s: %w", sourceSegmentDir, err) + } + + for _, segmentFile := range segmentFiles { + segmentFilePath := filepath.Join(sourceSegmentDir, segmentFile.Name()) + err = transferSegmentFile( + segmentFile.Name(), + segmentFilePath, + tableName, + destinations, + preserveOriginal, + fsync, + verbose, + totalSegmentFileCount, + segmentFileCount) + if err != nil { + return fmt.Errorf("failed to transfer segment file %s for table %s: %w", + segmentFilePath, tableName, err) + } + } + + if !preserveOriginal { + // Now that we've copied the segment files, we can delete the original directory. + err = os.Remove(sourceSegmentDir) + if err != nil { + return fmt.Errorf("failed to remove segment directory %s: %w", sourceSegmentDir, err) + } + } + + return nil +} + +// Transfer a single segment file (i.e. *.metadata, *.keys, *.values). +func transferSegmentFile( + segmentName string, + segmentFilePath string, + tableName string, + destinations []string, + preserveOriginal bool, + fsync bool, + verbose bool, + totalSegmentFileCount int64, + segmentFileCount *atomic.Int64, +) error { + + destination, err := determineDestination(segmentFilePath, destinations) + if err != nil { + return fmt.Errorf("failed to determine destination for segment file %s: %w", segmentFilePath, err) + } + + destinationSegmentPath := filepath.Join(destination, tableName, segment.SegmentDirectory, segmentName) + + if verbose { + count := segmentFileCount.Add(1) + text := fmt.Sprintf("Transferring Segment File %d/%d from table '%s': %s", + count, totalSegmentFileCount, tableName, filepath.Base(segmentFilePath)) + writer := bufio.NewWriter(os.Stdout) + _, _ = fmt.Fprintf(writer, "\r%-100s", text) + _ = writer.Flush() + } + + err = util.RecursiveMove(segmentFilePath, destinationSegmentPath, preserveOriginal, fsync) + if err != nil { + return fmt.Errorf("failed to copy segment file from %s to %s: %w", + segmentFilePath, destinationSegmentPath, err) + } + + return nil +} + +// transfers the table metadata file, if it is present. +func transferTableMetadata( + source string, + tableName string, + destinations []string, + preserveOriginal bool, + fsync bool, + verbose bool, +) error { + + sourceTableDir := filepath.Join(source, tableName) + + sourceMetadataPath := filepath.Join(sourceTableDir, disktable.TableMetadataFileName) + exists, err := util.Exists(sourceMetadataPath) + if err != nil { + return fmt.Errorf("failed to check if table metadata file %s exists: %w", sourceMetadataPath, err) + } + + if !exists { + return nil + } + + destination, err := determineDestination(sourceTableDir, destinations) + if err != nil { + return fmt.Errorf("failed to determine destination for table metadata %s: %w", sourceMetadataPath, err) + } + + destinationMetadataPath := filepath.Join(destination, tableName, disktable.TableMetadataFileName) + + if verbose { + text := fmt.Sprintf("Transferring table '%s' metadata", tableName) + writer := bufio.NewWriter(os.Stdout) + _, _ = fmt.Fprintf(writer, "\r%-100s", text) + _ = writer.Flush() + } + + err = util.RecursiveMove(sourceMetadataPath, destinationMetadataPath, preserveOriginal, fsync) + if err != nil { + return fmt.Errorf("failed to copy table metadata from %s to %s: %w", + sourceMetadataPath, destinationMetadataPath, err) + } + + return nil +} + +// Determines the location where a file should be transferred given a list of options. +// This function is deterministic. This is important! If a rebase is interrupted, the +// second attempt should always transfer the file to the same location as the first attempt. +func determineDestination(source string, destinations []string) (string, error) { + hasher := fnv.New64a() + _, err := hasher.Write([]byte(source)) + if err != nil { + return "", fmt.Errorf("failed to hash source path %s: %w", source, err) + } + + return destinations[hasher.Sum64()%uint64(len(destinations))], nil +} diff --git a/sei-db/db_engine/litt/cli/rebase_test.go b/sei-db/db_engine/litt/cli/rebase_test.go new file mode 100644 index 0000000000..d4532cdd3e --- /dev/null +++ b/sei-db/db_engine/litt/cli/rebase_test.go @@ -0,0 +1,395 @@ +//go:build littdb_wip + +package main + +import ( + "path" + "testing" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func rebaseTest( + t *testing.T, + sourceDirs uint64, + destDirs uint64, + overlap uint64, + preserveOriginal bool, + verbose bool, +) { + t.Helper() + logger := test.GetLogger() + + if overlap > 0 && preserveOriginal { + require.Fail(t, "Invalid test configuration, cannot preserve original when there is overlap") + } + + rand := random.NewTestRandom() + testDir := t.TempDir() + + sourceDirList := make([]string, 0, sourceDirs) + sourceDirSet := make(map[string]struct{}, sourceDirs) + destDirList := make([]string, 0, destDirs) + destDirSet := make(map[string]struct{}, destDirs) + + for i := uint64(0); i < sourceDirs; i++ { + sourceDir := path.Join(testDir, rand.String(32)) + sourceDirList = append(sourceDirList, path.Join(testDir, sourceDir)) + sourceDirSet[sourceDir] = struct{}{} + + if i < overlap { + // Reuse this directory for the destination as well. + destDirList = append(destDirList, sourceDir) + destDirSet[sourceDir] = struct{}{} + } + } + for len(destDirList) < int(destDirs) { + destDir := path.Join(testDir, rand.String(32)) + destDirList = append(destDirList, destDir) + destDirSet[destDir] = struct{}{} + } + + // Randomize the order of the source and destination directories. This ensures that the first directories + // are not always the ones that overlap. + rand.Shuffle(len(sourceDirList), func(i, j int) { + sourceDirList[i], sourceDirList[j] = sourceDirList[j], sourceDirList[i] + }) + rand.Shuffle(len(destDirList), func(i, j int) { + destDirList[i], destDirList[j] = destDirList[j], destDirList[i] + }) + + tableCount := rand.Uint64Range(2, 4) + tableNames := make([]string, 0, tableCount) + for i := uint64(0); i < tableCount; i++ { + tableNames = append(tableNames, rand.String(32)) + } + + shardingFactor := sourceDirs + rand.Uint64Range(0, 4) + + config, err := litt.DefaultConfig(sourceDirList...) + require.NoError(t, err) + config.DoubleWriteProtection = true + config.ShardingFactor = uint32(shardingFactor) + config.Fsync = false + config.TargetSegmentFileSize = 100 + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + expectedData := make(map[string] /*table*/ map[string] /*value*/ []byte) + for _, tableName := range tableNames { + expectedData[tableName] = make(map[string][]byte) + } + + // Insert data into the tables. + keyCount := uint64(1024) + for i := uint64(0); i < keyCount; i++ { + tableIndex := rand.Uint64Range(0, tableCount) + table, err := db.GetTable(tableNames[tableIndex]) + require.NoError(t, err) + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(10, 100) + + expectedData[table.Name()][string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "failed to put key %s in table %s", key, table.Name()) + } + + // Flush all tables. + for _, tableName := range tableNames { + table, err := db.GetTable(tableName) + require.NoError(t, err) + err = table.Flush() + require.NoError(t, err, "failed to flush table %s", table.Name()) + } + + // Verify the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + // Verify expected directories. + for _, sourceDir := range sourceDirList { + // We should see each source dir. + exists, err := util.Exists(sourceDir) + require.NoError(t, err) + require.True(t, exists, "source directory %s does not exist", sourceDir) + } + for _, destDir := range destDirList { + // We should not see dest dirs unless they overlap with source dirs. + exists, err := util.Exists(destDir) + require.NoError(t, err) + if _, ok := sourceDirSet[destDir]; !ok { + require.True(t, !exists, "destination directory %s does not exist", destDir) + } else { + require.False(t, exists, "destination directory %s exists", destDir) + } + } + + // Rebasing with the DB still open should fail. + err = rebase(logger, sourceDirList, destDirList, preserveOriginal, false, verbose) + require.Error(t, err) + + // None of the source dirs should have been deleted. + for _, sourceDir := range sourceDirList { + // We should see each source dir. + exists, err := util.Exists(sourceDir) + require.NoError(t, err) + require.True(t, exists, "source directory %s does not exist", sourceDir) + } + + // The failed rebase should not have changed the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + // Shut down the DB and rebase it. + err = db.Close() + require.NoError(t, err, "failed to close DB") + + err = rebase(logger, sourceDirList, destDirList, preserveOriginal, false, verbose) + require.NoError(t, err, "failed to rebase DB") + + // Verify the new directories. + for _, sourceDir := range sourceDirList { + exists, err := util.Exists(sourceDir) + require.NoError(t, err) + + if preserveOriginal { + // We should see each source dir if preserveOriginal is true. + require.True(t, exists, "source directory %s does not exist", sourceDir) + } else { + // If we aren't preserving the original, then a source directory should only exist if it overlaps. + if _, ok := destDirSet[sourceDir]; !ok { + require.False(t, exists, "source directory %s exists but should not", sourceDir) + } else { + require.True(t, exists, "source directory %s does not exist but should", sourceDir) + } + } + } + for _, destDir := range destDirList { + // We should see all destination dirs. + exists, err := util.Exists(destDir) + require.NoError(t, err) + require.True(t, exists, "destination directory %s does not exist", destDir) + } + + // Reopen the DB at the new destination directories. + config.Paths = destDirList + db, err = littbuilder.NewDB(config) + require.NoError(t, err, "failed to open DB after rebase") + + // Verify the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + err = db.Close() + require.NoError(t, err, "failed to close DB after rebase") +} + +func TestRebase1to1(t *testing.T) { + t.Parallel() + + sourceDirs := uint64(1) + destDirs := uint64(1) + + t.Run("preserve", func(t *testing.T) { + // This is the only test that runs with verbose= true. We want to make sure this doesn't crash, + // but don't want too much spam in the logs. + rebaseTest(t, sourceDirs, destDirs, 0, true, true) + }) + + t.Run("do not preserve", func(t *testing.T) { + rebaseTest(t, sourceDirs, destDirs, 0, false, false) + }) +} + +func TestRebase1toN(t *testing.T) { + t.Parallel() + + sourceDirs := uint64(1) + destDirs := uint64(4) + + t.Run("preserve", func(t *testing.T) { + rebaseTest(t, sourceDirs, destDirs, 0, true, false) + }) + + t.Run("do not preserve", func(t *testing.T) { + rebaseTest(t, sourceDirs, destDirs, 0, false, false) + }) +} + +func TestRebaseNto1(t *testing.T) { + t.Parallel() + + sourceDirs := uint64(4) + destDirs := uint64(1) + + t.Run("preserve", func(t *testing.T) { + rebaseTest(t, sourceDirs, destDirs, 0, true, false) + }) + + t.Run("do not preserve", func(t *testing.T) { + rebaseTest(t, sourceDirs, destDirs, 0, false, false) + }) +} + +func TestRebaseNtoN(t *testing.T) { + t.Parallel() + + sourceDirs := uint64(4) + destDirs := uint64(4) + + t.Run("preserve", func(t *testing.T) { + rebaseTest(t, sourceDirs, destDirs, 0, true, false) + }) + + t.Run("do not preserve", func(t *testing.T) { + rebaseTest(t, sourceDirs, destDirs, 0, false, false) + }) +} + +func TestRebaseNtoNOverlap(t *testing.T) { + t.Parallel() + + sourceDirs := uint64(4) + destDirs := uint64(4) + + t.Run("preserve", func(t *testing.T) { + rebaseTest(t, sourceDirs, destDirs, 0, true, false) + }) + + t.Run("do not preserve", func(t *testing.T) { + rebaseTest(t, sourceDirs, destDirs, 0, false, false) + }) +} + +// Verify the behavior when we attempt to rebase a snapshot directory. +func TestRebaseSnapshot(t *testing.T) { + t.Parallel() + + logger := test.GetLogger() + rand := random.NewTestRandom() + testDir := t.TempDir() + + tableCount := rand.Uint64Range(2, 4) + tableNames := make([]string, 0, tableCount) + for i := uint64(0); i < tableCount; i++ { + tableNames = append(tableNames, rand.String(32)) + } + + shardingFactor := rand.Uint32Range(1, 4) + roots := make([]string, 0, shardingFactor) + for i := uint32(0); i < shardingFactor; i++ { + roots = append(roots, path.Join(testDir, rand.String(32))) + } + + snapshotDir := path.Join(testDir, "snapshot") + + config, err := litt.DefaultConfig(roots...) + require.NoError(t, err) + config.DoubleWriteProtection = true + config.ShardingFactor = shardingFactor + config.Fsync = false + config.SnapshotDirectory = snapshotDir + config.TargetSegmentFileSize = 100 + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + expectedData := make(map[string] /*table*/ map[string] /*value*/ []byte) + for _, tableName := range tableNames { + expectedData[tableName] = make(map[string][]byte) + } + + // Insert data into the tables. + keyCount := uint64(1024) + for i := uint64(0); i < keyCount; i++ { + tableIndex := rand.Uint64Range(0, tableCount) + table, err := db.GetTable(tableNames[tableIndex]) + require.NoError(t, err) + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(10, 100) + + expectedData[table.Name()][string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "failed to put key %s in table %s", key, table.Name()) + } + + // Flush all tables. + for _, tableName := range tableNames { + table, err := db.GetTable(tableName) + require.NoError(t, err) + err = table.Flush() + require.NoError(t, err, "failed to flush table %s", table.Name()) + } + + // Verify the data in the DB. + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "failed to get table %s", tableName) + for key := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "failed to get key %s in table %s", key, tableName) + require.True(t, ok, "key %s not found in table %s", key, tableName) + require.Equal(t, expectedData[tableName][key], value, + "value for key %s in table %s does not match expected value", key, tableName) + } + } + + destinationDir := path.Join(testDir, "destination") + + // Begin the rebase without shutting down the DB. Lock files on the snapshot directory shouldn't interfere, + // but we still expect it to fail, since we don't support rebasing a snapshot directory. + err = rebase( + logger, + []string{snapshotDir}, + []string{destinationDir}, + true, + false, + false) + require.Error(t, err) + + err = db.Close() + require.NoError(t, err, "failed to close DB after rebase") + + // It won't matter that the DB is closed, we still expect the rebase to fail. + err = rebase( + logger, + []string{snapshotDir}, + []string{destinationDir}, + true, + false, + false) + require.Error(t, err) +} diff --git a/sei-db/db_engine/litt/cli/sync.go b/sei-db/db_engine/litt/cli/sync.go new file mode 100644 index 0000000000..9c77a973cd --- /dev/null +++ b/sei-db/db_engine/litt/cli/sync.go @@ -0,0 +1,269 @@ +//go:build littdb_wip + +package main + +import ( + "context" + "fmt" + "os" + "os/signal" + "strings" + "syscall" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/urfave/cli/v2" +) + +func syncCommand(ctx *cli.Context) error { + if ctx.NArg() < 1 { + return fmt.Errorf("not enough arguments provided, must provide USER@HOST") + } + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + if err != nil { + return fmt.Errorf("failed to create logger: %w", err) + } + + sources := ctx.StringSlice("src") + if len(sources) == 0 { + return fmt.Errorf("no sources provided") + } + for i, src := range sources { + var err error + sources[i], err = util.SanitizePath(src) + if err != nil { + return fmt.Errorf("invalid source path: %s", src) + } + } + + destinations := ctx.StringSlice("dest") + if len(destinations) == 0 { + return fmt.Errorf("no destinations provided") + } + + userHost := ctx.Args().First() + parts := strings.Split(userHost, "@") + if len(parts) != 2 { + return fmt.Errorf("invalid USER@HOST format: %s", userHost) + } + user := parts[0] + host := parts[1] + + port := ctx.Uint64("port") + + keyPath := ctx.String("key") + keyPath, err = util.SanitizePath(keyPath) + if err != nil { + return fmt.Errorf("invalid key path: %s", keyPath) + } + + deleteAfterTransfer := !ctx.Bool("no-gc") + threads := ctx.Uint64("threads") + verbose := !ctx.Bool("quiet") + throttleMB := ctx.Float64("throttle") + periodSeconds := ctx.Int64("period") + period := time.Duration(periodSeconds) * time.Second + + maxAgeSeconds := ctx.Uint64("max-age") + remoteLittBinary := ctx.String("litt-binary") + + knownHostsFile := ctx.String(knownHostsFileFlag.Name) + knownHostsFile, err = util.SanitizePath(knownHostsFile) + if err != nil { + return fmt.Errorf("invalid known hosts path: %s", knownHostsFileFlag.Name) + } + + return newSyncEngine( + context.Background(), + logger, + sources, + destinations, + user, + host, + port, + keyPath, + knownHostsFile, + deleteAfterTransfer, + true, + threads, + throttleMB, + period, + maxAgeSeconds, + remoteLittBinary, + verbose).run() +} + +// A utility that periodically transfers data from a local database to a remote backup using rsync. +type syncEngine struct { + ctx context.Context + cancel context.CancelFunc + logger logging.Logger + sources []string + destinations []string + user string + host string + port uint64 + keyPath string + knownHostsFile string + deleteAfterTransfer bool + fsync bool + threads uint64 + throttleMB float64 + period time.Duration + maxAgeSeconds uint64 + remoteLittBinary string + verbose bool +} + +// newSyncEngine creates a new syncEngine instance with the provided parameters. +func newSyncEngine( + ctx context.Context, + logger logging.Logger, + sources []string, + destinations []string, + user string, + host string, + port uint64, + keyPath string, + knownHostsFile string, + deleteAfterTransfer bool, + fsync bool, + threads uint64, + throttleMB float64, + period time.Duration, + maxAgeSeconds uint64, + remoteLittBinary string, + verbose bool, +) *syncEngine { + + ctx, cancel := context.WithCancel(ctx) + + return &syncEngine{ + ctx: ctx, + cancel: cancel, + logger: logger, + sources: sources, + destinations: destinations, + user: user, + host: host, + port: port, + keyPath: keyPath, + knownHostsFile: knownHostsFile, + deleteAfterTransfer: deleteAfterTransfer, + fsync: fsync, + threads: threads, + throttleMB: throttleMB, + period: period, + maxAgeSeconds: maxAgeSeconds, + remoteLittBinary: remoteLittBinary, + verbose: verbose, + } +} + +// run the sync engine. This method blocks until the context is cancelled or an unrecoverable error occurs. +func (s *syncEngine) run() error { + go s.syncLoop() + + // Create a channel to listen for OS signals + sigChan := make(chan os.Signal, 1) + signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM) + + // Wait for signal + select { + case <-s.ctx.Done(): + s.logger.Infof("Received shutdown signal, stopping") + case <-sigChan: + // Cancel the context when signal is received + s.cancel() + } + + return nil +} + +// syncLoop is the main loop of the sync engine. It runs indefinitely until the context is cancelled. +func (s *syncEngine) syncLoop() { + + ticker := time.NewTicker(s.period) + defer ticker.Stop() + + for { + select { + case <-s.ctx.Done(): + return + case <-ticker.C: + s.sync() + } + } +} + +func (s *syncEngine) sync() { + s.logger.Info("Pushing data to remote.") + + err := push( + s.logger, + s.sources, + s.destinations, + s.user, + s.host, + s.port, + s.keyPath, + s.knownHostsFile, + s.deleteAfterTransfer, + s.fsync, + s.threads, + s.throttleMB, + s.verbose) + + if err != nil { + s.logger.Errorf("Push failed: %v", err) + return + } else { + s.logger.Info("Push completed successfully.") + } + + if s.maxAgeSeconds == 0 { + s.logger.Info("No max age configured, remote data will not be automatically pruned.") + return + } + + s.logger.Infof("Pruning remote data older than %d seconds.", s.maxAgeSeconds) + + command := fmt.Sprintf("%s prune --max-age %d", s.remoteLittBinary, s.maxAgeSeconds) + sshSession, err := util.NewSSHSession( + s.logger, + s.user, + s.host, + s.port, + s.keyPath, + s.knownHostsFile, + s.verbose) + if err != nil { + s.logger.Errorf("Failed to create SSH session to %s@%s port %d: %v", s.user, s.host, s.port, err) + return + } + defer func() { + err = sshSession.Close() + if err != nil { + s.logger.Errorf("Failed to close SSH session: %v", err) + } + }() + stdout, stderr, err := sshSession.Exec(command) + if s.verbose { + s.logger.Infof("prune stdout: %s", stdout) + } + if stderr != "" { + s.logger.Errorf("prune stderr: %s", stderr) + } + + if err != nil { + s.logger.Errorf("failed to execute command '%s': %v", command, err) + } +} + +// Stop stops the sync engine by cancelling the context. +func (s *syncEngine) Stop() { + s.cancel() +} diff --git a/sei-db/db_engine/litt/cli/table_info.go b/sei-db/db_engine/litt/cli/table_info.go new file mode 100644 index 0000000000..09458ccc47 --- /dev/null +++ b/sei-db/db_engine/litt/cli/table_info.go @@ -0,0 +1,201 @@ +//go:build littdb_wip + +package main + +import ( + "context" + "fmt" + "path" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/urfave/cli/v2" +) + +// TableInfo contains high level information about a table in LittDB. +type TableInfo struct { + // The number of key-value pairs in the table. + KeyCount uint64 + // The size of the table in bytes. + Size uint64 + // If true, the table at the specified path is a snapshot of another table. + IsSnapshot bool + // The time when the oldest segment was sealed. + OldestSegmentSealTime time.Time + // The time when the newest segment was sealed. + NewestSegmentSealTime time.Time + // The index of the oldest segment in the table. + LowestSegmentIndex uint32 + // The index of the newest segment in the table. + HighestSegmentIndex uint32 + // The type of the keymap used by the table. If "", then this table doesn't have a keymap (i.e. it will rebuild + // a keymap the next time it is loaded). + KeymapType string +} + +// tableInfoCommand is the CLI command handler for the "table-info" command. +func tableInfoCommand(ctx *cli.Context) error { + if ctx.NArg() != 1 { + return fmt.Errorf( + "table-info command requires exactly at least one argument: ") + } + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + if err != nil { + return fmt.Errorf("failed to create logger: %w", err) + } + + tableName := ctx.Args().Get(0) + + sources := ctx.StringSlice("src") + if len(sources) == 0 { + return fmt.Errorf("no sources provided") + } + for i, src := range sources { + var err error + sources[i], err = util.SanitizePath(src) + if err != nil { + return fmt.Errorf("invalid source path: %s", src) + } + } + + info, err := tableInfo(logger, tableName, sources, true) + if err != nil { + return fmt.Errorf("failed to get table info for table %s at paths %v: %w", tableName, sources, err) + } + + oldestSegmentAge := uint64(time.Since(info.OldestSegmentSealTime).Nanoseconds()) + newestSegmentAge := uint64(time.Since(info.NewestSegmentSealTime).Nanoseconds()) + segmentSpan := oldestSegmentAge - newestSegmentAge + + // Print table information in a human-readable format + logger.Infof("Table: %s", tableName) + logger.Infof("Key count: %s", common.CommaOMatic(info.KeyCount)) + logger.Infof("Size: %s", common.PrettyPrintBytes(info.Size)) + logger.Infof("Is snapshot: %t", info.IsSnapshot) + logger.Infof("Oldest segment age: %s", common.PrettyPrintTime(oldestSegmentAge)) + logger.Infof("Oldest segment seal time: %s", info.OldestSegmentSealTime.Format(time.RFC3339)) + logger.Infof("Newest segment age: %s", common.PrettyPrintTime(newestSegmentAge)) + logger.Infof("Newest segment seal time: %s", info.NewestSegmentSealTime.Format(time.RFC3339)) + logger.Infof("Segment span: %s", common.PrettyPrintTime(segmentSpan)) + logger.Infof("Lowest segment index: %d", info.LowestSegmentIndex) + logger.Infof("Highest segment index: %d", info.HighestSegmentIndex) + logger.Infof("Key map type: %s", info.KeymapType) + + return nil +} + +// tableInfo retrieves information about a table at the specified path. +func tableInfo(logger logging.Logger, tableName string, paths []string, fsync bool) (*TableInfo, error) { + if !litt.IsTableNameValid(tableName) { + return nil, fmt.Errorf("table name '%s' is invalid, "+ + "must be at least one character long and contain only letters, numbers, underscores, and dashes", + tableName) + } + + // Forbid touching tables in active use. + releaseLocks, err := util.LockDirectories(logger, paths, util.LockfileName, fsync) + if err != nil { + return nil, fmt.Errorf("failed to acquire locks on paths %v: %w", paths, err) + } + defer releaseLocks() + + segmentPaths, err := segment.BuildSegmentPaths(paths, "", tableName) + if err != nil { + return nil, fmt.Errorf( + "failed to build segment paths for table %s at paths %v: %w", tableName, paths, err) + } + + for _, segmentPath := range segmentPaths { + if err = util.ErrIfNotExists(segmentPath.SegmentDirectory()); err != nil { + return nil, fmt.Errorf("segment directory %s does not exist", segmentPath.SegmentDirectory()) + } + } + + errorMonitor := util.NewErrorMonitor(context.Background(), logger, nil) + + lowestSegmentIndex, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + segmentPaths, + false, + time.Now(), + false, + fsync) + + if err != nil { + return nil, fmt.Errorf("failed to gather segment files for table %s at paths %v: %w", + tableName, paths, err) + } + if ok, err := errorMonitor.IsOk(); !ok { + // This should be impossible since we aren't doing anything on background threads that report to the + // error monitor, but it doesn't hurt to check. + return nil, fmt.Errorf("error monitor reports errors: %w", err) + } + + if len(segments) == 0 { + return nil, fmt.Errorf("no segments found for table %s at paths %v", tableName, paths) + } + + isSnapshot, err := segments[lowestSegmentIndex].IsSnapshot() + if err != nil { + return nil, fmt.Errorf("failed to check if segment %d is a snapshot: %w", lowestSegmentIndex, err) + } + + if isSnapshot { + if len(paths) != 1 { + return nil, fmt.Errorf("table %s is a snapshot, but multiple paths were provided: %v", + tableName, paths) + } + + upperBoundFile, err := disktable.LoadBoundaryFile(disktable.UpperBound, path.Join(paths[0], tableName)) + if err != nil { + return nil, fmt.Errorf("failed to load boundary file for table %s at path %s: %w", + tableName, paths[0], err) + } + + if upperBoundFile.IsDefined() { + highestSegmentIndex = upperBoundFile.BoundaryIndex() + } + } + + keyCount := uint64(0) + size := uint64(0) + for _, seg := range segments { + if seg.SegmentIndex() > highestSegmentIndex { + // Do not attempt to read segments outside the limit set by the boundary file. + break + } + + keyCount += uint64(seg.KeyCount()) + size += seg.Size() + } + + _, _, keymapTypeFile, err := littbuilder.FindKeymapLocation(paths, tableName) + if err != nil { + return nil, fmt.Errorf("failed to find keymap location for table %s at paths %v: %w", + tableName, paths, err) + } + + keymapType := "none (will be rebuilt on next LittDB startup)" + if keymapTypeFile != nil { + keymapType = (string)(keymapTypeFile.Type()) + } + + return &TableInfo{ + KeyCount: keyCount, + Size: size, + IsSnapshot: isSnapshot, + OldestSegmentSealTime: segments[lowestSegmentIndex].GetSealTime(), + NewestSegmentSealTime: segments[highestSegmentIndex].GetSealTime(), + LowestSegmentIndex: lowestSegmentIndex, + HighestSegmentIndex: highestSegmentIndex, + KeymapType: keymapType, + }, nil +} diff --git a/sei-db/db_engine/litt/cli/table_info_test.go b/sei-db/db_engine/litt/cli/table_info_test.go new file mode 100644 index 0000000000..73faf7256a --- /dev/null +++ b/sei-db/db_engine/litt/cli/table_info_test.go @@ -0,0 +1,129 @@ +//go:build littdb_wip + +package main + +import ( + "fmt" + "testing" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestTableInfo(t *testing.T) { + t.Parallel() + + rand := random.NewTestRandom() + directory := t.TempDir() + logger := test.GetLogger() + + // Spread data across several root directories. + rootCount := rand.Uint32Range(2, 5) + roots := make([]string, 0, rootCount) + for i := 0; i < int(rootCount); i++ { + roots = append(roots, fmt.Sprintf("%s/root-%d", directory, i)) + } + + config, err := litt.DefaultConfig(roots...) + require.NoError(t, err) + + // Make it so that we have at least as many shards as roots. + config.ShardingFactor = rootCount * rand.Uint32Range(1, 4) + + // Settings that should be enabled for LittDB unit tests. + config.DoubleWriteProtection = true + config.Fsync = false + + // Use small segments to ensure that we create a few segments per table. + config.TargetSegmentFileSize = 100 + + // Enable snapshotting. + snapshotDir := t.TempDir() + config.SnapshotDirectory = snapshotDir + + // Build the DB and a handful of tables. + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableCount := rand.Uint32Range(2, 5) + tables := make([]litt.Table, 0, tableCount) + expectedData := make(map[string]map[string][]byte) + tableNames := make([]string, 0, tableCount) + for i := 0; i < int(tableCount); i++ { + tableName := fmt.Sprintf("table-%d-%s", i, rand.PrintableBytes(8)) + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables = append(tables, table) + expectedData[table.Name()] = make(map[string][]byte) + tableNames = append(tableNames, tableName) + } + + // Insert some data into the tables. + for _, table := range tables { + for i := 0; i < 100; i++ { + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(10, 200) + expectedData[table.Name()][string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "Failed to put key-value pair in table %s", table.Name()) + } + err = table.Flush() + require.NoError(t, err, "Failed to flush table %s", table.Name()) + } + + // Verify that the data is correctly stored in the tables. + for _, table := range tables { + for key, expectedValue := range expectedData[table.Name()] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "Failed to get value for key %s in table %s", key, table.Name()) + require.True(t, ok, "Key %s not found in table %s", key, table.Name()) + require.Equal(t, expectedValue, value, + "Value mismatch for key %s in table %s", key, table.Name()) + } + } + + // We should not be able to call table-info on the core directories while the table holds a lock. + _, err = tableInfo(logger, tableNames[0], config.Paths, false) + require.Error(t, err) + + // Even when the DB is running, it should always be possible to check the snapshot directory. + lsResult, err := ls(logger, snapshotDir, true, false) + require.NoError(t, err) + require.Equal(t, tableNames, lsResult) + + for _, tableName := range tableNames { + info, err := tableInfo(logger, tableName, []string{snapshotDir}, false) + require.NoError(t, err) + + require.True(t, info.IsSnapshot) + require.Greater(t, info.Size, uint64(0)) + require.Greater(t, info.KeyCount, uint64(0)) + require.LessOrEqual(t, info.KeyCount, uint64(100)) + require.Equal(t, "none (will be rebuilt on next LittDB startup)", info.KeymapType) + } + + // Getting info on a table that doesn't exist should return an error. + _, err = tableInfo(logger, "nonexistent-table", config.Paths, false) + require.Error(t, err) + + err = db.Close() + require.NoError(t, err) + + // Now that the DB is closed, we should be able to call table-info on the core directories. + for _, tableName := range tableNames { + info, err := tableInfo(logger, tableName, config.Paths, false) + require.NoError(t, err) + + require.False(t, info.IsSnapshot) + require.Greater(t, info.Size, uint64(0)) + require.Equal(t, info.KeyCount, uint64(100)) + require.Equal(t, "LevelDBKeymap", info.KeymapType) + } + + // A non-existent table should return an error for the core directories as well. + _, err = tableInfo(logger, "nonexistent-table", config.Paths, false) + require.Error(t, err, "Expected error when querying info for a non-existent table after DB close") +} diff --git a/sei-db/db_engine/litt/cli/unlock.go b/sei-db/db_engine/litt/cli/unlock.go new file mode 100644 index 0000000000..d567b9493e --- /dev/null +++ b/sei-db/db_engine/litt/cli/unlock.go @@ -0,0 +1,50 @@ +//go:build littdb_wip + +package main + +import ( + "bufio" + "fmt" + "os" + "strings" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/urfave/cli/v2" +) + +// called by the CLI to unlock a LittDB file system. +func unlockCommand(ctx *cli.Context) error { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + if err != nil { + return fmt.Errorf("failed to create logger: %w", err) + } + sources := ctx.StringSlice(srcFlag.Name) + + if len(sources) == 0 { + return fmt.Errorf("at least one source path is required") + } + + force := ctx.Bool(forceFlag.Name) + if !force { + magicString := "I know what I am doing" + logger.Warnf("About to delete LittDB lock files. This is potentially dangerous. "+ + "Type \"%s\" to continue, or use "+ + "the --force flag.", magicString) + reader := bufio.NewReader(os.Stdin) + input, err := reader.ReadString('\n') + if err != nil { + return fmt.Errorf("failed to read input: %w", err) + } + input = strings.TrimSuffix(input, "\n") + if input != magicString { + return fmt.Errorf("unlock operation aborted") + } + } + + err = disktable.Unlock(logger, sources) + if err != nil { + return fmt.Errorf("failed to unlock LittDB files: %w", err) + } + return nil +} diff --git a/sei-db/db_engine/litt/db.go b/sei-db/db_engine/litt/db.go new file mode 100644 index 0000000000..9ed1bdcaa4 --- /dev/null +++ b/sei-db/db_engine/litt/db.go @@ -0,0 +1,60 @@ +//go:build littdb_wip + +package litt + +// DB is a highly specialized key-value store. It is intentionally very feature poor, sacrificing +// unnecessary features for simplicity, high performance, and low memory usage. +// +// Litt: adjective, slang, a synonym for "cool" or "awesome". e.g. "Man, that database is litt, bro!". +// +// Supported features: +// - writing values +// - reading values +// - TTLs and automatic (lazy) deletion of expired values +// - tables with non-overlapping namespaces +// - thread safety: all methods are safe to call concurrently, and all key-value pair modifications are +// individually atomic +// - dynamic multi-drive support (data can be spread across multiple physical volumes, and +// volume membership can be changed at runtime without stopping the DB) +// - incremental backups (both local and remote) +// +// Unsupported features: +// - mutating existing values (once a value is written, it cannot be changed) +// - multi-entity atomicity (there is no supported way to atomically write multiple key-value pairs as a group) +// - deleting values (values only leave the DB when they expire via a TTL) +// - transactions (individual operations are atomic, but there is no way to group operations atomically) +// - fine granularity for TTL (all data in the same table must have the same TTL) +type DB interface { + // GetTable gets a table by name, creating one if it does not exist. + // + // Table names appear as directories on the file system, and so table names are restricted to be + // ASCII alphanumeric characters, dashes, and underscores. The name must be at least one character long. + // + // The first time a table is fetched (either a new table or an existing one loaded from disk), its TTL is always + // set to 0 (i.e. it has no TTL, meaning data is never deleted). If you want to set a TTL, you must call + // Table.SetTTL() to do so. This is necessary after each time the database is started/restarted. + GetTable(name string) (Table, error) + + // DropTable deletes a table and all of its data. This is a no-op if the table does not exist. + // + // Note that it is NOT thread safe to drop a table concurrently with any operation that accesses the table. + // The table returned by GetTable() before DropTable() is called must not be used once DropTable() is called. + DropTable(name string) error + + // Size returns the on-disk size of the database in bytes. + // + // Note that this size may not accurately reflect the size of the keymap. This is because some third party + // libraries used for certain keymap implementations do not provide an accurate way to measure size. + Size() uint64 + + // KeyCount returns the number of keys in the database. + KeyCount() uint64 + + // Close stops the database. This method must be called when the database is no longer needed. + // Close ensures that all non-flushed data is crash durable on disk before returning. Calls to + // Put() concurrent with Close() may not be crash durable after Close() returns. + Close() error + + // Destroy deletes all data in the database. + Destroy() error +} diff --git a/sei-db/db_engine/litt/disktable/boundary_file.go b/sei-db/db_engine/litt/disktable/boundary_file.go new file mode 100644 index 0000000000..05c31399a4 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/boundary_file.go @@ -0,0 +1,185 @@ +//go:build littdb_wip + +package disktable + +import ( + "fmt" + "os" + "path" + "strconv" + "strings" + + "github.com/Layr-Labs/eigenda/litt/util" +) + +// The name of the file that defines the lower bound of a LittDB snapshot directory. +const LowerBoundFileName = "lower-bound.txt" + +// The name of the file that defines the upper bound of a LittDB snapshot directory. +const UpperBoundFileName = "upper-bound.txt" + +// BoundaryType is an enum that describes the type of boundary file. +type BoundaryType bool + +const ( + // A boundary file that defines the lowest valid segment index in a snapshot directory. + LowerBound BoundaryType = true + // A boundary file that defines the highest valid segment index in a snapshot directory. + UpperBound BoundaryType = false +) + +type BoundaryFile struct { + // The type of this boundary file. + boundaryType BoundaryType + + // The parent directory where this file is stored. + parentDirectory string + + // If true, then the boundary is defined, otherwise it is undefined. + // If undefined, the boundary index should be considered invalid. + defined bool + + // The segment index of the boundary. Describes a lower/upper segment index. If this is a lower bound file, + // it describes the lowest segment index that is valid within the snapshot directory (inclusive). If this is + // an upper bound file, it describes the highest segment index that is valid within the snapshot directory + // (also inclusive). + boundaryIndex uint32 +} + +// LoadBoundaryFile loads a boundary file from the specified parent directory. If the boundary file does not exist, +// then this method returns an object that can be used to create a new boundary file at the specified path (i.e. by +// calling Write() or Update()). +func LoadBoundaryFile(boundaryType BoundaryType, parentDirectory string) (*BoundaryFile, error) { + boundary := &BoundaryFile{ + boundaryType: boundaryType, + parentDirectory: parentDirectory, + } + + exists, err := util.Exists(boundary.Path()) + if err != nil { + return nil, fmt.Errorf("failed to check if boundary file %s exists: %v", boundary.Path(), err) + } + + if exists { + data, err := os.ReadFile(boundary.Path()) + if err != nil { + return nil, fmt.Errorf("failed to read boundary file %s: %v", boundary.Path(), err) + } + + data = []byte(strings.TrimSpace(string(data))) + + err = boundary.deserialize(data) + if err != nil { + return nil, fmt.Errorf("failed to deserialize boundary file %s: %v", boundary.Path(), err) + } + boundary.defined = true + } + + return boundary, nil +} + +// Atomically update the value of the boundary file. +func (b *BoundaryFile) Update(newBoundary uint32) error { + if b == nil { + return nil + } + + if newBoundary < b.boundaryIndex { + return fmt.Errorf("boundary index may only increase, cannot set to %d (current: %d)", + newBoundary, b.boundaryIndex) + } + + b.defined = true + b.boundaryIndex = newBoundary + err := b.Write() + if err != nil { + return fmt.Errorf("failed to update boundary file %s: %v", b.Path(), err) + } + return nil +} + +// Get the file name of the boundary file. +func (b *BoundaryFile) Name() string { + if b == nil { + return "" + } + + if b.boundaryType == LowerBound { + return LowerBoundFileName + } + return UpperBoundFileName +} + +// Get the full path where the boundary file is stored. +func (b *BoundaryFile) Path() string { + if b == nil { + return "" + } + + return path.Join(b.parentDirectory, b.Name()) +} + +// Serialize the boundary file to a byte slice. +func (b *BoundaryFile) serialize() []byte { + if b == nil { + return nil + } + + // Serialize the boundary file to a byte slice. Since end users may interact with this file, + // serialize in a human-readable format. + return []byte(fmt.Sprintf("%d\n", b.boundaryIndex)) +} + +func (b *BoundaryFile) deserialize(data []byte) error { + if b == nil { + return nil + } + + boundaryIndex, err := strconv.Atoi(string(data)) + if err != nil { + return fmt.Errorf("failed to parse boundary index from data: %v", err) + } + b.boundaryIndex = uint32(boundaryIndex) + return nil +} + +// Write the boundary file to disk. +func (b *BoundaryFile) Write() error { + if b == nil { + return nil + } + + data := b.serialize() + // fsync is not necessary, in an advent of a crash the boundary files get repaired + err := util.AtomicWrite(b.Path(), data, false) + if err != nil { + return fmt.Errorf("failed to write boundary file %s: %v", b.Path(), err) + } + + return nil +} + +// Returns true if this boundary file is defined. If undefined, it means that the boundary index is invalid +// and should not be used. +func (b *BoundaryFile) IsDefined() bool { + if b == nil { + return false + } + + return b.defined +} + +// Get the boundary index described by this file. +// +// If this is a lower bound, then it describes the highest segment index in a snapshot directory that has been garbage +// collected. As a result, LittDB will not snapshot any segments with this index or lower. +// +// If this is an upper bound, then it describes the highest segment index that LittDB has fully taken a snapshot of. +// External processes using the snapshot should ignore any segment with an index greater than this. +func (b *BoundaryFile) BoundaryIndex() uint32 { + if b == nil { + return 0 + } + + return b.boundaryIndex +} diff --git a/sei-db/db_engine/litt/disktable/boundary_file_test.go b/sei-db/db_engine/litt/disktable/boundary_file_test.go new file mode 100644 index 0000000000..2e331127a2 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/boundary_file_test.go @@ -0,0 +1,265 @@ +//go:build littdb_wip + +package disktable + +import ( + "os" + "path/filepath" + "testing" + + "github.com/stretchr/testify/require" +) + +func TestLoadBoundaryFileNonExistentFile(t *testing.T) { + tempDir := t.TempDir() + + // Test loading lower bound file that doesn't exist + lowerBoundary, err := LoadBoundaryFile(LowerBound, tempDir) + require.NoError(t, err) + require.NotNil(t, lowerBoundary) + require.False(t, lowerBoundary.IsDefined()) + require.Equal(t, uint32(0), lowerBoundary.BoundaryIndex()) + + // Test loading upper bound file that doesn't exist + upperBoundary, err := LoadBoundaryFile(UpperBound, tempDir) + require.NoError(t, err) + require.NotNil(t, upperBoundary) + require.False(t, upperBoundary.IsDefined()) + require.Equal(t, uint32(0), upperBoundary.BoundaryIndex()) +} + +func TestLoadBoundaryFileExistingFile(t *testing.T) { + tempDir := t.TempDir() + + // Create a lower bound file + lowerBoundPath := filepath.Join(tempDir, LowerBoundFileName) + err := os.WriteFile(lowerBoundPath, []byte("123\n"), 0644) + require.NoError(t, err) + + // Create an upper bound file + upperBoundPath := filepath.Join(tempDir, UpperBoundFileName) + err = os.WriteFile(upperBoundPath, []byte("456"), 0644) + require.NoError(t, err) + + // Load lower bound file + lowerBoundary, err := LoadBoundaryFile(LowerBound, tempDir) + require.NoError(t, err) + require.NotNil(t, lowerBoundary) + require.True(t, lowerBoundary.IsDefined()) + require.Equal(t, uint32(123), lowerBoundary.BoundaryIndex()) + + // Load upper bound file + upperBoundary, err := LoadBoundaryFile(UpperBound, tempDir) + require.NoError(t, err) + require.NotNil(t, upperBoundary) + require.True(t, upperBoundary.IsDefined()) + require.Equal(t, uint32(456), upperBoundary.BoundaryIndex()) +} + +func TestLoadBoundaryFileInvalidContent(t *testing.T) { + tempDir := t.TempDir() + + // Create a file with invalid content + boundaryPath := filepath.Join(tempDir, LowerBoundFileName) + err := os.WriteFile(boundaryPath, []byte("not_a_number"), 0644) + require.NoError(t, err) + + // Loading should fail + _, err = LoadBoundaryFile(LowerBound, tempDir) + require.Error(t, err) +} + +func TestName(t *testing.T) { + tempDir := t.TempDir() + + // Test lower bound file name + lowerBoundary, err := LoadBoundaryFile(LowerBound, tempDir) + require.NoError(t, err) + require.Equal(t, LowerBoundFileName, lowerBoundary.Name()) + + // Test upper bound file name + upperBoundary, err := LoadBoundaryFile(UpperBound, tempDir) + require.NoError(t, err) + require.Equal(t, UpperBoundFileName, upperBoundary.Name()) + + // Test nil boundary + var nilBoundary *BoundaryFile + require.Equal(t, "", nilBoundary.Name()) +} + +func TestPath(t *testing.T) { + tempDir := t.TempDir() + + // Test lower bound file path + lowerBoundary, err := LoadBoundaryFile(LowerBound, tempDir) + require.NoError(t, err) + expectedLowerPath := filepath.Join(tempDir, LowerBoundFileName) + require.Equal(t, expectedLowerPath, lowerBoundary.Path()) + + // Test upper bound file path + upperBoundary, err := LoadBoundaryFile(UpperBound, tempDir) + require.NoError(t, err) + expectedUpperPath := filepath.Join(tempDir, UpperBoundFileName) + require.Equal(t, expectedUpperPath, upperBoundary.Path()) + + // Test nil boundary + var nilBoundary *BoundaryFile + require.Equal(t, "", nilBoundary.Path()) +} + +func TestUpdate(t *testing.T) { + tempDir := t.TempDir() + + // Load boundary file (non-existent initially) + boundary, err := LoadBoundaryFile(LowerBound, tempDir) + require.NoError(t, err) + require.False(t, boundary.IsDefined()) + + // Update the boundary + err = boundary.Update(42) + require.NoError(t, err) + require.True(t, boundary.IsDefined()) + require.Equal(t, uint32(42), boundary.BoundaryIndex()) + + // Verify file was written + expectedPath := filepath.Join(tempDir, LowerBoundFileName) + content, err := os.ReadFile(expectedPath) + require.NoError(t, err) + require.Equal(t, "42\n", string(content)) + + // Update again with different value + err = boundary.Update(100) + require.NoError(t, err) + require.Equal(t, uint32(100), boundary.BoundaryIndex()) + + // Verify file was updated + content, err = os.ReadFile(expectedPath) + require.NoError(t, err) + require.Equal(t, "100\n", string(content)) +} + +func TestUpdateNilBoundary(t *testing.T) { + var nilBoundary *BoundaryFile + err := nilBoundary.Update(42) + require.NoError(t, err) // Should not error on nil +} + +func TestWrite(t *testing.T) { + tempDir := t.TempDir() + + // Create boundary file + boundary := &BoundaryFile{ + boundaryType: LowerBound, + parentDirectory: tempDir, + defined: true, + boundaryIndex: 999, + } + + // Write the file + err := boundary.Write() + require.NoError(t, err) + + // Verify file content + expectedPath := filepath.Join(tempDir, LowerBoundFileName) + content, err := os.ReadFile(expectedPath) + require.NoError(t, err) + require.Equal(t, "999\n", string(content)) +} + +func TestWriteNilBoundary(t *testing.T) { + var nilBoundary *BoundaryFile + err := nilBoundary.Write() + require.NoError(t, err) // Should not error on nil +} + +func TestIsDefined(t *testing.T) { + tempDir := t.TempDir() + + // Test undefined boundary (newly loaded, no file exists) + boundary, err := LoadBoundaryFile(LowerBound, tempDir) + require.NoError(t, err) + require.False(t, boundary.IsDefined()) + + // Update to make it defined + err = boundary.Update(50) + require.NoError(t, err) + require.True(t, boundary.IsDefined()) + + // Test nil boundary + var nilBoundary *BoundaryFile + require.False(t, nilBoundary.IsDefined()) +} + +func TestBoundaryIndex(t *testing.T) { + tempDir := t.TempDir() + + // Test undefined boundary + boundary, err := LoadBoundaryFile(LowerBound, tempDir) + require.NoError(t, err) + require.Equal(t, uint32(0), boundary.BoundaryIndex()) + + // Update and test defined boundary + err = boundary.Update(789) + require.NoError(t, err) + require.Equal(t, uint32(789), boundary.BoundaryIndex()) + + // Test nil boundary + var nilBoundary *BoundaryFile + require.Equal(t, uint32(0), nilBoundary.BoundaryIndex()) +} + +func TestSerialize(t *testing.T) { + boundary := &BoundaryFile{ + boundaryType: UpperBound, + parentDirectory: "/tmp", + defined: true, + boundaryIndex: 12345, + } + + data := boundary.serialize() + require.Equal(t, []byte("12345\n"), data) + + // Test nil boundary + var nilBoundary *BoundaryFile + require.Nil(t, nilBoundary.serialize()) +} + +func TestDeserialize(t *testing.T) { + boundary := &BoundaryFile{ + boundaryType: LowerBound, + parentDirectory: "/tmp", + defined: false, + boundaryIndex: 0, + } + + // Test valid data + err := boundary.deserialize([]byte("54321")) + require.NoError(t, err) + require.Equal(t, uint32(54321), boundary.boundaryIndex) + + // Test invalid data + err = boundary.deserialize([]byte("invalid")) + require.Error(t, err) + + // Test nil boundary + var nilBoundary *BoundaryFile + err = nilBoundary.deserialize([]byte("123")) + require.NoError(t, err) // Should not error on nil +} + +func TestRoundTrip(t *testing.T) { + tempDir := t.TempDir() + + // Create and update a boundary file + boundary, err := LoadBoundaryFile(LowerBound, tempDir) + require.NoError(t, err) + + err = boundary.Update(98765) + require.NoError(t, err) + + // Load the same file again and verify + boundary2, err := LoadBoundaryFile(LowerBound, tempDir) + require.NoError(t, err) + require.True(t, boundary2.IsDefined()) + require.Equal(t, uint32(98765), boundary2.BoundaryIndex()) +} diff --git a/sei-db/db_engine/litt/disktable/control_loop.go b/sei-db/db_engine/litt/disktable/control_loop.go new file mode 100644 index 0000000000..e24631542f --- /dev/null +++ b/sei-db/db_engine/litt/disktable/control_loop.go @@ -0,0 +1,439 @@ +//go:build littdb_wip + +package disktable + +import ( + "fmt" + "math/rand" + "sync" + "sync/atomic" + "time" + + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/metrics" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// controlLoop runs a goroutine that handles control messages for the disk table. +type controlLoop struct { + logger logging.Logger + + // diskTable is the disk table that this control loop is associated with. + diskTable *DiskTable + + // errorMonitor is used to react to fatal errors anywhere in the disk table. + errorMonitor *util.ErrorMonitor + + // controllerChannel is the channel for messages sent to the control loop. + controllerChannel chan any + + // The index of the lowest numbered segment. After initial creation, only the garbage collection + // thread is permitted to read/write this value for the sake of thread safety. + lowestSegmentIndex uint32 + + // The index of the highest numbered segment. All writes are applied to this segment. + highestSegmentIndex uint32 + + // This value mirrors highestSegmentIndex, but is thread safe to read from external goroutines. + // There are several unit tests that read this value, and so there needs to be a threadsafe way + // to access it. Since new segments are added on an infrequent basis and this is never read in + // production, maintaining this atomic variable has negligible overhead. + threadsafeHighestSegmentIndex atomic.Uint32 + + // segmentLock protects access to the variables segments and highestSegmentIndex. + // Does not protect the segments themselves. + segmentLock sync.RWMutex + + // All segments currently in use. Only the control loop modifies this map, but other threads may read from it. + // The control loop does not need to hold a lock when doing read operations on this map, since no other thread + // will modify it. The control loop does need to hold a lock when modifying this map, though, and other threads + // must hold a lock when reading from it. + segments map[uint32]*segment.Segment + + // The number of bytes contained within the immutable segments. This tracks the number of bytes that are + // on disk, not bytes in memory. For thread safety, this variable may only be read/written in the constructor + // and in the control loop. + immutableSegmentSize uint64 + + // The target size for value files. + targetFileSize uint32 + + // The maximum number of keys in a segment. + maxKeyCount uint32 + + // The target size for key files. + targetKeyFileSize uint64 + + // The size of the disk table is stored here. + size *atomic.Uint64 + + // The number of keys in the table. + keyCount *atomic.Int64 + + // clock is the time source used by the disk table. + clock func() time.Time + + // The locations where segment files are stored. + segmentPaths []*segment.SegmentPath + + // Controls if snapshotting is enabled or not. + snapshottingEnabled bool + + // The table's metadata. + metadata *tableMetadata + + // A source of randomness used for generating sharding salt. + saltShaker *rand.Rand + + // whether fsync mode is enabled. + fsync bool + + // If true, then the control loop has been stopped. + stopped atomic.Bool + + // Encapsulates metrics for the database. + metrics *metrics.LittDBMetrics + + // The table's name. + name string + + // The maximum number of keys that can be garbage collected in a single batch. + gcBatchSize uint64 + + // The keymap used to store key-to-address mappings. + keymap keymap.Keymap + + // The goroutine responsible for blocking on flush operations. + flushLoop *flushLoop + + // garbageCollectionPeriod is the period at which garbage collection is run. + garbageCollectionPeriod time.Duration +} + +// enqueue enqueues a request to the control loop. Returns an error if the request could not be sent due to the +// database being in a panicked state. Only types defined in control_loop_messages.go are permitted to be sent +// to the control loop. +func (c *controlLoop) enqueue(request controlLoopMessage) error { + return util.Send(c.errorMonitor, c.controllerChannel, request) +} + +// run runs the control loop for the disk table. It has sole responsibility for scheduling all operations that +// mutate the data in the disk table. +func (c *controlLoop) run() { + ticker := time.NewTicker(c.garbageCollectionPeriod) + defer ticker.Stop() + + for { + select { + case <-c.errorMonitor.ImmediateShutdownRequired(): + c.diskTable.logger.Infof("context done, shutting down disk table control loop") + return + case message := <-c.controllerChannel: + if req, ok := message.(*controlLoopWriteRequest); ok { + c.handleWriteRequest(req) + } else if req, ok := message.(*controlLoopFlushRequest); ok { + c.handleFlushRequest(req) + } else if req, ok := message.(*controlLoopSetShardingFactorRequest); ok { + c.handleControlLoopSetShardingFactorRequest(req) + } else if req, ok := message.(*controlLoopShutdownRequest); ok { + c.handleShutdownRequest(req) + return + } else if req, ok := message.(*controlLoopGCRequest); ok { + c.doGarbageCollection() + req.completionChan <- struct{}{} + } else { + c.errorMonitor.Panic(fmt.Errorf("unknown control message type %T", message)) + return + } + case <-ticker.C: + c.doGarbageCollection() + } + } +} + +// doGarbageCollection performs garbage collection on all segments, deleting old ones as necessary. +func (c *controlLoop) doGarbageCollection() { + start := c.clock() + ttl := c.metadata.GetTTL() + if ttl.Nanoseconds() <= 0 { + // No TTL set, so nothing to do. + return + } + + defer func() { + if c.metrics != nil { + end := c.clock() + delta := end.Sub(start) + c.metrics.ReportGarbageCollectionLatency(c.name, delta) + + } + c.updateCurrentSize() + }() + + for index := c.lowestSegmentIndex; index <= c.highestSegmentIndex; index++ { + seg := c.segments[index] + if !seg.IsSealed() { + // We can't delete an unsealed segment. + return + } + + sealTime := seg.GetSealTime() + segmentAge := start.Sub(sealTime) + if segmentAge < ttl { + // Segment is not old enough to be deleted. + return + } + + // Segment is old enough to be deleted. + keys, err := seg.GetKeys() + if err != nil { + c.errorMonitor.Panic(fmt.Errorf("failed to get keys: %w", err)) + return + } + + for keyIndex := uint64(0); keyIndex < uint64(len(keys)); keyIndex += c.gcBatchSize { + lastIndex := keyIndex + c.gcBatchSize + if lastIndex > uint64(len(keys)) { + lastIndex = uint64(len(keys)) + } + err = c.keymap.Delete(keys[keyIndex:lastIndex]) + if err != nil { + c.errorMonitor.Panic(fmt.Errorf("failed to delete keys: %w", err)) + return + } + } + + if seg.Size() > c.immutableSegmentSize { + c.logger.Errorf("segment %d size %d is larger than immutable segment size %d, "+ + "reported DB size will not be accurate", index, seg.Size(), c.immutableSegmentSize) + } + + c.immutableSegmentSize -= seg.Size() + c.keyCount.Add(-1 * int64(seg.KeyCount())) + + // Deletion of segment files will happen when the segment is released by all reservation holders. + seg.Release() + c.segmentLock.Lock() + delete(c.segments, index) + c.segmentLock.Unlock() + + c.lowestSegmentIndex++ + } +} + +// getReservedSegment returns the segment with the given index. Segment is reserved, and it is the caller's +// responsibility to release the reservation when done. Returns true if the segment was found and reserved, +// and false if the segment could not be found or could not be reserved. +func (c *controlLoop) getReservedSegment(index uint32) (*segment.Segment, bool) { + c.segmentLock.RLock() + defer c.segmentLock.RUnlock() + + seg, ok := c.segments[index] + if !ok { + return nil, false + } + + ok = seg.Reserve() + if !ok { + // segmented was deleted out from under us + return nil, false + } + + return seg, true +} + +// getSegments returns the segments of the disk table. It is only legal to call this after the control loop has been +// stopped. +func (c *controlLoop) getSegments() (map[uint32]*segment.Segment, error) { + if !c.stopped.Load() { + return nil, fmt.Errorf("cannot get segments until control loop has stopped") + } + return c.segments, nil +} + +// updateCurrentSize updates the size of the table. +func (c *controlLoop) updateCurrentSize() { + size := c.immutableSegmentSize + + c.segments[c.highestSegmentIndex].Size() + + c.metadata.Size() + + c.size.Store(size) +} + +// handleWriteRequest handles a controlLoopWriteRequest control message. +func (c *controlLoop) handleWriteRequest(req *controlLoopWriteRequest) { + for _, kv := range req.values { + // Do the write. + seg := c.segments[c.highestSegmentIndex] + keyCount, keyFileSize, err := seg.Write(kv) + shardSize := seg.GetMaxShardSize() + if err != nil { + c.errorMonitor.Panic( + fmt.Errorf("failed to write to segment %d: %w", c.highestSegmentIndex, err)) + return + } + + // Check to see if the write caused the mutable segment to become full. + if shardSize > uint64(c.targetFileSize) || keyCount >= c.maxKeyCount || keyFileSize >= c.targetKeyFileSize { + // Mutable segment is full. Before continuing, we need to expand the segments. + err = c.expandSegments() + if err != nil { + c.errorMonitor.Panic(fmt.Errorf("failed to expand segments: %w", err)) + return + } + } + } + + c.updateCurrentSize() +} + +// expandSegments seals the latest segment and creates a new mutable segment. +func (c *controlLoop) expandSegments() error { + now := c.clock() + + // Seal the previous segment. + flushLoopResponseChan := make(chan struct{}, 1) + request := &flushLoopSealRequest{ + now: now, + segmentToSeal: c.segments[c.highestSegmentIndex], + responseChan: flushLoopResponseChan, + } + err := c.flushLoop.enqueue(request) + if err != nil { + return fmt.Errorf("failed to send seal request: %w", err) + } + + // Unfortunately, it is necessary to block until the sealing has been completed. Although this may result + // in a brief interruption in new write work being sent to the segment, expanding the number of segments is + // infrequent, even for very high throughput workloads. + _, err = util.Await(c.errorMonitor, flushLoopResponseChan) + if err != nil { + return fmt.Errorf("failed to seal segment: %w", err) + } + + // Record the size of the segment. + c.immutableSegmentSize += c.segments[c.highestSegmentIndex].Size() + + // Create a new segment. + salt := [16]byte{} + _, err = c.saltShaker.Read(salt[:]) + if err != nil { + return fmt.Errorf("failed to read salt: %w", err) + } + newSegment, err := segment.CreateSegment( + c.logger, + c.errorMonitor, + c.highestSegmentIndex+1, + c.segmentPaths, + c.snapshottingEnabled, + c.metadata.GetShardingFactor(), + salt, + c.fsync) + if err != nil { + return err + } + c.segments[c.highestSegmentIndex].SetNextSegment(newSegment) + c.highestSegmentIndex++ + c.threadsafeHighestSegmentIndex.Add(1) + + c.segmentLock.Lock() + c.segments[c.highestSegmentIndex] = newSegment + c.segmentLock.Unlock() + + c.updateCurrentSize() + + return nil +} + +// handleFlushRequest handles the part of the flush that is performed on the control loop. +// The control loop is responsible for enqueuing the flush request in the segment's work queue (thus +// ensuring a serial ordering with respect to other operations on the control loop), but not for +// waiting for the segment to finish the flush. +func (c *controlLoop) handleFlushRequest(req *controlLoopFlushRequest) { + // This method will enqueue a flush operation within the segment. Once that is done, + // it becomes the responsibility of the flush loop to wait for the flush to complete. + flushWaitFunction, err := c.segments[c.highestSegmentIndex].Flush() + if err != nil { + c.errorMonitor.Panic(fmt.Errorf("failed to flush segment %d: %w", c.highestSegmentIndex, err)) + return + } + + // The flush loop is responsible for the remaining parts of the flush. + request := &flushLoopFlushRequest{ + flushWaitFunction: flushWaitFunction, + responseChan: req.responseChan, + } + err = c.flushLoop.enqueue(request) + if err != nil { + c.logger.Errorf("failed to send flush request to flush loop: %v", err) + } +} + +// handleControlLoopSetShardingFactorRequest updates the sharding factor of the disk table. If the requested +// sharding factor is the same as before, no action is taken. If it is different, the sharding factor is updated, +// the current mutable segment is sealed, and a new mutable segment is created. +func (c *controlLoop) handleControlLoopSetShardingFactorRequest(req *controlLoopSetShardingFactorRequest) { + + if req.shardingFactor == c.metadata.GetShardingFactor() { + // No action necessary. + return + } + err := c.metadata.SetShardingFactor(req.shardingFactor) + if err != nil { + c.errorMonitor.Panic(fmt.Errorf("failed to set sharding factor: %w", err)) + return + } + + // This seals the current mutable segment and creates a new one. The new segment will have the new sharding factor. + err = c.expandSegments() + if err != nil { + c.errorMonitor.Panic(fmt.Errorf("failed to expand segments: %w", err)) + return + } +} + +// handleShutdownRequest performs tasks necessary to cleanly shut down the disk table. +func (c *controlLoop) handleShutdownRequest(req *controlLoopShutdownRequest) { + // Instruct the flush loop to stop. + shutdownCompleteChan := make(chan struct{}) + request := &flushLoopShutdownRequest{ + shutdownCompleteChan: shutdownCompleteChan, + } + err := c.flushLoop.enqueue(request) + if err != nil { + c.logger.Errorf("failed to send shutdown request to flush loop: %v", err) + return + } + + _, err = util.Await(c.errorMonitor, shutdownCompleteChan) + if err != nil { + c.logger.Errorf("failed to shutdown flush loop: %v", err) + return + } + + // Seal the mutable segment + durableKeys, err := c.segments[c.highestSegmentIndex].Seal(c.clock()) + if err != nil { + c.errorMonitor.Panic(fmt.Errorf("failed to seal mutable segment: %w", err)) + return + } + + // Flush the keys that are now durable in the segment. + err = c.diskTable.writeKeysToKeymap(durableKeys) + if err != nil { + c.errorMonitor.Panic(fmt.Errorf("failed to flush keys: %w", err)) + return + } + + // Stop the keymap + err = c.keymap.Stop() + if err != nil { + c.errorMonitor.Panic(fmt.Errorf("failed to stop keymap: %w", err)) + return + } + + c.stopped.Store(true) + req.shutdownCompleteChan <- struct{}{} +} diff --git a/sei-db/db_engine/litt/disktable/control_loop_messages.go b/sei-db/db_engine/litt/disktable/control_loop_messages.go new file mode 100644 index 0000000000..8726fdbff5 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/control_loop_messages.go @@ -0,0 +1,55 @@ +//go:build littdb_wip + +package disktable + +import "github.com/Layr-Labs/eigenda/litt/types" + +// This file contains various messages that can be sent to the disk table's control loop. + +// controlLoopMessage is an interface for messages sent to the control loop via controlLoop.enqueue. +type controlLoopMessage interface { + // If this is an empty interface, then the golang type system will not complain if non-implementing types are + // passed to the control loop. + unimplemented() +} + +// controlLoopFlushRequest is a request to flush the writer that is sent to the control loop. +type controlLoopFlushRequest struct { + controlLoopMessage + + // responseChan produces a value when the flush is complete. + responseChan chan struct{} +} + +// controlLoopWriteRequest is a request to write a key-value pair that is sent to the control loop. +type controlLoopWriteRequest struct { + controlLoopMessage + + // values is a slice of key-value pairs to write. + values []*types.KVPair +} + +// controlLoopSetShardingFactorRequest is a request to set the sharding factor that is sent to the control loop. +type controlLoopSetShardingFactorRequest struct { + controlLoopMessage + + // shardingFactor is the new sharding factor to set. + shardingFactor uint32 +} + +// controlLoopShutdownRequest is a request to shut down the table that is sent to the control loop. +type controlLoopShutdownRequest struct { + controlLoopMessage + + // responseChan will produce a single struct{} when the control loop has stopped + // (i.e. when the handleShutdownRequest is complete). + shutdownCompleteChan chan struct{} +} + +// controlLoopGCRequest is a request to run garbage collection that is sent to the control loop. +type controlLoopGCRequest struct { + controlLoopMessage + + // completionChan produces a value when the garbage collection is complete. + completionChan chan struct{} +} diff --git a/sei-db/db_engine/litt/disktable/disk_table.go b/sei-db/db_engine/litt/disktable/disk_table.go new file mode 100644 index 0000000000..c66f94e586 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/disk_table.go @@ -0,0 +1,920 @@ +//go:build littdb_wip + +package disktable + +import ( + "errors" + "fmt" + "math" + "math/rand" + "os" + "path" + "sync" + "sync/atomic" + "time" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/metrics" + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +var _ litt.ManagedTable = (*DiskTable)(nil) + +// keymapReloadBatchSize is the size of the batch used for reloading keys from segments into the keymap. +const keymapReloadBatchSize = 1024 + +const tableFlushChannelCapacity = 8 + +// DiskTable manages a table's Segments. +type DiskTable struct { + // The logger for the disk table. + logger logging.Logger + + // errorMonitor is a struct that permits the DB to "panic". There are many goroutines that function under the + // hood, and many of these threads could, in theory, encounter errors which are unrecoverable. In such situations, + // the desirable outcome is for the DB to report the error and then refuse to do additional work. If the DB is in a + // broken state, it is much better to refuse to do work than to continue to do work and potentially corrupt data. + errorMonitor *util.ErrorMonitor + + // The root directories for the disk table. Each of these directories' name matches the name of the table. + roots []string + + // Configures the location where segment data is stored. + segmentPaths []*segment.SegmentPath + + // The table's name. + name string + + // The table's metadata. + metadata *tableMetadata + + // A map of keys to their addresses. + keymap keymap.Keymap + + // The path to the keymap directory. + keymapPath string + + // The type file for the keymap. + keymapTypeFile *keymap.KeymapTypeFile + + // unflushedDataCache is a map of keys to their values that may not have been flushed to disk yet. This is used as a + // lookup table when data is requested from the table before it has been flushed to disk. + unflushedDataCache sync.Map + + // clock is the time source used by the disk table. + clock func() time.Time + + // The number of bytes contained within all segments, including the mutable segment. This tracks the number of + // bytes that are on disk, not bytes in memory. + size atomic.Uint64 + + // The number of keys in the table. + keyCount atomic.Int64 + + // The control loop is a goroutine responsible for scheduling operations that mutate the table. + controlLoop *controlLoop + + // The flush loop is a goroutine responsible for blocking on flush operations. + flushLoop *flushLoop + + // Encapsulates metrics for the database. + metrics *metrics.LittDBMetrics + + // Set to true when the table is closed. This is used to prevent double closing. + closed atomic.Bool + + // Set to true when the table is destroyed. This is used to prevent double destroying. + destroyed atomic.Bool + + // If true then ensure file operations are synced to disk. + fsync bool + + // Manages flush requests and flush request batching. This is a performance optimization. + flushCoordinator *flushCoordinator +} + +// NewDiskTable creates a new DiskTable. +func NewDiskTable( + config *litt.Config, + name string, + keymap keymap.Keymap, + keymapPath string, + keymapTypeFile *keymap.KeymapTypeFile, + roots []string, + reloadKeymap bool, + metrics *metrics.LittDBMetrics) (litt.ManagedTable, error) { + + if config.GCPeriod <= 0 { + return nil, errors.New("garbage collection period must be greater than 0") + } + + qualifiedRoots := make([]string, len(roots)) + for i, root := range roots { + qualifiedRoots[i] = path.Join(root, name) + } + + // For each root directory, create a segment directory if it doesn't exist. + segmentPaths, err := segment.BuildSegmentPaths(roots, config.SnapshotDirectory, name) + if err != nil { + return nil, fmt.Errorf("failed to build segment paths: %w", err) + } + for _, segmentPath := range segmentPaths { + err = segmentPath.MakeDirectories(config.Fsync) + if err != nil { + return nil, fmt.Errorf("failed to create segment directories: %w", err) + } + } + + // Delete any orphaned swap files: + for _, root := range qualifiedRoots { + err = util.DeleteOrphanedSwapFiles(root) + if err != nil { + return nil, fmt.Errorf("failed to delete orphaned swap files in %s: %w", root, err) + } + } + + var metadataFilePath string + var metadata *tableMetadata + + // Find the table metadata file or create a new one. + for _, root := range qualifiedRoots { + possibleMetadataPath := metadataPath(root) + exists, err := util.Exists(possibleMetadataPath) + if err != nil { + return nil, fmt.Errorf("failed to check if metadata file exists: %w", err) + } + if exists { + if metadataFilePath != "" { + return nil, fmt.Errorf("multiple metadata files found: %s and %s", + metadataFilePath, possibleMetadataPath) + } + + // We've found an existing metadata file. Use it. + metadataFilePath = possibleMetadataPath + } + } + if metadataFilePath == "" { + // No metadata file exists yet. Create a new one in the first root. + var err error + metadataDir := qualifiedRoots[0] + metadata, err = newTableMetadata(config.Logger, metadataDir, config.TTL, config.ShardingFactor, config.Fsync) + if err != nil { + return nil, fmt.Errorf("failed to create table metadata: %w", err) + } + } else { + // Metadata file exists, so we need to load it. + var err error + metadataDir := path.Dir(metadataFilePath) + metadata, err = loadTableMetadata(config.Logger, metadataDir) + if err != nil { + return nil, fmt.Errorf("failed to load table metadata: %w", err) + } + } + + errorMonitor := util.NewErrorMonitor(config.CTX, config.Logger, config.FatalErrorCallback) + + table := &DiskTable{ + logger: config.Logger, + errorMonitor: errorMonitor, + clock: config.Clock, + roots: qualifiedRoots, + segmentPaths: segmentPaths, + name: name, + metadata: metadata, + keymap: keymap, + keymapPath: keymapPath, + keymapTypeFile: keymapTypeFile, + metrics: metrics, + fsync: config.Fsync, + } + table.flushCoordinator = newFlushCoordinator(errorMonitor, table.flushInternal, config.MinimumFlushInterval) + + snapshottingEnabled := config.SnapshotDirectory != "" + + // Load segments. + lowestSegmentIndex, highestSegmentIndex, segments, err := + segment.GatherSegmentFiles( + config.Logger, + errorMonitor, + table.segmentPaths, + snapshottingEnabled, + config.Clock(), + true, + config.Fsync) + if err != nil { + return nil, fmt.Errorf("failed to gather segment files: %w", err) + } + + keyCount := int64(0) + for _, seg := range segments { + keyCount += int64(seg.KeyCount()) + } + table.keyCount.Store(keyCount) + + immutableSegmentSize := uint64(0) + for _, seg := range segments { + immutableSegmentSize += seg.Size() + } + + // Create the mutable segment + creatingFirstSegment := len(segments) == 0 + + var nextSegmentIndex uint32 + if creatingFirstSegment { + nextSegmentIndex = 0 + } else { + nextSegmentIndex = highestSegmentIndex + 1 + } + salt := [16]byte{} + _, err = config.SaltShaker.Read(salt[:]) + if err != nil { + return nil, fmt.Errorf("failed to read salt: %w", err) + } + + mutableSegment, err := segment.CreateSegment( + config.Logger, + errorMonitor, + nextSegmentIndex, + segmentPaths, + snapshottingEnabled, + metadata.GetShardingFactor(), + salt, + config.Fsync) + if err != nil { + return nil, fmt.Errorf("failed to create mutable segment: %w", err) + } + if !creatingFirstSegment { + segments[highestSegmentIndex].SetNextSegment(mutableSegment) + highestSegmentIndex++ + } + segments[nextSegmentIndex] = mutableSegment + + if reloadKeymap { + config.Logger.Infof("reloading keymap from segments") + err = table.reloadKeymap(segments, lowestSegmentIndex, highestSegmentIndex) + if err != nil { + return nil, fmt.Errorf("failed to load keymap from segments: %w", err) + } + } + + tableSaltShaker := rand.New(rand.NewSource(config.SaltShaker.Int63())) + + var upperBoundSnapshotFile *BoundaryFile + if config.SnapshotDirectory != "" { + // Initialize snapshot files if snapshotting is enabled. + upperBoundSnapshotFile, err = table.repairSnapshot( + config.SnapshotDirectory, + lowestSegmentIndex, + highestSegmentIndex, + segments) + if err != nil { + return nil, fmt.Errorf("failed to repair snapshot: %w", err) + } + } + + // Start the flush loop. + fLoop := &flushLoop{ + logger: config.Logger, + diskTable: table, + errorMonitor: errorMonitor, + flushChannel: make(chan any, tableFlushChannelCapacity), + metrics: metrics, + clock: config.Clock, + name: name, + upperBoundSnapshotFile: upperBoundSnapshotFile, + } + table.flushLoop = fLoop + go fLoop.run() + + // Start the control loop. + cLoop := &controlLoop{ + logger: config.Logger, + diskTable: table, + errorMonitor: errorMonitor, + controllerChannel: make(chan any, config.ControlChannelSize), + lowestSegmentIndex: lowestSegmentIndex, + highestSegmentIndex: highestSegmentIndex, + segments: segments, + size: &table.size, + keyCount: &table.keyCount, + targetFileSize: config.TargetSegmentFileSize, + targetKeyFileSize: config.TargetSegmentKeyFileSize, + maxKeyCount: config.MaxSegmentKeyCount, + clock: config.Clock, + segmentPaths: segmentPaths, + snapshottingEnabled: snapshottingEnabled, + saltShaker: tableSaltShaker, + metadata: metadata, + fsync: config.Fsync, + metrics: metrics, + name: name, + gcBatchSize: config.GCBatchSize, + keymap: keymap, + flushLoop: fLoop, + garbageCollectionPeriod: config.GCPeriod, + immutableSegmentSize: immutableSegmentSize, + } + cLoop.threadsafeHighestSegmentIndex.Store(highestSegmentIndex) + table.controlLoop = cLoop + cLoop.updateCurrentSize() + go cLoop.run() + + return table, nil +} + +func (d *DiskTable) KeyCount() uint64 { + return uint64(d.keyCount.Load()) +} + +func (d *DiskTable) Size() uint64 { + return d.size.Load() +} + +// repairSnapshot is responsible for making any required repairs to the snapshot directories. This is needed +// if there is a crash, resulting in a segment not being fully snapshotted. It is also needed if LittDB has +// been rebased (which breaks symlinks) or manually modified (e.g. by the LittDB cli). Returns the new upper bound +// file for the repaired snapshot. +func (d *DiskTable) repairSnapshot( + symlinkDirectory string, + lowestSegmentIndex uint32, + highestSegmentIndex uint32, + segments map[uint32]*segment.Segment) (*BoundaryFile, error) { + + symlinkTableDirectory := path.Join(symlinkDirectory, d.name) + + err := util.EnsureDirectoryExists(symlinkTableDirectory, d.fsync) + if err != nil { + return nil, fmt.Errorf("failed to ensure symlink table directory exists: %w", err) + } + + upperBoundSnapshotFile, err := LoadBoundaryFile(UpperBound, symlinkTableDirectory) + if err != nil { + return nil, fmt.Errorf("failed to load snapshot boundary file: %w", err) + } + + // Prevent other processes from messing with the symlink table directory while we are working on it. + lockPath := path.Join(symlinkTableDirectory, util.LockfileName) + lock, err := util.NewFileLock(d.logger, lockPath, false) + if err != nil { + return nil, fmt.Errorf("failed to acquire lock on symlink table directory: %w", err) + } + defer lock.Release() + + symlinkSegmentsDirectory := path.Join(symlinkTableDirectory, segment.SegmentDirectory) + exists, err := util.Exists(symlinkSegmentsDirectory) + if err != nil { + return nil, fmt.Errorf("failed to check if symlink segments directory exists: %w", err) + } + if exists { + // Delete all data from the previous snapshot. This directory will contain a bunch of symlinks. It's a lot + // simpler to just rebuild this from scratch than it is to try to figure out which symlinks are valid + // and which are not. Building this is super fast, so this is not a performance concern. + err = os.RemoveAll(symlinkSegmentsDirectory) + if err != nil { + return nil, fmt.Errorf("failed to remove symlink segments directory: %w", err) + } + } + + err = os.MkdirAll(symlinkSegmentsDirectory, 0755) + if err != nil { + return nil, fmt.Errorf("failed to create symlink segments directory: %w", err) + } + + if len(segments) <= 1 { + // There is only the mutable segment, nothing else to do. + return upperBoundSnapshotFile, nil + } + + lowerBoundSnapshotFile, err := LoadBoundaryFile(LowerBound, symlinkTableDirectory) + if err != nil { + return nil, fmt.Errorf("failed to load snapshot boundary file: %w", err) + } + + firstSegmentToConsider := lowestSegmentIndex + if lowerBoundSnapshotFile.IsDefined() { + // The lower bound file contains the index of the highest segment that has been GC'd by an external process. + // We should ignore the segment at this index, and all segments with lower indices. + firstSegmentToConsider = lowerBoundSnapshotFile.BoundaryIndex() + 1 + } + + // Skip iterating over the highest segment index (i.e. don't do i <= highestSegmentIndex). The highest segment + // index is mutable and cannot be snapshotted until it has been sealed. + for i := firstSegmentToConsider; i < highestSegmentIndex; i++ { + seg := segments[i] + err = seg.Snapshot() + if err != nil { + return nil, fmt.Errorf("failed to snapshot segment %d: %w", i, err) + } + } + + // Signal that the segment files are now fully snapshotted and safe to use. + // The highest segment index is the mutable segment, which is not snapshotted. + err = upperBoundSnapshotFile.Update(highestSegmentIndex - 1) + if err != nil { + return nil, fmt.Errorf("failed to update upper bound snapshot file: %w", err) + } + + return upperBoundSnapshotFile, nil +} + +// reloadKeymap reloads the keymap from the segments. This is necessary when the keymap is lost, the keymap doesn't +// save its data on disk, or we are migrating from one keymap type to another. +func (d *DiskTable) reloadKeymap( + segments map[uint32]*segment.Segment, + lowestSegmentIndex uint32, + highestSegmentIndex uint32) error { + + start := d.clock() + defer func() { + d.logger.Infof("spent %v reloading keymap", d.clock().Sub(start)) + }() + + batch := make([]*types.ScopedKey, 0, keymapReloadBatchSize) + + for i := lowestSegmentIndex; i <= highestSegmentIndex; i++ { + if !segments[i].IsSealed() { + // ignore unsealed segment, this will have been created in the current session and will not + // yet contain any data. + continue + } + + keys, err := segments[i].GetKeys() + if err != nil { + return fmt.Errorf("failed to get keys from segment %d: %w", i, err) + } + for keyIndex := len(keys) - 1; keyIndex >= 0; keyIndex-- { + key := keys[keyIndex] + + batch = append(batch, key) + if len(batch) == keymapReloadBatchSize { + err = d.keymap.Put(batch) + if err != nil { + return fmt.Errorf("failed to put keys for segment %d: %w", i, err) + } + batch = make([]*types.ScopedKey, 0, keymapReloadBatchSize) + } + } + } + + if len(batch) > 0 { + err := d.keymap.Put(batch) + if err != nil { + return fmt.Errorf("failed to put keys: %w", err) + } + } + + // Now that the keymap is loaded, write the marker file that indicates that the keymap is fully loaded. + // If we crash prior to writing this file, the keymap will reload from the segments again. + keymapInitializedFile := path.Join(d.keymapPath, keymap.KeymapInitializedFileName) + err := os.MkdirAll(d.keymapPath, 0755) + if err != nil { + return fmt.Errorf("failed to create keymap directory: %w", err) + } + + f, err := os.Create(keymapInitializedFile) + if err != nil { + return fmt.Errorf("failed to create keymap initialized file after reload: %w", err) + } + err = f.Close() + if err != nil { + return fmt.Errorf("failed to close keymap initialized file after reload: %w", err) + } + + return nil +} + +func (d *DiskTable) Name() string { + return d.name +} + +// Close stops the disk table. Flushes all data out to disk. +func (d *DiskTable) Close() error { + firstTimeClosing := d.closed.CompareAndSwap(false, true) + if !firstTimeClosing { + return nil + } + + if ok, err := d.errorMonitor.IsOk(); !ok { + return fmt.Errorf("cannot process Stop() request, DB is in panicked state due to error: %w", err) + } + + d.errorMonitor.Shutdown() + + shutdownCompleteChan := make(chan struct{}, 1) + request := &controlLoopShutdownRequest{ + shutdownCompleteChan: shutdownCompleteChan, + } + + err := d.controlLoop.enqueue(request) + if err != nil { + return fmt.Errorf("failed to send shutdown request: %w", err) + } + + _, err = util.Await(d.errorMonitor, shutdownCompleteChan) + if err != nil { + return fmt.Errorf("failed to shutdown: %w", err) + } + + return nil +} + +// Destroy stops the disk table and delete all files. +func (d *DiskTable) Destroy() error { + firstTimeDestroying := d.destroyed.CompareAndSwap(false, true) + if !firstTimeDestroying { + return nil // already destroyed + } + + err := d.Close() + if err != nil { + return fmt.Errorf("failed to stop: %w", err) + } + + d.logger.Infof("deleting disk table at path(s): %v", d.roots) + + // release all segments + segments, err := d.controlLoop.getSegments() + if err != nil { + return fmt.Errorf("failed to get segments: %w", err) + } + for _, seg := range segments { + seg.Release() + } + // wait for segments to delete themselves + for _, seg := range segments { + err = seg.BlockUntilFullyDeleted() + if err != nil { + return fmt.Errorf("failed to delete segment: %w", err) + } + } + + // delete all segment directories (ignore snapshots -- this is the responsibility of an outside process to clean) + for _, segmentPath := range d.segmentPaths { + err = os.Remove(segmentPath.SegmentDirectory()) + if err != nil { + return fmt.Errorf("failed to remove segment directory: %w", err) + } + } + + // delete the snapshot hardlink directory + for _, root := range d.roots { + snapshotDir := path.Join(root, segment.HardLinkDirectory) + exists, err := util.Exists(snapshotDir) + if err != nil { + return fmt.Errorf("failed to check if snapshot directory exists: %w", err) + } + if exists { + err = os.RemoveAll(snapshotDir) + if err != nil { + return fmt.Errorf("failed to remove snapshot directory: %w", err) + } + } + } + + // destroy the keymap + err = d.keymap.Destroy() + if err != nil { + return fmt.Errorf("failed to destroy keymap: %w", err) + } + err = d.keymapTypeFile.Delete() + if err != nil { + return fmt.Errorf("failed to delete keymap type file: %w", err) + } + exists, err := util.Exists(d.keymapPath) + if err != nil { + return fmt.Errorf("failed to check if keymap directory exists: %w", err) + } + if exists { + err = os.RemoveAll(d.keymapPath) + if err != nil { + return fmt.Errorf("failed to remove keymap directory: %w", err) + } + } + + // delete the metadata file + err = d.metadata.delete() + if err != nil { + return fmt.Errorf("failed to delete metadata: %w", err) + } + + // delete the root directories for the table + for _, root := range d.roots { + err = os.Remove(root) + if err != nil { + return fmt.Errorf("failed to remove root directory: %w", err) + } + } + + return nil +} + +// SetTTL sets the TTL for the disk table. If set to 0, no TTL is enforced. This setting affects both new +// data and data already written. +func (d *DiskTable) SetTTL(ttl time.Duration) error { + if ok, err := d.errorMonitor.IsOk(); !ok { + return fmt.Errorf("cannot process SetTTL() request, DB is in panicked state due to error: %w", err) + } + + err := d.metadata.SetTTL(ttl) + if err != nil { + return fmt.Errorf("failed to set TTL: %w", err) + } + return nil +} + +func (d *DiskTable) SetShardingFactor(shardingFactor uint32) error { + if ok, err := d.errorMonitor.IsOk(); !ok { + return fmt.Errorf( + "cannot process SetShardingFactor() request, DB is in panicked state due to error: %w", err) + } + + if shardingFactor == 0 { + return fmt.Errorf("sharding factor must be greater than 0") + } + + request := &controlLoopSetShardingFactorRequest{ + shardingFactor: shardingFactor, + } + err := d.controlLoop.enqueue(request) + if err != nil { + return fmt.Errorf("failed to send sharding factor request: %w", err) + } + + return nil +} + +func (d *DiskTable) Get(key []byte) (value []byte, exists bool, err error) { + if ok, err := d.errorMonitor.IsOk(); !ok { + return nil, false, fmt.Errorf( + "cannot process Get() request, DB is in panicked state due to error: %w", err) + } + + // First, check if the key is in the unflushed data map. + // If so, return it from there. + if value, ok := d.unflushedDataCache.Load(util.UnsafeBytesToString(key)); ok { + bytes := value.([]byte) + return bytes, true, nil + } + + // Look up the address of the data. + address, ok, err := d.keymap.Get(key) + if err != nil { + return nil, false, fmt.Errorf("failed to get address: %w", err) + } + if !ok { + return nil, false, nil + } + + // Reserve the segment that contains the data. + seg, ok := d.controlLoop.getReservedSegment(address.Index()) + if !ok { + return nil, false, nil + } + defer seg.Release() + + // Read the data from disk. + data, err := seg.Read(key, address) + if err != nil { + return nil, false, fmt.Errorf("failed to read data: %w", err) + } + + return data, true, nil +} + +func (d *DiskTable) CacheAwareGet( + key []byte, + onlyReadFromCache bool, +) (value []byte, exists bool, hot bool, err error) { + + if ok, err := d.errorMonitor.IsOk(); !ok { + return nil, false, false, fmt.Errorf( + "cannot process CacheAwareGet() request, DB is in panicked state due to error: %w", err) + } + + // First, check if the key is in the unflushed data map. If so, return it from there. + // Performance wise, this has equivalent semantics to reading the value from + // a cache, so we'd might as well count it as a cache hit. + var rawValue any + if rawValue, exists = d.unflushedDataCache.Load(util.UnsafeBytesToString(key)); exists { + value = rawValue.([]byte) + return value, true, true, nil + } + + // Look up the address of the data. + var address types.Address + address, exists, err = d.keymap.Get(key) + if err != nil { + return nil, false, false, fmt.Errorf("failed to get address: %w", err) + } + if !exists { + return nil, false, false, nil + } + + if onlyReadFromCache { + // The value exists but we are not allowed to read it from disk. + return nil, true, false, nil + } + + // Reserve the segment that contains the data. + seg, ok := d.controlLoop.getReservedSegment(address.Index()) + if !ok { + // This can happen if there is a race between this thread and the GC thread, i.e. + // if we start reading a value just as the garbage collector decides to delete it. + return nil, false, false, nil + } + defer seg.Release() + + // Read the data from disk. + value, err = seg.Read(key, address) + if err != nil { + return nil, false, false, fmt.Errorf("failed to read data: %w", err) + } + + return value, true, false, nil +} + +func (d *DiskTable) Put(key []byte, value []byte) error { + return d.PutBatch([]*types.KVPair{{Key: key, Value: value}}) +} + +func (d *DiskTable) PutBatch(batch []*types.KVPair) error { + if ok, err := d.errorMonitor.IsOk(); !ok { + return fmt.Errorf("cannot process PutBatch() request, DB is in panicked state due to error: %w", err) + } + + if d.metrics != nil { + start := d.clock() + totalSize := uint64(0) + for _, kv := range batch { + totalSize += uint64(len(kv.Value)) + } + defer func() { + end := d.clock() + delta := end.Sub(start) + d.metrics.ReportWriteOperation(d.name, delta, uint64(len(batch)), totalSize) + }() + } + + for _, kv := range batch { + if len(kv.Key) > math.MaxUint32 { + return fmt.Errorf("key is too large, length must not exceed 2^32 bytes: %d bytes", len(kv.Key)) + } + if len(kv.Value) > math.MaxUint32 { + return fmt.Errorf("value is too large, length must not exceed 2^32 bytes: %d bytes", len(kv.Value)) + } + if kv.Key == nil { + return fmt.Errorf("nil keys are not supported") + } + if kv.Value == nil { + return fmt.Errorf("nil values are not supported") + } + + d.unflushedDataCache.Store(util.UnsafeBytesToString(kv.Key), kv.Value) + } + + request := &controlLoopWriteRequest{ + values: batch, + } + err := d.controlLoop.enqueue(request) + if err != nil { + return fmt.Errorf("failed to send write request: %w", err) + } + + d.keyCount.Add(int64(len(batch))) + + return nil +} + +func (d *DiskTable) Exists(key []byte) (bool, error) { + _, ok := d.unflushedDataCache.Load(util.UnsafeBytesToString(key)) + if ok { + return true, nil + } + + _, ok, err := d.keymap.Get(key) + if err != nil { + return false, fmt.Errorf("failed to get address: %w", err) + } + + return ok, nil +} + +// Flush flushes all data to disk. Blocks until all data previously submitted to Put has been written to disk. +func (d *DiskTable) Flush() error { + // The flush coordinator batches flush requests together to improve performance if + // flushes are being requested very frequently. + err := d.flushCoordinator.Flush() + if err != nil { + return fmt.Errorf("failed to flush: %w", err) + } + return nil +} + +// actually flushes the internal DB +func (d *DiskTable) flushInternal() error { + if ok, err := d.errorMonitor.IsOk(); !ok { + return fmt.Errorf("cannot process Flush() request, DB is in panicked state due to error: %w", err) + } + + if d.metrics != nil { + start := d.clock() + defer func() { + end := d.clock() + delta := end.Sub(start) + d.metrics.ReportFlushOperation(d.name, delta) + }() + } + + flushReq := &controlLoopFlushRequest{ + responseChan: make(chan struct{}, 1), + } + err := d.controlLoop.enqueue(flushReq) + if err != nil { + return fmt.Errorf("failed to send flush request: %w", err) + } + + _, err = util.Await(d.errorMonitor, flushReq.responseChan) + if err != nil { + return fmt.Errorf("failed to flush: %w", err) + } + + return nil +} + +func (d *DiskTable) SetWriteCacheSize(size uint64) error { + if ok, err := d.errorMonitor.IsOk(); !ok { + return fmt.Errorf( + "cannot process SetWriteCacheSize() request, DB is in panicked state due to error: %w", err) + } + + // this implementation does not provide a cache, if a cache is needed then it must be provided by a wrapper + return nil +} + +func (d *DiskTable) SetReadCacheSize(size uint64) error { + if ok, err := d.errorMonitor.IsOk(); !ok { + return fmt.Errorf( + "cannot process SetReadCacheSize() request, DB is in panicked state due to error: %w", err) + } + + // this implementation does not provide a cache, if a cache is needed then it must be provided by a wrapper + return nil +} + +func (d *DiskTable) RunGC() error { + if ok, err := d.errorMonitor.IsOk(); !ok { + return fmt.Errorf( + "cannot process RunGC() request, DB is in panicked state due to error: %w", err) + } + + request := &controlLoopGCRequest{ + completionChan: make(chan struct{}, 1), + } + + err := d.controlLoop.enqueue(request) + if err != nil { + return fmt.Errorf("failed to send GC request: %w", err) + } + + _, err = util.Await(d.errorMonitor, request.completionChan) + if err != nil { + return fmt.Errorf("failed to await GC completion: %w", err) + } + + return nil +} + +// writeKeysToKeymap flushes all keys to the keymap. Once they are flushed, it also removes the keys from the +// unflushedDataCache. +func (d *DiskTable) writeKeysToKeymap(keys []*types.ScopedKey) error { + if len(keys) == 0 { + // Nothing to flush. + return nil + } + + if d.metrics != nil { + start := d.clock() + defer func() { + end := d.clock() + delta := end.Sub(start) + d.metrics.ReportKeymapFlushLatency(d.name, delta) + }() + } + + err := d.keymap.Put(keys) + if err != nil { + return fmt.Errorf("failed to flush keys: %w", err) + } + + // Keys are now durably written to both the segment and the keymap. It is therefore safe to remove them from the + // unflushed data cache. + for _, ka := range keys { + d.unflushedDataCache.Delete(util.UnsafeBytesToString(ka.Key)) + } + + return nil +} diff --git a/sei-db/db_engine/litt/disktable/disk_table_flush_loop.go b/sei-db/db_engine/litt/disktable/disk_table_flush_loop.go new file mode 100644 index 0000000000..e036d73e54 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/disk_table_flush_loop.go @@ -0,0 +1,3 @@ +//go:build littdb_wip + +package disktable diff --git a/sei-db/db_engine/litt/disktable/disk_table_test.go b/sei-db/db_engine/litt/disktable/disk_table_test.go new file mode 100644 index 0000000000..43dc991c7d --- /dev/null +++ b/sei-db/db_engine/litt/disktable/disk_table_test.go @@ -0,0 +1,2307 @@ +//go:build littdb_wip + +package disktable + +import ( + "fmt" + "os" + "path" + "path/filepath" + "strings" + "sync/atomic" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +// This file contains tests that are specific to the disk table implementation. Other more general test scenarios +// are defined in litt/test/table_test.go. + +type tableBuilder struct { + name string + builder func(clock func() time.Time, name string, paths []string) (litt.ManagedTable, error) +} + +// This test executes against different table implementations. This is useful for distinguishing between bugs that +// are present in an implementation, and bugs that are present in the test scenario itself. +var tableBuilders = []*tableBuilder{ + { + name: "MemKeyDiskTableSingleShard", + builder: buildMemKeyDiskTableSingleShard, + }, + { + name: "MemKeyDiskTableMultiShard", + builder: buildMemKeyDiskTableMultiShard, + }, + { + name: "LevelDBKeyDiskTableSingleShard", + builder: buildLevelDBKeyDiskTableSingleShard, + }, + { + name: "LevelDBKeyDiskTableMultiShard", + builder: buildLevelDBKeyDiskTableMultiShard, + }, +} + +func setupKeymapTypeFile(keymapPath string, keymapType keymap.KeymapType) (*keymap.KeymapTypeFile, error) { + exists, err := keymap.KeymapFileExists(keymapPath) + if err != nil { + return nil, fmt.Errorf("failed to check if keymap file exists: %w", err) + } + var keymapTypeFile *keymap.KeymapTypeFile + if exists { + keymapTypeFile, err = keymap.LoadKeymapTypeFile(keymapPath) + if err != nil { + return nil, fmt.Errorf("failed to load keymap type file: %w", err) + } + } else { + err = os.MkdirAll(keymapPath, os.ModePerm) + if err != nil { + return nil, fmt.Errorf("failed to create keymap directory: %w", err) + } + keymapTypeFile = keymap.NewKeymapTypeFile(keymapPath, keymapType) + err = keymapTypeFile.Write() + if err != nil { + return nil, fmt.Errorf("failed to create keymap type file: %w", err) + } + } + + return keymapTypeFile, nil +} + +func buildMemKeyDiskTableSingleShard( + clock func() time.Time, + name string, + paths []string) (litt.ManagedTable, error) { + + logger := test.GetLogger() + + keymapPath := filepath.Join(paths[0], keymap.KeymapDirectoryName) + keymapTypeFile, err := setupKeymapTypeFile(keymapPath, keymap.MemKeymapType) + if err != nil { + return nil, fmt.Errorf("failed to load keymap type file: %w", err) + } + + keys, _, err := keymap.NewMemKeymap(logger, "", true) + if err != nil { + return nil, fmt.Errorf("failed to create keymap: %w", err) + } + + roots := make([]string, 0, len(paths)) + roots = append(roots, paths...) + + config, err := litt.DefaultConfig(paths...) + if err != nil { + return nil, fmt.Errorf("failed to create config: %w", err) + } + + config.Clock = clock + config.TargetSegmentFileSize = 100 // intentionally use a very small segment size + config.GCPeriod = time.Millisecond + config.Fsync = false + config.SaltShaker = random.NewTestRandom().Rand + config.Logger = logger + + table, err := NewDiskTable( + config, + name, + keys, + keymapPath, + keymapTypeFile, + roots, + true, + nil) + + if err != nil { + return nil, fmt.Errorf("failed to create disk table: %w", err) + } + + return table, nil +} + +func buildMemKeyDiskTableMultiShard( + clock func() time.Time, + name string, + paths []string) (litt.ManagedTable, error) { + + logger := test.GetLogger() + + keymapPath := filepath.Join(paths[0], keymap.KeymapDirectoryName) + keymapTypeFile, err := setupKeymapTypeFile(keymapPath, keymap.MemKeymapType) + if err != nil { + return nil, fmt.Errorf("failed to load keymap type file: %w", err) + } + + keys, _, err := keymap.NewMemKeymap(logger, "", true) + if err != nil { + return nil, fmt.Errorf("failed to create keymap: %w", err) + } + + config, err := litt.DefaultConfig(paths...) + if err != nil { + return nil, fmt.Errorf("failed to create config: %w", err) + } + + config.Clock = clock + config.TargetSegmentFileSize = 100 // intentionally use a very small segment size + config.GCPeriod = time.Millisecond + config.Fsync = false + config.SaltShaker = random.NewTestRandom().Rand + config.ShardingFactor = 4 + config.Logger = logger + + table, err := NewDiskTable( + config, + name, + keys, + keymapPath, + keymapTypeFile, + paths, + true, + nil) + + if err != nil { + return nil, fmt.Errorf("failed to create disk table: %w", err) + } + + return table, nil +} + +func buildLevelDBKeyDiskTableSingleShard( + clock func() time.Time, + name string, + paths []string) (litt.ManagedTable, error) { + + logger := test.GetLogger() + keymapPath := filepath.Join(paths[0], keymap.KeymapDirectoryName) + keymapTypeFile, err := setupKeymapTypeFile(keymapPath, keymap.UnsafeLevelDBKeymapType) + if err != nil { + return nil, fmt.Errorf("failed to load keymap type file: %w", err) + } + + keys, _, err := keymap.NewUnsafeLevelDBKeymap(logger, keymapPath, false) + if err != nil { + return nil, fmt.Errorf("failed to create keymap: %w", err) + } + + config, err := litt.DefaultConfig(paths...) + if err != nil { + return nil, fmt.Errorf("failed to create config: %w", err) + } + + config.Clock = clock + config.TargetSegmentFileSize = 100 // intentionally use a very small segment size + config.GCPeriod = time.Millisecond + config.Fsync = false + config.SaltShaker = random.NewTestRandom().Rand + config.Logger = logger + + table, err := NewDiskTable( + config, + name, + keys, + keymapPath, + keymapTypeFile, + paths, + false, + nil) + + if err != nil { + return nil, fmt.Errorf("failed to create disk table: %w", err) + } + + return table, nil +} + +func buildLevelDBKeyDiskTableMultiShard( + clock func() time.Time, + name string, + paths []string) (litt.ManagedTable, error) { + + logger := test.GetLogger() + keymapPath := filepath.Join(paths[0], name, keymap.KeymapDirectoryName) + keymapTypeFile, err := setupKeymapTypeFile(keymapPath, keymap.UnsafeLevelDBKeymapType) + if err != nil { + return nil, fmt.Errorf("failed to load keymap type file: %w", err) + } + + keys, _, err := keymap.NewUnsafeLevelDBKeymap(logger, keymapPath, true) + if err != nil { + return nil, fmt.Errorf("failed to create keymap: %w", err) + } + + config, err := litt.DefaultConfig(paths...) + if err != nil { + return nil, fmt.Errorf("failed to create config: %w", err) + } + + config.Clock = clock + config.TargetSegmentFileSize = 100 // intentionally use a very small segment size + config.GCPeriod = time.Millisecond + config.Fsync = false + config.SaltShaker = random.NewTestRandom().Rand + config.ShardingFactor = 4 + config.Logger = logger + + table, err := NewDiskTable( + config, + name, + keys, + keymapPath, + keymapTypeFile, + paths, + false, + nil) + + if err != nil { + return nil, fmt.Errorf("failed to create disk table: %w", err) + } + + return table, nil +} + +func restartTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + + iterations := 1000 + restartIteration := iterations/2 + int(rand.Int64Range(-10, 10)) + + for i := 0; i < iterations; i++ { + + // Somewhere in the middle of the test, restart the table. + if i == restartIteration { + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + table, err = tableBuilder.builder(time.Now, tableName, []string{directory}) + require.NoError(t, err) + + // Do a full scan of the table to verify that all expected values are still present. + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok, "key %s not found", expectedKey) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestRestart(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + restartTest(t, tb) + }) + } +} + +// This test deletes a random file from a middle segment. This is considered unrecoverable corruption, and should +// cause the table to fail to restart. +func middleFileMissingTest(t *testing.T, tableBuilder *tableBuilder, typeToDelete string) { + rand := random.NewTestRandom() + + logger := test.GetLogger() + + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + + // Fill the table with random data. + iterations := 100 + for i := 0; i < iterations; i++ { + batchSize := rand.Int32Range(1, 10) + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + } + + // Stop the table + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + errorMonitor := table.(*DiskTable).errorMonitor + + // Delete a file in the middle of the sequence of segments. + segmentPath, err := segment.NewSegmentPath(directory, "", tableName) + require.NoError(t, err) + lowestSegmentIndex, highestSegmentIndex, _, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + []*segment.SegmentPath{segmentPath}, + false, + time.Now(), + true, + false) + require.NoError(t, err) + + middleIndex := lowestSegmentIndex + (highestSegmentIndex-lowestSegmentIndex)/2 + + filePath := "" + if typeToDelete == "key" { + filePath = fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, middleIndex, segment.KeyFileExtension) + } else if typeToDelete == "value" { + shardingFactor := table.(*DiskTable).metadata.GetShardingFactor() + shard := rand.Uint32Range(0, shardingFactor) + filePath = fmt.Sprintf("%s/%s/segments/%d-%d%s", + directory, tableName, middleIndex, shard, segment.ValuesFileExtension) + } else { + filePath = fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, middleIndex, segment.MetadataFileExtension) + } + + exists, err := util.Exists(filePath) + require.NoError(t, err) + require.True(t, exists) + + err = os.Remove(filePath) + require.NoError(t, err) + + // files in segments directory should not be changed as a result of the deletion + files, err := os.ReadDir(fmt.Sprintf("%s/%s/segments", directory, tableName)) + require.NoError(t, err) + + // Restart the table. This should fail. + table, err = tableBuilder.builder(time.Now, tableName, []string{directory}) + require.Error(t, err) + require.Nil(t, table) + + // Ensure that no files were added or removed from the segments directory. + filesAfterRestart, err := os.ReadDir(fmt.Sprintf("%s/%s/segments", directory, tableName)) + require.NoError(t, err) + require.Equal(t, len(files), len(filesAfterRestart)) + filesSet := make(map[string]struct{}) + for _, file := range files { + filesSet[file.Name()] = struct{}{} + } + for _, file := range filesAfterRestart { + require.Contains(t, filesSet, file.Name()) + } +} + +func TestMiddleFileMissing(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run("key-"+tb.name, func(t *testing.T) { + middleFileMissingTest(t, tb, "key") + }) + t.Run("value-"+tb.name, func(t *testing.T) { + middleFileMissingTest(t, tb, "value") + }) + t.Run("metadata-"+tb.name, func(t *testing.T) { + middleFileMissingTest(t, tb, "metadata") + }) + } +} + +// This test deletes a random file from the first segment. This is considered recoverable, since it can happen +// if the table crashes during garbage collection. +func initialFileMissingTest(t *testing.T, tableBuilder *tableBuilder, typeToDelete string) { + rand := random.NewTestRandom() + + logger := test.GetLogger() + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + + // Fill the table with random data. + iterations := 100 + for i := 0; i < iterations; i++ { + batchSize := rand.Int32Range(1, 10) + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + } + + // Stop the table + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + segmentPath, err := segment.NewSegmentPath(directory, "", tableName) + require.NoError(t, err) + lowestSegmentIndex, _, segments, err := segment.GatherSegmentFiles( + logger, + table.(*DiskTable).errorMonitor, + []*segment.SegmentPath{segmentPath}, + false, + time.Now(), + true, + false) + require.NoError(t, err) + + // All keys in the initial segment are expected to be missing after the restart. + missingKeys := make(map[string]struct{}) + segmentKeys, err := segments[lowestSegmentIndex].GetKeys() + require.NoError(t, err) + for _, key := range segmentKeys { + missingKeys[string(key.Key)] = struct{}{} + } + + // Delete a file in the initial segment. + filePath := "" + if typeToDelete == "key" { + filePath = fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, lowestSegmentIndex, segment.KeyFileExtension) + } else if typeToDelete == "value" { + shardingFactor := table.(*DiskTable).metadata.GetShardingFactor() + shard := rand.Uint32Range(0, shardingFactor) + filePath = fmt.Sprintf( + "%s/%s/segments/%d-%d%s", + directory, tableName, lowestSegmentIndex, shard, segment.ValuesFileExtension) + } else { + filePath = fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, lowestSegmentIndex, segment.MetadataFileExtension) + } + exists, err := util.Exists(filePath) + require.NoError(t, err) + require.True(t, exists) + + err = os.Remove(filePath) + require.NoError(t, err) + + // Restart the table. + table, err = tableBuilder.builder(time.Now, tableName, []string{directory}) + require.NoError(t, err) + + // Check the data in the table. + for expectedKey, expectedValue := range expectedValues { + if _, expectedToBeMissing := missingKeys[expectedKey]; expectedToBeMissing { + _, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.False(t, ok) + } else { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + } + + // Remove the missing values from the expected values map. Simplifies following checks. + for key := range missingKeys { + delete(expectedValues, key) + } + + // Make additional modifications to the table to ensure that it is still working. + for i := 0; i < iterations; i++ { + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + ok, _ = table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestInitialFileMissing(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run("key-"+tb.name, func(t *testing.T) { + initialFileMissingTest(t, tb, "key") + }) + t.Run("value-"+tb.name, func(t *testing.T) { + initialFileMissingTest(t, tb, "value") + }) + t.Run("metadata-"+tb.name, func(t *testing.T) { + initialFileMissingTest(t, tb, "metadata") + }) + } +} + +// This test deletes a random file from the last segment. This can happen if the table crashes prior to the +// last segment being flushed. +func lastFileMissingTest(t *testing.T, tableBuilder *tableBuilder, typeToDelete string) { + rand := random.NewTestRandom() + + logger := test.GetLogger() + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + + // Fill the table with random data. + iterations := 100 + for i := 0; i < iterations; i++ { + batchSize := rand.Int32Range(1, 10) + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + } + + // Stop the table + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + segmentPath, err := segment.NewSegmentPath(directory, "", tableName) + require.NoError(t, err) + _, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + table.(*DiskTable).errorMonitor, + []*segment.SegmentPath{segmentPath}, + false, + time.Now(), + true, + false) + require.NoError(t, err) + + // All keys in the final segment are expected to be missing after the restart. + missingKeys := make(map[string]struct{}) + segmentKeys, err := segments[highestSegmentIndex].GetKeys() + require.NoError(t, err) + for _, key := range segmentKeys { + missingKeys[string(key.Key)] = struct{}{} + } + + // Delete a file in the final segment. + filePath := "" + if typeToDelete == "key" { + filePath = fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, highestSegmentIndex, segment.KeyFileExtension) + } else if typeToDelete == "value" { + shardingFactor := table.(*DiskTable).metadata.GetShardingFactor() + shard := rand.Uint32Range(0, shardingFactor) + filePath = fmt.Sprintf("%s/%s/segments/%d-%d%s", + directory, tableName, highestSegmentIndex, shard, segment.ValuesFileExtension) + } else { + filePath = fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, highestSegmentIndex, segment.MetadataFileExtension) + } + exists, err := util.Exists(filePath) + require.NoError(t, err) + require.True(t, exists) + + err = os.Remove(filePath) + require.NoError(t, err) + + // Restart the table. + table, err = tableBuilder.builder(time.Now, tableName, []string{directory}) + require.NoError(t, err) + + // Manually remove the keys from the last segment from the keymap. If this happens in reality (as opposed + // to the files being artificially deleted in this test), the keymap will not hold any value that has not + // yet been durably flushed to disk. + for key := range missingKeys { + err = table.(*DiskTable).keymap.Delete([]*types.ScopedKey{{Key: []byte(key)}}) + require.NoError(t, err) + } + + // Check the data in the table. + for expectedKey, expectedValue := range expectedValues { + if _, expectedToBeMissing := missingKeys[expectedKey]; expectedToBeMissing { + _, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.False(t, ok) + } else { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + } + + // Remove the missing values from the expected values map. Simplifies following checks. + for key := range missingKeys { + delete(expectedValues, key) + } + + // Make additional modifications to the table to ensure that it is still working. + for i := 0; i < iterations; i++ { + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + ok, _ = table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestLastFileMissing(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run("key-"+tb.name, func(t *testing.T) { + lastFileMissingTest(t, tb, "key") + }) + t.Run("value-"+tb.name, func(t *testing.T) { + lastFileMissingTest(t, tb, "value") + }) + t.Run("metadata-"+tb.name, func(t *testing.T) { + lastFileMissingTest(t, tb, "metadata") + }) + } +} + +// This test simulates the scenario where a key file is truncated. +func truncatedKeyFileTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + logger := test.GetLogger() + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + + // Fill the table with random data. + iterations := 100 + for i := 0; i < iterations; i++ { + batchSize := rand.Int32Range(1, 10) + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + } + + err = table.Flush() + require.NoError(t, err) + + // If the last segment is empty, write a final value to make it non-empty. This test isn't interesting + // if there is no data to be truncated. + segmentPath, err := segment.NewSegmentPath(directory, "", tableName) + require.NoError(t, err) + _, highestSegmentIndex, _, err := segment.GatherSegmentFiles( + logger, + table.(*DiskTable).errorMonitor, + []*segment.SegmentPath{segmentPath}, + false, + time.Now(), + true, + false) + require.NoError(t, err) + keyFileName := fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, highestSegmentIndex, segment.KeyFileExtension) + keyFileBytes, err := os.ReadFile(keyFileName) + require.NoError(t, err) + + if len(keyFileBytes) == 0 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 64) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } + + // Stop the table + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + _, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + table.(*DiskTable).errorMonitor, + []*segment.SegmentPath{segmentPath}, + false, + time.Now(), + true, + false) + require.NoError(t, err) + + // Truncate the last key file. + keysInLastFile, err := segments[highestSegmentIndex].GetKeys() + require.NoError(t, err) + + keyFileName = fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, highestSegmentIndex, segment.KeyFileExtension) + keyFileBytes, err = os.ReadFile(keyFileName) + require.NoError(t, err) + + bytesRemaining := int32(0) + if len(keyFileBytes) > 0 { + bytesRemaining = rand.Int32Range(1, int32(len(keyFileBytes))) + } + + keyFileBytes = keyFileBytes[:bytesRemaining] + err = os.WriteFile(keyFileName, keyFileBytes, 0644) + require.NoError(t, err) + + keysInLastFileAfterTruncate, err := segments[highestSegmentIndex].GetKeys() + require.NoError(t, err) + + missingKeyCount := len(keysInLastFile) - len(keysInLastFileAfterTruncate) + require.True(t, missingKeyCount > 0) + remainingKeyCount := len(keysInLastFileAfterTruncate) + + missingKeys := make(map[string]struct{}) + for i := 0; i < missingKeyCount; i++ { + missingKeys[string(keysInLastFile[remainingKeyCount+i].Key)] = struct{}{} + } + + // Mark the last segment as non-sealed. This will be the case if the file is truncated. + metadataFileName := fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, highestSegmentIndex, segment.MetadataFileExtension) + metadataBytes, err := os.ReadFile(metadataFileName) + require.NoError(t, err) + // The last byte of the metadata file is the sealed flag. + metadataBytes[len(metadataBytes)-1] = 0 + err = os.WriteFile(metadataFileName, metadataBytes, 0644) + require.NoError(t, err) + + // Restart the table. + table, err = tableBuilder.builder(time.Now, tableName, []string{directory}) + require.NoError(t, err) + + // Manually remove the keys from the last segment from the keymap. If this happens in reality (as opposed + // to the files being artificially deleted in this test), the keymap will not hold any value that has not + // yet been durably flushed to disk. + for key := range missingKeys { + err = table.(*DiskTable).keymap.Delete([]*types.ScopedKey{{Key: []byte(key)}}) + require.NoError(t, err) + } + + // Check the data in the table. + for expectedKey, expectedValue := range expectedValues { + if _, expectedToBeMissing := missingKeys[expectedKey]; expectedToBeMissing { + _, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.False(t, ok) + } else { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + } + + // Remove the missing values from the expected values map. Simplifies following checks. + for key := range missingKeys { + delete(expectedValues, key) + } + + // Make additional modifications to the table to ensure that it is still working. + for i := 0; i < iterations; i++ { + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + ok, _ = table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestTruncatedKeyFile(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + truncatedKeyFileTest(t, tb) + }) + } +} + +// This test simulates the scenario where a value file is truncated. +func truncatedValueFileTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + logger := test.GetLogger() + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + + // Fill the table with random data. + iterations := 100 + for i := 0; i < iterations; i++ { + batchSize := rand.Int32Range(1, 10) + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + } + + err = table.Flush() + require.NoError(t, err) + + segmentPath, err := segment.NewSegmentPath(directory, "", tableName) + require.NoError(t, err) + _, highestSegmentIndex, _, err := segment.GatherSegmentFiles( + logger, + table.(*DiskTable).errorMonitor, + []*segment.SegmentPath{segmentPath}, + false, + time.Now(), + true, + false) + require.NoError(t, err) + keyFileName := fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, highestSegmentIndex, segment.KeyFileExtension) + keyFileBytes, err := os.ReadFile(keyFileName) + require.NoError(t, err) + + if len(keyFileBytes) == 0 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 64) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } + + // Stop the table + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + _, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + table.(*DiskTable).errorMonitor, + []*segment.SegmentPath{segmentPath}, + false, + time.Now(), + true, + false) + require.NoError(t, err) + + // Truncate a random shard of the last value file. + // Find a shard that has at least one key in the last segment (truncating an empty file is boring) + keysInLastFile, err := segments[highestSegmentIndex].GetKeys() + require.NoError(t, err) + diskTable := table.(*DiskTable) + nonEmptyShards := make(map[uint32]struct{}) + for key := range keysInLastFile { + keyShard := diskTable.controlLoop.segments[highestSegmentIndex].GetShard(keysInLastFile[key].Key) + nonEmptyShards[keyShard] = struct{}{} + } + var shard uint32 + for shard = range nonEmptyShards { + // iteration order is random, shard will be randomly selected from nonEmptyShards + break + } + + valueFileName := fmt.Sprintf("%s/%s/segments/%d-%d%s", + directory, tableName, highestSegmentIndex, shard, segment.ValuesFileExtension) + valueFileBytes, err := os.ReadFile(valueFileName) + require.NoError(t, err) + + bytesRemaining := int32(0) + if len(valueFileBytes) > 0 { + bytesRemaining = rand.Int32Range(1, int32(len(valueFileBytes))) + } + + valueFileBytes = valueFileBytes[:bytesRemaining] + err = os.WriteFile(valueFileName, valueFileBytes, 0644) + require.NoError(t, err) + + // Figure out which keys are expected to be missing + missingKeys := make(map[string]struct{}) + for _, key := range keysInLastFile { + keyShard := diskTable.controlLoop.segments[diskTable.controlLoop.highestSegmentIndex].GetShard(key.Key) + if keyShard != shard { + // key does not belong to the shard that was truncated + continue + } + + offset := key.Address.Offset() + valueSize := len(expectedValues[string(key.Key)]) + // If there are not at least this many bytes remaining in the value file, the value is missing. + requiredLength := offset + uint32(valueSize) + 4 + if requiredLength > uint32(len(valueFileBytes)) { + missingKeys[string(key.Key)] = struct{}{} + } + } + + // Mark the last segment as non-sealed. This will be the case if the file is truncated. + metadataFileName := fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, highestSegmentIndex, segment.MetadataFileExtension) + metadataBytes, err := os.ReadFile(metadataFileName) + require.NoError(t, err) + // The last byte of the metadata file is the sealed flag. + metadataBytes[len(metadataBytes)-1] = 0 + err = os.WriteFile(metadataFileName, metadataBytes, 0644) + require.NoError(t, err) + + // Restart the table. + table, err = tableBuilder.builder(time.Now, tableName, []string{directory}) + require.NoError(t, err) + + // Manually remove the keys from the last segment from the keymap. If this happens in reality (as opposed + // to the files being artificially deleted in this test), the keymap will not hold any value that has not + // yet been durably flushed to disk. + for key := range missingKeys { + err = table.(*DiskTable).keymap.Delete([]*types.ScopedKey{{Key: []byte(key)}}) + require.NoError(t, err) + } + + // Check the data in the table. + for expectedKey, expectedValue := range expectedValues { + if _, expectedToBeMissing := missingKeys[expectedKey]; expectedToBeMissing { + _, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.False(t, ok) + } else { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + } + + // Remove the missing values from the expected values map. Simplifies following checks. + for key := range missingKeys { + delete(expectedValues, key) + } + + // Make additional modifications to the table to ensure that it is still working. + for i := 0; i < iterations; i++ { + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + ok, _ = table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestTruncatedValueFile(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + truncatedValueFileTest(t, tb) + }) + } +} + +// This test simulates the scenario where keys have not been flushed to the key store. The important thing +// is to ensure that garbage collection doesn't explode when it encounters keys that are not in the key store. +func unflushedKeysTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + logger := test.GetLogger() + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + + // Fill the table with random data. + iterations := 100 + for i := 0; i < iterations; i++ { + batchSize := rand.Int32Range(1, 10) + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + } + + err = table.Flush() + require.NoError(t, err) + + // If the last segment is empty, write a final value to make it non-empty. This test isn't interesting + // if there is no data left unflushed. + segmentPath, err := segment.NewSegmentPath(directory, "", tableName) + require.NoError(t, err) + _, highestSegmentIndex, _, err := segment.GatherSegmentFiles( + logger, + table.(*DiskTable).errorMonitor, + []*segment.SegmentPath{segmentPath}, + false, + time.Now(), + true, + false) + require.NoError(t, err) + keyFileName := fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, highestSegmentIndex, segment.KeyFileExtension) + keyFileBytes, err := os.ReadFile(keyFileName) + require.NoError(t, err) + if len(keyFileBytes) == 0 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 64) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } + + // Stop the table + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + _, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + table.(*DiskTable).errorMonitor, + []*segment.SegmentPath{segmentPath}, + false, + time.Now(), + true, + false) + require.NoError(t, err) + + // Identify keys in the last file. These will be removed from the keymap to simulate keys that have not + // been flushed to the key store. + keysInLastFile, err := segments[highestSegmentIndex].GetKeys() + require.NoError(t, err) + + missingKeys := make(map[string]struct{}) + for _, key := range keysInLastFile { + missingKeys[string(key.Key)] = struct{}{} + } + + // Mark the last segment as non-sealed. This will be the case if the file is truncated. + metadataFileName := fmt.Sprintf("%s/%s/segments/%d%s", + directory, tableName, highestSegmentIndex, segment.MetadataFileExtension) + metadataBytes, err := os.ReadFile(metadataFileName) + require.NoError(t, err) + // The last byte of the metadata file is the sealed flag. + metadataBytes[len(metadataBytes)-1] = 0 + err = os.WriteFile(metadataFileName, metadataBytes, 0644) + require.NoError(t, err) + + // Restart the table. + table, err = tableBuilder.builder(time.Now, tableName, []string{directory}) + require.NoError(t, err) + + // Manually remove the keys from the last segment from the keymap. If this happens in reality (as opposed + // to the files being artificially deleted in this test), the keymap will not hold any value that has not + // yet been durably flushed to disk. + for key := range missingKeys { + err = table.(*DiskTable).keymap.Delete([]*types.ScopedKey{{Key: []byte(key)}}) + require.NoError(t, err) + } + + // Check the data in the table. + for expectedKey, expectedValue := range expectedValues { + if _, expectedToBeMissing := missingKeys[expectedKey]; expectedToBeMissing { + _, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.False(t, ok) + } else { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + } + + // Remove the missing values from the expected values map. Simplifies following checks. + for key := range missingKeys { + delete(expectedValues, key) + } + + // Make additional modifications to the table to ensure that it is still working. + for i := 0; i < iterations; i++ { + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + // Enable a TTL for the table. The goal is to force the keys that were removed from the keymap artificially to + // become eligible for garbage collection. + err = table.SetTTL(1 * time.Millisecond) + require.NoError(t, err) + + // Sleep for a short time to allow the TTL to expire, and to give the garbage collector a chance to + // do bad things if it is going to. Nothing bad should happen if the GC is implemented correctly. + time.Sleep(50 * time.Millisecond) + + ok, _ = table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestUnflushedKeys(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + unflushedKeysTest(t, tb) + }) + } +} + +func metadataPreservedOnRestartTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + require.Equal(t, tableName, table.Name()) + + ttl := time.Duration(rand.Int63n(1000)) * time.Millisecond + err = table.SetTTL(ttl) + require.NoError(t, err) + shardingFactor := rand.Uint32Range(1, 100) + err = table.SetShardingFactor(shardingFactor) + require.NoError(t, err) + + // Stop the table + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + // Restart the table. + table, err = tableBuilder.builder(time.Now, tableName, []string{directory}) + require.NoError(t, err) + + // Check the table metadata. + actualTTL := (table.(*DiskTable)).metadata.GetTTL() + require.Equal(t, ttl, actualTTL) + + actualShardingFactor := (table.(*DiskTable)).metadata.GetShardingFactor() + require.Equal(t, shardingFactor, actualShardingFactor) + + err = table.Destroy() + require.NoError(t, err) +} + +func TestMetadataPreservedOnRestart(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + metadataPreservedOnRestartTest(t, tb) + }) + } +} + +func orphanedMetadataTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + require.Equal(t, tableName, table.Name()) + + ttl := time.Duration(rand.Int63n(1000)) * time.Millisecond + err = table.SetTTL(ttl) + require.NoError(t, err) + shardingFactor := rand.Uint32Range(1, 100) + err = table.SetShardingFactor(shardingFactor) + require.NoError(t, err) + + // Stop the table + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + // Simulate an orphaned metadata file. + orphanedMetadataFileName := fmt.Sprintf("%s/%s/table.metadata.swap", directory, tableName) + orphanedFileBytes := rand.PrintableVariableBytes(1, 1024) + err = os.WriteFile(orphanedMetadataFileName, orphanedFileBytes, 0644) + require.NoError(t, err) + + // Restart the table. + table, err = tableBuilder.builder(time.Now, tableName, []string{directory}) + require.NoError(t, err) + + // Check the table metadata. + actualTTL := (table.(*DiskTable)).metadata.GetTTL() + require.Equal(t, ttl, actualTTL) + + actualShardingFactor := (table.(*DiskTable)).metadata.GetShardingFactor() + require.Equal(t, shardingFactor, actualShardingFactor) + + // The swap file we created should not be present anymore. + exists, err := util.Exists(orphanedMetadataFileName) + require.NoError(t, err) + require.False(t, exists) + + err = table.Destroy() + require.NoError(t, err) +} + +func TestOrphanedMetadata(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + orphanedMetadataTest(t, tb) + }) + } +} + +func restartWithMultipleStorageDirectoriesTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + directoryCount := rand.Uint32Range(5, 10) + + directory := t.TempDir() + directories := make([]string, 0, directoryCount) + for i := uint32(0); i < directoryCount; i++ { + directories = append(directories, path.Join(directory, fmt.Sprintf("dir%d", i))) + } + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, directories) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + + iterations := 1000 + restartIteration := iterations/2 + int(rand.Int64Range(-10, 10)) + + for i := 0; i < iterations; i++ { + + // Somewhere in the middle of the test, restart the table. + if i == restartIteration { + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + // Shuffle around the segment files. This should not cause problems. + files := make([]string, 0) + for _, dir := range directories { + segmentDir := path.Join(dir, tableName, "segments") + + entries, err := os.ReadDir(segmentDir) + require.NoError(t, err) + for _, entry := range entries { + files = append(files, path.Join(dir, tableName, "segments", entry.Name())) + } + } + for _, file := range files { + destination := path.Join( + directories[rand.Uint32Range(0, uint32(len(directories)))], + tableName, + "segments", + path.Base(file)) + err = os.Rename(file, destination) + require.NoError(t, err) + } + + // Shuffle the table metadata location. This should not cause problems. + metadataDir := path.Join(directories[0], tableName) + mPath := path.Join(metadataDir, TableMetadataFileName) + newMetadataDir := path.Join(directories[rand.Uint32Range(1, uint32(len(directories)))], tableName) + newMPath := path.Join(newMetadataDir, TableMetadataFileName) + err = os.MkdirAll(newMetadataDir, 0755) + require.NoError(t, err) + err = os.Rename(mPath, newMPath) + require.NoError(t, err) + + table, err = tableBuilder.builder(time.Now, tableName, directories) + require.NoError(t, err) + + // Change the sharding factor. This should not cause problems. + shardingFactor := rand.Uint32Range(1, 10) + err = table.SetShardingFactor(shardingFactor) + require.NoError(t, err) + + // Do a full scan of the table to verify that all expected values are still present. + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Destroy() + require.NoError(t, err) + + // ensure that the test directories are empty + for _, dir := range directories { + entries, err := os.ReadDir(dir) + require.NoError(t, err) + require.Empty(t, entries) + } +} + +func TestRestartWithMultipleStorageDirectories(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + restartWithMultipleStorageDirectoriesTest(t, tb) + }) + } +} + +// changingShardingFactorTest checks the number of shards in a particular segment and compares it to the expected +// number of shards in the segment. +func checkShardsInSegment( + t *testing.T, + roots []string, + segmentIndex uint32, + expectedShardCount uint32) { + + // For each shard, there should be exactly one value file in the format -.value + expectedValueFiles := make(map[string]struct{}) + for i := uint32(0); i < expectedShardCount; i++ { + expectedValueFiles[fmt.Sprintf("%d-%d.values", segmentIndex, i)] = struct{}{} + } + + discoveredShardFiles := make(map[string]struct{}) + for _, root := range roots { + err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error { + fileName := filepath.Base(path) + if _, ok := expectedValueFiles[fileName]; ok { + discoveredShardFiles[fileName] = struct{}{} + } + + return nil + }) + require.NoError(t, err) + } + + require.Equal(t, expectedValueFiles, discoveredShardFiles) +} + +// changingShardingFactorTest checks the number of shards in the latest segment. +func checkShardsInSegments( + t *testing.T, + roots []string, + expectedShardCounts map[uint32]uint32) { + + for segmentIndex, expectedShardCount := range expectedShardCounts { + checkShardsInSegment(t, roots, segmentIndex, expectedShardCount) + } +} + +// getLatestSegmentIndex returns the index of the latest segment in the table. +func getLatestSegmentIndex(table litt.Table) uint32 { + return (table.(*DiskTable)).controlLoop.threadsafeHighestSegmentIndex.Load() +} + +func changingShardingFactorTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + directory := t.TempDir() + rootCount := rand.Uint32Range(1, 5) + roots := make([]string, 0, rootCount) + for i := uint32(0); i < rootCount; i++ { + roots = append(roots, path.Join(directory, fmt.Sprintf("root%d", i))) + } + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, roots) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + // Contains the expected number of shards in various segments. We won't check all segments, just the segments + // immediately before and immediately after a sharding factor change. + expectedShardCounts := make(map[uint32]uint32) + + // Before data is written, change the sharding factor to a random value. + expectedShardCounts[getLatestSegmentIndex(table)] = table.(*DiskTable).metadata.GetShardingFactor() + shardingFactor := rand.Uint32Range(2, 10) + err = table.SetShardingFactor(shardingFactor) + require.NoError(t, err) + err = table.Flush() + require.NoError(t, err) + expectedShardCounts[getLatestSegmentIndex(table)] = shardingFactor + + expectedValues := make(map[string][]byte) + + iterations := 1000 + restartIteration := iterations/2 + int(rand.Int64Range(-10, 10)) + + for i := 0; i < iterations; i++ { + + // Somewhere in the middle of the test, restart the table. + if i == restartIteration { + expectedShardCounts[getLatestSegmentIndex(table)] = shardingFactor + + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + err = table.Close() + require.NoError(t, err) + + table, err = tableBuilder.builder(time.Now, tableName, roots) + require.NoError(t, err) + + expectedShardCounts[getLatestSegmentIndex(table)] = shardingFactor + + // Do a full scan of the table to verify that all expected values are still present. + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok, "key %s not found", expectedKey) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, change the sharding factor to a random value. + if rand.BoolWithProbability(0.01) { + expectedShardCounts[getLatestSegmentIndex(table)] = shardingFactor + shardingFactor = rand.Uint32Range(1, 10) + err = table.SetShardingFactor(shardingFactor) + require.NoError(t, err) + err = table.Flush() + require.NoError(t, err) + expectedShardCounts[getLatestSegmentIndex(table)] = shardingFactor + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + ok, _ := table.(*DiskTable).errorMonitor.IsOk() + require.True(t, ok) + + err = table.Close() + require.NoError(t, err) + + checkShardsInSegments(t, roots, expectedShardCounts) +} + +func TestChangingShardingFactor(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + changingShardingFactorTest(t, tb) + }) + } +} + +// verifies that the size reported by the table matches the actual size of the table on disk +func tableSizeTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + directory := t.TempDir() + + startTime := rand.Time() + + var fakeTime atomic.Pointer[time.Time] + fakeTime.Store(&startTime) + + clock := func() time.Time { + return *fakeTime.Load() + } + + tableName := rand.String(8) + table, err := tableBuilder.builder(clock, tableName, []string{directory}) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + ttlSeconds := rand.Int32Range(20, 30) + ttl := time.Duration(ttlSeconds) * time.Second + err = table.SetTTL(ttl) + require.NoError(t, err) + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + creationTimes := make(map[string]time.Time) + expiredValues := make(map[string][]byte) + + iterations := 1000 + for i := 0; i < iterations; i++ { + + // Advance the clock. + now := *fakeTime.Load() + secondsToAdvance := rand.Float64Range(0.0, 1.0) + newTime := now.Add(time.Duration(secondsToAdvance * float64(time.Second))) + fakeTime.Store(&newTime) + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + creationTimes[string(key)] = newTime + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + creationTimes[string(key)] = newTime + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, change the TTL. To avoid introducing test flakiness, only decrease the TTL + // (increasing the TTL risks causing the expected deletions as tracked by this test to get out + // of sync with what the table is doing) + if rand.BoolWithProbability(0.01) { + ttlSeconds -= 1 + ttl = time.Duration(ttlSeconds) * time.Second + err = table.SetTTL(ttl) + require.NoError(t, err) + } + + // Once in a while, pause for a brief moment to give the garbage collector a chance to do work in the + // background. This is not required for the test to pass. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + + // Force garbage collection to run in order to remove expired values from counts. + err = table.Flush() + require.NoError(t, err) + err = (table).(*DiskTable).RunGC() + require.NoError(t, err) + + // Remove expired values from the expected values. + newlyExpiredKeys := make([]string, 0) + for key, creationTime := range creationTimes { + age := newTime.Sub(creationTime) + if age > ttl { + newlyExpiredKeys = append(newlyExpiredKeys, key) + } + } + for _, key := range newlyExpiredKeys { + expiredValues[key] = expectedValues[key] + delete(expectedValues, key) + delete(creationTimes, key) + } + + // Check the keys that are expected to still be in the table + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok, "key %s not found in table", expectedKey) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + + for key, expectedValue := range expiredValues { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err) + if !ok { + // value is not present in the table + continue + } + + // If the value has not yet been deleted, it should at least return the expected value. + require.Equal(t, expectedValue, value, "unexpected value for key %s", key) + + } + } + } + + err = table.Flush() + require.NoError(t, err) + err = table.RunGC() + require.NoError(t, err) + + // disable garbage collection + err = table.SetTTL(0) + require.NoError(t, err) + err = table.Flush() + require.NoError(t, err) + + // Write some data that won't expire, just to be sure that the table is not empty. + for i := 0; i < 10; i++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } + + err = table.Flush() + require.NoError(t, err) + + reportedSize := table.Size() + reportedKeyCount := table.KeyCount() + + // The exact key count is hard to predict for the sake of this unit test, since GC is "lazy" and may not + // immediately remove all values that are legal to be removed. But at the very least, all unexpired + // values should be present, and the key count should not exceed the number of total inserted values. + require.GreaterOrEqual(t, reportedKeyCount, uint64(len(expectedValues))) + require.LessOrEqual(t, reportedKeyCount, uint64(len(expectedValues)+len(expiredValues))) + + err = table.Close() + require.NoError(t, err) + + // Walk the "directory" file tree and calculate the actual size of the table. + // There is some asynchrony in file deletion, so we retry a reasonable number of times. + test.AssertEventuallyTrue(t, func() bool { + actualSize := uint64(0) + + err = filepath.Walk(directory, func(path string, info os.FileInfo, err error) error { + if err != nil { + // files may be deleted in the middle of the walk + return nil + } + if info.IsDir() { + // directory sizes are not factored into the table size + return nil + } + if strings.Contains(path, "keymap") { + // table size does not currently include the keymap size + return nil + } + actualSize += uint64(info.Size()) + return nil + }) + require.NoError(t, err) + return actualSize == reportedSize + }, time.Second) + + // Restart the table. The size should be accurately reported. + table, err = tableBuilder.builder(clock, tableName, []string{directory}) + require.NoError(t, err) + + newReportedSize := table.Size() + newReportedKeyCount := table.KeyCount() + + // New size should be greater than the old size, since GC is disabled and + // we will have started a new segment upon restart. + require.LessOrEqual(t, reportedSize, newReportedSize) + + // The number of keys should be the same as before. + require.Equal(t, reportedKeyCount, newReportedKeyCount) + + err = table.Close() + require.NoError(t, err) + + // Walk the "directory" file tree and calculate the actual size of the table. + // There is some asynchrony in file deletion, so we retry a reasonable number of times. + test.AssertEventuallyTrue(t, func() bool { + actualSize := uint64(0) + err = filepath.Walk(directory, func(path string, info os.FileInfo, err error) error { + if err != nil { + // files may be deleted in the middle of the walk + return nil + } + if info.IsDir() { + // directory sizes are not factored into the table size + return nil + } + if strings.Contains(path, "keymap") { + // table size does not currently include the keymap size + return nil + } + + actualSize += uint64(info.Size()) + return nil + }) + require.NoError(t, err) + + return actualSize == newReportedSize + }, time.Second) +} + +func TestTableSize(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + tableSizeTest(t, tb) + }) + } +} diff --git a/sei-db/db_engine/litt/disktable/flush_coordinator.go b/sei-db/db_engine/litt/disktable/flush_coordinator.go new file mode 100644 index 0000000000..edab229bf1 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/flush_coordinator.go @@ -0,0 +1,145 @@ +//go:build littdb_wip + +package disktable + +import ( + "fmt" + "time" + + "github.com/Layr-Labs/eigenda/common/structures" + "github.com/Layr-Labs/eigenda/litt/util" + "golang.org/x/time/rate" +) + +// Size of the request channel buffer. This should be large enough to handle bursts of flush requests without +// blocking the caller, but not so large that it wastes memory. +const requestChanBufferSize = 128 + +// Used to make very rapid flushes more efficient. Essentially batches multiple flushes into individual flushes. +// If configured to only allow one flush per X milliseconds and multiple flushes are requested during that time period, +// will only perform one flush at the end of the time period. Does not change the semantics of flush from the +// caller's perspective, just the performance/timing. +type flushCoordinator struct { + // Used to manage the lifecycle of LittDB threading resources. + errorMonitor *util.ErrorMonitor + + // The function that actually performs the flush on the underlying database. + internalFlush func() error + + // Channel to send flush requests to the control loop. + requestChan chan any + + // used to rate limit flushes + rateLimiter *rate.Limiter +} + +// A request to flush the underlying database. When the flush is eventually performed, a response is sent on +// the request's channel. The response is nil if the flush was successful, or an error if it failed. +type flushCoordinatorRequest chan error + +// Creates a new flush coordinator. +// +// - internalFlush: the function that actually performs the flush on the underlying database +// - flushPeriod: the minimum time period between flushes, if zero then no batching is performed +func newFlushCoordinator( + errorMonitor *util.ErrorMonitor, + internalFlush func() error, + flushPeriod time.Duration, +) *flushCoordinator { + + fc := &flushCoordinator{ + errorMonitor: errorMonitor, + internalFlush: internalFlush, + requestChan: make(chan any, requestChanBufferSize), + } + + if flushPeriod > 0 { + fc.rateLimiter = rate.NewLimiter(rate.Every(flushPeriod), 1) + go fc.controlLoop() + } + + return fc +} + +// Flushes the underlying database. May wait to call flush based on the configured flush period. +func (c *flushCoordinator) Flush() error { + if c.rateLimiter == nil { + // we can short circuit and just call the internal flush directly, flush frequency is infinitely high + return c.internalFlush() + } + + request := make(flushCoordinatorRequest, 1) + + // send the request + err := util.Send(c.errorMonitor, c.requestChan, request) + if err != nil { + return fmt.Errorf("error sending flush coordinator request: %w", err) + } + + // await the response + response, err := util.Await(c.errorMonitor, request) + if err != nil { + return fmt.Errorf("error awaiting flush coordinator response: %w", err) + } + + if response != nil { + return fmt.Errorf("flush failed: %w", response) + } + return nil + +} + +// The control loop that manages flush timing. +func (c *flushCoordinator) controlLoop() { + defer close(c.requestChan) + + // requests that are waiting for a flush to be performed + waitingRequests := structures.NewQueue[flushCoordinatorRequest](1024) + + // timer used to wait until the next flush can be performed + timer := time.NewTimer(0) + defer timer.Stop() + var timerActive bool + + for { + + if timerActive { + // There are pending flushes we want to handle, but we need to wait until the timer expires. + select { + case <-c.errorMonitor.ImmediateShutdownRequired(): + return + case request := <-c.requestChan: + waitingRequests.Push(request.(flushCoordinatorRequest)) + case <-timer.C: + + // we can now perform a flush + err := c.internalFlush() + // send a response to each waiting caller + for request, ok := waitingRequests.TryPop(); ok; request, ok = waitingRequests.TryPop() { + request <- err + } + + timerActive = false + } + } else { + // We don't have any pending flush requests, so we aren't waiting on the timer. If we get a new request, + // check to see if the rate limiter will allow it to be flushed immediately, and do so if possible. + select { + case <-c.errorMonitor.ImmediateShutdownRequired(): + return + case request := <-c.requestChan: + if c.rateLimiter.Allow() { + // we can flush immediately, it's been long enough since the last flush + request.(flushCoordinatorRequest) <- c.internalFlush() + } else { + // we need to wait before flushing, add the request to the queue + waitingRequests.Push(request.(flushCoordinatorRequest)) + + timeUntilPermitted := c.rateLimiter.Reserve().Delay() + timer.Reset(timeUntilPermitted) + timerActive = true + } + } + } + } +} diff --git a/sei-db/db_engine/litt/disktable/flush_coordinator_test.go b/sei-db/db_engine/litt/disktable/flush_coordinator_test.go new file mode 100644 index 0000000000..3769dab61d --- /dev/null +++ b/sei-db/db_engine/litt/disktable/flush_coordinator_test.go @@ -0,0 +1,133 @@ +//go:build littdb_wip + +package disktable + +import ( + "sync/atomic" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/stretchr/testify/require" +) + +// Note from the author (cody.littley): it's really tricky to validate rate limiting behavior without writing tests +// that rely on timing, to some extent. If these test flake, let me know, and we can either loosen +// the timing requirements or disable them. + +// Flush 1000 times in a second, but limit actual flush rate to 10 times a second. +func TestRapidFlushes(t *testing.T) { + // This test is inherently timing sensitive, don't parallelize it. + + logger, err := common.NewLogger(common.DefaultLoggerConfig()) + require.NoError(t, err) + + errorMonitor := util.NewErrorMonitor(t.Context(), logger, nil) + + flushCount := atomic.Uint64{} + flushFunction := func() error { + flushCount.Add(1) + return nil + } + + desiredFlushPeriod := 100 * time.Millisecond + encounteredFlushPeriod := time.Millisecond + + fc := newFlushCoordinator(errorMonitor, flushFunction, desiredFlushPeriod) + + completionChan := make(chan struct{}) + + // Send a bunch of rapid flush requests on background goroutines. + ticker := time.NewTicker(encounteredFlushPeriod) + defer ticker.Stop() + for i := 0; i < 1000; i++ { + <-ticker.C + go func() { + err := fc.Flush() + require.NoError(t, err) + completionChan <- struct{}{} + }() + require.NoError(t, err) + } + + // Wait for all flushes to unblock and complete. + timer := time.NewTimer(2 * time.Second) + for i := 0; i < 1000; i++ { + select { + case <-completionChan: + case <-timer.C: + require.Fail(t, "Timed out waiting for flushes to complete") + } + } + + // We should expect to see 11 flushes (one at t=0, then once per 100ms for the remaining second). + // But assert for weaker conditions to avoid test flakiness. + lowerBound := 5 + upperBound := 25 + require.True(t, flushCount.Load() >= uint64(lowerBound), + "Expected at least %d flushes, got %d", lowerBound, flushCount.Load()) + require.True(t, flushCount.Load() <= uint64(upperBound), + "Expected at most %d flushes, got %d", upperBound, flushCount.Load()) + + ok, _ := errorMonitor.IsOk() + require.True(t, ok) + errorMonitor.Shutdown() +} + +// If we flush slower than the maximum rate, then we should never wait that long for a flush. +func TestInfrequentFlushes(t *testing.T) { + // This test is inherently timing sensitive, don't parallelize it. + + logger, err := common.NewLogger(common.DefaultLoggerConfig()) + require.NoError(t, err) + + errorMonitor := util.NewErrorMonitor(t.Context(), logger, nil) + + flushCount := atomic.Uint64{} + flushFunction := func() error { + flushCount.Add(1) + return nil + } + + desiredFlushPeriod := 100 * time.Millisecond + + fc := newFlushCoordinator(errorMonitor, flushFunction, desiredFlushPeriod) + + // The time to flush when unblocked is likely to be less than a millisecond, but only assert + // that it is less than this value to avoid test flakiness. + minimumFlushTime := desiredFlushPeriod / 2 + + // The first flush should be very fast, since we can't be in violation of the rate limit at t=0. + startTime := time.Now() + err = fc.Flush() + require.NoError(t, err) + duration := time.Since(startTime) + require.True(t, duration < minimumFlushTime, + "Expected first flush to take less than %v, took %v", minimumFlushTime, duration) + require.Equal(t, uint64(1), flushCount.Load()) + + // The second flush should be delayed. + startTime = time.Now() + err = fc.Flush() + require.NoError(t, err) + duration = time.Since(startTime) + require.True(t, duration >= minimumFlushTime, + "Expected second flush to take at least %v, took %v", minimumFlushTime, duration) + require.Equal(t, uint64(2), flushCount.Load()) + + // Wait for 2x the flush period. The next flush should be able to happen immediately. + time.Sleep(2 * desiredFlushPeriod) + + startTime = time.Now() + err = fc.Flush() + require.NoError(t, err) + duration = time.Since(startTime) + require.True(t, duration < minimumFlushTime, + "Expected third flush to take less than %v, took %v", minimumFlushTime, duration) + require.Equal(t, uint64(3), flushCount.Load()) + + ok, _ := errorMonitor.IsOk() + require.True(t, ok) + errorMonitor.Shutdown() +} diff --git a/sei-db/db_engine/litt/disktable/flush_loop.go b/sei-db/db_engine/litt/disktable/flush_loop.go new file mode 100644 index 0000000000..896ebaf82e --- /dev/null +++ b/sei-db/db_engine/litt/disktable/flush_loop.go @@ -0,0 +1,138 @@ +//go:build littdb_wip + +package disktable + +import ( + "fmt" + "time" + + "github.com/Layr-Labs/eigenda/litt/metrics" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// flushLoop is a struct that runs a goroutine that is responsible for blocking on flush operations. +type flushLoop struct { + logger logging.Logger + + // the parent disk table + diskTable *DiskTable + + // Responsible for handling fatal DB errors. + errorMonitor *util.ErrorMonitor + + // flushChannel is a channel used to enqueue work on the flush loop. + flushChannel chan any + + // metrics encapsulates metrics for the DB. + metrics *metrics.LittDBMetrics + + // provides the current time + clock func() time.Time + + // the name of the table + name string + + // This file stores the highest segment index that is fully snapshot. Written as segments are sealed + // and copied to the snapshot directory, read by the external snapshot consumer. + upperBoundSnapshotFile *BoundaryFile +} + +// enqueue sends work to be handled on the flush loop. Will return an error if the DB is panicking. +func (f *flushLoop) enqueue(request flushLoopMessage) error { + return util.Send(f.errorMonitor, f.flushChannel, request) +} + +// run is responsible for handling operations that flush data (i.e. calls to Flush() and when the mutable segment +// is sealed). In theory, this work could be done on the main control loop, but doing so would block new writes while +// a flush is in progress. In order to keep the writing threads busy, it is critical that flush do not block the +// control loop. +func (f *flushLoop) run() { + for { + select { + case <-f.errorMonitor.ImmediateShutdownRequired(): + f.logger.Infof("context done, shutting down disk table flush loop") + return + case message := <-f.flushChannel: + if req, ok := message.(*flushLoopFlushRequest); ok { + f.handleFlushRequest(req) + } else if req, ok := message.(*flushLoopSealRequest); ok { + f.handleSealRequest(req) + } else if req, ok := message.(*flushLoopShutdownRequest); ok { + req.shutdownCompleteChan <- struct{}{} + return + } else { + f.errorMonitor.Panic(fmt.Errorf("unknown flush message type %T", message)) + return + } + } + } +} + +// handleSealRequest handles the part of the seal operation that is performed on the flush loop. +// We don't want to send a flush request to a segment that has already been sealed. By performing the sealing +// on the flush loop, we ensure that this can never happen. Any previously scheduled flush requests against the +// segment that is being sealed will be processed prior to this request being processed due to the FIFO nature +// of the flush loop channel. When a sealing operation begins, the control loop blocks, and does not unblock until +// the seal is finished and a new mutable segment has been created. This means that no future flush requests will be +// sent to the segment that is being sealed, since only the control loop can schedule work for the flush loop. +func (f *flushLoop) handleSealRequest(req *flushLoopSealRequest) { + durableKeys, err := req.segmentToSeal.Seal(req.now) + if err != nil { + f.errorMonitor.Panic(fmt.Errorf("failed to seal segment %s: %w", req.segmentToSeal.String(), err)) + return + } + + // Flush the keys that are now durable in the segment. + err = f.diskTable.writeKeysToKeymap(durableKeys) + if err != nil { + f.errorMonitor.Panic(fmt.Errorf("failed to flush keys: %w", err)) + return + } + + req.responseChan <- struct{}{} + + // Snapshotting can wait until after we have sent a response. No need for the Flush() caller to wait for + // snapshotting. Flush() only cares about the data's crash durability, and is completely independent of + // snapshotting. + err = req.segmentToSeal.Snapshot() + if err != nil { + f.errorMonitor.Panic(fmt.Errorf("failed to snapshot segment %s: %w", req.segmentToSeal.String(), err)) + return + } + + // Update the boundary file. The consumer of the snapshot uses this information to determine when segments + // are fully copied to the snapshot directory. + err = f.upperBoundSnapshotFile.Update(req.segmentToSeal.SegmentIndex()) + if err != nil { + f.errorMonitor.Panic(fmt.Errorf("failed to update upper bound snapshot file: %w", err)) + } +} + +// handleFlushRequest handles the part of the flush that is performed on the flush loop. +func (f *flushLoop) handleFlushRequest(req *flushLoopFlushRequest) { + var segmentFlushStart time.Time + if f.metrics != nil { + segmentFlushStart = f.clock() + } + + durableKeys, err := req.flushWaitFunction() + if err != nil { + f.errorMonitor.Panic(fmt.Errorf("failed to flush mutable segment: %w", err)) + return + } + + if f.metrics != nil { + segmentFlushEnd := f.clock() + delta := segmentFlushEnd.Sub(segmentFlushStart) + f.metrics.ReportSegmentFlushLatency(f.name, delta) + } + + err = f.diskTable.writeKeysToKeymap(durableKeys) + if err != nil { + f.errorMonitor.Panic(fmt.Errorf("failed to flush keys: %w", err)) + return + } + + req.responseChan <- struct{}{} +} diff --git a/sei-db/db_engine/litt/disktable/flush_loop_messages.go b/sei-db/db_engine/litt/disktable/flush_loop_messages.go new file mode 100644 index 0000000000..6b3c8278b1 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/flush_loop_messages.go @@ -0,0 +1,46 @@ +//go:build littdb_wip + +package disktable + +import ( + "time" + + "github.com/Layr-Labs/eigenda/litt/disktable/segment" +) + +// FlushLoopMessage is an interface for messages sent to the flush loop via flushLoop.enqueue. +type flushLoopMessage interface { + // unimplemented is a no-op function that is used to satisfy the interface. + unimplemented() +} + +// flushLoopFlushRequest is a request to flush the writer that is sent to the flush loop. +type flushLoopFlushRequest struct { + flushLoopMessage + + // flushWaitFunction is the function that will wait for the flush to complete. + flushWaitFunction segment.FlushWaitFunction + + // responseChan sends an object when the flush is complete. + responseChan chan struct{} +} + +// flushLoopSealRequest is a request to seal the mutable segment that is sent to the flush loop. +type flushLoopSealRequest struct { + flushLoopMessage + + // the time when the segment is sealed + now time.Time + // segmentToSeal is the segment that is being sealed. + segmentToSeal *segment.Segment + // responseChan sends an object when the seal is complete. + responseChan chan struct{} +} + +// flushLoopShutdownRequest is a request to shut down the flush loop. +type flushLoopShutdownRequest struct { + flushLoopMessage + + // responseChan will produce a single struct{} when the flush loop has stopped. + shutdownCompleteChan chan struct{} +} diff --git a/sei-db/db_engine/litt/disktable/keymap/keymap.go b/sei-db/db_engine/litt/disktable/keymap/keymap.go new file mode 100644 index 0000000000..a4b0598e45 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/keymap/keymap.go @@ -0,0 +1,55 @@ +//go:build littdb_wip + +package keymap + +import ( + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// KeymapDirectoryName is the name of the directory where the keymap stores its files. One keymap directory is +// created per table +const KeymapDirectoryName = "keymap" + +// KeymapDataDirectoryName is the name of the directory where the keymap implementation stores its data files. +// This directory will be created inside the keymap directory. +const KeymapDataDirectoryName = "data" + +// KeymapInitializedFileName is the name of the file that indicates that the keymap has been initialized. +// This file contains no data, and serves as a flag that is set when the keymap has been fully initialized. +const KeymapInitializedFileName = "initialized" + +// Keymap maintains a mapping between keys and addresses. Implementations of this interface are goroutine safe. +type Keymap interface { + // Put adds keys to the keymap as a batch. This method is required to store the address, but can ignore + // other fields in the ScopedKey struct such as the value length. + // + // A keymap provides atomicity for individual key-address pairs, but not for the batch as a whole. + // + // It is not safe to modify the contents of any slices passed to this function after the call. + // This includes the byte slices containing the keys. + Put(pairs []*types.ScopedKey) error + + // Get returns the address for a key. Returns true if the key exists, and false otherwise (i.e. does not + // return an error if the key does not exist). + // + // It is not safe to modify key byte slice after it is passed to this method. + Get(key []byte) (types.Address, bool, error) + + // Delete removes keys from the keymap. Deleting non-existent keys is a no-op. + // + // Deletion of keys is atomic, but deletion is not atomic across the entire batch. + // + // It is not safe to modify the contents of any slices passed to this function after the call. + // This includes the byte slices containing the keys. + Delete(keys []*types.ScopedKey) error + + // Stop stops the keymap. + Stop() error + + // Destroy stops the keymap and permanently deletes all data. + Destroy() error +} + +// BuildKeymap is a function that builds a Keymap. +type BuildKeymap func(logger logging.Logger, keymapPath string, doubleWriteProtection bool) (Keymap, bool, error) diff --git a/sei-db/db_engine/litt/disktable/keymap/keymap_test.go b/sei-db/db_engine/litt/disktable/keymap/keymap_test.go new file mode 100644 index 0000000000..5c0aa291b2 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/keymap/keymap_test.go @@ -0,0 +1,287 @@ +//go:build littdb_wip + +package keymap + +import ( + "os" + "path" + "testing" + + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/stretchr/testify/require" +) + +var builders = []keymapBuilder{ + buildMemKeymap, + buildLevelDBKeymap, +} + +type keymapBuilder func(logger logging.Logger, path string) (Keymap, error) + +func buildMemKeymap(logger logging.Logger, path string) (Keymap, error) { + kmap, _, err := NewMemKeymap(logger, path, true) + if err != nil { + return nil, err + } + + return kmap, nil +} + +func buildLevelDBKeymap(logger logging.Logger, path string) (Keymap, error) { + kmap, _, err := NewUnsafeLevelDBKeymap(logger, path, true) + if err != nil { + return nil, err + } + + return kmap, nil +} + +func testBasicBehavior(t *testing.T, keymap Keymap) { + rand := random.NewTestRandom() + + expected := make(map[string]types.Address) + + operations := 1000 + for i := 0; i < operations; i++ { + choice := rand.Float64() + if choice < 0.5 { + // Write a random value + key := []byte(rand.String(32)) + address := types.Address(rand.Uint64()) + + err := keymap.Put([]*types.ScopedKey{{Key: key, Address: address}}) + require.NoError(t, err) + expected[string(key)] = address + } else if choice < 0.75 { + // Delete a few random values + numberToDelete := rand.Int32Range(1, 10) + numberToDelete = min(numberToDelete, int32(len(expected))) + keysToDelete := make([]*types.ScopedKey, 0, numberToDelete) + for key := range expected { + if numberToDelete == int32(len(keysToDelete)) { + break + } + keysToDelete = append(keysToDelete, &types.ScopedKey{Key: []byte(key)}) + numberToDelete-- + } + + err := keymap.Delete(keysToDelete) + require.NoError(t, err) + for _, key := range keysToDelete { + delete(expected, string(key.Key)) + } + } else { + // Write a batch of random values + numberToWrite := rand.Int32Range(1, 10) + pairs := make([]*types.ScopedKey, numberToWrite) + for i := 0; i < int(numberToWrite); i++ { + key := []byte(rand.String(32)) + address := types.Address(rand.Uint64()) + pairs[i] = &types.ScopedKey{Key: key, Address: address} + expected[string(key)] = address + } + err := keymap.Put(pairs) + require.NoError(t, err) + } + + // Every once in a while, verify that the keymap is correct + if rand.BoolWithProbability(0.1) { + for key, expectedAddress := range expected { + address, ok, err := keymap.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedAddress, address) + } + } + } + + for key, expectedAddress := range expected { + address, ok, err := keymap.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedAddress, address) + } + + err := keymap.Destroy() + require.NoError(t, err) +} + +func TestBasicBehavior(t *testing.T) { + t.Parallel() + testDir := t.TempDir() + dbDir := path.Join(testDir, "keymap") + + logger := test.GetLogger() + for _, builder := range builders { + keymap, err := builder(logger, dbDir) + require.NoError(t, err) + testBasicBehavior(t, keymap) + + // verify that test dir is empty (destroy should have deleted everything) + exists, err := util.Exists(dbDir) + require.NoError(t, err) + + if exists { + // Directory exists. Make sure it's empty. + entries, err := os.ReadDir(dbDir) + require.NoError(t, err) + require.Empty(t, entries) + } + } +} + +func TestRestart(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + logger := test.GetLogger() + testDir := t.TempDir() + dbDir := path.Join(testDir, "keymap") + + keymap, _, err := NewUnsafeLevelDBKeymap(logger, dbDir, true) + require.NoError(t, err) + + expected := make(map[string]types.Address) + + operations := 1000 + for i := 0; i < operations; i++ { + choice := rand.Float64() + if choice < 0.5 { + // Write a random value + key := []byte(rand.String(32)) + address := types.Address(rand.Uint64()) + + err := keymap.Put([]*types.ScopedKey{{Key: key, Address: address}}) + require.NoError(t, err) + expected[string(key)] = address + } else if choice < 0.75 { + // Delete a few random values + numberToDelete := rand.Int32Range(1, 10) + numberToDelete = min(numberToDelete, int32(len(expected))) + keysToDelete := make([]*types.ScopedKey, 0, numberToDelete) + for key := range expected { + if numberToDelete == int32(len(keysToDelete)) { + break + } + keysToDelete = append(keysToDelete, &types.ScopedKey{Key: []byte(key)}) + numberToDelete-- + } + + err := keymap.Delete(keysToDelete) + require.NoError(t, err) + for _, key := range keysToDelete { + delete(expected, string(key.Key)) + } + } else { + // Write a batch of random values + numberToWrite := rand.Int32Range(1, 10) + pairs := make([]*types.ScopedKey, numberToWrite) + for i := 0; i < int(numberToWrite); i++ { + key := []byte(rand.String(32)) + address := types.Address(rand.Uint64()) + pairs[i] = &types.ScopedKey{Key: key, Address: address} + expected[string(key)] = address + } + err := keymap.Put(pairs) + require.NoError(t, err) + } + + // Every once in a while, verify that the keymap is correct + if rand.BoolWithProbability(0.1) { + for key, expectedAddress := range expected { + address, ok, err := keymap.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedAddress, address) + } + } + } + + for key, expectedAddress := range expected { + address, ok, err := keymap.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedAddress, address) + } + + // Shut down the keymap and reload it + err = keymap.Stop() + require.NoError(t, err) + + keymap, _, err = NewUnsafeLevelDBKeymap(logger, dbDir, true) + require.NoError(t, err) + + // Expected data should be present + for key, expectedAddress := range expected { + address, ok, err := keymap.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedAddress, address) + } + + for i := 0; i < operations; i++ { + choice := rand.Float64() + if choice < 0.5 { + // Write a random value + key := []byte(rand.String(32)) + address := types.Address(rand.Uint64()) + + err := keymap.Put([]*types.ScopedKey{{Key: key, Address: address}}) + require.NoError(t, err) + expected[string(key)] = address + } else if choice < 0.75 { + // Delete a few random values + numberToDelete := rand.Int32Range(1, 10) + numberToDelete = min(numberToDelete, int32(len(expected))) + keysToDelete := make([]*types.ScopedKey, 0, numberToDelete) + for key := range expected { + if numberToDelete == int32(len(keysToDelete)) { + break + } + keysToDelete = append(keysToDelete, &types.ScopedKey{Key: []byte(key)}) + numberToDelete-- + } + + err := keymap.Delete(keysToDelete) + require.NoError(t, err) + for _, key := range keysToDelete { + delete(expected, string(key.Key)) + } + } else { + // Write a batch of random values + numberToWrite := rand.Int32Range(1, 10) + pairs := make([]*types.ScopedKey, numberToWrite) + for i := 0; i < int(numberToWrite); i++ { + key := []byte(rand.String(32)) + address := types.Address(rand.Uint64()) + pairs[i] = &types.ScopedKey{Key: key, Address: address} + expected[string(key)] = address + } + err := keymap.Put(pairs) + require.NoError(t, err) + } + + // Every once in a while, verify that the keymap is correct + if rand.BoolWithProbability(0.1) { + for key, expectedAddress := range expected { + address, ok, err := keymap.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedAddress, address) + } + } + } + + for key, expectedAddress := range expected { + address, ok, err := keymap.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedAddress, address) + } + + err = keymap.Destroy() + require.NoError(t, err) +} diff --git a/sei-db/db_engine/litt/disktable/keymap/keymap_type.go b/sei-db/db_engine/litt/disktable/keymap/keymap_type.go new file mode 100644 index 0000000000..c30e6b8bca --- /dev/null +++ b/sei-db/db_engine/litt/disktable/keymap/keymap_type.go @@ -0,0 +1,16 @@ +//go:build littdb_wip + +package keymap + +// KeymapType represents the type of a keymap. +type KeymapType string + +// LevelDBKeymapType is the type of a LevelDBKeymap. +const LevelDBKeymapType = "LevelDBKeymap" + +// UnsafeLevelDBKeymapType is similar to LevelDBKeymapType, but it is not safe to use in production. +// It runs a lot faster, but with weaker crash recovery guarantees. +const UnsafeLevelDBKeymapType = "UnsafeLevelDBKeymap" + +// MemKeymapType is the type of a MemKeymap. +const MemKeymapType = "MemKeymap" diff --git a/sei-db/db_engine/litt/disktable/keymap/keymap_type_file.go b/sei-db/db_engine/litt/disktable/keymap/keymap_type_file.go new file mode 100644 index 0000000000..5611f4acb9 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/keymap/keymap_type_file.go @@ -0,0 +1,121 @@ +//go:build littdb_wip + +package keymap + +import ( + "fmt" + "os" + "path" + + "github.com/Layr-Labs/eigenda/litt/util" +) + +// KeymapTypeFileName is the name of the file that contains the keymap type. +const KeymapTypeFileName = "keymap-type.txt" + +// KeymapTypeFile is a text file that contains the name of the keymap type. This is used to determine if the keymap +// needs to reload when littDB is restarted, or if the data structures in the keymap directory are still valid. +type KeymapTypeFile struct { + // keymapPath is the path to the keymap directory. + keymapPath string + + // KeymapType is the type of the keymap currently stored in the keymap directory. + keymapType KeymapType +} + +// KeymapFileExists checks if the keymap type file exists in the target directory. +func KeymapFileExists(keymapPath string) (bool, error) { + return util.Exists(path.Join(keymapPath, KeymapTypeFileName)) +} + +// NewKeymapTypeFile creates a new KeymapTypeFile. +func NewKeymapTypeFile(keymapPath string, keymapType KeymapType) *KeymapTypeFile { + return &KeymapTypeFile{ + keymapPath: keymapPath, + keymapType: keymapType, + } +} + +// LoadKeymapTypeFile loads the keymap type from the keymap directory. +func LoadKeymapTypeFile(keymapPath string) (*KeymapTypeFile, error) { + filePath := path.Join(keymapPath, KeymapTypeFileName) + + if err := util.ErrIfNotExists(filePath); err != nil { + return nil, fmt.Errorf("keymap type file does not exist: %v", filePath) + } + + fileContents, err := os.ReadFile(filePath) + if err != nil { + return nil, fmt.Errorf("unable to read keymap type file: %v", err) + } + + var keymapType KeymapType + switch string(fileContents) { + case MemKeymapType: + keymapType = MemKeymapType + case LevelDBKeymapType: + keymapType = LevelDBKeymapType + case UnsafeLevelDBKeymapType: + keymapType = UnsafeLevelDBKeymapType + default: + return nil, fmt.Errorf("unknown keymap type: %s", string(fileContents)) + } + + return &KeymapTypeFile{ + keymapPath: keymapPath, + keymapType: keymapType, + }, nil +} + +// Type returns the type of the keymap. +func (k *KeymapTypeFile) Type() KeymapType { + return k.keymapType +} + +// Write writes the keymap type to the keymap directory. +func (k *KeymapTypeFile) Write() error { + filePath := path.Join(k.keymapPath, KeymapTypeFileName) + + exists, _, err := util.ErrIfNotWritableFile(filePath) + if err != nil { + return fmt.Errorf("unable to open keymap type file: %v", err) + } + + if exists { + return fmt.Errorf("keymap type file already exists: %v", filePath) + } + + keymapFile, err := os.Create(filePath) + if err != nil { + return fmt.Errorf("unable to create keymap type file: %v", err) + } + + _, err = keymapFile.WriteString(string(k.keymapType)) + if err != nil { + return fmt.Errorf("unable to write keymap type file: %v", err) + } + + err = keymapFile.Close() + if err != nil { + return fmt.Errorf("unable to close keymap type file: %v", err) + } + + return nil +} + +// Delete deletes the keymap type file. +func (k *KeymapTypeFile) Delete() error { + exists, err := util.Exists(path.Join(k.keymapPath, KeymapTypeFileName)) + if err != nil { + return fmt.Errorf("error checking for keymap type file: %w", err) + } + if !exists { + return nil + } + + err = os.Remove(path.Join(k.keymapPath, KeymapTypeFileName)) + if err != nil { + return fmt.Errorf("unable to delete keymap type file: %v", err) + } + return nil +} diff --git a/sei-db/db_engine/litt/disktable/keymap/level_db_keymap.go b/sei-db/db_engine/litt/disktable/keymap/level_db_keymap.go new file mode 100644 index 0000000000..75926f6f1d --- /dev/null +++ b/sei-db/db_engine/litt/disktable/keymap/level_db_keymap.go @@ -0,0 +1,179 @@ +//go:build littdb_wip + +package keymap + +import ( + "errors" + "fmt" + "os" + "sync/atomic" + + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/syndtr/goleveldb/leveldb" + "github.com/syndtr/goleveldb/leveldb/opt" +) + +var _ Keymap = &LevelDBKeymap{} + +// LevelDBKeymap is a keymap that uses LevelDB as the underlying storage. Methods on this struct are goroutine safe. +type LevelDBKeymap struct { + logger logging.Logger + db *leveldb.DB + // if true, then return an error if an update would overwrite an existing key + doubleWriteProtection bool + keymapPath string + alive atomic.Bool + // This is a "test mode only" flag. Should be true in production use cases or anywhere that data consistency + // is critical. Unit tests write lots of little values, and syncing each one is slow, so it may be desirable + // to set this to false in some tests. + syncWrites bool +} + +var _ BuildKeymap = NewLevelDBKeymap + +// NewLevelDBKeymap creates a new LevelDBKeymap instance. +func NewLevelDBKeymap( + logger logging.Logger, + keymapPath string, + doubleWriteProtection bool) (kmap Keymap, requiresReload bool, err error) { + + return newLevelDBKeymap(logger, keymapPath, doubleWriteProtection, true) +} + +// NewUnsafeLevelDBKeymap creates a new LevelDBKeymap instance. It does not use sync writes. This makes it faster, +// but unsafe if data consistency is critical (i.e. production use cases). +func NewUnsafeLevelDBKeymap( + logger logging.Logger, + keymapPath string, + doubleWriteProtection bool) (kmap Keymap, requiresReload bool, err error) { + + return newLevelDBKeymap(logger, keymapPath, doubleWriteProtection, false) +} + +// newLevelDBKeymap creates a new LevelDBKeymap instance. +func newLevelDBKeymap( + logger logging.Logger, + keymapPath string, + doubleWriteProtection bool, + syncWrites bool) (kmap *LevelDBKeymap, requiresReload bool, err error) { + + exists, err := util.Exists(keymapPath) + if err != nil { + return nil, false, fmt.Errorf("error checking for keymap directory: %w", err) + } + + if !exists { + err = os.MkdirAll(keymapPath, 0755) + if err != nil { + return nil, false, fmt.Errorf("error creating keymap directory: %w", err) + } + } + requiresReload = !exists + + db, err := leveldb.OpenFile(keymapPath, nil) + if err != nil { + return nil, false, fmt.Errorf("failed to open LevelDB: %w", err) + } + + kmap = &LevelDBKeymap{ + logger: logger, + db: db, + keymapPath: keymapPath, + doubleWriteProtection: doubleWriteProtection, + syncWrites: syncWrites, + } + kmap.alive.Store(true) + + return kmap, requiresReload, nil +} + +func (l *LevelDBKeymap) Put(keys []*types.ScopedKey) error { + + if l.doubleWriteProtection { + for _, k := range keys { + _, ok, err := l.Get(k.Key) + if err != nil { + return fmt.Errorf("failed to get key: %w", err) + } + if ok { + return fmt.Errorf("key %s already exists", k.Key) + } + } + } + + batch := new(leveldb.Batch) + for _, k := range keys { + batch.Put(k.Key, k.Address.Serialize()) + } + + writeOptions := &opt.WriteOptions{ + Sync: l.syncWrites, + } + + err := l.db.Write(batch, writeOptions) + if err != nil { + return fmt.Errorf("failed to put batch to LevelDB: %w", err) + } + return nil +} + +func (l *LevelDBKeymap) Get(key []byte) (types.Address, bool, error) { + addressBytes, err := l.db.Get(key, nil) + if err != nil { + if errors.Is(err, leveldb.ErrNotFound) { + return 0, false, nil + } + return 0, false, fmt.Errorf("failed to get key from LevelDB: %w", err) + } + + address, err := types.DeserializeAddress(addressBytes) + if err != nil { + return 0, false, fmt.Errorf("failed to deserialize address: %w", err) + } + + return address, true, nil +} + +func (l *LevelDBKeymap) Delete(keys []*types.ScopedKey) error { + batch := new(leveldb.Batch) + for _, key := range keys { + batch.Delete(key.Key) + } + + err := l.db.Write(batch, nil) + if err != nil { + return fmt.Errorf("failed to delete keys from LevelDB: %w", err) + } + + return nil +} + +func (l *LevelDBKeymap) Stop() error { + alive := l.alive.Swap(false) + if !alive { + return nil + } + + err := l.db.Close() + if err != nil { + return fmt.Errorf("failed to close LevelDB: %w", err) + } + return nil +} + +func (l *LevelDBKeymap) Destroy() error { + err := l.Stop() + if err != nil { + return fmt.Errorf("failed to stop LevelDB: %w", err) + } + + l.logger.Info(fmt.Sprintf("deleting LevelDB keymap at path: %s", l.keymapPath)) + err = os.RemoveAll(l.keymapPath) + if err != nil { + return fmt.Errorf("failed to remove LevelDB data directory: %w", err) + } + + return nil +} diff --git a/sei-db/db_engine/litt/disktable/keymap/mem_keymap.go b/sei-db/db_engine/litt/disktable/keymap/mem_keymap.go new file mode 100644 index 0000000000..1107b3a8aa --- /dev/null +++ b/sei-db/db_engine/litt/disktable/keymap/mem_keymap.go @@ -0,0 +1,92 @@ +//go:build littdb_wip + +package keymap + +import ( + "fmt" + "sync" + + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +var _ Keymap = &memKeymap{} + +// An in-memory keymap implementation. When a table using a memKeymap is restarted, it loads all keys from +// the segment files. Methods on this struct are goroutine safe. +// +// - potentially high memory usage for large keymaps +// - potentially slow startup time for large keymaps +// - very fast after startup +type memKeymap struct { + logger logging.Logger + data map[string]types.Address + // if true, then return an error if an update would overwrite an existing key + doubleWriteProtection bool + lock sync.RWMutex +} + +var _ BuildKeymap = NewMemKeymap + +// NewMemKeymap creates a new in-memory keymap. +func NewMemKeymap(logger logging.Logger, + _ string, + doubleWriteProtection bool) (kmap Keymap, requiresReload bool, err error) { + + return &memKeymap{ + logger: logger, + data: make(map[string]types.Address), + doubleWriteProtection: doubleWriteProtection, + }, true, nil +} + +func (m *memKeymap) Put(keys []*types.ScopedKey) error { + m.lock.Lock() + defer m.lock.Unlock() + + for _, k := range keys { + stringKey := util.UnsafeBytesToString(k.Key) + + if m.doubleWriteProtection { + _, ok := m.data[stringKey] + if ok { + return fmt.Errorf("key %s already exists", k.Key) + } + } + + m.data[stringKey] = k.Address + } + return nil +} + +func (m *memKeymap) Get(key []byte) (types.Address, bool, error) { + m.lock.RLock() + defer m.lock.RUnlock() + + address, ok := m.data[util.UnsafeBytesToString(key)] + return address, ok, nil +} + +func (m *memKeymap) Delete(keys []*types.ScopedKey) error { + m.lock.Lock() + defer m.lock.Unlock() + + for _, key := range keys { + delete(m.data, util.UnsafeBytesToString(key.Key)) + } + + return nil +} + +func (m *memKeymap) Stop() error { + // nothing to do here + return nil +} + +func (m *memKeymap) Destroy() error { + m.lock.Lock() + defer m.lock.Unlock() + m.data = nil + return nil +} diff --git a/sei-db/db_engine/litt/disktable/segment/address_test.go b/sei-db/db_engine/litt/disktable/segment/address_test.go new file mode 100644 index 0000000000..716d064977 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/address_test.go @@ -0,0 +1,23 @@ +//go:build littdb_wip + +package segment + +import ( + "testing" + + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestAddress(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + + index := rand.Uint32() + offset := rand.Uint32() + address := types.NewAddress(index, offset) + + require.Equal(t, index, address.Index()) + require.Equal(t, offset, address.Offset()) +} diff --git a/sei-db/db_engine/litt/disktable/segment/key_file.go b/sei-db/db_engine/litt/disktable/segment/key_file.go new file mode 100644 index 0000000000..117816c36d --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/key_file.go @@ -0,0 +1,362 @@ +//go:build littdb_wip + +package segment + +import ( + "bufio" + "encoding/binary" + "fmt" + "os" + "path" + "strconv" + + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// KeyFileExtension is the file extension for the keys file. This file contains the keys for the data segment, +// and is used for performing garbage collection on the keymap. It can also be used to rebuild the keymap. +const KeyFileExtension = ".keys" + +// KeyFileSwapExtension is the file extension for the keys swap file. This file is used to atomically +// update key files. +const KeyFileSwapExtension = KeyFileExtension + util.SwapFileExtension + +// keyFile tracks the keys in a segment. It is used to do garbage collection on the keymap. +// +// This struct is NOT goroutine safe. It is unsafe to concurrently call write, flush, or seal on the same key file. +// It is not safe to read a key file until it is sealed. Once sealed, read only operations are goroutine safe. +type keyFile struct { + // The logger for the key file. + logger logging.Logger + + // The segment index. + index uint32 + + // Path data for the segment file. + segmentPath *SegmentPath + + // The writer for the file. If the file is sealed, this value is nil. + writer *bufio.Writer + + // The size of the key file in bytes. + size uint64 + + // The segment version. Determines serialization format. + segmentVersion SegmentVersion + + // If true, then this key file is intended to replace another key file. It is written to a temporary + // file, and then atomically renamed to the final file name. + swap bool +} + +// newKeyFile creates a new key file. +func createKeyFile( + logger logging.Logger, + index uint32, + segmentPath *SegmentPath, + swap bool, +) (*keyFile, error) { + + keys := &keyFile{ + logger: logger, + index: index, + segmentPath: segmentPath, + segmentVersion: ValueSizeSegmentVersion, + swap: swap, + } + + filePath := keys.path() + + exists, _, err := util.ErrIfNotWritableFile(filePath) + if err != nil { + return nil, fmt.Errorf("can not write to file: %w", err) + } + + if exists { + return nil, fmt.Errorf("key file %s already exists", filePath) + } + + flags := os.O_RDWR | os.O_CREATE + file, err := os.OpenFile(filePath, flags, 0644) + if err != nil { + return nil, fmt.Errorf("failed to open key file: %w", err) + } + + writer := bufio.NewWriter(file) + keys.writer = writer + + return keys, nil +} + +// loadKeyFile loads the key file from disk, looking in the given parent directories until it finds the file. +// If the file is not found, it returns an error. +func loadKeyFile( + logger logging.Logger, + index uint32, + segmentPaths []*SegmentPath, + segmentVersion SegmentVersion, +) (*keyFile, error) { + + keyFileName := fmt.Sprintf("%d%s", index, KeyFileExtension) + keysPath, err := lookForFile(segmentPaths, keyFileName) + if err != nil { + return nil, fmt.Errorf("failed to find key file: %w", err) + } + if keysPath == nil { + return nil, fmt.Errorf("failed to find key file %s", keyFileName) + } + + keys := &keyFile{ + logger: logger, + index: index, + segmentPath: keysPath, + segmentVersion: segmentVersion, + } + + filePath := keys.path() + + exists, size, err := util.ErrIfNotWritableFile(filePath) + if err != nil { + return nil, fmt.Errorf("can not write to file: %w", err) + } + + if exists { + keys.size = uint64(size) + } + + if !exists { + return nil, fmt.Errorf("key file %s does not exist", filePath) + } + + return keys, nil +} + +// Size returns the size of the key file in bytes. +func (k *keyFile) Size() uint64 { + return k.size +} + +// name returns the name of the key file. +func (k *keyFile) name() string { + extension := KeyFileExtension + if k.swap { + extension = KeyFileSwapExtension + } + + return fmt.Sprintf("%d%s", k.index, extension) +} + +// path returns the full path to the key file. +func (k *keyFile) path() string { + return path.Join(k.segmentPath.SegmentDirectory(), k.name()) +} + +// atomicSwap atomically replaces the key file, replacing the old one. +func (k *keyFile) atomicSwap(sync bool) error { + if !k.swap { + return fmt.Errorf("key file is not a swap file") + } + + swapPath := k.path() + k.swap = false + newPath := k.path() + + err := util.AtomicRename(swapPath, newPath, sync) + if err != nil { + return fmt.Errorf("failed to atomically swap key file %s with %s: %w", swapPath, newPath, err) + } + + return nil +} + +// write writes a key to the key file. +func (k *keyFile) write(scopedKey *types.ScopedKey) error { + if k.writer == nil { + return fmt.Errorf("key file is sealed") + } + + // Write the length of the key. + err := binary.Write(k.writer, binary.BigEndian, uint32(len(scopedKey.Key))) + if err != nil { + return fmt.Errorf("failed to write key length to key file: %w", err) + } + + // Write the key itself. + _, err = k.writer.Write(scopedKey.Key) + if err != nil { + return fmt.Errorf("failed to write key to key file: %w", err) + } + + // Write the address. + err = binary.Write(k.writer, binary.BigEndian, scopedKey.Address) + if err != nil { + return fmt.Errorf("failed to write address to key file: %w", err) + } + + // Write the size of the value. + err = binary.Write(k.writer, binary.BigEndian, scopedKey.ValueSize) + if err != nil { + return fmt.Errorf("failed to write value size to key file: %w", err) + } + + k.size += uint64( + 4 /* uint32 size of key */ + + len(scopedKey.Key) + + 8 /* uint64 address */ + + 4 /* uint32 size of value */) + + return nil +} + +// getKeyFileIndex returns the index of the key file from the file name. Key file names have the form "X.keys", +// where X is the segment index. +func getKeyFileIndex(fileName string) (uint32, error) { + baseName := path.Base(fileName) + indexString := baseName[:len(baseName)-len(KeyFileExtension)] + index, err := strconv.Atoi(indexString) + if err != nil { + return 0, fmt.Errorf("failed to parse index from file name %s: %w", fileName, err) + } + + return uint32(index), nil +} + +// flush flushes the key file to disk. +func (k *keyFile) flush() error { + if k.writer == nil { + return fmt.Errorf("key file is sealed") + } + + return k.writer.Flush() +} + +// seal seals the key file, preventing further writes. +func (k *keyFile) seal() error { + if k.writer == nil { + return fmt.Errorf("key file is already sealed") + } + + err := k.flush() + if err != nil { + return fmt.Errorf("failed to flush key file: %w", err) + } + k.writer = nil + + return nil +} + +// readKeys reads all keys from the key file. This method returns an error if the key file is not sealed. +// If there are keys that were only partially written (i.e. keys being written when the process crashed), then +// those keys may not be returned. If a key is returned, it is guaranteed to be "whole" (i.e. a partial key will +// never be returned). +func (k *keyFile) readKeys() ([]*types.ScopedKey, error) { + if !k.isSealed() { + return nil, fmt.Errorf("key file is not sealed") + } + + file, err := os.Open(k.path()) + if err != nil { + return nil, fmt.Errorf("failed to open key file: %w", err) + } + defer func() { + err = file.Close() + if err != nil { + k.logger.Errorf("failed to close key file: %v", err) + } + }() + + // Key files are small as long as key length is sane. Safe to read the whole file into memory. + keyBytes, err := os.ReadFile(k.path()) + if err != nil { + return nil, fmt.Errorf("failed to read key file: %w", err) + } + keys := make([]*types.ScopedKey, 0) + + index := 0 + for { + // We need at least 4 bytes to read the length of the key. + if index+4 > len(keyBytes) { //nolint:staticcheck // QF1006 + // There are fewer than 4 bytes left in the file. + break + } + keyLength := int(binary.BigEndian.Uint32(keyBytes[index : index+4])) + index += 4 + + if k.segmentVersion < ValueSizeSegmentVersion { + // We need to read the key, as well as the 8 byte address. + if index+keyLength+8 > len(keyBytes) { + // There are insufficient bytes left in the file to read the key and address. + break + } + } else { + // We need to read the key, as well as the 8 byte address and 4 byte value size. + if index+keyLength+12 > len(keyBytes) { + // There are insufficient bytes left in the file to read the key, address, and value size. + break + } + } + + key := keyBytes[index : index+keyLength] + index += keyLength + + address := types.Address(binary.BigEndian.Uint64(keyBytes[index : index+8])) + index += 8 + + var valueSize uint32 + if k.segmentVersion >= ValueSizeSegmentVersion { + valueSize = binary.BigEndian.Uint32(keyBytes[index : index+4]) + index += 4 + } + + keys = append(keys, &types.ScopedKey{ + Key: key, + Address: address, + ValueSize: valueSize, + }) + } + + if index != len(keyBytes) { + // This can happen if there is a crash while we are writing to the key file. + // Recoverable, but best to note the event in the logs. + k.logger.Warnf("key file %s has %d partial bytes", k.path(), len(keyBytes)-index) + } + + return keys, nil +} + +// snapshot creates a hard link to the file in the snapshot directory, and a soft link to the hard linked file in the +// soft link directory. Requires that the file is sealed and that snapshotting is enabled. +func (k *keyFile) snapshot() error { + if !k.isSealed() { + return fmt.Errorf("file %s is not sealed, cannot take Snapshot", k.path()) + } + + err := k.segmentPath.Snapshot(k.name()) + if err != nil { + return fmt.Errorf("failed to create Snapshot: %w", err) + } + + return nil +} + +// delete deletes the key file. If this key_file is a snapshot file (i.e. it is backed by a symlink), this method will +// also delete the file pointed to by the symlink. +func (k *keyFile) delete() error { + if !k.isSealed() { + return fmt.Errorf("key file %s is not sealed, cannot delete", k.path()) + } + + err := util.DeepDelete(k.path()) + if err != nil { + return fmt.Errorf("failed to delete key file %s: %w", k.path(), err) + } + + return nil +} + +// isSealed returns true if the key file is sealed, and false otherwise. +func (k *keyFile) isSealed() bool { + return k.writer == nil +} diff --git a/sei-db/db_engine/litt/disktable/segment/key_file_test.go b/sei-db/db_engine/litt/disktable/segment/key_file_test.go new file mode 100644 index 0000000000..b0bdeb9a4e --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/key_file_test.go @@ -0,0 +1,294 @@ +//go:build littdb_wip + +package segment + +import ( + "os" + "testing" + + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +func TestReadWriteKeys(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + logger := test.GetLogger() + directory := t.TempDir() + + index := rand.Uint32() + + keyCount := rand.Int32Range(100, 200) + keys := make([]*types.ScopedKey, keyCount) + for i := 0; i < int(keyCount); i++ { + key := rand.VariableBytes(1, 100) + address := types.Address(rand.Uint64()) + valueSize := rand.Uint32() + keys[i] = &types.ScopedKey{Key: key, Address: address, ValueSize: valueSize} + } + + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + file, err := createKeyFile(logger, index, segmentPath, false) + require.NoError(t, err) + + for _, key := range keys { + err := file.write(key) + require.NoError(t, err) + } + + // Reading the file prior to sealing it is forbidden. + _, err = file.readKeys() + require.Error(t, err) + + err = file.seal() + require.NoError(t, err) + + // Verify that file size is correctly reported. + reportedSize := file.Size() + stat, err := os.Stat(file.path()) + require.NoError(t, err) + actualSize := uint64(stat.Size()) + require.Equal(t, actualSize, reportedSize) + + // Reading the file after sealing it is allowed. + readKeys, err := file.readKeys() + require.NoError(t, err) + + for i, key := range keys { + assert.Equal(t, key, readKeys[i]) + } + + // Create a new in-memory instance from the on-disk file and verify that it behaves the same. + file2, err := loadKeyFile(logger, index, []*SegmentPath{segmentPath}, ValueSizeSegmentVersion) + require.NoError(t, err) + require.Equal(t, file.Size(), file2.Size()) + + readKeys, err = file2.readKeys() + require.NoError(t, err) + for i, key := range keys { + assert.Equal(t, key, readKeys[i]) + } + + // delete the file + filePath := file.path() + _, err = os.Stat(filePath) + require.NoError(t, err) + + err = file.delete() + require.NoError(t, err) + + _, err = os.Stat(filePath) + require.True(t, os.IsNotExist(err)) +} + +func TestReadingTruncatedKeyFile(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + logger := test.GetLogger() + directory := t.TempDir() + + index := rand.Uint32() + + keyCount := rand.Int32Range(100, 200) + keys := make([]*types.ScopedKey, keyCount) + for i := 0; i < int(keyCount); i++ { + key := rand.VariableBytes(1, 100) + address := types.Address(rand.Uint64()) + valueSize := rand.Uint32() + keys[i] = &types.ScopedKey{Key: key, Address: address, ValueSize: valueSize} + } + + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + file, err := createKeyFile(logger, index, segmentPath, false) + require.NoError(t, err) + + for _, key := range keys { + err := file.write(key) + require.NoError(t, err) + } + + err = file.seal() + require.NoError(t, err) + + // Truncate the file. Chop off some bytes from the last key, but do not corrupt the length prefix. + lastKeyLength := len(keys[keyCount-1].Key) + + filePath := file.path() + + originalBytes, err := os.ReadFile(filePath) + require.NoError(t, err) + + bytesToRemove := rand.Int32Range(1, int32(lastKeyLength)+1) + bytes := originalBytes[:len(originalBytes)-int(bytesToRemove)] + + err = os.WriteFile(filePath, bytes, 0644) + require.NoError(t, err) + + // We should be able to read the keys up to the point where the file was truncated. + readKeys, err := file.readKeys() + require.NoError(t, err) + + require.Equal(t, int(keyCount-1), len(readKeys)) + for i, key := range keys[:keyCount-1] { + assert.Equal(t, key, readKeys[i]) + } + + // Truncate the file. This time, chop off some of the last entry. + prefixBytesToRemove := rand.Int32Range(1, 8) + bytes = originalBytes[:len(originalBytes)-int(prefixBytesToRemove)] + + err = os.WriteFile(filePath, bytes, 0644) + require.NoError(t, err) + + // We should not be able to read the keys if the length prefix is truncated. + keys, err = file.readKeys() + require.NoError(t, err) + + require.Equal(t, int(keyCount-1), len(keys)) + for i, key := range keys[:keyCount-1] { + assert.Equal(t, key, keys[i]) + } + + // delete the file + _, err = os.Stat(filePath) + require.NoError(t, err) + + err = file.delete() + require.NoError(t, err) + + _, err = os.Stat(filePath) + require.True(t, os.IsNotExist(err)) +} + +func TestSwappingKeyFile(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + logger := test.GetLogger() + directory := t.TempDir() + + index := rand.Uint32() + + keyCount := rand.Int32Range(100, 200) + keys := make([]*types.ScopedKey, keyCount) + for i := 0; i < int(keyCount); i++ { + key := rand.VariableBytes(1, 100) + address := types.Address(rand.Uint64()) + valueSize := rand.Uint32() + keys[i] = &types.ScopedKey{Key: key, Address: address, ValueSize: valueSize} + } + + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + file, err := createKeyFile(logger, index, segmentPath, false) + require.NoError(t, err) + + for _, key := range keys { + err := file.write(key) + require.NoError(t, err) + } + + // Reading the file prior to sealing it is forbidden. + _, err = file.readKeys() + require.Error(t, err) + + err = file.seal() + require.NoError(t, err) + + // Verify that file size is correctly reported. + reportedSize := file.Size() + stat, err := os.Stat(file.path()) + require.NoError(t, err) + actualSize := uint64(stat.Size()) + require.Equal(t, actualSize, reportedSize) + + // Reading the file after sealing it is allowed. + readKeys, err := file.readKeys() + require.NoError(t, err) + + for i, key := range keys { + assert.Equal(t, key, readKeys[i]) + } + + // Create a new in-memory instance from the on-disk file and verify that it behaves the same. + file2, err := loadKeyFile(logger, index, []*SegmentPath{segmentPath}, ValueSizeSegmentVersion) + require.NoError(t, err) + require.Equal(t, file.Size(), file2.Size()) + + readKeys, err = file2.readKeys() + require.NoError(t, err) + for i, key := range keys { + assert.Equal(t, key, readKeys[i]) + } + + // Create a new version of the key file that only contains the keys at even indices. The intention is to replace + // the on-disk file with this new version. + swapFile, err := createKeyFile(logger, index, segmentPath, true) + require.NoError(t, err) + for i := 0; i < int(keyCount); i += 2 { + err := swapFile.write(keys[i]) + require.NoError(t, err) + } + err = swapFile.seal() + require.NoError(t, err) + + // Verify that the swap file is present on disk. + swapFilePath := swapFile.path() + _, err = os.Stat(swapFilePath) + require.NoError(t, err) + + // The swap file path should be different from the original file path. + originalFilePath := file.path() + require.NotEqual(t, swapFilePath, originalFilePath) + + // Replace the old file with the new one. + err = swapFile.atomicSwap(false) + require.NoError(t, err) + + // The old swap file should no longer be present. + _, err = os.Stat(swapFilePath) + require.True(t, os.IsNotExist(err)) + + // The "regular" file should still be present. + _, err = os.Stat(originalFilePath) + require.NoError(t, err) + + // Verify that the file size is correctly reported after the swap. + reportedSize = swapFile.Size() + stat, err = os.Stat(swapFile.path()) + require.NoError(t, err) + actualSize = uint64(stat.Size()) + require.Equal(t, actualSize, reportedSize) + + // Verify the contents of the new file. Reload it from disk just to ensure that we aren't "cheating" somehow. + file2, err = loadKeyFile(logger, index, []*SegmentPath{segmentPath}, ValueSizeSegmentVersion) + require.NoError(t, err) + readKeys, err = file2.readKeys() + require.NoError(t, err) + for i, key := range keys { + if i%2 == 0 { + assert.Equal(t, key, readKeys[i/2]) + } + } + + // delete the file + filePath := file.path() + _, err = os.Stat(filePath) + require.NoError(t, err) + + err = file.delete() + require.NoError(t, err) + + _, err = os.Stat(filePath) + require.True(t, os.IsNotExist(err)) +} diff --git a/sei-db/db_engine/litt/disktable/segment/metadata_file.go b/sei-db/db_engine/litt/disktable/segment/metadata_file.go new file mode 100644 index 0000000000..42e2a1af10 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/metadata_file.go @@ -0,0 +1,383 @@ +//go:build littdb_wip + +package segment + +import ( + "encoding/binary" + "fmt" + "os" + "path" + "strconv" + "time" + + "github.com/Layr-Labs/eigenda/litt/util" +) + +const ( + + // MetadataFileExtension is the file extension for the metadata file. + MetadataFileExtension = ".metadata" + + // MetadataSwapExtension is the file extension for the metadata swap file. This file is used to atomically update + // the metadata file by doing an atomic rename of the swap file to the metadata file. If this file is ever + // present when the database first starts, it is an artifact of a crash during a metadata update, and should be + // deleted. + MetadataSwapExtension = MetadataFileExtension + util.SwapFileExtension + + // V0MetadataSize is the size the metadata file at version 0 (aka OldHashFunctionSegmentVersion) + // This is a constant, so it's convenient to have it here. + // - 4 bytes for version + // - 4 bytes for the sharding factor + // - 4 bytes for salt + // - 8 bytes for lastValueTimestamp + // - and 1 byte for sealed. + V0MetadataSize = 21 + + // V1MetadataSize is the size of the metadata file at version 1 (aka SipHashSegmentVersion). + // This is a constant, so it's convenient to have it here. + // - 4 bytes for version + // - 4 bytes for the sharding factor + // - 16 bytes for salt + // - 8 bytes for lastValueTimestamp + // - and 1 byte for sealed. + V1MetadataSize = 33 + + // V2MetadataSize is the size of the metadata file at version 2 (aka ValueSizeSegmentVersion). + // This is a constant, so it's convenient to have it here. + // - 4 bytes for version + // - 4 bytes for the sharding factor + // - 16 bytes for salt + // - 8 bytes for lastValueTimestamp + // - 4 bytes for keyCount + // - and 1 byte for sealed. + V2MetadataSize = 37 +) + +// metadataFile contains metadata about a segment. This file contains metadata about the data segment, such as +// serialization version and the lastValueTimestamp when the file was sealed. +type metadataFile struct { + // The segment index. This value is encoded in the file name. + index uint32 + + // The serialization version for this segment, used to permit smooth data migrations. + // This value is encoded in the file. + segmentVersion SegmentVersion + + // The sharding factor for this segment. This value is encoded in the file. + shardingFactor uint32 + + // A random number, used to make the sharding hash function hard for an attacker to predict. + // This value is encoded in the file. Note: after the hash function change, this value is + // only used for data written with the old hash function. + legacySalt uint32 + + // A random byte array, used to make the sharding hash function hard for an attacker to predict. + // This value is encoded in the file. + salt [16]byte + + // The time when the last value was written into the segment, in nanoseconds since the epoch. A segment can + // only be deleted when all values within it are expired, and so we only need to keep track of the + // lastValueTimestamp of the last value (which always expires last). This value is irrelevant if the segment is + // not yet sealed. This value is encoded in the file. + lastValueTimestamp uint64 + + // The number of keys in the segment. This value is undefined if the segment is not yet sealed. + // This value is encoded in the file. + keyCount uint32 + + // If true, the segment is sealed and no more data can be written to it. If false, then data can still be written + // to this segment. This value is encoded in the file. + sealed bool + + // Path data for the segment file. This information is not serialized in the metadata file. + segmentPath *SegmentPath + + // If true, then use fsync to make metadata updates atomic. Should always be true in production, but can be + // set to false in tests to speed up unit tests. Not serialized to the file. + fsync bool +} + +// createMetadataFile creates a new metadata file. When this method returns, the metadata file will +// be durably written to disk. +func createMetadataFile( + index uint32, + shardingFactor uint32, + salt [16]byte, + path *SegmentPath, + fsync bool, +) (*metadataFile, error) { + + file := &metadataFile{ + index: index, + segmentPath: path, + fsync: fsync, + } + + file.segmentVersion = LatestSegmentVersion + file.shardingFactor = shardingFactor + file.salt = salt + err := file.write() + if err != nil { + return nil, fmt.Errorf("failed to write metadata file: %v", err) + } + + return file, nil +} + +// loadMetadataFile loads the metadata file from disk, looking in the given parent directories until it finds the file. +// If the file is not found, it returns an error. +func loadMetadataFile(index uint32, segmentPaths []*SegmentPath, fsync bool) (*metadataFile, error) { + metadataFileName := fmt.Sprintf("%d%s", index, MetadataFileExtension) + metadataPath, err := lookForFile(segmentPaths, metadataFileName) + if err != nil { + return nil, fmt.Errorf("failed to find metadata file: %w", err) + } + if metadataPath == nil { + return nil, fmt.Errorf("failed to find metadata file %s", metadataFileName) + } + + file := &metadataFile{ + index: index, + segmentPath: metadataPath, + fsync: fsync, + } + + filePath := file.path() + + data, err := os.ReadFile(filePath) + if err != nil { + return nil, fmt.Errorf("failed to read metadata file %s: %v", filePath, err) + } + err = file.deserialize(data) + if err != nil { + return nil, fmt.Errorf("failed to deserialize metadata file %s: %v", filePath, err) + } + + return file, nil +} + +// MetadataFileExtension is the file extension for the metadata file. Metadata file names have the form "X.metadata", +// where X is the segment index. +func getMetadataFileIndex(fileName string) (uint32, error) { + indexString := path.Base(fileName)[:len(fileName)-len(MetadataFileExtension)] + index, err := strconv.Atoi(indexString) + if err != nil { + return 0, fmt.Errorf("failed to parse index from file name %s: %v", fileName, err) + } + + return uint32(index), nil +} + +// Size returns the size of the metadata file in bytes. +func (m *metadataFile) Size() uint64 { + switch m.segmentVersion { + case OldHashFunctionSegmentVersion: + return V0MetadataSize + case SipHashSegmentVersion: + return V1MetadataSize + default: + return V2MetadataSize + } +} + +// Name returns the file name for this metadata file. +func (m *metadataFile) name() string { + return fmt.Sprintf("%d%s", m.index, MetadataFileExtension) +} + +// Path returns the full path to this metadata file. +func (m *metadataFile) path() string { + return path.Join(m.segmentPath.SegmentDirectory(), m.name()) +} + +// Seal seals the segment. This action will atomically write the metadata file to disk one final time, +// and should only be performed when all data that will be written to the key/value files has been made durable. +func (m *metadataFile) seal(now time.Time, keyCount uint32) error { + m.sealed = true + m.lastValueTimestamp = uint64(now.UnixNano()) + m.keyCount = keyCount + err := m.write() + if err != nil { + return fmt.Errorf("failed to write sealed metadata file: %v", err) + } + return nil +} + +func (m *metadataFile) serializeV0Legacy() []byte { + data := make([]byte, V0MetadataSize) + + // Write the version + binary.BigEndian.PutUint32(data[0:4], uint32(m.segmentVersion)) + + // Write the sharding factor + binary.BigEndian.PutUint32(data[4:8], m.shardingFactor) + + // Write the salt + binary.BigEndian.PutUint32(data[8:12], m.legacySalt) + + // Write the lastValueTimestamp + binary.BigEndian.PutUint64(data[12:20], m.lastValueTimestamp) + + // Write the sealed flag + if m.sealed { + data[20] = 1 + } else { + data[20] = 0 + } + + return data +} + +func (m *metadataFile) serializeV1Legacy() []byte { + data := make([]byte, V1MetadataSize) + + // Write the version + binary.BigEndian.PutUint32(data[0:4], uint32(m.segmentVersion)) + + // Write the sharding factor + binary.BigEndian.PutUint32(data[4:8], m.shardingFactor) + + // Write the salt + copy(data[8:24], m.salt[:]) + + // Write the lastValueTimestamp + binary.BigEndian.PutUint64(data[24:32], m.lastValueTimestamp) + + // Write the sealed flag + if m.sealed { + data[32] = 1 + } else { + data[32] = 0 + } + + return data +} + +// serialize serializes the metadata file to a byte array. +func (m *metadataFile) serialize() []byte { + if m.segmentVersion == OldHashFunctionSegmentVersion { + return m.serializeV0Legacy() + } else if m.segmentVersion == SipHashSegmentVersion { + return m.serializeV1Legacy() + } + + data := make([]byte, V2MetadataSize) + + // Write the version + binary.BigEndian.PutUint32(data[0:4], uint32(m.segmentVersion)) + + // Write the sharding factor + binary.BigEndian.PutUint32(data[4:8], m.shardingFactor) + + // Write the salt + copy(data[8:24], m.salt[:]) + + // Write the lastValueTimestamp + binary.BigEndian.PutUint64(data[24:32], m.lastValueTimestamp) + + // Write the key count + binary.BigEndian.PutUint32(data[32:36], m.keyCount) + + // Write the sealed flag + if m.sealed { + data[36] = 1 + } else { + data[36] = 0 + } + + return data +} + +func (m *metadataFile) deserializeV0Legacy(data []byte) error { + // TODO (cody.littley): delete this after all data is migrated + if len(data) != V0MetadataSize { + return fmt.Errorf("metadata file is not the correct size, expected %d, got %d", + V0MetadataSize, len(data)) + } + + m.shardingFactor = binary.BigEndian.Uint32(data[4:8]) + m.legacySalt = binary.BigEndian.Uint32(data[8:12]) + m.lastValueTimestamp = binary.BigEndian.Uint64(data[12:20]) + m.sealed = data[20] == 1 + return nil +} + +func (m *metadataFile) deserializeV1Legacy(data []byte) error { + // TODO (cody.littley): delete this after all data is migrated + if len(data) != V1MetadataSize { + return fmt.Errorf("metadata file is not the correct size, expected %d, got %d", + V1MetadataSize, len(data)) + } + + m.shardingFactor = binary.BigEndian.Uint32(data[4:8]) + m.salt = [16]byte(data[8:24]) + m.lastValueTimestamp = binary.BigEndian.Uint64(data[24:32]) + m.sealed = data[32] == 1 + return nil +} + +// deserialize deserializes the metadata file from a byte array. +func (m *metadataFile) deserialize(data []byte) error { + if len(data) < 4 { + return fmt.Errorf("metadata file is not the correct size, expected at least 4 bytes, got %d", len(data)) + } + + m.segmentVersion = SegmentVersion(binary.BigEndian.Uint32(data[0:4])) + if m.segmentVersion > LatestSegmentVersion { + return fmt.Errorf("unsupported serialization version: %d", m.segmentVersion) + } + + if m.segmentVersion == OldHashFunctionSegmentVersion { + return m.deserializeV0Legacy(data) + } else if m.segmentVersion == SipHashSegmentVersion { + return m.deserializeV1Legacy(data) + } + + if len(data) != V2MetadataSize { + return fmt.Errorf("metadata file is not the correct size, expected %d, got %d", + V2MetadataSize, len(data)) + } + + m.shardingFactor = binary.BigEndian.Uint32(data[4:8]) + m.salt = [16]byte(data[8:24]) + m.lastValueTimestamp = binary.BigEndian.Uint64(data[24:32]) + m.keyCount = binary.BigEndian.Uint32(data[32:36]) + m.sealed = data[36] == 1 + + return nil +} + +// write atomically writes the metadata file to disk. +func (m *metadataFile) write() error { + err := util.AtomicWrite(m.path(), m.serialize(), m.fsync) + if err != nil { + return fmt.Errorf("failed to write metadata file %s: %v", m.path(), err) + } + + return nil +} + +// snapshot creates a hard link to the file in the snapshot directory, and a soft link to the hard linked file in the +// soft link directory. Requires that the file is sealed and that snapshotting is enabled. +func (m *metadataFile) snapshot() error { + if !m.sealed { + return fmt.Errorf("file %s is not sealed, cannot take Snapshot", m.path()) + } + + err := m.segmentPath.Snapshot(m.name()) + if err != nil { + return fmt.Errorf("failed to create Snapshot: %v", err) + } + + return nil +} + +// delete deletes the metadata file from disk. If the file is a snapshot (i.e., a symlink), this method will also +// delete the actual file that the symlink points to. +func (m *metadataFile) delete() error { + err := util.DeepDelete(m.path()) + if err != nil { + return fmt.Errorf("failed to delete metadata file %s: %w", m.path(), err) + } + return nil +} diff --git a/sei-db/db_engine/litt/disktable/segment/metadata_file_test.go b/sei-db/db_engine/litt/disktable/segment/metadata_file_test.go new file mode 100644 index 0000000000..de362d9bc7 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/metadata_file_test.go @@ -0,0 +1,192 @@ +//go:build littdb_wip + +package segment + +import ( + "os" + "testing" + + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestUnsealedSerialization(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + directory := t.TempDir() + + index := rand.Uint32() + shardingFactor := rand.Uint32() + salt := ([16]byte)(rand.Bytes(16)) + timestamp := rand.Uint64() + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + m := &metadataFile{ + index: index, + segmentVersion: LatestSegmentVersion, + shardingFactor: shardingFactor, + salt: salt, + lastValueTimestamp: timestamp, + sealed: false, + segmentPath: segmentPath, + } + err = m.write() + require.NoError(t, err) + + deserialized, err := loadMetadataFile(index, []*SegmentPath{segmentPath}, false) + require.NoError(t, err) + require.Equal(t, *m, *deserialized) + + reportedSize := m.Size() + stat, err := os.Stat(m.path()) + require.NoError(t, err) + actualSize := uint64(stat.Size()) + require.Equal(t, actualSize, reportedSize) + + // delete the file + filePath := m.path() + _, err = os.Stat(filePath) + require.NoError(t, err) + + err = m.delete() + require.NoError(t, err) + + _, err = os.Stat(filePath) + require.True(t, os.IsNotExist(err)) +} + +func TestSealedSerialization(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + directory := t.TempDir() + + index := rand.Uint32() + shardingFactor := rand.Uint32() + salt := ([16]byte)(rand.Bytes(16)) + timestamp := rand.Uint64() + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + m := &metadataFile{ + index: index, + segmentVersion: LatestSegmentVersion, + shardingFactor: shardingFactor, + salt: salt, + lastValueTimestamp: timestamp, + sealed: true, + segmentPath: segmentPath, + } + err = m.write() + require.NoError(t, err) + + reportedSize := m.Size() + stat, err := os.Stat(m.path()) + require.NoError(t, err) + actualSize := uint64(stat.Size()) + require.Equal(t, actualSize, reportedSize) + + deserialized, err := loadMetadataFile(index, []*SegmentPath{segmentPath}, false) + require.NoError(t, err) + require.Equal(t, *m, *deserialized) + + // delete the file + filePath := m.path() + _, err = os.Stat(filePath) + require.NoError(t, err) + + err = m.delete() + require.NoError(t, err) + + _, err = os.Stat(filePath) + require.True(t, os.IsNotExist(err)) +} + +func TestFreshFileSerialization(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + directory := t.TempDir() + + salt := ([16]byte)(rand.Bytes(16)) + + index := rand.Uint32() + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + m, err := createMetadataFile(index, 1234, salt, segmentPath, false) + require.NoError(t, err) + + require.Equal(t, index, m.index) + require.Equal(t, LatestSegmentVersion, m.segmentVersion) + require.False(t, m.sealed) + require.Zero(t, m.lastValueTimestamp) + + reportedSize := m.Size() + stat, err := os.Stat(m.path()) + require.NoError(t, err) + actualSize := uint64(stat.Size()) + require.Equal(t, actualSize, reportedSize) + + deserialized, err := loadMetadataFile(index, []*SegmentPath{segmentPath}, false) + require.NoError(t, err) + require.Equal(t, *m, *deserialized) + + // delete the file + filePath := m.path() + _, err = os.Stat(filePath) + require.NoError(t, err) + + err = m.delete() + require.NoError(t, err) + + _, err = os.Stat(filePath) + require.True(t, os.IsNotExist(err)) +} + +func TestSealing(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + directory := t.TempDir() + + salt := ([16]byte)(rand.Bytes(16)) + + index := rand.Uint32() + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + m, err := createMetadataFile(index, 1234, salt, segmentPath, false) + require.NoError(t, err) + + // seal the file + sealTime := rand.Time() + err = m.seal(sealTime, 987) + require.NoError(t, err) + + require.Equal(t, index, m.index) + require.Equal(t, LatestSegmentVersion, m.segmentVersion) + require.True(t, m.sealed) + require.Equal(t, uint64(sealTime.UnixNano()), m.lastValueTimestamp) + require.Equal(t, salt, m.salt) + require.Equal(t, uint32(1234), m.shardingFactor) + require.Equal(t, uint32(987), m.keyCount) + + // load the file + deserialized, err := loadMetadataFile(index, []*SegmentPath{segmentPath}, false) + require.NoError(t, err) + require.Equal(t, *m, *deserialized) + + // delete the file + filePath := m.path() + _, err = os.Stat(filePath) + require.NoError(t, err) + + err = m.delete() + require.NoError(t, err) + + _, err = os.Stat(filePath) + require.True(t, os.IsNotExist(err)) +} diff --git a/sei-db/db_engine/litt/disktable/segment/segment.go b/sei-db/db_engine/litt/disktable/segment/segment.go new file mode 100644 index 0000000000..b784ecf94b --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/segment.go @@ -0,0 +1,882 @@ +//go:build littdb_wip + +package segment + +import ( + "errors" + "fmt" + "math" + "os" + "path" + "sync/atomic" + "time" + + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// unflushedKeysInitialCapacity is the initial capacity of the unflushedKeys slice. This slice is used to store keys +// that have been written to the segment but have not yet been flushed to disk. +const unflushedKeysInitialCapacity = 128 + +// shardControlChannelCapacity is the capacity of the channel used to send messages to the shard control loop. +const shardControlChannelCapacity = 32 + +// Segment is a chunk of data stored on disk. All data in a particular data segment is expired at the same time. +// +// This struct is not safe for operations that mutate the segment, access control must be handled by the caller. +type Segment struct { + // The logger for the segment. + logger logging.Logger + + // Used to signal an unrecoverable error in the segment. If errorMonitor.Panic() is called, the entire DB + // enters a "panic" state and will refuse to do additional work. + errorMonitor *util.ErrorMonitor + + // The index of the data segment. The first data segment ever created has index 0, the next has index 1, and so on. + index uint32 + + // This file contains metadata about the segment. + metadata *metadataFile + + // This file contains the keys for the data segment, and is used for performing garbage collection on the key index. + keys *keyFile + + // The value files, one for each shard in the segment. Indexed by shard number. + shards []*valueFile + + // shardSizes is a list of the current sizes of each shard in the segment. Indexed by shard number. This + // value is only tracked for mutable segments (i.e. the unsealed segment), meaning that if this segment was loaded + // from disk, the values in this slice will be zero. + shardSizes []uint64 + + // The current size of the key file in bytes. This is only tracked for mutable segments, meaning that if this + // segment was loaded from disk, this value will be zero. + keyFileSize uint64 + + // The maximum size of all shards in this segment. + maxShardSize uint64 + + // The number of keys written to this segment. + keyCount uint32 + + // shardChannels is a list of channels used to send messages to the goroutine responsible for writing to + // each shard. Indexed by shard number. + shardChannels []chan any + + // keyFileChannel is a channel used to send messages to the goroutine responsible for writing to the key file. + keyFileChannel chan any + + // deletionChannel permits a caller to block until this segment is fully deleted. An element is inserted into + // the channel when the segment is fully deleted. + deletionChannel chan struct{} + + // reservationCount is the number of reservations on this segment. The segment will not be deleted until this count + // reaches zero. + reservationCount atomic.Int32 + + // nextSegment is the next segment in the chain (i.e. the segment with index+1). Each segment takes a reservation + // on the next segment in the sequence. This reservation is released when the segment is fully deleted. This + // ensures that segments are always deleted strictly in sequence. This makes it impossible for a crash to cause + // segment X to be missing while segment X-1 is present. + nextSegment *Segment + + // Used as a sanity checker. For each value written to the segment, the segment must eventually return + // a key to be written to the keymap. This value tracks the number of values that have been written to the + // segment but have not yet been flushed to the keymap. When the segment is eventually sealed, the code + // asserts that this value is zero. This check should never fail, but is a nice safety net. + unflushedKeyCount atomic.Int64 + + // If true, then take a snapshot of the segment when it is sealed. + snapshottingEnabled bool + + // If true, then sync the file system for atomic operations. Should always be true in production, but can + // be set to false for tests to save time. + fsync bool +} + +// CreateSegment creates a new data segment. +func CreateSegment( + logger logging.Logger, + errorMonitor *util.ErrorMonitor, + index uint32, + segmentPaths []*SegmentPath, + snapshottingEnabled bool, + shardingFactor uint32, + salt [16]byte, + fsync bool) (*Segment, error) { + + if len(segmentPaths) == 0 { + return nil, errors.New("no segment paths provided") + } + + metadata, err := createMetadataFile(index, shardingFactor, salt, segmentPaths[0], fsync) + if err != nil { + return nil, fmt.Errorf("failed to open metadata file: %v", err) + } + + keys, err := createKeyFile(logger, index, segmentPaths[0], false) + if err != nil { + return nil, fmt.Errorf("failed to open key file: %v", err) + } + + keyFileSize := keys.Size() + + shards := make([]*valueFile, metadata.shardingFactor) + for shard := uint32(0); shard < metadata.shardingFactor; shard++ { + // Assign value files to available segment paths in a round-robin fashion. + // Assign the first shard to the directory at index 1. The first directory + // is used by the keymap, so if we have enough directories we don't want to + // use it for value files too. + segmentPath := segmentPaths[int(shard+1)%len(segmentPaths)] + + values, err := createValueFile(logger, index, shard, segmentPath, fsync) + if err != nil { + return nil, fmt.Errorf("failed to open value file: %v", err) + } + shards[shard] = values + } + + shardSizes := make([]uint64, metadata.shardingFactor) + + shardChannels := make([]chan any, metadata.shardingFactor) + for shard := uint32(0); shard < metadata.shardingFactor; shard++ { + shardChannels[shard] = make(chan any, shardControlChannelCapacity) + } + + // If at all possible, we want to size this channel so that the goroutines writing data to the sharded value files + // do not block on insertion into this channel. Scale the size of this channel by the number of shards, as more + // shards mean there may be a higher rate of writes to this channel. + keyFileChannel := make(chan any, shardControlChannelCapacity*metadata.shardingFactor) + + segment := &Segment{ + logger: logger, + errorMonitor: errorMonitor, + index: index, + metadata: metadata, + keys: keys, + shards: shards, + shardSizes: shardSizes, + keyFileSize: keyFileSize, + shardChannels: shardChannels, + keyFileChannel: keyFileChannel, + deletionChannel: make(chan struct{}, 1), + snapshottingEnabled: snapshottingEnabled, + fsync: fsync, + } + + // Segments are returned with an initial reference count of 1, as the caller of the constructor is considered to + // have a reference to the segment. + segment.reservationCount.Store(1) + + // Start up the control loops. + for shard := uint32(0); shard < metadata.shardingFactor; shard++ { + go segment.shardControlLoop(shard) + } + + go segment.keyFileControlLoop() + + return segment, nil +} + +// LoadSegment loads an existing segment from disk. If that segment is unsealed, this method will seal it. +func LoadSegment(logger logging.Logger, + errorMonitor *util.ErrorMonitor, + index uint32, + segmentPaths []*SegmentPath, + snapshottingEnabled bool, + now time.Time, + fsync bool, +) (*Segment, error) { + + if len(segmentPaths) == 0 { + return nil, errors.New("no segment paths provided") + } + + // Look for the metadata file. + metadata, err := loadMetadataFile(index, segmentPaths, fsync) + if err != nil { + return nil, fmt.Errorf("failed to open metadata file: %w", err) + } + + // Look for the key file. + keys, err := loadKeyFile(logger, index, segmentPaths, metadata.segmentVersion) + if err != nil { + return nil, fmt.Errorf("failed to open key file: %v", err) + } + keyFileSize := keys.Size() + + // Look for the value files. There should be one for each shard. + shards := make([]*valueFile, metadata.shardingFactor) + for shard := uint32(0); shard < metadata.shardingFactor; shard++ { + values, err := loadValueFile(logger, index, shard, segmentPaths) + if err != nil { + return nil, fmt.Errorf("failed to open value file: %v", err) + } + shards[shard] = values + } + + segment := &Segment{ + logger: logger, + errorMonitor: errorMonitor, + index: index, + metadata: metadata, + keys: keys, + shards: shards, + keyFileSize: keyFileSize, + keyCount: metadata.keyCount, + deletionChannel: make(chan struct{}, 1), + snapshottingEnabled: snapshottingEnabled, + fsync: fsync, + } + + // Segments are returned with an initial reference count of 1, as the caller of the constructor is considered to + // have a reference to the segment. + segment.reservationCount.Store(1) + + if !metadata.sealed { + err = segment.sealLoadedSegment(now) + if err != nil { + return nil, fmt.Errorf("failed to seal segment: %w", err) + } + } + + return segment, nil +} + +// SegmentIndex returns the index of the segment. +func (s *Segment) SegmentIndex() uint32 { + return s.index +} + +// sealLoadedSegment is responsible for sealing a segment loaded from disk that is not already sealed. +// While doing this, it is responsible for making the key file consistent with the values present in the +// value files. +func (s *Segment) sealLoadedSegment(now time.Time) error { + scopedKeys, err := s.keys.readKeys() + if err != nil { + return fmt.Errorf("failed to read keys: %w", err) + } + + // keys with values that are not present in the value files + goodKeys := make([]*types.ScopedKey, 0, len(scopedKeys)) + + // keys with values that weren't flushed out to the value files before the DB crashed + badKeys := make([]*types.ScopedKey, 0, len(scopedKeys)) + + for _, scopedKey := range scopedKeys { + shard := s.GetShard(scopedKey.Key) + + requiredValueFileLength := uint64(scopedKey.Address.Offset()) + + 4 /* value size uint32 */ + + uint64(scopedKey.ValueSize) + + if s.shards[shard].Size() < requiredValueFileLength { + badKeys = append(badKeys, scopedKey) + } else { + goodKeys = append(goodKeys, scopedKey) + } + } + + if len(badKeys) > 0 { + // We have at least one bad key. Rewrite the keyfile with only the good keys. + s.logger.Warnf("segment %d has %d unflushed value(s)", + s.index, len(badKeys)) + + swapFile, err := createKeyFile(s.logger, s.index, s.keys.segmentPath, true) + if err != nil { + return fmt.Errorf("failed to create swap key file: %w", err) + } + + for _, scopedKey := range goodKeys { + err = swapFile.write(scopedKey) + if err != nil { + return fmt.Errorf("failed to write key to swap file: %w", err) + } + } + err = swapFile.seal() + if err != nil { + return fmt.Errorf("failed to seal swap file: %w", err) + } + + err = swapFile.atomicSwap(s.fsync) + if err != nil { + return fmt.Errorf("failed to swap key file: %w", err) + } + + s.keys = swapFile + } + + err = s.metadata.seal(now, uint32(len(goodKeys))) + if err != nil { + return fmt.Errorf("failed to seal metadata file: %w", err) + } + s.keyCount = uint32(len(goodKeys)) + + return nil +} + +// Size returns the size of the segment in bytes. Counts bytes that are on disk or that will eventually end up on disk. +// This method is not thread safe, and should not be called concurrently with methods that modify the segment. +func (s *Segment) Size() uint64 { + size := s.metadata.Size() + + if s.IsSealed() { + // This segment is immutable, so it's thread safe to query the files directly. + size += s.keys.Size() + for _, shard := range s.shards { + size += shard.Size() + } + } else { + // This segment is mutable. We must use our local reckoning of the sizes of the files. + size += s.keyFileSize + for _, shardSize := range s.shardSizes { + size += shardSize + } + } + + return size +} + +// KeyCount returns the number of keys in the segment. +func (s *Segment) KeyCount() uint32 { + return s.keyCount +} + +// lookForFile looks for a file in a list of directories. It returns an error if the file appears +// in more than one directory, and nil if the file is not found. If the file is found and +// there are no errors, this method returns the SegmentPath where the file was found. +func lookForFile(paths []*SegmentPath, fileName string) (*SegmentPath, error) { + locations := make([]*SegmentPath, 0, 1) + for _, possiblePath := range paths { + potentialLocation := path.Join(possiblePath.segmentDirectory, fileName) + exists, err := util.Exists(potentialLocation) + if err != nil { + return nil, fmt.Errorf("failed to check if file %s exists: %v", potentialLocation, err) + } + if exists { + locations = append(locations, possiblePath) + } + } + + if len(locations) > 1 { + return nil, fmt.Errorf("file %s found in multiple directories: %v", fileName, locations) + } + + if len(locations) == 0 { + return nil, nil + } + return locations[0], nil +} + +// SetNextSegment sets the next segment in the chain. +func (s *Segment) SetNextSegment(nextSegment *Segment) { + nextSegment.Reserve() + s.nextSegment = nextSegment +} + +// GetShard returns the shard number for a key. +func (s *Segment) GetShard(key []byte) uint32 { + if s.metadata.shardingFactor == 1 { + // Shortcut: if we have one shard, we don't need to hash the key to figure out the mapping. + return 0 + } + + if s.metadata.segmentVersion == OldHashFunctionSegmentVersion { + return util.LegacyHashKey(key, s.metadata.legacySalt) % s.metadata.shardingFactor + } + + hash := util.HashKey(key, s.metadata.salt) + + return hash % s.metadata.shardingFactor +} + +// Write records a key-value pair in the data segment, returning the maximum size of all shards within this segment. +// +// This method does not ensure that the key-value pair is actually written to disk, only that it will eventually be +// written to disk. Flush must be called to ensure that all data previously passed to Write is written to disk. +func (s *Segment) Write(data *types.KVPair) (keyCount uint32, keyFileSize uint64, err error) { + if s.metadata.sealed { + return 0, 0, fmt.Errorf("segment is sealed, cannot write data") + } + + shard := s.GetShard(data.Key) + currentSize := s.shardSizes[shard] + + if currentSize > math.MaxUint32 { + // No matter the configuration, we absolutely cannot permit a value to be written if the first byte of the + // value would be beyond position 2^32. This is because we only have 32 bits in an address to store the + // position of a value's first byte. + return 0, 0, + fmt.Errorf("value file already contains %d bytes, cannot add a new value", currentSize) + } + s.unflushedKeyCount.Add(1) + firstByteIndex := uint32(currentSize) + + s.shardSizes[shard] += uint64(len(data.Value)) + 4 /* uint32 length */ + if s.shardSizes[shard] > s.maxShardSize { + s.maxShardSize = s.shardSizes[shard] + } + s.keyCount++ + s.keyFileSize += uint64(len(data.Key)) + 4 /* uint32 length */ + 8 /* uint64 Address */ + 4 /* uint32 ValueSize */ + + // Forward the value to the shard control loop, which asynchronously writes it to the value file. + shardRequest := &valueToWrite{ + value: data.Value, + expectedFirstByteIndex: firstByteIndex, + } + err = util.Send(s.errorMonitor, s.shardChannels[shard], shardRequest) + if err != nil { + return 0, 0, + fmt.Errorf("failed to send value to shard control loop: %v", err) + } + + // Forward the value to the key and its address file control loop, which asynchronously writes it to the key file. + keyRequest := &types.ScopedKey{ + Key: data.Key, + Address: types.NewAddress(s.index, firstByteIndex), + ValueSize: uint32(len(data.Value)), + } + + err = util.Send(s.errorMonitor, s.keyFileChannel, keyRequest) + if err != nil { + return 0, 0, + fmt.Errorf("failed to send key to key file control loop: %v", err) + } + + return s.keyCount, s.keyFileSize, nil +} + +// GetMaxShardSize returns the maximum size of all shards in this segment. +func (s *Segment) GetMaxShardSize() uint64 { + return s.maxShardSize +} + +// Read fetches the data for a key from the data segment. +// +// It is only thread safe to read from a segment if the key being read has previously been flushed to disk. +func (s *Segment) Read(key []byte, dataAddress types.Address) ([]byte, error) { + shard := s.GetShard(key) + values := s.shards[shard] + + value, err := values.read(dataAddress.Offset()) + if err != nil { + return nil, fmt.Errorf("failed to read value: %w", err) + } + return value, nil +} + +// GetKeys returns all keys in the data segment. Only permitted to be called after the segment has been sealed. +func (s *Segment) GetKeys() ([]*types.ScopedKey, error) { + if !s.metadata.sealed { + return nil, fmt.Errorf("segment is not sealed, cannot read keys") + } + + keys, err := s.keys.readKeys() + if err != nil { + return nil, fmt.Errorf("failed to read keys: %w", err) + } + return keys, nil +} + +// FlushWaitFunction is a function that waits for a flush operation to complete. It returns the addresses of the data +// that was flushed, or an error if the flush operation failed. +type FlushWaitFunction func() ([]*types.ScopedKey, error) + +// Flush schedules a flush operation. Flush operations are performed serially in the order they are scheduled. +// This method returns a function that, when called, will block until the flush operation is complete. The function +// returns the addresses of the data that was flushed, or an error if the flush operation failed. +func (s *Segment) Flush() (FlushWaitFunction, error) { + return s.flush(false) +} + +func (s *Segment) flush(seal bool) (FlushWaitFunction, error) { + // Schedule a flush for all shards. + shardResponseChannels := make([]chan struct{}, s.metadata.shardingFactor) + for shard, shardChannel := range s.shardChannels { + shardResponseChannels[shard] = make(chan struct{}, 1) + request := &shardFlushRequest{ + seal: seal, + completionChannel: shardResponseChannels[shard], + } + err := util.Send(s.errorMonitor, shardChannel, request) + if err != nil { + return nil, fmt.Errorf("failed to send flush request to shard %d: %w", shard, err) + } + } + + // Schedule a flush for the key channel. + // Now that all shards have sent their key/address pairs to the key file, flush the key file. + keyResponseChannel := make(chan *keyFileFlushResponse, 1) + request := &keyFileFlushRequest{ + seal: seal, + completionChannel: keyResponseChannel, + } + err := util.Send(s.errorMonitor, s.keyFileChannel, request) + if err != nil { + return nil, fmt.Errorf("failed to send flush request to key file: %w", err) + } + + return func() ([]*types.ScopedKey, error) { + // Wait for each shard to finish flushing. + for i := range s.shardChannels { + _, err := util.Await(s.errorMonitor, shardResponseChannels[i]) + if err != nil { + return nil, fmt.Errorf("failed to flush shard %d: %w", i, err) + } + } + + keyFlushResponse, err := util.Await(s.errorMonitor, keyResponseChannel) + if err != nil { + return nil, fmt.Errorf("failed to flush key file: %w", err) + } + + s.unflushedKeyCount.Add(-int64(len(keyFlushResponse.addresses))) + return keyFlushResponse.addresses, nil + }, nil +} + +// Snapshot takes a snapshot of the files in the segment if snapshotting is enabled. If snapshotting is not enabled, +// then this method is a no-op. +func (s *Segment) Snapshot() error { + if !s.snapshottingEnabled { + return nil + } + + err := s.metadata.snapshot() + if err != nil { + return fmt.Errorf("failed to snapshot metadata file: %w", err) + } + + err = s.keys.snapshot() + if err != nil { + return fmt.Errorf("failed to snapshot key file: %w", err) + } + + for shardIndex, shard := range s.shards { + err = shard.snapshot() + if err != nil { + return fmt.Errorf("failed to snapshot value file for shard %d: %w", shardIndex, err) + } + } + + return nil +} + +// Check if this segment is actually a snapshot. A snapshot will be backed up by symlinks, while a real segment +// will have real files. +func (s *Segment) IsSnapshot() (bool, error) { + metadataPath := s.metadata.path() + + fileInfo, err := os.Lstat(metadataPath) + if err != nil { + return false, fmt.Errorf("failed to get file info for metadata path %s: %w", metadataPath, err) + } + + return fileInfo.Mode()&os.ModeSymlink != 0, nil +} + +// Seal flushes all data to disk and finalizes the metadata. Returns addresses that became durable as a result of +// this method call. After this method is called, no more data can be written to this segment. +func (s *Segment) Seal(now time.Time) ([]*types.ScopedKey, error) { + flushWaitFunction, err := s.flush(true) + if err != nil { + return nil, fmt.Errorf("failed to flush segment: %w", err) + } + addresses, err := flushWaitFunction() + if err != nil { + return nil, fmt.Errorf("failed to flush segment: %w", err) + } + + // Seal the metadata file. + err = s.metadata.seal(now, s.keyCount) + if err != nil { + return nil, fmt.Errorf("failed to seal metadata file: %w", err) + } + + unflushedKeyCount := s.unflushedKeyCount.Load() + if s.unflushedKeyCount.Load() != 0 { + return nil, fmt.Errorf("segment %d has %d unflushedKeyCount keys", s.index, unflushedKeyCount) + } + + return addresses, nil +} + +// IsSealed returns true if the segment is sealed, and false otherwise. +func (s *Segment) IsSealed() bool { + return s.metadata.sealed +} + +// GetSealTime returns the time at which the segment was sealed. If the file is not sealed, this method will return +// the zero time. +func (s *Segment) GetSealTime() time.Time { + return time.Unix(0, int64(s.metadata.lastValueTimestamp)) +} + +// Reserve reserves the segment, preventing it from being deleted. Returns true if the reservation was successful, and +// false otherwise. +func (s *Segment) Reserve() bool { + for { + reservations := s.reservationCount.Load() + if reservations <= 0 { + return false + } + + if s.reservationCount.CompareAndSwap(reservations, reservations+1) { + return true + } + } +} + +// Release releases a reservation held on this segment. A segment cannot be deleted until all reservations on it +// have been released. The last call to Release() that releases the final reservation schedules the segment for +// asynchronous deletion +func (s *Segment) Release() { + reservations := s.reservationCount.Add(-1) + + if reservations > 0 { + return + } + + if reservations < 0 { + // This should be impossible. + s.errorMonitor.Panic( + fmt.Errorf("segment %d has negative reservation count: %d", s.index, reservations)) + } + + go func() { + err := s.delete() + if err != nil { + s.errorMonitor.Panic(fmt.Errorf("failed to delete segment: %w", err)) + } + }() +} + +// BlockUntilFullyDeleted blocks until the segment is fully deleted. If the segment is not yet fully released, +// this method will block until it is. This method should only be called once per segment (the second call +// will block forever!). +func (s *Segment) BlockUntilFullyDeleted() error { + _, err := util.Await(s.errorMonitor, s.deletionChannel) + if err != nil { + return fmt.Errorf("failed to await segment deletion: %w", err) + } + return nil +} + +// delete deletes the segment from disk. +func (s *Segment) delete() error { + defer func() { + s.deletionChannel <- struct{}{} + }() + + err := s.keys.delete() + if err != nil { + return fmt.Errorf("failed to delete key file, segment %d: %w", s.index, err) + } + for shardIndex, shard := range s.shards { + err = shard.delete() + if err != nil { + return fmt.Errorf("failed to delete value file, segment %d, shard %d: %w", s.index, shardIndex, err) + } + } + err = s.metadata.delete() + if err != nil { + return fmt.Errorf("failed to delete metadata file, segment %d: %w", s.index, err) + } + + // The next segment is now eligible for deletion once it is fully released by other reservation holders. + if s.nextSegment != nil { + s.nextSegment.Release() + } + + return nil +} + +func (s *Segment) String() string { + var sealedString string + if s.metadata.sealed { + sealedString = "sealed" + } else { + sealedString = "unsealed" + } + + return fmt.Sprintf("[seg %d - %s]", s.index, sealedString) +} + +// handleShardFlushRequest handles a request to flush a shard to disk. +func (s *Segment) handleShardFlushRequest(shard uint32, request *shardFlushRequest) { + if request.seal { + err := s.shards[shard].seal() + if err != nil { + s.errorMonitor.Panic(fmt.Errorf("failed to seal value file: %w", err)) + } + } else { + err := s.shards[shard].flush() + if err != nil { + s.errorMonitor.Panic(fmt.Errorf("failed to flush value file: %w", err)) + } + } + request.completionChannel <- struct{}{} +} + +// handleShardWrite applies a single write operation to a shard. +func (s *Segment) handleShardWrite(shard uint32, data *valueToWrite) { + firstByteIndex, err := s.shards[shard].write(data.value) + if err != nil { + s.errorMonitor.Panic(fmt.Errorf("failed to write value to value file: %w", err)) + } + + if firstByteIndex != data.expectedFirstByteIndex { + // This should never happen. But it's a good sanity check. + if firstByteIndex != data.expectedFirstByteIndex { + s.errorMonitor.Panic( + fmt.Errorf("expected first byte index %d, got %d", data.expectedFirstByteIndex, firstByteIndex)) + } + } +} + +// handleKeyFileWrite writes a key to the key file. +func (s *Segment) handleKeyFileWrite(data *types.ScopedKey) { + err := s.keys.write(data) + if err != nil { + s.errorMonitor.Panic(fmt.Errorf("failed to write key to key file: %w", err)) + } +} + +// handleKeyFileFlushRequest handles a request to flush the key file to disk. +func (s *Segment) handleKeyFileFlushRequest(request *keyFileFlushRequest, unflushedKeys []*types.ScopedKey) { + if request.seal { + err := s.keys.seal() + if err != nil { + s.errorMonitor.Panic(fmt.Errorf("failed to seal key file: %w", err)) + } + } else { + err := s.keys.flush() + if err != nil { + s.errorMonitor.Panic(fmt.Errorf("failed to flush key file: %w", err)) + } + } + + request.completionChannel <- &keyFileFlushResponse{ + addresses: unflushedKeys, + } +} + +// shardFlushRequest is a message sent to shard control loops to request that they flush their data to disk. +type shardFlushRequest struct { + // If true, seal the shard after flushing. If false, do not seal the shard. + seal bool + + // As each shard finishes its flush it will send an object to this channel. + completionChannel chan struct{} +} + +// valueToWrite is a message sent to the shard control loop to request that it write a value to the value file. +type valueToWrite struct { + value []byte + expectedFirstByteIndex uint32 +} + +// shardControlLoop is the main loop for performing modifications to a particular shard. Each shard is managed +// by its own goroutine, which is running this function. +func (s *Segment) shardControlLoop(shard uint32) { + for { + select { + case <-s.errorMonitor.ImmediateShutdownRequired(): + s.logger.Infof("segment %d shard %d control loop exiting, context cancelled", s.index, shard) + return + case operation := <-s.shardChannels[shard]: + if flushRequest, ok := operation.(*shardFlushRequest); ok { + s.handleShardFlushRequest(shard, flushRequest) + if flushRequest.seal { + // After sealing, we can exit the control loop. + return + } + } else if data, ok := operation.(*valueToWrite); ok { + s.handleShardWrite(shard, data) + continue + } else { + s.errorMonitor.Panic( + fmt.Errorf("unknown operation type in shard control loop: %T", operation)) + } + } + } +} + +// keyFileFlushRequest is a message sent to the key file control loop to request that it flush its data to disk. +type keyFileFlushRequest struct { + // If true, seal the key file after flushing. If false, do not seal the key file. + seal bool + + // As the key file finishes its flush, it will either send an error if something went wrong, or nil if the flush was + // successful. + completionChannel chan *keyFileFlushResponse +} + +// keyFileFlushResponse is a message sent from the key file control loop to the caller of Flush to indicate that the +// key file has been flushed. +type keyFileFlushResponse struct { + addresses []*types.ScopedKey +} + +// keyFileControlLoop is the main loop for performing modifications to the key file. This goroutine is responsible +// for writing key-address pairs to the key file. +func (s *Segment) keyFileControlLoop() { + unflushedKeys := make([]*types.ScopedKey, 0, unflushedKeysInitialCapacity) + + for { + select { + case <-s.errorMonitor.ImmediateShutdownRequired(): + s.logger.Infof("segment %d key file control loop exiting, context cancelled", s.index) + return + case operation := <-s.keyFileChannel: + + if flushRequest, ok := operation.(*keyFileFlushRequest); ok { + s.handleKeyFileFlushRequest(flushRequest, unflushedKeys) + unflushedKeys = make([]*types.ScopedKey, 0, unflushedKeysInitialCapacity) + + if flushRequest.seal { + // After sealing, we can exit the control loop. + return + } + + } else if data, ok := operation.(*types.ScopedKey); ok { + s.handleKeyFileWrite(data) + unflushedKeys = append(unflushedKeys, data) + + } else { + s.errorMonitor.Panic( + fmt.Errorf("unknown operation type in key file control loop: %T", operation)) + } + } + } +} + +// GetMetadataFilePath returns the path to the metadata file for this segment. +func (s *Segment) GetMetadataFilePath() string { + return s.metadata.path() +} + +// GetKeyFilePath returns the path to the key file for this segment. +func (s *Segment) GetKeyFilePath() string { + return s.keys.path() +} + +// / GetValueFilePaths returns a list of file paths for all value files in this segment. +func (s *Segment) GetValueFilePaths() []string { + paths := make([]string, 0, len(s.shards)) + for _, shard := range s.shards { + paths = append(paths, shard.path()) + } + return paths +} + +// GetFilePaths returns a list of file paths for all files that make up this segment. +func (s *Segment) GetFilePaths() []string { + filePaths := make([]string, 0, 2+len(s.shards)) + filePaths = append(filePaths, s.GetMetadataFilePath()) + filePaths = append(filePaths, s.GetKeyFilePath()) + filePaths = append(filePaths, s.GetValueFilePaths()...) + return filePaths +} diff --git a/sei-db/db_engine/litt/disktable/segment/segment_path.go b/sei-db/db_engine/litt/disktable/segment/segment_path.go new file mode 100644 index 0000000000..d223fa1a4b --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/segment_path.go @@ -0,0 +1,152 @@ +//go:build littdb_wip + +package segment + +import ( + "fmt" + "os" + "path" + "path/filepath" + + "github.com/Layr-Labs/eigenda/litt/util" +) + +// The name of the directory where segment files are stored. The segment directory is created at +// "$STORAGE_PATH/$TABLE_NAME/segments". Each table has at least one segment directory. Tables may +// have multiple segment directories if more than one path is provided to Litt.Config.Paths. +const SegmentDirectory = "segments" + +// The name of the directory where hard links to segment files are stored for snapshotting (if enabled). +// The hard link directory is created at "$STORAGE_PATH/$TABLE_NAME/snapshot". +const HardLinkDirectory = "snapshot" + +// SegmentPath encapsulates various file paths utilized by segment files. +type SegmentPath struct { + // The directory where the segment file is stored. + segmentDirectory string + // If snapshotting is enabled, the directory where a Snapshot will put a hard link to the segment file. + // An empty string if snapshotting is not enabled. + hardlinkPath string + // If snapshotting is enabled, the directory where a Snapshot will put a soft link to the hard link of a + // segment file. An empty string if snapshotting is not enabled. + softlinkPath string +} + +// NewSegmentPath creates a new SegmentPath. Each segment file's location on disk is determined by a SegmentPath object. +// +// The storageRoot is a location where LittDB is storing data, i.e. one of the paths from Litt.Config.Paths. +// +// softlinkRoot will be an empty string if snapshotting is not enabled, or a path to the root directory where +// Snapshot soft links will be created. The presence (or absence) of this path is used by LittDB to +// determine if snapshotting is enabled. +// +// The tableName is the name of the table that owns the segment file. +func NewSegmentPath( + storageRoot string, + softlinkRoot string, + tableName string, +) (*SegmentPath, error) { + + if storageRoot == "" { + return nil, fmt.Errorf("storage path cannot be empty") + } + + segmentDirectory := path.Join(storageRoot, tableName, SegmentDirectory) + + softlinkPath := "" + hardLinkPath := "" + if softlinkRoot != "" { + softlinkPath = path.Join(softlinkRoot, tableName, SegmentDirectory) + hardLinkPath = path.Join(storageRoot, tableName, HardLinkDirectory) + } + + return &SegmentPath{ + segmentDirectory: segmentDirectory, + hardlinkPath: hardLinkPath, + softlinkPath: softlinkPath, + }, nil +} + +// BuildSegmentPaths creates a list of SegmentPath objects for each storage root provided. +func BuildSegmentPaths( + storageRoots []string, + softlinkRoot string, + tableName string, +) ([]*SegmentPath, error) { + segmentPaths := make([]*SegmentPath, len(storageRoots)) + for i, storageRoot := range storageRoots { + segmentPath, err := NewSegmentPath(storageRoot, softlinkRoot, tableName) + if err != nil { + return nil, fmt.Errorf("error building segment path: %v", err) + } + segmentPaths[i] = segmentPath + } + return segmentPaths, nil +} + +// SegmentDirectory returns the parent directory where segment files are stored. +func (p *SegmentPath) SegmentDirectory() string { + return p.segmentDirectory +} + +// HardlinkPath returns the path where hard links to segment files will be created for snapshotting. +func (p *SegmentPath) HardlinkPath() string { + return p.hardlinkPath +} + +// SoftlinkPath returns the path where soft links to hard links of segment files will be created for snapshotting. +func (p *SegmentPath) SoftlinkPath() string { + return p.softlinkPath +} + +// snapshottingEnabled checks if snapshotting is enabled. +func (p *SegmentPath) snapshottingEnabled() bool { + return p.softlinkPath != "" +} + +// MakeDirectories creates the necessary directories described by the SegmentPath if they do not already exist. +func (p *SegmentPath) MakeDirectories(fsync bool) error { + err := util.EnsureDirectoryExists(p.segmentDirectory, fsync) + if err != nil { + return fmt.Errorf("failed to ensure segment directory exists: %w", err) + } + + if p.snapshottingEnabled() { + err = util.EnsureDirectoryExists(p.hardlinkPath, fsync) + if err != nil { + return fmt.Errorf("failed to ensure hard link directory exists: %w", err) + } + + err = util.EnsureDirectoryExists(p.softlinkPath, fsync) + if err != nil { + return fmt.Errorf("failed to ensure soft link directory exists: %w", err) + } + } + + return nil +} + +// Snapshot creates a hard link to the file in the Snapshot directory, and a symlink to that hard link in the soft link +// directory. The fileName should just be the name of the file, not its full path. The file is expected to be in the +// segmentDirectory. +func (p *SegmentPath) Snapshot(fileName string) error { + if !p.snapshottingEnabled() { + return fmt.Errorf("snapshotting is not enabled, cannot Snapshot file %s", fileName) + } + + sourcePath := filepath.Join(p.segmentDirectory, fileName) + hardlinkPath := filepath.Join(p.hardlinkPath, fileName) + symlinkPath := filepath.Join(p.softlinkPath, fileName) + + err := os.Link(sourcePath, hardlinkPath) + if err != nil && !os.IsExist(err) { + return fmt.Errorf("failed to create hard link from %s to %s: %v", sourcePath, hardlinkPath, err) + } + + err = os.Symlink(hardlinkPath, symlinkPath) + if err != nil { + return fmt.Errorf("failed to create symlink from %s to %s: %v", hardlinkPath, symlinkPath, err) + } + + return nil +} diff --git a/sei-db/db_engine/litt/disktable/segment/segment_path_test.go b/sei-db/db_engine/litt/disktable/segment/segment_path_test.go new file mode 100644 index 0000000000..91ae927663 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/segment_path_test.go @@ -0,0 +1,84 @@ +//go:build littdb_wip + +package segment + +import ( + "fmt" + "path" + "testing" + + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/stretchr/testify/require" +) + +func TestSegmentPathWithSnapshotDir(t *testing.T) { + dir := t.TempDir() + + snapshotDir := path.Join(dir, "snapshot") + roots := make([]string, 0, 10) + for i := 0; i < 10; i++ { + roots = append(roots, path.Join(dir, fmt.Sprintf("%d", i))) + } + + tableName := "table" + + segmentPaths, err := BuildSegmentPaths(roots, snapshotDir, tableName) + require.NoError(t, err) + + for i, segmentPath := range segmentPaths { + + require.True(t, segmentPath.snapshottingEnabled()) + require.Equal(t, path.Join(roots[i], tableName, SegmentDirectory), segmentPath.SegmentDirectory()) + require.Equal(t, path.Join(roots[i], tableName, HardLinkDirectory), segmentPath.HardlinkPath()) + require.Equal(t, path.Join(snapshotDir, tableName, SegmentDirectory), segmentPath.SoftlinkPath()) + + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + + exists, err := util.Exists(segmentPath.SegmentDirectory()) + require.NoError(t, err) + require.True(t, exists, "Segment directory should exist: %s", segmentPath.SegmentDirectory()) + + exists, err = util.Exists(segmentPath.HardlinkPath()) + require.NoError(t, err) + require.True(t, exists, "Hardlink path should exist: %s", segmentPath.HardlinkPath()) + + exists, err = util.Exists(segmentPath.SoftlinkPath()) + require.NoError(t, err) + require.True(t, exists, "Softlink path should exist: %s", segmentPath.SoftlinkPath()) + } +} + +func TestSegmentPathWithoutSnapshotDir(t *testing.T) { + dir := t.TempDir() + + roots := make([]string, 0, 10) + for i := 0; i < 10; i++ { + roots = append(roots, path.Join(dir, fmt.Sprintf("%d", i))) + } + + tableName := "table" + + segmentPaths, err := BuildSegmentPaths(roots, "", tableName) + require.NoError(t, err) + + for i, segmentPath := range segmentPaths { + + require.False(t, segmentPath.snapshottingEnabled()) + require.Equal(t, path.Join(roots[i], tableName, SegmentDirectory), segmentPath.SegmentDirectory()) + require.Equal(t, "", segmentPath.HardlinkPath()) + require.Equal(t, "", segmentPath.SoftlinkPath()) + + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + + exists, err := util.Exists(segmentPath.SegmentDirectory()) + require.NoError(t, err) + require.True(t, exists, "Segment directory should exist: %s", segmentPath.SegmentDirectory()) + + // Since we are not snapshotting, we shouldn't create this directory. + exists, err = util.Exists(segmentPath.HardlinkPath()) + require.NoError(t, err) + require.False(t, exists, "Hardlink path should exist: %s", segmentPath.HardlinkPath()) + } +} diff --git a/sei-db/db_engine/litt/disktable/segment/segment_scanner.go b/sei-db/db_engine/litt/disktable/segment/segment_scanner.go new file mode 100644 index 0000000000..5f516e77ab --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/segment_scanner.go @@ -0,0 +1,379 @@ +//go:build littdb_wip + +package segment + +import ( + "fmt" + "math" + "os" + "path" + "time" + + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// scanDirectories scans directories for segment files and returns a map of metadata, key, and value files. +// Also returns a list of garbage files that should be deleted. Does not do anything to files with unrecognized +// extensions. +func scanDirectories(logger logging.Logger, segmentPaths []*SegmentPath) ( + metadataFiles map[uint32]string, + keyFiles map[uint32]string, + valueFiles map[uint32][]string, + garbageFiles []string, + highestSegmentIndex uint32, + lowestSegmentIndex uint32, + err error) { + + highestSegmentIndex = uint32(0) + lowestSegmentIndex = uint32(math.MaxUint32) + + // key is the file's segment index, value is the file's path + metadataFiles = make(map[uint32]string) + keyFiles = make(map[uint32]string) + valueFiles = make(map[uint32][]string) + + garbageFiles = make([]string, 0) + + for _, segmentPath := range segmentPaths { + files, err := os.ReadDir(segmentPath.SegmentDirectory()) + if err != nil { + return nil, nil, nil, nil, 0, 0, + fmt.Errorf("failed to read directory %s: %v", segmentPath.SegmentDirectory(), err) + } + + for _, file := range files { + if file.IsDir() { + continue + } + + fileName := file.Name() + extension := path.Ext(fileName) + filePath := path.Join(segmentPath.SegmentDirectory(), fileName) + var index uint32 + + switch extension { + case MetadataSwapExtension, KeyFileSwapExtension: + garbageFiles = append(garbageFiles, filePath) + continue + case MetadataFileExtension: + index, err = getMetadataFileIndex(fileName) + if err != nil { + return nil, nil, nil, nil, + 0, 0, + fmt.Errorf("failed to get file index: %v", err) + } + metadataFiles[index] = filePath + case KeyFileExtension: + index, err = getKeyFileIndex(fileName) + if err != nil { + return nil, nil, nil, nil, + 0, 0, fmt.Errorf("failed to get file index: %v", err) + } + keyFiles[index] = filePath + case ValuesFileExtension: + index, err = getValueFileIndex(fileName) + if err != nil { + return nil, nil, nil, nil, + 0, 0, fmt.Errorf("failed to get file index: %v", err) + } + valueFiles[index] = append(valueFiles[index], filePath) + default: + logger.Debugf("Ignoring unknown file %s", filePath) + continue + } + + if index > highestSegmentIndex { + highestSegmentIndex = index + } + if index < lowestSegmentIndex { + lowestSegmentIndex = index + } + } + } + + if lowestSegmentIndex == math.MaxUint32 { + // No segments found, fix the index. + lowestSegmentIndex = 0 + } + + return metadataFiles, + keyFiles, + valueFiles, + garbageFiles, + highestSegmentIndex, + lowestSegmentIndex, + nil +} + +// diagnoseMissingFile decides what to do with specific missing files. If the segment is either the segment +// with the lowest index or the segment with the highest index, it is possible for files to be missing due to +// non-catastrophic reasons (i.e. a crash during cleanup). If the segment is neither the lowest nor highest segment, +// then missing files signal non-recoverable DB corruption, and an error is returned. +func diagnoseMissingFile( + logger logging.Logger, + index uint32, + lowestFileIndex uint32, + highestFileIndex uint32, + fileType string, + damagedSegments map[uint32]struct{}) error { + + if index == highestFileIndex { + // This can happen if we crash while creating a new segment. Recoverable. + logger.Warnf("Missing %s file for last segment %d", fileType, index) + damagedSegments[index] = struct{}{} + } else if index == lowestFileIndex { + // This can happen when deleting the oldest segment. Recoverable. + logger.Warnf("Missing %s file for first segment %d", fileType, index) + damagedSegments[index] = struct{}{} + } else { + // Database is missing internal files. Catastrophic failure. + return fmt.Errorf("missing %s file for segment %d", fileType, index) + } + + return nil +} + +// lookForMissingFiles ensures that all files that should be present are actually present. Returns an error +// if files are missing in a way that cannot be recovered. If recoverable, returns a list of orphaned files. +// An "orphaned file" is defined as a file on disk for a segment that is missing one or more of its files. +// For example, if a segment has a metadata file but is missing its key file, the metadata file is considered orphaned. +func lookForMissingFiles( + logger logging.Logger, + lowestSegmentIndex uint32, + highestSegmentIndex uint32, + metadataFiles map[uint32]string, + keyFiles map[uint32]string, + valueFiles map[uint32][]string, + fsync bool, +) (orphanedFiles []string, damagedSegments map[uint32]struct{}, error error) { + + orphanedFiles = make([]string, 0) + damagedSegments = make(map[uint32]struct{}) + + for segment := lowestSegmentIndex; segment <= highestSegmentIndex; segment++ { + + if segment == 0 && len(metadataFiles) == 0 && len(keyFiles) == 0 && len(valueFiles) == 0 { + // Special case, only happens when starting a table from scratch. + // Files aren't actually missing, so no need to log anything. + break + } + + potentialOrphans := make([]string, 0) + segmentMissingFiles := false + + // Check for missing metadata file. + _, metadataPresent := metadataFiles[segment] + if metadataPresent { + potentialOrphans = append(potentialOrphans, metadataFiles[segment]) + } else { + segmentMissingFiles = true + err := diagnoseMissingFile( + logger, + segment, + lowestSegmentIndex, + highestSegmentIndex, + "metadata", + damagedSegments) + if err != nil { + return nil, nil, err + } + } + + // Check for missing key file. + _, keysPresent := keyFiles[segment] + if keysPresent { + potentialOrphans = append(potentialOrphans, keyFiles[segment]) + } else { + segmentMissingFiles = true + err := diagnoseMissingFile( + logger, + segment, + lowestSegmentIndex, + highestSegmentIndex, + "key", + damagedSegments) + if err != nil { + return nil, nil, err + } + } + + // Check for missing value files (there should be exactly one value file per shard). + if !metadataPresent { + // If the metadata file is missing but we haven't yet returned an error, all of the value files + // are automatically considered orphaned. + orphanedFiles = append(orphanedFiles, valueFiles[segment]...) + } else { + + // We need to know the sharding factor to check for missing value files. + metadataPath := metadataFiles[segment] + metadataDirectory := path.Dir(metadataPath) + + metadata, err := loadMetadataFile(segment, []*SegmentPath{{segmentDirectory: metadataDirectory}}, fsync) + if err != nil { + return nil, nil, + fmt.Errorf("failed to load metadata file: %v", err) + } + + if uint32(len(valueFiles[segment])) > metadata.shardingFactor { + return nil, nil, + fmt.Errorf("too many value files for segment %d, expected at most %d, got %d", + segment, metadata.shardingFactor, len(valueFiles[segment])) + } + + // Catalogue the shards we do have. + shardsPresent := make(map[uint32]struct{}) + for _, vFile := range valueFiles[segment] { + shard, err := getValueFileShard(vFile) + if err != nil { + return nil, nil, + fmt.Errorf("failed to get shard from value file: %v", err) + } + shardsPresent[shard] = struct{}{} + potentialOrphans = append(potentialOrphans, vFile) + } + + // Check that we have each shard. + for shard := uint32(0); shard < metadata.shardingFactor; shard++ { + _, shardPresent := shardsPresent[shard] + if !shardPresent { + segmentMissingFiles = true + err = diagnoseMissingFile( + logger, + segment, + lowestSegmentIndex, + highestSegmentIndex, + fmt.Sprintf("shard-%d", shard), + damagedSegments) + if err != nil { + return nil, nil, err + } + } + } + } + + if segmentMissingFiles { + // If we are missing a file in this segment, all other files in the segment are considered orphaned. + orphanedFiles = append(orphanedFiles, potentialOrphans...) + } + } + + return orphanedFiles, damagedSegments, nil +} + +// deleteOrphanedFiles deletes any files that are in the orphan set. +func deleteOrphanedFiles(logger logging.Logger, orphanedFiles []string) error { + for _, orphanedFile := range orphanedFiles { + logger.Infof("deleting orphaned file %s", orphanedFile) + err := os.Remove(orphanedFile) + if err != nil { + return fmt.Errorf("failed to remove orphaned file %s: %v", orphanedFile, err) + } + } + return nil +} + +// linkSegments links together adjacent segments via SetNextSegment(). +func linkSegments(lowestSegmentIndex uint32, highestSegmentIndex uint32, segments map[uint32]*Segment) error { + if lowestSegmentIndex == highestSegmentIndex { + // Only one segment, nothing to link. This is checked explicitly to avoid 0-1 underflow. + return nil + } + + for i := lowestSegmentIndex; i < highestSegmentIndex; i++ { + first, ok := segments[i] + if !ok { + return fmt.Errorf("missing segment %d", i) + } + second, ok := segments[i+1] + if !ok { + return fmt.Errorf("missing segment %d", i+1) + } + first.SetNextSegment(second) + } + return nil +} + +// GatherSegmentFiles scans a directory for segment files and loads them into memory. +func GatherSegmentFiles( + logger logging.Logger, + errorMonitor *util.ErrorMonitor, + segmentPaths []*SegmentPath, + snapshottingEnabled bool, + now time.Time, + cleanOrphans bool, + fsync bool, +) (lowestSegmentIndex uint32, highestSegmentIndex uint32, segments map[uint32]*Segment, err error) { + + // Scan the root directories for segment files. + metadataFiles, keyFiles, valueFiles, garbageFiles, highestSegmentIndex, lowestSegmentIndex, err := + scanDirectories(logger, segmentPaths) + if err != nil { + return 0, 0, nil, + fmt.Errorf("failed to scan directory: %v", err) + } + + segments = make(map[uint32]*Segment) + + // Delete any garbage files. Ignore files with unrecognized extensions. + for _, garbageFile := range garbageFiles { + logger.Infof("deleting file %s", garbageFile) + err = os.Remove(garbageFile) + if err != nil { + return 0, 0, nil, + fmt.Errorf("failed to remove garbage file %s: %v", garbageFile, err) + } + } + + // Check for missing files. + orphanedFiles, damagedSegments, err := lookForMissingFiles( + logger, + lowestSegmentIndex, + highestSegmentIndex, + metadataFiles, + keyFiles, + valueFiles, + fsync) + if err != nil { + return 0, 0, nil, + fmt.Errorf("there are one or more missing files: %v", err) + } + + if cleanOrphans { + // Clean up any orphaned segment files. + err = deleteOrphanedFiles(logger, orphanedFiles) + if err != nil { + return 0, 0, nil, + fmt.Errorf("failed to delete orphaned files: %v", err) + } + } + + if len(metadataFiles) > 0 { + // Adjust the segment range to exclude orphaned segments. + if _, ok := damagedSegments[highestSegmentIndex]; ok { + highestSegmentIndex-- + } + if _, ok := damagedSegments[lowestSegmentIndex]; ok { + lowestSegmentIndex++ + } + + // Load all healthy segments. + for i := lowestSegmentIndex; i <= highestSegmentIndex; i++ { + segment, err := LoadSegment(logger, errorMonitor, i, segmentPaths, snapshottingEnabled, now, fsync) + if err != nil { + return 0, 0, nil, + fmt.Errorf("failed to create segment %d: %v", i, err) + } + segments[i] = segment + } + + // Stitch together the segments. + err = linkSegments(lowestSegmentIndex, highestSegmentIndex, segments) + if err != nil { + return 0, 0, nil, + fmt.Errorf("failed to link segments: %v", err) + } + } + + return lowestSegmentIndex, highestSegmentIndex, segments, nil +} diff --git a/sei-db/db_engine/litt/disktable/segment/segment_test.go b/sei-db/db_engine/litt/disktable/segment/segment_test.go new file mode 100644 index 0000000000..d730197263 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/segment_test.go @@ -0,0 +1,525 @@ +//go:build littdb_wip + +package segment + +import ( + "bytes" + "os" + "sort" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +// countFilesInDirectory returns the number of files in the given directory. +func countFilesInDirectory(t *testing.T, directory string) int { + files, err := os.ReadDir(directory) + require.NoError(t, err) + return len(files) +} + +func TestWriteAndReadSegmentSingleShard(t *testing.T) { + t.Parallel() + + ctx := t.Context() + rand := random.NewTestRandom() + logger := test.GetLogger() + directory := t.TempDir() + + index := rand.Uint32() + valueCount := rand.Int32Range(1000, 2000) + keys := make([][]byte, valueCount) + values := make([][]byte, valueCount) + for i := 0; i < int(valueCount); i++ { + key := rand.PrintableVariableBytes(1, 100) + keys[i] = key + values[i] = rand.PrintableVariableBytes(1, 100) + } + + // a map from keys to values + expectedValues := make(map[string][]byte) + + // a map from keys to addresses + addressMap := make(map[string]types.Address) + + expectedLargestShardSize := uint64(0) + + salt := ([16]byte)(rand.Bytes(16)) + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + seg, err := CreateSegment( + logger, + util.NewErrorMonitor(ctx, logger, nil), + index, + []*SegmentPath{segmentPath}, + false, + 1, + salt, + false) + + require.NoError(t, err) + + // Write values to the segment. + for i := 0; i < int(valueCount); i++ { + key := keys[i] + value := values[i] + expectedValues[string(key)] = value + + expectedLargestShardSize += uint64(len(value)) + 4 /* uint32 length */ + + _, _, err := seg.Write(&types.KVPair{Key: key, Value: value}) + largestShardSize := seg.GetMaxShardSize() + require.NoError(t, err) + require.Equal(t, expectedLargestShardSize, largestShardSize) + + // Occasionally flush the segment to disk. + if rand.BoolWithProbability(0.25) { + flushFunction, err := seg.Flush() + require.NoError(t, err) + flushedKeys, err := flushFunction() + require.NoError(t, err) + for _, flushedKey := range flushedKeys { + addressMap[string(flushedKey.Key)] = flushedKey.Address + } + + // after flushing, the address map should be the same size as the expected values map + require.Equal(t, len(expectedValues), len(addressMap)) + } + + // Occasionally scan all addresses and values in the segment. + if rand.BoolWithProbability(0.1) { + flushFunction, err := seg.Flush() + require.NoError(t, err) + flushedKeys, err := flushFunction() + require.NoError(t, err) + for _, flushedKey := range flushedKeys { + addressMap[string(flushedKey.Key)] = flushedKey.Address + } + + // after flushing, the address map should be the same size as the expected values map + require.Equal(t, len(expectedValues), len(addressMap)) + + for k, addr := range addressMap { + readValue, err := seg.Read([]byte(k), addr) + require.NoError(t, err) + require.Equal(t, expectedValues[k], readValue) + } + } + } + + // Seal the segment and read all keys and values. + require.False(t, seg.IsSealed()) + sealTime := rand.Time() + flushedKeys, err := seg.Seal(sealTime) + require.NoError(t, err) + require.True(t, seg.IsSealed()) + + for _, flushedKey := range flushedKeys { + addressMap[string(flushedKey.Key)] = flushedKey.Address + } + + // after flushing, the address map should be the same size as the expected values map + require.Equal(t, len(expectedValues), len(addressMap)) + + require.Equal(t, sealTime.UnixNano(), seg.GetSealTime().UnixNano()) + + for k, addr := range addressMap { + readValue, err := seg.Read([]byte(k), addr) + require.NoError(t, err) + require.Equal(t, expectedValues[k], readValue) + } + + keysFromSegment, err := seg.GetKeys() + require.NoError(t, err) + for i, ka := range keysFromSegment { + require.Equal(t, ka.Key, keys[i]) + } + + // Reopen the segment and read all keys and values. + seg2, err := LoadSegment( + logger, + util.NewErrorMonitor(ctx, logger, nil), + index, + []*SegmentPath{segmentPath}, + false, + time.Now(), + false) + require.NoError(t, err) + require.True(t, seg2.IsSealed()) + + require.Equal(t, sealTime.UnixNano(), seg2.GetSealTime().UnixNano()) + + for k, addr := range addressMap { + readValue, err := seg2.Read([]byte(k), addr) + require.NoError(t, err) + require.Equal(t, expectedValues[k], readValue) + } + + keysFromSegment2, err := seg2.GetKeys() + require.NoError(t, err) + require.Equal(t, keysFromSegment, keysFromSegment2) + + // delete the segment + require.Equal(t, 3, countFilesInDirectory(t, segmentPath.SegmentDirectory())) + + err = seg.delete() + require.NoError(t, err) + + require.Equal(t, 0, countFilesInDirectory(t, segmentPath.SegmentDirectory())) +} + +func TestWriteAndReadSegmentMultiShard(t *testing.T) { + t.Parallel() + + ctx := t.Context() + rand := random.NewTestRandom() + logger := test.GetLogger() + directory := t.TempDir() + + index := rand.Uint32() + valueCount := rand.Int32Range(1000, 2000) + shardCount := rand.Uint32Range(2, 32) + keys := make([][]byte, valueCount) + values := make([][]byte, valueCount) + for i := 0; i < int(valueCount); i++ { + key := rand.PrintableVariableBytes(1, 100) + keys[i] = key + values[i] = rand.PrintableVariableBytes(1, 100) + } + + // a map from keys to values + expectedValues := make(map[string][]byte) + + // a map from keys to addresses + addressMap := make(map[string]types.Address) + + salt := ([16]byte)(rand.Bytes(16)) + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + seg, err := CreateSegment( + logger, + util.NewErrorMonitor(ctx, logger, nil), + index, + []*SegmentPath{segmentPath}, + false, + shardCount, + salt, + false) + + require.NoError(t, err) + + // Write values to the segment. + for i := 0; i < int(valueCount); i++ { + key := keys[i] + value := values[i] + expectedValues[string(key)] = value + + _, _, err := seg.Write(&types.KVPair{Key: key, Value: value}) + require.NoError(t, err) + largestShardSize := seg.GetMaxShardSize() + require.True(t, largestShardSize >= uint64(len(value)+4)) + + // Occasionally flush the segment to disk. + if rand.BoolWithProbability(0.25) { + flushFunction, err := seg.Flush() + require.NoError(t, err) + flushedKeys, err := flushFunction() + require.NoError(t, err) + for _, flushedKey := range flushedKeys { + addressMap[string(flushedKey.Key)] = flushedKey.Address + } + + // after flushing, the address map should be the same size as the expected values map + require.Equal(t, len(expectedValues), len(addressMap)) + } + + // Occasionally scan all addresses and values in the segment. + if rand.BoolWithProbability(0.1) { + flushFunction, err := seg.Flush() + require.NoError(t, err) + flushedKeys, err := flushFunction() + require.NoError(t, err) + for _, flushedKey := range flushedKeys { + addressMap[string(flushedKey.Key)] = flushedKey.Address + } + + // after flushing, the address map should be the same size as the expected values map + require.Equal(t, len(expectedValues), len(addressMap)) + + for k, addr := range addressMap { + readValue, err := seg.Read([]byte(k), addr) + require.NoError(t, err) + require.Equal(t, expectedValues[k], readValue) + } + } + } + + // Seal the segment and read all keys and values. + require.False(t, seg.IsSealed()) + sealTime := rand.Time() + flushedKeys, err := seg.Seal(sealTime) + require.NoError(t, err) + require.True(t, seg.IsSealed()) + + for _, flushedKey := range flushedKeys { + addressMap[string(flushedKey.Key)] = flushedKey.Address + } + + // after flushing, the address map should be the same size as the expected values map + require.Equal(t, len(expectedValues), len(addressMap)) + + require.Equal(t, sealTime.UnixNano(), seg.GetSealTime().UnixNano()) + + for k, addr := range addressMap { + readValue, err := seg.Read([]byte(k), addr) + require.NoError(t, err) + require.Equal(t, expectedValues[k], readValue) + } + + keysFromSegment, err := seg.GetKeys() + require.NoError(t, err) + // Sort keys. With more than one shard, keys may have random order. + sort.Slice(keys, func(i, j int) bool { + return bytes.Compare(keys[i], keys[j]) < 0 + }) + sort.Slice(keysFromSegment, func(i, j int) bool { + return bytes.Compare(keysFromSegment[i].Key, keysFromSegment[j].Key) < 0 + }) + for i, ka := range keysFromSegment { + require.Equal(t, ka.Key, keys[i]) + } + + // Reopen the segment and read all keys and values. + seg2, err := LoadSegment( + logger, + util.NewErrorMonitor(ctx, logger, nil), + index, + []*SegmentPath{segmentPath}, + false, + time.Now(), + false) + require.NoError(t, err) + require.True(t, seg2.IsSealed()) + + require.Equal(t, sealTime.UnixNano(), seg2.GetSealTime().UnixNano()) + + for k, addr := range addressMap { + readValue, err := seg2.Read([]byte(k), addr) + require.NoError(t, err) + require.Equal(t, expectedValues[k], readValue) + } + + keysFromSegment2, err := seg2.GetKeys() + sort.Slice(keysFromSegment2, func(i, j int) bool { + return bytes.Compare(keysFromSegment2[i].Key, keysFromSegment2[j].Key) < 0 + }) + require.NoError(t, err) + require.Equal(t, keysFromSegment, keysFromSegment2) + + // delete the segment + require.Equal(t, int(2+shardCount), countFilesInDirectory(t, segmentPath.SegmentDirectory())) + + err = seg.delete() + require.NoError(t, err) + + require.Equal(t, 0, countFilesInDirectory(t, segmentPath.SegmentDirectory())) +} + +// Tests writing and reading, but allocates more shards than values written to force some shards to be empty. +func TestWriteAndReadColdShard(t *testing.T) { + t.Parallel() + + ctx := t.Context() + rand := random.NewTestRandom() + logger := test.GetLogger() + directory := t.TempDir() + + index := rand.Uint32() + shardCount := rand.Uint32Range(2, 32) + valueCount := shardCount * 2 + keys := make([][]byte, valueCount) + values := make([][]byte, valueCount) + for i := 0; i < int(valueCount); i++ { + key := rand.PrintableVariableBytes(1, 100) + keys[i] = key + values[i] = rand.PrintableVariableBytes(1, 100) + } + + // a map from keys to values + expectedValues := make(map[string][]byte) + + // a map from keys to addresses + addressMap := make(map[string]types.Address) + + salt := ([16]byte)(rand.Bytes(16)) + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + seg, err := CreateSegment( + logger, + util.NewErrorMonitor(ctx, logger, nil), + index, + []*SegmentPath{segmentPath}, + false, + shardCount, + salt, + false) + + require.NoError(t, err) + + // Write values to the segment. + for i := 0; i < int(valueCount); i++ { + key := keys[i] + value := values[i] + expectedValues[string(key)] = value + + _, _, err := seg.Write(&types.KVPair{Key: key, Value: value}) + require.NoError(t, err) + largestShardSize := seg.GetMaxShardSize() + require.True(t, largestShardSize >= uint64(len(value)+4)) + } + + // Seal the segment and read all keys and values. + require.False(t, seg.IsSealed()) + sealTime := rand.Time() + flushedKeys, err := seg.Seal(sealTime) + require.NoError(t, err) + require.True(t, seg.IsSealed()) + + for _, flushedKey := range flushedKeys { + addressMap[string(flushedKey.Key)] = flushedKey.Address + } + + // after flushing, the address map should be the same size as the expected values map + require.Equal(t, len(expectedValues), len(addressMap)) + + require.Equal(t, sealTime.UnixNano(), seg.GetSealTime().UnixNano()) + + for k, addr := range addressMap { + readValue, err := seg.Read([]byte(k), addr) + require.NoError(t, err) + require.Equal(t, expectedValues[k], readValue) + } + + keysFromSegment, err := seg.GetKeys() + require.NoError(t, err) + // Sort keys. With more than one shard, keys may have random order. + sort.Slice(keys, func(i, j int) bool { + return bytes.Compare(keys[i], keys[j]) < 0 + }) + sort.Slice(keysFromSegment, func(i, j int) bool { + return bytes.Compare(keysFromSegment[i].Key, keysFromSegment[j].Key) < 0 + }) + for i, ka := range keysFromSegment { + require.Equal(t, ka.Key, keys[i]) + } + + // Reopen the segment and read all keys and values. + seg2, err := LoadSegment( + logger, + util.NewErrorMonitor(ctx, logger, nil), + index, + []*SegmentPath{segmentPath}, + false, + time.Now(), + false) + require.NoError(t, err) + require.True(t, seg2.IsSealed()) + + require.Equal(t, sealTime.UnixNano(), seg2.GetSealTime().UnixNano()) + + for k, addr := range addressMap { + readValue, err := seg2.Read([]byte(k), addr) + require.NoError(t, err) + require.Equal(t, expectedValues[k], readValue) + } + + keysFromSegment2, err := seg2.GetKeys() + sort.Slice(keysFromSegment2, func(i, j int) bool { + return bytes.Compare(keysFromSegment2[i].Key, keysFromSegment2[j].Key) < 0 + }) + require.NoError(t, err) + require.Equal(t, keysFromSegment, keysFromSegment2) + + // delete the segment + require.Equal(t, int(2+shardCount), countFilesInDirectory(t, segmentPath.SegmentDirectory())) + + err = seg.delete() + require.NoError(t, err) + + require.Equal(t, 0, countFilesInDirectory(t, segmentPath.SegmentDirectory())) +} + +func TestGetFilePaths(t *testing.T) { + ctx := t.Context() + rand := random.NewTestRandom() + logger := test.GetLogger() + errorMonitor := util.NewErrorMonitor(ctx, logger, nil) + + index := rand.Uint32() + shardingFactor := rand.Uint32Range(1, 10) + salt := make([]byte, 16) + + segmentPath, err := NewSegmentPath(t.TempDir(), "", "table") + require.NoError(t, err) + + err = os.MkdirAll(segmentPath.SegmentDirectory(), 0755) + require.NoError(t, err) + + segment, err := CreateSegment( + logger, + errorMonitor, + index, + []*SegmentPath{segmentPath}, + false, + shardingFactor, + ([16]byte)(salt), + false) + require.NoError(t, err) + + files := segment.GetFilePaths() + filesSet := make(map[string]struct{}) + for _, file := range files { + filesSet[file] = struct{}{} + } + + expectedCount := 0 + + // metadata + _, found := filesSet[segment.metadata.path()] + require.True(t, found) + expectedCount++ + + // key file + _, found = filesSet[segment.keys.path()] + require.True(t, found) + expectedCount++ + + // value files + for i := uint32(0); i < shardingFactor; i++ { + _, found = filesSet[segment.shards[i].path()] + require.True(t, found) + expectedCount++ + } + + // make sure there aren't any additional files + require.Equal(t, expectedCount, len(filesSet)) + + // Compare values to functions that return specific file paths. + require.Equal(t, segment.metadata.path(), segment.GetMetadataFilePath()) + require.Equal(t, segment.keys.path(), segment.GetKeyFilePath()) + valueFiles := segment.GetValueFilePaths() + for i := uint32(0); i < shardingFactor; i++ { + require.Equal(t, segment.shards[i].path(), valueFiles[i]) + } +} diff --git a/sei-db/db_engine/litt/disktable/segment/segment_version.go b/sei-db/db_engine/litt/disktable/segment/segment_version.go new file mode 100644 index 0000000000..b172effb37 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/segment_version.go @@ -0,0 +1,22 @@ +//go:build littdb_wip + +package segment + +// SegmentVersion is used to indicate the serialization version of a segment. Whenever serialization formats change +// in segment files, this version should be incremented. +type SegmentVersion uint32 + +const ( + // OldHashFunctionSegmentVersion is the serialization version for the old hash function. + OldHashFunctionSegmentVersion SegmentVersion = 0 + + // SipHashSegmentVersion is the version when the siphash hash function was introduced for sharding. + SipHashSegmentVersion SegmentVersion = 1 + + // ValueSizeSegmentVersion adds the length of values to the key file. Previously, only the key and the address were + // stored in the key file. It also adds the key count to the segment metadata file. + ValueSizeSegmentVersion SegmentVersion = 2 +) + +// LatestSegmentVersion always refers to the latest version of the segment serialization format. +const LatestSegmentVersion = ValueSizeSegmentVersion diff --git a/sei-db/db_engine/litt/disktable/segment/value_file.go b/sei-db/db_engine/litt/disktable/segment/value_file.go new file mode 100644 index 0000000000..36fad15873 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/value_file.go @@ -0,0 +1,343 @@ +//go:build littdb_wip + +package segment + +import ( + "bufio" + "encoding/binary" + "fmt" + "io" + "math" + "os" + "path" + "strconv" + "strings" + "sync/atomic" + + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// ValuesFileExtension is the file extension for the values file. This file contains the values for the data +// segment. Value files are written in the form "X-Y.values", where X is the segment index and Y is the shard number. +const ValuesFileExtension = ".values" + +// valueFile represents a file that stores values. +type valueFile struct { + // The logger for the value file. + logger logging.Logger + + // The segment index. + index uint32 + + // The shard number of this value file. + shard uint32 + + // Path data for the segment file. + segmentPath *SegmentPath + + // The file wrapped by the writer. If the file is sealed, this value is nil. + file *os.File + + // The writer for the file. If the file is sealed, this value is nil. + writer *bufio.Writer + + // The current size of the file in bytes. Includes both flushed and unflushed data. + size uint64 + + // The current size of the file, only including flushed data. Protects against reads of partially written values. + flushedSize atomic.Uint64 + + // Whether fsync mode is enabled. If fsync mode is enabled, then each flush operation will invoke the OS fsync + // operation before returning. An fsync operation is required to ensure that data is not sitting in OS level + // in-memory buffers (otherwise, an OS crash may lead to data loss). This option is provided for testing, + // as many test scenarios do lots of tiny writes and flushes, and this workload is MUCH slower with fsync + // mode enabled. In production, fsync mode should always be enabled. + fsync bool +} + +// createValueFile creates a new value file. +func createValueFile( + logger logging.Logger, + index uint32, + shard uint32, + segmentPath *SegmentPath, + fsync bool, +) (*valueFile, error) { + + values := &valueFile{ + logger: logger, + index: index, + shard: shard, + segmentPath: segmentPath, + fsync: fsync, + } + + filePath := values.path() + exists, _, err := util.ErrIfNotWritableFile(filePath) + if err != nil { + return nil, fmt.Errorf("file %s has incorrect permissions: %v", filePath, err) + } + + if exists { + return nil, fmt.Errorf("value file %s already exists", filePath) + } + + // Open the file for writing. + file, err := os.OpenFile(filePath, os.O_RDWR|os.O_CREATE, 0644) + if err != nil { + return nil, fmt.Errorf("failed to open value file %s: %v", filePath, err) + } + + values.file = file + values.writer = bufio.NewWriter(file) + + return values, nil +} + +// loadValueFile loads a value file from disk. It looks for the file in the given parent directories until it finds +// the file. If the file is not found, it returns an error. +func loadValueFile( + logger logging.Logger, + index uint32, + shard uint32, + segmentPaths []*SegmentPath) (*valueFile, error) { + + valuesFileName := fmt.Sprintf("%d-%d%s", index, shard, ValuesFileExtension) + valuesPath, err := lookForFile(segmentPaths, valuesFileName) + if err != nil { + return nil, fmt.Errorf("failed to find value file: %v", err) + } + if valuesPath == nil { + return nil, fmt.Errorf("value file %s not found", valuesFileName) + } + + values := &valueFile{ + logger: logger, + index: index, + shard: shard, + segmentPath: valuesPath, + fsync: false, + } + + filePath := values.path() + exists, size, err := util.ErrIfNotWritableFile(filePath) + if err != nil { + return nil, fmt.Errorf("file %s has incorrect permissions: %v", filePath, err) + } + + if !exists { + return nil, fmt.Errorf("value file %s does not exist", filePath) + } + + values.size = uint64(size) + values.flushedSize.Store(values.size) + + return values, nil +} + +// getValueFileIndex returns the index of the value file from the file name. Value file names have the form +// "X-Y.values", where X is the segment index and Y is the shard number. +func getValueFileIndex(fileName string) (uint32, error) { + baseName := path.Base(fileName) + strippedName := baseName[:len(baseName)-len(ValuesFileExtension)] + + parts := strings.Split(strippedName, "-") + if len(parts) != 2 { + return 0, fmt.Errorf("invalid value file name %s", fileName) + } + indexString := parts[0] + + index, err := strconv.Atoi(indexString) + if err != nil { + return 0, fmt.Errorf("failed to parse index from file name %s: %v", fileName, err) + } + + return uint32(index), nil +} + +// getValueFileShard returns the shard number of the value file from the file name. Value file names have the form +// "X-Y.values", where X is the segment index and Y is the shard number. +func getValueFileShard(fileName string) (uint32, error) { + baseName := path.Base(fileName) + strippedName := baseName[:len(baseName)-len(ValuesFileExtension)] + + parts := strings.Split(strippedName, "-") + if len(parts) != 2 { + return 0, fmt.Errorf("invalid value file name %s", fileName) + } + shardString := parts[1] + + shard, err := strconv.Atoi(shardString) + if err != nil { + return 0, fmt.Errorf("failed to parse shard from file name %s: %v", fileName, err) + } + + return uint32(shard), nil +} + +// Size returns the size of the value file in bytes. +func (v *valueFile) Size() uint64 { + return v.size +} + +// name returns the name of the value file. +func (v *valueFile) name() string { + return fmt.Sprintf("%d-%d%s", v.index, v.shard, ValuesFileExtension) +} + +// path returns the path to the value file. +func (v *valueFile) path() string { + return path.Join(v.segmentPath.SegmentDirectory(), v.name()) +} + +// read reads a value from the value file. +func (v *valueFile) read(firstByteIndex uint32) ([]byte, error) { + flushedSize := v.flushedSize.Load() + if uint64(firstByteIndex) >= flushedSize { + return nil, fmt.Errorf("index %d is out of bounds (current flushed size is %d)", + firstByteIndex, flushedSize) + } + + file, err := os.OpenFile(v.path(), os.O_RDONLY, 0644) + if err != nil { + return nil, fmt.Errorf("failed to open value file: %v", err) + } + defer func() { + err = file.Close() + if err != nil { + v.logger.Errorf("failed to close value file: %v", err) + } + }() + + _, err = file.Seek(int64(firstByteIndex), 0) + reader := bufio.NewReader(file) + + // Read the length of the value. + var length uint32 + err = binary.Read(reader, binary.BigEndian, &length) + if err != nil { + return nil, fmt.Errorf("failed to read value length from value file: %v", err) + } + + // Read the value itself. + value := make([]byte, length) + bytesRead, err := io.ReadFull(reader, value) + if err != nil { + return nil, fmt.Errorf("failed to read value from value file: %v", err) + } + + if uint32(bytesRead) != length { + return nil, fmt.Errorf("failed to read value from value file: read %d bytes, expected %d", bytesRead, length) + } + + return value, nil +} + +// write writes a value to the value file, returning the index of the first byte written. +func (v *valueFile) write(value []byte) (uint32, error) { + if v.writer == nil { + return 0, fmt.Errorf("value file is sealed") + } + + if v.size > math.MaxUint32 { + // No matter what, we can't start a new value if its first byte would be beyond position 2^32. + // This is because we only have 32 bits in an address to store the position of a value's first byte. + return 0, fmt.Errorf("value file already contains %d bytes, cannot add a new value", v.size) + } + + firstByteIndex := uint32(v.size) + + // First, write the length of the value. + err := binary.Write(v.writer, binary.BigEndian, uint32(len(value))) + if err != nil { + return 0, fmt.Errorf("failed to write value length to value file: %v", err) + } + + // Then, write the value itself. + _, err = v.writer.Write(value) + if err != nil { + return 0, fmt.Errorf("failed to write value to value file: %v", err) + } + + v.size += uint64(len(value) + 4) + + return firstByteIndex, nil +} + +// flush writes all unflushed data to disk. +func (v *valueFile) flush() error { + if v.writer == nil { + return fmt.Errorf("value file is sealed") + } + + err := v.writer.Flush() + if err != nil { + return fmt.Errorf("failed to flush value file: %v", err) + } + + if v.fsync { + err = v.file.Sync() + if err != nil { + return fmt.Errorf("failed to sync value file: %v", err) + } + } + + // It is now safe to read the flushed bytes directly from the file. + v.flushedSize.Store(v.size) + + return nil +} + +// seal seals the value file. +func (v *valueFile) seal() error { + if v.writer == nil { + return fmt.Errorf("value file is already sealed") + } + + err := v.flush() + if err != nil { + return fmt.Errorf("failed to flush value file: %v", err) + } + + err = v.file.Close() + if err != nil { + return fmt.Errorf("failed to close value file: %v", err) + } + + v.writer = nil + v.file = nil + return nil +} + +// snapshot creates a hard link to the file in the snapshot directory, and a soft link to the hard linked file in the +// soft link directory. Requires that the file is sealed and that snapshotting is enabled. +func (v *valueFile) snapshot() error { + if v.writer != nil { + return fmt.Errorf("file %s is not sealed, cannot take Snapshot", v.path()) + } + + err := v.segmentPath.Snapshot(v.name()) + if err != nil { + return fmt.Errorf("failed to create Snapshot: %v", err) + } + + return nil +} + +// delete deletes the value file. +func (v *valueFile) delete() error { + if v.writer != nil { + return fmt.Errorf("value file is not sealed") + } + + // As an extra safety check, make it so that all future reads fail before they do I/O. + v.flushedSize.Store(0) + + err := util.DeepDelete(v.path()) + if err != nil { + return fmt.Errorf("failed to delete value file %s: %v", v.path(), err) + } + + return nil +} diff --git a/sei-db/db_engine/litt/disktable/segment/value_file_test.go b/sei-db/db_engine/litt/disktable/segment/value_file_test.go new file mode 100644 index 0000000000..6794766a5f --- /dev/null +++ b/sei-db/db_engine/litt/disktable/segment/value_file_test.go @@ -0,0 +1,195 @@ +//go:build littdb_wip + +package segment + +import ( + "os" + "testing" + + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestWriteThenReadValues(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + logger := test.GetLogger() + directory := t.TempDir() + + index := rand.Uint32() + shard := rand.Uint32() + valueCount := rand.Int32Range(100, 200) + values := make([][]byte, valueCount) + expectedFileSize := uint64(0) + for i := 0; i < int(valueCount); i++ { + values[i] = rand.VariableBytes(1, 100) + expectedFileSize += uint64(len(values[i])) + 4 /* length uint32 */ + } + + // A map from the first byte index of the value to the value itself. + addressMap := make(map[uint32][]byte) + + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + file, err := createValueFile(logger, index, shard, segmentPath, false) + require.NoError(t, err) + + for _, value := range values { + address, err := file.write(value) + require.NoError(t, err) + addressMap[address] = value + + // Occasionally flush the file to disk. + if rand.BoolWithProbability(0.25) { + err := file.flush() + require.NoError(t, err) + } + + // Occasionally scan all addresses and values in the file. + if rand.BoolWithProbability(0.1) { + err = file.flush() + require.NoError(t, err) + for key, val := range addressMap { + readValue, err := file.read(key) + require.NoError(t, err) + require.Equal(t, val, readValue) + } + } + } + + // Seal the file and read all values. + err = file.seal() + require.NoError(t, err) + for key, val := range addressMap { + readValue, err := file.read(key) + require.NoError(t, err) + require.Equal(t, val, readValue) + } + + reportedFileSize := file.size + stat, err := os.Stat(file.path()) + require.NoError(t, err) + actualFileSize := uint64(stat.Size()) + require.Equal(t, actualFileSize, reportedFileSize) + + // Create a new in-memory instance from the on-disk file and verify that it behaves the same. + file2, err := loadValueFile(logger, index, shard, []*SegmentPath{segmentPath}) + require.NoError(t, err) + require.Equal(t, file.size, file2.size) + for key, val := range addressMap { + readValue, err := file2.read(key) + require.NoError(t, err) + require.Equal(t, val, readValue) + } + + // delete the file + filePath := file.path() + _, err = os.Stat(filePath) + require.NoError(t, err) + + err = file.delete() + require.NoError(t, err) + + _, err = os.Stat(filePath) + require.True(t, os.IsNotExist(err)) +} + +func TestReadingTruncatedValueFile(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + logger := test.GetLogger() + directory := t.TempDir() + + index := rand.Uint32() + shard := rand.Uint32() + valueCount := rand.Int32Range(100, 200) + values := make([][]byte, valueCount) + for i := 0; i < int(valueCount); i++ { + values[i] = rand.VariableBytes(1, 100) + } + + // A map from the first byte index of the value to the value itself. + addressMap := make(map[uint32][]byte) + + segmentPath, err := NewSegmentPath(directory, "", "table") + require.NoError(t, err) + err = segmentPath.MakeDirectories(false) + require.NoError(t, err) + file, err := createValueFile(logger, index, shard, segmentPath, false) + require.NoError(t, err) + + var lastAddress uint32 + for _, value := range values { + address, err := file.write(value) + require.NoError(t, err) + addressMap[address] = value + lastAddress = address + } + + err = file.seal() + require.NoError(t, err) + + // Truncate the file. Chop off some bytes from the last value, but do not corrupt the length prefix. + lastValueLength := len(values[valueCount-1]) + + filePath := file.path() + + originalBytes, err := os.ReadFile(filePath) + require.NoError(t, err) + + bytesToRemove := rand.Int32Range(1, int32(lastValueLength)+1) + bytes := originalBytes[:len(originalBytes)-int(bytesToRemove)] + + err = os.WriteFile(filePath, bytes, 0644) + require.NoError(t, err) + + file, err = loadValueFile(logger, index, shard, []*SegmentPath{segmentPath}) + require.NoError(t, err) + + // We should be able to read all values except for the last one. + for key, val := range addressMap { + if key == lastAddress { + _, err := file.read(key) + require.Error(t, err) + } else { + readValue, err := file.read(key) + require.NoError(t, err) + require.Equal(t, val, readValue) + } + } + + // Truncate the file. Corrupt the length prefix of the last value. + prefixBytesToRemove := rand.Int32Range(1, 4) + bytes = originalBytes[:len(originalBytes)-int(prefixBytesToRemove)] + + err = os.WriteFile(filePath, bytes, 0644) + require.NoError(t, err) + + file, err = loadValueFile(logger, index, shard, []*SegmentPath{segmentPath}) + require.NoError(t, err) + + // We should be able to read all values except for the last one. + for key, val := range addressMap { + if key == lastAddress { + _, err := file.read(key) + require.Error(t, err) + } else { + readValue, err := file.read(key) + require.NoError(t, err) + require.Equal(t, val, readValue) + } + } + + // delete the file + _, err = os.Stat(filePath) + require.NoError(t, err) + + err = file.delete() + require.NoError(t, err) + + _, err = os.Stat(filePath) + require.True(t, os.IsNotExist(err)) +} diff --git a/sei-db/db_engine/litt/disktable/table_metadata.go b/sei-db/db_engine/litt/disktable/table_metadata.go new file mode 100644 index 0000000000..d759df48a6 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/table_metadata.go @@ -0,0 +1,187 @@ +//go:build littdb_wip + +package disktable + +import ( + "encoding/binary" + "fmt" + "os" + "path" + "sync/atomic" + "time" + + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +const tableMetadataSerializationVersion = 0 +const TableMetadataFileName = "table.metadata" +const tableMetadataSize = 16 + +// tableMetadata contains table data that is preserved across restarts. +type tableMetadata struct { + logger logging.Logger + + tableDirectory string + + // the table's TTL, accessed/modified by concurrent goroutines + ttl atomic.Pointer[time.Duration] + + // the table's sharding factor, accessed/modified by concurrent goroutines + shardingFactor atomic.Uint32 + + // If true, metadata writes will be atomic. Should be set to true in production, but can be set to false + // to speed up unit tests. + fsync bool +} + +// newTableMetadata creates a new table metadata object. +func newTableMetadata( + logger logging.Logger, + tableDirectory string, + ttl time.Duration, + shardingFactor uint32, + fsync bool) (*tableMetadata, error) { + + metadata := &tableMetadata{ + logger: logger, + tableDirectory: tableDirectory, + fsync: fsync, + } + metadata.ttl.Store(&ttl) + metadata.shardingFactor.Store(shardingFactor) + + err := metadata.write() + if err != nil { + return nil, fmt.Errorf("failed to write table metadata: %v", err) + } + + return metadata, nil +} + +// loadTableMetadata loads the table metadata from disk. +func loadTableMetadata(logger logging.Logger, tableDirectory string) (*tableMetadata, error) { + mPath := metadataPath(tableDirectory) + + if err := util.ErrIfNotExists(mPath); err != nil { + return nil, fmt.Errorf("table metadata file does not exist: %s", mPath) + } + + data, err := os.ReadFile(mPath) + if err != nil { + return nil, fmt.Errorf("failed to read table metadata file %s: %v", mPath, err) + } + + metadata, err := deserialize(data) + if err != nil { + return nil, fmt.Errorf("failed to deserialize table metadata: %v", err) + } + metadata.logger = logger + metadata.tableDirectory = tableDirectory + + return metadata, nil +} + +// Size returns the size of the table metadata file in bytes. +func (t *tableMetadata) Size() uint64 { + return tableMetadataSize +} + +// GetTTL returns the time-to-live for the table. +func (t *tableMetadata) GetTTL() time.Duration { + return *t.ttl.Load() +} + +// SetTTL sets the time-to-live for the table. +func (t *tableMetadata) SetTTL(ttl time.Duration) error { + t.ttl.Store(&ttl) + err := t.write() + if err != nil { + return fmt.Errorf("failed to update table metadata: %v", err) + } + return nil +} + +// GetShardingFactor returns the sharding factor for the table. +func (t *tableMetadata) GetShardingFactor() uint32 { + return t.shardingFactor.Load() +} + +// SetShardingFactor sets the sharding factor for the table. +func (t *tableMetadata) SetShardingFactor(shardingFactor uint32) error { + t.shardingFactor.Store(shardingFactor) + err := t.write() + if err != nil { + return fmt.Errorf("failed to update table metadata: %v", err) + } + return nil +} + +// Store atomically stores the table metadata to disk. +func (t *tableMetadata) write() error { + err := util.AtomicWrite(metadataPath(t.tableDirectory), t.serialize(), t.fsync) + if err != nil { + return fmt.Errorf("failed to write table metadata file: %v", err) + } + + return nil +} + +// serialize serializes the table metadata to a byte slice. +func (t *tableMetadata) serialize() []byte { + // 4 bytes for version + // 8 bytes for TTL + // 4 bytes for sharding factor + data := make([]byte, tableMetadataSize) + + // Write the version + binary.BigEndian.PutUint32(data[0:4], tableMetadataSerializationVersion) + + // Write the TTL + ttlNanoseconds := t.GetTTL().Nanoseconds() + binary.BigEndian.PutUint64(data[4:12], uint64(ttlNanoseconds)) + + // Write the sharding factor + binary.BigEndian.PutUint32(data[12:16], t.GetShardingFactor()) + + return data +} + +// deserialize deserializes the table metadata from a byte slice. +func deserialize(data []byte) (*tableMetadata, error) { + // 4 bytes for version + // 8 bytes for TTL + // 4 bytes for sharding factor + if len(data) != tableMetadataSize { + return nil, fmt.Errorf("metadata file is not the correct size, expected 16 bytes, got %d", len(data)) + } + + serializationVersion := binary.BigEndian.Uint32(data[0:4]) + if serializationVersion != tableMetadataSerializationVersion { + return nil, fmt.Errorf("unsupported serialization version: %d", serializationVersion) + } + + ttl := time.Duration(binary.BigEndian.Uint64(data[4:12])) + shardingFactor := binary.BigEndian.Uint32(data[12:16]) + + metadata := &tableMetadata{} + metadata.ttl.Store(&ttl) + metadata.shardingFactor.Store(shardingFactor) + + return metadata, nil +} + +// delete deletes the table metadata from disk. +func (t *tableMetadata) delete() error { + metadataPath := path.Join(t.tableDirectory, TableMetadataFileName) + err := os.Remove(metadataPath) + if err != nil { + return fmt.Errorf("failed to delete table metadata file %s: %v", metadataPath, err) + } + return nil +} + +// path returns the path to the table metadata file. +func metadataPath(tableDirectory string) string { + return path.Join(tableDirectory, TableMetadataFileName) +} diff --git a/sei-db/db_engine/litt/disktable/unlock.go b/sei-db/db_engine/litt/disktable/unlock.go new file mode 100644 index 0000000000..9eac46f833 --- /dev/null +++ b/sei-db/db_engine/litt/disktable/unlock.go @@ -0,0 +1,46 @@ +//go:build littdb_wip + +package disktable + +import ( + "fmt" + "os" + "path/filepath" + "strings" + + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// Unlocks a LittDB file system. +// +// DANGER: calling this method opens the door for unsafe concurrent operations on LittDB files. +// With great power comes great responsibility. +func Unlock(logger logging.Logger, sourcePaths []string) error { + for _, sourcePath := range sourcePaths { + err := filepath.WalkDir(sourcePath, func(path string, d os.DirEntry, err error) error { + if err != nil { + return err + } + if d.IsDir() { + return nil + } + + if strings.HasSuffix(path, util.LockfileName) { + logger.Infof("Removing lock file %s", path) + if removeErr := os.Remove(path); removeErr != nil { + logger.Error("Failed to remove lock file", "path", path, "error", removeErr) + return fmt.Errorf("failed to remove lock file %s: %w", path, removeErr) + } + } + + return nil + }) + + if err != nil { + return fmt.Errorf("failed to walk directory %s: %w", sourcePath, err) + } + } + + return nil +} diff --git a/sei-db/db_engine/litt/docs/architecture.md b/sei-db/db_engine/litt/docs/architecture.md new file mode 100644 index 0000000000..11cbb849c9 --- /dev/null +++ b/sei-db/db_engine/litt/docs/architecture.md @@ -0,0 +1,223 @@ +# LittDB Architecture + +This section explains the high level architecture of LittDB. It starts out by describing a simple (but inefficient) +storage solution, and incrementally adds complexity in order to solve various problems. For the full picture, skip to +[Putting it all together: LittDB](#putting-it-all-together-littdb). + +For each iteration, the database must fulfill the following requirements: + +- must support `put(key, value)`/`get(key)` operations +- must be thread safe +- must support a TTL +- must be crash durable + +## Iteration 1: Appending data to a file + +Let's implement the simplest possible key-value store that satisfies the requirements above. It's going to be super +slow. Ok, fine. We want simple. + +![](resources/iteration1.png) + +When the user writes a key-value pair to the database, append the key and the value to the end of the file, along +with a timestamp. When the user reads a key, scan the file from the beginning until you find the key and +return the value. + +Periodically, scan the data in the file to check for expired data. If a key has expired, remove it from the file +(will require the file to be rewritten). + +This needs to be thread safe. Keep a global read-write lock around the file. When a write or GC operation is in +progress, no reads are allowed. GC operations and writes are not permitted to happen in parallel. Allow multiple +reads to happen concurrently. + +In order to provide durability, ensure the file is fully flushed to disk before releasing a write lock. + +Congratulations! You've written your very own database! + +![](resources/iDidIt.png) + +## Iteration 2: Add a cache + +Reads against the database in 1 are slow. If there is any way we could reduce the number of times we have to iterate +over the file, that would be great. Let's add an in-memory cache. + +![](resources/iteration2.png) + +Let's assume we are using a thread safe map to implement the cache. + +When reading data, first check to see if the data is in the cache. If it is, return it. If it is not, acquire a read +lock and scan the file. Be sure to update the cache with the data you read. + +When writing data, write the data to the file, and then update the cache. Data that is recently written is often +read immediately shortly after, for many workloads. + +When deleting data, remove the data from the file, and then remove the data from the cache. + +## Iteration 3: Add an index + +Reading recent values is a lot faster now. But if you miss the cache, things start getting slow. `O(n)` isn't fun +when you database holds 100TB. To address this, let's add an index that allows us to jump straight to the data we +are looking for. For the sake of consistency with other parts of this document, let's call this index a "keymap". + +![](resources/iteration3.png) + +Inside the keymap, maintain a mapping from each key to the offset in the file where the first byte of the value is +stored. + +When writing a value, take note of the offset in the file where the value was written. Store the key and the offset +in the keymap. + +When reading a value and there is a cache miss, look up the key inside the keymap. If the key is present, jump to +start of the value in the file and read the value. If the key is not present, tell the user that the key is not +present in the database. + +When deleting a value, remove the key from the keymap in addition to removing the value from the file. + +At startup time, we will have to rebuild the keymap, since we are only storing it in memory. In order to do so, +iterate over the file and reconstruct the keymap. If this is too slow, consider storing the keymap on disk (perhaps +using an off-the-shelf key-value store like levelDB). + +The database needs to do a little extra bookkeeping when it deletes data from the file. If it deletes X bytes from +the beginning of the file, then the offsets recorded in the keymap are off by a factor of X. The key map doesn't +need to be rebuilt in order to fix this. Rather, the database can simply subtract X from all the offsets in the +keymap to find the actual location of the data in the file. Additionally, it must add X to the offset when computing +the "offset" of new data that is written to the file. + +## Iteration 4: Unflushed data map + +In order to be thread safe, the solution above uses a global lock. While one thread is writing, readers must wait +unless they get lucky and find their data in the cache. It would be really nice if we could permit reads to continue +uninterrupted while writes are happening in the background. + +![](resources/iteration4.png) + +Create another key->value map called the "unflushed data map". Use a thread safe map implementation. + +When the user writes data to the database, immediately add it to the unflushed data map, but not the key map. +After that is completed, write it to file. The write doesn't need to be synchronous. For example, you can use file +stream APIs that buffer data in memory before writing it to disk in larger chunks. The write operation doesn't need +to block until the data is written to disk, it can return as soon as the data is in the unflushed data map and written +to the buffer. + +Expose a new method in the database called `Flush()`. When `Flush()` is called, first flush all data in buffers to disk, +then empty out the unflushed data map. Before each entry is removed, write the key-address pair to the key map. +This flush operation should block until all of this work is done. + +When reading data, look for it in the following places, in order: + +- the cache +- the unflushed data map +- on disk (via the keymap and data file) + +Unlike previous iterations, write no longer needs to hold a lock that blocks readers. This is thread safe, and it +provides read-your-writes consistency. + +If a reader is attempting to read data that is currently in the process of being written to disk, then the data will +be present in the unflushed data map. If the reader finds an entry in the key map, this means that the data has already +been written out to disk, and is therefore safe to read from the file. Even if the writer is writing later in the file, +the bytes the reader wants to read will be immutable. + +Although the strategy described above allows read operations to execute concurrently with write operations, it does +not solve the problem for deletions of values that have exceeded their TTL. This operation will still require a global +lock that blocks all reads and writes. + +## Iteration 5: Break the file into segments + +One of the biggest inefficiencies in design, to this point, is that the deleting values is exceptionally slow. The +entire file must be rewritten in order to trim bytes from the beginning. And to make matters worse, we need to hold +a global lock while we do it. To fix this, let's break apart the data file into multiple data files. We'll call each +data file a "segment". + +![](resources/iteration5.png) + +Decide on a maximum file size for each segment. Whenever a file gets "full", close it and open a new one. Let's assign +each of these files a serial number starting with `0` and increasing monotonically. We'll call this serial number the +"segment index". + +Previously, the address stored in the key map told us the offset in the file where the value was stored. Now, the +address will also need to keep track of the segment index, as well as the offset. + +Deletion of data is now super easy. When all data in the oldest segment file exceeds its TTL, we can delete just that +segment without modifying any of the other segment files. Iterate over the segment file to delete values from the key +map. + +In order to avoid the race condition where a reader is reading data from a segment that is in the process of being +deleted, use reference counters for each segment. When a reader goes to read data, it first finds the address in the +keymap, than increments the reference counter for the segment. When the reader is done reading, it decrements the +reference counter. When the garbage collector goes to delete a segment, it waits to actually delete the file on disk +until the reference counter is zero. As a result of this strategy, there is now no longer a need for garbage collection +to hold a global lock. + +## Iteration 6: Metadata files + +In the previous iteration, we do garbage collection by deleting a segment once all data contained within that segment +has expired. But how do we figure out when that actually is? In the previous iteration, the only way to do that is to +iterate over the entire segment file and read the timestamp stored with the last value. Let's do better. + +For each segment, let's create a metadata file. We'll put the timestamp of the last value written to the segment into +this file. As a result, we will no longer need to store timestamp information inside the value files, which will +save us a few bytes per entry. + +![](resources/iteration6.png) + +Now, all the garbage collector needs to read to decide when it is time to delete a segment is the metadata file for +that segment. + +## Iteration 7: Key files + +Storing timestamp information in a metadata file is a good start, but we still need to scan the value file. When a +segment is deleted, we need to clean up the key map. In order to do this, we need to know which keys are stored in the +segment. Additionally, when we start up the database, we need to rebuild the key map. This requires us to scan each +segment file to find the keys. + +From an optimization point of view, we can assume that in general keys will be much smaller than values. During the +operations described above, we don't care about the values, only the keys. So lets separate the keys from the values +to avoid having to read the values when we don't need them. + +![](resources/iteration7.png) + +Everything works the same way as before. But instead of iterating huge segment files when deleting a segment +or rebuilding the key map at startup, we only have to iterate over the key file. The key file is going to be +significantly smaller than the value file (for sane key-value size ratios), and so this will be much faster. + +## Iteration 8: Sharding + +A highly desirable property for this database is the capability to spread its data across multiple physical drives. +In order to do this, we need to shard the data. That is to say, we need to break the data into smaller pieces and +spread those pieces across multiple locations. + +![iteration8](resources/iteration8.png) + +Key files and metadata files are small. For the sake of simplicity, let's not bother sharding those. Value files +are big. Break apart value files, and have one value file per shard. + +When writing data, the first thing to do will be to figure out which shard the data belongs in. Do this by taking a +hash of the key modulo the number of shards. + +When reading data, we need to do the reverse. Take a hash of the key modulo the number of shards to figure out which +shard to look in. As a consequence, the address alone is no longer enough information to find the data. We also need +to know the key when looking up data. But this isn't a problem, since we always have access to the key when we are +looking up data. + +From a security perspective, sharding with a predictable hash is dangerous. An attacker could, in theory, craft keys +that all map to the same shard, causing a hot spot in the database. To prevent this, the database chooses a random +"salt" value that it includes in the hash function. As long as an attacker does not know the salt value, they cannot +predict which shard a key will map to. + +We already have a metadata file for each segment. We can go ahead and save the sharding factor and salt in the metadata +file. This will give us enough information to find data contained within the segment. + +## Iteration 9: Multi-table support + +A nice-to-have feature would be the ability to support multiple tables. Each table would have its own namespace, and +data in one table would not conflict with data in another table. + +This is simple! Let's just run a different DB instance for each table. + +![](resources/iteration9.png) + +Since each table might want to have its own configuration, we can store that configuration in a metadata file for each +table. + +## Putting it all together: LittDB + +![littdb](resources/littdb-big-picture.png) \ No newline at end of file diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/README.md b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/README.md new file mode 100644 index 0000000000..dee92d77cd --- /dev/null +++ b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/README.md @@ -0,0 +1,154 @@ +# Test Description + +A long term soak test (2 weeks) with ~200 TB on disk. The goal of this test was to verify that LittDB performance +did not degrade over time and with this quantity of data on disk. + +# Setup + + +| Property | Value | +|-------------------|----------------------------------------------| +| commit | `2625a70cecf0efc239fb9891691b7b179733b5f8` | +| environment | OCI (Oracle Cloud Infrastructure) | +| region | US East (Ashburn) | +| OS | Canonical-Ubuntu-20.04-2025.07.23-0 | +| shape | VM.Optimized3.Flex | +| OCPU count | 1 | +| Network Bandwidth | 4 Gbps | +| Memory | 14 GB | +| Disk | 8x 32TB block volumes, per disk config below | +| Disk Performance | Balanced (VPU/GB:10) | +| Disk Throughput | 480 MB/s | +| Disk IOPS | 25,000 IOPS | +| Disk encryption | disabled | +| Disk backup | disabled | + +# Configuration + +I used the following benchmark configuration: + +```json +{ + "LittConfig": { + "Paths": ["~/mount/b", "~/mount/c", "~/mount/d", "~/mount/e", "~/mount/f", "~/mount/g", "~/mount/h", "~/mount/i"], + "MetricsEnabled": true + }, + "MaximumWriteThroughputMB": 1024, + "MetricsLoggingPeriodSeconds": 1, + "TTLHours": 168 +} +``` + +The block volumes were mounted under `~/mount/b` ... `~/mount/i` and formatted with `ext4` filesystem. +("`/dev/sda`" was already in use, so I started with "`/dev/sdb`".) + +I ran the test for 14 days. The first 7 days (i.e. 168 hours) were spent ramping up, followed by 7 days of steady state. + +# Results + +| | | +|---|---| +| ![Disk Footprint](data/disk-footprint.webp) | ![Key Count](data/key-count.webp) | +| ![Bytes Written / Second](data/bytes-written-second.webp) | ![Keys Written / Second](data/keys-written-second.webp) | +| ![Flushes / Second](data/flushes-second.webp) | ![Write Latency](data/write-latency.webp) | +| ![Flush Latency](data/flush-latency.webp) | ![Segment Flush Latency](data/segment-flush-latency.webp) | +| ![Keymap Flush Latency](data/keymap-flush-latency.webp) | ![GC Latency](data/gc-latency.webp) | +| ![Bytes Read / Second](data/bytes-read-second.webp) | ![Keys Read / Second](data/keys-read-second.webp) | +| ![Read Latency](data/read-latency.webp) | ![Cache Hits / Second](data/cache-hits-second.webp) | +| ![Cache Misses / Second](data/cache-misses-second.webp) | ![Cache Miss Latency](data/cache-miss-latency.webp) | +| ![Memory](data/memory.webp) | ![CPU Seconds](data/cpu-seconds.webp) | + +# Notes and Observations + +## Clean Bill of Health + +The test completed successfully with no errors. All metrics reported healthy values. There were no signs of +performance degradation or resource leaks over the course of the test. Although read latency and memory use did +increase slightly over time, I suspect this can be explained by the growth in size of the keymap (i.e. an internal +LevelDB instance used for tracking metadata). Once the size of the data reached a steady state, this minor growth +in read latency and memory appeared to flatten out and enter a steady state as well. + +## Is the benchmark code available? + +Yes! To run this benchmark yourself, do the following: + +- install golang 1.24 +- `git clone https://github.com/Layr-Labs/eigenda.git` +- `cd eigenda/litt && make build` + - this will create the LittDB CLI binary at `./eigenda/litt/bin/litt` + - you can install this CLI by making sure this binary is on your bash PATH, or you can invoke it directly +- create a benchmark config file + - the above example is a good starting point + - a complete list of config options can be found at https://github.com/Layr-Labs/eigenda/blob/master/litt/benchmark/config/benchmark_config.go +- `litt benchmark /path/to/benchmark_config.json` + +## Why OCI? + +It's cheap. + +## What's the current write bottleneck? + +The write throughput observed during this test vastly exceeds what we need, so I didn't spend much time attempting to +further optimize the write throughput. + +I suspect the write bottleneck is one of two things: + +- the benchmark utility itself +- some sort of OCI limitation based on the VM shape + +When I was running this benchmark with a single disk, I observed slightly faster write throughput. If the bottleneck +was the capacity of the disks themselves, I would expect that adding more disks would increase the write throughput. +Additionally, the observed write throughput is well below the theoretical maximum of the disks (even when running with +a single disk). + +It's plausible that there is some other cause for the current write bottleneck. As of now, I've not collected +sufficient data to determine the exact cause. + +## Memory Use + +It's important to point out that the benchmark allocates a fixed size 1 GB memory buffer. Although the system was using +~2 GB of memory, the actual memory use of the DB itself was at most only half of that. + +In a production environment, LittDB can use a lot of memory depending on cache configuration. But modulo caching, +the baseline memory needed for a high capacity LittDB instance is quite low (under 1 GB). + +## Garbage Collection Overhead + +One of the major problems with other DB's I've tested with the EigenDA workload is garbage collection. This test +demonstrates that LittDB garbage collection is exceptionally fast and efficient. Garbage collection runs once +per 5 minutes, and takes 100-200ms to complete. + +## Data Validation + +A feature of this benchmark is that when it reads data, it validates that the data read is correct. During the span +of this two week benchmark, no data corruption was detected. Note that since the write rate was much larger than +the read rate, only a small fraction of the data written was actually read and validated. But if there was systemic +and large scale data corruption, it is very likely that the random sampling would have detected it. + +# Future Work + +## Test Length + +The intended use case of the DB requires continuous uptime over months or years. This test was only 2 weeks long, so +it's possible that issues could arise over longer time periods. The length of this test was limited by cost +considerations. + +## Read Workload + +The read workload of this test was intentionally kept light. The primary purpose of this test was to verify that +performance did not degrade with large quantities of data on disk. It might be interesting to repeat this test +with a more realistic read workload. + +## Larger Data Set + +The target data size for this test was ~200 TB. The test only achieved ~192 TB, but this is close enough for all +practical purposes. The exact quantity of data stored on disk is a function of the write throughput and the TTL. +Since the write throughput was dependant on the speed of the underlying disks and the TTL was fixed at 7 days, the +exact quantity of data stored on disk could not be precisely controlled. + +Based on this data, we are confident that LittDB can handle EigenDA workloads for 1/8th stake validators, and then some! +The scale of this benchmark exceeded the requirements for this EigenDA use case by 2-4x. + +A long term goal is to make the EigenDA protocol capable of bearing 1gb/s. In order to do so, we will need to validate +LittDB at a 1-2 petabyte scale. Due to cost considerations, this test was not performed at that scale. Based on observed +data, I do not anticipate DB problems at this scale. But it's hard to say for sure without actually running the test. \ No newline at end of file diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/bytes-read-second.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/bytes-read-second.webp new file mode 100644 index 0000000000..7ef64cc429 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/bytes-read-second.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/bytes-written-second.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/bytes-written-second.webp new file mode 100644 index 0000000000..b745e52f9d Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/bytes-written-second.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cache-hits-second.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cache-hits-second.webp new file mode 100644 index 0000000000..96d6ef0d34 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cache-hits-second.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cache-miss-latency.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cache-miss-latency.webp new file mode 100644 index 0000000000..c045ea1115 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cache-miss-latency.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cache-misses-second.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cache-misses-second.webp new file mode 100644 index 0000000000..79ad928b07 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cache-misses-second.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cpu-seconds.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cpu-seconds.webp new file mode 100644 index 0000000000..67e5250deb Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/cpu-seconds.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/disk-footprint.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/disk-footprint.webp new file mode 100644 index 0000000000..4e0996809b Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/disk-footprint.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/flush-latency.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/flush-latency.webp new file mode 100644 index 0000000000..24beb94b65 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/flush-latency.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/flushes-second.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/flushes-second.webp new file mode 100644 index 0000000000..2184710c43 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/flushes-second.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/gc-latency.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/gc-latency.webp new file mode 100644 index 0000000000..da2df2fe5f Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/gc-latency.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/key-count.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/key-count.webp new file mode 100644 index 0000000000..d30ce8a8fb Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/key-count.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/keymap-flush-latency.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/keymap-flush-latency.webp new file mode 100644 index 0000000000..add1dccc26 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/keymap-flush-latency.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/keys-read-second.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/keys-read-second.webp new file mode 100644 index 0000000000..11d92c19e2 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/keys-read-second.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/keys-written-second.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/keys-written-second.webp new file mode 100644 index 0000000000..2eed94bd01 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/keys-written-second.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/memory.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/memory.webp new file mode 100644 index 0000000000..abbe587727 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/memory.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/raw.tar.gz b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/raw.tar.gz new file mode 100644 index 0000000000..58ca9cdd15 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/raw.tar.gz differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/read-latency.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/read-latency.webp new file mode 100644 index 0000000000..4f15351c5b Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/read-latency.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/segment-flush-latency.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/segment-flush-latency.webp new file mode 100644 index 0000000000..599fb98d3c Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/segment-flush-latency.webp differ diff --git a/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/write-latency.webp b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/write-latency.webp new file mode 100644 index 0000000000..a14d56fb29 Binary files /dev/null and b/sei-db/db_engine/litt/docs/benchmark-data/8-27-2025/data/write-latency.webp differ diff --git a/sei-db/db_engine/litt/docs/filesystem_layout.md b/sei-db/db_engine/litt/docs/filesystem_layout.md new file mode 100644 index 0000000000..7bb742dab0 --- /dev/null +++ b/sei-db/db_engine/litt/docs/filesystem_layout.md @@ -0,0 +1,351 @@ +# Filesystem Layout + +This document provides an overview of how LittDB stores data on disk. + +## Root Directories + +LittDB spreads its data across N root directories. In practice, each root directory will probably be on its +own physical drive, but that's not a hard requirement. + +In the example below, the root directories are `root/root0`, `root/root1`, and `root/root2`. + +## Table Directories + +LittDB supports multiple tables, each with its own namespace. Each table is stored within its own +subdirectory. + +The name of the table's subdirectory is the name of the table (hence the restrictions on characters allowed in +table names). Each table will have one subdirectory per root. + +In the example below, there are three tables: `tableA`, `tableB`, and `tableC`. The full paths to the table directories +in the example below are as follows: + +- for `tableA`: + - `root/root0/tableA` + - `root/root1/tableA` + - `root/root2/tableA` +for `tableB`: + - `root/root0/tableB` + - `root/root1/tableB` + - `root/root2/tableB` +for `tableC`: + - `root/root0/tableC` + - `root/root1/tableC` + - `root/root2/tableC` + +## Keymap Directory + +All keymap data appears in the directory named `keymap`. There is one keymap per table, so if there are multiple +tables in a DB then there may be multiple keymap directories. + +- The file `keymap/keymap-type.txt` contains the name of the keymap implementation. +- The file `keymap/initialized` is a marker file used to indicate if a keymap has been fully initialized or not + (relevant if the process crashes during keymap initialization). +- If the keymap writes data to disk (e.g. levelDB, as pictured below), then the data will be stored in the + `keymap/data` directory. + +Even if there are multiple root paths, each table only has a single keymap directory. The directory will be located +inside the table directory in exactly one of the root directories. It doesn't matter which root directory contains the +keymap directory. + +In the example below, keymap directories are located at the following paths: + +- `root/root0/tableA/keymap` +- `root/root0/tableB/keymap` +- `root/root0/tableC/keymap` + +If the DB is shut down, it's safe to delete the entire `keymap` directory. On the next startup, LittDB will +recreate the keymap directory and reinitialize the keymap. + +## Segment Files + +There are three types of files that contain data for a segment + +- metadata: these files take the form `N.metadata`, where `N` is the segment number. These files contain a small amount + of metadata about the segment. +- keys: these files take the form `N.keys`, where `N` is the segment number. These files contain the keys for the + segment. +- values: these files take the form `N-M.values`, where `N` is the segment number and `M` is the shard number. + These files contain the values for the segment. + +Segment files appear in the `segments` subdirectory of a table directory. Segments for a table may be spread across +different root directories. It's unimportant which root directory contains each segment file. It's perfectly ok +to move a segment file from one root directory to another while the DB is not running. + +In the example below, segment files can be found in the following paths: + +- `root/root0/tableA/segments` +- `root/root1/tableA/segments` +- `root/root2/tableA/segments` +- `root/root0/tableB/segments` +- `root/root1/tableB/segments` +- `root/root2/tableB/segments` +- `root/root0/tableC/segments` +- `root/root1/tableC/segments` +- `root/root2/tableC/segments` + +## Snapshot Files + +If enabled, LittDB will periodically capture a rolling snapshot of its data. This snapshot can be used to make backups. +In the example below, the rolling snapshot is stored in the `root/rolling_snapshot` directory (this is configurable). + +The data in the rolling snapshot directory are symlinks. This is needed since LittDB data may be spread across +multiple physical volumes, and we really don't want to do a deep copy of the data in order to create a snapshot. +LittDB files are immutable, so there is no risk of the data being "pulled out from under" the snapshot. + +The snapshot files point to hard linked copies of the segment files. For each volume, there is a directory named +`snapshot` that contains these hard linked files. The reason for this is to protect the snapshot data from being +deleted by the LittDB garbage collector. LittDB links the snapshot files, and it is the responsibility of the +external user/tooling to delete the snapshot files when they are no longer needed (both the symlinks and the hard +links). + +Within the snapshot directory, there are also files named `lower-bound.txt` and `upper-bound.txt`. These files +are used for communication between the DB and tooling that manages LittDB snapshots. + +## Lock Files + +LittDB writes lock files to each root directory it operates on. This acts as a sanity check to ensure that multiple +processes do not attempt to access/modify the same file tree in an unsafe way. The lock file is named `litt.lock`. + +If a LittDB process crashes before cleaning up its lock files, no action is needed. LittDB will automatically +remove the lock files on the next startup as long as the old process is no longer running. If the old process +is hanging, then it will be necessary to kill the process before starting a new one. + +The LittDB CLI also uses lock files in the same way. This ensures that the CLI does not attempt to operate on LittDB +files in unsafe ways, such as deleting files that are currently being managed by a running LittDB process. + +In the example below, lock files can be found at the following paths: + +- `root/root0/litt.lock` +- `root/root1/litt.lock` +- `root/root2/litt.lock` + +## Example Layout + +The following is an example file tree for a simple LittDB instance. +(This example file tree was generated using generate_example_tree_test.go.) + +### Root Directories + +There are three directories into which data is written. In theory, these could be located on three separate +physical drives. Those directories are + +- `root/root0` +- `root/root1` +- `root/root2` + +The table is configured to have four shards. That's one more shard than root directory, meaning that one of the +root directories will have two shards, and all the others will have one shard. + +### Tables + +There are three tables, each with its own namespace. The tables are + +- `tableA` +- `tableB` +- `tableC` + +### Segments + +A little data has been written to the DB. + +- `tableA` has enough data to have three segments +- `tableB` has enough data to have two segments +- `tableC` has enough data to have one segment + +### Keymap + +The keymap is implemented using levelDB. + +### Snapshot + +The DB has been configured to take a rolling snapshot, and the target directory is `root/rolling_snapshot`. + +### File Tree + +```text +root +├── rolling_snapshot +│   ├── tableA +│   │   ├── lower-bound.txt +│   │   ├── segments +│   │   │   ├── 0-0.values -> root/root1/tableA/snapshot/0-0.values +│   │   │   ├── 0-1.values -> root/root2/tableA/snapshot/0-1.values +│   │   │   ├── 0-2.values -> root/root0/tableA/snapshot/0-2.values +│   │   │   ├── 0-3.values -> root/root1/tableA/snapshot/0-3.values +│   │   │   ├── 0.keys -> root/root0/tableA/snapshot/0.keys +│   │   │   ├── 0.metadata -> root/root0/tableA/snapshot/0.metadata +│   │   │   ├── 1-0.values -> root/root1/tableA/snapshot/1-0.values +│   │   │   ├── 1-1.values -> root/root2/tableA/snapshot/1-1.values +│   │   │   ├── 1-2.values -> root/root0/tableA/snapshot/1-2.values +│   │   │   ├── 1-3.values -> root/root1/tableA/snapshot/1-3.values +│   │   │   ├── 1.keys -> root/root0/tableA/snapshot/1.keys +│   │   │   ├── 1.metadata -> root/root0/tableA/snapshot/1.metadata +│   │   │   ├── 2-0.values -> root/root1/tableA/snapshot/2-0.values +│   │   │   ├── 2-1.values -> root/root2/tableA/snapshot/2-1.values +│   │   │   ├── 2-2.values -> root/root0/tableA/snapshot/2-2.values +│   │   │   ├── 2-3.values -> root/root1/tableA/snapshot/2-3.values +│   │   │   ├── 2.keys -> root/root0/tableA/snapshot/2.keys +│   │   │   └── 2.metadata -> root/root0/tableA/snapshot/2.metadata +│   │   └── upper-bound.txt +│   ├── tableB +│   │   ├── lower-bound.txt +│   │   ├── segments +│   │   │   ├── 0-0.values -> root/root1/tableB/snapshot/0-0.values +│   │   │   ├── 0-1.values -> root/root2/tableB/snapshot/0-1.values +│   │   │   ├── 0-2.values -> root/root0/tableB/snapshot/0-2.values +│   │   │   ├── 0-3.values -> root/root1/tableB/snapshot/0-3.values +│   │   │   ├── 0.keys -> root/root0/tableB/snapshot/0.keys +│   │   │   ├── 0.metadata -> root/root0/tableB/snapshot/0.metadata +│   │   │   ├── 1-0.values -> root/root1/tableB/snapshot/1-0.values +│   │   │   ├── 1-1.values -> root/root2/tableB/snapshot/1-1.values +│   │   │   ├── 1-2.values -> root/root0/tableB/snapshot/1-2.values +│   │   │   ├── 1-3.values -> root/root1/tableB/snapshot/1-3.values +│   │   │   ├── 1.keys -> root/root0/tableB/snapshot/1.keys +│   │   │   └── 1.metadata -> root/root0/tableB/snapshot/1.metadata +│   │   └── upper-bound.txt +│   └── tableC +│   ├── lower-bound.txt +│   └── segments +├── root0 +│   ├── litt.lock +│   ├── tableA +│   │   ├── keymap +│   │   │   ├── data +│   │   │   │   ├── 000001.log +│   │   │   │   ├── CURRENT +│   │   │   │   ├── LOCK +│   │   │   │   ├── LOG +│   │   │   │   └── MANIFEST-000000 +│   │   │   ├── initialized +│   │   │   └── keymap-type.txt +│   │   ├── segments +│   │   │   ├── 0-2.values +│   │   │   ├── 0.keys +│   │   │   ├── 0.metadata +│   │   │   ├── 1-2.values +│   │   │   ├── 1.keys +│   │   │   ├── 1.metadata +│   │   │   ├── 2-2.values +│   │   │   ├── 2.keys +│   │   │   ├── 2.metadata +│   │   │   ├── 3-2.values +│   │   │   ├── 3.keys +│   │   │   └── 3.metadata +│   │   ├── snapshot +│   │   │   ├── 0-2.values +│   │   │   ├── 0.keys +│   │   │   ├── 0.metadata +│   │   │   ├── 1-2.values +│   │   │   ├── 1.keys +│   │   │   ├── 1.metadata +│   │   │   ├── 2-2.values +│   │   │   ├── 2.keys +│   │   │   └── 2.metadata +│   │   └── table.metadata +│   ├── tableB +│   │   ├── keymap +│   │   │   ├── data +│   │   │   │   ├── 000001.log +│   │   │   │   ├── CURRENT +│   │   │   │   ├── LOCK +│   │   │   │   ├── LOG +│   │   │   │   └── MANIFEST-000000 +│   │   │   ├── initialized +│   │   │   └── keymap-type.txt +│   │   ├── segments +│   │   │   ├── 0-2.values +│   │   │   ├── 0.keys +│   │   │   ├── 0.metadata +│   │   │   ├── 1-2.values +│   │   │   ├── 1.keys +│   │   │   ├── 1.metadata +│   │   │   ├── 2-2.values +│   │   │   ├── 2.keys +│   │   │   └── 2.metadata +│   │   ├── snapshot +│   │   │   ├── 0-2.values +│   │   │   ├── 0.keys +│   │   │   ├── 0.metadata +│   │   │   ├── 1-2.values +│   │   │   ├── 1.keys +│   │   │   └── 1.metadata +│   │   └── table.metadata +│   └── tableC +│   ├── keymap +│   │   ├── data +│   │   │   ├── 000001.log +│   │   │   ├── CURRENT +│   │   │   ├── LOCK +│   │   │   ├── LOG +│   │   │   └── MANIFEST-000000 +│   │   ├── initialized +│   │   └── keymap-type.txt +│   ├── segments +│   │   ├── 0-2.values +│   │   ├── 0.keys +│   │   └── 0.metadata +│   ├── snapshot +│   └── table.metadata +├── root1 +│   ├── litt.lock +│   ├── tableA +│   │   ├── segments +│   │   │   ├── 0-0.values +│   │   │   ├── 0-3.values +│   │   │   ├── 1-0.values +│   │   │   ├── 1-3.values +│   │   │   ├── 2-0.values +│   │   │   ├── 2-3.values +│   │   │   ├── 3-0.values +│   │   │   └── 3-3.values +│   │   └── snapshot +│   │   ├── 0-0.values +│   │   ├── 0-3.values +│   │   ├── 1-0.values +│   │   ├── 1-3.values +│   │   ├── 2-0.values +│   │   └── 2-3.values +│   ├── tableB +│   │   ├── segments +│   │   │   ├── 0-0.values +│   │   │   ├── 0-3.values +│   │   │   ├── 1-0.values +│   │   │   ├── 1-3.values +│   │   │   ├── 2-0.values +│   │   │   └── 2-3.values +│   │   └── snapshot +│   │   ├── 0-0.values +│   │   ├── 0-3.values +│   │   ├── 1-0.values +│   │   └── 1-3.values +│   └── tableC +│   ├── segments +│   │   ├── 0-0.values +│   │   └── 0-3.values +│   └── snapshot +└── root2 + ├── litt.lock + ├── tableA + │   ├── segments + │   │   ├── 0-1.values + │   │   ├── 1-1.values + │   │   ├── 2-1.values + │   │   └── 3-1.values + │   └── snapshot + │   ├── 0-1.values + │   ├── 1-1.values + │   └── 2-1.values + ├── tableB + │   ├── segments + │   │   ├── 0-1.values + │   │   ├── 1-1.values + │   │   └── 2-1.values + │   └── snapshot + │   ├── 0-1.values + │   └── 1-1.values + └── tableC + ├── segments + │   └── 0-1.values + └── snapshot +``` diff --git a/sei-db/db_engine/litt/docs/licenses/README.md b/sei-db/db_engine/litt/docs/licenses/README.md new file mode 100644 index 0000000000..6c775d0ddc --- /dev/null +++ b/sei-db/db_engine/litt/docs/licenses/README.md @@ -0,0 +1,17 @@ +The LittDB database was originally developed by EigenDA. This code was copied from the +[EigenDA repository](https://github.com/Layr-Labs/eigenda) from commit `61019b4e9f91cbbb3dc05ed758674e4bdfeee20e`, +and has since been modified. + +This copy was made after 2026-03-31, i.e. after the EigenDA repository transitioned to an MIT license. + +The [original license](./business-source-license.txt) for the code in the EigenDA repo was the +Business Source License 1.1, the text of which is included here for reference. That license stipulates +that on 2026-03-31, the license for the EigenDA repo switches to an [MIT license](./mit-license.txt). +The maintainers of the EigenDA repo did not update the text of their license file at the time I made the copy, +and so the text of the MIT license was not available from the EigenDA repository. As a reference, I have +copied that text from https://opensource.org/license/MIT and included it here. + +Typically, the first part of the MIT license (i.e. `Copyright `) is filled out by the copyright +holder. However, since the EigenDA maintainers did not update their repo by the time this copy was made, I could only +copy the boilerplate from a third party source. As such, I left these fields un-filled. + diff --git a/sei-db/db_engine/litt/docs/licenses/business-source-license.txt b/sei-db/db_engine/litt/docs/licenses/business-source-license.txt new file mode 100644 index 0000000000..cc06becad3 --- /dev/null +++ b/sei-db/db_engine/litt/docs/licenses/business-source-license.txt @@ -0,0 +1,98 @@ +Business Source License 1.1 + +License text copyright (c) 2017 MariaDB Corporation Ab, All Rights Reserved. +"Business Source License" is a trademark of MariaDB Corporation Ab. + +----------------------------------------------------------------------------- + +Parameters + +Licensor: Layr Labs, Inc. + +Licensed Work: EigenDA + The Licensed Work is (c) 2023 Layr Labs, Inc. + +Additional Use Grant: None. + +Change Date: 2026-03-31 (March 31st, 2026) + +Change License: MIT + +----------------------------------------------------------------------------- + +Terms + +The Licensor hereby grants you the right to copy, modify, create derivative +works, redistribute, and make non-production use of the Licensed Work. The +Licensor may make an Additional Use Grant, above, permitting limited +production use. + +Effective on the Change Date, or the fourth anniversary of the first publicly +available distribution of a specific version of the Licensed Work under this +License, whichever comes first, the Licensor hereby grants you rights under +the terms of the Change License, and the rights granted in the paragraph +above terminate. + +If your use of the Licensed Work does not comply with the requirements +currently in effect as described in this License, you must purchase a +commercial license from the Licensor, its affiliated entities, or authorized +resellers, or you must refrain from using the Licensed Work. + +All copies of the original and modified Licensed Work, and derivative works +of the Licensed Work, are subject to this License. This License applies +separately for each version of the Licensed Work and the Change Date may vary +for each version of the Licensed Work released by Licensor. + +You must conspicuously display this License on each original or modified copy +of the Licensed Work. If you receive the Licensed Work in original or +modified form from a third party, the terms and conditions set forth in this +License apply to your use of that work. + +Any use of the Licensed Work in violation of this License will automatically +terminate your rights under this License for the current and all other +versions of the Licensed Work. + +This License does not grant you any right in any trademark or logo of +Licensor or its affiliates (provided that you may use a trademark or logo of +Licensor as expressly required by this License). + +TO THE EXTENT PERMITTED BY APPLICABLE LAW, THE LICENSED WORK IS PROVIDED ON +AN "AS IS" BASIS. LICENSOR HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, +EXPRESS OR IMPLIED, INCLUDING (WITHOUT LIMITATION) WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND +TITLE. + +MariaDB hereby grants you permission to use this License’s text to license +your works, and to refer to it using the trademark "Business Source License", +as long as you comply with the Covenants of Licensor below. + +----------------------------------------------------------------------------- + +Covenants of Licensor + +In consideration of the right to use this License’s text and the "Business +Source License" name and trademark, Licensor covenants to MariaDB, and to all +other recipients of the licensed work to be provided by Licensor: + +1. To specify as the Change License the GPL Version 2.0 or any later version, + or a license that is compatible with GPL Version 2.0 or a later version, + where "compatible" means that software provided under the Change License can + be included in a program with software provided under GPL Version 2.0 or a + later version. Licensor may specify additional Change Licenses without + limitation. + +2. To either: (a) specify an additional grant of rights to use that does not + impose any additional restriction on the right granted in this License, as + the Additional Use Grant; or (b) insert the text "None". + +3. To specify a Change Date. + +4. Not to modify this License in any other way. + +----------------------------------------------------------------------------- + +Notice + +The Business Source License (this document, or the "License") is not an Open +Source license. However, the Licensed Work will eventually be made available +under an Open Source License, as stated in this License. diff --git a/sei-db/db_engine/litt/docs/licenses/mit-license.txt b/sei-db/db_engine/litt/docs/licenses/mit-license.txt new file mode 100644 index 0000000000..37d0c22eb8 --- /dev/null +++ b/sei-db/db_engine/litt/docs/licenses/mit-license.txt @@ -0,0 +1,14 @@ +Copyright + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated +documentation files (the “Software”), to deal in the Software without restriction, including without limitation the +rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit +persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial +portions of the Software. + +THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE +WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR +COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR +OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/sei-db/db_engine/litt/docs/littdb_cli.md b/sei-db/db_engine/litt/docs/littdb_cli.md new file mode 100644 index 0000000000..6c22a0c027 --- /dev/null +++ b/sei-db/db_engine/litt/docs/littdb_cli.md @@ -0,0 +1,275 @@ +# Installation + +The LittDB CLI is not currently distributed as a pre-built binary. This may change in the future, but for now, +you will need to build it from source. + +## Building from source + +Make sure you have the latest version of Go installed. You can find instructions for installing Go +[here](https://go.dev/doc/install). + +Clone the EigenDA repository: + +```bash +git clone https://github.com/Layr-Labs/eigenda.git +``` +Build the LittDB CLI: + +```bash +cd eigenda/litt +make build +``` + +The LittDB CLI binary will be located at `eigenda/litt/bin/litt`. + +### Optional: Shortcuts + +If you want to be able to run the LittDB CLI from anywhere, you can do one of the following: + +Create an alias in your shell configuration file (e.g. `.bashrc`, `.zshrc`, etc.): + +```bash +alias litt='path/to/eigenda/litt/bin/litt' +``` + +Or, you can add the `eigenda/litt/bin` directory to your `PATH` environment variable: + +```bash +export PATH="$PATH:path/to/eigenda/litt/bin" +``` + +Or you can just copy the `litt` binary to a directory that is already in your `PATH`, such as `/usr/local/bin`: + +```bash +cp eigenda/litt/bin/litt /usr/local/bin/ +``` + +A symlink can also be created to the `litt` binary in a directory that is already in your `PATH`: + +```bash +ln -s path/to/eigenda/litt/bin/litt /usr/local/bin/litt +``` + +### Help! I'm trying to run on Windows! + +Heh, good luck! + +# Sources and Destinations + +Many LittDB commands operate on the concept of "sources" and "destinations". A source/destination is a path where +LittDB data is stored. For commands that require source directories, those directories can be specified using the +`--src` or `-s` flag. For commands that require a destination directory, the `--dst` or `-d` flag is used. + +LittDB can be configured to store data in just a single directory, or it can be configured to store data across +multiple directories. This can be useful if you want to spread data between multiple physical drives. When +using the LittDB CLI, it is important to always provide ALL source directories. If you do not do this, the CLI will +detect the problem and abort the operation. + +## EigenDA Validator: Source Directories + +If you are running an EigenDA validator node, the source directories are determined by the following flags: + +### Recommended: `NODE_LITT_DB_STORAGE_PATHS` + +If `NODE_LITT_DB_STORAGE_PATHS` is set, then the source directories will be the paths specified in that variable. + +Example: +``` +export NODE_LITT_DB_STORAGE_PATHS="/data0,/data1,/data2" + +litt ls --src /data0 --src /data1 --src /data2 +``` + +### Deprecated: `NODE_DB_PATH` + +If `NODE_LITT_DB_STORAGE_PATHS` is not set, then the source directory will be determined by the value of +`NODE_DB_PATH`. The source directory will be `$NODE_DB_PATH/chunk_v2_litt`. + +Note that this pattern is deprecated. It is suggested that you use the LittDB CLI to refactor your DB as described +in the "bonus example" [here](#litt-rebase). + +Example: +``` +export NODE_DB_PATH=/data + +litt ls --src /data/chunk_v2_litt +``` + +# Subcommands + +## `littdb --help` + +Prints a help message. + + +## `litt ls` + +A utility for listing the names of all tables in a LittDB instance. + +For documentation on command flags and configuration, run `litt ls --help`. + +Example: + +Suppose you have a LittDB instance with data stored in `/data0`, `/data1`, and `/data2`, and suppose you have +tables named `tableA`, `tableB`, and `tableC`. You can list the tables in the instance by running: + +``` +$ litt ls --src /data0 --src /data1 --src /data2 + +Jun 18 11:28:59.732 INF cli/ls.go:47 Tables found: +tableA +tableB +tableC +``` + +## `litt table-info` + +This utility provides information about the data contained in a LittDB table. + +For documentation on command flags and configuration, run `litt table-info --help`. + +Example: + +Suppose you have a LittDB instance with data stored in `/data0`, `/data1`, and `/data2`, and want to get information +about the `tableA` table. You can run: + +``` +$ litt table-info --src /data0 --src /data1 --src /data2 tableA + +Jun 18 11:32:11.236 INF cli/table_info.go:76 Table: tableA +Jun 18 11:32:11.236 INF cli/table_info.go:77 Key count: 95 +Jun 18 11:32:11.236 INF cli/table_info.go:78 Size: 190.01 MiB +Jun 18 11:32:11.236 INF cli/table_info.go:79 Is snapshot: false +Jun 18 11:32:11.236 INF cli/table_info.go:80 Oldest segment age: 1.05 hours +Jun 18 11:32:11.236 INF cli/table_info.go:81 Oldest segment seal time: 2025-06-18T10:29:02-05:00 +Jun 18 11:32:11.236 INF cli/table_info.go:82 Newest segment age: 50.88 minutes +Jun 18 11:32:11.236 INF cli/table_info.go:83 Newest segment seal time: 2025-06-18T10:41:18-05:00 +Jun 18 11:32:11.236 INF cli/table_info.go:84 Segment span: 12.27 minutes +Jun 18 11:32:11.236 INF cli/table_info.go:85 Lowest segment index: 0 +Jun 18 11:32:11.236 INF cli/table_info.go:86 Highest segment index: 95 +Jun 18 11:32:11.236 INF cli/table_info.go:87 Key map type: LevelDBKeymap +``` + +## `litt rebase` + +LittDB can store data in multiple directories. Changing the number of directories after data has been written into +the DB is possible, but not easy to do by hand. The `litt rebase` utility automates this workflow. + +For documentation on command flags and configuration, run `litt rebase --help`. + +Before rebasing, you must know two things: + +- the list of directories where the DB is currently storing its data (called the "source directories") +- the list of directories where you want the DB to store its data after the rebase (called the "destination directories") + +If your destination directories are a superset of the source directories, then the rebase will be a no-op. Adding a new +directory to LittDB does not require a rebase, since LittDB can dynamically add new directories as needed. + +A rebase operation is idempotent. That is to say, running it more than once has the same effect as running it exactly +once. If your computer crashes half way though a rebase, simply run the same command again, and the rebase utility will +pick up where it left off. + +Example: + +Suppose you have a LittDB instance with data stored in `/data0`, `/data1`, and `/data2`, and you want to rebase to the +directories `/data2`, `/data3`, and `/data4`. (Notice there is overlap between the sources and destinations, this is +ok!) + +You can run the following command: + +``` +litt rebase --src /data0 --src /data1 --src /data2 --dst /data2 --dst /data3 --dst /data4 +``` + +Bonus example: + +Suppose you are running an EigenDA validator node and want to change from using the deprecated `NODE_DB_PATH` flag +to instead using the recommended `NODE_LITT_DB_STORAGE_PATHS` flag. Suppose your old path for `NODE_DB_PATH` was +`/data` (meaning the LittDB source directory is `/data/chunk_v2_litt`), and you instead use +`NODE_LITT_DB_STORAGE_PATHS="/data0,/data1,/data2"`. This can be done with the following command: + +``` +litt rebase --src /data/chunk_v2_litt --dst /data0 --dst /data1 --dst /data2 +``` + +## `litt benchmark` + +The LittDB benchmark can be launched using the `litt benchmark` command. This may be useful for determining the +capability of hardware in various configurations, or for testing the performance of LittDB itself. + +The LittDB benchmark accepts a single argument, which is a path to a configuration file. An example configuration file +is shown below: + +```json +{ + "LittConfig": { + "Paths": ["~/benchmark/volume1", "~/benchmark/volume2", "~/benchmark/volume3"], + }, + "MaximumWriteThroughputMB": 1024, + "MetricsLoggingPeriodSeconds": 1 +} +``` + +For more documentation on possible configuration options, see the +[benchmark_config.go](../benchmark/config/benchmark_config.go). + +## `litt prune` + +The `litt prune` command is used to delete data from a LittDB database or snapshot. LittDB snapshots are not +automatically pruned, so if no action is taken, then the size of the snapshot on disk will grow indefinitely +(at least until you fill up your disk). + +For documentation on command flags and configuration, run `litt prune --help`. + +The `--max-age` flag is used to specify the maximum age of data to keep, and is specified in seconds. + +Example: + +Suppose you have a LittDB instance with data stored in `/data0`, `/data1`, and `/data2`, and you want to prune all +data that is older than 1 hour. You can run the following command: + +``` +litt prune --src /data0 --src /data1 --src /data2 --max-age 3600 +``` + +## `litt push` + +Although it is perfectly safe from a concurrency perspective to make copies of the data in the LittDB snapshot +directory, there are some nuances involved in doing so. The `litt push` command is a utility that can be used to +push data from a LittDB snapshot to a remote location using `ssh` and `rsync`. The `litt push` utility also deletes +data from the snapshot directory after it has been successfully pushed to the remote location. + +For documentation on command flags and configuration, run `litt push --help`. + +Similar to the LittDB's capability to store data in multiple directories, the `litt push` command can also push data to +multiple remote directories (on the same machine). This may be convenient if your data size is sufficiently large that +it is difficult to provision a single disk that is large enough to hold the entire data set. + +`litt push` makes incremental/rolling backups. That is to say, if you make a backup at time T1, and then make a backup +at time T2, then `litt push` will only copy data written into the DB between T1 and T2. + +As long as you are working from a snapshot directory, there is no need to stop the LittDB instance while you are +making a backup. Backups made with `litt push` are fully consistent. If a backup fails for some reason +(e.g. a network issue or a computer crash), running the same command again will pick up where it left off. + +Suppose your LittDB instance is storing snapshot data in `/snapshot`, and you want to push that data to directories +`/backup1`, `/backup2`, and `/backup3` on a remote machine with the username `user` and hostname `host`. You can run +the following command: + +``` +litt push --src /snapshot --dst /backup1 --dst /backup2 --dst /backup3 user@host +``` + +This command will copy over all data since the previous backup, and will delete data from the snapshot directory +once it has been successfully transferred. + +### Restoring from a Backup + +To restore data from a backup, simply use `litt push` on the backup machine to push the data where it needs to go. +`litt push` can push from multiple source directories if that is how it is being stored. + +### Backup Garbage Collection + +If you are using the patterns described above to back up data, then the size of your backup will grow indefinitely. +In order to prune the data you keep, use `litt prune` on the backup machine to delete old data. You should not run +`litt prune` concurrently with `litt push`, as there are race conditions that can occur if you do so. diff --git a/sei-db/db_engine/litt/docs/resources/flush-visual.png b/sei-db/db_engine/litt/docs/resources/flush-visual.png new file mode 100644 index 0000000000..9a7fd18cd1 Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/flush-visual.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iDidIt.png b/sei-db/db_engine/litt/docs/resources/iDidIt.png new file mode 100644 index 0000000000..8e542c351c Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iDidIt.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iteration1.png b/sei-db/db_engine/litt/docs/resources/iteration1.png new file mode 100644 index 0000000000..49afa09a3d Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iteration1.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iteration2.png b/sei-db/db_engine/litt/docs/resources/iteration2.png new file mode 100644 index 0000000000..93726ff483 Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iteration2.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iteration3.png b/sei-db/db_engine/litt/docs/resources/iteration3.png new file mode 100644 index 0000000000..8adee1102a Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iteration3.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iteration4.png b/sei-db/db_engine/litt/docs/resources/iteration4.png new file mode 100644 index 0000000000..6776b3b665 Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iteration4.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iteration5.png b/sei-db/db_engine/litt/docs/resources/iteration5.png new file mode 100644 index 0000000000..2010169cd9 Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iteration5.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iteration6.png b/sei-db/db_engine/litt/docs/resources/iteration6.png new file mode 100644 index 0000000000..fded86873b Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iteration6.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iteration7.png b/sei-db/db_engine/litt/docs/resources/iteration7.png new file mode 100644 index 0000000000..9bc9c9339b Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iteration7.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iteration8.png b/sei-db/db_engine/litt/docs/resources/iteration8.png new file mode 100644 index 0000000000..1d3209758a Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iteration8.png differ diff --git a/sei-db/db_engine/litt/docs/resources/iteration9.png b/sei-db/db_engine/litt/docs/resources/iteration9.png new file mode 100644 index 0000000000..58f48a69f3 Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/iteration9.png differ diff --git a/sei-db/db_engine/litt/docs/resources/littdb-big-picture.png b/sei-db/db_engine/litt/docs/resources/littdb-big-picture.png new file mode 100644 index 0000000000..c9890919d3 Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/littdb-big-picture.png differ diff --git a/sei-db/db_engine/litt/docs/resources/littdb-logo.png b/sei-db/db_engine/litt/docs/resources/littdb-logo.png new file mode 100644 index 0000000000..9449ffd13a Binary files /dev/null and b/sei-db/db_engine/litt/docs/resources/littdb-logo.png differ diff --git a/sei-db/db_engine/litt/go.mod b/sei-db/db_engine/litt/go.mod new file mode 100644 index 0000000000..fbbeef8c85 --- /dev/null +++ b/sei-db/db_engine/litt/go.mod @@ -0,0 +1,14 @@ +// A nested Go module boundary. The sei-db/db_engine/litt/ tree is a raw import +// from the upstream LittDB project and has not yet been adapted to this +// repo's dependency set (imports still reference github.com/Layr-Labs/...). +// +// Declaring this subtree as a separate module hides it from the parent +// module's `go mod tidy`, `go test ./...`, `go build ./...`, `go vet ./...`, +// and `golangci-lint run` — none of which cross module boundaries. See +// `sei-db/db_engine/litt/README.md` ("Work-in-progress guard") for the +// incremental integration policy. +// +// This file can be removed once the litt package fully compiles and passes lint. +module github.com/sei-protocol/sei-chain/sei-db/db_engine/litt + +go 1.25.6 diff --git a/sei-db/db_engine/litt/littbuilder/build_utils.go b/sei-db/db_engine/litt/littbuilder/build_utils.go new file mode 100644 index 0000000000..aebe30f4a1 --- /dev/null +++ b/sei-db/db_engine/litt/littbuilder/build_utils.go @@ -0,0 +1,290 @@ +//go:build littdb_wip + +package littbuilder + +import ( + "fmt" + "net/http" + "os" + "path" + "strings" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/common/cache" + "github.com/Layr-Labs/eigenda/litt" + tablecache "github.com/Layr-Labs/eigenda/litt/cache" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/metrics" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/prometheus/client_golang/prometheus" + "github.com/prometheus/client_golang/prometheus/collectors" + "github.com/prometheus/client_golang/prometheus/promhttp" +) + +// keymapBuilders contains builders for all supported keymap types. +var keymapBuilders = map[keymap.KeymapType]keymap.BuildKeymap{ + keymap.MemKeymapType: keymap.NewMemKeymap, + keymap.LevelDBKeymapType: keymap.NewLevelDBKeymap, + keymap.UnsafeLevelDBKeymapType: keymap.NewUnsafeLevelDBKeymap, +} + +// cacheWeight is a function that calculates the weight of a cache entry. +func cacheWeight(key string, value []byte) uint64 { + return uint64(len(key) + len(value)) +} + +// Look for a table's keymap directory in the provided segment paths. +func FindKeymapLocation( + rootPaths []string, + tableName string, +) (keymapDirectory string, keymapInitialized bool, keymapTypeFile *keymap.KeymapTypeFile, error error) { + + if len(rootPaths) == 0 { + return "", false, nil, + fmt.Errorf("no segment paths provided for keymap search") + } + + potentialKeymapDirectories := make([]string, len(rootPaths)) + for i, rootPath := range rootPaths { + potentialKeymapDirectories[i] = path.Join(rootPath, tableName, keymap.KeymapDirectoryName) + } + + for _, directory := range potentialKeymapDirectories { + exists, err := util.Exists(directory) + if err != nil { + return "", false, nil, + fmt.Errorf("error checking for keymap type file: %w", err) + } + if exists { + if keymapDirectory != "" { + return "", false, nil, + fmt.Errorf("multiple keymap directories found: %s and %s", keymapDirectory, directory) + } + + keymapDirectory = directory + keymapTypeFile, err = keymap.LoadKeymapTypeFile(directory) + if err != nil { + return "", false, nil, + fmt.Errorf("error loading keymap type file: %w", err) + } + + initializedExists, err := util.Exists(path.Join(keymapDirectory, keymap.KeymapInitializedFileName)) + if err != nil { + return "", false, nil, + fmt.Errorf("error checking for keymap initialized file: %w", err) + } + if initializedExists { + keymapInitialized = true + } + } + } + + return keymapDirectory, keymapInitialized, keymapTypeFile, nil +} + +// buildKeymap creates a new keymap based on the configuration. +func buildKeymap( + config *litt.Config, + logger logging.Logger, + tableName string, +) (kmap keymap.Keymap, keymapPath string, keymapTypeFile *keymap.KeymapTypeFile, requiresReload bool, err error) { + + builderForConfiguredType, ok := keymapBuilders[config.KeymapType] + if !ok { + return nil, "", nil, false, + fmt.Errorf("unsupported keymap type: %v", config.KeymapType) + } + + keymapDirectory, keymapInitialized, keymapTypeFile, err := FindKeymapLocation(config.Paths, tableName) + if err != nil { + return nil, "", nil, false, + fmt.Errorf("error finding keymap location: %w", err) + } + + if keymapTypeFile != nil && !keymapInitialized { + // The keymap has not been fully initialized. This is likely due to a crash during the keymap reloading process. + logger.Warnf("incomplete keymap initialization detected. Deleting keymap directory: %s", + keymapDirectory) + + err := os.RemoveAll(keymapDirectory) + if err != nil { + return nil, "", nil, false, + fmt.Errorf("error deleting keymap directory: %w", err) + } + + keymapTypeFile = nil + keymapDirectory = "" + } + + newKeymap := false + if keymapTypeFile == nil { + // No previous keymap exists. Either we are starting fresh or the keymap was deleted. + newKeymap = true + + // by convention, always select the first path as the keymap directory + keymapDirectory = path.Join(config.Paths[0], tableName, keymap.KeymapDirectoryName) + keymapTypeFile = keymap.NewKeymapTypeFile(keymapDirectory, config.KeymapType) + + // create the keymap directory + err := os.MkdirAll(keymapDirectory, 0755) + if err != nil { + return nil, "", nil, false, + fmt.Errorf("error creating keymap directory: %w", err) + } + + // write the keymap type file + err = keymapTypeFile.Write() + if err != nil { + return nil, "", nil, false, + fmt.Errorf("error writing keymap type file: %w", err) + } + + } else { + // A previous keymap exists. Check if the keymap type has changed. + if config.KeymapType != keymapTypeFile.Type() { + // The previously used keymap type is different from the one in the configuration. + + keymapTypeFile = nil + + // delete the old keymap + err = os.RemoveAll(keymapDirectory) + if err != nil { + return nil, "", nil, false, + fmt.Errorf("error deleting keymap files: %w", err) + } + + // write the new keymap type file + err = os.MkdirAll(keymapDirectory, 0755) + if err != nil { + return nil, "", nil, false, + fmt.Errorf("error creating keymap directory: %w", err) + } + keymapTypeFile = keymap.NewKeymapTypeFile(keymapDirectory, config.KeymapType) + err = keymapTypeFile.Write() + if err != nil { + return nil, "", nil, false, + fmt.Errorf("error writing keymap type file: %w", err) + } + } + } + + keymapDataDirectory := path.Join(keymapDirectory, keymap.KeymapDataDirectoryName) + kmap, requiresReload, err = builderForConfiguredType(logger, keymapDataDirectory, config.DoubleWriteProtection) + if err != nil { + return nil, "", nil, false, + fmt.Errorf("error building keymap: %w", err) + } + + if !requiresReload { + // If the keymap does not need to be reloaded, then it is already fully initialized. + keymapInitializedFile := path.Join(keymapDirectory, keymap.KeymapInitializedFileName) + f, err := os.Create(keymapInitializedFile) + if err != nil { + return nil, "", nil, false, + fmt.Errorf("failed to create keymap initialized file: %v", err) + } + err = f.Close() + if err != nil { + return nil, "", nil, false, + fmt.Errorf("failed to close keymap initialized file: %v", err) + } + } + + return kmap, keymapDirectory, keymapTypeFile, requiresReload || newKeymap, nil +} + +// buildTable creates a new table based on the configuration. +func buildTable( + config *litt.Config, + logger logging.Logger, + name string, + metrics *metrics.LittDBMetrics) (litt.ManagedTable, error) { + + var table litt.ManagedTable + + if config.ShardingFactor < 1 { + return nil, fmt.Errorf("sharding factor must be at least 1") + } + + kmap, keymapDirectory, keymapTypeFile, requiresReload, err := buildKeymap(config, logger, name) + if err != nil { + return nil, fmt.Errorf("error creating keymap: %w", err) + } + + table, err = disktable.NewDiskTable( + config, + name, + kmap, + keymapDirectory, + keymapTypeFile, + config.Paths, + requiresReload, + metrics) + + if err != nil { + return nil, fmt.Errorf("error creating table: %w", err) + } + + writeCache := cache.NewFIFOCache[string, []byte](config.WriteCacheSize, cacheWeight, metrics.GetWriteCacheMetrics()) + writeCache = cache.NewThreadSafeCache(writeCache) + + readCache := cache.NewFIFOCache[string, []byte](config.ReadCacheSize, cacheWeight, metrics.GetReadCacheMetrics()) + readCache = cache.NewThreadSafeCache(readCache) + + cachedTable := tablecache.NewCachedTable(table, writeCache, readCache, metrics) + + return cachedTable, nil +} + +// buildLogger creates a new logger based on the configuration. +func buildLogger(config *litt.Config) (logging.Logger, error) { + if config.Logger != nil { + return config.Logger, nil + } + + return common.NewLogger(config.LoggerConfig) +} + +// buildMetrics creates a new metrics object based on the configuration. If the returned server is not nil, +// then it is the responsibility of the caller to eventually call server.Shutdown(). +func buildMetrics(config *litt.Config, logger logging.Logger) (*metrics.LittDBMetrics, *http.Server) { + if !config.MetricsEnabled { + return nil, nil + } + + var registry *prometheus.Registry + var server *http.Server + + if config.MetricsEnabled { + if config.MetricsRegistry == nil { + registry = prometheus.NewRegistry() + registry.MustRegister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{})) + registry.MustRegister(collectors.NewGoCollector()) + + logger.Infof("Starting metrics server at port %d", config.MetricsPort) + addr := fmt.Sprintf(":%d", config.MetricsPort) + mux := http.NewServeMux() + mux.Handle("/metrics", promhttp.HandlerFor( + registry, + promhttp.HandlerOpts{}, + )) + server = &http.Server{ + Addr: addr, + Handler: mux, + } + + go func() { + err := server.ListenAndServe() + if err != nil && !strings.Contains(err.Error(), "http: Server closed") { + logger.Errorf("metrics server error: %v", err) + } + }() + } else { + registry = config.MetricsRegistry + } + } + + return metrics.NewLittDBMetrics(registry, config.MetricsNamespace), server +} diff --git a/sei-db/db_engine/litt/littbuilder/db_impl.go b/sei-db/db_engine/litt/littbuilder/db_impl.go new file mode 100644 index 0000000000..268a6df0e1 --- /dev/null +++ b/sei-db/db_engine/litt/littbuilder/db_impl.go @@ -0,0 +1,321 @@ +//go:build littdb_wip + +package littbuilder + +import ( + "context" + "fmt" + "net/http" + "sync" + "sync/atomic" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/metrics" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" +) + +var _ litt.DB = &db{} + +// TableBuilderFunc is a function that creates a new table. +type TableBuilderFunc func( + ctx context.Context, + logger logging.Logger, + name string, + metrics *metrics.LittDBMetrics) (litt.ManagedTable, error) + +// db is an implementation of DB. +type db struct { + ctx context.Context + logger logging.Logger + + // A function that returns the current time. + clock func() time.Time + + // The default time-to-live for new tables. Once created, the TTL for a table can be changed. + ttl time.Duration + + // The period between garbage collection runs. + gcPeriod time.Duration + + // A function that creates new tables. + tableBuilder TableBuilderFunc + + // A map of all tables in the database. + tables map[string]litt.ManagedTable + + // Protects access to tables and ttl. + lock sync.Mutex + + // True if the database has been stopped. + stopped atomic.Bool + + // Metrics for the database. + metrics *metrics.LittDBMetrics + + // The HTTP server for metrics. nil if metrics are disabled or if an external party is managing the server. + metricsServer *http.Server + + // A function that releases file locks. + releaseLocks func() + + // Set to true when the database is closed. + closed bool +} + +// NewDB creates a new DB instance. After this method is called, the config object should not be modified. +func NewDB(config *litt.Config) (litt.DB, error) { + if config.Logger == nil { + var err error + config.Logger, err = buildLogger(config) + if err != nil { + return nil, fmt.Errorf("error building logger: %w", err) + } + } + + err := config.SanityCheck() + if err != nil { + return nil, fmt.Errorf("error checking config: %w", err) + } + + err = config.SanitizePaths() + if err != nil { + return nil, fmt.Errorf("error expanding tildes in config: %w", err) + } + + if !config.Fsync { + config.Logger.Warnf( + "Fsync is disabled. Ok for unit tests that need to run fast, NOT OK FOR PRODUCTION USE.") + } + + tableBuilder := func( + ctx context.Context, + logger logging.Logger, + name string, + metrics *metrics.LittDBMetrics) (litt.ManagedTable, error) { + + return buildTable(config, logger, name, metrics) + } + + return NewDBUnsafe(config, tableBuilder) +} + +// NewDBUnsafe creates a new DB instance with a custom table builder. This is intended for unit test use, +// and should not be considered a stable API. +func NewDBUnsafe(config *litt.Config, tableBuilder TableBuilderFunc) (litt.DB, error) { + for _, rootPath := range config.Paths { + err := util.EnsureDirectoryExists(rootPath, config.Fsync) + if err != nil { + return nil, fmt.Errorf("error ensuring directory %s exists: %w", rootPath, err) + } + } + + if config.PurgeLocks { + config.Logger.Warnf("Purging LittDB locks from paths %v", config.Paths) + err := disktable.Unlock(config.Logger, config.Paths) + if err != nil { + return nil, fmt.Errorf("error purging locks: %w", err) + } + config.Logger.Infof("Locks purged successfully") + } else { + config.Logger.Infof("Not purging locks, continuing with existing locks") + } + + releaseLocks, err := util.LockDirectories(config.Logger, config.Paths, util.LockfileName, config.Fsync) + if err != nil { + return nil, fmt.Errorf("error acquiring locks on paths %v: %w", config.Paths, err) + } + + if config.Logger == nil { + config.Logger, err = buildLogger(config) + if err != nil { + return nil, fmt.Errorf("error building logger: %w", err) + } + } + + var dbMetrics *metrics.LittDBMetrics + var metricsServer *http.Server + if config.MetricsEnabled { + dbMetrics, metricsServer = buildMetrics(config, config.Logger) + } + + if config.SnapshotDirectory != "" { + config.Logger.Infof("LittDB rolling snapshots enabled, snapshot data will be stored in %s", + config.SnapshotDirectory) + } + + database := &db{ + ctx: config.CTX, + logger: config.Logger, + clock: config.Clock, + ttl: config.TTL, + gcPeriod: config.GCPeriod, + tableBuilder: tableBuilder, + tables: make(map[string]litt.ManagedTable), + metrics: dbMetrics, + metricsServer: metricsServer, + releaseLocks: releaseLocks, + } + + if config.MetricsEnabled { + go database.gatherMetrics(config.MetricsUpdateInterval) + } + + return database, nil +} + +func (d *db) KeyCount() uint64 { + d.lock.Lock() + defer d.lock.Unlock() + + count := uint64(0) + for _, table := range d.tables { + count += table.KeyCount() + } + + return count +} + +func (d *db) Size() uint64 { + d.lock.Lock() + defer d.lock.Unlock() + + return d.lockFreeSize() +} + +func (d *db) lockFreeSize() uint64 { + size := uint64(0) + for _, table := range d.tables { + size += table.Size() + } + + return size +} + +func (d *db) GetTable(name string) (litt.Table, error) { + d.lock.Lock() + defer d.lock.Unlock() + + table, ok := d.tables[name] + if !ok { + if !litt.IsTableNameValid(name) { + return nil, fmt.Errorf( + "table name '%s' is invalid, must be at least one character long and "+ + "contain only letters, numbers, and underscores, and dashes", name) + } + + var err error + table, err = d.tableBuilder(d.ctx, d.logger, name, d.metrics) + if err != nil { + return nil, fmt.Errorf("error creating table: %w", err) + } + d.logger.Infof( + "Table '%s' initialized, table contains %d key-value pairs and has a size of %s.", + name, table.KeyCount(), common.PrettyPrintBytes(table.Size())) + + d.tables[name] = table + } + + return table, nil +} + +func (d *db) DropTable(name string) error { + d.lock.Lock() + defer d.lock.Unlock() + + table, ok := d.tables[name] + if !ok { + // Table does not exist, nothing to do. + d.logger.Infof("table %s does not exist, cannot drop", name) + return nil + } + + d.logger.Infof("dropping table %s", name) + err := table.Destroy() + if err != nil { + return fmt.Errorf("error destroying table: %w", err) + } + delete(d.tables, name) + + return nil +} + +func (d *db) Close() error { + d.lock.Lock() + defer d.lock.Unlock() + return d.closeUnsafe() +} + +func (d *db) closeUnsafe() error { + if d.closed { + // closing more than once is a no-op + return nil + } + + d.logger.Infof("Stopping LittDB, estimated data size: %d", d.lockFreeSize()) + d.stopped.Store(true) + + for name, table := range d.tables { + err := table.Close() + if err != nil { + return fmt.Errorf("error stopping table %s: %w", name, err) + } + } + + d.releaseLocks() + + return nil +} + +func (d *db) Destroy() error { + d.lock.Lock() + defer d.lock.Unlock() + + err := d.closeUnsafe() + if err != nil { + return fmt.Errorf("error closing database: %w", err) + } + + for name, table := range d.tables { + err := table.Destroy() + if err != nil { + return fmt.Errorf("error destroying table %s: %w", name, err) + } + } + + return nil +} + +// gatherMetrics is a method that periodically collects metrics. +func (d *db) gatherMetrics(interval time.Duration) { + if d.metricsServer != nil { + defer func() { + err := d.metricsServer.Close() + if err != nil { + d.logger.Errorf("error closing metrics server: %v", err) + } + }() + } + + ticker := time.NewTicker(interval) + defer ticker.Stop() + + for !d.stopped.Load() { + select { + case <-d.ctx.Done(): + return + case <-ticker.C: + d.lock.Lock() + tablesCopy := make(map[string]litt.ManagedTable, len(d.tables)) + for name, table := range d.tables { + tablesCopy[name] = table + } + d.lock.Unlock() + + d.metrics.CollectPeriodicMetrics(tablesCopy) + } + } +} diff --git a/sei-db/db_engine/litt/littdb_config.go b/sei-db/db_engine/litt/littdb_config.go new file mode 100644 index 0000000000..5920a10c89 --- /dev/null +++ b/sei-db/db_engine/litt/littdb_config.go @@ -0,0 +1,280 @@ +//go:build littdb_wip + +package litt + +import ( + "context" + "fmt" + "math" + "math/rand" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/docker/go-units" + "github.com/prometheus/client_golang/prometheus" +) + +// Config is configuration for a litt.DB. +type Config struct { + // The context for the database. If nil, context.Background() is used. + CTX context.Context + + // The paths where the database will store its files. If the path does not exist, it will be created. + // If more than one path is provided, then the database will do its best to spread out the data across + // the paths. If the database is restarted, it will attempt to load data from all paths. Note: the number + // of paths should not exceed the sharding factor, or else data may not be split across all paths. + // + // Most of the time, providing exactly one path is sufficient. If the data should be spread across multiple + // drives, then providing multiple permits that. The number of provided paths should be a small number, perhaps + // a few dozen paths at most. Providing an excessive number of paths may lead to degraded performance. + // + // Providing zero paths will cause the DB to return an error at startup. + Paths []string + + // The logger for the database. If nil, a logger is built using the LoggerConfig. + Logger logging.Logger + + // The logger configuration for the database. Ignored if Logger is not nil. + LoggerConfig *common.LoggerConfig + + // The type of the keymap. Choices are keymap.MemKeymapType and keymap.LevelDBKeymapType. + // Default is keymap.LevelDBKeymapType. + KeymapType keymap.KeymapType + + // The default TTL for newly created tables (either ones with data on disk or new tables). + // The default is 0 (no TTL). TTL can be set individually on each table by calling Table.SetTTL(). + TTL time.Duration + + // The size of the control channel for the segment manager. The default is 64. + ControlChannelSize int + + // The target size for segments. The default is math.MaxUint32. + TargetSegmentFileSize uint32 + + // The maximum number of keys in a segment. The default is 50,000. For workloads with moderately large values + // (i.e. in the kb+ range), this threshold is unlikely to be relevant. For workloads with very small values, + // this constant prevents a segment from accumulating too many keys. A segment with too many keys may have + // undesirable properties such as a very large key file and very slow garbage collection (since no kv-pair in + // a segment can be deleted until the entire segment is deleted). + MaxSegmentKeyCount uint32 + + // The desired maximum size for a key file. The default is 2 MB. When a key file exceeds this size, the segment + // will close the current segment and begin writing to a new one. For workloads with moderately large values, + // this threshold is unlikely to be relevant. For workloads with very small values, this constant prevents a key + // file from growing too large. A key file with too many keys may have undesirable properties such as very slow + // garbage collection (since no kv-pair in a segment can be deleted until the entire segment is deleted). + TargetSegmentKeyFileSize uint64 + + // The period between garbage collection runs. The default is 5 minutes. + GCPeriod time.Duration + + // The size of the keymap deletion batch for garbage collection. The default is 10,000. + GCBatchSize uint64 + + // The sharding factor for the database. If the sharding factor is greater than 1, then values will be spread + // out across multiple files. (Note that individual values will always be written to a single file, but two + // different values may be written to different files.) These shard files are spead evenly across the paths + // provided in the Paths field. If the sharding factor is larger than the number of paths, then some paths will + // have multiple shard files. If the sharding factor is smaller than the number of paths, then some paths may not + // always have an actively written shard file. + // + // The default is 8. Must be at least 1. + ShardingFactor uint32 + + // The random number generator used for generating sharding salts. The default is a standard rand.New() + // seeded by the current time. + SaltShaker *rand.Rand + + // The size of the cache for tables that have not had their write cache size set. A write cache is used + // to store recently written values for fast access. The default is 0 (no cache). + // Cache size is in bytes, and includes the size of both the key and the value. Cache size can be set + // individually on each table by calling Table.SetWriteCacheSize(). + WriteCacheSize uint64 + + // The size of the cache for tables that have not had their read cache size set. A read cache is used + // to store recently read values for fast access. The default is 0 (no cache). + // Cache size is in bytes, and includes the size of both the key and the value. Cache size can be set + // individually on each table by calling Table.SetReadCacheSize(). + ReadCacheSize uint64 + + // The time source used by the database. This can be substituted for an artificial time source + // for testing purposes. The default is time.Now. + Clock func() time.Time + + // If true, then flush operations will call fsync on the underlying file to ensure data is flushed out of the + // operating system's buffer and onto disk. Setting this to false means that even after flushing data, + // there may be data loss in the advent of an OS/hardware crash. + // + // The default is true. + // + // Enabling fsync may have performance implications, although this strongly depends on the workload. For large + // batches that are flushed infrequently, benchmark data suggests that the impact is minimal. For small batches + // that are flushed frequently, the difference can be severe. For example, when enabled in unit tests that do + // super tiny and frequent flushes, the difference in performance was an order of magnitude. + Fsync bool + + // If enabled, the database will return an error if a key is written but that key is already present in + // the database. Updating existing keys is illegal and may result in unexpected behavior, and so this check + // acts as a safety mechanism against this sort of illegal operation. Unfortunately, if using a keymap other + // than keymap.MemKeymapType, performing this check may be very expensive. By default, this is false. + DoubleWriteProtection bool + + // If enabled, collect DB metrics and export them to prometheus. By default, this is false. + MetricsEnabled bool + + // The namespace to use for metrics. If empty, the default namespace "litt" is used. + MetricsNamespace string + + // The prometheus registry to use for metrics. If nil and metrics are enabled, a new registry is created. + MetricsRegistry *prometheus.Registry + + // The port to use for the metrics server. Ignored if MetricsEnabled is false or MetricsRegistry is not nil. + // The default is 9101. + MetricsPort int + + // The interval at which various DB metrics are updated. The default is 1 second. + MetricsUpdateInterval time.Duration + + // A function that is called if the database experiences a non-recoverable error (e.g. data corruption, + // a crashed goroutine, a full disk, etc.). If nil (the default), no callback is called. If called at all, + // this method is called exactly once. + FatalErrorCallback func(error) + + // If empty, snapshotting is disabled. If not empty, then this directory is used by the database to publish a + // rolling sequence of "snapshots". Using the data in the snapshot directory, an external process can safely + // get a consistent read-only views of the database. + // + // The snapshot directory will contain symbolic links to segment files that are safe for external processes to + // read/copy. If, at any point in time, an external process takes all data in the snapshot directory and loads + // it into a new LittDB instance, then that instance will have a consistent view of the database. (Note that there + // are some steps required to load this data into a new database instance.) + // + // Since data may be spread across multiple physical volumes, it is not possible to create a directory with hard + // linked files for all configurations (short of making cost-prohibitive copies). Each symbolic link in the + // snapshot directory points to a file that MUST be garbage collected by whatever external process is making use + // of database snapshots. Failing to clean up the hard linked files referenced by the symlinks will result in a + // disk space leak. + SnapshotDirectory string + + // If true, then purge all lock files prior to starting the database. This is potentially dangerous, as it will + // permit multiple databases to be opened against the same data directories. If ever there are two LittDB + // instances running against the same data directories, data corruption is almost a certainty. + PurgeLocks bool + + // If Flush() is called more frequently than this interval, the flushes may be batched together to improve + // performance. If this is set to zero, then no batching is performed and all flushes are executed immediately. + MinimumFlushInterval time.Duration +} + +// DefaultConfig returns a Config with default values. +func DefaultConfig(paths ...string) (*Config, error) { + if len(paths) == 0 { + return nil, fmt.Errorf("at least one path must be provided") + } + + config := DefaultConfigNoPaths() + config.Paths = paths + + return config, nil +} + +// DefaultConfigNoPaths returns a Config with default values, and does not require any paths to be provided. +// If paths are not set prior to use, then the DB will return an error at startup. +func DefaultConfigNoPaths() *Config { + seed := time.Now().UnixNano() + saltShaker := rand.New(rand.NewSource(seed)) + + loggerConfig := common.DefaultLoggerConfig() + + return &Config{ + CTX: context.Background(), + LoggerConfig: loggerConfig, + Clock: time.Now, + GCPeriod: 5 * time.Minute, + GCBatchSize: 10_000, + ShardingFactor: 8, + SaltShaker: saltShaker, + KeymapType: keymap.LevelDBKeymapType, + ControlChannelSize: 64, + TargetSegmentFileSize: math.MaxUint32, + MaxSegmentKeyCount: 50_000, + TargetSegmentKeyFileSize: 2 * units.MiB, + Fsync: true, + DoubleWriteProtection: false, + MetricsEnabled: false, + MetricsNamespace: "litt", + MetricsPort: 9101, + MetricsUpdateInterval: time.Second, + PurgeLocks: false, + } +} + +// SanitizePaths replaces any paths that start with '~' with the user's home directory. +func (c *Config) SanitizePaths() error { + for i, path := range c.Paths { + var err error + c.Paths[i], err = util.SanitizePath(path) + if err != nil { + return fmt.Errorf("error sanitizing path %s: %w", path, err) + } + } + + if c.SnapshotDirectory != "" { + var err error + c.SnapshotDirectory, err = util.SanitizePath(c.SnapshotDirectory) + if err != nil { + return fmt.Errorf("error sanitizing snapshot directory %s: %w", c.SnapshotDirectory, err) + } + } + + return nil +} + +// SanityCheck performs a sanity check on the configuration, returning an error if any of the configuration +// settings are invalid. The config returned by DefaultConfig() is guaranteed to pass this check if unmodified. +func (c *Config) SanityCheck() error { + if c.CTX == nil { + return fmt.Errorf("context cannot be nil") + } + if len(c.Paths) == 0 { + return fmt.Errorf("at least one path must be provided") + } + if c.Logger == nil && c.LoggerConfig == nil { + return fmt.Errorf("logger or logger config must be provided") + } + if c.Clock == nil { + return fmt.Errorf("time source cannot be nil") + } + if c.GCBatchSize == 0 { + return fmt.Errorf("gc batch size must be at least 1") + } + if c.ShardingFactor == 0 { + return fmt.Errorf("sharding factor must be at least 1") + } + if c.ControlChannelSize == 0 { + return fmt.Errorf("control channel size must be at least 1") + } + if c.TargetSegmentFileSize == 0 { + return fmt.Errorf("target segment file size must be at least 1") + } + if c.MaxSegmentKeyCount == 0 { + return fmt.Errorf("max segment key count must be at least 1") + } + if c.TargetSegmentKeyFileSize == 0 { + return fmt.Errorf("target segment key file size must be at least 1") + } + if c.GCPeriod == 0 { + return fmt.Errorf("gc period must be at least 1") + } + if c.SaltShaker == nil { + return fmt.Errorf("salt shaker cannot be nil") + } + if (c.MetricsEnabled || c.MetricsRegistry != nil) && c.MetricsUpdateInterval == 0 { + return fmt.Errorf("metrics update interval must be at least 1 if metrics are enabled") + } + + return nil +} diff --git a/sei-db/db_engine/litt/memtable/mem_table.go b/sei-db/db_engine/litt/memtable/mem_table.go new file mode 100644 index 0000000000..128b604e07 --- /dev/null +++ b/sei-db/db_engine/litt/memtable/mem_table.go @@ -0,0 +1,215 @@ +//go:build littdb_wip + +package memtable + +import ( + "fmt" + "sync" + "sync/atomic" + "time" + + "github.com/Layr-Labs/eigenda/common/structures" + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/types" +) + +var _ litt.ManagedTable = &memTable{} + +// expirationRecord is a record of when a key was inserted into the table. +type expirationRecord struct { + // The time at which the key was inserted into the table. + creationTime time.Time + // A stringified version of the key. + key string +} + +// memTable is a simple implementation of a Table that stores its data in memory. +type memTable struct { + // A function that returns the current time. + clock func() time.Time + + // The name of the table. + name string + + // The time-to-live for data in this table. + ttl time.Duration + + // The actual data store. + data map[string][]byte + + // Keeps track of when data should be deleted. + expirationQueue *structures.Queue[*expirationRecord] + + // Protects access to data and expirationQueue. + // + // This implementation could be made with smaller granularity locks to improve multithreaded performance, + // at the cost of code complexity. But since this implementation is primary intended for use in tests, + // such optimization is not necessary. + lock sync.RWMutex + + shutdown atomic.Bool +} + +// NewMemTable creates a new in-memory table. +func NewMemTable(config *litt.Config, name string) litt.ManagedTable { + + table := &memTable{ + clock: config.Clock, + name: name, + ttl: config.TTL, + data: make(map[string][]byte), + expirationQueue: structures.NewQueue[*expirationRecord](1024), + } + + if config.GCPeriod > 0 { + ticker := time.NewTicker(config.GCPeriod) + go func() { + defer ticker.Stop() + for !table.shutdown.Load() { + <-ticker.C + err := table.RunGC() + if err != nil { + panic(err) // this is a class designed for use in testing, not worth properly handling errors + } + } + }() + } + + return table +} + +func (m *memTable) Size() uint64 { + // Technically speaking, this table stores zero bytes on disk, and this method + // is contractually obligated to return only the size of the data on disk. + return 0 +} + +func (m *memTable) Name() string { + return m.name +} + +func (m *memTable) KeyCount() uint64 { + m.lock.RLock() + defer m.lock.RUnlock() + return uint64(len(m.data)) +} + +func (m *memTable) Put(key []byte, value []byte) error { + stringKey := string(key) + expiration := &expirationRecord{ + creationTime: m.clock(), + key: stringKey, + } + + m.lock.Lock() + defer m.lock.Unlock() + + _, ok := m.data[stringKey] + if ok { + return fmt.Errorf("key %x already exists", key) + } + m.data[stringKey] = value + m.expirationQueue.Push(expiration) + + return nil +} + +func (m *memTable) PutBatch(batch []*types.KVPair) error { + for _, kv := range batch { + err := m.Put(kv.Key, kv.Value) + if err != nil { + return err + } + } + return nil +} + +func (m *memTable) Get(key []byte) (value []byte, exists bool, err error) { + value, exists, _, err = m.CacheAwareGet(key, false) + return value, exists, err +} + +func (m *memTable) CacheAwareGet(key []byte, _ bool) (value []byte, exists bool, hot bool, err error) { + m.lock.RLock() + defer m.lock.RUnlock() + + value, exists = m.data[string(key)] + if !exists { + return nil, false, false, nil + } + + return value, true, true, nil +} + +func (m *memTable) Exists(key []byte) (exists bool, err error) { + m.lock.RLock() + defer m.lock.RUnlock() + _, exists = m.data[string(key)] + return exists, nil +} + +func (m *memTable) Flush() error { + // This is a no-op for a memory table. Memory tables are ephemeral by nature. + return nil +} + +func (m *memTable) SetTTL(ttl time.Duration) error { + m.lock.Lock() + defer m.lock.Unlock() + m.ttl = ttl + return nil +} + +func (m *memTable) Destroy() error { + m.lock.Lock() + defer m.lock.Unlock() + + m.data = make(map[string][]byte) + m.expirationQueue.Clear() + + return nil +} + +func (m *memTable) Close() error { + m.shutdown.Store(true) + return nil +} + +func (m *memTable) SetWriteCacheSize(size uint64) error { + return nil +} + +func (m *memTable) SetReadCacheSize(size uint64) error { + return nil +} + +func (m *memTable) SetShardingFactor(shardingFactor uint32) error { + // the memory table has no concept of sharding + return nil +} + +func (m *memTable) RunGC() error { + m.lock.Lock() + defer m.lock.Unlock() + + if m.ttl == 0 { + return nil + } + + now := m.clock() + earliestPermittedCreationTime := now.Add(-m.ttl) + + for { + expiration, ok := m.expirationQueue.TryPeek() + if !ok { + break + } + if expiration.creationTime.After(earliestPermittedCreationTime) { + break + } + m.expirationQueue.Pop() + delete(m.data, expiration.key) + } + + return nil +} diff --git a/sei-db/db_engine/litt/metrics/littdb_metrics.go b/sei-db/db_engine/litt/metrics/littdb_metrics.go new file mode 100644 index 0000000000..09e3073461 --- /dev/null +++ b/sei-db/db_engine/litt/metrics/littdb_metrics.go @@ -0,0 +1,388 @@ +//go:build littdb_wip + +package metrics + +import ( + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/Layr-Labs/eigenda/common/cache" + "github.com/Layr-Labs/eigenda/litt" + "github.com/prometheus/client_golang/prometheus" + "github.com/prometheus/client_golang/prometheus/promauto" +) + +// Metrics to possibly add in the future: +// - total disk used, broken down by root +// - disk available on each root +// - control loop idle fraction +// - main control loop +// - flush loop +// - shard control loops +// - keyfile control loop +// - total number of segments +// - average segment span (i.e. difference in time between first and last values written to a segment) +// - segment creation rate +// - used/unused segment space (useful for detecting shard assignment issues) + +// LittDBMetrics encapsulates metrics for a LittDB. +type LittDBMetrics struct { + // The size of individual tables in the database. + tableSizeInBytes *prometheus.GaugeVec + + // The number of keys in individual tables in the database. + tableKeyCount *prometheus.GaugeVec + + // The number of bytes read from disk since startup. + bytesReadCounter *prometheus.CounterVec + + // The number of keys read from disk since startup. + keysReadCounter *prometheus.CounterVec + + // The number of cache hits since startup. + cacheHitCounter *prometheus.CounterVec + + // The number of cache misses since startup. + cacheMissCounter *prometheus.CounterVec + + // Reports on the read latency of the database. This metric includes both cache hits and cache misses. + readLatency *prometheus.SummaryVec + + // Reports on the write latency of the database, but only measures the time to read a value when a + // cache miss occurs. + cacheMissLatency *prometheus.SummaryVec + + // The number of bytes written to disk since startup. Only includes values, not metadata. + bytesWrittenCounter *prometheus.CounterVec + + // The number of keys written to disk since startup. + keysWrittenCounter *prometheus.CounterVec + + // Reports on the write latency of the database. + writeLatency *prometheus.SummaryVec + + // The number of times a flush operation has been performed. + flushCount *prometheus.CounterVec + + // Reports on the latency of a flush operation. + flushLatency *prometheus.SummaryVec + + // Reports on the latency of a flushing segment files. This is a subset of the time spent during a flush operation. + segmentFlushLatency *prometheus.SummaryVec + + // Reports on the latency of a keymap flush operation. This is a subset of the time spent during a flush operation. + keymapFlushLatency *prometheus.SummaryVec + + // The latency of garbage collection operations.1 + garbageCollectionLatency *prometheus.SummaryVec + + // Metrics for the write cache. + writeCacheMetrics *cache.CacheMetrics + + // Metrics for the read cache. + readCacheMetrics *cache.CacheMetrics +} + +// NewLittDBMetrics creates a new LittDBMetrics instance. +func NewLittDBMetrics(registry *prometheus.Registry, namespace string) *LittDBMetrics { + if registry == nil { + return nil + } + + objectives := map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001} + + tableSizeInBytes := promauto.With(registry).NewGaugeVec( + prometheus.GaugeOpts{ + Namespace: namespace, + Name: "table_size_bytes", + Help: "The size of individual tables in the database in bytes.", + }, + []string{"table"}, + ) + + tableKeyCount := promauto.With(registry).NewGaugeVec( + prometheus.GaugeOpts{ + Namespace: namespace, + Name: "table_key_count", + Help: "The number of keys in individual tables in the database.", + }, + []string{"table"}, + ) + + bytesReadCounter := promauto.With(registry).NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Name: "bytes_read", + Help: "The number of bytes read from disk since startup.", + }, + []string{"table"}, + ) + + keysReadCounter := promauto.With(registry).NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Name: "keys_read", + Help: "The number of keys read from disk since startup.", + }, + []string{"table"}, + ) + + cacheHitCounter := promauto.With(registry).NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Name: "cache_hits", + Help: "The number of cache hits since startup.", + }, + []string{"table"}, + ) + + cacheMissCounter := promauto.With(registry).NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Name: "cache_misses", + Help: "The number of cache misses since startup.", + }, + []string{"table"}, + ) + + readLatency := promauto.With(registry).NewSummaryVec( + prometheus.SummaryOpts{ + Namespace: namespace, + Name: "read_latency_ms", + Help: "Reports on the read latency of the database. " + + "This metric includes both cache hits and cache misses.", + Objectives: objectives, + }, + []string{"table"}, + ) + + cacheMissLatency := promauto.With(registry).NewSummaryVec( + prometheus.SummaryOpts{ + Namespace: namespace, + Name: "cache_miss_latency_ms", + Help: "Reports on the write latency of the database, " + + "but only measures the time to read a value when a cache miss occurs.", + Objectives: objectives, + }, + []string{"table"}, + ) + + bytesWrittenCounter := promauto.With(registry).NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Name: "bytes_written", + Help: "The number of bytes written to disk since startup. Only includes values, not metadata.", + }, + []string{"table"}, + ) + + keysWrittenCounter := promauto.With(registry).NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Name: "keys_written", + Help: "The number of keys written to disk since startup.", + }, + []string{"table"}, + ) + + writeLatency := promauto.With(registry).NewSummaryVec( + prometheus.SummaryOpts{ + Namespace: namespace, + Name: "write_latency_ms", + Help: "Reports on the write latency of the database.", + Objectives: objectives, + }, + []string{"table"}, + ) + + flushCount := promauto.With(registry).NewCounterVec( + prometheus.CounterOpts{ + Namespace: namespace, + Name: "flush_count", + Help: "The number of times a flush operation has been performed.", + }, + []string{"table"}, + ) + + flushLatency := promauto.With(registry).NewSummaryVec( + prometheus.SummaryOpts{ + Namespace: namespace, + Name: "flush_latency_ms", + Help: "Reports on the latency of a flush operation.", + Objectives: objectives, + }, + []string{"table"}, + ) + + segmentFlushLatency := promauto.With(registry).NewSummaryVec( + prometheus.SummaryOpts{ + Namespace: namespace, + Name: "segment_flush_latency_ms", + Help: "Reports on segment flush latency. This is a subset of the time spent during a flush operation.", + Objectives: objectives, + }, + []string{"table"}, + ) + + keymapFlushLatency := promauto.With(registry).NewSummaryVec( + prometheus.SummaryOpts{ + Namespace: namespace, + Name: "keymap_flush_latency_ms", + Help: "Reports on the latency of a keymap flush operation. " + + "This is a subset of the time spent during a flush operation.", + Objectives: objectives, + }, + []string{"table"}, + ) + + garbageCollectionLatency := promauto.With(registry).NewSummaryVec( + prometheus.SummaryOpts{ + Namespace: namespace, + Name: "garbage_collection_latency_ms", + Help: "Reports on the latency of garbage collection operations.", + Objectives: objectives, + }, + []string{"table"}, + ) + + writeCacheMetrics := cache.NewCacheMetrics( + registry, + namespace, + "chunk_write", + ) + + readCacheMetrics := cache.NewCacheMetrics( + registry, + namespace, + "chunk_read", + ) + + return &LittDBMetrics{ + tableSizeInBytes: tableSizeInBytes, + tableKeyCount: tableKeyCount, + bytesReadCounter: bytesReadCounter, + keysReadCounter: keysReadCounter, + cacheHitCounter: cacheHitCounter, + cacheMissCounter: cacheMissCounter, + readLatency: readLatency, + cacheMissLatency: cacheMissLatency, + bytesWrittenCounter: bytesWrittenCounter, + keysWrittenCounter: keysWrittenCounter, + writeLatency: writeLatency, + flushCount: flushCount, + flushLatency: flushLatency, + garbageCollectionLatency: garbageCollectionLatency, + segmentFlushLatency: segmentFlushLatency, + keymapFlushLatency: keymapFlushLatency, + writeCacheMetrics: writeCacheMetrics, + readCacheMetrics: readCacheMetrics, + } +} + +// CollectPeriodicMetrics is a method that is periodically called to collect metrics. Tables are not permitted to be +// added or dropped while this method is running. +func (m *LittDBMetrics) CollectPeriodicMetrics(tables map[string]litt.ManagedTable) { + if m == nil { + return + } + + for _, table := range tables { + tableName := table.Name() + + tableSize := table.Size() + m.tableSizeInBytes.WithLabelValues(tableName).Set(float64(tableSize)) + + tableKeyCount := table.KeyCount() + m.tableKeyCount.WithLabelValues(tableName).Set(float64(tableKeyCount)) + } +} + +// ReportReadOperation reports the results of a read operation. +func (m *LittDBMetrics) ReportReadOperation( + tableName string, + latency time.Duration, + dataSize uint64, + cacheHit bool) { + + if m == nil { + return + } + + m.bytesReadCounter.WithLabelValues(tableName).Add(float64(dataSize)) + m.keysReadCounter.WithLabelValues(tableName).Inc() + m.readLatency.WithLabelValues(tableName).Observe(latency.Seconds()) + + if cacheHit { + m.cacheHitCounter.WithLabelValues(tableName).Inc() + } else { + m.cacheMissCounter.WithLabelValues(tableName).Inc() + m.cacheMissLatency.WithLabelValues(tableName).Observe(common.ToMilliseconds(latency)) + } +} + +// ReportWriteOperation reports the results of a write operation. +func (m *LittDBMetrics) ReportWriteOperation( + tableName string, + latency time.Duration, + batchSize uint64, + dataSize uint64) { + + if m == nil { + return + } + + m.bytesWrittenCounter.WithLabelValues(tableName).Add(float64(dataSize)) + m.keysWrittenCounter.WithLabelValues(tableName).Add(float64(batchSize)) + m.writeLatency.WithLabelValues(tableName).Observe(common.ToMilliseconds(latency)) +} + +// ReportFlushOperation reports the results of a flush operation. +func (m *LittDBMetrics) ReportFlushOperation(tableName string, latency time.Duration) { + if m == nil { + return + } + + m.flushCount.WithLabelValues(tableName).Inc() + m.flushLatency.WithLabelValues(tableName).Observe(common.ToMilliseconds(latency)) +} + +// ReportSegmentFlushLatency reports the amount of time taken to flush value files. +func (m *LittDBMetrics) ReportSegmentFlushLatency(tableName string, latency time.Duration) { + if m == nil { + return + } + + m.segmentFlushLatency.WithLabelValues(tableName).Observe(common.ToMilliseconds(latency)) +} + +// ReportKeymapFlushLatency reports the amount of time taken to flush the keymap. +func (m *LittDBMetrics) ReportKeymapFlushLatency(tableName string, latency time.Duration) { + if m == nil { + return + } + + m.keymapFlushLatency.WithLabelValues(tableName).Observe(common.ToMilliseconds(latency)) +} + +// ReportGarbageCollectionLatency reports the latency of a garbage collection operation. +func (m *LittDBMetrics) ReportGarbageCollectionLatency(tableName string, latency time.Duration) { + if m == nil { + return + } + + m.garbageCollectionLatency.WithLabelValues(tableName).Observe(common.ToMilliseconds(latency)) +} + +func (m *LittDBMetrics) GetWriteCacheMetrics() *cache.CacheMetrics { + if m == nil { + return nil + } + return m.writeCacheMetrics +} + +func (m *LittDBMetrics) GetReadCacheMetrics() *cache.CacheMetrics { + if m == nil { + return nil + } + return m.readCacheMetrics +} diff --git a/sei-db/db_engine/litt/table.go b/sei-db/db_engine/litt/table.go new file mode 100644 index 0000000000..4e8608c3c8 --- /dev/null +++ b/sei-db/db_engine/litt/table.go @@ -0,0 +1,145 @@ +//go:build littdb_wip + +package litt + +import ( + "regexp" + "time" + + "github.com/Layr-Labs/eigenda/litt/types" +) + +// TableNameRegex is a regular expression that matches valid table names. +var TableNameRegex = regexp.MustCompile(`^[a-zA-Z0-9_-]+$`) + +// Table is a key-value store with a namespace that does not overlap with other tables. +// Values may be written to the table, but once written, they may not be changed or deleted (except via TTL). +// +// All methods in this interface are thread safe. +type Table interface { + // Name returns the name of the table. Table names are unique across the database. + Name() string + + // Put stores a value in the database. May not be used to overwrite an existing value. + // Note that when this method returns, data written may not be crash durable on disk + // (although the write does have atomicity). In order to ensure crash durability, call Flush(). + // + // The maximum size of the key is 2^32 bytes. The maximum size of the value is 2^32 bytes. + // This database has been optimized under the assumption that values are generally much larger than keys. + // This affects performance, but not correctness. + // + // It is not safe to modify the byte slices passed to this function after the call + // (both the key and the value). + Put(key []byte, value []byte) error + + // PutBatch stores multiple values in the database. Similar to Put, but allows for multiple values to be written + // at once. This may improve performance, but it otherwise has identical properties to a sequence of Put calls + // (i.e. this method does not atomically write the entire batch). + // + // The maximum size of a key is 2^32 bytes. The maximum size of a value is 2^32 bytes. + // This database has been optimized under the assumption that values are generally much larger than keys. + // This affects performance, but not correctness. + // + // It is not safe to modify the byte slices passed to this function after the call + // (including the key byte slices and the value byte slices). + PutBatch(batch []*types.KVPair) error + + // Get retrieves a value from the database. The returned boolean indicates whether the key exists in the database + // (returns false if the key does not exist). If an error is returned, the value of the other returned values are + // undefined. + // + // The maximum size of a key is 2^32 bytes. The maximum size of a value is 2^32 bytes. + // This database has been optimized under the assumption that values are generally much larger than keys. + // This affects performance, but not correctness. + // + // For the sake of performance, the returned data is NOT safe to mutate. If you need to modify the data, + // make a copy of it first. It is also not safe to modify the key byte slice after it is passed to this + // method. + Get(key []byte) (value []byte, exists bool, err error) + + // CacheAwareGet is identical to Get, except that it permits the caller to determine whether the value + // should still be read if it is not present in the cache. If read, it also returns whether the value + // was present in the cache. Note that the 'exists' return value is always accurate even if onlyReadFromCache + // is true. If onlyReadFromCache is true and the value exists but is not in the cache, the returned values are + // (nil, true, false, nil). + CacheAwareGet(key []byte, onlyReadFromCache bool) (value []byte, exists bool, hot bool, err error) + + // Exists returns true if the key exists in the database, and false otherwise. This is faster than calling Get. + // + // It is not safe to modify the key byte slice after it is passed to this method. + Exists(key []byte) (exists bool, err error) + + // Flush ensures that all data written to the database is crash durable on disk. When this method returns, + // all data written by Put() operations is guaranteed to be crash durable. Put() operations that overlap with calls + // to Flush() may not be crash durable after this method returns. + // + // Note that data flushed at the same time is not atomic. If the process crashes mid-flush, some data + // being flushed may become persistent, while some may not. Each individual key-value pair is atomic + // in the event of a crash, though. This is true even for very large keys/values. + Flush() error + + // Size returns the disk size of the table in bytes. Does not include the size of any data stored only in memory. + // + // Note that the value returned by this method may lag slightly behind the actual size of the table due to the + // pipelined implementation of the database. If an exact size is needed, first call Flush(), then call Size(). + // + // Due to technical limitations, this size may or may not accurately reflect the size of the keymap. This is + // because some third party libraries used for certain keymap implementations do not provide an accurate way to + // measure size. + Size() uint64 + + // KeyCount returns the number of keys in the table. + KeyCount() uint64 + + // SetTTL sets the time to live for data in this table. This TTL is immediately applied to data already in + // the table. Note that deletion is lazy. That is, when the data expires, it may not be deleted immediately. + // + // A TTL less than or equal to 0 means that the data never expires. + SetTTL(ttl time.Duration) error + + // SetShardingFactor sets the number of write shards used. Increasing this value increases the number of parallel + // writes that can be performed. + SetShardingFactor(shardingFactor uint32) error + + // SetWriteCacheSize sets the write cache size, in bytes, for the table. For table implementations without a cache, + // this method does nothing. The cache is used to store recently written data. When reading from the table, + // if the requested data is present in this cache, the cache is used instead of reading from disk. Reading from the + // cache is significantly faster than reading from the disk. + // + // If the cache size is set to 0 (default), the cache is disabled. The size of each cache entry is equal to the sum + // of key length and the value length. Note that the actual in-memory footprint of the cache will be slightly + // larger than the cache size due to implementation overhead (e.g. pointers, slice headers, map entries, etc.). + SetWriteCacheSize(size uint64) error + + // SetReadCacheSize sets the read cache size, in bytes, for the table. For table implementations without a cache, + // this method does nothing. The cache is used to store recently read data. When reading from the table, + // if the requested data is present in this cache, the cache is used instead of reading from disk. Reading from the + // cache is significantly faster than reading from the disk. + // + // If the cache size is set to 0 (default), the cache is disabled. The size of each cache entry is equal to the sum + // of key length and the value length. Note that the actual in-memory footprint of the cache will be slightly + // larger than the cache size due to implementation overhead (e.g. pointers, slice headers, map entries, etc.). + SetReadCacheSize(size uint64) error +} + +// isTableNameValid returns true if the table name is valid. +func IsTableNameValid(name string) bool { + return TableNameRegex.MatchString(name) +} + +// ManagedTable is a Table that can perform garbage collection on its data. This type should not be directly used +// by clients, and is a type that is used internally by the database. +type ManagedTable interface { + Table + + // Close shuts down the table, flushing data to disk. + Close() error + + // Destroy cleans up resources used by the table. All data on disk is permanently and unrecoverable deleted. + Destroy() error + + // RunGC performs a garbage collection run. This method blocks until that run is complete. + // This method is intended for use in tests, where it can be useful to force a garbage collection run to occur + // at a specific time. + RunGC() error +} diff --git a/sei-db/db_engine/litt/test/cache_test.go b/sei-db/db_engine/litt/test/cache_test.go new file mode 100644 index 0000000000..1a049b25d1 --- /dev/null +++ b/sei-db/db_engine/litt/test/cache_test.go @@ -0,0 +1,192 @@ +//go:build littdb_wip + +package test + +import ( + "os" + "testing" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestCache(t *testing.T) { + rand := random.NewTestRandom() + + directory := t.TempDir() + + config, err := litt.DefaultConfig(directory) + require.NoError(t, err) + + config.WriteCacheSize = rand.Uint64Range(1000, 2000) + config.ReadCacheSize = rand.Uint64Range(1000, 2000) + config.Fsync = false + config.DoubleWriteProtection = true + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + table, err := db.GetTable("test_table") + require.NoError(t, err) + + expectedValues := make(map[string][]byte) + + var firstKey []byte + var firstValueSize uint64 + + keySize := uint64(32) + maxValueSize := uint64(50) + + // Write some values to the table. Stop before any values are evicted from the write cache. + bytesWritten := uint64(0) + for bytesWritten <= config.WriteCacheSize-keySize-maxValueSize { + nextValueSize := rand.Uint64Range(1, maxValueSize) + kvSize := keySize + nextValueSize + + bytesWritten += kvSize + + key := rand.PrintableBytes(int(keySize)) + value := rand.PrintableBytes(int(nextValueSize)) + + if firstKey == nil { + firstKey = key + firstValueSize = nextValueSize + } + + expectedValues[string(key)] = value + err = table.Put(key, value) + require.NoError(t, err) + } + err = table.Flush() + require.NoError(t, err) + + // Read all values. All should be hot (i.e. in the read cache). + for expectedKey, expectedValue := range expectedValues { + // Only permit reading from the cache. + value, ok, hot, err := table.CacheAwareGet([]byte(expectedKey), true) + require.NoError(t, err) + require.True(t, ok) + require.True(t, hot) + require.Equal(t, expectedValue, value) + + // Permit reading from disk. Since everything is in the cache, this should be functionally equivalent. + value, ok, hot, err = table.CacheAwareGet([]byte(expectedKey), false) + require.NoError(t, err) + require.True(t, ok) + require.True(t, hot) + require.Equal(t, expectedValue, value) + } + + // Write another value that is large enough to evict the first value. This should cause the first value to be + // evicted from the write cache. + key := rand.PrintableBytes(int(keySize)) + value := rand.PrintableBytes(int(maxValueSize)) + bytesWritten += keySize + maxValueSize + expectedValues[string(key)] = value + err = table.Put(key, value) + require.NoError(t, err) + + // Read the first value. It should not be hot. For the first request, do not allow a trip to the cache. + value, ok, hot, err := table.CacheAwareGet(firstKey, true) + require.NoError(t, err) + require.True(t, ok) + require.Nil(t, value) + require.False(t, hot) + + // Try again, but allow a trip to the cache. + value, ok, hot, err = table.CacheAwareGet(firstKey, false) + require.NoError(t, err) + require.True(t, ok) + require.False(t, hot) + require.Equal(t, expectedValues[string(firstKey)], value) + + // Reading again should now result in a cache hit. + value, ok, hot, err = table.CacheAwareGet(firstKey, true) + require.NoError(t, err) + require.True(t, ok) + require.True(t, hot) + require.Equal(t, expectedValues[string(firstKey)], value) + + // Write enough values to push all previously written values out of the write cache. + for bytesWritten < 5000 { + nextValueSize := rand.Uint64Range(1, maxValueSize) + kvSize := keySize + nextValueSize + + bytesWritten += kvSize + + key := rand.PrintableBytes(int(keySize)) + value := rand.PrintableBytes(int(nextValueSize)) + + if firstKey == nil { + firstKey = key + } + + expectedValues[string(key)] = value + err = table.Put(key, value) + require.NoError(t, err) + } + err = table.Flush() + require.NoError(t, err) + + // At this moment in time, the number of bytes in the cache should be less than the write cache size, plus that + // of the first key which will be in the read cache. Verify that fact. + maxCacheSize := config.WriteCacheSize + keySize + firstValueSize + hotBytes := uint64(0) + for key, expectedValue := range expectedValues { + value, ok, hot, err = table.CacheAwareGet([]byte(key), true) + require.NoError(t, err) + require.True(t, ok) + + if hot { + require.Equal(t, expectedValue, value) + hotBytes += uint64(len(key)) + uint64(len(value)) + } else { + require.Nil(t, value) + } + } + require.LessOrEqual(t, hotBytes, maxCacheSize) + + // Read enough values to guarantee that the write cache is at full capacity. + for key, expectedValue := range expectedValues { + value, ok, hot, err = table.CacheAwareGet([]byte(key), false) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + + // Reading a cold value twice in a row should not cause it to become hot. + if !hot { + value, ok, hot, err = table.CacheAwareGet([]byte(key), false) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + require.True(t, hot) + } + } + + // Do a final scan of the values in the DB. The number of hot bytes should not exceed the sizes of the caches. + maxCacheSize = config.WriteCacheSize + config.ReadCacheSize + hotBytes = uint64(0) + for key, expectedValue := range expectedValues { + value, ok, hot, err = table.CacheAwareGet([]byte(key), true) + require.NoError(t, err) + require.True(t, ok) + + if hot { + require.Equal(t, expectedValue, value) + hotBytes += uint64(len(key)) + uint64(len(value)) + } else { + require.Nil(t, value) + } + } + require.LessOrEqual(t, hotBytes, maxCacheSize) + + err = db.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} diff --git a/sei-db/db_engine/litt/test/db_test.go b/sei-db/db_engine/litt/test/db_test.go new file mode 100644 index 0000000000..bf4b854411 --- /dev/null +++ b/sei-db/db_engine/litt/test/db_test.go @@ -0,0 +1,351 @@ +//go:build littdb_wip + +package test + +import ( + "context" + "fmt" + "os" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/memtable" + "github.com/Layr-Labs/eigenda/litt/metrics" + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/Layr-Labs/eigensdk-go/logging" + "github.com/stretchr/testify/require" +) + +type dbBuilder struct { + name string + builder func(t *testing.T, tableDirectory string) (litt.DB, error) +} + +var builders = []*dbBuilder{ + { + name: "mem", + builder: buildMemDB, + }, + { + name: "mem keymap disk table", + builder: buildMemKeyDiskDB, + }, + { + name: "levelDB keymap disk table", + builder: buildLevelDBDiskDB, + }, +} + +var restartableBuilders = []*dbBuilder{ + { + name: "mem keymap disk table", + builder: buildMemKeyDiskDB, + }, + { + name: "levelDB keymap disk table", + builder: buildLevelDBDiskDB, + }, +} + +var flushLimitedBuilder = &dbBuilder{ + name: "levelDB keymap disk table with flush limiter", + builder: buildLevelDBDiskDBWithFlushLimiter, +} + +func buildMemDB(t *testing.T, path string) (litt.DB, error) { + config, err := litt.DefaultConfig(path) + require.NoError(t, err) + + config.GCPeriod = 50 * time.Millisecond + config.Logger = test.GetLogger() + + tb := func( + ctx context.Context, + logger logging.Logger, + name string, + metrics *metrics.LittDBMetrics, + ) (litt.ManagedTable, error) { + return memtable.NewMemTable(config, name), nil + } + + return littbuilder.NewDBUnsafe(config, tb) +} + +func buildMemKeyDiskDB(t *testing.T, path string) (litt.DB, error) { + config, err := litt.DefaultConfig(path) + require.NoError(t, err) + config.KeymapType = keymap.MemKeymapType + config.WriteCacheSize = 1000 + config.TargetSegmentFileSize = 100 + config.ShardingFactor = 4 + config.Fsync = false // fsync is too slow for unit test workloads + config.DoubleWriteProtection = true + + return littbuilder.NewDB(config) +} + +func buildLevelDBDiskDB(t *testing.T, path string) (litt.DB, error) { + config, err := litt.DefaultConfig(path) + require.NoError(t, err) + config.KeymapType = keymap.UnsafeLevelDBKeymapType + config.WriteCacheSize = 1000 + config.TargetSegmentFileSize = 100 + config.ShardingFactor = 4 + config.Fsync = false // fsync is too slow for unit test workloads + config.DoubleWriteProtection = true + + return littbuilder.NewDB(config) +} + +func buildLevelDBDiskDBWithFlushLimiter(t *testing.T, path string) (litt.DB, error) { + config, err := litt.DefaultConfig(path) + require.NoError(t, err) + config.KeymapType = keymap.UnsafeLevelDBKeymapType + config.WriteCacheSize = 1000 + config.TargetSegmentFileSize = 100 + config.ShardingFactor = 4 + config.Fsync = false // fsync is too slow for unit test workloads + config.DoubleWriteProtection = true + config.MinimumFlushInterval = 50 * time.Millisecond + + db, err := littbuilder.NewDB(config) + if err != nil { + return nil, fmt.Errorf("failed to build levelDB: %w", err) + } + return db, nil +} + +func randomDBOperationsTest(t *testing.T, builder *dbBuilder) { + rand := random.NewTestRandom() + + directory := t.TempDir() + + db, err := builder.builder(t, directory) + require.NoError(t, err) + + tableCount := rand.Int32Range(8, 16) + tableNames := make([]string, 0, tableCount) + for i := int32(0); i < tableCount; i++ { + tableNames = append(tableNames, fmt.Sprintf("table-%d-%s", i, rand.PrintableBytes(8))) + } + + // first key is table name, second key is the key in the kv-pair + expectedValues := make(map[string]map[string][]byte) + for _, tableName := range tableNames { + expectedValues[tableName] = make(map[string][]byte) + } + + iterations := 1000 + for i := 0; i < iterations; i++ { + + // Write some data. + tableName := tableNames[rand.Intn(len(tableNames))] + table, err := db.GetTable(tableName) + require.NoError(t, err) + + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[tableName][string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[tableName][string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush tables. + if rand.BoolWithProbability(0.1) { + for _, tableName := range tableNames { + table, err = db.GetTable(tableName) + require.NoError(t, err) + err = table.Flush() + require.NoError(t, err) + } + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the tables and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for tableName, tableValues := range expectedValues { + table, err := db.GetTable(tableName) + require.NoError(t, err) + + for expectedKey, expectedValue := range tableValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + } + } + } + + err = db.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestRandomDBOperations(t *testing.T) { + t.Parallel() + for _, builder := range builders { + t.Run(builder.name, func(t *testing.T) { + randomDBOperationsTest(t, builder) + }) + } +} + +// Test with flush limiting enabled. This will be slower for the unit test data access pattern, but we need to +// exercise the code pathways. +func TestRandomDBOperationsWithFlushLimiter(t *testing.T) { + t.Parallel() + randomDBOperationsTest(t, flushLimitedBuilder) +} + +func dbRestartTest(t *testing.T, builder *dbBuilder) { + rand := random.NewTestRandom() + + directory := t.TempDir() + + db, err := builder.builder(t, directory) + require.NoError(t, err) + + tableCount := rand.Int32Range(8, 16) + tableNames := make([]string, 0, tableCount) + for i := int32(0); i < tableCount; i++ { + tableNames = append(tableNames, fmt.Sprintf("table-%d-%s", i, rand.PrintableBytes(8))) + } + + // first key is table name, second key is the key in the kv-pair + expectedValues := make(map[string]map[string][]byte) + for _, tableName := range tableNames { + expectedValues[tableName] = make(map[string][]byte) + } + + iterations := 1000 + restartIteration := iterations/2 + int(rand.Int64Range(-10, 10)) + + for i := 0; i < iterations; i++ { + // Somewhere in the middle of the test, restart the db. + if i == restartIteration { + err = db.Close() + require.NoError(t, err) + + db, err = builder.builder(t, directory) + require.NoError(t, err) + + // Do a full scan of the table to verify that all expected values are still present. + for tableName, tableValues := range expectedValues { + table, err := db.GetTable(tableName) + require.NoError(t, err) + + for expectedKey, expectedValue := range tableValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + } + } + + // Write some data. + tableName := tableNames[rand.Intn(len(tableNames))] + table, err := db.GetTable(tableName) + require.NoError(t, err) + + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[tableName][string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[tableName][string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush tables. + if rand.BoolWithProbability(0.1) { + for _, tableName := range tableNames { + table, err = db.GetTable(tableName) + require.NoError(t, err) + err = table.Flush() + require.NoError(t, err) + } + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the tables and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for tableName, tableValues := range expectedValues { + table, err := db.GetTable(tableName) + require.NoError(t, err) + + for expectedKey, expectedValue := range tableValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + } + } + } + + err = db.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestDBRestart(t *testing.T) { + t.Parallel() + for _, builder := range restartableBuilders { + t.Run(builder.name, func(t *testing.T) { + dbRestartTest(t, builder) + }) + } +} diff --git a/sei-db/db_engine/litt/test/generate_example_tree_test.go b/sei-db/db_engine/litt/test/generate_example_tree_test.go new file mode 100644 index 0000000000..9f0dbf9110 --- /dev/null +++ b/sei-db/db_engine/litt/test/generate_example_tree_test.go @@ -0,0 +1,97 @@ +//go:build littdb_wip + +package test + +import ( + "fmt" + "log" + "os/exec" + "path" + "strings" + "testing" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +// TestGenerateExampleTree will generate the example file tree displayed in the readme. +func TestGenerateExampleTree(t *testing.T) { + + t.Skip("this should only be run manually") + + rand := random.NewTestRandom() + testDir := t.TempDir() + + rootDirectories := []string{path.Join(testDir, "root0"), path.Join(testDir, "root1"), path.Join(testDir, "root2")} + + config, err := litt.DefaultConfig(rootDirectories...) + require.NoError(t, err) + + config.ShardingFactor = 4 + config.TargetSegmentFileSize = 100 // use a small value to intentionally create several segments + config.SnapshotDirectory = path.Join(testDir, "rolling_snapshot") + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableA, err := db.GetTable("tableA") + require.NoError(t, err) + tableB, err := db.GetTable("tableB") + require.NoError(t, err) + tableC, err := db.GetTable("tableC") + require.NoError(t, err) + + // Write enough data to tableA to create 3 segments + err = tableA.Put([]byte("key1"), rand.Bytes(100)) + require.NoError(t, err) + err = tableA.Put([]byte("key2"), rand.Bytes(100)) + require.NoError(t, err) + err = tableA.Put([]byte("key3"), rand.Bytes(100)) + require.NoError(t, err) + + // Write enough data to tableB to create 2 segments + err = tableB.Put([]byte("key1"), rand.Bytes(100)) + require.NoError(t, err) + err = tableB.Put([]byte("key2"), rand.Bytes(100)) + require.NoError(t, err) + + // Write enough data to tableC to create 1 segment + err = tableC.Put([]byte("key1"), rand.Bytes(50)) + require.NoError(t, err) + + err = tableA.Flush() + require.NoError(t, err) + err = tableB.Flush() + require.NoError(t, err) + err = tableC.Flush() + require.NoError(t, err) + + // Simulate a lower bound files. This normally only gets generated when there is GC done externally. + for _, tableName := range []string{"tableA", "tableB", "tableC"} { + lowerBoundFile, err := disktable.LoadBoundaryFile( + disktable.LowerBound, + path.Join(testDir, "rolling_snapshot", tableName)) + require.NoError(t, err) + err = lowerBoundFile.Update(0) + require.NoError(t, err) + } + + // Run the tree command on testDir + output, err := exec.Command("tree", testDir).CombinedOutput() + if err != nil { + log.Fatalf("command failed: %v", err) + } + // Convert the output (a byte slice) into a string + resultString := string(output) + + // replace the root name with "root". + resultString = strings.ReplaceAll(resultString, testDir, "root") + + fmt.Println(resultString) + + err = db.Close() + require.NoError(t, err) +} diff --git a/sei-db/db_engine/litt/test/keymap_migration_test.go b/sei-db/db_engine/litt/test/keymap_migration_test.go new file mode 100644 index 0000000000..ba6608849b --- /dev/null +++ b/sei-db/db_engine/litt/test/keymap_migration_test.go @@ -0,0 +1,293 @@ +//go:build littdb_wip + +package test + +import ( + "fmt" + "os" + "path" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" + "github.com/syndtr/goleveldb/leveldb" +) + +// Tests migration from one type of Keymap to another. +func TestKeymapMigration(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + directory := t.TempDir() + + directoryCount := 8 + shardDirectories := make([]string, 0, directoryCount) + for i := 0; i < directoryCount; i++ { + shardDirectories = append(shardDirectories, path.Join(directory, rand.String(32))) + } + + // Build the table using LevelDBKeymap. + config, err := litt.DefaultConfig(shardDirectories...) + require.NoError(t, err) + config.ShardingFactor = uint32(directoryCount) + config.KeymapType = keymap.UnsafeLevelDBKeymapType + config.Fsync = false // fsync is too slow for unit test workloads + config.DoubleWriteProtection = true + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + table, err := db.GetTable("test") + require.NoError(t, err) + + // Fill the table with some data. + expectedValues := make(map[string][]byte) + + iterations := 1000 + for i := 0; i < iterations; i++ { + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + // Shut down the table and move the keymap directory. There shouldn't be any problems caused by this. + err = db.Close() + require.NoError(t, err) + + // By default, the keymap will store its data inside directory 0 + keymapPath := path.Join(shardDirectories[0], "test", "keymap") + newKeymapPath := path.Join(shardDirectories[int(rand.Int64Range(1, int64(directoryCount)))], + "test", "keymap") + + err = os.Rename(keymapPath, newKeymapPath) + require.NoError(t, err) + + // Reload the table and check the data + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + table, err = db.GetTable("test") + require.NoError(t, err) + + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Close the table and reopen it using a MemKeymap + err = db.Close() + require.NoError(t, err) + config.KeymapType = keymap.MemKeymapType + + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + table, err = db.GetTable("test") + require.NoError(t, err) + + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // The keymap data path should be empty. + keymapDataPath := path.Join(newKeymapPath, keymap.KeymapDataDirectoryName) + _, err = os.Stat(keymapDataPath) + require.True(t, os.IsNotExist(err)) + + // Close the table and reopen it using a LevelDBKeymap + err = db.Close() + require.NoError(t, err) + config.KeymapType = keymap.UnsafeLevelDBKeymapType + + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + table, err = db.GetTable("test") + require.NoError(t, err) + + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + err = db.Destroy() + require.NoError(t, err) +} + +func TestFailedKeymapMigration(t *testing.T) { + t.Parallel() + rand := random.NewTestRandom() + directory := t.TempDir() + + directoryCount := 8 + shardDirectories := make([]string, 0, directoryCount) + for i := 0; i < directoryCount; i++ { + shardDirectories = append(shardDirectories, path.Join(directory, rand.String(32))) + } + + // Build the table using LevelDBKeymap. + config, err := litt.DefaultConfig(shardDirectories...) + require.NoError(t, err) + config.ShardingFactor = uint32(directoryCount) + config.KeymapType = keymap.UnsafeLevelDBKeymapType + config.Fsync = false // fsync is too slow for unit test workloads + config.DoubleWriteProtection = true + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + table, err := db.GetTable("test") + require.NoError(t, err) + + // Fill the table with some data. + expectedValues := make(map[string][]byte) + + iterations := 1000 + for i := 0; i < iterations; i++ { + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + } + } + + err = db.Close() + require.NoError(t, err) + + // Simulate a failed reload. A failed reload be identified by the missing "initialized" flag file. + // By deleting the file, the DB is tricked into reloading the keymap. + flagFilePath := path.Join(shardDirectories[0], "test", keymap.KeymapDirectoryName, keymap.KeymapInitializedFileName) + + exists, err := util.Exists(flagFilePath) + require.NoError(t, err) + require.True(t, exists) + + err = os.Remove(flagFilePath) + require.NoError(t, err) + + // To verify that the migration works, manually load the old keymap and corrupt it. If things work as they should, + // the keymap should be reloaded from disk, and the corrupted keymap should be deleted. + levelDBPath := path.Join(shardDirectories[0], "test", keymap.KeymapDirectoryName, keymap.KeymapDataDirectoryName) + ldb, err := leveldb.OpenFile(levelDBPath, nil) + require.NoError(t, err) + + for key := range expectedValues { + err = ldb.Put([]byte(key), []byte(fmt.Sprintf("%d", rand.Uint64())), nil) + require.NoError(t, err) + } + + err = ldb.Close() + require.NoError(t, err) + + // Reload the table and check the data + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + table, err = db.GetTable("test") + require.NoError(t, err) + + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } +} diff --git a/sei-db/db_engine/litt/test/lock_test.go b/sei-db/db_engine/litt/test/lock_test.go new file mode 100644 index 0000000000..b6633eec25 --- /dev/null +++ b/sei-db/db_engine/litt/test/lock_test.go @@ -0,0 +1,253 @@ +//go:build littdb_wip + +package test + +import ( + "fmt" + "os" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +// Verify that we cannot open a second instance of the database with the same root directories while the +// first instance is running. +func TestDBLocking(t *testing.T) { + t.Parallel() + + rand := random.NewTestRandom() + directory := t.TempDir() + + // Spread data across several root directories. + rootCount := rand.Uint32Range(2, 5) + roots := make([]string, 0, rootCount) + for i := 0; i < int(rootCount); i++ { + roots = append(roots, fmt.Sprintf("%s/root-%d", directory, i)) + } + + config, err := litt.DefaultConfig(roots...) + require.NoError(t, err) + + // Make it so that we have at least as many shards as roots. + config.ShardingFactor = rootCount * rand.Uint32Range(1, 4) + + // Settings that should be enabled for LittDB unit tests. + config.DoubleWriteProtection = true + config.Fsync = false + + // Use small segments to ensure that we create a few segments per table. + config.TargetSegmentFileSize = 100 + + // Build the DB and a handful of tables. + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableCount := rand.Uint32Range(2, 5) + tables := make([]litt.Table, 0, tableCount) + expectedData := make(map[string]map[string][]byte) + for i := 0; i < int(tableCount); i++ { + tableName := fmt.Sprintf("table-%d-%s", i, rand.PrintableBytes(8)) + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables = append(tables, table) + expectedData[table.Name()] = make(map[string][]byte) + } + + // Insert some data into the tables. + for _, table := range tables { + for i := 0; i < 100; i++ { + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(10, 200) + expectedData[table.Name()][string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "Failed to put key-value pair in table %s", table.Name()) + } + err = table.Flush() + require.NoError(t, err, "Failed to flush table %s", table.Name()) + } + + // Verify that the data is correctly stored in the tables. + for _, table := range tables { + for key, expectedValue := range expectedData[table.Name()] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "Failed to get value for key %s in table %s", key, table.Name()) + require.True(t, ok, "Key %s not found in table %s", key, table.Name()) + require.Equal(t, expectedValue, value, + "Value mismatch for key %s in table %s", key, table.Name()) + } + } + + // Attempt to open a second instance of the database with the same root directories. Locking should prevent this. + shadowConfig, err := litt.DefaultConfig(roots...) + require.NoError(t, err) + shadowConfig.ShardingFactor = config.ShardingFactor + shadowConfig.DoubleWriteProtection = true + shadowConfig.Fsync = false + + _, err = littbuilder.NewDB(shadowConfig) + require.Error(t, err, + "Expected error when opening a second instance of the database with the same root directories") + + // Even sharing just one root should be enough to torpedo the second instance. + shadowConfig, err = litt.DefaultConfig(roots[:1]...) + require.NoError(t, err) + shadowConfig.ShardingFactor = config.ShardingFactor + shadowConfig.DoubleWriteProtection = true + shadowConfig.Fsync = false + + _, err = littbuilder.NewDB(shadowConfig) + require.Error(t, err, + "Expected error when opening a second instance of the database with the same root directories") + + // Shutting down the database should release the locks. + err = db.Close() + require.NoError(t, err, "Failed to close the database") + + // Ensure that we can now open a second instance of the database. + db, err = littbuilder.NewDB(config) + require.NoError(t, err, "Failed to open a second instance of the database after closing the first") + + tables = make([]litt.Table, 0, tableCount) + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "Failed to get table %s after reopening the database", tableName) + tables = append(tables, table) + } + + // Verify that the data is correctly stored in the tables. + for _, table := range tables { + for key, expectedValue := range expectedData[table.Name()] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "Failed to get value for key %s in table %s", key, table.Name()) + require.True(t, ok, "Key %s not found in table %s", key, table.Name()) + require.Equal(t, expectedValue, value, + "Value mismatch for key %s in table %s", key, table.Name()) + } + } + + err = db.Destroy() + require.NoError(t, err, "Failed to destroy the database after testing locking") +} + +// If the database process is killed, it may leave behind lock files. Simulate this scenario. +func TestDeadProcessSimulation(t *testing.T) { + t.Parallel() + + rand := random.NewTestRandom() + directory := t.TempDir() + + // Spread data across several root directories. + rootCount := rand.Uint32Range(2, 5) + roots := make([]string, 0, rootCount) + for i := 0; i < int(rootCount); i++ { + roots = append(roots, fmt.Sprintf("%s/root-%d", directory, i)) + } + + config, err := litt.DefaultConfig(roots...) + require.NoError(t, err) + + // Make it so that we have at least as many shards as roots. + config.ShardingFactor = rootCount * rand.Uint32Range(1, 4) + + // Settings that should be enabled for LittDB unit tests. + config.DoubleWriteProtection = true + config.Fsync = false + + // Use small segments to ensure that we create a few segments per table. + config.TargetSegmentFileSize = 100 + + // Build the DB and a handful of tables. + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableCount := rand.Uint32Range(2, 5) + tables := make([]litt.Table, 0, tableCount) + expectedData := make(map[string]map[string][]byte) + for i := 0; i < int(tableCount); i++ { + tableName := fmt.Sprintf("table-%d-%s", i, rand.PrintableBytes(8)) + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables = append(tables, table) + expectedData[table.Name()] = make(map[string][]byte) + } + + // Insert some data into the tables. + for _, table := range tables { + for i := 0; i < 100; i++ { + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(10, 200) + expectedData[table.Name()][string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "Failed to put key-value pair in table %s", table.Name()) + } + err = table.Flush() + require.NoError(t, err, "Failed to flush table %s", table.Name()) + } + + // Verify that the data is correctly stored in the tables. + for _, table := range tables { + for key, expectedValue := range expectedData[table.Name()] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "Failed to get value for key %s in table %s", key, table.Name()) + require.True(t, ok, "Key %s not found in table %s", key, table.Name()) + require.Equal(t, expectedValue, value, + "Value mismatch for key %s in table %s", key, table.Name()) + } + } + + err = db.Close() + require.NoError(t, err, "Failed to close the database before simulating dead process") + + // Find a PID that does not have an active process. + pid := int(rand.Int64Range(10000, 20000)) + for util.IsProcessAlive(pid) { + pid = int(rand.Int64Range(10000, 20000)) + } + + // Write lock files for the simulated dead process. + for _, root := range roots { + lockFilePath := fmt.Sprintf("%s/%s", root, util.LockfileName) + lockFile, err := os.Create(lockFilePath) + require.NoError(t, err, "Failed to create lock file for simulated dead process at %s", lockFilePath) + + err = WriteLockFile(lockFile, pid) + require.NoError(t, err, "Failed to write lock file for simulated dead process at %s", lockFilePath) + } + + // We should still be able to open a new instance of the database since there is no process running with the PID. + db, err = littbuilder.NewDB(config) + require.NoError(t, err, "Failed to open a new instance of the database after simulating dead process") + + tables = make([]litt.Table, 0, tableCount) + for tableName := range expectedData { + table, err := db.GetTable(tableName) + require.NoError(t, err, "Failed to get table %s after reopening the database", tableName) + tables = append(tables, table) + } + + // Verify that the data is correctly stored in the tables. + for _, table := range tables { + for key, expectedValue := range expectedData[table.Name()] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "Failed to get value for key %s in table %s", key, table.Name()) + require.True(t, ok, "Key %s not found in table %s", key, table.Name()) + require.Equal(t, expectedValue, value, + "Value mismatch for key %s in table %s", key, table.Name()) + } + } + + err = db.Destroy() + require.NoError(t, err, "Failed to destroy the database after testing locking") +} + +// WriteLockFile writes the current process ID and timestamp to the lock file. +func WriteLockFile(lockFile *os.File, pid int) error { + lockInfo := fmt.Sprintf("PID: %d\nTimestamp: %s\n", pid, time.Now().Format(time.RFC3339)) + _, err := lockFile.WriteString(lockInfo) + return err +} diff --git a/sei-db/db_engine/litt/test/migration_data.go b/sei-db/db_engine/litt/test/migration_data.go new file mode 100644 index 0000000000..a06b6f11ce --- /dev/null +++ b/sei-db/db_engine/litt/test/migration_data.go @@ -0,0 +1,108 @@ +//go:build littdb_wip + +package test + +// This map is used for migration tests. This data is written to a table at the old version, and used to verify that +// the data after migration is the same as the data before migration. +var migrationData = map[string]string{ + "S7MOxfceWW": "oSNhtpEtRb48ntgPkhL", + "uQxQ25apaahwztuOzNi": "Tn2MgaTP5B", + "cdlFwQ3izP6gddTWg": "lrB2OPxXpvA9GEr", + "BUHqRs6XNnk": "XiM14PxeApDwgCwoWl", + "iMV7t0BLFhp8WDt3z": "AtkhY6eBDwJjPC9Yq0", + "9v3kYNhyWqpbXKjB": "fXVDjf4H3LAPZo", + "fZLvo7jDSSlWP": "uhI9oNwGZvOR", + "3pkkNwZmFgO1": "p2EakPC1qFy1Ln7X1gy", + "k30CXpbPH7N": "CJPo06kCod8H5nl", + "bK6ShP3Ji9FN": "dCXgS4SlWnmo", + "lYtAmE5Oe0wYeLTr": "26b4nHzUbnFbragc6D", + "chzmznu42ET4i": "bUHbWNpRnJFmR5zdgMY", + "QWGu2AnfcifYECejE": "26FYmPjkYs51nh98", + "4aEyphJuc5": "6xevs3LFY58gxg", + "aQ0Y9rb1UisYU03FW": "ontvK6EElNxUt", + "kYCtV1TdwjO": "qQZMRlvQ4MJRRST", + "U2E9LMOhu0uY1DL": "5P1OmVO3hI1PI", + "dysi8hDsKj8FF": "w2Fkpvl9PAI", + "LcUMjv2DlnS": "6vZh6B840MN8W8Edx", + "XxAUWO6zyJ": "blcXwtWmVB8Xkzv", + "lWQkLUVEFMS": "K2xRiBNQ5MNb75d3B", + "n64zlB9gKtk": "Arky8MofGkvEhFNc", + "ZEeVJZTz6372d": "BmAwd2EvHw", + "6B1wwUMjTF": "428u9CE6zZlQoWG", + "sg7u1aDylz": "w4XuLp12Gg6pWll", + "ivHrCBthr8qu": "i1BYGFSfM3P", + "f8y4xuM57qFQg": "haThtIFGmQ2a1", + "7Lw3q58svTi4SEAFw": "QQZ9cqPEq2VVR", + "NRrxErIRM4": "MuP0gvMHSbk51W93N", + "zmNLDGiOsX0zzLxgqx": "rIea0vLsQnLpL", + "R12vsDgE9vHSh": "ofNCxSlZx44UPkG8C04", + "UFjhyw212E1HB": "FlWDrgzeshrq", + "ue2g7bcwq1xS": "fbJrgwABL86Kh", + "jrDRPJ1uXPLeJxwbDdp": "4TGH4FzHWSUn5oc", + "j8GIOZUCpcotvNs": "D4MBDXATSN", + "3UwjwlxbofoH": "l1R6uK4eCQ", + "dNmMpVGPQpUkcUE": "vaPjmDx1lP", + "2nk7LDEAIiP17i": "3G5RAf58WUmqTEQed", + "LMCzFVEVHL8yozVw3X": "pMyKVDIUyz", + "mvyYTJEO2cJ6oY3L4U": "M5s0cyA2UJ3jstDz", + "Bx0ARO4F4BSg": "NtCNQZAEuJizQhXXL", + "6x45pVeBPckE9Rbb": "CTFHvtahyIn0CAN", + "4Upqz2PKSR1": "6PpFUoLqEtg5QLPf7Q", + "sJtKBhkqXJ8QjPab": "KNhNwNybSgp0hjsayh", + "UxtCua2isEaZAuCEM1": "CV1D4By3PkfctVA8pEA", + "kkVYsbOBrIhrm": "UXtbSmjYPR", + "MfA1l81VnHH": "qECowRfgz0", + "xFSCCXEBQfVB": "jxRBNQOMpHErksJu4", + "EvJlXug4Lj": "xa6IUSXbcqxdo", + "KC9ljchlpJGC": "QH2dqRdzH7Vr", + "C8kiIIMWffu5UH": "ZGzgRuGu55bFY", + "qB8FM7KKVM192bW7c": "R8AEX7ZSVc5Kku", + "2WvlDWvByFAjHGO5": "ToPJqT4cHpuK7j7oHs", + "Y21Q4luB2YR9tkH": "2H41w79yXlFcxg", + "EdLROPjF0lrQR5Y": "VpmOg5d6Ya", + "9OIQkcyEZ4V0hgJT": "3kwfJ9pzGeB67Y", + "eHhgOVn7XZBvp": "3W9GuwG3XH0", + "7PTApk1JZnegET": "0K4RIpQbBU", + "zO3XDUKdmFWhzwL8": "zol4hrMcjKh4wXBW0X4", + "anEZPbHRLgbK8ab8k": "TuVWcQMIUC3w", + "8zjsG3w3mP": "Lus1iBWnndJca1BGPw", + "i1RqPkH2XKRj4wS": "UaaoCv0nA6DuXQ", + "35RKf4sd9a": "GHinZXfMWGfZqfrEUj", + "sX3VM3pdWuTN": "qu1IYzyZXWSrRt0Q7", + "DQXDdUJvMijK": "KJ9lMw28tR3i5CzSOe", + "8G9r4r7hKZs": "zryjRgkY3B9", + "Ge55N78jIGzl4kyWAQ": "IToFVMqwa2woQfsh", + "4KcWZuzvlSMI": "cbBr5XwaDgyduz7lF", + "iHCadisZ2d4Lhh": "RqsHSDNJbX", + "KnHZhDP4EezmNcH6waF": "5qDf9Tg08OHwOyrbV", + "2VFfY7yWW5cEs": "vxwc3n4trq3D", + "Cl74jcT7McogOuI": "zEpiTYqMnM4AEpQecs", + "C3ZqqO4cenvQhUXr5": "ro7MlUTDJt3yCG4I9x", + "J0iTmnA2jc0g": "oImOAez9d2M3LodO", + "Xg9t7f0x9F": "4kD25VKJGYTJXNScjKI", + "2qIhPhR0tqr0sf9n": "67hj2DdNr8", + "c2D85oqCiSFv344vw": "24ptxcYqnwu", + "nSlaWA77r6Dqbl3Lyv": "KcMnVtYPwgcqT", + "EpfdcYJauGI": "XzBcPMUZyryB", + "j0FvUY2kdcFehwSFTPU": "MqA1KDBYG53K", + "MHwGBaYMRtPVX": "cTqqONfvuSAtt", + "x5yJoUs8wOwkiiiao": "syZQNyr47tVH4", + "K3LPe7EsYmzmZfmJSr": "VT0tSNW17vJ", + "snbz01TFonWpok1WQJQ": "dkLkKFlbNsRhgCZGsp", + "KYL5i7mIx6I95dO0": "74ndgZk9ymMxhn0spv", + "b2yGXFlpHJuQwpCaa": "ZuvhlCcIRKcdn", + "fycSvFVXdL7": "Al7tASqhEtUxwv8O8", + "UY9YfW75SzDqCPy": "Mz9q5TUxPfkh", + "OGfnB7QR4eQaatXwP": "t3zE0G6XVVG", + "2S3X8sDLwDNk": "kDUv68Hm807FEDCj", + "zMJPfHe0Td4m5JLD": "4XUTqdsnQPtI2Bk", + "4plod7WQcLypxeJ24B4": "flw6IHhUi8NmZ", + "UMlCE2OHHYREl": "QOaCQaRS67dCW6", + "nz7DN3LHVWsjEPVD": "4tndorV1Yltoz", + "dUVvq2B95CkIOHn": "QqgioH4rseg", + "ypMpA354f9xP": "CuskocQHlFcYtG", + "TejKR8aotSlTBW78Mt": "7dvQROKGAjCFfEHmHT", + "hZ9XON4x4WivPJ3": "TuVgbSDFtna5dv", + "Z3IErKLZrStej27": "JLZ1yjpuYQXFRsG", + "azDFe3GvhnR": "fYw79uPHmN", +} diff --git a/sei-db/db_engine/litt/test/migration_test.go b/sei-db/db_engine/litt/test/migration_test.go new file mode 100644 index 0000000000..31b0d7750a --- /dev/null +++ b/sei-db/db_engine/litt/test/migration_test.go @@ -0,0 +1,201 @@ +//go:build littdb_wip + +package test + +import ( + "fmt" + "os" + "path/filepath" + "strconv" + "testing" + + "github.com/Layr-Labs/eigenda/core" + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +// This file contains tests for data migrations (i.e. when the on-disk format of the data changes). + +// Enable and run this "test" to generate data for a migration test at the current version. +func TestGenerateData(t *testing.T) { + t.Skip() // comment out this line to generate data + + version := segment.LatestSegmentVersion + dataDir := fmt.Sprintf("testdata/v%d", version) + + exists, err := util.Exists(dataDir) + require.NoError(t, err) + if exists { + fmt.Printf("deleting existing data at %s\n", dataDir) + err = os.RemoveAll(dataDir) + require.NoError(t, err) + } + + fmt.Printf("generating migration test data at %s\n", dataDir) + + err = os.MkdirAll(dataDir, 0777) + require.NoError(t, err) + + config, err := litt.DefaultConfig(dataDir) + require.NoError(t, err) + config.DoubleWriteProtection = true + config.Fsync = false + config.ShardingFactor = 4 + config.TargetSegmentFileSize = 100 + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + table, err := db.GetTable("test") + require.NoError(t, err) + + for key, value := range migrationData { + err = table.Put([]byte(key), []byte(value)) + require.NoError(t, err) + } + + // verify the data in the table + for key, value := range migrationData { + v, exists, err := table.Get([]byte(key)) + require.NoError(t, err) + require.True(t, exists) + require.Equal(t, value, string(v)) + } + + // Shut the DB down. + err = db.Close() + require.NoError(t, err) +} + +func TestMigration(t *testing.T) { + + // Find all copies of the table at various versions. We will run a migration test on each of them. + migrationPaths := make([]string, 0) + + // Get direct subdirectories of "testdata/" - only these contain version data + entries, err := os.ReadDir("testdata") + require.NoError(t, err) + + for _, entry := range entries { + if entry.IsDir() { + versionDir := filepath.Join("testdata", entry.Name()) + // Only include directories with 'v' prefix (version directories) + if len(entry.Name()) > 0 && entry.Name()[0] == 'v' { + migrationPaths = append(migrationPaths, versionDir) + } + } + } + + // Skip the test if no version directories are found + require.NotEmpty(t, migrationPaths, "No version directories found in testdata/") + + currentVersion := segment.LatestSegmentVersion + for _, migrationPath := range migrationPaths { + + // Each migration path is in the format "v[version]". + oldVersion, err := strconv.Atoi(filepath.Base(migrationPath)[1:]) + require.NoError(t, err) + + t.Run(fmt.Sprintf("%d->%d", oldVersion, currentVersion), func(t *testing.T) { + testMigration(t, migrationPath) + }) + } + +} + +func testMigration(t *testing.T, migrationPath string) { + rand := random.NewTestRandom() + + // Make a copy of the data so we don't modify the original (which is checked into git). + testDir := t.TempDir() + + err := os.MkdirAll(testDir, 0777) + require.NoError(t, err) + + // Copy the test data directory to our temporary directory + err = util.RecursiveMove(migrationPath, testDir, true, false) + require.NoError(t, err) + + // Now open the database and verify the data matches our expectations + config, err := litt.DefaultConfig(testDir) + require.NoError(t, err) + config.DoubleWriteProtection = true + config.Fsync = false + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + t.Cleanup(func() { core.CloseLogOnError(db, "littdb", nil) }) + + table, err := db.GetTable("test") + require.NoError(t, err) + + // Verify the data in the table matches our expected data + for key, value := range migrationData { + v, exists, err := table.Get([]byte(key)) + require.NoError(t, err) + require.True(t, exists) + require.Equal(t, value, string(v)) + } + + // Write some new data to the table to ensure we can read and write after migration + newData := make(map[string]string) + const numNewItems = 50 + for i := 0; i < numNewItems; i++ { + key := fmt.Sprintf("newkey-%d-%s", i, rand.PrintableBytes(32)) + value := rand.PrintableBytes(32) + newData[key] = string(value) + + err := table.Put([]byte(key), value) + require.NoError(t, err, "Failed to write new data after migration") + } + + // Verify all the new data can be read back correctly + for key, value := range newData { + v, exists, err := table.Get([]byte(key)) + require.NoError(t, err, "Error reading back new data") + require.True(t, exists, "New data doesn't exist") + require.Equal(t, value, string(v), "New data doesn't match") + } + + // Verify the original data. + for key, value := range migrationData { + v, exists, err := table.Get([]byte(key)) + require.NoError(t, err, "Error reading migration data") + require.True(t, exists, "Migration data doesn't exist") + require.Equal(t, value, string(v), "Migration data doesn't match") + } + + // Close and reopen the database to ensure persistence + err = db.Close() + require.NoError(t, err, "Failed to close database") + + // Reopen the database + db, err = littbuilder.NewDB(config) + require.NoError(t, err, "Failed to reopen database") + + table, err = db.GetTable("test") + require.NoError(t, err, "Failed to get table after reopening") + + // Verify original migration data is still intact + for key, value := range migrationData { + v, exists, err := table.Get([]byte(key)) + require.NoError(t, err, "Error reading migration data after reopen") + require.True(t, exists, "Migration data doesn't exist after reopen") + require.Equal(t, value, string(v), "Migration data doesn't match after reopen") + } + + // Verify the new data is still intact + for key, value := range newData { + v, exists, err := table.Get([]byte(key)) + require.NoError(t, err, "Error reading new data after reopen") + require.True(t, exists, "New data doesn't exist after reopen") + require.Equal(t, value, string(v), "New data doesn't match after reopen") + } + + err = db.Destroy() + require.NoError(t, err, "Failed to destroy database") +} diff --git a/sei-db/db_engine/litt/test/snapshot_test.go b/sei-db/db_engine/litt/test/snapshot_test.go new file mode 100644 index 0000000000..5ca18bd635 --- /dev/null +++ b/sei-db/db_engine/litt/test/snapshot_test.go @@ -0,0 +1,665 @@ +//go:build littdb_wip + +package test + +import ( + "fmt" + "os" + "path" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/disktable/segment" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/util" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +func TestSnapshot(t *testing.T) { + t.Parallel() + + ctx := t.Context() + logger := test.GetLogger() + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + errorMonitor := util.NewErrorMonitor(ctx, logger, nil) + + rootPathCount := rand.Uint64Range(2, 5) + rootPaths := make([]string, rootPathCount) + for i := uint64(0); i < rootPathCount; i++ { + rootPaths[i] = path.Join(testDirectory, fmt.Sprintf("root-%d", i)) + } + + snapshotDir := testDirectory + "/snapshot" + + // Configure the DB to enable snapshots. + config, err := litt.DefaultConfig(rootPaths...) + require.NoError(t, err) + config.Fsync = false + config.DoubleWriteProtection = true + config.ShardingFactor = uint32(rand.Uint64Range(rootPathCount, 2*rootPathCount)) + config.TargetSegmentFileSize = 100 + config.SnapshotDirectory = snapshotDir + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableCount := rand.Uint64Range(2, 5) + tables := make(map[string]litt.Table, tableCount) + for i := uint64(0); i < tableCount; i++ { + tableName := fmt.Sprintf("table-%d", i) + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables[tableName] = table + } + + // map from table name to keys to values + expectedData := make(map[string]map[string][]byte) + for _, table := range tables { + expectedData[table.Name()] = make(map[string][]byte) + } + + // Write some data into the DB. + for i := 0; i < 1000; i++ { + tableIndex := rand.Uint64Range(0, tableCount) + tableName := fmt.Sprintf("table-%d", tableIndex) + table := tables[tableName] + + key := rand.String(32) + value := rand.PrintableVariableBytes(1, 100) + + err = table.Put([]byte(key), value) + require.NoError(t, err) + + expectedData[tableName][key] = value + } + + // Flush all tables to ensure data is written to disk. + for _, table := range tables { + err = table.Flush() + require.NoError(t, err) + } + + // Now, let's compare the segment files in the snapshot directory with the segments in the regular directories. + for tableName := range tables { + + segmentPaths, err := segment.BuildSegmentPaths(rootPaths, "", tableName) + require.NoError(t, err) + lowestSegmentIndex, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + segmentPaths, + false, + time.Now(), + false, + false) + + require.NoError(t, err) + snapshotSegmentPath, err := segment.NewSegmentPath(snapshotDir, "", tableName) + require.NoError(t, err) + snapshotLowestSegmentIndex, snapshotHighestSegmentIndex, snapshotSegments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + []*segment.SegmentPath{snapshotSegmentPath}, + false, + time.Now(), + false, + false) + require.NoError(t, err) + + // Both the snapshot directory and the regular directories should agree on the lowest segment index. + require.Equal(t, lowestSegmentIndex, snapshotLowestSegmentIndex) + + // The snapshot directory should have one fewer segments than the regular directories. The highest segment will + // be mutable, and therefore won't appear in the snapshot. + require.Equal(t, highestSegmentIndex-1, snapshotHighestSegmentIndex) + require.Equal(t, len(segments)-1, len(snapshotSegments)) + + // There should be a boundary file in the snapshot directory signaling the highest legal segment index in the + // snapshot. + boundaryFile, err := disktable.LoadBoundaryFile(disktable.UpperBound, path.Join(snapshotDir, tableName)) + require.NoError(t, err) + require.True(t, boundaryFile.IsDefined()) + require.Equal(t, snapshotHighestSegmentIndex, boundaryFile.BoundaryIndex()) + + for i := lowestSegmentIndex; i < highestSegmentIndex; i++ { + regularSegment := segments[i] + snapshotSegment := snapshotSegments[i] + + // The regular segment should know it is not a snapshot. + snapshot, err := regularSegment.IsSnapshot() + require.NoError(t, err) + require.False(t, snapshot) + + // None of the regular segment files should be symlinks. + for _, filePath := range regularSegment.GetFilePaths() { + info, err := os.Lstat(filePath) + require.NoError(t, err) + require.False(t, info.Mode()&os.ModeSymlink != 0) + } + + // The snapshot segment should realize that it is a snapshot. + snapshot, err = snapshotSegment.IsSnapshot() + require.NoError(t, err) + require.True(t, snapshot) + + // All snapshot files should be symlinks. + for _, filePath := range snapshotSegment.GetFilePaths() { + info, err := os.Lstat(filePath) + require.NoError(t, err) + require.True(t, info.Mode()&os.ModeSymlink != 0) + } + + // The keys should be the same in both segments. + regularKeys, err := regularSegment.GetKeys() + require.NoError(t, err) + snapshotKeys, err := snapshotSegment.GetKeys() + require.NoError(t, err) + require.Equal(t, regularKeys, snapshotKeys) + + // The values should be present in both segments. + for _, key := range regularKeys { + regularValue, err := regularSegment.Read(key.Key, key.Address) + require.NoError(t, err) + + snapshotValue, err := snapshotSegment.Read(key.Key, key.Address) + require.NoError(t, err) + + require.Equal(t, regularValue, snapshotValue) + } + } + } + + ok, err := errorMonitor.IsOk() + require.NoError(t, err) + require.True(t, ok) + + // Deleting the snapshot directory should not in any way cause issues with the database. + err = db.Close() + require.NoError(t, err) + + errorMonitor = util.NewErrorMonitor(ctx, logger, nil) + + err = os.RemoveAll(snapshotDir) + require.NoError(t, err) + + // Reopen the database and ensure that it still works. + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + + for tableName := range tables { + table, err := db.GetTable(tableName) + require.NoError(t, err) + + // Ensure that the data is still present in the database. + for key, expectedValue := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok, "Expected key %s to be present in table %s", key, tableName) + require.Equal(t, expectedValue, value) + } + } + + // Cleanup. + err = db.Close() + require.NoError(t, err) + + ok, err = errorMonitor.IsOk() + require.NoError(t, err) + require.True(t, ok) +} + +// This test verifies that LittDB rebuilds the snapshot directory correctly every time it starts up. +func TestSnapshotRebuilding(t *testing.T) { + t.Parallel() + + ctx := t.Context() + logger := test.GetLogger() + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + errorMonitor := util.NewErrorMonitor(ctx, logger, nil) + rootPathCount := rand.Uint64Range(2, 5) + rootPaths := make([]string, rootPathCount) + for i := uint64(0); i < rootPathCount; i++ { + rootPaths[i] = path.Join(testDirectory, fmt.Sprintf("root-%d", i)) + } + + snapshotDir := testDirectory + "/snapshot" + + // Configure the DB to enable snapshots. + config, err := litt.DefaultConfig(rootPaths...) + require.NoError(t, err) + config.Fsync = false + config.DoubleWriteProtection = true + config.ShardingFactor = uint32(rand.Uint64Range(rootPathCount, 2*rootPathCount)) + config.TargetSegmentFileSize = 100 + config.SnapshotDirectory = snapshotDir + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableCount := rand.Uint64Range(2, 5) + tables := make(map[string]litt.Table, tableCount) + for i := uint64(0); i < tableCount; i++ { + tableName := fmt.Sprintf("table-%d", i) + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables[tableName] = table + } + + // map from table name to keys to values + expectedData := make(map[string]map[string][]byte) + for _, table := range tables { + expectedData[table.Name()] = make(map[string][]byte) + } + + // Write some data into the DB. + for i := 0; i < 1000; i++ { + tableIndex := rand.Uint64Range(0, tableCount) + tableName := fmt.Sprintf("table-%d", tableIndex) + table := tables[tableName] + + key := rand.String(32) + value := rand.PrintableVariableBytes(1, 100) + + err = table.Put([]byte(key), value) + require.NoError(t, err) + + expectedData[tableName][key] = value + } + + // Flush all tables to ensure data is written to disk. + for _, table := range tables { + err = table.Flush() + require.NoError(t, err) + } + + // Delete all snapshot files with even indices. + for tableName := range tables { + require.NoError(t, err) + snapshotSegmentPath, err := segment.NewSegmentPath(snapshotDir, "", tableName) + require.NoError(t, err) + snapshotLowestSegmentIndex, snapshotHighestSegmentIndex, snapshotSegments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + []*segment.SegmentPath{snapshotSegmentPath}, + false, + time.Now(), + false, + false) + require.NoError(t, err) + + for i := snapshotLowestSegmentIndex; i <= snapshotHighestSegmentIndex; i++ { + if i%2 == 0 { + for _, filePath := range snapshotSegments[i].GetFilePaths() { + err = os.Remove(filePath) + require.NoError(t, err, "Failed to remove file %s in snapshot directory", filePath) + } + } + } + } + + ok, err := errorMonitor.IsOk() + require.NoError(t, err) + require.True(t, ok) + + // Restart the DB. + err = db.Close() + require.NoError(t, err) + + errorMonitor = util.NewErrorMonitor(ctx, logger, nil) + + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + + for tableName := range tables { + table, err := db.GetTable(tableName) + require.NoError(t, err) + + // Ensure that the data is still present in the database. + for key, expectedValue := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok, "Expected key %s to be present in table %s", key, tableName) + require.Equal(t, expectedValue, value) + } + } + + // Now, let's compare the segment files in the snapshot directory with the segments in the regular directories. + // Our shenanigans above should have been fully fixed when the DB restarted. + for tableName := range tables { + + segmentPaths, err := segment.BuildSegmentPaths(rootPaths, "", tableName) + require.NoError(t, err) + lowestSegmentIndex, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + segmentPaths, + false, + time.Now(), + false, + false) + + require.NoError(t, err) + snapshotSegmentPath, err := segment.NewSegmentPath(snapshotDir, "", tableName) + require.NoError(t, err) + snapshotLowestSegmentIndex, snapshotHighestSegmentIndex, snapshotSegments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + []*segment.SegmentPath{snapshotSegmentPath}, + false, + time.Now(), + false, + false) + require.NoError(t, err) + + // Both the snapshot directory and the regular directories should agree on the lowest segment index. + require.Equal(t, lowestSegmentIndex, snapshotLowestSegmentIndex) + + // The snapshot directory should have one fewer segments than the regular directories. The highest segment will + // be mutable, and therefore won't appear in the snapshot. + require.Equal(t, highestSegmentIndex-1, snapshotHighestSegmentIndex) + require.Equal(t, len(segments)-1, len(snapshotSegments)) + + // There should be a boundary file in the snapshot directory signaling the highest legal segment index in the + // snapshot. + boundaryFile, err := disktable.LoadBoundaryFile(disktable.UpperBound, path.Join(snapshotDir, tableName)) + require.NoError(t, err) + require.True(t, boundaryFile.IsDefined()) + require.Equal(t, snapshotHighestSegmentIndex, boundaryFile.BoundaryIndex()) + + for i := lowestSegmentIndex; i < highestSegmentIndex; i++ { + regularSegment := segments[i] + snapshotSegment := snapshotSegments[i] + + // The regular segment should know it is not a snapshot. + snapshot, err := regularSegment.IsSnapshot() + require.NoError(t, err) + require.False(t, snapshot) + + // None of the regular segment files should be symlinks. + for _, filePath := range regularSegment.GetFilePaths() { + info, err := os.Lstat(filePath) + require.NoError(t, err) + require.False(t, info.Mode()&os.ModeSymlink != 0) + } + + // The snapshot segment should realize that it is a snapshot. + snapshot, err = snapshotSegment.IsSnapshot() + require.NoError(t, err) + require.True(t, snapshot) + + // All snapshot files should be symlinks. + for _, filePath := range snapshotSegment.GetFilePaths() { + info, err := os.Lstat(filePath) + require.NoError(t, err) + require.True(t, info.Mode()&os.ModeSymlink != 0) + } + + // The keys should be the same in both segments. + regularKeys, err := regularSegment.GetKeys() + require.NoError(t, err) + snapshotKeys, err := snapshotSegment.GetKeys() + require.NoError(t, err) + require.Equal(t, regularKeys, snapshotKeys) + + // The values should be present in both segments. + for _, key := range regularKeys { + regularValue, err := regularSegment.Read(key.Key, key.Address) + require.NoError(t, err) + + snapshotValue, err := snapshotSegment.Read(key.Key, key.Address) + require.NoError(t, err) + + require.Equal(t, regularValue, snapshotValue) + } + } + } + + // Cleanup. + err = db.Close() + require.NoError(t, err) + + ok, err = errorMonitor.IsOk() + require.NoError(t, err) + require.True(t, ok) +} + +// The DB should not attempt to rebuild snapshot files that are below the specified lower bound. +func TestSnapshotLowerBound(t *testing.T) { + t.Parallel() + + ctx := t.Context() + logger := test.GetLogger() + rand := random.NewTestRandom() + testDirectory := t.TempDir() + + errorMonitor := util.NewErrorMonitor(ctx, logger, nil) + + rootPathCount := rand.Uint64Range(2, 5) + rootPaths := make([]string, rootPathCount) + for i := uint64(0); i < rootPathCount; i++ { + rootPaths[i] = path.Join(testDirectory, fmt.Sprintf("root-%d", i)) + } + + snapshotDir := testDirectory + "/snapshot" + + // Configure the DB to enable snapshots. + config, err := litt.DefaultConfig(rootPaths...) + require.NoError(t, err) + config.Fsync = false + config.DoubleWriteProtection = true + config.ShardingFactor = uint32(rand.Uint64Range(rootPathCount, 2*rootPathCount)) + config.TargetSegmentFileSize = 100 + config.SnapshotDirectory = snapshotDir + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableCount := rand.Uint64Range(2, 5) + tables := make(map[string]litt.Table, tableCount) + for i := uint64(0); i < tableCount; i++ { + tableName := fmt.Sprintf("table-%d", i) + table, err := db.GetTable(tableName) + require.NoError(t, err) + tables[tableName] = table + } + + // map from table name to keys to values + expectedData := make(map[string]map[string][]byte) + for _, table := range tables { + expectedData[table.Name()] = make(map[string][]byte) + } + + // Write some data into the DB. + for i := 0; i < 1000; i++ { + tableIndex := rand.Uint64Range(0, tableCount) + tableName := fmt.Sprintf("table-%d", tableIndex) + table := tables[tableName] + + key := rand.String(32) + value := rand.PrintableVariableBytes(1, 100) + + err = table.Put([]byte(key), value) + require.NoError(t, err) + + expectedData[tableName][key] = value + } + + // Flush all tables to ensure data is written to disk. + for _, table := range tables { + err = table.Flush() + require.NoError(t, err) + } + + // We are going to delete the lower half of snapshot files to simulate a "litt prune" command. The lower bound + // file will be updated to signal that we do not want to reconstruct the deleted segments. We will delete all + // other segments that have even indices, to verify that the DB does rebuild those segments. + lowerBoundsByTable := make(map[string]uint32) + for tableName := range tables { + require.NoError(t, err) + snapshotSegmentPath, err := segment.NewSegmentPath(snapshotDir, "", tableName) + require.NoError(t, err) + snapshotLowestSegmentIndex, snapshotHighestSegmentIndex, snapshotSegments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + []*segment.SegmentPath{snapshotSegmentPath}, + false, + time.Now(), + false, + false) + require.NoError(t, err) + + lowerBound := snapshotLowestSegmentIndex + (snapshotHighestSegmentIndex-snapshotLowestSegmentIndex)/2 + lowerBoundsByTable[tableName] = lowerBound + boundaryFile, err := disktable.LoadBoundaryFile(disktable.LowerBound, path.Join(snapshotDir, tableName)) + require.NoError(t, err) + err = boundaryFile.Update(lowerBound) + require.NoError(t, err) + + for i := snapshotLowestSegmentIndex; i <= snapshotHighestSegmentIndex; i++ { + if i%2 == 0 || i <= lowerBound { + for _, filePath := range snapshotSegments[i].GetFilePaths() { + err = os.Remove(filePath) + require.NoError(t, err, "Failed to remove file %s in snapshot directory", filePath) + } + } + } + } + + ok, err := errorMonitor.IsOk() + require.NoError(t, err) + require.True(t, ok) + + // Restart the DB. + err = db.Close() + require.NoError(t, err) + + errorMonitor = util.NewErrorMonitor(ctx, logger, nil) + + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + + for tableName := range tables { + table, err := db.GetTable(tableName) + require.NoError(t, err) + + // Ensure that the data is still present in the database. + for key, expectedValue := range expectedData[tableName] { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err) + require.True(t, ok, "Expected key %s to be present in table %s", key, tableName) + require.Equal(t, expectedValue, value) + } + } + + // Now, let's compare the segment files in the snapshot directory with the segments in the regular directories. + // Our shenanigans above should have been fully fixed for the files above the boundary, but no snapshots + // should have been rebuilt for the files below or at the boundary. + for tableName := range tables { + + segmentPaths, err := segment.BuildSegmentPaths(rootPaths, "", tableName) + require.NoError(t, err) + _, highestSegmentIndex, segments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + segmentPaths, + false, + time.Now(), + false, + false) + + require.NoError(t, err) + snapshotSegmentPath, err := segment.NewSegmentPath(snapshotDir, "", tableName) + require.NoError(t, err) + snapshotLowestSegmentIndex, snapshotHighestSegmentIndex, snapshotSegments, err := segment.GatherSegmentFiles( + logger, + errorMonitor, + []*segment.SegmentPath{snapshotSegmentPath}, + false, + time.Now(), + false, + false) + require.NoError(t, err) + + // We shouldn't see snapshot files with an index less than or equal to the lower bound. + require.Equal(t, lowerBoundsByTable[tableName]+1, snapshotLowestSegmentIndex) + + // The high segment index should be one less than the highest segment index in the regular directories. + require.Equal(t, highestSegmentIndex-1, snapshotHighestSegmentIndex) + + // There should be a boundary file in the snapshot directory signaling the highest legal segment index in the + // snapshot. + boundaryFile, err := disktable.LoadBoundaryFile(disktable.UpperBound, path.Join(snapshotDir, tableName)) + require.NoError(t, err) + require.True(t, boundaryFile.IsDefined()) + require.Equal(t, snapshotHighestSegmentIndex, boundaryFile.BoundaryIndex()) + + // The lower bound file we previously wrote should still be present. + lowerBoundFile, err := disktable.LoadBoundaryFile(disktable.LowerBound, path.Join(snapshotDir, tableName)) + require.NoError(t, err) + require.True(t, lowerBoundFile.IsDefined()) + require.Equal(t, lowerBoundsByTable[tableName], lowerBoundFile.BoundaryIndex()) + + for i := snapshotLowestSegmentIndex; i <= snapshotHighestSegmentIndex; i++ { + regularSegment := segments[i] + snapshotSegment := snapshotSegments[i] + + // The regular segment should know it is not a snapshot. + snapshot, err := regularSegment.IsSnapshot() + require.NoError(t, err) + require.False(t, snapshot) + + // None of the regular segment files should be symlinks. + for _, filePath := range regularSegment.GetFilePaths() { + info, err := os.Lstat(filePath) + require.NoError(t, err) + require.False(t, info.Mode()&os.ModeSymlink != 0) + } + + // The snapshot segment should realize that it is a snapshot. + snapshot, err = snapshotSegment.IsSnapshot() + require.NoError(t, err) + require.True(t, snapshot) + + // All snapshot files should be symlinks. + for _, filePath := range snapshotSegment.GetFilePaths() { + info, err := os.Lstat(filePath) + require.NoError(t, err) + require.True(t, info.Mode()&os.ModeSymlink != 0) + } + + // The keys should be the same in both segments. + regularKeys, err := regularSegment.GetKeys() + require.NoError(t, err) + snapshotKeys, err := snapshotSegment.GetKeys() + require.NoError(t, err) + require.Equal(t, regularKeys, snapshotKeys) + + // The values should be present in both segments. + for _, key := range regularKeys { + regularValue, err := regularSegment.Read(key.Key, key.Address) + require.NoError(t, err) + + snapshotValue, err := snapshotSegment.Read(key.Key, key.Address) + require.NoError(t, err) + + require.Equal(t, regularValue, snapshotValue) + } + } + } + + // Cleanup. + err = db.Close() + require.NoError(t, err) + + ok, err = errorMonitor.IsOk() + require.NoError(t, err) + require.True(t, ok) +} diff --git a/sei-db/db_engine/litt/test/table_test.go b/sei-db/db_engine/litt/test/table_test.go new file mode 100644 index 0000000000..b155e97eac --- /dev/null +++ b/sei-db/db_engine/litt/test/table_test.go @@ -0,0 +1,556 @@ +//go:build littdb_wip + +package test + +import ( + "fmt" + "os" + "path/filepath" + "sync/atomic" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/common/cache" + "github.com/Layr-Labs/eigenda/litt" + tablecache "github.com/Layr-Labs/eigenda/litt/cache" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/disktable/keymap" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/memtable" + "github.com/Layr-Labs/eigenda/litt/types" + "github.com/Layr-Labs/eigenda/test" + "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +type tableBuilder struct { + name string + builder func(clock func() time.Time, name string, path string) (litt.ManagedTable, error) +} + +// This test executes against different table implementations. +var tableBuilders = []*tableBuilder{ + { + "memtable", + buildMemTable, + }, + { + "cached memtable", + buildCachedMemTable, + }, + { + "mem keymap disk table", + buildMemKeyDiskTable, + }, + { + "cached mem keymap disk table", + buildCachedMemKeyDiskTable, + }, + { + "leveldb keymap disk table", + buildLevelDBKeyDiskTable, + }, + { + "cached leveldb keymap disk table", + buildCachedLevelDBKeyDiskTable, + }, +} + +var noCacheTableBuilders = []*tableBuilder{ + { + "memtable", + buildMemTable, + }, + { + "mem keymap disk table", + buildMemKeyDiskTable, + }, + { + "leveldb keymap disk table", + buildLevelDBKeyDiskTable, + }, +} + +func buildMemTable( + clock func() time.Time, + name string, + path string) (litt.ManagedTable, error) { + + config, err := litt.DefaultConfig(path) + config.Clock = clock + config.GCPeriod = time.Millisecond + + if err != nil { + return nil, fmt.Errorf("failed to create config: %w", err) + } + + return memtable.NewMemTable(config, name), nil +} + +func setupKeymapTypeFile(keymapPath string, keymapType keymap.KeymapType) (*keymap.KeymapTypeFile, error) { + exists, err := keymap.KeymapFileExists(keymapPath) + if err != nil { + return nil, fmt.Errorf("failed to check if keymap file exists: %w", err) + } + var keymapTypeFile *keymap.KeymapTypeFile + if exists { + keymapTypeFile, err = keymap.LoadKeymapTypeFile(keymapPath) + if err != nil { + return nil, fmt.Errorf("failed to load keymap type file: %w", err) + } + } else { + err = os.MkdirAll(keymapPath, 0755) + if err != nil { + return nil, fmt.Errorf("failed to create keymap directory: %w", err) + } + keymapTypeFile = keymap.NewKeymapTypeFile(keymapPath, keymapType) + err = keymapTypeFile.Write() + if err != nil { + return nil, fmt.Errorf("failed to create keymap type file: %w", err) + } + } + + return keymapTypeFile, nil +} + +func buildMemKeyDiskTable( + clock func() time.Time, + name string, + path string) (litt.ManagedTable, error) { + + logger := test.GetLogger() + + keymapPath := filepath.Join(path, name, keymap.KeymapDirectoryName) + keymapTypeFile, err := setupKeymapTypeFile(keymapPath, keymap.MemKeymapType) + if err != nil { + return nil, fmt.Errorf("failed to load keymap type file: %w", err) + } + + keys, _, err := keymap.NewMemKeymap(logger, "", true) + if err != nil { + return nil, fmt.Errorf("failed to create keymap: %w", err) + } + + config, err := litt.DefaultConfig(path) + if err != nil { + return nil, fmt.Errorf("failed to create config: %w", err) + } + config.GCPeriod = time.Millisecond + config.Clock = clock + config.Fsync = false + config.DoubleWriteProtection = true + config.SaltShaker = random.NewTestRandom().Rand + config.TargetSegmentFileSize = 100 // intentionally use a very small segment size + config.Logger = logger + + table, err := disktable.NewDiskTable( + config, + name, + keys, + keymapPath, + keymapTypeFile, + []string{path}, + true, + nil) + + if err != nil { + return nil, fmt.Errorf("failed to create disk table: %w", err) + } + + return table, nil +} + +func buildLevelDBKeyDiskTable( + clock func() time.Time, + name string, + path string) (litt.ManagedTable, error) { + + logger := test.GetLogger() + + keymapPath := filepath.Join(path, name, keymap.KeymapDirectoryName) + keymapTypeFile, err := setupKeymapTypeFile(keymapPath, keymap.MemKeymapType) + if err != nil { + return nil, fmt.Errorf("failed to load keymap type file: %w", err) + } + + keys, _, err := keymap.NewUnsafeLevelDBKeymap(logger, keymapPath, true) + if err != nil { + return nil, fmt.Errorf("failed to create keymap: %w", err) + } + + config, err := litt.DefaultConfig(path) + if err != nil { + return nil, fmt.Errorf("failed to create config: %w", err) + } + config.GCPeriod = time.Millisecond + config.Clock = clock + config.Fsync = false + config.DoubleWriteProtection = true + config.SaltShaker = random.NewTestRandom().Rand + config.TargetSegmentFileSize = 100 // intentionally use a very small segment size + config.Logger = logger + + table, err := disktable.NewDiskTable( + config, + name, + keys, + keymapPath, + keymapTypeFile, + []string{path}, + true, + nil) + + if err != nil { + return nil, fmt.Errorf("failed to create disk table: %w", err) + } + + return table, nil +} + +func buildCachedMemTable( + clock func() time.Time, + name string, + path string) (litt.ManagedTable, error) { + + baseTable, err := buildMemTable(clock, name, path) + if err != nil { + return nil, err + } + + writeCache := cache.NewFIFOCache[string, []byte](500, func(k string, v []byte) uint64 { + return uint64(len(k) + len(v)) + }, nil) + readCache := cache.NewFIFOCache[string, []byte](500, func(k string, v []byte) uint64 { + return uint64(len(k) + len(v)) + }, nil) + + return tablecache.NewCachedTable(baseTable, writeCache, readCache, nil), nil +} + +func buildCachedMemKeyDiskTable( + clock func() time.Time, + name string, + path string) (litt.ManagedTable, error) { + + baseTable, err := buildMemKeyDiskTable(clock, name, path) + if err != nil { + return nil, err + } + + writeCache := cache.NewFIFOCache[string, []byte](500, func(k string, v []byte) uint64 { + return uint64(len(k) + len(v)) + }, nil) + readCache := cache.NewFIFOCache[string, []byte](500, func(k string, v []byte) uint64 { + return uint64(len(k) + len(v)) + }, nil) + + return tablecache.NewCachedTable(baseTable, writeCache, readCache, nil), nil +} + +func buildCachedLevelDBKeyDiskTable( + clock func() time.Time, + name string, + path string) (litt.ManagedTable, error) { + + baseTable, err := buildLevelDBKeyDiskTable(clock, name, path) + if err != nil { + return nil, err + } + + writeCache := cache.NewFIFOCache[string, []byte](500, func(k string, v []byte) uint64 { + return uint64(len(k) + len(v)) + }, nil) + readCache := cache.NewFIFOCache[string, []byte](500, func(k string, v []byte) uint64 { + return uint64(len(k) + len(v)) + }, nil) + + return tablecache.NewCachedTable(baseTable, writeCache, readCache, nil), nil +} + +func randomTableOperationsTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + directory := t.TempDir() + + tableName := rand.String(8) + table, err := tableBuilder.builder(time.Now, tableName, directory) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + + iterations := 1000 + for i := 0; i < iterations; i++ { + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Once in a while, flush the table. + if rand.BoolWithProbability(0.1) { + err = table.Flush() + require.NoError(t, err) + } + + // Once in a while, sleep for a short time. For tables that do garbage collection, the garbage + // collection interval has been configured to be 1ms. Sleeping 5ms should be enough to give + // the garbage collector a chance to run. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + for expectedKey, expectedValue := range expectedValues { + ok, err := table.Exists([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + nonExistentKey := rand.PrintableVariableBytes(32, 64) + ok, err := table.Exists(nonExistentKey) + require.NoError(t, err) + require.False(t, ok) + _, ok, err = table.Get(nonExistentKey) + require.NoError(t, err) + require.False(t, ok) + } + } + + err = table.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestRandomTableOperations(t *testing.T) { + t.Parallel() + for _, tb := range tableBuilders { + t.Run(tb.name, func(t *testing.T) { + randomTableOperationsTest(t, tb) + }) + } +} + +func garbageCollectionTest(t *testing.T, tableBuilder *tableBuilder) { + rand := random.NewTestRandom() + + directory := t.TempDir() + + startTime := rand.Time() + + var fakeTime atomic.Pointer[time.Time] + fakeTime.Store(&startTime) + + clock := func() time.Time { + return *fakeTime.Load() + } + + tableName := rand.String(8) + table, err := tableBuilder.builder(clock, tableName, directory) + if err != nil { + t.Fatalf("failed to create table: %v", err) + } + + ttlSeconds := rand.Int32Range(20, 30) + ttl := time.Duration(ttlSeconds) * time.Second + err = table.SetTTL(ttl) + require.NoError(t, err) + + require.Equal(t, tableName, table.Name()) + + expectedValues := make(map[string][]byte) + creationTimes := make(map[string]time.Time) + expiredValues := make(map[string][]byte) + + iterations := 1000 + for i := 0; i < iterations; i++ { + + // Advance the clock. + now := *fakeTime.Load() + secondsToAdvance := rand.Float64Range(0.0, 1.0) + newTime := now.Add(time.Duration(secondsToAdvance * float64(time.Second))) + fakeTime.Store(&newTime) + + // Write some data. + batchSize := rand.Int32Range(1, 10) + + if batchSize == 1 { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + err = table.Put(key, value) + require.NoError(t, err) + expectedValues[string(key)] = value + creationTimes[string(key)] = newTime + } else { + batch := make([]*types.KVPair, 0, batchSize) + for j := int32(0); j < batchSize; j++ { + key := rand.PrintableVariableBytes(32, 64) + value := rand.PrintableVariableBytes(1, 128) + batch = append(batch, &types.KVPair{Key: key, Value: value}) + expectedValues[string(key)] = value + creationTimes[string(key)] = newTime + } + err = table.PutBatch(batch) + require.NoError(t, err) + } + + // Flush the table. + err = table.Flush() + require.NoError(t, err) + + // Once in a while, change the TTL. To avoid introducing test flakiness, only decrease the TTL + // (increasing the TTL risks causing the expected deletions as tracked by this test to get out + // of sync with what the table is doing) + if rand.BoolWithProbability(0.01) { + ttlSeconds -= 1 + ttl = time.Duration(ttlSeconds) * time.Second + err = table.SetTTL(ttl) + require.NoError(t, err) + } + + // Once in a while, pause for a brief moment to give the garbage collector a chance to do work in the + // background. This is not required for the test to pass. + if rand.BoolWithProbability(0.01) { + time.Sleep(5 * time.Millisecond) + } + + // Once in a while, scan the table and verify that all expected values are present. + // Don't do this every time for the sake of test runtime. + if rand.BoolWithProbability(0.01) || i == iterations-1 /* always check on the last iteration */ { + // Remove expired values from the expected values. + newlyExpiredKeys := make([]string, 0) + for key, creationTime := range creationTimes { + if newTime.Sub(creationTime) > ttl { + newlyExpiredKeys = append(newlyExpiredKeys, key) + } + } + for _, key := range newlyExpiredKeys { + expiredValues[key] = expectedValues[key] + delete(expectedValues, key) + delete(creationTimes, key) + } + + // Check the keys that are expected to still be in the table + for expectedKey, expectedValue := range expectedValues { + value, ok, err := table.Get([]byte(expectedKey)) + require.NoError(t, err) + require.True(t, ok, "key %s not found in table", expectedKey) + require.Equal(t, expectedValue, value) + } + + // Try fetching a value that isn't in the table. + _, ok, err := table.Get(rand.PrintableVariableBytes(32, 64)) + require.NoError(t, err) + require.False(t, ok) + + // Check the values that are expected to have been removed from the table + // Garbage collection happens asynchronously, so we may need to wait for it to complete. + test.AssertEventuallyTrue(t, func() bool { + // keep a running sum of the unexpired data size. Some data may be unable to expire + // due to sharing a file with data that is not yet ready to expire, so it's hard + // to predict the exact quantity of unexpired data. + // + // Math: + // - 100 bytes in each segment (test configuration) + // - max value size of 128 bytes (test configuration) + // - 4 bytes to store the length of the value (default property) + // - max bytes per segment: 100+128+4 = 232 + // - max number of segments per write is equal to max batch size, or 9 + // - max unexpired data size = 9 * 232 = 2088 + unexpiredDataSize := 0 + + for key, expectedValue := range expiredValues { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err) + if !ok { + // value is not present in the table + continue + } + + // If the value has not yet been deleted, it should at least return the expected value. + require.Equal(t, expectedValue, value, "unexpected value for key %s", key) + + unexpiredDataSize += len(value) + 4 // 4 bytes stores the length of the value + } + + // This check passes if the unexpired data size is less than or equal to the maximum plausible + // size of unexpired data. If working as expected, this should always happen within a reasonable + // amount of time. + return unexpiredDataSize <= 2088 + }, time.Second) + } + } + + err = table.Destroy() + require.NoError(t, err) + + // ensure that the test directory is empty + entries, err := os.ReadDir(directory) + require.NoError(t, err) + require.Empty(t, entries) +} + +func TestGarbageCollection(t *testing.T) { + t.Parallel() + for _, tb := range noCacheTableBuilders { + t.Run(tb.name, func(t *testing.T) { + garbageCollectionTest(t, tb) + }) + } +} + +func TestInvalidTableName(t *testing.T) { + t.Parallel() + directory := t.TempDir() + + config, err := litt.DefaultConfig(directory) + require.NoError(t, err) + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + tableName := "invalid name" + table, err := db.GetTable(tableName) + require.Error(t, err) + require.Nil(t, table) + + tableName = "invalid/name" + table, err = db.GetTable(tableName) + require.Error(t, err) + require.Nil(t, table) + + tableName = "" + table, err = db.GetTable(tableName) + require.Error(t, err) + require.Nil(t, table) +} diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/000001.log b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/000001.log new file mode 100644 index 0000000000..02312aebc3 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/000001.log differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/CURRENT b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/CURRENT new file mode 100644 index 0000000000..feda7d6b24 --- /dev/null +++ b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/CURRENT @@ -0,0 +1 @@ +MANIFEST-000000 diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/LOCK b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/LOCK new file mode 100644 index 0000000000..e69de29bb2 diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/LOG b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/LOG new file mode 100644 index 0000000000..9ff9a197f2 --- /dev/null +++ b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/LOG @@ -0,0 +1,8 @@ +=============== May 7, 2025 (CDT) =============== +09:33:37.810933 log@legend F·NumFile S·FileSize N·Entry C·BadEntry B·BadBlock Ke·KeyError D·DroppedEntry L·Level Q·SeqNum T·TimeElapsed +09:33:37.824567 db@open opening +09:33:37.825148 version@stat F·[] S·0B[] Sc·[] +09:33:37.828724 db@janitor F·2 G·0 +09:33:37.828751 db@open done T·4.167625ms +09:33:37.859690 db@close closing +09:33:37.859770 db@close done T·79.375µs diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/MANIFEST-000000 b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/MANIFEST-000000 new file mode 100644 index 0000000000..9d54f6733b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/data/MANIFEST-000000 differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/keymap/initialized b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/initialized new file mode 100644 index 0000000000..e69de29bb2 diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/keymap/keymap-type.txt b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/keymap-type.txt new file mode 100644 index 0000000000..02d4ce5d35 --- /dev/null +++ b/sei-db/db_engine/litt/test/testdata/v0/test/keymap/keymap-type.txt @@ -0,0 +1 @@ +LevelDBKeymap \ No newline at end of file diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-0.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-0.values new file mode 100644 index 0000000000..c79f0ee40b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-1.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-1.values new file mode 100644 index 0000000000..ff4315b838 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-2.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-2.values new file mode 100644 index 0000000000..e06c80df10 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-3.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-3.values new file mode 100644 index 0000000000..25ba38b932 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/0.keys b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0.keys new file mode 100644 index 0000000000..8697025a97 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/0.metadata b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0.metadata new file mode 100644 index 0000000000..c8ff1b3281 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/0.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-0.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-0.values new file mode 100644 index 0000000000..ffbfbc24df Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-1.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-1.values new file mode 100644 index 0000000000..9669cde62d Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-2.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-2.values new file mode 100644 index 0000000000..dbc9dcb172 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-3.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-3.values new file mode 100644 index 0000000000..7b738d1c80 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/1.keys b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1.keys new file mode 100644 index 0000000000..b0f355a99a Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/1.metadata b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1.metadata new file mode 100644 index 0000000000..9cb517c1b2 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/1.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-0.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-0.values new file mode 100644 index 0000000000..d1ba0f90aa Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-1.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-1.values new file mode 100644 index 0000000000..bc90c4a93d Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-2.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-2.values new file mode 100644 index 0000000000..91d34a9bc9 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-3.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-3.values new file mode 100644 index 0000000000..936dbf00cc Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/2.keys b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2.keys new file mode 100644 index 0000000000..2a9e7d8b2a Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/2.metadata b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2.metadata new file mode 100644 index 0000000000..7ba88722fe Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/2.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-0.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-0.values new file mode 100644 index 0000000000..c1cbd70a17 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-1.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-1.values new file mode 100644 index 0000000000..9e0790f427 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-2.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-2.values new file mode 100644 index 0000000000..c5e4f42629 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-3.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-3.values new file mode 100644 index 0000000000..b620e6eee4 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/3.keys b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3.keys new file mode 100644 index 0000000000..ddd63b4959 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/3.metadata b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3.metadata new file mode 100644 index 0000000000..d7b27c40ca Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/3.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-0.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-0.values new file mode 100644 index 0000000000..e60641747f Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-1.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-1.values new file mode 100644 index 0000000000..472a79141e Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-2.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-2.values new file mode 100644 index 0000000000..c98bf1774b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-3.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-3.values new file mode 100644 index 0000000000..e4859a42dd Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/4.keys b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4.keys new file mode 100644 index 0000000000..7f78d6d176 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/4.metadata b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4.metadata new file mode 100644 index 0000000000..f0a757d544 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/4.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-0.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-0.values new file mode 100644 index 0000000000..ed28004dc9 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-1.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-1.values new file mode 100644 index 0000000000..1b5d74a1ec Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-2.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-2.values new file mode 100644 index 0000000000..214fc38d5d Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-3.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5-3.values new file mode 100644 index 0000000000..e69de29bb2 diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/5.keys b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5.keys new file mode 100644 index 0000000000..a0f5bf12ef Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/5.metadata b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5.metadata new file mode 100644 index 0000000000..aefc272f27 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/5.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-0.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-0.values new file mode 100644 index 0000000000..d47c40450b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-1.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-1.values new file mode 100644 index 0000000000..8da50c7c96 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-2.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-2.values new file mode 100644 index 0000000000..61fa27d496 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-3.values b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-3.values new file mode 100644 index 0000000000..70ae276532 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/6.keys b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6.keys new file mode 100644 index 0000000000..9760a7224e Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/segments/6.metadata b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6.metadata new file mode 100644 index 0000000000..4e9ad20fd9 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/segments/6.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v0/test/table.metadata b/sei-db/db_engine/litt/test/testdata/v0/test/table.metadata new file mode 100644 index 0000000000..ba93fdb0ca Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v0/test/table.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/000001.log b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/000001.log new file mode 100644 index 0000000000..80d36ef36f Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/000001.log differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/CURRENT b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/CURRENT new file mode 100644 index 0000000000..feda7d6b24 --- /dev/null +++ b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/CURRENT @@ -0,0 +1 @@ +MANIFEST-000000 diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/LOCK b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/LOCK new file mode 100644 index 0000000000..e69de29bb2 diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/LOG b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/LOG new file mode 100644 index 0000000000..883cbc6d45 --- /dev/null +++ b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/LOG @@ -0,0 +1,8 @@ +=============== May 12, 2025 (CDT) =============== +12:12:52.269858 log@legend F·NumFile S·FileSize N·Entry C·BadEntry B·BadBlock Ke·KeyError D·DroppedEntry L·Level Q·SeqNum T·TimeElapsed +12:12:52.280593 db@open opening +12:12:52.281865 version@stat F·[] S·0B[] Sc·[] +12:12:52.284835 db@janitor F·2 G·0 +12:12:52.284865 db@open done T·4.2475ms +12:12:52.312588 db@close closing +12:12:52.312685 db@close done T·95.916µs diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/MANIFEST-000000 b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/MANIFEST-000000 new file mode 100644 index 0000000000..9d54f6733b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/data/MANIFEST-000000 differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/keymap/initialized b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/initialized new file mode 100644 index 0000000000..e69de29bb2 diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/keymap/keymap-type.txt b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/keymap-type.txt new file mode 100644 index 0000000000..02d4ce5d35 --- /dev/null +++ b/sei-db/db_engine/litt/test/testdata/v1/test/keymap/keymap-type.txt @@ -0,0 +1 @@ +LevelDBKeymap \ No newline at end of file diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-0.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-0.values new file mode 100644 index 0000000000..6131fc5d00 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-1.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-1.values new file mode 100644 index 0000000000..f554d4793d Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-2.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-2.values new file mode 100644 index 0000000000..18c5c08b11 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-3.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-3.values new file mode 100644 index 0000000000..df15184564 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/0.keys b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0.keys new file mode 100644 index 0000000000..7f84279a3f Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/0.metadata b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0.metadata new file mode 100644 index 0000000000..b3a083bc5c Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/0.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-0.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-0.values new file mode 100644 index 0000000000..90b6b5973a Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-1.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-1.values new file mode 100644 index 0000000000..b7b640b048 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-2.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-2.values new file mode 100644 index 0000000000..d9126a2ff6 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-3.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-3.values new file mode 100644 index 0000000000..d483dd7ee0 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/1.keys b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1.keys new file mode 100644 index 0000000000..2e08ff0c27 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/1.metadata b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1.metadata new file mode 100644 index 0000000000..c7324a3e05 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/1.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-0.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-0.values new file mode 100644 index 0000000000..52f7b9ac80 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-1.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-1.values new file mode 100644 index 0000000000..f9fbc8d881 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-2.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-2.values new file mode 100644 index 0000000000..b979b71a83 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-3.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-3.values new file mode 100644 index 0000000000..92a984f8fa Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/2.keys b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2.keys new file mode 100644 index 0000000000..b253a651dd Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/2.metadata b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2.metadata new file mode 100644 index 0000000000..5b4ba83de3 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/2.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-0.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-0.values new file mode 100644 index 0000000000..e69de29bb2 diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-1.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-1.values new file mode 100644 index 0000000000..2f690f74e6 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-2.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-2.values new file mode 100644 index 0000000000..bc393a1954 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-3.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-3.values new file mode 100644 index 0000000000..7e909cc546 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/3.keys b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3.keys new file mode 100644 index 0000000000..9cc99a1f46 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/3.metadata b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3.metadata new file mode 100644 index 0000000000..bae71de018 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/3.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-0.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-0.values new file mode 100644 index 0000000000..363f10f170 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-1.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-1.values new file mode 100644 index 0000000000..e09021c77c Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-2.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-2.values new file mode 100644 index 0000000000..742b15e9d7 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-3.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-3.values new file mode 100644 index 0000000000..d8e781019e Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/4.keys b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4.keys new file mode 100644 index 0000000000..a5f52ad0ce Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/4.metadata b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4.metadata new file mode 100644 index 0000000000..3bc2900c4b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/4.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-0.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-0.values new file mode 100644 index 0000000000..6ff7ba379c Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-1.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-1.values new file mode 100644 index 0000000000..da21673e37 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-2.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-2.values new file mode 100644 index 0000000000..ad1b46b8ac Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-3.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-3.values new file mode 100644 index 0000000000..9dbfcb7364 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/5.keys b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5.keys new file mode 100644 index 0000000000..a39cf60325 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/5.metadata b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5.metadata new file mode 100644 index 0000000000..29c76dfc5f Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/5.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-0.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-0.values new file mode 100644 index 0000000000..12117338f0 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-1.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-1.values new file mode 100644 index 0000000000..08e5fa7919 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-2.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-2.values new file mode 100644 index 0000000000..f518be3e20 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-3.values b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-3.values new file mode 100644 index 0000000000..77ac678280 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/6.keys b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6.keys new file mode 100644 index 0000000000..7461c2a9cd Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/segments/6.metadata b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6.metadata new file mode 100644 index 0000000000..2f9d140525 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/segments/6.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v1/test/table.metadata b/sei-db/db_engine/litt/test/testdata/v1/test/table.metadata new file mode 100644 index 0000000000..ba93fdb0ca Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v1/test/table.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/000001.log b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/000001.log new file mode 100644 index 0000000000..02c4ba5bbd Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/000001.log differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/CURRENT b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/CURRENT new file mode 100644 index 0000000000..feda7d6b24 --- /dev/null +++ b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/CURRENT @@ -0,0 +1 @@ +MANIFEST-000000 diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/LOCK b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/LOCK new file mode 100644 index 0000000000..e69de29bb2 diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/LOG b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/LOG new file mode 100644 index 0000000000..210f230478 --- /dev/null +++ b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/LOG @@ -0,0 +1,8 @@ +=============== May 15, 2025 (CDT) =============== +15:54:37.535265 log@legend F·NumFile S·FileSize N·Entry C·BadEntry B·BadBlock Ke·KeyError D·DroppedEntry L·Level Q·SeqNum T·TimeElapsed +15:54:37.556992 db@open opening +15:54:37.557686 version@stat F·[] S·0B[] Sc·[] +15:54:37.566101 db@janitor F·2 G·0 +15:54:37.566141 db@open done T·9.127417ms +15:54:37.602897 db@close closing +15:54:37.602996 db@close done T·95.417µs diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/MANIFEST-000000 b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/MANIFEST-000000 new file mode 100644 index 0000000000..9d54f6733b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/data/MANIFEST-000000 differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/keymap/initialized b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/initialized new file mode 100644 index 0000000000..e69de29bb2 diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/keymap/keymap-type.txt b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/keymap-type.txt new file mode 100644 index 0000000000..02d4ce5d35 --- /dev/null +++ b/sei-db/db_engine/litt/test/testdata/v2/test/keymap/keymap-type.txt @@ -0,0 +1 @@ +LevelDBKeymap \ No newline at end of file diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-0.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-0.values new file mode 100644 index 0000000000..7113c9ec1a Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-1.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-1.values new file mode 100644 index 0000000000..bcfe881555 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-2.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-2.values new file mode 100644 index 0000000000..769fba71df Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-3.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-3.values new file mode 100644 index 0000000000..cd7e0b8995 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/0.keys b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0.keys new file mode 100644 index 0000000000..aa0911cde1 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/0.metadata b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0.metadata new file mode 100644 index 0000000000..6706f4e809 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/0.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-0.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-0.values new file mode 100644 index 0000000000..b1192c27f0 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-1.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-1.values new file mode 100644 index 0000000000..b581776a29 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-2.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-2.values new file mode 100644 index 0000000000..8e5cd193d5 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-3.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-3.values new file mode 100644 index 0000000000..e631469b79 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/1.keys b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1.keys new file mode 100644 index 0000000000..dc3e476428 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/1.metadata b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1.metadata new file mode 100644 index 0000000000..b514b7381f Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/1.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-0.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-0.values new file mode 100644 index 0000000000..e7b7736fc0 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-1.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-1.values new file mode 100644 index 0000000000..6889215419 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-2.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-2.values new file mode 100644 index 0000000000..af1500f7c7 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-3.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-3.values new file mode 100644 index 0000000000..a2cad8b576 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/2.keys b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2.keys new file mode 100644 index 0000000000..a3304f056b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/2.metadata b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2.metadata new file mode 100644 index 0000000000..1383ebd864 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/2.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-0.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-0.values new file mode 100644 index 0000000000..2d08799bd1 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-1.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-1.values new file mode 100644 index 0000000000..effbae8b29 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-2.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-2.values new file mode 100644 index 0000000000..098182c3d5 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-3.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-3.values new file mode 100644 index 0000000000..6ff4f20443 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/3.keys b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3.keys new file mode 100644 index 0000000000..407d560c6a Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/3.metadata b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3.metadata new file mode 100644 index 0000000000..26f287c9c3 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/3.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-0.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-0.values new file mode 100644 index 0000000000..c21a6f4f0c Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-1.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-1.values new file mode 100644 index 0000000000..733612233f Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-2.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-2.values new file mode 100644 index 0000000000..01d8dda56d Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-3.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-3.values new file mode 100644 index 0000000000..fc14618ca6 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/4.keys b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4.keys new file mode 100644 index 0000000000..a44c53136c Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/4.metadata b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4.metadata new file mode 100644 index 0000000000..c2566987d8 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/4.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-0.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-0.values new file mode 100644 index 0000000000..f4e56928df Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-1.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-1.values new file mode 100644 index 0000000000..ffcdaa6c8b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-2.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-2.values new file mode 100644 index 0000000000..91d34a9bc9 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-3.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-3.values new file mode 100644 index 0000000000..564dd6775b Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/5.keys b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5.keys new file mode 100644 index 0000000000..b1379d1daa Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/5.metadata b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5.metadata new file mode 100644 index 0000000000..17c15eacd1 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/5.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-0.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-0.values new file mode 100644 index 0000000000..65fd4613e5 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-1.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-1.values new file mode 100644 index 0000000000..a6273091a3 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-2.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-2.values new file mode 100644 index 0000000000..03c60030aa Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-2.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-3.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-3.values new file mode 100644 index 0000000000..fb4009e103 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/6.keys b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6.keys new file mode 100644 index 0000000000..23ec52adb2 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/6.metadata b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6.metadata new file mode 100644 index 0000000000..a5226bc000 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/6.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-0.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-0.values new file mode 100644 index 0000000000..44b7238e30 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-0.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-1.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-1.values new file mode 100644 index 0000000000..f97cb68e17 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-1.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-2.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-2.values new file mode 100644 index 0000000000..e69de29bb2 diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-3.values b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-3.values new file mode 100644 index 0000000000..6c0ad6117d Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7-3.values differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/7.keys b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7.keys new file mode 100644 index 0000000000..f39ce59430 Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7.keys differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/segments/7.metadata b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7.metadata new file mode 100644 index 0000000000..a7c788777c Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/segments/7.metadata differ diff --git a/sei-db/db_engine/litt/test/testdata/v2/test/table.metadata b/sei-db/db_engine/litt/test/testdata/v2/test/table.metadata new file mode 100644 index 0000000000..ba93fdb0ca Binary files /dev/null and b/sei-db/db_engine/litt/test/testdata/v2/test/table.metadata differ diff --git a/sei-db/db_engine/litt/test/unlock_test.go b/sei-db/db_engine/litt/test/unlock_test.go new file mode 100644 index 0000000000..fe0e800f42 --- /dev/null +++ b/sei-db/db_engine/litt/test/unlock_test.go @@ -0,0 +1,171 @@ +//go:build littdb_wip + +package test + +import ( + "os" + "path" + "path/filepath" + "strings" + "testing" + + "github.com/Layr-Labs/eigenda/litt" + "github.com/Layr-Labs/eigenda/litt/disktable" + "github.com/Layr-Labs/eigenda/litt/littbuilder" + "github.com/Layr-Labs/eigenda/litt/util" + testrandom "github.com/Layr-Labs/eigenda/test/random" + "github.com/stretchr/testify/require" +) + +// Note: this test is defined in the test package to avoid circular dependencies. + +func TestUnlock(t *testing.T) { + testDir := t.TempDir() + rand := testrandom.NewTestRandom() + volumes := []string{path.Join(testDir, "volume1"), path.Join(testDir, "volume2"), path.Join(testDir, "volume3")} + + config, err := litt.DefaultConfig(volumes...) + config.Fsync = false // Disable fsync for faster tests + config.TargetSegmentFileSize = 100 + config.ShardingFactor = uint32(len(volumes)) + require.NoError(t, err) + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + table, err := db.GetTable("test_table") + require.NoError(t, err) + + expectedData := make(map[string][]byte) + + // Write some data + for i := 0; i < 100; i++ { + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(1, 100) + + expectedData[string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "Failed to put data in table") + } + + // Look for lock files. We should see one for each volume. + lockFileCount := 0 + err = filepath.Walk(testDir, func(path string, info os.FileInfo, err error) error { + if err != nil { + // Log but do not fail. LittDB may be shuffling files around concurrently. + t.Logf("Error walking path %s (not necessarily fatal): %v", path, err) + return nil + } + if info.IsDir() { + return nil + } + if strings.HasSuffix(path, util.LockfileName) { + lockFileCount++ + } + return nil + }) + require.NoError(t, err) + require.Equal(t, 3, lockFileCount) + + // Unlock the DB. This should remove all lock files, but leave other files intact. + err = disktable.Unlock(config.Logger, volumes) + require.NoError(t, err, "Failed to unlock the database") + + // There should be no lock files left. + lockFileCount = 0 + err = filepath.Walk(testDir, func(path string, info os.FileInfo, err error) error { + if err != nil { + // Log but do not fail. LittDB may be shuffling files around concurrently. + t.Logf("Error walking path %s (not necessarily fatal): %v", path, err) + return nil + } + if info.IsDir() { + return nil + } + if strings.HasSuffix(path, util.LockfileName) { + lockFileCount++ + } + return nil + }) + require.NoError(t, err) + require.Equal(t, 0, lockFileCount, "There should be no lock files left after unlocking") + + // Calling unlock again should not cause any issues. + err = disktable.Unlock(config.Logger, volumes) + require.NoError(t, err, "Failed to unlock the database again") + + // Verify that the data is still intact. + for key, expectedValue := range expectedData { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "Failed to get data from table") + require.True(t, ok, "Failed to get data from table") + require.Equal(t, expectedValue, value, "Data mismatch for key %s", key) + } + + // Restart the database and verify the data again. + err = db.Close() + require.NoError(t, err) + + db, err = littbuilder.NewDB(config) + require.NoError(t, err) + + table, err = db.GetTable("test_table") + require.NoError(t, err) + + for key, expectedValue := range expectedData { + value, ok, err := table.Get([]byte(key)) + require.NoError(t, err, "Failed to get data from table after restart") + require.True(t, ok, "Failed to get data from table after restart") + require.Equal(t, expectedValue, value, "Data mismatch for key %s after restart", key) + } + + err = db.Close() + require.NoError(t, err, "Failed to close the database after restart") +} + +func TestPurgeLocks(t *testing.T) { + testDir := t.TempDir() + rand := testrandom.NewTestRandom() + volumes := []string{path.Join(testDir, "volume1", path.Join(testDir, "volume2"), path.Join(testDir, "volume3"))} + + config, err := litt.DefaultConfig(volumes...) + config.Fsync = false // Disable fsync for faster tests + config.TargetSegmentFileSize = 100 + require.NoError(t, err) + + db, err := littbuilder.NewDB(config) + require.NoError(t, err) + + table, err := db.GetTable("test_table") + require.NoError(t, err) + + expectedData := make(map[string][]byte) + + // Write some data + for i := 0; i < 100; i++ { + key := rand.PrintableBytes(32) + value := rand.PrintableVariableBytes(1, 100) + + expectedData[string(key)] = value + err = table.Put(key, value) + require.NoError(t, err, "Failed to put data in table") + } + + // Opening a second instance of the database should fail due to existing locks. + _, err = littbuilder.NewDB(config) + require.Error(t, err, "Expected error when opening a second instance of the database with existing locks") + + // Open a new instance of the database at the same time. Normally this is not possible, but it becomes possible + // when we purge locks. + config.PurgeLocks = true + db2, err := littbuilder.NewDB(config) + require.NoError(t, err, "Failed to open a second instance of the database") + + // This test doesn't bother to verify the table data, since we are in unsafe territory now with multiple instances + // of the database running at the same time. + + err = db.Close() + require.NoError(t, err, "Failed to close the first instance of the database") + err = db2.Close() + require.NoError(t, err) +} diff --git a/sei-db/db_engine/litt/types/address.go b/sei-db/db_engine/litt/types/address.go new file mode 100644 index 0000000000..dc3a6dfcb3 --- /dev/null +++ b/sei-db/db_engine/litt/types/address.go @@ -0,0 +1,47 @@ +//go:build littdb_wip + +package types + +import ( + "encoding/binary" + "fmt" +) + +// Address describes the location of data on disk. +// The first 4 bytes are the file ID, and the second 4 bytes are the offset of the data within the file. +type Address uint64 + +// NewAddress creates a new address +func NewAddress(index uint32, offset uint32) Address { + return Address(uint64(index)<<32 | uint64(offset)) +} + +// DeserializeAddress converts a byte slice to an address. +func DeserializeAddress(bytes []byte) (Address, error) { + if len(bytes) != 8 { + return 0, fmt.Errorf("invalid address length: %d", len(bytes)) + } + return Address(binary.BigEndian.Uint64(bytes)), nil +} + +// Index returns the file index of the value address. +func (a Address) Index() uint32 { + return uint32(a >> 32) +} + +// Offset returns the offset of the value address. +func (a Address) Offset() uint32 { + return uint32(a) +} + +// String returns a string representation of the address. +func (a Address) String() string { + return fmt.Sprintf("(%d:%d)", a.Index(), a.Offset()) +} + +// Serialize converts the address to a byte slice. +func (a Address) Serialize() []byte { + bytes := make([]byte, 8) + binary.BigEndian.PutUint64(bytes, uint64(a)) + return bytes +} diff --git a/sei-db/db_engine/litt/types/kv_pair.go b/sei-db/db_engine/litt/types/kv_pair.go new file mode 100644 index 0000000000..f7e703f4c6 --- /dev/null +++ b/sei-db/db_engine/litt/types/kv_pair.go @@ -0,0 +1,11 @@ +//go:build littdb_wip + +package types + +// KVPair represents a key-value pair. +type KVPair struct { + // Key is the key. + Key []byte + // Value is the value. + Value []byte +} diff --git a/sei-db/db_engine/litt/types/scoped_key.go b/sei-db/db_engine/litt/types/scoped_key.go new file mode 100644 index 0000000000..418e8aaab7 --- /dev/null +++ b/sei-db/db_engine/litt/types/scoped_key.go @@ -0,0 +1,13 @@ +//go:build littdb_wip + +package types + +// ScopedKey is a key, plus additional information about the value associated with the key. +type ScopedKey struct { + // A key in the DB. + Key []byte + // The location where the value associated with the key is stored. + Address Address + // The length of the value associated with the key. + ValueSize uint32 +} diff --git a/sei-db/db_engine/litt/util/constants.go b/sei-db/db_engine/litt/util/constants.go new file mode 100644 index 0000000000..217b9e0c7a --- /dev/null +++ b/sei-db/db_engine/litt/util/constants.go @@ -0,0 +1,6 @@ +//go:build littdb_wip + +package util + +// The name of the LittDB lockfile. Protects against DBs in multiple processes from accessing the same data directory. +const LockfileName = "litt.lock" diff --git a/sei-db/db_engine/litt/util/error_monitor.go b/sei-db/db_engine/litt/util/error_monitor.go new file mode 100644 index 0000000000..43846fdbcc --- /dev/null +++ b/sei-db/db_engine/litt/util/error_monitor.go @@ -0,0 +1,121 @@ +//go:build littdb_wip + +package util + +import ( + "context" + "fmt" + "runtime/debug" + "sync/atomic" + + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// ErrorMonitor is a struct that permits the process to "panic" without using the golang panic keyword. +// When there are goroutines that function under the hood that are unable to return errors using the standard pattern, +// this utility provides an elegant way to handle those errors. In such situations, the desirable outcome is for the +// process to report the error and to elegantly spin itself down. +// +// Even though this utility can "panic", it is not the same as the panic that is built into Go. The Panic() method +// should be called in situations where recovery is not possible, i.e. the same situations where one would otherwise +// call golang's panic(). The big difference is that calling Panic() will not result in the process immediately being +// torn down. +type ErrorMonitor struct { + ctx context.Context + cancel context.CancelFunc + + logger logging.Logger + + // callback is called when the Panic() method is called for the first time. + callback func(error) + + // If this is non-nil, the monitor is either in a "panic" state or a "shutdown" state. + error atomic.Pointer[error] +} + +// NewErrorMonitor creates a new ErrorMonitor struct. Executes the callback function when/if Panic() is called. +// The callback is ignored if it is nil. +func NewErrorMonitor( + ctx context.Context, + logger logging.Logger, + callback func(error)) *ErrorMonitor { + + ctx, cancel := context.WithCancel(ctx) + + return &ErrorMonitor{ + ctx: ctx, + cancel: cancel, + logger: logger, + callback: callback, + } +} + +// Await waits for a value to be sent on a channel. If the channel sends a value, the value is returned. +// If the Panic() is called before the channel sends a value, an error is returned. +func Await[T any](handler *ErrorMonitor, channel <-chan T) (T, error) { + select { + case value := <-channel: + return value, nil + case <-handler.ImmediateShutdownRequired(): + var zero T + return zero, fmt.Errorf("context cancelled") + } +} + +// Send sends a value on a channel. If the value is sent, nil is returned. If the Panic() is called before the value +// is sent, an error is returned. +func Send[T any](handler *ErrorMonitor, channel chan<- any, value T) error { + select { + case channel <- value: + return nil + case <-handler.ImmediateShutdownRequired(): + return fmt.Errorf("context cancelled") + } +} + +// ImmediateShutdownRequired returns an output channel that is closed when Panic() is called. The channel might also be +// closed if the parent context is cancelled, and so this channel being closed can't be used to infer that we are +// in a panicked state. +func (h *ErrorMonitor) ImmediateShutdownRequired() <-chan struct{} { + return h.ctx.Done() +} + +// IsOk returns true if the ErrorMonitor is in a good state, and false if in a "panic" or "shutdown" state. +// If Panic() was called, the error returned is the error that caused the panic, and does not indicate that +// the call to IsOk() failed. If the Panic() has been called multiple times, the error returned will +// be the first error passed to Panic(). If Panic() has not been called and Shutdown() has not been called, +// the error returned will describe the shutdown. +func (h *ErrorMonitor) IsOk() (bool, error) { + err := h.error.Load() + if err != nil { + return false, *err + } + return true, nil +} + +// Shutdown causes the ErrorMonitor to enter a "shutdown" state. Causes ImmediateShutdownRequired() to signal. +func (h *ErrorMonitor) Shutdown() { + err := fmt.Errorf("monitor is shut down") + + // don't overwrite the error if there is already an error stored + h.error.CompareAndSwap(nil, &err) +} + +// Panic time! Something just went very wrong. (╯°□°)╯︵ ┻━┻ +func (h *ErrorMonitor) Panic(err error) { + stackTrace := string(debug.Stack()) + + h.logger.Errorf("monitor encountered an unrecoverable error: %v\n%s", err, stackTrace) + + // only store the error if there isn't already an error stored + firstError := h.error.CompareAndSwap(nil, &err) + + // Always cancel the context, even if this is not the first error. It's possible that the first "error" was + // actually a shutdown request, and we want to make sure that the context is always cancelled in the event + // of an unexpected error. + h.cancel() + + if firstError && h.callback != nil { + h.callback(err) + } +} diff --git a/sei-db/db_engine/litt/util/file_lock.go b/sei-db/db_engine/litt/util/file_lock.go new file mode 100644 index 0000000000..fe90259c15 --- /dev/null +++ b/sei-db/db_engine/litt/util/file_lock.go @@ -0,0 +1,227 @@ +//go:build littdb_wip + +package util + +import ( + "errors" + "fmt" + "os" + "path" + "strconv" + "strings" + "syscall" + "time" + + "github.com/Layr-Labs/eigensdk-go/logging" +) + +// FileLock represents a file-based lock +type FileLock struct { + logger logging.Logger + path string + file *os.File +} + +// IsProcessAlive checks if a process with the given PID is still running +func IsProcessAlive(pid int) bool { + if pid <= 0 { + return false + } + + // Send signal 0 to check if process exists + // This doesn't actually send a signal, just checks if we can send one + err := syscall.Kill(pid, 0) + if err == nil { + return true + } + + // Check the specific error + var errno syscall.Errno + if errors.As(err, &errno) { + switch { + case errors.Is(errno, syscall.ESRCH): + // No such process + return false + case errors.Is(errno, syscall.EPERM): + // Permission denied, but process exists + return true + default: + // Other error, assume process exists to be safe + return true + } + } + + // Unknown error, assume process exists to be safe + return true +} + +// parseLockFile parses a lock file and returns the PID if valid +func parseLockFile(path string) (int, error) { + content, err := os.ReadFile(path) + if err != nil { + return 0, fmt.Errorf("failed to read lock file: %w", err) + } + + lines := strings.Split(string(content), "\n") + for _, line := range lines { + line = strings.TrimSpace(line) + if strings.HasPrefix(line, "PID: ") { + pidStr := strings.TrimPrefix(line, "PID: ") + pid, err := strconv.Atoi(pidStr) + if err != nil { + return 0, fmt.Errorf("invalid PID in lock file: %s", pidStr) + } + return pid, nil + } + } + + return 0, fmt.Errorf("no PID found in lock file") +} + +// NewFileLock attempts to create a lock file at the specified path. Fails if another process has already created a +// lock file. Useful for situations where a process wants to hold a mutual exclusion lock on a resource. +// The caller is responsible for calling Release() to release the lock. +func NewFileLock(logger logging.Logger, path string, fsync bool) (*FileLock, error) { + path, err := SanitizePath(path) + if err != nil { + return nil, fmt.Errorf("sanitize path failed: %v", err) + } + + // Try to create the lock file exclusively (O_EXCL ensures it fails if file exists) + file, err := os.OpenFile(path, os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0644) + if err != nil { + if os.IsExist(err) { + // Lock file exists, check if it's stale + if pid, parseErr := parseLockFile(path); parseErr == nil { + if !IsProcessAlive(pid) { + // Process is dead, remove stale lock file and try again + if removeErr := os.Remove(path); removeErr != nil { + return nil, fmt.Errorf("failed to remove stale lock file %s: %w", path, removeErr) + } + + // Try to create the lock file again + file, err = os.OpenFile(path, os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0644) + if err != nil { + return nil, fmt.Errorf("failed to create lock file after removing stale lock %s: %w", + path, err) + } + } else { + // Process is still alive, cannot acquire lock + debugInfo := "" + content, readErr := os.ReadFile(path) + if readErr == nil { + debugInfo = fmt.Sprintf(" (existing lock info: %s)", strings.TrimSpace(string(content))) + } else { + debugInfo = fmt.Sprintf(" (failed to read existing lock file: %v)", readErr) + } + return nil, fmt.Errorf("lock file already exists and process %d is still running: %s%s", + pid, path, debugInfo) + } + } else { + // Cannot parse lock file, treat as existing lock with debug info + debugInfo := "" + if content, readErr := os.ReadFile(path); readErr == nil { + debugInfo = fmt.Sprintf(" (existing lock info: %s)", strings.TrimSpace(string(content))) + } + return nil, fmt.Errorf("lock file already exists: %s%s", path, debugInfo) + } + } else { + return nil, fmt.Errorf("failed to create lock file %s: %w", path, err) + } + } + + // Write process ID and timestamp to the lock file for debugging + lockInfo := fmt.Sprintf("PID: %d\nTimestamp: %s\n", os.Getpid(), time.Now().Format(time.RFC3339)) + _, err = file.WriteString(lockInfo) + if err != nil { + // Close and remove the file if we can't write to it + secondaryErr := file.Close() + if secondaryErr != nil { + logger.Errorf("failed to close lock file %s after write error: %v", path, secondaryErr) + } + secondaryErr = os.Remove(path) + if secondaryErr != nil { + logger.Errorf("failed to remove lock file %s after write error: %v", path, secondaryErr) + } + return nil, fmt.Errorf("failed to write to lock file %s: %w", path, err) + } + + if fsync { + err = file.Sync() + if err != nil { + // Close and remove the file if we can't sync it + secondaryErr := file.Close() + if secondaryErr != nil { + logger.Errorf("failed to close lock file %s after sync error: %v", path, secondaryErr) + } + secondaryErr = os.Remove(path) + if secondaryErr != nil { + logger.Errorf("failed to remove lock file %s after sync error: %v", path, secondaryErr) + } + return nil, fmt.Errorf("failed to sync lock file %s: %w", path, err) + } + } + + return &FileLock{ + logger: logger, + path: path, + file: file, + }, nil +} + +// Release releases the file lock by closing and removing the lock file. +// This is a no-op if the lock is already released. +func (fl *FileLock) Release() { + if fl.file == nil { + return + } + + // Close the file first + err := fl.file.Close() + fl.file = nil + + if err != nil { + fl.logger.Errorf("failed to close lock file %s: %w", fl.path, err) + return + } + + // Remove the lock file + err = os.Remove(fl.path) + if err != nil { + fl.logger.Errorf("failed to remove lock file %s: %w", fl.path, err) + return + } +} + +// Path returns the path of the lock file +func (fl *FileLock) Path() string { + return fl.path +} + +// Create a lock on multiple directories. Returns a function that can be used to release all locks. +func LockDirectories( + logger logging.Logger, + directories []string, + lockFileName string, + fsync bool) (func(), error) { + + locks := make([]*FileLock, 0, len(directories)) + for _, dir := range directories { + lockFilePath := path.Join(dir, lockFileName) + lock, err := NewFileLock(logger, lockFilePath, fsync) + if err != nil { + // Release all previously acquired locks before returning an error + for _, l := range locks { + l.Release() + } + return nil, fmt.Errorf("failed to acquire lock on directory %s: %v", dir, err) + } + locks = append(locks, lock) + } + + return func() { + for _, lock := range locks { + lock.Release() + } + }, nil +} diff --git a/sei-db/db_engine/litt/util/file_lock_test.go b/sei-db/db_engine/litt/util/file_lock_test.go new file mode 100644 index 0000000000..bf1b2569c2 --- /dev/null +++ b/sei-db/db_engine/litt/util/file_lock_test.go @@ -0,0 +1,627 @@ +//go:build littdb_wip + +package util + +import ( + "fmt" + "os" + "path/filepath" + "strings" + "sync" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/stretchr/testify/require" +) + +func TestNewFileLock(t *testing.T) { + tempDir := t.TempDir() + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + tests := []struct { + name string + setup func() string + expectError bool + }{ + { + name: "successful lock creation", + setup: func() string { + return filepath.Join(tempDir, "test.lock") + }, + expectError: false, + }, + { + name: "lock already exists with live process", + setup: func() string { + lockPath := filepath.Join(tempDir, "existing.lock") + // Create an existing lock file with current process PID (which is alive) + content := fmt.Sprintf("PID: %d\nTimestamp: 2023-01-01T00:00:00Z\n", os.Getpid()) + err := os.WriteFile(lockPath, []byte(content), 0644) + require.NoError(t, err) + return lockPath + }, + expectError: true, + }, + { + name: "stale lock file gets overridden", + setup: func() string { + lockPath := filepath.Join(tempDir, "stale.lock") + // Create a lock file with a PID that definitely doesn't exist + // Use PID 999999 which is very unlikely to exist + stalePID := 999999 + content := fmt.Sprintf("PID: %d\nTimestamp: 2023-01-01T00:00:00Z\n", stalePID) + err := os.WriteFile(lockPath, []byte(content), 0644) + require.NoError(t, err) + return lockPath + }, + expectError: false, + }, + { + name: "malformed lock file gets treated as existing", + setup: func() string { + lockPath := filepath.Join(tempDir, "malformed.lock") + // Create a lock file without proper PID format + err := os.WriteFile(lockPath, []byte("invalid content"), 0644) + require.NoError(t, err) + return lockPath + }, + expectError: true, + }, + { + name: "invalid directory", + setup: func() string { + return filepath.Join(tempDir, "nonexistent", "test.lock") + }, + expectError: true, + }, + { + name: "tilde expansion", + setup: func() string { + return "~/test.lock" + }, + expectError: false, + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + lockPath := tc.setup() + + lock, err := NewFileLock(logger, lockPath, false) + + if tc.expectError { + require.Error(t, err) + require.Nil(t, lock) + } else { + require.NoError(t, err) + require.NotNil(t, lock) + + // Verify lock file was created + _, err := os.Stat(lock.Path()) + require.NoError(t, err) + + // Verify lock file contains process info + content, err := os.ReadFile(lock.Path()) + require.NoError(t, err) + contentStr := string(content) + require.Contains(t, contentStr, "PID:") + require.Contains(t, contentStr, "Timestamp:") + + // Clean up + lock.Release() + } + }) + } +} + +func TestFileLockRelease(t *testing.T) { + tempDir := t.TempDir() + lockPath := filepath.Join(tempDir, "test.lock") + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + // Create a lock + lock, err := NewFileLock(logger, lockPath, false) + require.NoError(t, err) + require.NotNil(t, lock) + + // Verify lock file exists + _, err = os.Stat(lockPath) + require.NoError(t, err) + + // Release the lock + lock.Release() + + // Verify lock file was removed + _, err = os.Stat(lockPath) + require.True(t, os.IsNotExist(err)) + + // Try to release again (should not) + lock.Release() +} + +func TestFileLockPath(t *testing.T) { + tempDir := t.TempDir() + lockPath := filepath.Join(tempDir, "test.lock") + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + lock, err := NewFileLock(logger, lockPath, false) + require.NoError(t, err) + defer lock.Release() + + // Path should be sanitized (absolute) + returnedPath := lock.Path() + require.True(t, filepath.IsAbs(returnedPath)) + require.True(t, strings.HasSuffix(returnedPath, "test.lock")) +} + +func TestFileLockConcurrency(t *testing.T) { + tempDir := t.TempDir() + lockPath := filepath.Join(tempDir, "concurrent.lock") + + const numGoroutines = 10 + const duration = 50 * time.Millisecond + + var successCount int32 + var wg sync.WaitGroup + results := make(chan bool, numGoroutines) + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + // Launch multiple goroutines trying to acquire the same lock + for i := 0; i < numGoroutines; i++ { + wg.Add(1) + go func(id int) { + defer wg.Done() + + lock, err := NewFileLock(logger, lockPath, false) + if err != nil { + results <- false + return + } + + // Hold the lock for a short time + time.Sleep(duration) + + lock.Release() + + results <- true + }(i) + } + + wg.Wait() + close(results) + + // Count successful lock acquisitions + successCount = 0 + for success := range results { + if success { + successCount++ + } + } + + // Only one goroutine should have successfully acquired the lock + require.Equal(t, int32(1), successCount, "Only one goroutine should acquire the lock") +} + +func TestDoubleRelease(t *testing.T) { + tempDir := t.TempDir() + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + lockPath := filepath.Join(tempDir, "double-release.lock") + + lock, err := NewFileLock(logger, lockPath, false) + require.NoError(t, err) + + // First release should succeed + lock.Release() + + // Second release should not panic + lock.Release() +} + +func TestFileLockDebugInfo(t *testing.T) { + tempDir := t.TempDir() + lockPath := filepath.Join(tempDir, "debug-test.lock") + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + // Create first lock + lock1, err := NewFileLock(logger, lockPath, false) + require.NoError(t, err) + + // Try to create second lock - should fail with debug info + lock2, err := NewFileLock(logger, lockPath, false) + require.Error(t, err) + require.Nil(t, lock2) + + // Error should contain debug information from existing lock + require.Contains(t, err.Error(), "lock file already exists") + require.Contains(t, err.Error(), "existing lock info:") + require.Contains(t, err.Error(), "PID:") + require.Contains(t, err.Error(), "Timestamp:") + + // Clean up + lock1.Release() +} + +func TestIsProcessAlive(t *testing.T) { + tests := []struct { + name string + pid int + expected bool + }{ + { + name: "current process", + pid: os.Getpid(), + expected: true, + }, + { + name: "invalid pid zero", + pid: 0, + expected: false, + }, + { + name: "invalid pid negative", + pid: -1, + expected: false, + }, + { + name: "nonexistent pid", + pid: 999999, // Very unlikely to exist + expected: false, + }, + { + name: "init process", + pid: 1, + expected: true, // Init process should always exist on Unix systems + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + result := IsProcessAlive(tc.pid) + require.Equal(t, tc.expected, result) + }) + } +} + +func TestParseLockFile(t *testing.T) { + tempDir := t.TempDir() + + tests := []struct { + name string + content string + expectedPID int + expectError bool + }{ + { + name: "valid lock file", + content: "PID: 12345\nTimestamp: 2023-01-01T00:00:00Z\n", + expectedPID: 12345, + expectError: false, + }, + { + name: "lock file with extra whitespace", + content: " PID: 67890 \n Timestamp: 2023-01-01T00:00:00Z \n", + expectedPID: 67890, + expectError: false, + }, + { + name: "lock file missing PID", + content: "Timestamp: 2023-01-01T00:00:00Z\n", + expectedPID: 0, + expectError: true, + }, + { + name: "lock file with invalid PID", + content: "PID: not-a-number\nTimestamp: 2023-01-01T00:00:00Z\n", + expectedPID: 0, + expectError: true, + }, + { + name: "empty lock file", + content: "", + expectedPID: 0, + expectError: true, + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + lockPath := filepath.Join(tempDir, fmt.Sprintf("test-%s.lock", tc.name)) + err := os.WriteFile(lockPath, []byte(tc.content), 0644) + require.NoError(t, err) + + pid, err := parseLockFile(lockPath) + + if tc.expectError { + require.Error(t, err) + } else { + require.NoError(t, err) + require.Equal(t, tc.expectedPID, pid) + } + }) + } +} + +func TestStaleLockRecovery(t *testing.T) { + tempDir := t.TempDir() + lockPath := filepath.Join(tempDir, "stale-recovery.lock") + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + // Create a stale lock file with a definitely dead PID + stalePID := 999999 + staleContent := fmt.Sprintf("PID: %d\nTimestamp: 2023-01-01T00:00:00Z\n", stalePID) + err = os.WriteFile(lockPath, []byte(staleContent), 0644) + require.NoError(t, err) + + // Verify the lock file exists + _, err = os.Stat(lockPath) + require.NoError(t, err) + + // Try to acquire the lock - should succeed by removing stale lock + lock, err := NewFileLock(logger, lockPath, false) + require.NoError(t, err) + require.NotNil(t, lock) + + // Verify the lock file now has our PID + content, err := os.ReadFile(lockPath) + require.NoError(t, err) + require.Contains(t, string(content), fmt.Sprintf("PID: %d", os.Getpid())) + + // Clean up + lock.Release() +} + +func TestLockDirectoriesSuccessfulLocking(t *testing.T) { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + tempDir := t.TempDir() + + // Create multiple directories + dir1 := filepath.Join(tempDir, "dir1") + dir2 := filepath.Join(tempDir, "dir2") + dir3 := filepath.Join(tempDir, "dir3") + + err = os.MkdirAll(dir1, 0755) + require.NoError(t, err) + err = os.MkdirAll(dir2, 0755) + require.NoError(t, err) + err = os.MkdirAll(dir3, 0755) + require.NoError(t, err) + + directories := []string{dir1, dir2, dir3} + lockFileName := "test.lock" + + // Lock all directories + release, err := LockDirectories(logger, directories, lockFileName, false) + require.NoError(t, err) + require.NotNil(t, release) + + // Verify lock files were created in all directories + for _, dir := range directories { + lockPath := filepath.Join(dir, lockFileName) + _, err := os.Stat(lockPath) + require.NoError(t, err, "lock file should exist in %s", dir) + + // Verify lock file content + content, err := os.ReadFile(lockPath) + require.NoError(t, err) + contentStr := string(content) + require.Contains(t, contentStr, "PID:") + require.Contains(t, contentStr, "Timestamp:") + } + + // Release all locks + release() + + // Verify all lock files were removed + for _, dir := range directories { + lockPath := filepath.Join(dir, lockFileName) + _, err := os.Stat(lockPath) + require.True(t, os.IsNotExist(err), "lock file should be removed from %s", dir) + } +} + +func TestLockDirectoriesFailureWhenLockExists(t *testing.T) { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + tempDir := t.TempDir() + + // Create multiple directories + dir1 := filepath.Join(tempDir, "dir1") + dir2 := filepath.Join(tempDir, "dir2") + dir3 := filepath.Join(tempDir, "dir3") + + err = os.MkdirAll(dir1, 0755) + require.NoError(t, err) + err = os.MkdirAll(dir2, 0755) + require.NoError(t, err) + err = os.MkdirAll(dir3, 0755) + require.NoError(t, err) + + lockFileName := "test.lock" + + // Create an existing lock in dir2 + existingLockPath := filepath.Join(dir2, lockFileName) + content := fmt.Sprintf("PID: %d\nTimestamp: 2023-01-01T00:00:00Z\n", os.Getpid()) + err = os.WriteFile(existingLockPath, []byte(content), 0644) + require.NoError(t, err) + + directories := []string{dir1, dir2, dir3} + + // Try to lock all directories - should fail + release, err := LockDirectories(logger, directories, lockFileName, false) + require.Error(t, err) + require.Nil(t, release) + require.Contains(t, err.Error(), "failed to acquire lock on directory") + require.Contains(t, err.Error(), dir2) + + // Verify that no locks were left behind (all should be cleaned up on failure) + lockPath1 := filepath.Join(dir1, lockFileName) + _, err = os.Stat(lockPath1) + require.True(t, os.IsNotExist(err), "lock file should not exist in %s after failure", dir1) + + lockPath3 := filepath.Join(dir3, lockFileName) + _, err = os.Stat(lockPath3) + require.True(t, os.IsNotExist(err), "lock file should not exist in %s after failure", dir3) + + // Clean up the existing lock + err = os.Remove(existingLockPath) + require.NoError(t, err) +} + +func TestLockDirectoriesFailureWhenDirectoryDoesNotExist(t *testing.T) { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + tempDir := t.TempDir() + + // Create some directories but not all + dir1 := filepath.Join(tempDir, "dir1") + dir2 := filepath.Join(tempDir, "nonexistent") + dir3 := filepath.Join(tempDir, "dir3") + + err = os.MkdirAll(dir1, 0755) + require.NoError(t, err) + err = os.MkdirAll(dir3, 0755) + require.NoError(t, err) + + directories := []string{dir1, dir2, dir3} + lockFileName := "test.lock" + + // Try to lock all directories - should fail on nonexistent directory + release, err := LockDirectories(logger, directories, lockFileName, false) + require.Error(t, err) + require.Nil(t, release) + require.Contains(t, err.Error(), "failed to acquire lock on directory") + require.Contains(t, err.Error(), dir2) + + // Verify that no locks were left behind + lockPath1 := filepath.Join(dir1, lockFileName) + _, err = os.Stat(lockPath1) + require.True(t, os.IsNotExist(err), "lock file should not exist in %s after failure", dir1) + + lockPath3 := filepath.Join(dir3, lockFileName) + _, err = os.Stat(lockPath3) + require.True(t, os.IsNotExist(err), "lock file should not exist in %s after failure", dir3) +} + +func TestLockDirectoriesEmptyList(t *testing.T) { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + directories := []string{} + lockFileName := "test.lock" + + // Lock empty list should succeed + release, err := LockDirectories(logger, directories, lockFileName, false) + require.NoError(t, err) + require.NotNil(t, release) + + // Release should not panic + release() +} + +func TestLockDirectoriesConcurrentAccessPrevention(t *testing.T) { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + tempDir := t.TempDir() + + // Create directories + dir1 := filepath.Join(tempDir, "dir1") + dir2 := filepath.Join(tempDir, "dir2") + + err = os.MkdirAll(dir1, 0755) + require.NoError(t, err) + err = os.MkdirAll(dir2, 0755) + require.NoError(t, err) + + directories := []string{dir1, dir2} + lockFileName := "test.lock" + + // First process locks directories + release1, err := LockDirectories(logger, directories, lockFileName, false) + require.NoError(t, err) + require.NotNil(t, release1) + + // Second process tries to lock same directories - should fail + release2, err := LockDirectories(logger, directories, lockFileName, false) + require.Error(t, err) + require.Nil(t, release2) + require.Contains(t, err.Error(), "failed to acquire lock on directory") + + // Release first lock + release1() + + // Now second process should be able to lock + release2, err = LockDirectories(logger, directories, lockFileName, false) + require.NoError(t, err) + require.NotNil(t, release2) + + // Clean up + release2() +} + +func TestLockDirectoriesStaleLockRecovery(t *testing.T) { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + tempDir := t.TempDir() + + // Create directories + dir1 := filepath.Join(tempDir, "dir1") + dir2 := filepath.Join(tempDir, "dir2") + + err = os.MkdirAll(dir1, 0755) + require.NoError(t, err) + err = os.MkdirAll(dir2, 0755) + require.NoError(t, err) + + lockFileName := "test.lock" + + // Create stale lock files with non-existent PIDs + stalePID := 999999 + staleContent := fmt.Sprintf("PID: %d\nTimestamp: 2023-01-01T00:00:00Z\n", stalePID) + + staleLockPath1 := filepath.Join(dir1, lockFileName) + err = os.WriteFile(staleLockPath1, []byte(staleContent), 0644) + require.NoError(t, err) + + staleLockPath2 := filepath.Join(dir2, lockFileName) + err = os.WriteFile(staleLockPath2, []byte(staleContent), 0644) + require.NoError(t, err) + + directories := []string{dir1, dir2} + + // Should succeed by removing stale locks + release, err := LockDirectories(logger, directories, lockFileName, false) + require.NoError(t, err) + require.NotNil(t, release) + + // Verify lock files now contain our PID + for _, dir := range directories { + lockPath := filepath.Join(dir, lockFileName) + content, err := os.ReadFile(lockPath) + require.NoError(t, err) + require.Contains(t, string(content), fmt.Sprintf("PID: %d", os.Getpid())) + } + + // Clean up + release() +} diff --git a/sei-db/db_engine/litt/util/file_utils.go b/sei-db/db_engine/litt/util/file_utils.go new file mode 100644 index 0000000000..90dd0b3c57 --- /dev/null +++ b/sei-db/db_engine/litt/util/file_utils.go @@ -0,0 +1,458 @@ +//go:build littdb_wip + +package util + +import ( + "fmt" + "io" + "os" + "path/filepath" + + "github.com/Layr-Labs/eigenda/core" +) + +// SwapFileExtension is the file extension used for temporary swap files created during atomic writes. +const SwapFileExtension = ".swap" + +// IsSymlink checks if the given path is a symlink. +func IsSymlink(path string) (bool, error) { + info, err := os.Lstat(path) + if err != nil { + if os.IsNotExist(err) { + return false, nil // Path does not exist, so it can't be a symlink + } + return false, fmt.Errorf("failed to stat path %s: %w", path, err) + } + + return info.Mode()&os.ModeSymlink != 0, nil +} + +// ErrIfSymlink checks if the given path is a symlink and returns an error if it is. +func ErrIfSymlink(path string) error { + isSymlink, err := IsSymlink(path) + if err != nil { + return fmt.Errorf("failed to check if path %s is a symlink: %w", path, err) + } + if isSymlink { + return fmt.Errorf("path %s is a symlink, but it should not be", path) + } + return nil +} + +// IsDirectory checks if the given path is a directory. Returns false if the path is not a directory or does not exist. +func IsDirectory(path string) (bool, error) { + info, err := os.Stat(path) + if err != nil { + if os.IsNotExist(err) { + // Path does not exist, so it can't be a directory + return false, nil + } + return false, fmt.Errorf("failed to stat path %s: %w", path, err) + } + return info.IsDir(), nil +} + +// SanitizePath returns a sanitized version of the given path, doing things like expanding +// "~" to the user's home directory, converting to absolute path, normalizing slashes, etc. +func SanitizePath(path string) (string, error) { + if len(path) > 0 && path[0] == '~' { + homeDir, err := os.UserHomeDir() + if err != nil { + return "", fmt.Errorf("failed to get user home directory: %w", err) + } + + if len(path) == 1 { + path = homeDir + } else if len(path) > 1 && path[1] == '/' { + path = homeDir + path[1:] + } + } + + path = filepath.Clean(path) + path = filepath.ToSlash(path) + path, err := filepath.Abs(path) + if err != nil { + return "", fmt.Errorf("failed to resolve absolute path: %w", err) + } + + return path, nil +} + +// DeleteOrphanedSwapFiles deletes any swap files in the given directory, i.e. files that end with ".swap". +func DeleteOrphanedSwapFiles(directory string) error { + entries, err := os.ReadDir(directory) + if err != nil { + return fmt.Errorf("failed to read directory %s: %w", directory, err) + } + + for _, entry := range entries { + if !entry.IsDir() && filepath.Ext(entry.Name()) == SwapFileExtension { + swapFilePath := filepath.Join(directory, entry.Name()) + if err := os.Remove(swapFilePath); err != nil { + return fmt.Errorf("failed to remove swap file %s: %w", swapFilePath, err) + } + } + } + + return nil +} + +// AtomicWrite writes data to a file atomically. The parent directory must exist and be writable. +// If the destination file already exists, it will be overwritten. +// +// This method creates a temporary swap file in the same directory as the destination, but with SwapFileExtension +// appended to the filename. If there is a crash during this method's execution, it may leave this swap file behind. +func AtomicWrite(destination string, data []byte, fsync bool) error { + + swapPath := destination + SwapFileExtension + + // Write the data into the swap file. + swapFile, err := os.Create(swapPath) + if err != nil { + return fmt.Errorf("failed to create swap file: %v", err) + } + + _, err = swapFile.Write(data) + if err != nil { + return fmt.Errorf("failed to write to swap file: %v", err) + } + + if fsync { + // Ensure the data in the swap file is fully written to disk. + err = swapFile.Sync() + if err != nil { + return fmt.Errorf("failed to sync swap file: %v", err) + } + } + + err = swapFile.Close() + if err != nil { + return fmt.Errorf("failed to close swap file: %v", err) + } + + // Rename the swap file to the destination file. + err = AtomicRename(swapPath, destination, fsync) + if err != nil { + return fmt.Errorf("failed to rename swap file: %v", err) + } + + return nil +} + +// AtomicRename renames a file from oldPath to newPath atomically. +func AtomicRename(oldPath string, newPath string, fsync bool) error { + err := os.Rename(oldPath, newPath) + if err != nil { + return fmt.Errorf("failed to rename file: %w", err) + } + + parentDirectory := filepath.Dir(newPath) + + // Ensure that the rename is committed to disk. + dirFile, err := os.Open(parentDirectory) + if err != nil { + return fmt.Errorf("failed to open parent directory %s: %w", parentDirectory, err) + } + + if fsync { + err = dirFile.Sync() + if err != nil { + return fmt.Errorf("failed to sync parent directory %s: %w", parentDirectory, err) + } + } + + err = dirFile.Close() + if err != nil { + return fmt.Errorf("failed to close parent directory %s: %w", parentDirectory, err) + } + + return nil +} + +// ErrIfNotWritableFile verifies that a path is either a regular file with read+write permissions, +// or that it is legal to create a new regular file with read+write permissions in the parent directory. +// +// A file is considered to have the correct permissions/type if: +// - it exists and is a standard file with read+write permissions +// - if it does not exist but its parent directory has read+write permissions. +// +// The arguments for the function are the result of os.Stat(path). There is no need to do error checking on the +// result of os.Stat in the calling context (this method does it for you). +func ErrIfNotWritableFile(path string) (exists bool, size int64, err error) { + info, err := os.Stat(path) + if err != nil { + if os.IsNotExist(err) { + // The file does not exist. Check the parent. + parentPath := filepath.Dir(path) + parentInfo, err := os.Stat(parentPath) + if err != nil { + if os.IsNotExist(err) { + return false, -1, fmt.Errorf("parent directory %s does not exist", parentPath) + } + return false, -1, fmt.Errorf( + "failed to stat parent directory %s: %w", parentPath, err) + } + + if !parentInfo.IsDir() { + return false, -1, fmt.Errorf("parent directory %s is not a directory", parentPath) + } + + if parentInfo.Mode()&0700 != 0700 { + return false, -1, fmt.Errorf( + "parent directory %s has insufficient permissions", parentPath) + } + + return false, -1, nil + } else { + return false, 0, fmt.Errorf("failed to stat path %s: %w", path, err) + } + } + + // File exists. Check if it is a regular file and that it is readable+writeable. + if info.IsDir() { + return false, -1, fmt.Errorf("file %s is a directory", path) + } + if info.Mode()&0600 != 0600 { + return false, -1, fmt.Errorf("file %s has insufficient permissions", path) + } + + return true, info.Size(), nil +} + +// ErrIfNotWritableDirectory checks if a directory exists and is writable, or if it doesn't exist but it would +// be legal to create it. +func ErrIfNotWritableDirectory(dirPath string) error { + info, err := os.Stat(dirPath) + if err != nil { + if os.IsNotExist(err) { + // Directory doesn't exist, check parent permissions + parentDir := filepath.Dir(dirPath) + return ErrIfNotWritableDirectory(parentDir) + } + return fmt.Errorf("failed to access path '%s': %w", dirPath, err) + } + + // Path exists, verify it's a directory with write permissions + if !info.IsDir() { + return fmt.Errorf("path '%s' exists but is not a directory", dirPath) + } + + if info.Mode()&0200 == 0 { + return fmt.Errorf("directory '%s' is not writable", dirPath) + } + + return nil +} + +// Returns an error if the given path exists, otherwise returns nil. +func ErrIfExists(path string) error { + exists, err := Exists(path) + if err != nil { + return fmt.Errorf("failed to check if path %s exists: %w", path, err) + } + if exists { + return fmt.Errorf("path %s already exists", path) + } + return nil +} + +// Returns an error if the given path does not exist, otherwise returns nil. +func ErrIfNotExists(path string) error { + exists, err := Exists(path) + if err != nil { + return fmt.Errorf("failed to check if path %s exists: %w", path, err) + } + if !exists { + return fmt.Errorf("path %s does not exist", path) + } + return nil +} + +// Exists checks if a file or directory exists at the given path. More aesthetically pleasant than os.Stat. +func Exists(path string) (bool, error) { + _, err := os.Stat(path) + if err == nil { + return true, nil + } + if os.IsNotExist(err) { + return false, nil + } + return false, fmt.Errorf("error checking if path %s exists: %w", path, err) +} + +// SyncFile syncs a file/directory +func SyncPath(path string) error { + file, err := os.Open(path) + if err != nil { + return fmt.Errorf("failed to open path for sync: %w", err) + } + defer func() { + _ = file.Close() + }() + + if err := file.Sync(); err != nil { + return fmt.Errorf("failed to sync path: %w", err) + } + + return nil +} + +// SyncParentPath syncs the parent directory of the given path. +func SyncParentPath(path string) error { + return SyncPath(filepath.Dir(path)) +} + +// CopyRegularFile copies a regular file from src to dst. If a file already exists at dst, it will be removed +// before copying. +func CopyRegularFile(src string, dst string, fsync bool) error { + // Ensure parent directory exists + if err := EnsureParentDirectoryExists(dst, fsync); err != nil { + return err + } + + // Open source file + in, err := os.Open(src) + if err != nil { + return fmt.Errorf("failed to open source file %s: %w", src, err) + } + defer core.CloseLogOnError(in, src, nil) + + // If there is already a file at the destination, remove it. + // This ensures we don't have issues with file permissions or existing symlinks + exists, err := Exists(dst) + if err != nil { + return fmt.Errorf("failed to check if destination file %s exists: %w", dst, err) + } + if exists { + err = os.Remove(dst) + if err != nil { + return fmt.Errorf("failed to remove existing destination file %s: %w", dst, err) + } + } + + // Create destination file + out, err := os.OpenFile(dst, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0644) + if err != nil { + return fmt.Errorf("failed to create destination file %s: %w", dst, err) + } + defer core.CloseLogOnError(out, dst, nil) + + // Copy content + if _, err = io.Copy(out, in); err != nil { + return fmt.Errorf("failed to copy file content from %s to %s: %w", src, dst, err) + } + + // Sync if requested + if fsync { + if err = SyncPath(dst); err != nil { + return fmt.Errorf("failed to sync destination file %s: %w", dst, err) + } + if err = SyncParentPath(dst); err != nil { + return fmt.Errorf("failed to sync parent directory of %s: %w", dst, err) + } + } + + return nil +} + +// EnsureParentDirectoryExists ensures the parent directory of the given path exists and is writable. +// Creates parent directories if they don't exist. +func EnsureParentDirectoryExists(path string, fsync bool) error { + return EnsureDirectoryExists(filepath.Dir(path), fsync) +} + +// EnsureDirectoryExists ensures a directory exists with the given permissions. +// If the directory already exists, it verifies it has write permissions. +// If fsync is true, all newly created directories are synced to disk. +func EnsureDirectoryExists(dirPath string, fsync bool) error { + // Convert to absolute path to ensure clean processing + absPath, err := filepath.Abs(dirPath) + if err != nil { + return fmt.Errorf("failed to get absolute path for %s: %w", dirPath, err) + } + + // Find the first ancestor that exists + pathsToCreate := []string{} + currentPath := absPath + + for { + // Check if current path exists + info, err := os.Stat(currentPath) + if err == nil { + // Path exists, verify it's a directory + if !info.IsDir() { + return fmt.Errorf("path %s exists but is not a directory", currentPath) + } + break // Found existing ancestor + } + + if !os.IsNotExist(err) { + return fmt.Errorf("failed to check path %s: %w", currentPath, err) + } + + // Path doesn't exist, add to list of paths to create + pathsToCreate = append(pathsToCreate, currentPath) + + // Move to parent directory + parentPath := filepath.Dir(currentPath) + if parentPath == currentPath { + // Reached filesystem root. filepath.Dir("/") returns "/", so we stop here. + break + } + currentPath = parentPath + } + + // Create directories from top-level to bottom-level and possibly sync each one + for i := len(pathsToCreate) - 1; i >= 0; i-- { + dirToCreate := pathsToCreate[i] + + // Create the directory + if err := os.Mkdir(dirToCreate, 0755); err != nil { + return fmt.Errorf("failed to create directory %s: %w", dirToCreate, err) + } + + if fsync { + // Sync the newly created directory + if err := SyncPath(dirToCreate); err != nil { + return fmt.Errorf("failed to sync newly created directory %s: %w", dirToCreate, err) + } + + // Also sync the parent directory to ensure the directory entry is persisted + parentDir := filepath.Dir(dirToCreate) + if err := SyncPath(parentDir); err != nil { + return fmt.Errorf("failed to sync parent directory %s: %w", parentDir, err) + } + } + } + + return nil +} + +// DeepDelete deletes a regular file. If the file is a symlink, the symlink and the file pointed to by the symlink +// are both deleted. This method can delete an empty directory, but will return an error if asked to delete a +// non-empty directory. For the sake of simplicity, this method does not traverse chain of symlinks. If the symlink +// points to another symlink, it will only delete original symlink and the symlink that the original symlink points to. +func DeepDelete(path string) error { + isSymlink, err := IsSymlink(path) + if err != nil { + return fmt.Errorf("failed to check if path %s is a symlink: %w", path, err) + } + + if isSymlink { + // remove the file where the symlink points + actualFile, err := os.Readlink(path) + if err != nil { + return fmt.Errorf("failed to read symlink %s: %w", path, err) + } + if err := os.Remove(actualFile); err != nil { + return fmt.Errorf("failed to remove actual file %s: %w", actualFile, err) + } + } + + err = os.Remove(path) + if err != nil { + return fmt.Errorf("failed to remove file %s: %w", path, err) + } + + return nil +} diff --git a/sei-db/db_engine/litt/util/file_utils_test.go b/sei-db/db_engine/litt/util/file_utils_test.go new file mode 100644 index 0000000000..8afc91aad7 --- /dev/null +++ b/sei-db/db_engine/litt/util/file_utils_test.go @@ -0,0 +1,1443 @@ +//go:build littdb_wip + +package util + +import ( + "os" + "path/filepath" + "testing" + + "github.com/stretchr/testify/require" +) + +func TestErrIfNotWritableFile(t *testing.T) { + // Setup + tempDir := t.TempDir() + + // Test cases + tests := []struct { + name string + setup func() string + expectedExists bool + expectedSize int64 + expectError bool + expectedErrorMsg string + }{ + { + name: "existing file with correct permissions", + setup: func() string { + path := filepath.Join(tempDir, "test-file") + err := os.WriteFile(path, []byte("test data"), 0600) + require.NoError(t, err) + return path + }, + expectedExists: true, + expectedSize: 9, // "test data" is 9 bytes + expectError: false, + }, + { + name: "non-existent file with writable parent", + setup: func() string { + return filepath.Join(tempDir, "non-existent-file") + }, + expectedExists: false, + expectedSize: -1, + expectError: false, + }, + { + name: "non-existent file with non-existent parent", + setup: func() string { + return filepath.Join(tempDir, "non-existent-dir", "non-existent-file") + }, + expectedExists: false, + expectedSize: -1, + expectError: true, + expectedErrorMsg: "parent directory", + }, + { + name: "existing file is a directory", + setup: func() string { + path := filepath.Join(tempDir, "test-dir") + err := os.Mkdir(path, 0755) + require.NoError(t, err) + return path + }, + expectedExists: false, + expectedSize: -1, + expectError: true, + expectedErrorMsg: "is a directory", + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + path := tc.setup() + exists, size, err := ErrIfNotWritableFile(path) + + if tc.expectError { + require.Error(t, err) + require.Contains(t, err.Error(), tc.expectedErrorMsg) + } else { + require.NoError(t, err) + } + + require.Equal(t, tc.expectedExists, exists) + require.Equal(t, tc.expectedSize, size) + }) + } +} + +func TestExists(t *testing.T) { + // Setup + tempDir := t.TempDir() + existingFile := filepath.Join(tempDir, "existing-file") + err := os.WriteFile(existingFile, []byte("test"), 0600) + require.NoError(t, err) + + nonExistentFile := filepath.Join(tempDir, "non-existent-file") + + // Test cases + tests := []struct { + name string + path string + expected bool + expectError bool + }{ + { + name: "existing file", + path: existingFile, + expected: true, + expectError: false, + }, + { + name: "non-existent file", + path: nonExistentFile, + expected: false, + expectError: false, + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + exists, err := Exists(tc.path) + + if tc.expectError { + require.Error(t, err) + } else { + require.NoError(t, err) + } + + require.Equal(t, tc.expected, exists) + }) + } +} + +func TestErrIfNotWritableDirectory(t *testing.T) { + // Setup + tempDir := t.TempDir() + + // Create a non-writable directory (0500 = read & execute, no write) + nonWritableDir := filepath.Join(tempDir, "non-writable-dir") + err := os.Mkdir(nonWritableDir, 0500) + require.NoError(t, err) + + // Create a writable directory + writableDir := filepath.Join(tempDir, "writable-dir") + err = os.Mkdir(writableDir, 0700) + require.NoError(t, err) + + // Create a regular file + regularFile := filepath.Join(tempDir, "regular-file") + err = os.WriteFile(regularFile, []byte("test"), 0600) + require.NoError(t, err) + + // Test cases + tests := []struct { + name string + path string + expectError bool + errorMsg string + }{ + { + name: "writable directory", + path: writableDir, + expectError: false, + }, + { + name: "non-writable directory", + path: nonWritableDir, + expectError: true, + errorMsg: "not writable", + }, + { + name: "regular file", + path: regularFile, + expectError: true, + errorMsg: "is not a directory", + }, + { + name: "non-existent directory with writable parent", + path: filepath.Join(writableDir, "non-existent"), + expectError: false, + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + err := ErrIfNotWritableDirectory(tc.path) + + if tc.expectError { + require.Error(t, err) + require.Contains(t, err.Error(), tc.errorMsg) + } else { + require.NoError(t, err) + } + }) + } + + // Cleanup special permissions + err = os.Chmod(nonWritableDir, 0700) + require.NoError(t, err) +} + +func TestEnsureParentDirExists(t *testing.T) { + // Setup + tempDir := t.TempDir() + + // Create a non-writable directory (0500 = read & execute, no write) + nonWritableDir := filepath.Join(tempDir, "non-writable-dir") + err := os.Mkdir(nonWritableDir, 0500) + require.NoError(t, err) + + // Create a test file + testFile := filepath.Join(tempDir, "test-file") + err = os.WriteFile(testFile, []byte("test"), 0600) + require.NoError(t, err) + + // Test cases + tests := []struct { + name string + path string + expectError bool + errorMsg string + }{ + { + name: "parent exists and is writable", + path: filepath.Join(tempDir, "new-file"), + expectError: false, + }, + { + name: "multi-level parent doesn't exist", + path: filepath.Join(tempDir, "new-dir", "subdir", "new-file"), + expectError: false, + }, + { + name: "parent exists but is a file", + path: filepath.Join(testFile, "impossible"), + expectError: true, + errorMsg: "is not a directory", + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + err := EnsureParentDirectoryExists(tc.path, false) + + if tc.expectError { + require.Error(t, err) + require.Contains(t, err.Error(), tc.errorMsg) + } else { + require.NoError(t, err) + + // Verify the parent directory was created if needed + parentDir := filepath.Dir(tc.path) + exists, err := Exists(parentDir) + require.NoError(t, err) + require.True(t, exists) + } + }) + } + + // Cleanup special permissions + err = os.Chmod(nonWritableDir, 0700) + require.NoError(t, err) +} + +func TestCopyRegularFile(t *testing.T) { + // Setup + tempDir := t.TempDir() + + // Create a source file with specific content, permissions, and time + sourceFile := filepath.Join(tempDir, "source-file") + content := []byte("test content") + err := os.WriteFile(sourceFile, content, 0640) + require.NoError(t, err) + + // Test cases + tests := []struct { + name string + destPath string + expectError bool + }{ + { + name: "copy to a new file", + destPath: filepath.Join(tempDir, "dest-file"), + expectError: false, + }, + { + name: "overwrite existing file", + destPath: filepath.Join(tempDir, "existing-file"), + expectError: false, + }, + { + name: "copy to a new subdirectory", + destPath: filepath.Join(tempDir, "subdir", "dest-file"), + expectError: false, + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + // If testing overwrite, create the file first + if tc.name == "overwrite existing file" { + err := os.WriteFile(tc.destPath, []byte("original content"), 0600) + require.NoError(t, err) + } + + err := CopyRegularFile(sourceFile, tc.destPath, false) + + if tc.expectError { + require.Error(t, err) + } else { + require.NoError(t, err) + + // Check content + destContent, err := os.ReadFile(tc.destPath) + require.NoError(t, err) + require.Equal(t, content, destContent) + } + }) + } +} + +func TestEnsureDirectoryExists(t *testing.T) { + // Setup + tempDir := t.TempDir() + + // Create a regular file + regularFile := filepath.Join(tempDir, "regular-file") + err := os.WriteFile(regularFile, []byte("test"), 0600) + require.NoError(t, err) + + // Test cases + tests := []struct { + name string + dirPath string + setup func(path string) + expectError bool + errorMsg string + }{ + { + name: "directory doesn't exist", + dirPath: filepath.Join(tempDir, "new-dir"), + setup: func(path string) {}, + expectError: false, + }, + { + name: "directory already exists", + dirPath: filepath.Join(tempDir, "existing-dir"), + setup: func(path string) { + err := os.Mkdir(path, 0755) + require.NoError(t, err) + }, + expectError: false, + }, + { + name: "path exists but is a file", + dirPath: regularFile, + setup: func(path string) {}, + expectError: true, + errorMsg: "is not a directory", + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + tc.setup(tc.dirPath) + + err := EnsureDirectoryExists(tc.dirPath, false) + + if tc.expectError { + require.Error(t, err) + require.Contains(t, err.Error(), tc.errorMsg) + } else { + require.NoError(t, err) + + // Verify the directory exists + info, err := os.Stat(tc.dirPath) + require.NoError(t, err) + require.True(t, info.IsDir()) + + // If we created a new directory, verify the mode + if tc.name == "directory doesn't exist" { + // Note: mode comparison can be tricky due to umask and OS differences + // So we just check that it's writable + require.True(t, info.Mode()&0200 != 0, "Directory should be writable") + } + } + }) + } + + // Clean up non-writable directory + nonWritableDir := filepath.Join(tempDir, "non-writable-dir") + if _, err := os.Stat(nonWritableDir); err == nil { + err = os.Chmod(nonWritableDir, 0700) + require.NoError(t, err) + } +} + +func TestEnsureParentDirectoryExists(t *testing.T) { + testDir := t.TempDir() + + directoryPath := filepath.Join(testDir, "foo", "bar", "baz") + filePath := filepath.Join(directoryPath, "data.txt") + + err := EnsureParentDirectoryExists(filePath, false) + require.NoError(t, err, "failed to create directory") + + exists, err := Exists(directoryPath) + require.NoError(t, err, "failed to check if directory exists") + require.True(t, exists, "directory does not exist") + + // Utility should not have created the file, just the parent. + exists, err = Exists(filePath) + require.NoError(t, err, "failed to check if file 1exists") + require.False(t, exists, "file should not exist") + + // Calling the same method again should not cause an error. + err = EnsureParentDirectoryExists(filePath, false) + require.NoError(t, err) +} + +func TestAtomicWrite(t *testing.T) { + // Setup + tempDir := t.TempDir() + + // Test cases + tests := []struct { + name string + setup func() (string, []byte) + expectError bool + errorMsg string + }{ + { + name: "write to new file", + setup: func() (string, []byte) { + path := filepath.Join(tempDir, "new-file.txt") + data := []byte("test content") + return path, data + }, + expectError: false, + }, + { + name: "overwrite existing file", + setup: func() (string, []byte) { + path := filepath.Join(tempDir, "existing-file.txt") + // Create existing file with different content + err := os.WriteFile(path, []byte("old content"), 0644) + require.NoError(t, err) + data := []byte("new content") + return path, data + }, + expectError: false, + }, + { + name: "write to subdirectory", + setup: func() (string, []byte) { + subDir := filepath.Join(tempDir, "subdir") + err := os.Mkdir(subDir, 0755) + require.NoError(t, err) + path := filepath.Join(subDir, "file.txt") + data := []byte("content in subdirectory") + return path, data + }, + expectError: false, + }, + { + name: "write with empty data", + setup: func() (string, []byte) { + path := filepath.Join(tempDir, "empty-file.txt") + data := []byte("") + return path, data + }, + expectError: false, + }, + { + name: "write to non-existent parent directory", + setup: func() (string, []byte) { + path := filepath.Join(tempDir, "non-existent-dir", "file.txt") + data := []byte("content") + return path, data + }, + expectError: true, + errorMsg: "failed to create swap file", + }, + { + name: "write with large data", + setup: func() (string, []byte) { + path := filepath.Join(tempDir, "large-file.txt") + // Create 1MB of data + data := make([]byte, 1024*1024) + for i := range data { + data[i] = byte(i % 256) + } + return path, data + }, + expectError: false, + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + path, data := tc.setup() + swapPath := path + SwapFileExtension + + // Ensure swap file doesn't exist before test + _, err := os.Stat(swapPath) + require.True(t, os.IsNotExist(err), "Swap file should not exist before test") + + err = AtomicWrite(path, data, true) + + if tc.expectError { + require.Error(t, err) + require.Contains(t, err.Error(), tc.errorMsg) + + // Verify that the destination file wasn't created or modified + if tc.name == "overwrite existing file" { + // Original file should still have old content + content, err := os.ReadFile(path) + require.NoError(t, err) + require.Equal(t, "old content", string(content)) + } + } else { + require.NoError(t, err) + + // Verify the file was written correctly + content, err := os.ReadFile(path) + require.NoError(t, err) + require.Equal(t, data, content) + + // Verify the swap file was cleaned up + _, err = os.Stat(swapPath) + require.True(t, os.IsNotExist(err), "Swap file should be cleaned up after successful write") + + // Verify file permissions are reasonable (at least owner readable/writable) + info, err := os.Stat(path) + require.NoError(t, err) + require.True(t, info.Mode()&0600 != 0, "File should be readable and writable by owner") + } + }) + } +} + +func TestAtomicWriteSwapFileCleanup(t *testing.T) { + // Test that swap files are properly cleaned up even if something goes wrong + tempDir := t.TempDir() + path := filepath.Join(tempDir, "test-file.txt") + swapPath := path + SwapFileExtension + data := []byte("test content") + + // Simulate a scenario where swap file might be left behind + // by creating a swap file manually first + err := os.WriteFile(swapPath, []byte("old swap content"), 0644) + require.NoError(t, err) + + // Verify swap file exists + _, err = os.Stat(swapPath) + require.NoError(t, err) + + // Now run AtomicWrite - it should overwrite the swap file and clean up + err = AtomicWrite(path, data, true) + require.NoError(t, err) + + // Verify the target file has the correct content + content, err := os.ReadFile(path) + require.NoError(t, err) + require.Equal(t, data, content) + + // Verify the swap file was cleaned up + _, err = os.Stat(swapPath) + require.True(t, os.IsNotExist(err), "Swap file should be cleaned up") +} + +func TestAtomicWritePreservesOtherFiles(t *testing.T) { + // Test that AtomicWrite doesn't interfere with other files in the same directory + tempDir := t.TempDir() + + // Create some existing files + file1 := filepath.Join(tempDir, "file1.txt") + file2 := filepath.Join(tempDir, "file2.txt") + targetFile := filepath.Join(tempDir, "target.txt") + + err := os.WriteFile(file1, []byte("content1"), 0644) + require.NoError(t, err) + err = os.WriteFile(file2, []byte("content2"), 0644) + require.NoError(t, err) + + // Perform atomic write on target file + targetData := []byte("target content") + err = AtomicWrite(targetFile, targetData, true) + require.NoError(t, err) + + // Verify all files have correct content + content1, err := os.ReadFile(file1) + require.NoError(t, err) + require.Equal(t, "content1", string(content1)) + + content2, err := os.ReadFile(file2) + require.NoError(t, err) + require.Equal(t, "content2", string(content2)) + + targetContent, err := os.ReadFile(targetFile) + require.NoError(t, err) + require.Equal(t, targetData, targetContent) +} + +func TestAtomicRename(t *testing.T) { + // Setup + tempDir := t.TempDir() + + // Test cases + tests := []struct { + name string + setup func() (string, string) + expectError bool + errorMsg string + }{ + { + name: "rename file in same directory", + setup: func() (string, string) { + oldPath := filepath.Join(tempDir, "old-name.txt") + newPath := filepath.Join(tempDir, "new-name.txt") + err := os.WriteFile(oldPath, []byte("test content"), 0644) + require.NoError(t, err) + return oldPath, newPath + }, + expectError: false, + }, + { + name: "rename file to different directory", + setup: func() (string, string) { + subDir := filepath.Join(tempDir, "subdir") + err := os.Mkdir(subDir, 0755) + require.NoError(t, err) + + oldPath := filepath.Join(tempDir, "file.txt") + newPath := filepath.Join(subDir, "moved-file.txt") + err = os.WriteFile(oldPath, []byte("content to move"), 0644) + require.NoError(t, err) + return oldPath, newPath + }, + expectError: false, + }, + { + name: "overwrite existing file", + setup: func() (string, string) { + oldPath := filepath.Join(tempDir, "source.txt") + newPath := filepath.Join(tempDir, "target.txt") + + // Create source file + err := os.WriteFile(oldPath, []byte("source content"), 0644) + require.NoError(t, err) + + // Create target file that will be overwritten + err = os.WriteFile(newPath, []byte("target content"), 0644) + require.NoError(t, err) + + return oldPath, newPath + }, + expectError: false, + }, + { + name: "rename non-existent file", + setup: func() (string, string) { + oldPath := filepath.Join(tempDir, "non-existent.txt") + newPath := filepath.Join(tempDir, "new.txt") + return oldPath, newPath + }, + expectError: true, + errorMsg: "failed to rename file", + }, + { + name: "rename to non-existent directory", + setup: func() (string, string) { + oldPath := filepath.Join(tempDir, "existing.txt") + newPath := filepath.Join(tempDir, "non-existent-dir", "file.txt") + err := os.WriteFile(oldPath, []byte("content"), 0644) + require.NoError(t, err) + return oldPath, newPath + }, + expectError: true, + errorMsg: "failed to rename file", + }, + { + name: "rename directory", + setup: func() (string, string) { + oldDir := filepath.Join(tempDir, "old-dir") + newDir := filepath.Join(tempDir, "new-dir") + + err := os.Mkdir(oldDir, 0755) + require.NoError(t, err) + + // Add a file inside the directory + err = os.WriteFile(filepath.Join(oldDir, "file.txt"), []byte("dir content"), 0644) + require.NoError(t, err) + + return oldDir, newDir + }, + expectError: false, + }, + { + name: "rename with same source and destination", + setup: func() (string, string) { + path := filepath.Join(tempDir, "same-file.txt") + err := os.WriteFile(path, []byte("content"), 0644) + require.NoError(t, err) + return path, path + }, + expectError: false, // os.Rename typically succeeds for same path + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + oldPath, newPath := tc.setup() + + // Store original content if file exists + var originalContent []byte + var originalInfo os.FileInfo + if info, err := os.Stat(oldPath); err == nil { + if !info.IsDir() { + originalContent, err = os.ReadFile(oldPath) + require.NoError(t, err) + } + originalInfo = info + } + + err := AtomicRename(oldPath, newPath, true) + + if tc.expectError { + require.Error(t, err) + require.Contains(t, err.Error(), tc.errorMsg) + + // Verify original file still exists (rename failed) + if originalInfo != nil { + _, err := os.Stat(oldPath) + if tc.errorMsg == "failed to rename file" { + require.NoError(t, err, "Original file should still exist after failed rename") + } + } + } else { + require.NoError(t, err) + + // Verify the rename was successful + if tc.name != "rename with same source and destination" { + // Old path should not exist + _, err := os.Stat(oldPath) + require.True(t, os.IsNotExist(err), "Old path should not exist after successful rename") + } + + // New path should exist + newInfo, err := os.Stat(newPath) + require.NoError(t, err, "New path should exist after successful rename") + + // Verify content and properties if it was a file + if originalInfo != nil && !originalInfo.IsDir() { + if tc.name != "rename with same source and destination" { + // Check content preservation + newContent, err := os.ReadFile(newPath) + require.NoError(t, err) + require.Equal(t, originalContent, newContent, "File content should be preserved") + } + + // Check that it's still a file + require.False(t, newInfo.IsDir(), "Renamed file should still be a file") + } else if originalInfo != nil && originalInfo.IsDir() { + // Check that it's still a directory + require.True(t, newInfo.IsDir(), "Renamed directory should still be a directory") + + // Check that directory contents are preserved + if tc.name == "rename directory" { + fileContent, err := os.ReadFile(filepath.Join(newPath, "file.txt")) + require.NoError(t, err) + require.Equal(t, "dir content", string(fileContent)) + } + } + } + }) + } +} + +func TestAtomicRenamePreservesPermissions(t *testing.T) { + // Test that file permissions are preserved during atomic rename + tempDir := t.TempDir() + + oldPath := filepath.Join(tempDir, "source.txt") + newPath := filepath.Join(tempDir, "dest.txt") + + // Create file with specific permissions + err := os.WriteFile(oldPath, []byte("test content"), 0640) + require.NoError(t, err) + + // Get original permissions + originalInfo, err := os.Stat(oldPath) + require.NoError(t, err) + + // Perform atomic rename + err = AtomicRename(oldPath, newPath, true) + require.NoError(t, err) + + // Verify permissions are preserved + newInfo, err := os.Stat(newPath) + require.NoError(t, err) + require.Equal(t, originalInfo.Mode(), newInfo.Mode(), "File permissions should be preserved") +} + +func TestAtomicRenameWithSymlink(t *testing.T) { + tempDir := t.TempDir() + + // Create a target file + targetFile := filepath.Join(tempDir, "target.txt") + err := os.WriteFile(targetFile, []byte("target content"), 0644) + require.NoError(t, err) + + // Create a symlink + oldLink := filepath.Join(tempDir, "old-link") + err = os.Symlink(targetFile, oldLink) + require.NoError(t, err) + + // Rename the symlink + newLink := filepath.Join(tempDir, "new-link") + err = AtomicRename(oldLink, newLink, true) + require.NoError(t, err) + + // Verify the symlink was renamed and still points to the same target + linkTarget, err := os.Readlink(newLink) + require.NoError(t, err) + require.Equal(t, targetFile, linkTarget) + + // Verify old symlink no longer exists + _, err = os.Stat(oldLink) + require.True(t, os.IsNotExist(err)) +} + +const mixedSwapFilesTestName = "delete swap files in directory with mixed files" + +func TestDeleteOrphanedSwapFiles(t *testing.T) { + // Setup + tempDir := t.TempDir() + + // Test cases + tests := []struct { + name string + setup func() string + expectError bool + errorMsg string + }{ + { + name: mixedSwapFilesTestName, + setup: func() string { + testDir := filepath.Join(tempDir, "mixed-files") + err := os.Mkdir(testDir, 0755) + require.NoError(t, err) + + // Create regular files + err = os.WriteFile(filepath.Join(testDir, "regular1.txt"), []byte("content1"), 0644) + require.NoError(t, err) + err = os.WriteFile(filepath.Join(testDir, "regular2.log"), []byte("content2"), 0644) + require.NoError(t, err) + + // Create swap files + err = os.WriteFile(filepath.Join(testDir, "file1.txt"+SwapFileExtension), []byte("swap1"), 0644) + require.NoError(t, err) + err = os.WriteFile(filepath.Join(testDir, "file2.log"+SwapFileExtension), []byte("swap2"), 0644) + require.NoError(t, err) + err = os.WriteFile(filepath.Join(testDir, "orphaned"+SwapFileExtension), []byte("orphaned"), 0644) + require.NoError(t, err) + + // Create a subdirectory (should be ignored) + subDir := filepath.Join(testDir, "subdir") + err = os.Mkdir(subDir, 0755) + require.NoError(t, err) + + // Create a swap file in subdirectory (should not be deleted by this call) + err = os.WriteFile(filepath.Join(subDir, "nested"+SwapFileExtension), []byte("nested"), 0644) + require.NoError(t, err) + + return testDir + }, + expectError: false, + }, + { + name: "empty directory", + setup: func() string { + testDir := filepath.Join(tempDir, "empty-dir") + err := os.Mkdir(testDir, 0755) + require.NoError(t, err) + return testDir + }, + expectError: false, + }, + { + name: "directory with only swap files", + setup: func() string { + testDir := filepath.Join(tempDir, "only-swap") + err := os.Mkdir(testDir, 0755) + require.NoError(t, err) + + // Create only swap files + err = os.WriteFile(filepath.Join(testDir, "swap1"+SwapFileExtension), []byte("content1"), 0644) + require.NoError(t, err) + err = os.WriteFile(filepath.Join(testDir, "swap2"+SwapFileExtension), []byte("content2"), 0644) + require.NoError(t, err) + + return testDir + }, + expectError: false, + }, + { + name: "directory with no swap files", + setup: func() string { + testDir := filepath.Join(tempDir, "no-swap") + err := os.Mkdir(testDir, 0755) + require.NoError(t, err) + + // Create only regular files + err = os.WriteFile(filepath.Join(testDir, "file1.txt"), []byte("content1"), 0644) + require.NoError(t, err) + err = os.WriteFile(filepath.Join(testDir, "file2.log"), []byte("content2"), 0644) + require.NoError(t, err) + + return testDir + }, + expectError: false, + }, + { + name: "non-existent directory", + setup: func() string { + return filepath.Join(tempDir, "non-existent") + }, + expectError: true, + errorMsg: "failed to read directory", + }, + { + name: "path is a file not directory", + setup: func() string { + filePath := filepath.Join(tempDir, "not-a-dir.txt") + err := os.WriteFile(filePath, []byte("content"), 0644) + require.NoError(t, err) + return filePath + }, + expectError: true, + errorMsg: "failed to read directory", + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + dirPath := tc.setup() + + // Count files before deletion for verification + var beforeFiles []string + if entries, err := os.ReadDir(dirPath); err == nil { + for _, entry := range entries { + if !entry.IsDir() { + beforeFiles = append(beforeFiles, entry.Name()) + } + } + } + + err := DeleteOrphanedSwapFiles(dirPath) + + if tc.expectError { + require.Error(t, err) + require.Contains(t, err.Error(), tc.errorMsg) + } else { + require.NoError(t, err) + + // Verify that all swap files were deleted + entries, err := os.ReadDir(dirPath) + require.NoError(t, err) + + var afterFiles []string + var afterSwapFiles []string + for _, entry := range entries { + if !entry.IsDir() { + afterFiles = append(afterFiles, entry.Name()) + if filepath.Ext(entry.Name()) == SwapFileExtension { + afterSwapFiles = append(afterSwapFiles, entry.Name()) + } + } + } + + // No swap files should remain + require.Empty(t, afterSwapFiles, "All swap files should be deleted") + + // Regular files should remain unchanged + var beforeRegularFiles []string + var afterRegularFiles []string + for _, file := range beforeFiles { + if filepath.Ext(file) != SwapFileExtension { + beforeRegularFiles = append(beforeRegularFiles, file) + } + } + for _, file := range afterFiles { + if filepath.Ext(file) != SwapFileExtension { + afterRegularFiles = append(afterRegularFiles, file) + } + } + require.ElementsMatch(t, beforeRegularFiles, afterRegularFiles, "Regular files should be unchanged") + + // Verify subdirectories are not affected + if tc.name == mixedSwapFilesTestName { + subDirPath := filepath.Join(dirPath, "subdir") + subEntries, err := os.ReadDir(subDirPath) + require.NoError(t, err) + require.Len(t, subEntries, 1, "Subdirectory should still contain its swap file") + require.Equal(t, "nested"+SwapFileExtension, subEntries[0].Name()) + } + } + }) + } +} + +func TestDeleteOrphanedSwapFilesPermissions(t *testing.T) { + // Test behavior with permission issues + tempDir := t.TempDir() + + // Create a directory with swap files + testDir := filepath.Join(tempDir, "perm-test") + err := os.Mkdir(testDir, 0755) + require.NoError(t, err) + + // Create a swap file + swapFile := filepath.Join(testDir, "test"+SwapFileExtension) + err = os.WriteFile(swapFile, []byte("content"), 0644) + require.NoError(t, err) + + // Make the directory read-only (no write permissions) + err = os.Chmod(testDir, 0555) // read + execute only + require.NoError(t, err) + + // Attempt to delete swap files should fail + err = DeleteOrphanedSwapFiles(testDir) + require.Error(t, err) + require.Contains(t, err.Error(), "failed to remove swap file") + + // Restore permissions for cleanup + err = os.Chmod(testDir, 0755) + require.NoError(t, err) +} + +func TestSanitizePath(t *testing.T) { + // Get the current working directory and home directory for test expectations + cwd, err := os.Getwd() + require.NoError(t, err) + + homeDir, err := os.UserHomeDir() + require.NoError(t, err) + + // Test cases + tests := []struct { + name string + input string + expectedResult func() string // Function to compute expected result + expectError bool + errorMsg string + }{ + { + name: "tilde expansion - home directory only", + input: "~", + expectedResult: func() string { + return homeDir + }, + expectError: false, + }, + { + name: "tilde expansion - home directory with subdirectory", + input: "~/Documents/test.txt", + expectedResult: func() string { + return filepath.Join(homeDir, "Documents/test.txt") + }, + expectError: false, + }, + { + name: "tilde expansion - home directory with nested subdirectories", + input: "~/Documents/Projects/test-project/file.txt", + expectedResult: func() string { + return filepath.Join(homeDir, "Documents/Projects/test-project/file.txt") + }, + expectError: false, + }, + { + name: "absolute path - no changes needed", + input: "/usr/local/bin/test", + expectedResult: func() string { + return "/usr/local/bin/test" + }, + expectError: false, + }, + { + name: "relative path - converted to absolute", + input: "test-file.txt", + expectedResult: func() string { + return filepath.Join(cwd, "test-file.txt") + }, + expectError: false, + }, + { + name: "relative path with subdirectory", + input: "subdir/test-file.txt", + expectedResult: func() string { + return filepath.Join(cwd, "subdir/test-file.txt") + }, + expectError: false, + }, + { + name: "path with redundant elements", + input: "/usr/local/../local/bin/./test", + expectedResult: func() string { + return "/usr/local/bin/test" + }, + expectError: false, + }, + { + name: "path with current directory reference", + input: "./test-file.txt", + expectedResult: func() string { + return filepath.Join(cwd, "test-file.txt") + }, + expectError: false, + }, + { + name: "path with parent directory reference", + input: "../test-file.txt", + expectedResult: func() string { + return filepath.Join(filepath.Dir(cwd), "test-file.txt") + }, + expectError: false, + }, + { + name: "empty path", + input: "", + expectedResult: func() string { + return cwd + }, + expectError: false, + }, + { + name: "path with multiple slashes", + input: "/usr//local///bin/test", + expectedResult: func() string { + return "/usr/local/bin/test" + }, + expectError: false, + }, + { + name: "tilde in middle of path - not expanded", + input: "/path/to/~user/file.txt", + expectedResult: func() string { + return "/path/to/~user/file.txt" + }, + expectError: false, + }, + { + name: "complex relative path with redundant elements", + input: "./subdir/../another/./file.txt", + expectedResult: func() string { + return filepath.Join(cwd, "another/file.txt") + }, + expectError: false, + }, + { + name: "tilde with complex path", + input: "~/Documents/../Downloads/./file.txt", + expectedResult: func() string { + return filepath.Join(homeDir, "Downloads/file.txt") + }, + expectError: false, + }, + } + + // Run tests + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + result, err := SanitizePath(tc.input) + + if tc.expectError { + require.Error(t, err) + require.Contains(t, err.Error(), tc.errorMsg) + } else { + require.NoError(t, err) + expected := tc.expectedResult() + require.Equal(t, expected, result) + + // Verify the result is an absolute path + require.True(t, filepath.IsAbs(result), "Result should be an absolute path") + + // Verify the path is clean (no redundant elements) + require.Equal(t, filepath.Clean(result), result, "Result should be clean") + } + }) + } +} + +func TestIsSymlink(t *testing.T) { + testDir := t.TempDir() + + nonExistentPath := "non-existent-file.txt" + isSymlink, err := IsSymlink(nonExistentPath) + require.NoError(t, err) + require.False(t, isSymlink, "Non-existent file should not be a symlink") + err = ErrIfSymlink(nonExistentPath) + require.NoError(t, err, "Non-existent file should not be a symlink") + + regularFilePath := filepath.Join(testDir, "file.txt") + err = os.WriteFile(regularFilePath, []byte("test content"), 0644) + require.NoError(t, err) + isSymlink, err = IsSymlink(regularFilePath) + require.NoError(t, err) + require.False(t, isSymlink, "Regular file should not be a symlink") + err = ErrIfSymlink(regularFilePath) + require.NoError(t, err, "Regular file should not raise an error for being a symlink") + + isSymlink, err = IsSymlink(testDir) + require.NoError(t, err) + require.False(t, isSymlink, "Directory should not be a symlink") + err = ErrIfSymlink(testDir) + require.NoError(t, err, "Directory should not raise an error for being a symlink") + + symlinkToRegularFilePath := filepath.Join(testDir, "link-to-file.txt") + err = os.Symlink(regularFilePath, symlinkToRegularFilePath) + require.NoError(t, err) + isSymlink, err = IsSymlink(symlinkToRegularFilePath) + require.NoError(t, err) + require.True(t, isSymlink, "Symlink to regular file should be detected as symlink") + err = ErrIfSymlink(symlinkToRegularFilePath) + require.Error(t, err, "Symlink to regular file should raise an error") + + symlinkToTestDirPath := filepath.Join(testDir, "link-to-dir") + err = os.Symlink(testDir, symlinkToTestDirPath) + require.NoError(t, err) + isSymlink, err = IsSymlink(symlinkToTestDirPath) + require.NoError(t, err) + require.True(t, isSymlink, "Symlink to directory should be detected as symlink") + err = ErrIfSymlink(symlinkToTestDirPath) + require.Error(t, err, "Symlink to directory should raise an error") +} + +// It's hard to know if the sync methods are actually doing what they should be doing. But at the very least, +// ensure that they don't crash. +func TestSync(t *testing.T) { + testDir := t.TempDir() + + err := SyncPath(testDir) + require.NoError(t, err, "SyncPath should not return an error") + + nestedDir := filepath.Join(testDir, "nested") + err = os.Mkdir(nestedDir, 0755) + require.NoError(t, err, "Creating nested directory should not return an error") + err = SyncParentPath(nestedDir) + require.NoError(t, err, "SyncParentPath should not return an error") + + regularFilePath := filepath.Join(testDir, "file.txt") + err = os.WriteFile(regularFilePath, []byte("test content"), 0644) + require.NoError(t, err, "Creating regular file should not return an error") + err = SyncPath(regularFilePath) + require.NoError(t, err, "SyncPath should not return an error") +} + +func TestErrIfExists(t *testing.T) { + testDir := t.TempDir() + err := os.MkdirAll(testDir, 0755) + require.NoError(t, err, "Failed to create test directory") + + err = ErrIfExists(testDir) + require.Error(t, err) + err = ErrIfNotExists(testDir) + require.NoError(t, err, "Expected no error for existing directory") + + fooPath := filepath.Join(testDir, "foo") + barPath := filepath.Join(testDir, "bar.txt") + + err = ErrIfExists(fooPath) + require.NoError(t, err) + err = ErrIfNotExists(fooPath) + require.Error(t, err, "Expected error for non-existing directory") + + err = ErrIfExists(barPath) + require.NoError(t, err) + err = ErrIfNotExists(barPath) + require.Error(t, err, "Expected error for non-existing file") + + err = os.MkdirAll(fooPath, 0755) + require.NoError(t, err) + + err = os.WriteFile(barPath, []byte("test content"), 0644) + require.NoError(t, err) + + err = ErrIfExists(fooPath) + require.Error(t, err, "Expected error for existing directory") + err = ErrIfNotExists(fooPath) + require.NoError(t, err, "Expected no error for existing directory") + + err = ErrIfExists(barPath) + require.Error(t, err, "Expected error for existing file") + err = ErrIfNotExists(barPath) + require.NoError(t, err, "Expected no error for existing file") +} + +func TestDeepDelete(t *testing.T) { + directory := t.TempDir() + + // Attempt to delete a non-existent path + err := DeepDelete(filepath.Join(directory, "non-existent")) + require.Error(t, err) + + // Delete an empty directory + emptyDir := filepath.Join(directory, "empty-dir") + err = os.Mkdir(emptyDir, 0755) + require.NoError(t, err, "Failed to create empty directory") + exists, err := Exists(emptyDir) + require.NoError(t, err, "Failed to check if empty directory exists") + require.True(t, exists, "Empty directory should exist") + err = DeepDelete(emptyDir) + require.NoError(t, err, "Failed to delete empty directory") + exists, err = Exists(emptyDir) + require.NoError(t, err, "Failed to check if empty directory exists after deletion") + require.False(t, exists, "Empty directory should not exist after deletion") + + // Delete a regular file + filePath := filepath.Join(directory, "file.txt") + err = os.WriteFile(filePath, []byte("test content"), 0644) + require.NoError(t, err, "Failed to create regular file") + exists, err = Exists(filePath) + require.NoError(t, err, "Failed to check if regular file exists") + require.True(t, exists, "Regular file should exist before deletion") + err = DeepDelete(filePath) + require.NoError(t, err, "Failed to delete regular file") + exists, err = Exists(filePath) + require.NoError(t, err, "Failed to check if regular file exists after deletion") + require.False(t, exists, "Regular file should not exist after deletion") + + // Attempt to delete a non-empty directory + nonEmptyDir := filepath.Join(directory, "non-empty-dir") + err = os.Mkdir(nonEmptyDir, 0755) + require.NoError(t, err, "Failed to create non-empty directory") + subFilePath := filepath.Join(nonEmptyDir, "subfile.txt") + err = os.WriteFile(subFilePath, []byte("subfile content"), 0644) + require.NoError(t, err, "Failed to create subfile in non-empty directory") + exists, err = Exists(nonEmptyDir) + require.NoError(t, err, "Failed to check if non-empty directory exists") + require.True(t, exists, "Non-empty directory should exist before deletion") + err = DeepDelete(nonEmptyDir) + require.Error(t, err, "Expected error for non-empty directory") + exists, err = Exists(nonEmptyDir) + require.NoError(t, err, "Failed to check if non-empty directory exists after deletion attempt") + require.True(t, exists, "Non-empty directory should still exist after deletion attempt") + + // Delete a symlink that points to a file + targetFile := filepath.Join(directory, "target.txt") + symlinkPath := filepath.Join(directory, "symlink-to-file") + err = os.WriteFile(targetFile, []byte("target content"), 0644) + require.NoError(t, err, "Failed to create target file for symlink") + err = os.Symlink(targetFile, symlinkPath) + require.NoError(t, err, "Failed to create symlink to file") + exists, err = Exists(symlinkPath) + require.NoError(t, err, "Failed to check if symlink to file exists") + require.True(t, exists, "Symlink to file should exist before deletion") + err = DeepDelete(symlinkPath) + require.NoError(t, err, "Failed to delete symlink to file") + exists, err = Exists(symlinkPath) + require.NoError(t, err, "Failed to check if symlink to file exists after deletion") + require.False(t, exists, "Symlink to file should not exist after deletion") + exists, err = Exists(targetFile) + require.NoError(t, err, "Failed to check if original file exists after deleting symlink") + require.False(t, exists, "Original file should not exist after deleting symlink") + + // Delete a symlink that points to a directory + dirToLink := filepath.Join(directory, "dir-to-link") + err = os.Mkdir(dirToLink, 0755) + require.NoError(t, err, "Failed to create directory for symlink") + symlinkDirPath := filepath.Join(directory, "symlink-to-dir") + err = os.Symlink(dirToLink, symlinkDirPath) + require.NoError(t, err, "Failed to create symlink to directory") + exists, err = Exists(symlinkDirPath) + require.NoError(t, err, "Failed to check if symlink to directory exists") + require.True(t, exists, "Symlink to directory should exist before deletion") + err = DeepDelete(symlinkDirPath) + require.NoError(t, err, "Failed to delete symlink to directory") + exists, err = Exists(symlinkDirPath) + require.NoError(t, err, "Failed to check if symlink to directory exists after deletion") + require.False(t, exists, "Symlink to directory should not exist after deletion") + exists, err = Exists(dirToLink) + require.NoError(t, err, "Failed to check if original directory exists after deleting symlink") + require.False(t, exists, "Original directory should not exist after deleting symlink") + + // Delete a symlink that points to a non-empty directory + nonEmptyDirForSymlink := filepath.Join(directory, "non-empty-dir-for-symlink") + err = os.Mkdir(nonEmptyDirForSymlink, 0755) + require.NoError(t, err, "Failed to create non-empty directory for symlink") + subFileForSymlink := filepath.Join(nonEmptyDirForSymlink, "subfile-for-symlink.txt") + err = os.WriteFile(subFileForSymlink, []byte("subfile content for symlink"), 0644) + require.NoError(t, err, "Failed to create subfile in non-empty directory for symlink") + symlinkNonEmptyDirPath := filepath.Join(directory, "symlink-to-non-empty-dir") + err = os.Symlink(nonEmptyDirForSymlink, symlinkNonEmptyDirPath) + require.NoError(t, err, "Failed to create symlink to non-empty directory") + exists, err = Exists(symlinkNonEmptyDirPath) + require.NoError(t, err, "Failed to check if symlink to non-empty directory exists") + require.True(t, exists, "Symlink to non-empty directory should exist before deletion") + err = DeepDelete(symlinkNonEmptyDirPath) + require.Error(t, err, "Expected error due to non-empty directory") + exists, err = Exists(symlinkNonEmptyDirPath) + require.NoError(t, err, "Failed to check if symlink to non-empty directory exists after deletion") + require.True(t, exists, "Symlink to non-empty directory should exist after failed deletion") + exists, err = Exists(nonEmptyDirForSymlink) + require.NoError(t, err, "Failed to check if original non-empty directory exists after deleting symlink") + require.True(t, exists, "Original non-empty directory should still exist after failed deletion") +} + +func TestIsDirectory(t *testing.T) { + testDir := t.TempDir() + + // non-existent path + nonExistentPath := filepath.Join(testDir, "non-existent-dir") + isDir, err := IsDirectory(nonExistentPath) + require.NoError(t, err, "IsDirectory should not return an error for non-existent path") + require.False(t, isDir, "Non-existent path should not be a directory") + + // path is a file + filePath := filepath.Join(testDir, "file.txt") + err = os.WriteFile(filePath, []byte("test content"), 0644) + require.NoError(t, err, "Failed to create test file") + isDir, err = IsDirectory(filePath) + require.NoError(t, err, "IsDirectory should not return an error for file path") + require.False(t, isDir, "File path should not be a directory") + + // path is a directory + dirPath := filepath.Join(testDir, "test-dir") + err = os.Mkdir(dirPath, 0755) + require.NoError(t, err, "Failed to create test directory") + isDir, err = IsDirectory(dirPath) + require.NoError(t, err, "IsDirectory should not return an error for directory path") + require.True(t, isDir, "Directory path should be recognized as a directory") +} diff --git a/sei-db/db_engine/litt/util/hashing.go b/sei-db/db_engine/litt/util/hashing.go new file mode 100644 index 0000000000..dfd210d59e --- /dev/null +++ b/sei-db/db_engine/litt/util/hashing.go @@ -0,0 +1,74 @@ +//go:build littdb_wip + +package util + +import ( + "encoding/binary" + + "github.com/dchest/siphash" +) + +// Perm64 computes A permutation (invertible function) on 64 bits. +// The constants were found by automated search, to +// optimize avalanche. Avalanche means that for a +// random number x, flipping bit i of x has about a +// 50 percent chance of flipping bit j of perm64(x). +// For each possible pair (i,j), this function achieves +// a probability between 49.8 and 50.2 percent. +// +// Warning: this is not a cryptographic hash function. This hash function may be suitable for hash tables, but not for +// cryptographic purposes. It is trivially easy to reverse this function. +// +// Algorithm borrowed from https://github.com/hiero-ledger/hiero-consensus-node/blob/main/platform-sdk/swirlds-common/src/main/java/com/swirlds/common/utility/NonCryptographicHashing.java +// (original implementation is under Apache 2.0 license, algorithm designed by Leemon Baird) +func Perm64(x uint64) uint64 { + // This is necessary so that 0 does not hash to 0. + // As a side effect this constant will hash to 0. + x ^= 0x5e8a016a5eb99c18 + + x += x << 30 + x ^= x >> 27 + x += x << 16 + x ^= x >> 20 + x += x << 5 + x ^= x >> 18 + x += x << 10 + x ^= x >> 24 + x += x << 30 + return x +} + +// Perm64Bytes hashes a byte slice using perm64. +func Perm64Bytes(b []byte) uint64 { + x := uint64(0) + + for i := 0; i < len(b); i += 8 { + var next uint64 + if i+8 <= len(b) { + // grab the next 8 bytes + next = binary.BigEndian.Uint64(b[i:]) + } else { + // insufficient bytes, pad with zeros + nextBytes := make([]byte, 8) + copy(nextBytes, b[i:]) + next = binary.BigEndian.Uint64(nextBytes) + } + x = Perm64(next ^ x) + } + + return x +} + +// LegacyHashKey hash a key using the original littDB hash function. Once all data stored using the original +// hash function is deleted, this function can be removed. +func LegacyHashKey(key []byte, salt uint32) uint32 { + return uint32(Perm64(Perm64Bytes(key) ^ uint64(salt))) +} + +// HashKey hashes a key using perm64 and a salt. +func HashKey(key []byte, salt [16]byte) uint32 { + leftSalt := binary.BigEndian.Uint64(salt[:8]) + rightSalt := binary.BigEndian.Uint64(salt[8:]) + hash := siphash.Hash(leftSalt, rightSalt, key) + return uint32(hash) +} diff --git a/sei-db/db_engine/litt/util/recursive_move.go b/sei-db/db_engine/litt/util/recursive_move.go new file mode 100644 index 0000000000..31cdff9aa1 --- /dev/null +++ b/sei-db/db_engine/litt/util/recursive_move.go @@ -0,0 +1,187 @@ +//go:build littdb_wip + +package util + +import ( + "fmt" + "io/fs" + "os" + "path/filepath" +) + +// RecursiveMove transfers files/directory trees from the source to the destination. +// +// If preserveOriginal is false, then the files at the source will be deleted when this method returns. +// If preserveOriginal is true, then this function will leave behind a copy of the original files at the source. +// +// This function does not support symlinks. It will return an error if it encounters any symlinks in the source path. +func RecursiveMove( + source string, + destination string, + preserveOriginal bool, + fsync bool, +) error { + // Sanitize paths + source, err := SanitizePath(source) + if err != nil { + return fmt.Errorf("failed to sanitize source path: %w", err) + } + + destination, err = SanitizePath(destination) + if err != nil { + return fmt.Errorf("failed to sanitize destination path: %w", err) + } + + // Verify source exists + sourceInfo, err := os.Stat(source) + if err != nil { + return fmt.Errorf("source path %s does not exist: %w", source, err) + } + + // Verify destination parent directory is writable + if err := ErrIfNotWritableDirectory(filepath.Dir(destination)); err != nil { + return fmt.Errorf("destination parent directory not writable: %w", err) + } + + // If source is a file, handle it directly + if !sourceInfo.IsDir() { + return moveFile(source, destination, preserveOriginal, fsync) + } + + // Source is a directory, handle recursively + return recursiveMoveDirectory(source, destination, preserveOriginal, fsync) +} + +// moveFile handles moving a single file +func moveFile(source string, destination string, preserveOriginal bool, fsync bool) error { + // Ensure parent directory exists + if err := EnsureParentDirectoryExists(destination, fsync); err != nil { + return fmt.Errorf("failed to ensure parent directory exists: %w", err) + } + + // If not preserving original, try to move the file first (regardless of deep mode) + if !preserveOriginal { + // Try simple rename first (works if on same filesystem) + if err := os.Rename(source, destination); err == nil { + if fsync { + if err := SyncPath(filepath.Dir(destination)); err != nil { + return fmt.Errorf("failed to sync destination parent directory: %w", err) + } + if err := SyncPath(filepath.Dir(source)); err != nil { + return fmt.Errorf("failed to sync source parent directory: %w", err) + } + } + + return nil + } + // Rename failed (likely different filesystem), fall back to copy+delete + } + + err := ErrIfSymlink(source) + if err != nil { + return fmt.Errorf("symlinks not supported: %w", err) + } + + // Copy the file + if err := CopyRegularFile(source, destination, fsync); err != nil { + return fmt.Errorf("failed to copy file: %w", err) + } + + // Sync if requested + if fsync { + if err := SyncPath(destination); err != nil { + return fmt.Errorf("failed to sync destination file: %w", err) + } + // sync parent directory + if err := SyncPath(filepath.Dir(destination)); err != nil { + return fmt.Errorf("failed to sync parent directory: %w", err) + } + } + + // Remove source if not preserving original + if !preserveOriginal { + if err := os.Remove(source); err != nil { + return fmt.Errorf("failed to remove source file: %w", err) + } + } + + return nil +} + +// recursiveMoveDirectory handles moving a directory and its contents +func recursiveMoveDirectory( + source string, + destination string, + preserveOriginal bool, + fsync bool, +) error { + + // Create destination directory if it doesn't exist + if err := EnsureDirectoryExists(destination, fsync); err != nil { + return fmt.Errorf("failed to create destination directory: %w", err) + } + + // Walk through source directory + err := filepath.WalkDir(source, func(path string, d fs.DirEntry, err error) error { + if err != nil { + return fmt.Errorf("failed to walk path %s: %w", path, err) + } + + // Skip the root directory itself + if path == source { + return nil + } + + // Calculate relative path and destination path + relPath, err := filepath.Rel(source, path) + if err != nil { + return fmt.Errorf("failed to get relative path: %w", err) + } + + destPath := filepath.Join(destination, relPath) + + err = ErrIfSymlink(path) + if err != nil { + return fmt.Errorf("symlinks not supported: %w", err) + } + + if d.IsDir() { + // Create directory at destination + if err := EnsureDirectoryExists(destPath, fsync); err != nil { + return fmt.Errorf("failed to create directory %s: %w", destPath, err) + } + } else { + // Move the file + if err := moveFile(path, destPath, preserveOriginal, fsync); err != nil { + return fmt.Errorf("failed to copy regular file: %w", err) + } + } + + return nil + }) + + if err != nil { + return err + } + + // Sync destination directory if requested + if fsync { + if err := SyncPath(destination); err != nil { + return fmt.Errorf("failed to sync destination directory: %w", err) + } + } + + // Remove source directory if not preserving original + if !preserveOriginal { + if err := os.RemoveAll(source); err != nil { + return fmt.Errorf("failed to remove source directory: %w", err) + } + if fsync { + if err := SyncPath(filepath.Dir(source)); err != nil { + return fmt.Errorf("failed to sync parent directory of source: %w", err) + } + } + } + + return nil +} diff --git a/sei-db/db_engine/litt/util/recursive_move_test.go b/sei-db/db_engine/litt/util/recursive_move_test.go new file mode 100644 index 0000000000..25f795c189 --- /dev/null +++ b/sei-db/db_engine/litt/util/recursive_move_test.go @@ -0,0 +1,187 @@ +//go:build littdb_wip + +package util + +import ( + "os" + "path" + "strings" + "testing" + + "github.com/stretchr/testify/require" +) + +func TestRecursiveMoveDoNotPreserve(t *testing.T) { + // Create a small file tree + root1 := t.TempDir() + foo := path.Join(root1, "foo") + bar := path.Join(root1, "bar") + baz := path.Join(root1, "baz") + alpha := path.Join(foo, "alpha") + beta := path.Join(foo, "beta") + gamma := path.Join(foo, "gamma") + + fileA := path.Join(alpha, "fileA.txt") + fileB := path.Join(beta, "fileB.txt") + fileC := path.Join(foo, "fileC.txt") + fileD := path.Join(bar, "fileD.txt") + + err := EnsureDirectoryExists(foo, false) + require.NoError(t, err) + err = EnsureDirectoryExists(bar, false) + require.NoError(t, err) + err = EnsureDirectoryExists(baz, false) + require.NoError(t, err) + err = EnsureDirectoryExists(alpha, false) + require.NoError(t, err) + err = EnsureDirectoryExists(beta, false) + require.NoError(t, err) + err = EnsureDirectoryExists(gamma, false) + require.NoError(t, err) + + dataA := []byte("This is file A") + err = os.WriteFile(fileA, dataA, 0644) + require.NoError(t, err) + + dataB := []byte("This is file B") + err = os.WriteFile(fileB, dataB, 0644) + require.NoError(t, err) + + dataC := []byte("This is file C") + err = os.WriteFile(fileC, dataC, 0644) + require.NoError(t, err) + + dataD := []byte("This is file D") + err = os.WriteFile(fileD, dataD, 0644) + require.NoError(t, err) + + // move the data + root2 := t.TempDir() + err = RecursiveMove(root1, root2, false, false) + require.NoError(t, err) + + // verify that the file tree exists in the new location + require.NoError(t, ErrIfNotExists(strings.Replace(foo, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(bar, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(baz, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(alpha, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(beta, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(gamma, root1, root2, 1))) + + dataInFileA, err := os.ReadFile(strings.Replace(fileA, root1, root2, 1)) + require.NoError(t, err) + require.Equal(t, dataA, dataInFileA) + + dataInFileB, err := os.ReadFile(strings.Replace(fileB, root1, root2, 1)) + require.NoError(t, err) + require.Equal(t, dataB, dataInFileB) + + dataInFileC, err := os.ReadFile(strings.Replace(fileC, root1, root2, 1)) + require.NoError(t, err) + require.Equal(t, dataC, dataInFileC) + + dataInFileD, err := os.ReadFile(strings.Replace(fileD, root1, root2, 1)) + require.NoError(t, err) + require.Equal(t, dataD, dataInFileD) + + // Original directory should be gone + require.NoError(t, ErrIfExists(root1)) +} + +func TestRecursiveMovePreserve(t *testing.T) { + // Create a small file tree + root1 := t.TempDir() + foo := path.Join(root1, "foo") + bar := path.Join(root1, "bar") + baz := path.Join(root1, "baz") + alpha := path.Join(foo, "alpha") + beta := path.Join(foo, "beta") + gamma := path.Join(foo, "gamma") + + fileA := path.Join(alpha, "fileA.txt") + fileB := path.Join(beta, "fileB.txt") + fileC := path.Join(foo, "fileC.txt") + fileD := path.Join(bar, "fileD.txt") + + err := EnsureDirectoryExists(foo, false) + require.NoError(t, err) + err = EnsureDirectoryExists(bar, false) + require.NoError(t, err) + err = EnsureDirectoryExists(baz, false) + require.NoError(t, err) + err = EnsureDirectoryExists(alpha, false) + require.NoError(t, err) + err = EnsureDirectoryExists(beta, false) + require.NoError(t, err) + err = EnsureDirectoryExists(gamma, false) + require.NoError(t, err) + + dataA := []byte("This is file A") + err = os.WriteFile(fileA, dataA, 0644) + require.NoError(t, err) + + dataB := []byte("This is file B") + err = os.WriteFile(fileB, dataB, 0644) + require.NoError(t, err) + + dataC := []byte("This is file C") + err = os.WriteFile(fileC, dataC, 0644) + require.NoError(t, err) + + dataD := []byte("This is file D") + err = os.WriteFile(fileD, dataD, 0644) + require.NoError(t, err) + + // move the data + root2 := t.TempDir() + err = RecursiveMove(root1, root2, true, false) + require.NoError(t, err) + + // verify that the file tree exists in the new location + require.NoError(t, ErrIfNotExists(strings.Replace(foo, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(bar, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(baz, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(alpha, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(beta, root1, root2, 1))) + require.NoError(t, ErrIfNotExists(strings.Replace(gamma, root1, root2, 1))) + + dataInFileA, err := os.ReadFile(strings.Replace(fileA, root1, root2, 1)) + require.NoError(t, err) + require.Equal(t, dataA, dataInFileA) + + dataInFileB, err := os.ReadFile(strings.Replace(fileB, root1, root2, 1)) + require.NoError(t, err) + require.Equal(t, dataB, dataInFileB) + + dataInFileC, err := os.ReadFile(strings.Replace(fileC, root1, root2, 1)) + require.NoError(t, err) + require.Equal(t, dataC, dataInFileC) + + dataInFileD, err := os.ReadFile(strings.Replace(fileD, root1, root2, 1)) + require.NoError(t, err) + require.Equal(t, dataD, dataInFileD) + + // Original directory still be present and intact + require.NoError(t, ErrIfNotExists(foo)) + require.NoError(t, ErrIfNotExists(bar)) + require.NoError(t, ErrIfNotExists(baz)) + require.NoError(t, ErrIfNotExists(alpha)) + require.NoError(t, ErrIfNotExists(beta)) + require.NoError(t, ErrIfNotExists(gamma)) + + dataInFileA, err = os.ReadFile(fileA) + require.NoError(t, err) + require.Equal(t, dataA, dataInFileA) + + dataInFileB, err = os.ReadFile(fileB) + require.NoError(t, err) + require.Equal(t, dataB, dataInFileB) + + dataInFileC, err = os.ReadFile(fileC) + require.NoError(t, err) + require.Equal(t, dataC, dataInFileC) + + dataInFileD, err = os.ReadFile(fileD) + require.NoError(t, err) + require.Equal(t, dataD, dataInFileD) +} diff --git a/sei-db/db_engine/litt/util/ssh.go b/sei-db/db_engine/litt/util/ssh.go new file mode 100644 index 0000000000..b7fc16aa65 --- /dev/null +++ b/sei-db/db_engine/litt/util/ssh.go @@ -0,0 +1,234 @@ +//go:build littdb_wip + +package util + +import ( + "bytes" + "fmt" + "os" + "os/exec" + "strings" + + "github.com/Layr-Labs/eigensdk-go/logging" + "golang.org/x/crypto/ssh" + "golang.org/x/crypto/ssh/knownhosts" +) + +// SSHSession encapsulates an SSH session with a remote host. +type SSHSession struct { + logger logging.Logger + client *ssh.Client + user string + host string + port uint64 + keyPath string + knownHostsPath string + verbose bool +} + +// Create a new SSH session to a remote host. +// +// If the knownHosts parameter is provided, it will be used to verify the host's key. If it is absent or empty, +// the host key verification will be skipped. +func NewSSHSession( + logger logging.Logger, + user string, + host string, + port uint64, + keyPath string, + knownHosts string, + verbose bool, +) (*SSHSession, error) { + + var err error + + hostKeyCallback := ssh.InsecureIgnoreHostKey() + if knownHosts != "" { + knownHosts, err = SanitizePath(knownHosts) + if err != nil { + return nil, fmt.Errorf("failed to normalize known hosts path: %w", err) + } + hostKeyCallback, err = knownhosts.New(knownHosts) + if err != nil { + return nil, fmt.Errorf("failed to parse known hosts path: %w", err) + } + } + + config := &ssh.ClientConfig{ + User: user, + HostKeyCallback: hostKeyCallback, + } + + if err := ErrIfNotExists(keyPath); err != nil { + return nil, fmt.Errorf("private key does not exist at path: %s", keyPath) + } + + keyData, err := os.ReadFile(keyPath) + if err != nil { + return nil, fmt.Errorf("failed to read private key: %w", err) + } + + key, err := ssh.ParsePrivateKey(keyData) + if err != nil { + return nil, fmt.Errorf("failed to parse private key: %w", err) + } + config.Auth = []ssh.AuthMethod{ + ssh.PublicKeys(key), + } + + client, err := ssh.Dial("tcp", fmt.Sprintf("%s:%d", host, port), config) + if err != nil { + return nil, fmt.Errorf("failed to connect to %s port %d: %w", host, port, err) + } + + return &SSHSession{ + logger: logger, + client: client, + user: user, + host: host, + port: port, + keyPath: keyPath, + knownHostsPath: knownHosts, + verbose: verbose, + }, nil +} + +// Close the SSH session. +func (s *SSHSession) Close() error { + err := s.client.Close() + if err != nil { + return fmt.Errorf("failed to close SSH client: %w", err) + } + + return nil +} + +// Search for all files matching a regex inside a file tree at the specified root path. +func (s *SSHSession) FindFiles(root string, extensions []string) ([]string, error) { + command := fmt.Sprintf("find \"%s\" -type f", root) + stdout, stderr, err := s.Exec(command) + + if err != nil { + if !strings.Contains(stderr, "No such file or directory") { + return nil, fmt.Errorf("failed to execute command '%s': %w, stderr: %s", + command, err, stderr) + } + // There are no files since the directory does not exist. + return []string{}, nil + } + + files := strings.Split(stdout, "\n") + + filteredFiles := make([]string, 0, len(files)) + for _, file := range files { + if file == "" { + continue // Skip empty lines + } + for _, ext := range extensions { + if strings.HasSuffix(file, ext) { + filteredFiles = append(filteredFiles, file) + break // Stop checking other extensions once a match is found + } + } + } + + return filteredFiles, nil +} + +// Mkdirs creates the specified directory on the remote machine, including any necessary parent directories. +func (s *SSHSession) Mkdirs(path string) error { + _, stderr, err := s.Exec(fmt.Sprintf("mkdir -p '%s'", path)) + if err != nil { + if strings.Contains(stderr, "File exists") { + // Directory already exists, no error needed + return nil + } + return fmt.Errorf("failed to create directory '%s': %w, stderr: %s", path, err, stderr) + } + + return nil +} + +// Rsync transfers files from the local machine to the remote machine using rsync. The throttle is ignored +// if less than or equal to 0. +func (s *SSHSession) Rsync(sourceFile string, destFile string, throttleMB float64) error { + + knownHostsFlag := "" + if s.knownHostsPath == "" { + knownHostsFlag = "-o StrictHostKeyChecking=no" + } else { + knownHostsFlag = fmt.Sprintf("-o UserKnownHostsFile=%s", s.knownHostsPath) + } + + sshCmd := fmt.Sprintf("ssh -i %s -p %d %s", s.keyPath, s.port, knownHostsFlag) + target := fmt.Sprintf("%s@%s:%s", s.user, s.host, destFile) + + // If the source file is a symlink, we actually want to send the thing the symlink points to. + fileInfo, err := os.Lstat(sourceFile) + if err != nil { + return fmt.Errorf("failed to get file info for %s: %w", sourceFile, err) + } + isSymlink := fileInfo.Mode()&os.ModeSymlink != 0 + + if isSymlink { + // Resolve the symlink to get the actual file it points to + sourceFile, err = os.Readlink(sourceFile) + if err != nil { + return fmt.Errorf("failed to resolve symlink %s: %w", sourceFile, err) + } + } + + arguments := []string{ + "rsync", + "-z", + } + + if throttleMB > 0 { + // rsync interprets --bwlimit in KB/s, so we convert MB to KB + throttleKB := int(throttleMB * 1024) + arguments = append(arguments, fmt.Sprintf("--bwlimit=%d", throttleKB)) + } + + arguments = append(arguments, "-e", sshCmd, sourceFile, target) + + if s.verbose { + s.logger.Infof("Executing: %s", strings.Join(arguments, " ")) + } + + cmd := exec.Command(arguments[0], arguments[1:]...) + cmd.Stderr = os.Stderr + + err = cmd.Run() + if err != nil { + return fmt.Errorf("failed to rsync data: %w", err) + } + + return nil +} + +// Exec executes a command on the remote machine and returns the output. Returns the result of stdout and stderr. +func (s *SSHSession) Exec(command string) (stdout string, stderr string, err error) { + session, err := s.client.NewSession() + if err != nil { + return "", "", fmt.Errorf("failed to create SSH session: %w", err) + } + defer func() { + _ = session.Close() + }() + + var stdoutBuf bytes.Buffer + var stderrBuf bytes.Buffer + session.Stdout = &stdoutBuf + session.Stderr = &stderrBuf + + if s.verbose { + s.logger.Infof("Executing remotely: %s", command) + } + + if err = session.Run(command); err != nil { + return stdoutBuf.String(), stderrBuf.String(), + fmt.Errorf("failed to execute command '%s': %w", command, err) + } + + return stdoutBuf.String(), stderrBuf.String(), nil +} diff --git a/sei-db/db_engine/litt/util/ssh_self_destruct_test.go b/sei-db/db_engine/litt/util/ssh_self_destruct_test.go new file mode 100644 index 0000000000..40bc37b5c9 --- /dev/null +++ b/sei-db/db_engine/litt/util/ssh_self_destruct_test.go @@ -0,0 +1,91 @@ +//go:build littdb_wip + +package util + +import ( + "context" + "os" + "testing" + "time" + + "github.com/docker/docker/api/types/container" + "github.com/docker/docker/client" + "github.com/stretchr/testify/require" +) + +func TestSSHContainerSelfDestruct(t *testing.T) { + t.Skip("This test takes 5+ minutes to run - only enable for manual testing") + + ctx := context.Background() + + // Create Docker client + cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation()) + require.NoError(t, err) + + // Generate SSH key pair + tempDir := t.TempDir() + privateKeyPath := tempDir + "/test_ssh_key" + publicKeyPath := tempDir + "/test_ssh_key.pub" + + err = GenerateSSHKeyPair(privateKeyPath, publicKeyPath) + require.NoError(t, err) + + publicKeyContent, err := os.ReadFile(publicKeyPath) + require.NoError(t, err) + + // Create mount directory for file operations + mountDir := tempDir + "/ssh_mount" + err = os.MkdirAll(mountDir, 0755) + require.NoError(t, err) + + // Build Docker image + imageName := "ssh-test-selfdestruct:latest" + // Get current user's UID/GID for the container + uid, err := getCurrentUserUID() + require.NoError(t, err) + gid, err := getCurrentUserGID() + require.NoError(t, err) + err = BuildSSHTestImage(ctx, cli, tempDir, imageName, string(publicKeyContent), uid, gid) + require.NoError(t, err) + + // Start container + containerID, sshPort, err := StartSSHContainer(ctx, cli, imageName, mountDir, t.Name()) + require.NoError(t, err) + + // Verify container is running + containerInfo, err := cli.ContainerInspect(ctx, containerID) + require.NoError(t, err) + require.True(t, containerInfo.State.Running) + + // Wait for SSH to be ready + WaitForSSH(t, sshPort, privateKeyPath) + + t.Logf("Container %s is running and SSH is ready. Waiting for self-destruct...", containerID[:12]) + + // Wait for 6 minutes (container should self-destruct after 5 minutes) + timeout := time.After(6 * time.Minute) + ticker := time.NewTicker(10 * time.Second) + defer ticker.Stop() + + containerStopped := false + for !containerStopped { + select { + case <-timeout: + t.Fatal("Container did not self-destruct within 6 minutes") + case <-ticker.C: + containerInfo, err := cli.ContainerInspect(ctx, containerID) + require.NoError(t, err) + + if !containerInfo.State.Running { + containerStopped = true + t.Logf("Container self-destructed successfully") + } else { + t.Logf("Container still running...") + } + } + } + + // Clean up the stopped container + err = cli.ContainerRemove(ctx, containerID, container.RemoveOptions{}) + require.NoError(t, err) +} diff --git a/sei-db/db_engine/litt/util/ssh_test.go b/sei-db/db_engine/litt/util/ssh_test.go new file mode 100644 index 0000000000..9bb88d677e --- /dev/null +++ b/sei-db/db_engine/litt/util/ssh_test.go @@ -0,0 +1,210 @@ +//go:build littdb_wip + +package util + +import ( + "fmt" + "os" + "path" + "path/filepath" + "testing" + + "github.com/Layr-Labs/eigenda/common" + "github.com/stretchr/testify/require" +) + +func TestSSHSession_NewSSHSession(t *testing.T) { + t.Skip() // Docker build is flaky, need to fix prior to re-enabling + + t.Parallel() + + container := SetupSSHTestContainer(t, "") + defer container.Cleanup() + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + // Test successful connection + session, err := NewSSHSession( + logger, + container.GetUser(), + container.GetHost(), + container.GetSSHPort(), + container.GetPrivateKeyPath(), + "", + true) + require.NoError(t, err) + require.NotNil(t, session) + defer func() { _ = session.Close() }() + + // Test with non-existent key + _, err = NewSSHSession( + logger, + container.GetUser(), + container.GetHost(), + container.GetSSHPort(), + "/nonexistent/key", + "", + false) + require.Error(t, err) + require.Contains(t, err.Error(), "private key does not exist") + + // Test with wrong user + _, err = NewSSHSession( + logger, + "wronguser", + container.GetHost(), + container.GetSSHPort(), + container.GetPrivateKeyPath(), + "", + false) + require.Error(t, err) +} + +func TestSSHSession_Mkdirs(t *testing.T) { + t.Skip() // Docker build is flaky, need to fix prior to re-enabling + + t.Parallel() + + dataDir := t.TempDir() + + container := SetupSSHTestContainer(t, dataDir) + defer container.Cleanup() + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + session, err := NewSSHSession( + logger, + container.GetUser(), + container.GetHost(), + container.GetSSHPort(), + container.GetPrivateKeyPath(), + "", + true) + require.NoError(t, err) + defer func() { _ = session.Close() }() + + // Test creating directory + testDir := path.Join(container.GetDataDir(), "foo", "bar", "baz") + err = session.Mkdirs(testDir) + require.NoError(t, err) + + // Verify directories were created in the container workspace + exists, err := Exists(path.Join(dataDir, "foo", "bar", "baz")) + require.NoError(t, err) + require.True(t, exists) + + // Recreating the same directory should not error. + err = session.Mkdirs(testDir) + require.NoError(t, err) +} + +func TestSSHSession_FindFiles(t *testing.T) { + t.Skip() // Docker build is flaky, need to fix prior to re-enabling + + t.Parallel() + + dataDir := t.TempDir() + + container := SetupSSHTestContainer(t, dataDir) + defer container.Cleanup() + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + session, err := NewSSHSession( + logger, + container.GetUser(), + container.GetHost(), + container.GetSSHPort(), + container.GetPrivateKeyPath(), + "", + true) + require.NoError(t, err) + defer func() { _ = session.Close() }() + + // Create a test subdirectory in the container's data directory + testDir := path.Join(container.GetDataDir(), "search") + err = session.Mkdirs(testDir) + require.NoError(t, err) + + // Create test files via SSH instead of host filesystem to avoid permission issues + // This ensures all files are created with proper container ownership + _, _, err = session.Exec(fmt.Sprintf("echo 'test content' > %s/test.txt", testDir)) + require.NoError(t, err) + _, _, err = session.Exec(fmt.Sprintf("echo 'log content' > %s/test.log", testDir)) + require.NoError(t, err) + _, _, err = session.Exec(fmt.Sprintf("echo 'data content' > %s/other.dat", testDir)) + require.NoError(t, err) + + // Test finding files with specific extensions + files, err := session.FindFiles(testDir, []string{".txt", ".log"}) + require.NoError(t, err) + require.Len(t, files, 2) + require.Contains(t, files, path.Join(testDir, "test.txt")) + require.Contains(t, files, path.Join(testDir, "test.log")) + + // Test with non-existent directory + files, err = session.FindFiles("/nonexistent", []string{".txt"}) + require.NoError(t, err) + require.Empty(t, files) +} + +func TestSSHSession_Rsync(t *testing.T) { + t.Skip() // Docker build is flaky, need to fix prior to re-enabling + + t.Parallel() + + // Create a temporary data directory for testing + dataDir := t.TempDir() + container := SetupSSHTestContainer(t, dataDir) + defer container.Cleanup() + + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + session, err := NewSSHSession( + logger, + container.GetUser(), + container.GetHost(), + container.GetSSHPort(), + container.GetPrivateKeyPath(), + "", + true) + require.NoError(t, err) + defer func() { _ = session.Close() }() + + // Create local test file + localFile := filepath.Join(container.GetTempDir(), "test_rsync.txt") + testContent := []byte("This is test content for rsync") + err = os.WriteFile(localFile, testContent, 0644) + require.NoError(t, err) + + // Test rsync without throttling - sync to data directory + remoteFile := filepath.Join(container.GetDataDir(), "remote_file.txt") + err = session.Rsync(localFile, remoteFile, 0) + require.NoError(t, err) + + // Verify file was transferred via the container workspace directory + transferredFile := filepath.Join(dataDir, "remote_file.txt") + transferredContent, err := os.ReadFile(transferredFile) + require.NoError(t, err) + require.Equal(t, testContent, transferredContent) + + // Test rsync with throttling + localFile2 := filepath.Join(container.GetTempDir(), "test_rsync2.txt") + throttledContent := []byte("throttled content") + err = os.WriteFile(localFile2, throttledContent, 0644) + require.NoError(t, err) + + remoteFile2 := filepath.Join(container.GetDataDir(), "throttled_file.txt") + err = session.Rsync(localFile2, remoteFile2, 1.0) // 1MB/s throttle + require.NoError(t, err) + + // Verify throttled file was transferred via the container workspace directory + transferredFile2 := filepath.Join(dataDir, "throttled_file.txt") + transferredContent2, err := os.ReadFile(transferredFile2) + require.NoError(t, err) + require.Equal(t, throttledContent, transferredContent2) +} diff --git a/sei-db/db_engine/litt/util/ssh_test_utils.go b/sei-db/db_engine/litt/util/ssh_test_utils.go new file mode 100644 index 0000000000..d0cfaca013 --- /dev/null +++ b/sei-db/db_engine/litt/util/ssh_test_utils.go @@ -0,0 +1,729 @@ +//go:build littdb_wip + +package util + +import ( + "archive/tar" + "compress/gzip" + "context" + "crypto/rand" + "crypto/rsa" + "crypto/x509" + "encoding/base64" + "encoding/pem" + "fmt" + "hash/fnv" + "io" + "net" + "os" + "os/user" + "path/filepath" + "runtime" + "strconv" + "strings" + "sync" + "testing" + "time" + + "github.com/Layr-Labs/eigenda/common" + "github.com/docker/docker/api/types" + "github.com/docker/docker/api/types/container" + "github.com/docker/docker/api/types/mount" + "github.com/docker/docker/client" + "github.com/docker/go-connections/nat" + "github.com/stretchr/testify/require" + "golang.org/x/crypto/ssh" +) + +// SSHTestPortBase is the base port used for SSH testing to avoid port collisions in CI +const SSHTestPortBase = 22022 + +const containerDataDir = "/mnt/data" +const username = "testuser" + +// Global variables for shared SSH test image +var ( + sharedImageName string + imageMutex sync.Mutex +) + +// getCurrentUserUID returns the current user's UID +func getCurrentUserUID() (int, error) { + currentUser, err := user.Current() + if err != nil { + return 0, fmt.Errorf("failed to get current user: %w", err) + } + uid, err := strconv.Atoi(currentUser.Uid) + if err != nil { + return 0, fmt.Errorf("failed to convert UID to int: %w", err) + } + return uid, nil +} + +// getCurrentUserGID returns the current user's GID +func getCurrentUserGID() (int, error) { + currentUser, err := user.Current() + if err != nil { + return 0, fmt.Errorf("failed to get current user: %w", err) + } + gid, err := strconv.Atoi(currentUser.Gid) + if err != nil { + return 0, fmt.Errorf("failed to convert GID to int: %w", err) + } + return gid, nil +} + +// GetFreeSSHTestPort returns a free port starting from SSHTestPortBase +func GetFreeSSHTestPort() (int, error) { + // Try ports starting from the base port + for port := SSHTestPortBase; port < SSHTestPortBase+100; port++ { + addr := net.JoinHostPort("127.0.0.1", strconv.Itoa(port)) + listener, err := net.Listen("tcp", addr) + if err != nil { + continue // Port is in use, try next one + } + _ = listener.Close() + return port, nil + } + return 0, fmt.Errorf("no free port found in range %d-%d", SSHTestPortBase, SSHTestPortBase+100) +} + +// GetUniqueSSHTestPort returns a unique port based on test name hash to avoid collisions +func GetUniqueSSHTestPort(testName string) (int, error) { + // Create a hash of the test name to get a deterministic port offset + h := fnv.New32a() + _, _ = h.Write([]byte(testName)) + hash := h.Sum32() + + // Try multiple ports starting from the hash-based offset + for i := 0; i < 10; i++ { + portOffset := int((hash + uint32(i)) % 100) + port := SSHTestPortBase + portOffset + + // Check if this port is free with a short timeout + addr := net.JoinHostPort("127.0.0.1", strconv.Itoa(port)) + conn, err := net.DialTimeout("tcp", addr, 100*time.Millisecond) + if err != nil { + // Port is free (connection failed) + return port, nil + } + _ = conn.Close() + } + + // If no port found in the hash range, fall back to free port finder + return GetFreeSSHTestPort() +} + +// SSHTestContainer manages a Docker container with SSH server for testing +type SSHTestContainer struct { + t *testing.T + client *client.Client + containerID string + sshPort uint64 + tempDir string + privateKey string + publicKey string + host string + uid int + gid int +} + +// GetSSHPort returns the SSH port of the test container +func (c *SSHTestContainer) GetSSHPort() uint64 { + return c.sshPort +} + +// GetPrivateKeyPath returns the path to the private key file +func (c *SSHTestContainer) GetPrivateKeyPath() string { + return c.privateKey +} + +// GetPublicKeyPath returns the path to the public key file +func (c *SSHTestContainer) GetPublicKeyPath() string { + return c.publicKey +} + +// GetTempDir returns the temporary directory used by the container +func (c *SSHTestContainer) GetTempDir() string { + return c.tempDir +} + +// GetUser returns the SSH user for the test container +func (c *SSHTestContainer) GetUser() string { + return username +} + +// Get the UID of the user inside the container. +func (c *SSHTestContainer) GetUID() int { + return c.uid +} + +// Get the GID of the user inside the container. +func (c *SSHTestContainer) GetGID() int { + return c.gid +} + +// GetHost returns the host address for the SSH connection +func (c *SSHTestContainer) GetHost() string { + return c.host +} + +// GetDataDir returns the path to the container-controlled workspace directory +func (c *SSHTestContainer) GetDataDir() string { + return containerDataDir +} + +// delete the mounted data dir from within the container to avoid permission issues +func (c *SSHTestContainer) cleanupDataDir() error { + + // Create a temporary SSH session for cleanup + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + if err != nil { + return fmt.Errorf("failed to create logger for cleanup: %w", err) + } + + session, err := NewSSHSession( + logger, + c.GetUser(), + c.host, + c.sshPort, + c.privateKey, + "", + false) // Don't log connection errors during cleanup + if err != nil { + return fmt.Errorf("failed to create SSH session: %w", err) + } + defer func() { _ = session.Close() }() + + require.NotEqual(c.t, "", containerDataDir, + "if this is an empty string then we will attempt to 'rm -rf /*'... let's not do that") + + // Remove the entire workspace directory tree from inside the container + // This ensures container-owned files are removed by the container user + cleanupCmd := fmt.Sprintf("rm -rf %s/*", containerDataDir) + stdout, stderr, err := session.Exec(cleanupCmd) + if err != nil { + return fmt.Errorf("failed to cleanup workspace: %w\nstdout: %s\nstderr: %s", err, stdout, stderr) + } + + return nil +} + +// Cleanup removes the Docker container and cleans up resources +func (c *SSHTestContainer) Cleanup() { + err := c.cleanupDataDir() + require.NoError(c.t, err) + + // Use a context with timeout for cleanup operations + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + + // Stop and remove container with timeout + stopTimeout := 10 // seconds + err = c.client.ContainerStop(ctx, c.containerID, container.StopOptions{ + Timeout: &stopTimeout, + }) + if err != nil { + // Log the error but continue with removal + fmt.Printf("Warning: failed to stop container %s: %v\n", c.containerID, err) + } + + // Remove container even if stop failed + err = c.client.ContainerRemove(ctx, c.containerID, container.RemoveOptions{ + Force: true, // Force removal even if container is still running + }) + require.NoError(c.t, err) +} + +// GenerateSSHKeyPair creates an RSA key pair for testing +func GenerateSSHKeyPair(privateKeyPath string, publicKeyPath string) error { + privateKey, err := rsa.GenerateKey(rand.Reader, 2048) + if err != nil { + return fmt.Errorf("failed to generate private key: %w", err) + } + + // Save private key + privateKeyPEM := &pem.Block{ + Type: "RSA PRIVATE KEY", + Bytes: x509.MarshalPKCS1PrivateKey(privateKey), + } + + privateKeyFile, err := os.Create(privateKeyPath) + if err != nil { + return fmt.Errorf("failed to create private key file: %w", err) + } + defer func() { _ = privateKeyFile.Close() }() + + err = pem.Encode(privateKeyFile, privateKeyPEM) + if err != nil { + return fmt.Errorf("failed to encode private key: %w", err) + } + + err = os.Chmod(privateKeyPath, 0600) + if err != nil { + return fmt.Errorf("failed to set private key permissions: %w", err) + } + + // Save public key + publicKey, err := ssh.NewPublicKey(&privateKey.PublicKey) + if err != nil { + return fmt.Errorf("failed to create SSH public key: %w", err) + } + + publicKeyBytes := ssh.MarshalAuthorizedKey(publicKey) + err = os.WriteFile(publicKeyPath, publicKeyBytes, 0644) + if err != nil { + return fmt.Errorf("failed to write public key: %w", err) + } + + return nil +} + +// configureContainerSSHKey updates the container's SSH authorized_keys file with the test-specific public key +func configureContainerSSHKey(ctx context.Context, cli *client.Client, containerID string, publicKeyPath string) error { + publicKeyContent, err := os.ReadFile(publicKeyPath) + if err != nil { + return fmt.Errorf("failed to read public key: %w", err) + } + + // Use base64 encoding to safely pass the SSH key content without shell escaping issues + // Base64 encoding ensures no shell metacharacters can cause problems + encodedKey := base64.StdEncoding.EncodeToString(publicKeyContent) + + execConfig := container.ExecOptions{ + Cmd: []string{ + "sh", "-c", + fmt.Sprintf( + "echo '%s' | base64 -d > /home/%s/.ssh/authorized_keys && chmod 600 /home/%s/.ssh/authorized_keys", + encodedKey, username, username), + }, + } + + // Create the exec instance + execIDResp, err := cli.ContainerExecCreate(ctx, containerID, execConfig) + if err != nil { + return fmt.Errorf("failed to create exec instance: %w", err) + } + + // Start the exec instance with Detach: false to ensure it blocks until completion + err = cli.ContainerExecStart(ctx, execIDResp.ID, container.ExecStartOptions{ + Detach: false, // Explicitly set to false to block until completion + }) + if err != nil { + return fmt.Errorf("failed to start exec instance: %w", err) + } + + // With Detach: false, ContainerExecStart should block until completion. + // However, to be absolutely certain, we'll add a brief polling loop. + for i := 0; i < 10; i++ { // Max 10 attempts with 100ms intervals = 1 second max wait + execInspect, err := cli.ContainerExecInspect(ctx, execIDResp.ID) + if err != nil { + return fmt.Errorf("failed to inspect exec instance: %w", err) + } + + // If the command is no longer running, we can check the exit code + if !execInspect.Running { + // Check if the command was successful + if execInspect.ExitCode != 0 { + return fmt.Errorf("SSH key configuration command failed with exit code %d", execInspect.ExitCode) + } + return nil // Success! + } + + // Brief sleep before checking again + time.Sleep(10 * time.Millisecond) + } + + // If still running after polling, something is wrong + return fmt.Errorf("SSH key configuration command is still running after timeout") +} + +// WaitForSSH waits for the SSH server to be ready +func WaitForSSH(t *testing.T, sshPort uint64, privateKeyPath string) { + logger, err := common.NewLogger(common.DefaultConsoleLoggerConfig()) + require.NoError(t, err) + + // Use a context with timeout to prevent indefinite hanging + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + + ticker := time.NewTicker(500 * time.Millisecond) + defer ticker.Stop() + + for { + select { + case <-ctx.Done(): + require.Fail(t, "SSH server did not become ready within 30 seconds") + return + case <-ticker.C: + session, err := NewSSHSession( + logger, + username, + "localhost", + sshPort, + privateKeyPath, + "", + false) + if err == nil { + _ = session.Close() + return + } + // Continue trying on error + } + } +} + +// getOrBuildSharedSSHImage returns the name of the shared SSH test image. +// If the image doesn't exist, it builds it. This method is thread-safe. +func getOrBuildSharedSSHImage(ctx context.Context, cli *client.Client, t *testing.T) (string, error) { + imageMutex.Lock() + defer imageMutex.Unlock() + + // If we already have a cached image name, verify it still exists + if sharedImageName != "" { + _, err := cli.ImageInspect(ctx, sharedImageName) + if err == nil { + return sharedImageName, nil + } + // Image no longer exists, reset and rebuild + sharedImageName = "" + } + + // Get current user's UID/GID for the shared image + uid, err := getCurrentUserUID() + if err != nil { + return "", fmt.Errorf("failed to get current user UID: %w", err) + } + gid, err := getCurrentUserGID() + if err != nil { + return "", fmt.Errorf("failed to get current user GID: %w", err) + } + + // Generate a unique image name based on UID/GID and current time to avoid conflicts + imageName := fmt.Sprintf("ssh-test-shared:%d-%d-%d", uid, gid, time.Now().Unix()) + + // Create a temporary directory for building the image + tempDir := t.TempDir() + privateKeyPath := filepath.Join(tempDir, "shared_ssh_key") + publicKeyPath := filepath.Join(tempDir, "shared_ssh_key.pub") + + // Generate SSH key pair for the shared image + err = GenerateSSHKeyPair(privateKeyPath, publicKeyPath) + if err != nil { + return "", fmt.Errorf("failed to generate SSH key pair: %w", err) + } + + publicKeyContent, err := os.ReadFile(publicKeyPath) + if err != nil { + return "", fmt.Errorf("failed to read public key: %w", err) + } + + // Build the shared image + t.Logf("Building shared SSH test Docker image: %s", imageName) + err = BuildSSHTestImage(ctx, cli, tempDir, imageName, string(publicKeyContent), uid, gid) + if err != nil { + return "", fmt.Errorf("failed to build shared SSH image: %w", err) + } + + // Cache the image name for future use + sharedImageName = imageName + return sharedImageName, nil +} + +// SetupSSHTestContainer creates and starts a Docker container with SSH server +// If dataDir is not empty, it will be mounted in the container at /mnt/data +func SetupSSHTestContainer(t *testing.T, dataDir string) *SSHTestContainer { + // Use a longer timeout for the entire setup process to handle slow CI environments + ctx, cancel := context.WithTimeout(context.Background(), 180*time.Second) + defer cancel() + + // Get current user's UID/GID + uid, err := getCurrentUserUID() + require.NoError(t, err) + gid, err := getCurrentUserGID() + require.NoError(t, err) + + // Create Docker client + cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation()) + require.NoError(t, err) + + // Generate SSH key pair for this specific test + tempDir := t.TempDir() + privateKeyPath := filepath.Join(tempDir, "test_ssh_key") + publicKeyPath := filepath.Join(tempDir, "test_ssh_key.pub") + + err = GenerateSSHKeyPair(privateKeyPath, publicKeyPath) + require.NoError(t, err) + + // Get or build the shared SSH test image + imageName, err := getOrBuildSharedSSHImage(ctx, cli, t) + require.NoError(t, err) + + if dataDir != "" { + // we have to grant broad permissions here because the container may have a different UID + err = os.Chmod(dataDir, 0777) + require.NoError(t, err, "failed to set permissions on data directory") + } + + // Start container and configure it with the test-specific SSH key + containerID, sshPort, err := StartSSHContainer(ctx, cli, imageName, dataDir, t.Name()) + require.NoError(t, err) + + // Configure the container to use the test-specific SSH key + err = configureContainerSSHKey(ctx, cli, containerID, publicKeyPath) + require.NoError(t, err) + + // Wait for SSH to be ready + WaitForSSH(t, sshPort, privateKeyPath) + + return &SSHTestContainer{ + t: t, + client: cli, + containerID: containerID, + sshPort: sshPort, + tempDir: tempDir, + privateKey: privateKeyPath, + publicKey: publicKeyPath, + host: "localhost", + uid: uid, + gid: gid, + } +} + +// BuildSSHTestImage builds the SSH test image with the provided public key and user IDs +func BuildSSHTestImage( + ctx context.Context, + cli *client.Client, + tempDir string, + imageName string, + publicKey string, + uid int, + gid int, +) error { + + // Get the Dockerfile path + _, currentFile, _, ok := runtime.Caller(0) + if !ok { + return fmt.Errorf("failed to get current file path") + } + dockerfilePath := filepath.Join(filepath.Dir(currentFile), "testdata", "ssh-test.Dockerfile") + + // Create build context directory + buildContext := filepath.Join(tempDir, "docker_build") + err := os.MkdirAll(buildContext, 0755) + if err != nil { + return fmt.Errorf("failed to create build context: %w", err) + } + + // Copy Dockerfile to build context + dockerfileContent, err := os.ReadFile(dockerfilePath) + if err != nil { + return fmt.Errorf("failed to read Dockerfile: %w", err) + } + + // Copy start.sh script to build context + startScriptPath := filepath.Join(filepath.Dir(currentFile), "testdata", "start.sh") + startScriptContent, err := os.ReadFile(startScriptPath) + if err != nil { + return fmt.Errorf("failed to read start.sh script: %w", err) + } + err = os.WriteFile(filepath.Join(buildContext, "start.sh"), startScriptContent, 0755) + if err != nil { + return fmt.Errorf("failed to copy start.sh to build context: %w", err) + } + + // Add the public key setup to the Dockerfile + publicKeySetup := fmt.Sprintf( + "\n# Add test SSH public key\n"+ + "RUN echo '%s' > /home/testuser/.ssh/authorized_keys\n"+ + "RUN chmod 600 /home/testuser/.ssh/authorized_keys\n"+ + "RUN chown %d:%d /home/testuser/.ssh/authorized_keys\n", strings.TrimSpace(publicKey), uid, gid) + modifiedDockerfile := string(dockerfileContent) + publicKeySetup + + err = os.WriteFile(filepath.Join(buildContext, "Dockerfile"), []byte(modifiedDockerfile), 0644) + if err != nil { + return fmt.Errorf("failed to write modified Dockerfile: %w", err) + } + + // Create tar archive for build context + buildCtx, err := ArchiveDirectory(buildContext) + if err != nil { + return fmt.Errorf("failed to create build context archive: %w", err) + } + defer func() { _ = buildCtx.Close() }() + + // Build the image with optimized settings for CI + buildOptions := types.ImageBuildOptions{ + Tags: []string{imageName}, + Dockerfile: "Dockerfile", + Remove: true, + ForceRemove: true, + NoCache: false, // Allow caching to speed up builds + BuildArgs: map[string]*string{ + "USER_UID": &[]string{strconv.Itoa(uid)}[0], + "USER_GID": &[]string{strconv.Itoa(gid)}[0], + }, + } + + response, err := cli.ImageBuild(ctx, buildCtx, buildOptions) + if err != nil { + return fmt.Errorf("failed to build image: %w", err) + } + defer func() { _ = response.Body.Close() }() + + // Read build output with proper error handling for timeouts + // Create a buffer to capture build output for debugging on failure + var buildOutput strings.Builder + reader := io.TeeReader(response.Body, &buildOutput) + + _, err = io.Copy(io.Discard, reader) + if err != nil { + // Include build output in error for debugging + buildOutputStr := buildOutput.String() + if len(buildOutputStr) > 1000 { + buildOutputStr = buildOutputStr[:1000] + "... (truncated)" + } + return fmt.Errorf("failed to read build response: %w\nBuild output: %s", err, buildOutputStr) + } + + // After the build finishes, verify the image actually exists + _, err = cli.ImageInspect(ctx, imageName) + if err != nil { + buildOutputStr := buildOutput.String() + if len(buildOutputStr) > 2000 { + buildOutputStr = buildOutputStr[:2000] + "... (truncated)" + } + return fmt.Errorf("docker image build failed - image not found after build: %w\nBuild output: %s", + err, buildOutputStr) + } + + return nil +} + +// StartSSHContainer starts the SSH container and returns container ID and SSH port +// If dataDir is not empty, it will be mounted at /mnt/data in the container +func StartSSHContainer( + ctx context.Context, + cli *client.Client, + imageName string, + dataDir string, + testName string, +) (string, uint64, error) { + + // Get a unique port for this test based on test name hash + sshPort, err := GetUniqueSSHTestPort(testName) + if err != nil { + return "", 0, fmt.Errorf("failed to get unique SSH port: %w", err) + } + + containerConfig := &container.Config{ + Image: imageName, + ExposedPorts: nat.PortSet{ + "22/tcp": struct{}{}, + }, + } + + hostConfig := &container.HostConfig{ + PortBindings: nat.PortMap{ + "22/tcp": []nat.PortBinding{ + { + HostIP: "127.0.0.1", + HostPort: strconv.Itoa(sshPort), // Use custom port to avoid collisions in CI + }, + }, + }, + Mounts: func() []mount.Mount { + var mounts []mount.Mount + if dataDir != "" { + mounts = append(mounts, mount.Mount{ + Type: mount.TypeBind, + Source: dataDir, + Target: "/mnt/data", + }) + } + return mounts + }(), + } + + // Create a container name that includes the test name for easier debugging + containerName := fmt.Sprintf("ssh-test-%s-%d", + strings.ReplaceAll(testName, "/", "-"), time.Now().Unix()) + + resp, err := cli.ContainerCreate( + ctx, + containerConfig, + hostConfig, + nil, + nil, + containerName) + if err != nil { + return "", 0, fmt.Errorf("failed to create container: %w", err) + } + + err = cli.ContainerStart(ctx, resp.ID, container.StartOptions{}) + if err != nil { + return "", 0, fmt.Errorf("failed to start container: %w", err) + } + + // Use the custom SSH port (convert to uint64 for compatibility) + return resp.ID, uint64(sshPort), nil +} + +// ArchiveDirectory creates a tar.gz archive of a directory for Docker build context +func ArchiveDirectory(srcDir string) (io.ReadCloser, error) { + pr, pw := io.Pipe() + + go func() { + defer func() { _ = pw.Close() }() + + gw := gzip.NewWriter(pw) + defer func() { _ = gw.Close() }() + + tw := tar.NewWriter(gw) + defer func() { _ = tw.Close() }() + + _ = filepath.Walk(srcDir, func(path string, info os.FileInfo, err error) error { + if err != nil { + return err + } + + relPath, err := filepath.Rel(srcDir, path) + if err != nil { + return fmt.Errorf("failed to get relative path: %w", err) + } + + // Skip the root directory itself + if relPath == "." { + return nil + } + + header, err := tar.FileInfoHeader(info, "") + if err != nil { + return fmt.Errorf("failed to create tar header: %w", err) + } + header.Name = relPath + + if err := tw.WriteHeader(header); err != nil { + return fmt.Errorf("failed to write tar header for %s: %w", relPath, err) + } + + if info.IsDir() { + return nil + } + + file, err := os.Open(path) + if err != nil { + return fmt.Errorf("failed to open file %s: %w", path, err) + } + defer func() { _ = file.Close() }() + + _, err = io.Copy(tw, file) + if err != nil { + return fmt.Errorf("failed to copy file %s to tar: %w", path, err) + } + return nil + }) + }() + + return pr, nil +} diff --git a/sei-db/db_engine/litt/util/testdata/ssh-test.Dockerfile b/sei-db/db_engine/litt/util/testdata/ssh-test.Dockerfile new file mode 100644 index 0000000000..77f9eb512f --- /dev/null +++ b/sei-db/db_engine/litt/util/testdata/ssh-test.Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:22.04 + +# Build arguments for user IDs +ARG USER_UID=1337 +ARG USER_GID=1337 + +# Install required packages +RUN apt-get update && apt-get install -y \ + openssh-server \ + rsync \ + && rm -rf /var/lib/apt/lists/* + +# Create test group and user with provided UID/GID +# Handle case where group already exists (common on macOS with gid 20 = staff) +RUN if ! getent group ${USER_GID} >/dev/null; then \ + groupadd -g ${USER_GID} testgroup; \ + else \ + echo "Group with GID ${USER_GID} already exists, using existing group"; \ + fi +RUN useradd -m -s /bin/bash -u ${USER_UID} -g ${USER_GID} testuser + +# Setup SSH +RUN mkdir /var/run/sshd +RUN mkdir -p /home/testuser/.ssh + +# Configure SSH daemon +RUN sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config +RUN sed -i 's/#PubkeyAuthentication yes/PubkeyAuthentication yes/' /etc/ssh/sshd_config + +# Set proper permissions - use GID instead of group name to handle existing groups +RUN chown -R ${USER_UID}:${USER_GID} /home/testuser/.ssh +RUN chmod 700 /home/testuser/.ssh + +# Create mount directories and set ownership +RUN mkdir -p /mnt/data +RUN chown ${USER_UID}:${USER_GID} /mnt/data + +# Copy startup script with self-destruct mechanism +COPY start.sh /start.sh +RUN chmod +x /start.sh + +EXPOSE 22 +CMD ["/start.sh"] \ No newline at end of file diff --git a/sei-db/db_engine/litt/util/testdata/start.sh b/sei-db/db_engine/litt/util/testdata/start.sh new file mode 100644 index 0000000000..6671e57b7a --- /dev/null +++ b/sei-db/db_engine/litt/util/testdata/start.sh @@ -0,0 +1,16 @@ +#!/bin/bash + +# Start SSH daemon in background +/usr/sbin/sshd -D & +SSHD_PID=$! + +# Self-destruct after 5 minutes (300 seconds) +( + sleep 300 + echo "SSH test container self-destructing after 5 minutes..." + kill $SSHD_PID + exit 0 +) & + +# Wait for SSH daemon to finish +wait $SSHD_PID \ No newline at end of file diff --git a/sei-db/db_engine/litt/util/unsafe_string.go b/sei-db/db_engine/litt/util/unsafe_string.go new file mode 100644 index 0000000000..e5b05f078f --- /dev/null +++ b/sei-db/db_engine/litt/util/unsafe_string.go @@ -0,0 +1,14 @@ +//go:build littdb_wip + +package util + +import "unsafe" + +// UnsafeBytesToString converts a byte slice to a string without copying the data. +// Note that once converted in this way, it is not safe to modify the byte slice for any reason. +func UnsafeBytesToString(b []byte) string { + if len(b) == 0 { + return "" + } + return unsafe.String(&b[0], len(b)) +}