feat(new transform): add a drain log clustering transform#25619
Draft
srstrickland wants to merge 1 commit into
Draft
feat(new transform): add a drain log clustering transform#25619srstrickland wants to merge 1 commit into
srstrickland wants to merge 1 commit into
Conversation
Adds a new `drain` transform that runs the Drain log parsing algorithm
on each event, derives a template string (e.g. `user <*> logged in from
<*>`), and writes it to a configurable field. Mirrors the OpenTelemetry
Collector `drain` processor, including the seed_templates / seed_logs /
warmup_min_clusters knobs for stable templates across deployments.
The transform is built on a new internal crate at `lib/drain-log`,
vendored from the `drain3` crate (Apache-2.0, akshatagarwl/drain3) with
one substantive addition: true LRU eviction when `max_clusters` is hit.
The upstream crate refuses new clusters at the cap; the OpenTelemetry
processor (and the original logpai/Drain3 Python implementation) instead
evicts the least-recently-used cluster so the matcher keeps adapting to
drifting log vocabularies on long-running streams. The OpenObserve
XDrain post calls this out as essential for production streams ("LRU is
mandatory"), so the vendored copy implements it via an intrusive
doubly-linked list threaded through `Cluster` — O(1) touch on match,
O(1) eviction, no interior mutability, freed cluster ids recycled so
the slot vector stays bounded.
Notable design choices:
* The transform is task-style (stateful, single-threaded per
instance). The matcher is `Send`, suitable for the per-transform
task model.
* `template_field` defaults to `drain_template` rather than the OTel
`log.record.template` name, since dots are nested paths in Vector;
the OTel-style flat name is reachable via the quoted path
`"log.record.template"` and called out in the field docs.
* Persistence (snapshot to a storage extension) is deliberately
out of scope for v1 — Vector has no equivalent of OTel's storage
extension abstraction, and seeding plus warmup cover most of the
cross-restart stability the OTel processor's snapshot mode targets.
Tests: 6 LRU tests in `drain-log` covering eviction policy, id reuse,
DLL ordering invariants under interleaved touches, and a hot-clusters-
survive-churn stress test; 5 transform-level tests covering annotation,
warmup suppression, missing-field pass-through, and a custom template
field.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
draintransform that runs the Drain log parsing algorithm on each event, derives a template string (e.g.user <*> logged in from <*>), and writes it to a configurable field. Mirrors the OpenTelemetry Collectordrainprocessor, including the seed_templates / seed_logs / warmup_min_clusters knobs for stable templates across deployments.The transform is built on a new internal crate at
lib/drain-log, vendored from thedrain3crate (Apache-2.0, akshatagarwl/drain3) with one substantive addition: true LRU eviction whenmax_clustersis hit. The upstream crate refuses new clusters at the cap; the OpenTelemetry processor (and the original logpai/Drain3 Python implementation) instead evicts the least-recently-used cluster so the matcher keeps adapting to drifting log vocabularies on long-running streams. The OpenObserve XDrain post calls this out as essential for production streams ("LRU is mandatory"), so the vendored copy implements it via an intrusive doubly-linked list threaded throughCluster— O(1) touch on match, O(1) eviction, no interior mutability, freed cluster ids recycled so the slot vector stays bounded.Notable design choices:
Send, suitable for the per-transform task model.template_fielddefaults todrain_templaterather than the OTellog.record.templatename, since dots are nested paths in Vector; the OTel-style flat name is reachable via the quoted path"log.record.template"and called out in the field docs.Tests: 6 LRU tests in
drain-logcovering eviction policy, id reuse, DLL ordering invariants under interleaved touches, and a hot-clusters- survive-churn stress test; 5 transform-level tests covering annotation, warmup suppression, missing-field pass-through, and a custom template field.Vector configuration
Each emitted event carries an additional drain_template field with the
Drain-derived template string (e.g. <> <> <> systemd[<>]: Started <*>).
Watch the templates stabilise across a few seconds of input as the matcher
converges.
How did you test this PR?
Unit tests (cargo test, run on Linux x86_64):
recycling, ordering invariants under interleaved touches, hot-clusters-survive-churn stress, and the unlimited
(max_clusters = 0) path.
a custom template_field, and generate_config.
All tests pass.
cargo clippy --no-default-features --features transforms-drain -- -D warningsis clean.Manual run against demo_logs (syslog + apache_common + apache_error formats simultaneously) for several minutes,
observing template convergence on stdout via the console sink. Verified that:
name.
Build:
cargo build --bin vectorsucceeds with the new feature in the default transforms-logs set.Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Depth Tree" (the original Drain paper)
streaming workloads, motivating the local addition over the upstream
drain3crate(attribution preserved in
lib/drain-log/NOTICE)