Skip to content

Replace fluentd with fluent-bit in the operator#4910

Draft
hjiawei wants to merge 2 commits into
tigera:masterfrom
hjiawei:fluent-bit-deploy
Draft

Replace fluentd with fluent-bit in the operator#4910
hjiawei wants to merge 2 commits into
tigera:masterfrom
hjiawei:fluent-bit-deploy

Conversation

@hjiawei

@hjiawei hjiawei commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description

New feature (Calico Enterprise): the LogCollector controller now deploys fluent-bit (calico-fluent-bit in calico-system) in place of fluentd, completing the log-collector migration on the operator side.

  • Resource identity migration: namespace tigera-fluentdcalico-system; DaemonSet/ServiceAccount fluentd-nodecalico-fluent-bit; TLS secret → calico-fluent-bit-tls; metrics port 9081 → 2020 (fluent-bit's built-in HTTP server).
  • Configuration is rendered in fluent-bit's YAML schema into per-OS ConfigMaps (calico-fluent-bit-conf/-windows), subPath-mounted on Linux and directory-mounted on Windows, started with -c. The render loads the Go plugins via plugins_file, defines parsers inline, applies the record-transform Lua filter, and inlines user-provided fluent-bit YAML filter lists. A rendered-config hash annotation rolls the pods on config-only changes.
  • Tail inputs use the producing components' real log paths, with SQLite offset DBs and filesystem buffering under /var/log/calico/calico-fluent-bit; a pos-migrator init container (Linux and Windows) seeds offsets from the legacy fluentd .pos files and pre-creates the tailed directories. Windows tails the same log types the fluentd Windows variant shipped.
  • The linseed output matches only Linseed-bound tags, authenticates with mTLS + the pod's ServiceAccount token, and retries without limit against bounded filesystem storage. S3/Splunk/Syslog outputs mirror fluentd's per-type fan-out (standard AWS credential env vars, endpoint scheme honored, syslog ships the whole record as JSON via a per-output Lua processor with TLS properly enabled).
  • NonClusterHost renders the :9880 HTTP input with client-certificate verification; the input Service is cleaned up when the resource is removed.
  • eks-log-forwarder runs the fluent-bit image with a rendered in_ekslinseed pipeline and health probes (no startup init container; the input plugin resolves its resume point from Linseed).
  • Probes hit :2020/api/v1/health; the ServiceMonitor scrapes plain HTTP (fluent-bit's monitoring server has no TLS) with access restricted by the component NetworkPolicy, and legacy fluentd monitors are deleted.
  • The LogCollector controller no longer creates/owns the calico-system namespace (deleting the LogCollector must not garbage-collect it); the deprecated fluentdDaemonSet override is honored as an alias of the new calicoFluentBitDaemonSet field (with container-name translation); legacy tigera-fluentd resources — the namespace last — are cleaned up idempotently.

Testing: render, controller, and monitor unit suites updated/extended (ConfigMap-content assertions replacing the env-var assertions); the rendered configuration was validated against the real fluent-bit binary; the full migration was validated end-to-end on a test cluster — all log types flowing to Linseed/Elasticsearch, fluentd resources fully removed, tail-offset handover without re-shipping, NonClusterHost ingestion with client-certificate enforcement, and EKS/Windows render shapes verified.

Release Note

The LogCollector now deploys fluent-bit (calico-fluent-bit in calico-system) in place of fluentd, with operator-rendered configuration and automatic migration of fluentd tail positions.

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

@marvin-tigera marvin-tigera added this to the v1.43.0 milestone Jun 10, 2026
@hjiawei hjiawei added kind/enhancement New feature or request enterprise Feature applies to enterprise only labels Jun 10, 2026
Render the calico-fluent-bit DaemonSet (and its Windows variant) in
place of fluentd, migrating the resource identity and wiring up a
working fluent-bit configuration.

- Namespace tigera-fluentd -> calico-system; DaemonSet/ServiceAccount
  fluentd-node -> calico-fluent-bit; TLS secret -> calico-fluent-bit-tls;
  image ComponentFluentd -> ComponentFluentBit; metrics port 9081 ->
  2020 (fluent-bit's built-in HTTP server).
- Config is rendered in fluent-bit's YAML schema into per-OS ConfigMaps
  (calico-fluent-bit-conf and -windows — a shared name would make the
  two renders overwrite each other on mixed clusters), subPath-mounted
  on Linux and directory-mounted on Windows (which cannot mount single
  files), and started with `-c`. It loads the Go plugins via
  plugins_file, defines parsers inline, applies the record_transformer
  lua filter, and inlines user-provided fluent-bit YAML filter lists. A
  hash of the rendered config on the pod template rolls the daemonset
  on config-only changes.
- Tail inputs use the producing components' real paths (waf/,
  runtime-security/report.log, audit/tsee-audit.log, ids/events.log,
  the compliance.*.reports.log glob, policy/policy_activity.log) with
  SQLite offsets and filesystem buffering under
  /var/log/calico/calico-fluent-bit. The pos-migrator init container
  (Linux and Windows) seeds offsets from the fluentd .pos files and
  pre-creates the tailed directories so glob inputs don't error while a
  feature's log dir is absent. Windows tails the fluentd-windows types
  (flows, audit.tsee, audit.kube) against the C:\fluent-bit image
  layout.
- The linseed output matches only Linseed-bound tags (match_regex; IDS
  events and compliance reports are not Linseed-bound), posts with
  ca_file/cert_file/key_file (Go proxy plugins reject the native tls.*
  namespace) and the in-cluster ServiceAccount token, and retries
  without limit against the bounded filesystem buffer. S3, Splunk and
  Syslog outputs mirror fluentd's per-type fan-out: standard AWS
  credential env vars, endpoint scheme honored, and syslog packs the
  whole record as JSON via a per-output lua processor with TLS actually
  enabled (mode alone only selects framing) and the trusted-bundle CA
  when a user syslog certificate is configured.
- NonClusterHost renders the :9880 http input with client-certificate
  verification (voltron presents its internal certificate, matching
  fluentd's client_cert_auth), and the input Service is cleaned up when
  the resource is removed.
- eks-log-forwarder runs the fluent-bit image with a rendered in_eks ->
  linseed pipeline and health probes; the fluentd-era startup init
  container is gone (the plugin resolves its resume point from Linseed)
  and FetchInterval maps to EKS_CLOUDWATCH_POLL_INTERVAL.
- Health probes hit :2020/api/v1/health (health_check on). The
  ServiceMonitor scrapes plain HTTP — fluent-bit's monitoring server
  has no TLS, unlike fluentd's mTLS exporter — with access restricted
  by the component NetworkPolicy; legacy fluentd monitors are removed.
- The LogCollector controller no longer creates or owns the
  calico-system namespace (deleting the LogCollector must not
  garbage-collect it), the deprecated fluentdDaemonSet override is
  honored as an alias with container-name translation, deepcopy and the
  embedded LogCollector CRD are regenerated, and the legacy
  tigera-fluentd resources — the namespace last — are cleaned up
  idempotently on every reconcile.
- API: CalicoFluentBitDaemonSet added (FluentdDaemonSet deprecated);
  golden policy fixtures and enterprise_versions.yml updated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hjiawei hjiawei force-pushed the fluent-bit-deploy branch from b8edebb to 98ab545 Compare June 10, 2026 03:12
Replace the out_linseed Go proxy output with one built-in http output
block per Linseed-bound tag, for the Linux and Windows daemonsets and
the eks-log-forwarder. The http output is C compiled into fluent-bit:
`format json_lines` with the date key disabled produces exactly the
NDJSON body Linseed's bulk APIs expect, native tls.* carries the mTLS
client keypair (with hostname verification enabled — fluent-bit's
tls.verify alone only checks the chain), and bearer_token_file
(re-read per request) carries the ServiceAccount or managed-cluster
token. Multi-tenant clusters send the x-tenant-id header. The Windows
image runs no Go code at all, so the Windows config no longer loads a
plugins file; the Linux configs keep it only for the in_eks EKS
CloudWatch input, which stays a Go plugin and now feeds the http
output instead of out_linseed.

The optional EksCloudwatchLog streamPrefix/fetchInterval settings are
omitted from the environment when unset as defense in depth (the
logcollector controller defaults them before render, but an empty
prefix or zero interval reaching the plugin would override its own
defaults with settings that match every stream / disable polling).

Per-tag filesystem retry backlogs replace the single shared cap: flows
keeps 500M, low-volume tags get 100M each.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hjiawei hjiawei force-pushed the fluent-bit-deploy branch from e121ac4 to c81ad06 Compare June 11, 2026 20:04
@radTuti radTuti modified the milestones: v1.43.0, v1.44.0 Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-pr-required enterprise Feature applies to enterprise only kind/enhancement New feature or request release-note-required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants