Docs: Comparative architecture context — 5 retrospective design questions for an SDN study

## Context

I'm conducting independent comparative research on architectural patterns in open-source SDN/fabric controllers — specifically how different projects trade off (a) source-of-truth architecture, (b) operator substrate, (c) bare-metal vs virtualized control planes, (d) routing-daemon integration, and (e) operationalizing EVPN-MH at production scale.

Hedgehog Fabric is one of the most thoroughly documented case studies I've found publicly, and the explicit design choices visible in:

- `docs/concepts/overview.md`
- `docs/architecture/overview.md` and `docs/architecture/fabric.md`
- `docs/known-limitations/known-limitations.md`
- `docs/install-upgrade/requirements.md`
- the [`githedgehog/frr`](https://github.com/githedgehog/frr/tree/hh-master) `hh-master` branch (notably commit [`a9e45514d`](https://github.com/githedgehog/frr/commit/a9e45514d4843b922263222881490fe982c5761a) adding vtysh extensions)
- related discussions in githedgehog/fabricator#1105 (EMC planning), githedgehog/fabricator#1234, githedgehog/fabricator#1520, githedgehog/fabricator#242, githedgehog/fabric#1427, githedgehog/fabric#1426

make Hedgehog a unique reference point because the choices made are *deliberate and load-bearing* — not accidents of legacy code.

## What I'm asking

I'd like to ask the maintainers (and the wider community, if anyone has operational experience) for retrospective opinions on five design choices. **I'm not requesting code review or work** — just whatever architectural opinion you're willing to share publicly.

The research may eventually become a conference talk or blog post; I'd be glad to send a draft for review before publication. If parts of it land as "reference comparative material" in the Hedgehog docs themselves, that would be the ideal outcome.

## Why this might belong as docs

`docs/concepts/overview.md` and `docs/architecture/fabric.md` describe *what* Hedgehog is and *how* it works, but not *why these tradeoffs and not others*. A "Design rationale" or "Alternatives considered" section would be valuable for operators evaluating Hedgehog vs alternative patterns. The 5 questions below are the gaps I keep hitting when comparing.

If maintainers prefer to answer off-list, that's also fine — I can take any single question to email or out-of-band.

---

## The 5 questions

### 1. Source-of-truth architecture — why CRD-as-SoT vs an external graph?

`concepts/overview.md` makes an explicit choice to treat Kubernetes CRDs as the operational source of truth:

> "all user-facing APIs are Kubernetes Custom Resources (CRDs)"
> "Wiring Diagram consists of [Switch, Server, Connection, ...] resources"

This is defensible — a single SoT avoids the dual-write problem, and etcd gives transactional semantics and audit for free. But it's notably different from the pattern adopted by a growing set of network automation projects that put an **external graph-based system** above the orchestrator — [InfraHub](https://github.com/opsmill/infrahub) (OpsMill, Apache 2.0) is the most visible example, but Nautobot, NetBox + custom plugins, and several homegrown systems fit the same shape.

The InfraHub-style pattern claims properties that K8s/etcd-as-SoT can't easily deliver:

- **Branching and preview** — clone the network state, propose a change, render the diff, review, merge. Same workflow as git for code.
- **Schema-driven generators** — the graph holds intent in a flexible schema; generators render configs (Jinja, GraphQL, etc.) per target.
- **Decoupled storage scale** — Postgres + Neo4j backends grow horizontally; etcd has a well-known practical ceiling around 8 GB.
- **Multi-orchestrator** — the same graph can drive a K8s operator, an Ansible runner, a Terraform provider, etc., without any being authority.

The cost is dual-SoT reconciliation (graph ↔ CRD ↔ device) and the operational complexity of running another stateful service.

**Questions:**

a. Did you consider an external graph-based SoT (InfraHub or similar) above the K8s control plane? If so, what made you reject it? If not, was it never on the table, or was the single-SoT property of K8s the explicit goal?

b. At what fabric size do you expect CRD-as-SoT to show stress? Your typical deployment seems to target ~50 switches per control plane. For a hypothetical fabric of 500–1000 switches with thousands of VPCs, VPC peerings, and attachments, do you have empirical data on etcd utilization, controller reconcile latency, and admission webhook throughput? Where does the model break first?

c. The branching/preview workflow is something operators repeatedly ask for in production network changes (drain-then-merge, rollback gates, what-if analysis). Today this lives outside Hedgehog (GitOps on YAML files with ArgoCD/Flux). Do you see a future where the Fabric controller would natively support a "proposed change" CRD distinct from the live one — or is that explicitly out of scope and the answer is "use git, that's what it's for"?

d. For a deployment that already has InfraHub (or equivalent) as the org-wide SoT, what's your recommended integration pattern? Would you (i) drive Hedgehog CRDs from an InfraHub generator and accept the dual-SoT cost, (ii) replace Hedgehog's CRD layer with a custom InfraHub-rendered config delivered to the agent directly, or (iii) argue that the Hedgehog controller and InfraHub are addressing different layers and both should coexist?

### 2. K3s as the operator substrate — when does this primitive break down?

You chose K3s + Flatcar on a bare-metal control node, with gateway nodes joining as K3s agents. From the outside, the obvious alternatives are:

- per-zone K8s clusters with an external SoT for cross-zone discovery (cell-based, AWS / Cloudflare pattern);
- a workflow engine (Temporal-style) with stateless workers, since agents and executors are stateless reconcilers by design;
- pure systemd units per node, talking to a central streaming compiler over gRPC — pushing "converging-operator" down to the OS instead of K8s.

What pushed you toward K3s vs those? Did you ever prototype either? Looking back at incidents/post-mortems, what would you warn a team considering K8s for a network control plane to plan for? In particular:

- rolling deploys of the controller mid-reconciliation,
- etcd capacity ceiling vs CRD volume,
- K8s minor-version upgrade churn for the operator codebase,
- blast radius when a single cluster controls many switches/peers.

### 3. Bare-metal mandate for the control node — what specifically forced it?

`docs/install-upgrade/install.md` states that VM control nodes are "possible but not officially supported." Plausible drivers:

- the control node serves DHCP/PXE/HTTP on the OOB management VLAN, so it has hard L2 dependency on the fabric;
- gateway nodes join as K3s agents on the same cluster, which favors L2 proximity;
- DPDK/SR-IOV on gateway nodes is inherently hardware-bound, so bringing the rest of the stack physical avoids cross-substrate ops;
- something else entirely (procurement, audit, deterministic latency, TPM/measured boot).

Which of those was decisive — and how would the answer change for a fabric that explicitly does *not* anchor PXE/DHCP from the K8s control plane (i.e., switches and gateways are out-of-band provisioned and the K8s cluster is only a gNMI/SSH caller)?

### 4. EVPN-MH (RFC 7432 + RFC 8584) operational surface area

Your decision to go EVPN-MH over MC-LAG matches the open-standards direction, and `architecture/fabric.md` is one of the clearest public write-ups of the rationale. Two operational questions:

a. `known-limitations.md` lists 4 items today — VPC delete-then-create VNI reuse, port-channel admin-down config refusal on SONiC, MCLAG + externals blackhole, DS5000 CMIS init bug. The first three feel like cousins of the same surface (gNMI/SONiC eventual consistency under transient state). Have you considered exposing a transactional or 2-phase commit layer above gNMI? If so, what blocked it?

b. In community discussions, BFD on EVPN underlay peers has occasionally been reported as fragile when the BGP session is otherwise healthy — particularly on SONiC VS lab setups. Does this reproduce on production hardware SONiC for you? If yes, what's the current operational guidance (BFD on, BFD off, BFD only on overlay peers)?

### 5. FRR fork strategy — divergence cost over 15+ months

`githedgehog/frr` `hh-master` is currently ~15 months behind upstream master (forked from `frr-10.2.1`; upstream is at 10.6.1). The Hedgehog-specific delta is essentially one patch — commit [`a9e45514d`](https://github.com/githedgehog/frr/commit/a9e45514d4843b922263222881490fe982c5761a) adding vtysh extensions, ~132 lines across 5 files.

a. Did you ever attempt to upstream the vtysh extensions patch? The design looks generic and not Hedgehog-specific — it's a vtysh-level dlopen API. Did the proposal get pushback, or was it never sent?

b. What's the rebase cadence plan? Most teams maintaining long-lived forks of routing daemons (Cumulus historically, Arista's internal FRR) have reported the rebase tax as the dominant maintenance cost. Has that matched your experience? If you were starting today, would you still fork — or would you write an out-of-tree integration (FRR northbound plugin, mgmt-fe wire-client, etc.)?

---

## What I can offer back

Happy to share results of the comparative study once written up. If any question is too operational to answer publicly, I'd be glad to take it off-list (any preferred channel).

Thanks for the time, and for the quality of the public documentation — it's substantially better than most for understanding real operational tradeoffs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: Comparative architecture context — 5 retrospective design questions for an SDN study #309

Context

What I'm asking

Why this might belong as docs

The 5 questions

1. Source-of-truth architecture — why CRD-as-SoT vs an external graph?

2. K3s as the operator substrate — when does this primitive break down?

3. Bare-metal mandate for the control node — what specifically forced it?

4. EVPN-MH (RFC 7432 + RFC 8584) operational surface area

5. FRR fork strategy — divergence cost over 15+ months

What I can offer back

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docs: Comparative architecture context — 5 retrospective design questions for an SDN study #309

Description

Context

What I'm asking

Why this might belong as docs

The 5 questions

1. Source-of-truth architecture — why CRD-as-SoT vs an external graph?

2. K3s as the operator substrate — when does this primitive break down?

3. Bare-metal mandate for the control node — what specifically forced it?

4. EVPN-MH (RFC 7432 + RFC 8584) operational surface area

5. FRR fork strategy — divergence cost over 15+ months

What I can offer back

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions