Releases: Lekssays/codebadger
v0.6.1-beta
Patch release.
Fixed
- Worker reachability. The compose host-port publish now keys off
MCP_PUBLISH_HOST(default127.0.0.1, loopback-only) andMCP_PORT, which drives both the in-container listen port and the published host port. Pool workers now log their reachability mode and warn whenJOERN_DOCKER_NETWORKis empty while the MCP runs containerized — the misconfig that made workers time out despite a healthy JVM.
Docs
.env.exampleanddocker-compose.ymlclarify the distinction betweenMCP_HOST(in-container bind) andMCP_PUBLISH_HOST(host publish interface).
v0.6.0-beta
🦡 codebadger — v0.6.0-beta
Agent-usability release on top of v0.5.1/v0.5.2-beta, driven by feedback from a large agent-run across ~70 C/C++ CVEs. The analysis primitives were already solid; this release closes the gaps that actually slowed agents down — truthful build/load state (no more silent hangs or invisible empties), better coverage (indirect dispatch, gated code, large repos), and a set of frontend-aware build options so high-fidelity parsing is one parameter (or zero) away.
No breaking changes from v0.5.x.
🔭 Truthful state in get_cpg_status
generatingtimeout reconciliation. A build whose worker died (process restart, OOM kill, lost in-memory job) no longer sits ingeneratingforever. Each build stamps a deadline (generation_timeout+generation_deadline_grace); a status poll past the deadline with no live worker is reconciled toFAILEDwitherror_code=GENERATION_TIMEOUT. A still-queued/running build is never condemned (the liveness probe fails safe).- Progress telemetry. Responses now carry
phase(queued→frontend→loading→ready),elapsed_seconds,deadline_seconds(remaining budget), andqueue_position— so a poller can tell "queued behind others" from "actively parsing" from "wedged" instead of staring at a bare status. - Coverage sanity check.
user_method_count(the verified user-defined method count from load) is surfaced, so a near-empty build is obvious immediately. codebase_label. A stable, non-sensitive<project>@<short-hash>ties a hash back to what it built, despite redacted paths.
🛠️ New tool: get_backend_status
Read-only backend introspection for self-pacing: build_workers, recommended_max_concurrent_builds, queue depth / in-flight, active vs. max Joern servers, the memory-admission ledger, cpgs_on_disk / disk_mb, and a per-CPG list. Agents (and orchestrators) can now size their fan-out instead of melting the backend by trial and error.
🧩 Coverage gaps closed
- Indirect / virtual / callback callers. When
get_call_graph(..., incoming)finds 0 direct edges, it now surfaces every site that takes the method's address (function pointer, callback registration, vtable entry) as the likely caller — turning dead-ends (png_safe_execute, libtiffimg->put, registered read callbacks, …) into leads. Clearly labelled as a heuristic. - Gated-body warnings.
get_call_graph/get_program_slice/get_variable_flownow warn when the target method resolves but has no body (0 calls, ≤1-line span — the#ifdef/feature-gated signature), pointing atdefines=[…]/include_paths=[…]instead of silently returning empty. - Slice token budget.
get_program_sliceclips per-node code (one macro-expanded statement could blow the response) and caps node counts with an explicitTRUNCATEDnote, so a macro-heavy sink no longer forces the agent to abandon the slice.
⚙️ Frontend-aware build options on generate_cpg
Every new flag is gated by a per-frontend capability table (FRONTEND_CAPABILITIES), so a flag is passed only to a frontend that accepts it — handing a C-only flag to pysrc2cpg/jssrc2cpg/… can no longer crash a build. (--exclude-regex is universal; --include/--compilation-database/auto-discovery are c2cpg-only; --define is c2cpg + swift.)
include_globs— scoped large-repo builds (all languages). Analyze a subset of a big repo (['libavcodec/**','libavutil/**']) without re-rootingsource_pathand losing cross-directory header/macro resolution. The repo root stays the parse base; only out-of-scope source TUs are skipped — headers stay includable.auto_system_headers(C/C++, opt-in). Enables c2cpg--with-include-auto-discoveryso libc/STL headers resolve and stop dropping whole files / gated bodies when coverage looks thin.compile_commands(C/C++) — highest fidelity, with auto-detect. Point at acompile_commands.jsonfor exact per-file-I/-D/-std. The DB's absolute build-machine paths are auto-rebased onto the analyzed copy. If you don't pass one, acompile_commands.jsonshipped in the source (root/build//out//…) is detected and used automatically (CPG_AUTODETECT_COMPILE_DB, default on). Best-effort: if it can't be applied the build proceeds and logs why.
💾 Lifecycle & resource hygiene
- Load-size guard (2 GB). A built
cpg.binaboveCPG_MAX_LOAD_MB(default 2048) fails fast withCPG_TOO_LARGEand scoping guidance, instead of the opaque "failed to reload into a Joern server" after a long stall. - Cold-CPG GC (evict-only). A background sweep releases the allocations (server process, port, memory reservation) of CPGs gone cold and marks them
SLEEPING; thecpg.binis kept on disk and reloads transparently on the next query. Disk deletion is strictly opt-in (CPG_GC_DELETE_COLD, default off).
🐳 Containerization
- Docker networking for pooled workers. Joern server/worker networking hardened so pool workers can be reached by container name on an internal Docker network (no published host port required).
✅ Quality
- New unit + integration coverage for status reconciliation/progress,
get_backend_status, the gated/indirect/slice query changes, frontend capability gating,include_globsscoping, andcompile_commandsrebasing/auto-detect. The query changes were also validated live against a running Joern (gated warning, indirect-caller, slice, and scoping all confirmed end-to-end).
v0.5.2-beta
🔧 codebadger — v0.5.2-beta
Feature + hardening release on top of v0.5.1-beta. Adds caller-supplied C/C++ build options (with auto include-detection), makes the CPG cache key correct for build options and remote branches, fixes two query-output corruption bugs, and rebuilds the integration test corpus into a realistic codebase that exercises the detectors end-to-end.
No breaking changes from v0.5.1-beta.
✨ New
- C/C++ build options on
generate_cpg. New optionalinclude_pathsanddefinesparameters are passed through toc2cpg. Without them, angle-includes of generated headers (e.g.<libxml/xmlversion.h>) don't resolve and feature macros stay undefined, so#ifdef-gated modules are silently dropped from the CPG. Entries are validated/normalized first (_sanitize_build_opt_list): control characters are rejected and relative include paths may not contain..(so a joined path can't escape the source root); absolute paths pass through. - Automatic include-dir discovery. When no include paths are supplied,
_autodetect_c_includesseeds a lightweight C/C++ search path — the source root, anyinclude/directory, and any directory that directly containsconfig.hor a generated*version*.h— so common angle-includes resolve out of the box.
🩹 Fixes
run_cpgql_querychar-explosion. Query normalization now parenthesises the base query before appending.take/.toJsonPretty/.toString. Previously a++chain (e.g.a.l ++ b.l ++ c.name.toJsonPretty) bound the tail to its last operand only, so the result came back as an exploded String instead of JSON. Queries that already self-emit (a<codebadger_result>envelope, or an explicit.toJson/.toJsonPretty/.toStringtail) are left untouched.get_variable_flowoutput corruption. The variable-flow query emits a self-delimiting<codebadger_result>…</codebadger_result>envelope that_parse_outputextracts; the executor no longer wraps/trims its tail, which had been corrupting the payload.
🔁 Cache correctness
- Build options are part of the cache key. Caller-supplied include paths / defines change the produced CPG, so they're now folded into the cache key — two builds of the same source with different
c2cpgflags no longer collide on one graph. - Remote branch is part of the cache key. For
githubsources a requestedbranchnow keys the CPG, so two branches of the same repo can't collide (the second request previously reused the first branch's graph). Default branch (None) leaves the key unchanged for back-compat.
🐳 Containerized MCP
- Host-path resolution for containerized deployments.
resolve_host_path(host_path, require_local_access=True)underpinslocalsources. When the MCP is containerized and the path lives on the host filesystem, the caller copies the tree via a host-daemon helper container and passesrequire_local_access=Falseto skip local existence checks (the helper validates existence on the host instead).
🧪 Test corpus & integration coverage
- Realistic analysis fixture.
playground/codebases/corewas rewritten from a labelled sample collection into a comment-free, OSS-style lightweight VMM (microvm): virtio device emulation, a DMA/guest-RAM controller, a vsock/QMP monitor, INI config, and command handling. Telegraphing names (vuln_*,*_unsafe,*_untrusted,memory_process_untrusted,safe_str*, …) are gone, so detectors must rely on program analysis rather than naming. - Tricky + edge cases. Each detector now has a true-positive site paired with a precision (true-negative) variant — e.g.
dma_ring_resize(malloc(count * sizeof …), HIGH) vsdma_ring_resize_guarded(__builtin_mul_overflow+ bound); double-free on a shared path vs mutually-exclusive branches; UAF via interprocedural free + returned dangling pointer vs free-then-reassign; size-mismatchmemcpyvs bounds-checked copy; aconfig_open_checkedaccess→open TOCTOU; and interproceduralrecv → systemtaint. - Expanded integration suite. New cases cover integer overflow, TOCTOU, null-pointer dereference, uninitialized reads, interprocedural taint flow, and a function-pointer/static callback-chain call graph, alongside the existing UAF / double-free / format-string / heap- & stack-overflow / call-graph / CFG / taint-source/sink tests. Validated live against a running server: 28/28 detector checks pass on the new corpus.
- Unit tests added for cache-key generation, query normalization, build-option validation, host-path resolution, source fingerprinting, and C include auto-detection.
v0.5.1-beta
🩹 codebadger — v0.5.1-beta
Patch release on top of v0.5.0-beta.
Fixes
- Use-After-Free detector regex crash.
find_use_after_freeinterpolated the freed-pointer expression into Joern'scpg.identifier.name(...), which is regex-interpreted — so a freed dereference likefree(*ptr)threwPatternSyntaxException: Dangling meta character '*'and the tool returned a stack trace instead of results. It now derives the underlying identifier name and matches it withnameExact(no regex). Confirmed by the integration suite (now 22/22).
No API or configuration changes from v0.5.0-beta.
v0.5.0-beta
🚀 codebadger — v0.5.0-beta
v0.5.0-beta — high-scale stability hardening + a CPG-only refactor. This is a breaking release: six file/source-dependent MCP tools and all built-in prompts are removed, and source snapshots are now ephemeral. Driven by a postmortem of a ~300-CVE batch that surfaced load-tier fragility, host-OOM, and a connection-refused storm.
⚠️ Breaking changes
- Removed 6 MCP tools.
list_files,get_method_source,get_code_snippet,get_macro_expansion,get_codebase_summary, anddiscover_fixed_vulnerabilitiesare gone. codebadger is now CPG-only — it analyzes the Code Property Graph, not files on disk. Read raw source from your own checkout (agents already have grep), userun_cpgql_queryfor graph-level code access (node.code), and run git-history recon on demand in your own clone. The 9 pure-CPG browsing tools (list_methods,list_calls,get_call_graph,list_parameters,run_cpgql_query,find_bounds_checks,get_cpgql_syntax_help,get_cfg,get_type_definition) plus the taint and detector tools are unchanged. - Removed all built-in MCP prompts. Methodology now lives with the calling agent; the server ships tools only.
- Ephemeral source. After a CPG is built, the source snapshot (
playground/codebases/<hash>, incl. any GitHub clone) is deleted — the CPG is the sole persisted artifact. A later regenerate re-fetches source. SetCPG_EPHEMERAL_SOURCE=falseto keep snapshots for build debugging.
🩹 Stability & performance (load tier)
- Verify-probe timeout no longer condemns valid CPGs. The post-import readiness probe used a hard-coded 15 s timeout that, under host pressure, marked perfectly valid CPGs as failed/empty mid-load. It's now configurable (
JOERN_VERIFY_TIMEOUT_SECONDS, default 60) and bounded by the load budget; a query that times out while a CPG is still loading no longer terminates the server. - Transient load failures are retried, not fatal. A momentary stall during reactivation used to permanently mark a codebase
failedeven though itscpg.binwas valid on disk. Reloads now retry up toJOERN_LOAD_MAX_ATTEMPTS(default 3) for transient causes; a genuinely empty/broken build is still failed fast. - Build memory is bounded against the container cap.
build_workersis auto-clamped at startup sobuild_workers × build_heapfits the build container's memory limit, eliminating OOM-killed builds (exit 137) and the host-memory exhaustion they caused. - No more connection-refused storm. Stale cached Joern clients (pointing at a re-spawned worker's old port) are now rebuilt against the live registry; queries are no longer dispatched into a loading/generating server, and a
READYcodebase whose worker was reaped is transparently reactivated.
🔁 Concurrency
- Generation single-flight. Concurrent
generate_cpgcalls for the same source no longer race: the source copy is atomic (build-in-temp →os.replace, so a half-merged tree can't produce a spurious empty/parse-failed CPG), and a per-hash Rediscodebase_generation_lockdeduplicates the staging+enqueue so identical concurrent requests don't repeat the work.
⚙️ New configuration
| Env | Default | Purpose |
|---|---|---|
JOERN_VERIFY_TIMEOUT_SECONDS |
60 |
Per-poll readiness-probe timeout (bounded by load timeout). |
JOERN_LOAD_MAX_ATTEMPTS |
3 |
Reload-from-disk retries for a transient load failure. |
CPG_EPHEMERAL_SOURCE |
true |
Delete the source snapshot once the CPG is built. |
CPG_BUILD_WORKERS is now auto-clamped to the build container's memory cap. See docs/configuration.md.
v0.4.3-beta
🚀 codebadger — v0.4.3-beta
v0.4.3-beta — security & deployment hardening. (The scalability/port-allocation and queue-depth fixes shipped in v0.4.2-beta.)
🔒 Security
- SSRF-hardened repository URL validation. Remote repos are now restricted to
https://github.com/…andhttps://gitlab.com/…(incl.www.), enforced by two independent gates — a literal case-sensitivehttps://<host>/prefix match and a parsed-hostname allowlist. Rejects non-httpsschemes (git://,ssh://,file://), embedded credentials (user:tok@), non-default ports, control chars, userinfo host-smuggling (https://github.com@evil/…), internal/metadata hosts, and look-alike domains. - Snippet language validation + inference. Pasted code is supplied in
<code language="…">tags (regex-parsed); the language is validated and content-inferred, and a mislabeled or ambiguous snippet is refused with an actionable message instead of building a wrong-language CPG. CHAT_DEPLOYmode. SetCHAT_DEPLOY=trueto disablesource_type='local'entirely so a chat-facing / multi-tenant MCP can't read arbitrary host paths — callers must use an allowlisted repo URL or a pasted snippet.- Path-traversal hardening.
resolve_host_pathnow rejects null bytes / control characters, canonicalizes withrealpathbefore any check, and supports an optionalALLOWED_SOURCE_ROOTSallowlist for hard containment of local sources.
⚙️ Configuration & deployment
- New
MCP_PORTenv var (default4242). - Compose env passthrough fixed.
CHAT_DEPLOY,ALLOWED_SOURCE_ROOTS,CPG_QUEUE_MAXSIZE, andMCP_PORTare now passed into thecodebadger-mcpcontainer — previouslyCPG_QUEUE_MAXSIZEset in.envwas silently inert for the containerized MCP. .env/.env.examplesynced and documented for normal-user defaults.
📚 Docs
docs/security.md: SSRF/repo-URL allowlist, snippet-language,CHAT_DEPLOY, and path-confinement controls added to the threat model and hardening checklist.docs/deployment.md: new "Hardening a chat-facing deployment (CHAT_DEPLOY)" section.
Full changelog: v0.4.2-beta...v0.4.3-beta
v0.4.2-beta
🚀 codebadger — v0.4.2-beta
🎯 Highlights
v0.4.2-beta is a reliability-under-load release. After v0.4.1-beta made the stack production-shaped, large batches (hundreds of CVEs, high client concurrency) still hit a cluster of load-tier failures — CPGs that built fine but wouldn't reactivate, ports that raced Docker's teardown, queue rejections under fan-out, and "ready" codebases whose Joern server had quietly died. This release fixes those: a collision-free CPG loader, build-time overlay persistence so memory-capped workers load big CPGs reliably, rotating port allocation, a decoupled queue depth, event-loop-managed server restarts, and classified, API-visible build failures. Net effect: clean RAM no longer means mysterious failures — batches run to completion and the ones that don't tell you why.
⚠️ Breaking Changes
- None. v0.4.2-beta is drop-in over v0.4.1-beta. All new behavior is additive and defaults are safe; the new knobs below only need touching for unattended/batch drivers.
📦 What's New
Collision-Free CPG Loading (fixes the "No projects loaded" load-tier failures)
- Every CPG file is literally named
cpg.bin, so letting Joern derive the project name from the filename collided when a worker imported a second CPG or reused a workspace —importCpgthen left no project open ("No projects loaded"), failing reactivation of a perfectly good build. load_cpgnow imports under an explicit, collision-free project name (workspace.reset; importCpg(path, name); open(name)), then runs a readiness poll that distinguishes three outcomes: loaded-and-non-empty, genuinely empty build (0 user-defined methods → fail with a distinct reason, no pointless retry), and no-project race (→ re-import once before giving up). A registration race right afterimportCpgno longer reads as a permanent failure.
Build-Time Overlay Persistence (memory safety on reactivation)
- Joern applies the dataflow overlay (
ReachingDefPass) the first time a CPG is opened and re-saves it intocpg.bin. Doing that on every load inside a memory-capped query worker OOMed the worker on large C/C++ trees — surfacing, again, as "No projects loaded". - CPG generation now applies and persists the overlays once, in the large-heap build container. Later
importCpgcalls just deserialize ("Overlay dataflowOss already exists – skipping"), so even a tiny tier-S (2 GB) worker loads a large CPG reliably. Best-effort: on failure the base CPG is kept and the worker falls back to recompute-on-load. - Build JVM is now sized from
CPG_BUILD_HEAP_GBwith G1GC + string dedup (-Xmx{heap}G -Xms2G -XX:+UseG1GC -XX:+UseStringDeduplication), replacing the tiny-Xmx2Gdefault that OOMed the overlay pass.
Rotating Port Allocation (fixes "failed to become ready / connection refused" pile-ups)
- Always handing back the lowest free port republished a just-released host port on the very next spawn — racing Docker's teardown of the old mapping (docker-proxy/iptables DNAT) and the kernel's
TIME_WAIT. Failures concentrated on the first port (e.g. 14000). - Both
PortManager(in-process) andRedisPoolStore(pool mode, via an atomicINCRcursor under the admit lock) now rotate across the whole range, giving a freed port time to fully release before reuse. A new best-effort_wait_host_port_freewaits out a lingering mapping before publishing a worker.
Decoupled Build-Queue Depth (fixes ~30% generation rejections under load)
- Pending-queue depth was tied to
build_workers(workers * 4= 8), so a 12+-way client got ~30% of generations rejected withqueue_full— even though onlybuild_workersbuilds ever run at once. - New
CPG_QUEUE_MAXSIZE(default 64) sizes only the waiting room; concurrent builds — and thus build memory — stay capped atCPG_BUILD_WORKERS. Raising it does not increase memory.<=0falls back to the oldbuild_workers * 4.
Event-Loop-Managed Server Restarts (fixes the restart-fail churn)
- A "ready" codebase whose Joern server had died entered a retry → fail → repeat loop. Sync MCP tools (e.g.
get_cpg_status) run in worker threads with no running event loop, so background restarts had nowhere to schedule. - The main server loop is now captured at startup; sync tools schedule Joern server restarts onto it via
run_coroutine_threadsafe. If a reload fails, the codebase is marked FAILED instead of left "ready with a dead server." New tests cover zombie/restart scenarios.
Classified, API-Visible Build Failures
get_cpg_statusnow surfaces the failure cause (error_code+ human-readableerror) on a failed build, instead of a bare"failed"that forced digging through container logs.- New
_classify_cpg_build_failuredistinguishes an out-of-memory build (the dominant large-project failure) from a generic frontend error and from timeouts, with the frontend output tail attached.
Large-Project Guard (opt-out for batch drivers)
generate_cpgnow declines a local source above a size/LOC threshold — returning alarge_project_warninginstead of silently committing to a giant full-project build — unless the caller passesforce=True.- Thresholds are deliberately high (default 2 GB / 2 M LOC) so only enormous trees warn. New knobs:
CPG_LARGE_PROJECT_GUARD(setfalsefor unattended/batch/eval harnesses that always intend to build and can't passforceper call),CPG_LARGE_PROJECT_MAX_MB,CPG_LARGE_PROJECT_MAX_LOC— wired throughconfig.example.yaml,.env.example, anddocker-compose.yml.
Symlink-Safe Local Copy
- New
_copy_local_source_treeskips symlinks that escape the source root when staging a local source, closing a path-escape gap in local-source ingestion.
🐳 Deployment & Infrastructure
docker-compose.ymlnow surfaces the build-sizing and scale knobs inline:CPG_BUILD_WORKERS(default 4),CPG_BUILD_HEAP_GB(default 6),MAX_MCP_CONNECTIONS(default 16, 503 past it),MAX_REPO_SIZE_MB(default 1024), plus the large-project-guard vars.- Pool-mode invariant unchanged:
CPG_BUILD_WORKERS * CPG_BUILD_HEAP_GB ≤ JOERN_MEM_LIMIT(the build container's cap). Runpython scripts/recommend_config.pybefore launching on a new host.
🧪 Testing
- New suites for collision-free project naming (
test_joern_client_load.py), server restart / zombie handling (test_restart_scheduling.py), large-project guard behavior (test_cpg_generator.py,test_mcp_tools.py), symlink-safe copy, and worker-pool port rotation (test_worker_pool.py). 503 tests collected (~77 new).
⚠️ Notes
- This remains a beta release.
- Dedicate the host to codebadger — the MCP container mounts the Docker socket (root-equivalent on the host) and uses host networking. The MCP HTTP endpoint has no built-in auth; front it with a reverse proxy / network policy.
Full Changelog: v0.4.1-beta...v0.4.2-beta
v0.4.1-beta
🚀 codebadger — v0.4.1-beta
🎯 Highlights
v0.4.1-beta turns the v0.4.0 scalability redesign into a production-ready, hardened deployment. The whole stack — including the MCP server itself — now comes up from a single docker compose; Postgres + Redis are required defaults (SQLite is gone); there's a real /health endpoint for orchestrators; idle codebases are evicted to survive long-running batches; you can analyze pasted code snippets, not just repos and paths; and every LLM-supplied input now passes a thorough validation + security layer backed by a documented threat model.
⚠️ Breaking Changes
- Postgres + Redis are now required (and the default). SQLite and the in-process coordinator have been removed. This reverses the v0.4.0 "optional" note — the server now fails fast on a missing/unreachable Postgres or Redis. Stand the stack up with
./scripts/deploy.sh(ordocker compose up -d). pgdatarelocated out ofplayground/→ now./pgdata(override withPOSTGRES_DATA_PATH) so Joern workers can't reach the database files. Migration: while stopped,mv playground/pgdata ./pgdatato preserve an existing catalog, or start fresh.- redis-py bumped to 8.x (from 5.x). All in-code usage is stable API; review any custom Redis integrations.
📦 What's New
One-Command Full-Stack Deployment (fixes #21)
- New
Dockerfile.mcpcontainerizes the MCP server;docker compose up -dnow brings up MCP + Joern + Postgres + Redis together. - The MCP drives the host Docker daemon (Docker-out-of-Docker via the mounted socket) and uses host networking, so it builds CPGs and spawns per-CPG pool workers as sibling containers — no app code changes required.
- New
scripts/deploy.sh(up/down/restart/logs/status) builds, launches, exports an absolute playground path for pool mounts, and polls/health.
Production /health Endpoint (fixes #20)
- Reports
status(up/partial/down), anmcp: "codebadger"field, and adependenciesmap (joern, postgres, redis, docker, cpg_queue) — returns 200 for up/partial and 503 for down, so it works directly as a liveness/readiness probe.
Idle CPG Eviction (memory safety on long runs)
- Servers idle beyond
JOERN_IDLE_TTL_SECONDS(default 600s) are offloaded by a background reaper and auto-wake on the next query — fixing the long-run idle-worker leak that could exhaust RAM/swap. - Explicit
JOERN_MEMORY_BUDGET_MBbudgeting plus per-run file logging (logs/codebadger-<ts>-<pid>.log+ acodebadger-latest.logsymlink).
Code Snippet Analysis (fixes #19)
generate_cpg(source_type="snippet", code=..., language=...)analyzes code pasted straight into the chat — no repo or path needed. Staged like any other source, with content-hash dedup so re-pasting the same code reuses the cached CPG.
Security: Threat Model + Hardening (see docs/security.md)
- New
docs/security.md: threat model, trust boundaries (Mermaid), the controls we provide, and a production hardening checklist. - Every LLM-supplied input is now validated:
codebase_hash,language,branch&github_token(anti arg-/URL-injection, e.g. blocks--upload-pack), snippetcode/filename/label, and regex filters (length + ReDoS-shape guard). run_cpgql_querynow enforces the CPGQL blocklist by default (process exec, file read/write, network, dynamic dependency load, reflection) — previously the blocklist was never wired in. Patterns expanded after an audit. (Defense-in-depth; the real boundary is the Joern worker sandbox.)- Resource caps at the executor: query timeout ≤ 300s, ≤ 10 000 rows, ≤ 5 MB output, ≤ 5 000-line snippet spans, and clamped
take(n)/depth— with visibletruncatedflags (never silent). - Filesystem hardening: symlink-safe local copy (no dereferencing escapes),
realpath/commonpathconfinement, a deletion guard onremove_cpg, and host-path redaction in client-facing errors.
📚 Documentation
- New
docs/security.md(threat model + diagrams), linked from both READMEs. docs/installation.mdanddocs/deployment.mdrewritten as clear, step-by-step guides (full-stack and host-dev paths, day-2 ops).- New
docs/available-tools.mdcataloguing every MCP tool; snippet flow documented across usage/deployment.
🐳 Deployment & Infrastructure
Dockerfile.mcp(new) —python:3.13-slim+ git + Docker CLI 29.5.3 (client only).- Dependency bumps:
fastmcp>=3.4.2,mcp>=1.27.2,aiohttp>=3.14.1,uvicorn>=0.49.0,psycopg[binary]>=3.3.4,redis>=8.0.0. - Backing-service URLs now resolve from env —
DATABASE_URL/REDIS_URL, or the componentPOSTGRES_*/REDIS_*vars compose uses — instead of hardcoded defaults, so a host-run MCP honorsPOSTGRES_PORT/REDIS_PORToverrides. - Postgres on 55432 / Redis on 56379 (overridable); updated
.env.example,docker-compose.yml,.dockerignore, andcleanup.sh.
🛠️ Tooling
get_cpg_statusdescription now explicitly documents it as the way to wait forgenerate_cpg— poll with the returnedcodebase_hashuntilready/failed(fixes #22).
🧪 Testing
- New/expanded suites for snippets, the security validators (CPGQL blocklist + bypasses, ReDoS guard, error redaction), executor resource caps & truncation, malformed-hash rejection, and
QueryLoadernumeric clamps. 426 passing / 24 skipped.
⚠️ Notes
- This remains a beta release.
- Dedicate the host to Codebadger — the MCP container mounts the Docker socket (root-equivalent on the host) and uses host networking. The MCP HTTP endpoint has no built-in auth; front it with a reverse proxy / network policy.
- Pool-mode invariant still applies:
build container mem_limit + memory_budget_mb ≤ total Joern budget. Runpython scripts/recommend_config.pybefore launching on a new host.
Full Changelog: v0.4.0-beta...v0.4.1-beta
v0.4.0-beta
🚀 Codebadger Release — v0.4.0-beta
🎯 Highlights
This release is a major scalability redesign. Codebadger can now analyze large batches of codebases without OOM-killing the host, run as multiple coordinated processes against a shared store, and is backed by brand-new end-to-end documentation. RAM — not a fixed server count — is now the binding constraint, and the system manages it explicitly.
📦 What's New
Memory-Aware Server Admission
- Replaced the fixed query-server count with a real memory budget (
memory_budget_mb) — the sum of per-CPG heap reservations is the true concurrency limit. - CPG size tiers (S/M/L/XL): each Joern server's JVM heap is sized to the CPG's on-disk
.binsize, so a batch of small CPGs runs far more servers concurrently than a few large ones. - LRU eviction + RSS backstop: least-recently-used servers are slept to make room, and an RSS-pressure backstop evicts before the kernel OOM-kills.
- Generate-ahead, sleep-on-idle: sleeping servers cost no RAM and wake cheaply by re-loading the disk-cached CPG on the next query.
- Auto-tuned budgets derived from host RAM, with a startup over-commit guard that clamps (and warns) rather than letting the build and query pools jointly over-commit the host.
Pool Worker Mode
- New
JOERN_WORKER_MODE=poolruns each CPG's query server in its own cgroup-capped container, so an OOM kills one worker instead of cascading across every server. - In
sharedmode (default), query servers remain processes inside the single Joern container.
Durable Job Queue
- A DB-backed jobs table replaces the lossy in-memory queue: it survives restarts, dedups (one active job per CPG via a partial unique index), and applies real backpressure (
FOR UPDATE SKIP LOCKED) instead of silently dropping work.
Postgres + Redis (Multi-Process Operation)
DATABASE_URLswaps the entire store — catalog, tool cache, findings, and the job queue — into Postgres, enabling genuine multi-process operation.REDIS_URLmakes the per-CPG query lock and shared pool state (reservation ledger, warm-worker registry, global LRU, per-CPG spawn lock) cross-process.- New
PostgresDBManager/PostgresJobStore, plusRedisCoordinatorandRedisPoolStore, with lazy imports so Postgres/Redis are only required when configured. - Multiple API/scheduler processes can now share one budget, catalog, and queue.
Configuration Recommender
- New
scripts/recommend_config.py(andsrc/utils/recommend.py) prints a memory-aware config recommendation for the current or a hypothetical host (--mem,--cores,--worker-mode,--compare). - The same recommendation logs at startup (toggle with
RECOMMEND_ON_STARTUP).
📚 Documentation
- Brand-new
docs/set: Architecture, Deployment, Configuration, Installation, Usage, Custom Tools, Contributing, and a Roadmap. - Architecture, deployment, and lifecycle Mermaid diagrams (query flow with auto-wake, CPG/server lifecycle, memory-aware admission, topologies,
sharedvspool). - Slimmed-down README that points into the new docs; the old
CUSTOM_TOOLS_GUIDE.mdcontent moved intodocs/custom-tools.md.
🐳 Deployment & Infrastructure
docker-compose.ymlnow offers Postgres and Redis behind Compose profiles — a plaindocker compose up -dstarts Joern only.- Postgres publishes on 55432 and Redis on 56379 (non-default ports) to avoid clashing with system services; override via
POSTGRES_PORT/REDIS_PORT. - New optional dependencies:
psycopg[binary]>=3.1(Postgres) andredis>=5.0(coordination). - Updated
.env.example,config.example.yaml,Dockerfile, andcleanup.shfor the new topology.
⚙️ Stability & Performance
- Substantially reworked
JoernServerManager(spawn / sleep / evict / heap-sizing) as the heart of memory-aware scheduling. QueryExecutorserializes requests per CPG, caches successful results, and triggers auto-wake.- New
Coordinatorabstraction for per-CPG locks (in-process by default, Redis-backed when configured). - Hardened DB concurrency and storage handling across SQLite and Postgres.
🧪 Testing
- New suites covering the redesign:
test_coordination,test_db_manager,test_durable_queue,test_memory_admission,test_pool_memory_guard,test_pool_store_redis,test_postgres_db_manager,test_postgres_job_store,test_worker_pool. - Expanded
test_mainandtest_program_slicecoverage.
⚠️ Notes
- This remains a beta release.
- Postgres and Redis are optional — the default single-process setup still uses SQLite and needs only the Joern container. Set
DATABASE_URL/REDIS_URLto opt into shared, multi-process operation. - Pool-mode invariant:
build container mem_limit + memory_budget_mb ≤ total Joern budget. Review your Compose memory limits before enablingpoolmode. - Leave
memory_budget_mb/rss_eviction_threshold_mbat0to auto-derive from host RAM; runpython scripts/recommend_config.pybefore launching on a new host. - Review custom deployments after upgrading — Compose profiles, new ports, and the optional dependencies may affect existing integrations.
Full Changelog: v0.3.9-beta...v0.4.0-beta
v0.3.9-beta
🚀 Codebadger Release — v0.3.9-beta
v0.3.9-beta
Released Jun 5, 2026
📦 What's New
Improved Configuration Management
- Refactored configuration handling for greater consistency and maintainability.
- Added automatic type coercion for configuration values.
- Improved session status tracking and lifecycle management.
Enhanced CPG Generation
- Added configurable queue size limits to prevent resource exhaustion during CPG generation.
- Introduced warnings when analyzing exceptionally large projects.
- Improved CPG generation workflow and loading reliability.
- Added timeout configuration options for CPG loading operations.
Better Joern Integration
- Refactored Joern server management to use a shared restart callback mechanism.
- Improved startup, restart, and recovery behavior across analysis sessions.
🛡️ Security Improvements
Credential Safety
- Removed embedded Git credentials from repository configuration after cloning operations.
- Reduced risk of accidental credential persistence and exposure.
Query Validation Hardening
- Expanded CPGQL validation with more comprehensive dangerous-pattern detection.
- Improved validation error reporting to help users identify unsafe queries.
Path Validation
- Strengthened path validation throughout the analysis pipeline.
- Improved handling of filesystem edge cases and invalid paths.
⚙️ Stability & Performance
Database & Concurrency
- Improved DBManager concurrency handling.
- Reduced contention during parallel analysis operations.
- Enhanced overall system stability under load.
Resource & Port Management
- Streamlined port manager initialization.
- Improved service startup reliability.
- Enhanced graceful shutdown behavior and Docker container lifecycle management.
Async Processing
- Improved asynchronous handling across CPG generation and loading workflows.
- Reduced potential race conditions and timeout-related failures.
🔍 Analysis Improvements
- Improved program slicing to better account for operators and macro expansions.
- Added safe Scala string escaping utilities for query rendering.
- Enhanced query handling and execution reliability.
- Multiple correctness fixes across analysis and detection workflows.
🧹 Refactors & Maintenance
- Refactored internal service management components.
- Improved code organization around configuration, session management, and analysis services.
- Upgraded project dependencies.
- General cleanup and maintainability improvements across the codebase.
🧪 Testing
- Expanded validation test coverage.
- Added tests for updated query validation logic.
- Improved reliability testing for configuration and service lifecycle changes.
⚠️ Notes
- This remains a beta release.
- Query validation is now stricter and may reject patterns that were previously accepted.
- Users running large codebases may notice new warnings designed to highlight potentially long-running analysis jobs.
- Configuration handling has been modernized; review custom deployments and integrations after upgrading.
Full Changelog: v0.3.8-beta...v0.3.9-beta