Skip to content

Releases: Lekssays/codebadger

v0.6.1-beta

24 Jun 15:30

Choose a tag to compare

Patch release.

Fixed

  • Worker reachability. The compose host-port publish now keys off MCP_PUBLISH_HOST (default 127.0.0.1, loopback-only) and MCP_PORT, which drives both the in-container listen port and the published host port. Pool workers now log their reachability mode and warn when JOERN_DOCKER_NETWORK is empty while the MCP runs containerized — the misconfig that made workers time out despite a healthy JVM.

Docs

  • .env.example and docker-compose.yml clarify the distinction between MCP_HOST (in-container bind) and MCP_PUBLISH_HOST (host publish interface).

v0.6.0-beta

23 Jun 12:00

Choose a tag to compare

🦡 codebadger — v0.6.0-beta

Agent-usability release on top of v0.5.1/v0.5.2-beta, driven by feedback from a large agent-run across ~70 C/C++ CVEs. The analysis primitives were already solid; this release closes the gaps that actually slowed agents down — truthful build/load state (no more silent hangs or invisible empties), better coverage (indirect dispatch, gated code, large repos), and a set of frontend-aware build options so high-fidelity parsing is one parameter (or zero) away.

No breaking changes from v0.5.x.

🔭 Truthful state in get_cpg_status

  • generating timeout reconciliation. A build whose worker died (process restart, OOM kill, lost in-memory job) no longer sits in generating forever. Each build stamps a deadline (generation_timeout + generation_deadline_grace); a status poll past the deadline with no live worker is reconciled to FAILED with error_code=GENERATION_TIMEOUT. A still-queued/running build is never condemned (the liveness probe fails safe).
  • Progress telemetry. Responses now carry phase (queuedfrontendloadingready), elapsed_seconds, deadline_seconds (remaining budget), and queue_position — so a poller can tell "queued behind others" from "actively parsing" from "wedged" instead of staring at a bare status.
  • Coverage sanity check. user_method_count (the verified user-defined method count from load) is surfaced, so a near-empty build is obvious immediately.
  • codebase_label. A stable, non-sensitive <project>@<short-hash> ties a hash back to what it built, despite redacted paths.

🛠️ New tool: get_backend_status

Read-only backend introspection for self-pacing: build_workers, recommended_max_concurrent_builds, queue depth / in-flight, active vs. max Joern servers, the memory-admission ledger, cpgs_on_disk / disk_mb, and a per-CPG list. Agents (and orchestrators) can now size their fan-out instead of melting the backend by trial and error.

🧩 Coverage gaps closed

  • Indirect / virtual / callback callers. When get_call_graph(..., incoming) finds 0 direct edges, it now surfaces every site that takes the method's address (function pointer, callback registration, vtable entry) as the likely caller — turning dead-ends (png_safe_execute, libtiff img->put, registered read callbacks, …) into leads. Clearly labelled as a heuristic.
  • Gated-body warnings. get_call_graph / get_program_slice / get_variable_flow now warn when the target method resolves but has no body (0 calls, ≤1-line span — the #ifdef/feature-gated signature), pointing at defines=[…] / include_paths=[…] instead of silently returning empty.
  • Slice token budget. get_program_slice clips per-node code (one macro-expanded statement could blow the response) and caps node counts with an explicit TRUNCATED note, so a macro-heavy sink no longer forces the agent to abandon the slice.

⚙️ Frontend-aware build options on generate_cpg

Every new flag is gated by a per-frontend capability table (FRONTEND_CAPABILITIES), so a flag is passed only to a frontend that accepts it — handing a C-only flag to pysrc2cpg/jssrc2cpg/… can no longer crash a build. (--exclude-regex is universal; --include/--compilation-database/auto-discovery are c2cpg-only; --define is c2cpg + swift.)

  • include_globs — scoped large-repo builds (all languages). Analyze a subset of a big repo (['libavcodec/**','libavutil/**']) without re-rooting source_path and losing cross-directory header/macro resolution. The repo root stays the parse base; only out-of-scope source TUs are skipped — headers stay includable.
  • auto_system_headers (C/C++, opt-in). Enables c2cpg --with-include-auto-discovery so libc/STL headers resolve and stop dropping whole files / gated bodies when coverage looks thin.
  • compile_commands (C/C++) — highest fidelity, with auto-detect. Point at a compile_commands.json for exact per-file -I/-D/-std. The DB's absolute build-machine paths are auto-rebased onto the analyzed copy. If you don't pass one, a compile_commands.json shipped in the source (root/build//out//…) is detected and used automatically (CPG_AUTODETECT_COMPILE_DB, default on). Best-effort: if it can't be applied the build proceeds and logs why.

💾 Lifecycle & resource hygiene

  • Load-size guard (2 GB). A built cpg.bin above CPG_MAX_LOAD_MB (default 2048) fails fast with CPG_TOO_LARGE and scoping guidance, instead of the opaque "failed to reload into a Joern server" after a long stall.
  • Cold-CPG GC (evict-only). A background sweep releases the allocations (server process, port, memory reservation) of CPGs gone cold and marks them SLEEPING; the cpg.bin is kept on disk and reloads transparently on the next query. Disk deletion is strictly opt-in (CPG_GC_DELETE_COLD, default off).

🐳 Containerization

  • Docker networking for pooled workers. Joern server/worker networking hardened so pool workers can be reached by container name on an internal Docker network (no published host port required).

✅ Quality

  • New unit + integration coverage for status reconciliation/progress, get_backend_status, the gated/indirect/slice query changes, frontend capability gating, include_globs scoping, and compile_commands rebasing/auto-detect. The query changes were also validated live against a running Joern (gated warning, indirect-caller, slice, and scoping all confirmed end-to-end).

v0.5.2-beta

18 Jun 13:57

Choose a tag to compare

🔧 codebadger — v0.5.2-beta

Feature + hardening release on top of v0.5.1-beta. Adds caller-supplied C/C++ build options (with auto include-detection), makes the CPG cache key correct for build options and remote branches, fixes two query-output corruption bugs, and rebuilds the integration test corpus into a realistic codebase that exercises the detectors end-to-end.

No breaking changes from v0.5.1-beta.

✨ New

  • C/C++ build options on generate_cpg. New optional include_paths and defines parameters are passed through to c2cpg. Without them, angle-includes of generated headers (e.g. <libxml/xmlversion.h>) don't resolve and feature macros stay undefined, so #ifdef-gated modules are silently dropped from the CPG. Entries are validated/normalized first (_sanitize_build_opt_list): control characters are rejected and relative include paths may not contain .. (so a joined path can't escape the source root); absolute paths pass through.
  • Automatic include-dir discovery. When no include paths are supplied, _autodetect_c_includes seeds a lightweight C/C++ search path — the source root, any include/ directory, and any directory that directly contains config.h or a generated *version*.h — so common angle-includes resolve out of the box.

🩹 Fixes

  • run_cpgql_query char-explosion. Query normalization now parenthesises the base query before appending .take/.toJsonPretty/.toString. Previously a ++ chain (e.g. a.l ++ b.l ++ c.name.toJsonPretty) bound the tail to its last operand only, so the result came back as an exploded String instead of JSON. Queries that already self-emit (a <codebadger_result> envelope, or an explicit .toJson/.toJsonPretty/.toString tail) are left untouched.
  • get_variable_flow output corruption. The variable-flow query emits a self-delimiting <codebadger_result>…</codebadger_result> envelope that _parse_output extracts; the executor no longer wraps/trims its tail, which had been corrupting the payload.

🔁 Cache correctness

  • Build options are part of the cache key. Caller-supplied include paths / defines change the produced CPG, so they're now folded into the cache key — two builds of the same source with different c2cpg flags no longer collide on one graph.
  • Remote branch is part of the cache key. For github sources a requested branch now keys the CPG, so two branches of the same repo can't collide (the second request previously reused the first branch's graph). Default branch (None) leaves the key unchanged for back-compat.

🐳 Containerized MCP

  • Host-path resolution for containerized deployments. resolve_host_path(host_path, require_local_access=True) underpins local sources. When the MCP is containerized and the path lives on the host filesystem, the caller copies the tree via a host-daemon helper container and passes require_local_access=False to skip local existence checks (the helper validates existence on the host instead).

🧪 Test corpus & integration coverage

  • Realistic analysis fixture. playground/codebases/core was rewritten from a labelled sample collection into a comment-free, OSS-style lightweight VMM (microvm): virtio device emulation, a DMA/guest-RAM controller, a vsock/QMP monitor, INI config, and command handling. Telegraphing names (vuln_*, *_unsafe, *_untrusted, memory_process_untrusted, safe_str*, …) are gone, so detectors must rely on program analysis rather than naming.
  • Tricky + edge cases. Each detector now has a true-positive site paired with a precision (true-negative) variant — e.g. dma_ring_resize (malloc(count * sizeof …), HIGH) vs dma_ring_resize_guarded (__builtin_mul_overflow + bound); double-free on a shared path vs mutually-exclusive branches; UAF via interprocedural free + returned dangling pointer vs free-then-reassign; size-mismatch memcpy vs bounds-checked copy; a config_open_checked access→open TOCTOU; and interprocedural recv → system taint.
  • Expanded integration suite. New cases cover integer overflow, TOCTOU, null-pointer dereference, uninitialized reads, interprocedural taint flow, and a function-pointer/static callback-chain call graph, alongside the existing UAF / double-free / format-string / heap- & stack-overflow / call-graph / CFG / taint-source/sink tests. Validated live against a running server: 28/28 detector checks pass on the new corpus.
  • Unit tests added for cache-key generation, query normalization, build-option validation, host-path resolution, source fingerprinting, and C include auto-detection.

v0.5.1-beta

12 Jun 17:29

Choose a tag to compare

🩹 codebadger — v0.5.1-beta

Patch release on top of v0.5.0-beta.

Fixes

  • Use-After-Free detector regex crash. find_use_after_free interpolated the freed-pointer expression into Joern's cpg.identifier.name(...), which is regex-interpreted — so a freed dereference like free(*ptr) threw PatternSyntaxException: Dangling meta character '*' and the tool returned a stack trace instead of results. It now derives the underlying identifier name and matches it with nameExact (no regex). Confirmed by the integration suite (now 22/22).

No API or configuration changes from v0.5.0-beta.

v0.5.0-beta

12 Jun 17:09

Choose a tag to compare

🚀 codebadger — v0.5.0-beta

v0.5.0-beta — high-scale stability hardening + a CPG-only refactor. This is a breaking release: six file/source-dependent MCP tools and all built-in prompts are removed, and source snapshots are now ephemeral. Driven by a postmortem of a ~300-CVE batch that surfaced load-tier fragility, host-OOM, and a connection-refused storm.

⚠️ Breaking changes

  • Removed 6 MCP tools. list_files, get_method_source, get_code_snippet, get_macro_expansion, get_codebase_summary, and discover_fixed_vulnerabilities are gone. codebadger is now CPG-only — it analyzes the Code Property Graph, not files on disk. Read raw source from your own checkout (agents already have grep), use run_cpgql_query for graph-level code access (node .code), and run git-history recon on demand in your own clone. The 9 pure-CPG browsing tools (list_methods, list_calls, get_call_graph, list_parameters, run_cpgql_query, find_bounds_checks, get_cpgql_syntax_help, get_cfg, get_type_definition) plus the taint and detector tools are unchanged.
  • Removed all built-in MCP prompts. Methodology now lives with the calling agent; the server ships tools only.
  • Ephemeral source. After a CPG is built, the source snapshot (playground/codebases/<hash>, incl. any GitHub clone) is deleted — the CPG is the sole persisted artifact. A later regenerate re-fetches source. Set CPG_EPHEMERAL_SOURCE=false to keep snapshots for build debugging.

🩹 Stability & performance (load tier)

  • Verify-probe timeout no longer condemns valid CPGs. The post-import readiness probe used a hard-coded 15 s timeout that, under host pressure, marked perfectly valid CPGs as failed/empty mid-load. It's now configurable (JOERN_VERIFY_TIMEOUT_SECONDS, default 60) and bounded by the load budget; a query that times out while a CPG is still loading no longer terminates the server.
  • Transient load failures are retried, not fatal. A momentary stall during reactivation used to permanently mark a codebase failed even though its cpg.bin was valid on disk. Reloads now retry up to JOERN_LOAD_MAX_ATTEMPTS (default 3) for transient causes; a genuinely empty/broken build is still failed fast.
  • Build memory is bounded against the container cap. build_workers is auto-clamped at startup so build_workers × build_heap fits the build container's memory limit, eliminating OOM-killed builds (exit 137) and the host-memory exhaustion they caused.
  • No more connection-refused storm. Stale cached Joern clients (pointing at a re-spawned worker's old port) are now rebuilt against the live registry; queries are no longer dispatched into a loading/generating server, and a READY codebase whose worker was reaped is transparently reactivated.

🔁 Concurrency

  • Generation single-flight. Concurrent generate_cpg calls for the same source no longer race: the source copy is atomic (build-in-temp → os.replace, so a half-merged tree can't produce a spurious empty/parse-failed CPG), and a per-hash Redis codebase_generation_lock deduplicates the staging+enqueue so identical concurrent requests don't repeat the work.

⚙️ New configuration

Env Default Purpose
JOERN_VERIFY_TIMEOUT_SECONDS 60 Per-poll readiness-probe timeout (bounded by load timeout).
JOERN_LOAD_MAX_ATTEMPTS 3 Reload-from-disk retries for a transient load failure.
CPG_EPHEMERAL_SOURCE true Delete the source snapshot once the CPG is built.

CPG_BUILD_WORKERS is now auto-clamped to the build container's memory cap. See docs/configuration.md.

v0.4.3-beta

10 Jun 09:56

Choose a tag to compare

🚀 codebadger — v0.4.3-beta

v0.4.3-beta — security & deployment hardening. (The scalability/port-allocation and queue-depth fixes shipped in v0.4.2-beta.)

🔒 Security

  • SSRF-hardened repository URL validation. Remote repos are now restricted to https://github.com/… and https://gitlab.com/… (incl. www.), enforced by two independent gates — a literal case-sensitive https://<host>/ prefix match and a parsed-hostname allowlist. Rejects non-https schemes (git://, ssh://, file://), embedded credentials (user:tok@), non-default ports, control chars, userinfo host-smuggling (https://github.com@evil/…), internal/metadata hosts, and look-alike domains.
  • Snippet language validation + inference. Pasted code is supplied in <code language="…"> tags (regex-parsed); the language is validated and content-inferred, and a mislabeled or ambiguous snippet is refused with an actionable message instead of building a wrong-language CPG.
  • CHAT_DEPLOY mode. Set CHAT_DEPLOY=true to disable source_type='local' entirely so a chat-facing / multi-tenant MCP can't read arbitrary host paths — callers must use an allowlisted repo URL or a pasted snippet.
  • Path-traversal hardening. resolve_host_path now rejects null bytes / control characters, canonicalizes with realpath before any check, and supports an optional ALLOWED_SOURCE_ROOTS allowlist for hard containment of local sources.

⚙️ Configuration & deployment

  • New MCP_PORT env var (default 4242).
  • Compose env passthrough fixed. CHAT_DEPLOY, ALLOWED_SOURCE_ROOTS, CPG_QUEUE_MAXSIZE, and MCP_PORT are now passed into the codebadger-mcp container — previously CPG_QUEUE_MAXSIZE set in .env was silently inert for the containerized MCP.
  • .env / .env.example synced and documented for normal-user defaults.

📚 Docs

  • docs/security.md: SSRF/repo-URL allowlist, snippet-language, CHAT_DEPLOY, and path-confinement controls added to the threat model and hardening checklist.
  • docs/deployment.md: new "Hardening a chat-facing deployment (CHAT_DEPLOY)" section.

Full changelog: v0.4.2-beta...v0.4.3-beta

v0.4.2-beta

10 Jun 08:28

Choose a tag to compare

🚀 codebadger — v0.4.2-beta

🎯 Highlights

v0.4.2-beta is a reliability-under-load release. After v0.4.1-beta made the stack production-shaped, large batches (hundreds of CVEs, high client concurrency) still hit a cluster of load-tier failures — CPGs that built fine but wouldn't reactivate, ports that raced Docker's teardown, queue rejections under fan-out, and "ready" codebases whose Joern server had quietly died. This release fixes those: a collision-free CPG loader, build-time overlay persistence so memory-capped workers load big CPGs reliably, rotating port allocation, a decoupled queue depth, event-loop-managed server restarts, and classified, API-visible build failures. Net effect: clean RAM no longer means mysterious failures — batches run to completion and the ones that don't tell you why.


⚠️ Breaking Changes

  • None. v0.4.2-beta is drop-in over v0.4.1-beta. All new behavior is additive and defaults are safe; the new knobs below only need touching for unattended/batch drivers.

📦 What's New

Collision-Free CPG Loading (fixes the "No projects loaded" load-tier failures)

  • Every CPG file is literally named cpg.bin, so letting Joern derive the project name from the filename collided when a worker imported a second CPG or reused a workspace — importCpg then left no project open ("No projects loaded"), failing reactivation of a perfectly good build.
  • load_cpg now imports under an explicit, collision-free project name (workspace.reset; importCpg(path, name); open(name)), then runs a readiness poll that distinguishes three outcomes: loaded-and-non-empty, genuinely empty build (0 user-defined methods → fail with a distinct reason, no pointless retry), and no-project race (→ re-import once before giving up). A registration race right after importCpg no longer reads as a permanent failure.

Build-Time Overlay Persistence (memory safety on reactivation)

  • Joern applies the dataflow overlay (ReachingDefPass) the first time a CPG is opened and re-saves it into cpg.bin. Doing that on every load inside a memory-capped query worker OOMed the worker on large C/C++ trees — surfacing, again, as "No projects loaded".
  • CPG generation now applies and persists the overlays once, in the large-heap build container. Later importCpg calls just deserialize ("Overlay dataflowOss already exists – skipping"), so even a tiny tier-S (2 GB) worker loads a large CPG reliably. Best-effort: on failure the base CPG is kept and the worker falls back to recompute-on-load.
  • Build JVM is now sized from CPG_BUILD_HEAP_GB with G1GC + string dedup (-Xmx{heap}G -Xms2G -XX:+UseG1GC -XX:+UseStringDeduplication), replacing the tiny -Xmx2G default that OOMed the overlay pass.

Rotating Port Allocation (fixes "failed to become ready / connection refused" pile-ups)

  • Always handing back the lowest free port republished a just-released host port on the very next spawn — racing Docker's teardown of the old mapping (docker-proxy/iptables DNAT) and the kernel's TIME_WAIT. Failures concentrated on the first port (e.g. 14000).
  • Both PortManager (in-process) and RedisPoolStore (pool mode, via an atomic INCR cursor under the admit lock) now rotate across the whole range, giving a freed port time to fully release before reuse. A new best-effort _wait_host_port_free waits out a lingering mapping before publishing a worker.

Decoupled Build-Queue Depth (fixes ~30% generation rejections under load)

  • Pending-queue depth was tied to build_workers (workers * 4 = 8), so a 12+-way client got ~30% of generations rejected with queue_full — even though only build_workers builds ever run at once.
  • New CPG_QUEUE_MAXSIZE (default 64) sizes only the waiting room; concurrent builds — and thus build memory — stay capped at CPG_BUILD_WORKERS. Raising it does not increase memory. <=0 falls back to the old build_workers * 4.

Event-Loop-Managed Server Restarts (fixes the restart-fail churn)

  • A "ready" codebase whose Joern server had died entered a retry → fail → repeat loop. Sync MCP tools (e.g. get_cpg_status) run in worker threads with no running event loop, so background restarts had nowhere to schedule.
  • The main server loop is now captured at startup; sync tools schedule Joern server restarts onto it via run_coroutine_threadsafe. If a reload fails, the codebase is marked FAILED instead of left "ready with a dead server." New tests cover zombie/restart scenarios.

Classified, API-Visible Build Failures

  • get_cpg_status now surfaces the failure cause (error_code + human-readable error) on a failed build, instead of a bare "failed" that forced digging through container logs.
  • New _classify_cpg_build_failure distinguishes an out-of-memory build (the dominant large-project failure) from a generic frontend error and from timeouts, with the frontend output tail attached.

Large-Project Guard (opt-out for batch drivers)

  • generate_cpg now declines a local source above a size/LOC threshold — returning a large_project_warning instead of silently committing to a giant full-project build — unless the caller passes force=True.
  • Thresholds are deliberately high (default 2 GB / 2 M LOC) so only enormous trees warn. New knobs: CPG_LARGE_PROJECT_GUARD (set false for unattended/batch/eval harnesses that always intend to build and can't pass force per call), CPG_LARGE_PROJECT_MAX_MB, CPG_LARGE_PROJECT_MAX_LOC — wired through config.example.yaml, .env.example, and docker-compose.yml.

Symlink-Safe Local Copy

  • New _copy_local_source_tree skips symlinks that escape the source root when staging a local source, closing a path-escape gap in local-source ingestion.

🐳 Deployment & Infrastructure

  • docker-compose.yml now surfaces the build-sizing and scale knobs inline: CPG_BUILD_WORKERS (default 4), CPG_BUILD_HEAP_GB (default 6), MAX_MCP_CONNECTIONS (default 16, 503 past it), MAX_REPO_SIZE_MB (default 1024), plus the large-project-guard vars.
  • Pool-mode invariant unchanged: CPG_BUILD_WORKERS * CPG_BUILD_HEAP_GB ≤ JOERN_MEM_LIMIT (the build container's cap). Run python scripts/recommend_config.py before launching on a new host.

🧪 Testing

  • New suites for collision-free project naming (test_joern_client_load.py), server restart / zombie handling (test_restart_scheduling.py), large-project guard behavior (test_cpg_generator.py, test_mcp_tools.py), symlink-safe copy, and worker-pool port rotation (test_worker_pool.py). 503 tests collected (~77 new).

⚠️ Notes

  • This remains a beta release.
  • Dedicate the host to codebadger — the MCP container mounts the Docker socket (root-equivalent on the host) and uses host networking. The MCP HTTP endpoint has no built-in auth; front it with a reverse proxy / network policy.

Full Changelog: v0.4.1-beta...v0.4.2-beta

v0.4.1-beta

09 Jun 16:00

Choose a tag to compare

🚀 codebadger — v0.4.1-beta

🎯 Highlights

v0.4.1-beta turns the v0.4.0 scalability redesign into a production-ready, hardened deployment. The whole stack — including the MCP server itself — now comes up from a single docker compose; Postgres + Redis are required defaults (SQLite is gone); there's a real /health endpoint for orchestrators; idle codebases are evicted to survive long-running batches; you can analyze pasted code snippets, not just repos and paths; and every LLM-supplied input now passes a thorough validation + security layer backed by a documented threat model.


⚠️ Breaking Changes

  • Postgres + Redis are now required (and the default). SQLite and the in-process coordinator have been removed. This reverses the v0.4.0 "optional" note — the server now fails fast on a missing/unreachable Postgres or Redis. Stand the stack up with ./scripts/deploy.sh (or docker compose up -d).
  • pgdata relocated out of playground/ → now ./pgdata (override with POSTGRES_DATA_PATH) so Joern workers can't reach the database files. Migration: while stopped, mv playground/pgdata ./pgdata to preserve an existing catalog, or start fresh.
  • redis-py bumped to 8.x (from 5.x). All in-code usage is stable API; review any custom Redis integrations.

📦 What's New

One-Command Full-Stack Deployment (fixes #21)

  • New Dockerfile.mcp containerizes the MCP server; docker compose up -d now brings up MCP + Joern + Postgres + Redis together.
  • The MCP drives the host Docker daemon (Docker-out-of-Docker via the mounted socket) and uses host networking, so it builds CPGs and spawns per-CPG pool workers as sibling containers — no app code changes required.
  • New scripts/deploy.sh (up / down / restart / logs / status) builds, launches, exports an absolute playground path for pool mounts, and polls /health.

Production /health Endpoint (fixes #20)

  • Reports status (up / partial / down), an mcp: "codebadger" field, and a dependencies map (joern, postgres, redis, docker, cpg_queue) — returns 200 for up/partial and 503 for down, so it works directly as a liveness/readiness probe.

Idle CPG Eviction (memory safety on long runs)

  • Servers idle beyond JOERN_IDLE_TTL_SECONDS (default 600s) are offloaded by a background reaper and auto-wake on the next query — fixing the long-run idle-worker leak that could exhaust RAM/swap.
  • Explicit JOERN_MEMORY_BUDGET_MB budgeting plus per-run file logging (logs/codebadger-<ts>-<pid>.log + a codebadger-latest.log symlink).

Code Snippet Analysis (fixes #19)

  • generate_cpg(source_type="snippet", code=..., language=...) analyzes code pasted straight into the chat — no repo or path needed. Staged like any other source, with content-hash dedup so re-pasting the same code reuses the cached CPG.

Security: Threat Model + Hardening (see docs/security.md)

  • New docs/security.md: threat model, trust boundaries (Mermaid), the controls we provide, and a production hardening checklist.
  • Every LLM-supplied input is now validated: codebase_hash, language, branch & github_token (anti arg-/URL-injection, e.g. blocks --upload-pack), snippet code/filename/label, and regex filters (length + ReDoS-shape guard).
  • run_cpgql_query now enforces the CPGQL blocklist by default (process exec, file read/write, network, dynamic dependency load, reflection) — previously the blocklist was never wired in. Patterns expanded after an audit. (Defense-in-depth; the real boundary is the Joern worker sandbox.)
  • Resource caps at the executor: query timeout ≤ 300s, ≤ 10 000 rows, ≤ 5 MB output, ≤ 5 000-line snippet spans, and clamped take(n)/depth — with visible truncated flags (never silent).
  • Filesystem hardening: symlink-safe local copy (no dereferencing escapes), realpath/commonpath confinement, a deletion guard on remove_cpg, and host-path redaction in client-facing errors.

📚 Documentation

  • New docs/security.md (threat model + diagrams), linked from both READMEs.
  • docs/installation.md and docs/deployment.md rewritten as clear, step-by-step guides (full-stack and host-dev paths, day-2 ops).
  • New docs/available-tools.md cataloguing every MCP tool; snippet flow documented across usage/deployment.

🐳 Deployment & Infrastructure

  • Dockerfile.mcp (new) — python:3.13-slim + git + Docker CLI 29.5.3 (client only).
  • Dependency bumps: fastmcp>=3.4.2, mcp>=1.27.2, aiohttp>=3.14.1, uvicorn>=0.49.0, psycopg[binary]>=3.3.4, redis>=8.0.0.
  • Backing-service URLs now resolve from envDATABASE_URL/REDIS_URL, or the component POSTGRES_*/REDIS_* vars compose uses — instead of hardcoded defaults, so a host-run MCP honors POSTGRES_PORT/REDIS_PORT overrides.
  • Postgres on 55432 / Redis on 56379 (overridable); updated .env.example, docker-compose.yml, .dockerignore, and cleanup.sh.

🛠️ Tooling

  • get_cpg_status description now explicitly documents it as the way to wait for generate_cpg — poll with the returned codebase_hash until ready/failed (fixes #22).

🧪 Testing

  • New/expanded suites for snippets, the security validators (CPGQL blocklist + bypasses, ReDoS guard, error redaction), executor resource caps & truncation, malformed-hash rejection, and QueryLoader numeric clamps. 426 passing / 24 skipped.

⚠️ Notes

  • This remains a beta release.
  • Dedicate the host to Codebadger — the MCP container mounts the Docker socket (root-equivalent on the host) and uses host networking. The MCP HTTP endpoint has no built-in auth; front it with a reverse proxy / network policy.
  • Pool-mode invariant still applies: build container mem_limit + memory_budget_mb ≤ total Joern budget. Run python scripts/recommend_config.py before launching on a new host.

Full Changelog: v0.4.0-beta...v0.4.1-beta

v0.4.0-beta

08 Jun 14:05

Choose a tag to compare

🚀 Codebadger Release — v0.4.0-beta

🎯 Highlights

This release is a major scalability redesign. Codebadger can now analyze large batches of codebases without OOM-killing the host, run as multiple coordinated processes against a shared store, and is backed by brand-new end-to-end documentation. RAM — not a fixed server count — is now the binding constraint, and the system manages it explicitly.


📦 What's New

Memory-Aware Server Admission

  • Replaced the fixed query-server count with a real memory budget (memory_budget_mb) — the sum of per-CPG heap reservations is the true concurrency limit.
  • CPG size tiers (S/M/L/XL): each Joern server's JVM heap is sized to the CPG's on-disk .bin size, so a batch of small CPGs runs far more servers concurrently than a few large ones.
  • LRU eviction + RSS backstop: least-recently-used servers are slept to make room, and an RSS-pressure backstop evicts before the kernel OOM-kills.
  • Generate-ahead, sleep-on-idle: sleeping servers cost no RAM and wake cheaply by re-loading the disk-cached CPG on the next query.
  • Auto-tuned budgets derived from host RAM, with a startup over-commit guard that clamps (and warns) rather than letting the build and query pools jointly over-commit the host.

Pool Worker Mode

  • New JOERN_WORKER_MODE=pool runs each CPG's query server in its own cgroup-capped container, so an OOM kills one worker instead of cascading across every server.
  • In shared mode (default), query servers remain processes inside the single Joern container.

Durable Job Queue

  • A DB-backed jobs table replaces the lossy in-memory queue: it survives restarts, dedups (one active job per CPG via a partial unique index), and applies real backpressure (FOR UPDATE SKIP LOCKED) instead of silently dropping work.

Postgres + Redis (Multi-Process Operation)

  • DATABASE_URL swaps the entire store — catalog, tool cache, findings, and the job queue — into Postgres, enabling genuine multi-process operation.
  • REDIS_URL makes the per-CPG query lock and shared pool state (reservation ledger, warm-worker registry, global LRU, per-CPG spawn lock) cross-process.
  • New PostgresDBManager / PostgresJobStore, plus RedisCoordinator and RedisPoolStore, with lazy imports so Postgres/Redis are only required when configured.
  • Multiple API/scheduler processes can now share one budget, catalog, and queue.

Configuration Recommender

  • New scripts/recommend_config.py (and src/utils/recommend.py) prints a memory-aware config recommendation for the current or a hypothetical host (--mem, --cores, --worker-mode, --compare).
  • The same recommendation logs at startup (toggle with RECOMMEND_ON_STARTUP).

📚 Documentation

  • Brand-new docs/ set: Architecture, Deployment, Configuration, Installation, Usage, Custom Tools, Contributing, and a Roadmap.
  • Architecture, deployment, and lifecycle Mermaid diagrams (query flow with auto-wake, CPG/server lifecycle, memory-aware admission, topologies, shared vs pool).
  • Slimmed-down README that points into the new docs; the old CUSTOM_TOOLS_GUIDE.md content moved into docs/custom-tools.md.

🐳 Deployment & Infrastructure

  • docker-compose.yml now offers Postgres and Redis behind Compose profiles — a plain docker compose up -d starts Joern only.
  • Postgres publishes on 55432 and Redis on 56379 (non-default ports) to avoid clashing with system services; override via POSTGRES_PORT / REDIS_PORT.
  • New optional dependencies: psycopg[binary]>=3.1 (Postgres) and redis>=5.0 (coordination).
  • Updated .env.example, config.example.yaml, Dockerfile, and cleanup.sh for the new topology.

⚙️ Stability & Performance

  • Substantially reworked JoernServerManager (spawn / sleep / evict / heap-sizing) as the heart of memory-aware scheduling.
  • QueryExecutor serializes requests per CPG, caches successful results, and triggers auto-wake.
  • New Coordinator abstraction for per-CPG locks (in-process by default, Redis-backed when configured).
  • Hardened DB concurrency and storage handling across SQLite and Postgres.

🧪 Testing

  • New suites covering the redesign: test_coordination, test_db_manager, test_durable_queue, test_memory_admission, test_pool_memory_guard, test_pool_store_redis, test_postgres_db_manager, test_postgres_job_store, test_worker_pool.
  • Expanded test_main and test_program_slice coverage.

⚠️ Notes

  • This remains a beta release.
  • Postgres and Redis are optional — the default single-process setup still uses SQLite and needs only the Joern container. Set DATABASE_URL / REDIS_URL to opt into shared, multi-process operation.
  • Pool-mode invariant: build container mem_limit + memory_budget_mb ≤ total Joern budget. Review your Compose memory limits before enabling pool mode.
  • Leave memory_budget_mb / rss_eviction_threshold_mb at 0 to auto-derive from host RAM; run python scripts/recommend_config.py before launching on a new host.
  • Review custom deployments after upgrading — Compose profiles, new ports, and the optional dependencies may affect existing integrations.

Full Changelog: v0.3.9-beta...v0.4.0-beta

v0.3.9-beta

05 Jun 17:56

Choose a tag to compare

🚀 Codebadger Release — v0.3.9-beta

v0.3.9-beta
Released Jun 5, 2026


📦 What's New

Improved Configuration Management

  • Refactored configuration handling for greater consistency and maintainability.
  • Added automatic type coercion for configuration values.
  • Improved session status tracking and lifecycle management.

Enhanced CPG Generation

  • Added configurable queue size limits to prevent resource exhaustion during CPG generation.
  • Introduced warnings when analyzing exceptionally large projects.
  • Improved CPG generation workflow and loading reliability.
  • Added timeout configuration options for CPG loading operations.

Better Joern Integration

  • Refactored Joern server management to use a shared restart callback mechanism.
  • Improved startup, restart, and recovery behavior across analysis sessions.

🛡️ Security Improvements

Credential Safety

  • Removed embedded Git credentials from repository configuration after cloning operations.
  • Reduced risk of accidental credential persistence and exposure.

Query Validation Hardening

  • Expanded CPGQL validation with more comprehensive dangerous-pattern detection.
  • Improved validation error reporting to help users identify unsafe queries.

Path Validation

  • Strengthened path validation throughout the analysis pipeline.
  • Improved handling of filesystem edge cases and invalid paths.

⚙️ Stability & Performance

Database & Concurrency

  • Improved DBManager concurrency handling.
  • Reduced contention during parallel analysis operations.
  • Enhanced overall system stability under load.

Resource & Port Management

  • Streamlined port manager initialization.
  • Improved service startup reliability.
  • Enhanced graceful shutdown behavior and Docker container lifecycle management.

Async Processing

  • Improved asynchronous handling across CPG generation and loading workflows.
  • Reduced potential race conditions and timeout-related failures.

🔍 Analysis Improvements

  • Improved program slicing to better account for operators and macro expansions.
  • Added safe Scala string escaping utilities for query rendering.
  • Enhanced query handling and execution reliability.
  • Multiple correctness fixes across analysis and detection workflows.

🧹 Refactors & Maintenance

  • Refactored internal service management components.
  • Improved code organization around configuration, session management, and analysis services.
  • Upgraded project dependencies.
  • General cleanup and maintainability improvements across the codebase.

🧪 Testing

  • Expanded validation test coverage.
  • Added tests for updated query validation logic.
  • Improved reliability testing for configuration and service lifecycle changes.

⚠️ Notes

  • This remains a beta release.
  • Query validation is now stricter and may reject patterns that were previously accepted.
  • Users running large codebases may notice new warnings designed to highlight potentially long-running analysis jobs.
  • Configuration handling has been modernized; review custom deployments and integrations after upgrading.

Full Changelog: v0.3.8-beta...v0.3.9-beta