Summary
daemon_manager.py's ensure_bridge_running(probe_only=True) caches _bridge_ok as a module-level global with no expiry mechanism. Once set to False by a transient failure (subprocess creation race, env-var load ordering), it stays False for the lifetime of the gateway process. This causes MemTensorProvider.is_available() to return False permanently, and agent_init.py skips adding the provider — MemOS is silently disabled despite the bridge daemon being perfectly healthy.
Root Cause
In adapters/hermes/memos_provider/daemon_manager.py, the caching logic at lines 130–133:
if _bridge_ok is not None and probe_only:
return _bridge_ok # ← returns stale False forever
The sequence that triggers the bug:
- Hermes gateway starts, loads
.env (which sets MEMOS_NODE_BINARY)
agent_init.py calls is_available() → ensure_bridge_running(probe_only=True)
- If
_node_available() fails at this moment (e.g. transient subprocess error on Windows, or .env not yet loaded in os.environ), _bridge_ok is set to False
- Meanwhile
ensure_viewer_daemon() (called during initialize() without probe_only) finds the bridge already running on port 18800 via _probe_viewer() and returns True — without ever calling ensure_bridge_running() again, so _bridge_ok stays False forever
is_available() returns False → agent_init.py:994 skips add_provider() → MemOS offline for the gateway's lifetime
Symptoms
- Repeated
"MemOS: Node.js not found on PATH" warnings in logs (from non-session contexts, no session ID prefix)
- Bridge health endpoint (
/api/v1/health) shows llm.available: true and embedder.available: true
- But
bridge_client never logs any activity — the provider was never registered
hermes doctor shows memtensor as unavailable
Proposed Fix
Three changes to daemon_manager.py, ~20 lines net new code:
1. TTL-based cache expiry
Add _bridge_ok_at: float = 0.0 timestamp and BRIDGE_OK_TTL_SEC = 60.0. Cached results expire after 60 seconds, forcing revalidation.
2. Running-bridge fallback
When _node_available() fails, check whether a bridge is already alive via _probe_viewer() == "running_memos". A live bridge process is definitive proof the environment is viable (the bridge itself was launched with Node.js).
3. shutdown_bridge() resets both variables
Fixed ensure_bridge_running:
def ensure_bridge_running(*, probe_only: bool = False) -> bool:
global _bridge_ok, _bridge_ok_at
with _lock:
now = time.time()
if _bridge_ok is not None and probe_only:
if (now - _bridge_ok_at) < BRIDGE_OK_TTL_SEC:
return _bridge_ok
# Cache expired — fall through to revalidate.
script = _bridge_script()
if not script.exists():
logger.warning("MemOS: bridge script missing at %s", script)
_bridge_ok = False
_bridge_ok_at = now
return False
if _node_available():
_bridge_ok = True
_bridge_ok_at = now
return True
# Node binary check failed. Check if bridge is already running.
if _probe_viewer() == "running_memos":
_bridge_ok = True
_bridge_ok_at = now
return True
logger.warning("MemOS: Node.js not found on PATH")
_bridge_ok = False
_bridge_ok_at = now
return False
Environment
- MemOS version: 2.0.5 (
@memtensor/memos-local-plugin)
- Hermes Agent on Windows 10
- Node.js v24.14.1 (path set via
MEMOS_NODE_BINARY in .env)
Notes
Happy to submit a PR if this direction looks right. The fix has been tested on Windows — syntax verified, logic tested with fresh import.
Summary
daemon_manager.py'sensure_bridge_running(probe_only=True)caches_bridge_okas a module-level global with no expiry mechanism. Once set toFalseby a transient failure (subprocess creation race, env-var load ordering), it staysFalsefor the lifetime of the gateway process. This causesMemTensorProvider.is_available()to returnFalsepermanently, andagent_init.pyskips adding the provider — MemOS is silently disabled despite the bridge daemon being perfectly healthy.Root Cause
In
adapters/hermes/memos_provider/daemon_manager.py, the caching logic at lines 130–133:The sequence that triggers the bug:
.env(which setsMEMOS_NODE_BINARY)agent_init.pycallsis_available()→ensure_bridge_running(probe_only=True)_node_available()fails at this moment (e.g. transient subprocess error on Windows, or.envnot yet loaded inos.environ),_bridge_okis set toFalseensure_viewer_daemon()(called duringinitialize()withoutprobe_only) finds the bridge already running on port 18800 via_probe_viewer()and returnsTrue— without ever callingensure_bridge_running()again, so_bridge_okstaysFalseforeveris_available()returnsFalse→agent_init.py:994skipsadd_provider()→ MemOS offline for the gateway's lifetimeSymptoms
"MemOS: Node.js not found on PATH"warnings in logs (from non-session contexts, no session ID prefix)/api/v1/health) showsllm.available: trueandembedder.available: truebridge_clientnever logs any activity — the provider was never registeredhermes doctorshows memtensor as unavailableProposed Fix
Three changes to
daemon_manager.py, ~20 lines net new code:1. TTL-based cache expiry
Add
_bridge_ok_at: float = 0.0timestamp andBRIDGE_OK_TTL_SEC = 60.0. Cached results expire after 60 seconds, forcing revalidation.2. Running-bridge fallback
When
_node_available()fails, check whether a bridge is already alive via_probe_viewer() == "running_memos". A live bridge process is definitive proof the environment is viable (the bridge itself was launched with Node.js).3.
shutdown_bridge()resets both variablesFixed
ensure_bridge_running:Environment
@memtensor/memos-local-plugin)MEMOS_NODE_BINARYin.env)Notes
Happy to submit a PR if this direction looks right. The fix has been tested on Windows — syntax verified, logic tested with fresh import.