Skip to content

[Audit][High] ECS query iterator has data race and staleness bug in next() #732

Description

@MichaelFisher1997

🔍 Module Scanned\n (automated audit scan)\n\n## 📝 Summary\nThe ECS query iterator's function has two critical concurrency and correctness bugs: (1) the epoch validation check at line 90 is not protected by any lock, creating a data race with concurrent structural modifications, and (2) when the epoch changes, returns without resetting , causing the query to become permanently stale and skip all remaining entities.\n\n## 📍 Location\n- File: \n- Function/Scope: function within \n\n## 🔴 Severity: High\n- Critical: Crashes, data corruption, security vulnerabilities, GPU device loss\n- High: Memory leaks, race conditions, incorrect rendering, broken features\n- Medium: Performance degradation, missing error handling, suboptimal patterns\n- Low: Code style, dead code, minor improvements\n\n## 💥 Impact\nWhen multiple threads use the ECS registry concurrently (e.g., physics system updating on one thread while render system queries on another), the query iterator can return stale or null results even when valid entities exist. Specifically:\n\n1. Data Race: The check at line 90 reads without synchronization, while other threads may be writing to it via , , or . This is a plain data race in Zig (concurrent read and write to same memory).\n\n2. Permanent Staleness: When the epoch check fails, returns but does NOT reset . On the next call to , the same check fails again (epoch still changed), and the query is permanently stuck returning — skipping all remaining entities. This breaks the contract that a query should either return valid rows or indicate completion.\n\n## 🔎 Evidence\n\n\nThe epoch check at line 90 is a data race because:\n- is read without any lock or atomic operation\n- is written in (line 40), (line 48), and (line 56)\n- These writes can happen from any thread using the registry\n\nThe staleness bug occurs because when , the function returns immediately without resetting . If is called again (e.g., after the modifying thread completes), the same epoch check fails again, and the query is stuck.\n\n## 🛠️ Proposed Fix\n\n1. Fix the staleness bug by resetting when the epoch changes so the query can be re-iterated properly:\n\n\n2. Address the data race by either:\n a. Documenting that queries must be completed before any structural modifications (single-threaded iteration), or\n b. Using an atomic load for the epoch check, or\n c. Having the caller explicitly check the epoch before/after iteration\n\nFor the data race, the cleanest fix is to make an and use for the check. Alternatively, document that queries are not thread-safe and must be completed before concurrent // calls.\n\n## ✅ Acceptance Criteria\n- [ ] The function correctly resets when epoch changes, allowing re-iteration\n- [ ] No data race exists between reads in and writes in //\n- [ ] Existing ECS tests in pass\n- [ ] A test verifying query behavior after structural modification is added or existing tests cover this case\n\n## 📚 References\n- Zig language reference on data races: two or more concurrent accesses to the same memory location, at least one being a write\n- Related existing issue: None found covering ECS query staleness\n- Component storage uses which is also not thread-safe for concurrent writes + reads

Metadata

Metadata

Assignees

No one assigned

    Labels

    automated-auditIssues found by automated opencode audit scansbugSomething isn't workingdocumentationImprovements or additions to documentationenhancementNew feature or requesthotfix

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions