Skip to content

Large query handling for SQLGraph#281

Open
yfukai wants to merge 2 commits intoroyerlab:mainfrom
yfukai:large_query
Open

Large query handling for SQLGraph#281
yfukai wants to merge 2 commits intoroyerlab:mainfrom
yfukai:large_query

Conversation

@yfukai
Copy link
Copy Markdown
Contributor

@yfukai yfukai commented Apr 14, 2026

This pull request introduces a robust mechanism to handle large lists of IDs in SQL queries, preventing SQL variable overflow errors by dynamically switching between inline IN clauses and temporary scratch tables. It adds a new _SqlIdSet helper class to encapsulate this logic, updates all relevant filtering and degree calculation code paths to use it, and provides comprehensive tests to ensure correctness—especially around edge cases where the number of IDs approaches backend-imposed limits.

Key changes include:

Core SQL handling improvements:

  • Introduced the _SqlIdSet class, which automatically decides whether to use an inline IN clause or a temporary scratch table based on the number of IDs and the number of times they are used in a query, preventing SQL variable overflow (OperationalError: too many SQL variables). Scratch tables are created and cleaned up as needed, with automatic resource management using weakref.finalize.
  • Updated all filtering logic in SQLGraph.filter, overlaps, and _get_degree to use _SqlIdSet, ensuring consistent handling of large ID sets across all query paths.

Testing and validation:

  • Added new tests in test_subgraph.py that create graphs with enough nodes to trigger the scratch-table code path, verify correct filtering and degree calculations, and ensure the cutoff logic accounts for the number of ID occurrences per query. This includes edge cases near the cutoff boundary.

Internal utilities and cleanup:

  • Added helper functions for dropping scratch tables and closing _SqlIdSet instances, with error handling and logging for robustness during interpreter shutdown or unexpected errors.

These changes make the SQL backend resilient to large filter operations and improve maintainability by centralizing the handling of SQL variable limits.

@yfukai yfukai changed the title Large query Large query handling for SQLGraph Apr 14, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.78082% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.69%. Comparing base (6ea4fbf) to head (c43ac58).

Files with missing lines Patch % Lines
src/tracksdata/graph/_sql_graph.py 91.78% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #281      +/-   ##
==========================================
+ Coverage   87.62%   87.69%   +0.06%     
==========================================
  Files          57       57              
  Lines        4865     4924      +59     
  Branches      858      864       +6     
==========================================
+ Hits         4263     4318      +55     
- Misses        380      384       +4     
  Partials      222      222              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JoOkuma
Copy link
Copy Markdown
Member

JoOkuma commented Apr 21, 2026

Hi @yfukai , thanks for the PR.

Did you notice some performance improvement? Or was this something you could not compute before?

I'm not super familiar with scratch tables, and the PR looks clean, so I asked an LLM to help with the review.

It had some good comments. See below:


The one thing worth fixing before merging: occurrences=3 is hard-coded in SQLFilter.__init__ regardless of whether include_targets/include_sources are set. When both are True, only the Node.node_id filter fires — one expansion, not three — so a scratch table gets created ~3× earlier than necessary. It's never wrong, just wasteful. Easy fix:

occurrences = 1 + (not self._include_targets) + (not self._include_sources)
id_set = _SqlIdSet(self._graph, node_ids, occurrences=occurrences)

On the cycle: _SqlIdSet stores self._graph, and since graph.filter() returns a SQLFilter that holds the _SqlIdSet, there's a graph → SQLFilter → _SqlIdSet → graph cycle. Python's cycle GC handles it, but weakref.finalize only fires after the cycle is collected — not when the last normal reference drops. In a long-running process with many filters this could hold scratch tables open longer than expected. Storing graph._engine instead of graph breaks the cycle and keeps cleanup prompt. Probably fine as-is for now, just something to keep an eye on.


Smaller things:

The _SqlIdSet docstring says the cutoff is "scaled by" occurrences, but it's divided by it (limit = chunk_size // occurrences). "Scaled by" reads like multiplication — one word change fixes it.

_create_id_scratch_table does hasattr(ids, "tolist") again, but _SqlIdSet.__init__ already normalized to list[int] before calling it. Safe to drop.

The comment in SQLFilter.__init__ pins "L159-L160" for the node-attr-filter path. Line numbers drift — describing the condition instead would age better.


On the tests: the two new tests assert filtered._id_sets[0].uses_scratch_table to confirm the scratch-table path ran. Makes sense, though it ties the test to a private implementation detail that's easy to rename in a refactor. If the path-coverage check stays, a small _uses_scratch_tables() helper on SQLFilter gives a more stable surface for it. Alternatively, the output assertions already cover correctness, so it could just go. Also, the context-manager paths (overlaps, _get_degree) have no test that the temp table is actually dropped after the block — a quick check against sqlite_master after the call would give more confidence in the cleanup logic.


_drop_scratch_tables and _close_id_sets being module-level functions rather than methods is the right call for weakref.finalize, since a bound method keeps the object alive. A one-liner comment there would save the next person from "cleaning it up" and breaking the semantics.


Approve with minor fixes — the occurrences calculation is the only blocking item, everything else is optional.

@yfukai
Copy link
Copy Markdown
Contributor Author

yfukai commented Apr 22, 2026

Thanks @JoOkuma! Sorry for the lack of context; yeah, I got the same error as #249 when I did graph.filter(node_ids=node_ids), and this is intended to solve the issue. Not sure which of the scratch table and chunking the output is a better strategy, but at least this is simple and works... Let me check the comment, thanks!

@JoOkuma
Copy link
Copy Markdown
Member

JoOkuma commented Apr 22, 2026

That makes sense @yfukai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants