feat: Add RecordBatchLogReader for bounded log reading by charlesdong1991 · Pull Request #446 · apache/fluss-rust

charlesdong1991 · 2026-03-19T21:18:23Z

Purpose

Move query_latest_offsets and poll-until-offsets logic from Python binding into Rust core as RecordBatchLogReader.

This enables both Python and C++ bindings to share the same bounded-read implementation.

Linked issue: close #406

Tests

Tests are passed locally

API and Format

Documentation

charlesdong1991 · 2026-03-19T21:24:31Z

+    arrow_schema: SchemaRef,
+    /// Serializes overlapping `poll` / `poll_batches` across clones sharing this `Arc`.
+    ///
+    /// TODO: Consider an API that consumes


it is cheap to clone for this record batch log scanner, but all clones will share one Arc , so two overlapping poll is not supported under current usage model, i add a client-side guard with poll_session so overlapping calls can fail fast.

Not sure what you think, i am happy to create a new issue and do a follow-up on that, or if you prefer i can have a stricter API in this PR?

Let's do it properly in this PR. The reader should take ownership of the scanner (move, not clone). That way the compiler prevents concurrent polls - no mutex needed.

I don't think we solved this: new_shared_handle surfaces the similar pattern for bindings again.

Scenario: new_shared_handle re-enables the original shared-state problem for bindings. Concretely: if a user calls scanner.subscribe(new_bucket) while the reader is iterating, filter_batches silently drops new_bucket's batches because it's not in stopping_offsets.

you are totally right! 👍

with another look, this new_shared_handle will reopen the hole. i added a guard in LogScannerInner and added to all subscribe/unsubscribe* methods to check the guard.

wdyt of this approach? i think it's a lightweight runtime safety check for binding layer

fresh-borzoni

@charlesdong1991 Ty for the PR. Left comments, PTAL

fresh-borzoni · 2026-03-20T01:25:50Z

    /// The projected row type to use for record-based scanning
    projected_row_type: fcore::metadata::RowType,
-    /// Cache for partition_id -> partition_name mapping (avoids repeated list_partition_infos calls)
-    partition_name_cache: std::sync::RwLock<Option<HashMap<i64, String>>>,


Why have we removed this?

since it had no remaining caller after offset/poll loop moved to rust core, wdyt?

fresh-borzoni · 2026-03-21T01:07:10Z

+    arrow_schema: SchemaRef,
+    /// Serializes overlapping `poll` / `poll_batches` across clones sharing this `Arc`.
+    ///
+    /// TODO: Consider an API that consumes


Let's do it properly in this PR. The reader should take ownership of the scanner (move, not clone). That way the compiler prevents concurrent polls - no mutex needed.

charlesdong1991 · 2026-03-30T18:10:05Z

Hi @fresh-borzoni Sorry for late response, thanks for reviews. As i have been travelling without my laptop, i will come back to this in 2 weeks.
In the meantime, i will convert this to draft to avoid confusion. 🙏

charlesdong1991 · 2026-04-18T15:24:20Z

Thanks for your reviews, did some refactoring, PTAL @fresh-borzoni 🙏

fresh-borzoni · 2026-04-19T20:44:04Z

@charlesdong1991 Ty for the PR, I looked briefly, looks good now, but let's wait until we decide if we want to move to fully async api for python polls, and then if it's the case - merge it first, rebase/resolve conflicts herr and I'll review one more time.

WDYT?

charlesdong1991 · 2026-04-21T19:25:39Z

let's wait until we decide if we want to move to fully async api for python polls, and then if it's the case - merge it first

oh, that's good to hear, let me take a look too to get some understanding!

fresh-borzoni · 2026-05-10T12:03:42Z

Ty @charlesdong1991 for rebasing, I'll take a look today-tomorrow to unblock you with this

fresh-borzoni

@charlesdong1991 Ty for the changes, left some comments, PTAL

fresh-borzoni · 2026-05-11T13:40:38Z

+    ///
+    /// Returns:
+    ///     ``pyarrow.RecordBatchReader`` yielding ``RecordBatch`` objects
+    fn to_arrow_batch_reader(&self, py: Python) -> PyResult<Py<PyAny>> {


to_arrow_batch_reader() is a sync Python method but does RPC work via block_on, and iteration also blocks via RecordBatchReader.__next__.

I think this is acceptable only if documented as a blocking/sync Arrow interop API, not an asyncio-native streaming API.

We may wish to provide proper asyncio-native streaming api as well as a follow-up issue. Do you mind to file it?

Good one, i created #545 as follow-up, and i can work on that.

fresh-borzoni · 2026-05-11T13:55:11Z

+            .collect();
+
+        let table_id = scanner.table_id();
+        Ok(offsets


nit: should defensively intersect returned offsets with subscribed buckets?
Otherwise an unexpected bucket from list_offsets() can enter stopping_offsets and make the reader wait forever.

Just a cheap defensive check, server doesn't return funny things now, but it's a bit brittle

Good catch. Changed both query_latest_offsets and query_partitioned_offsets to skip bucket ids that not present

fresh-borzoni · 2026-05-11T14:13:10Z

    await admin.drop_table(table_path, ignore_if_not_exists=False)


+async def test_to_arrow_batch_reader(connection, admin):


Shall we add to test Drop behaviour, as it is rather sophisticated tbh?

I'm think about one integration test that subscribes, starts a reader, drops it mid-iteration, then asserts the original scanner sees no leftover subscriptions for buckets the reader hadn't completed.

fresh-borzoni · 2026-05-11T15:09:41Z

+    arrow_schema: SchemaRef,
+    /// Serializes overlapping `poll` / `poll_batches` across clones sharing this `Arc`.
+    ///
+    /// TODO: Consider an API that consumes


I don't think we solved this: new_shared_handle surfaces the similar pattern for bindings again.

Scenario: new_shared_handle re-enables the original shared-state problem for bindings. Concretely: if a user calls scanner.subscribe(new_bucket) while the reader is iterating, filter_batches silently drops new_bucket's batches because it's not in stopping_offsets.

charlesdong1991 · 2026-05-11T20:17:26Z

Thanks a lot for reviews @fresh-borzoni i made some changes, let me know what you think 🙏

fresh-borzoni

@charlesdong1991 Ty for the changes, LGTM overall, only minor comments

fresh-borzoni · 2026-05-13T08:22:34Z

+        admin: &FlussAdmin,
+    ) -> Result<Self> {
+        let subscribed = scanner.get_subscribed_buckets();
+        if subscribed.is_empty() {


what if unsubscribe called in between get_subscribed_buckets and guard acquisition?
I think we will pass through stale subscription state and try to read smth and hang, so it's a possible race

Nice catch, could cause race condition indeed 👍

fresh-borzoni · 2026-05-13T08:23:41Z

+    ///
+    /// **Not intended for general use** — prefer the async [`unsubscribe`].
+    #[doc(hidden)]
+    pub fn unsubscribe_sync(&self, bucket: i32) {


Shall we use (crate) visibility?

fresh-borzoni · 2026-05-13T08:23:49Z

+    /// [`unsubscribe_partition`](Self::unsubscribe_partition). See
+    /// [`unsubscribe_sync`](Self::unsubscribe_sync) for rationale.
+    #[doc(hidden)]
+    pub fn unsubscribe_partition_sync(&self, partition_id: PartitionId, bucket: i32) {


charlesdong1991 · 2026-05-13T20:08:38Z

thank you for the nice catch! @fresh-borzoni addressed

fresh-borzoni · 2026-05-13T20:20:04Z

@charlesdong1991 Ty, can you rebase, pls?
Also do you mind to file an issue to add the same logic for C++?

charlesdong1991 commented Mar 19, 2026

View reviewed changes

fresh-borzoni reviewed Mar 21, 2026

View reviewed changes

charlesdong1991 marked this pull request as draft March 30, 2026 18:10

charlesdong1991 added 2 commits April 18, 2026 15:07

Add RecordBatchLogReader for bounded log reading

01f4c2c

address comments

23f666d

charlesdong1991 force-pushed the arrow-batch-reader branch from 1502702 to 23f666d Compare April 18, 2026 13:57

charlesdong1991 added 2 commits April 18, 2026 16:12

update doc

360212c

update doc and inline comments

93ee1a6

charlesdong1991 marked this pull request as ready for review April 18, 2026 15:17

charlesdong1991 added 2 commits May 6, 2026 21:53

resolve conflicts

38f8634

rebase and follow up after rebase of new changes, and fix a corner issue

ad2b913

fresh-borzoni reviewed May 11, 2026

View reviewed changes

feedback

d31b023

run tests in thread to avoid asyncio event loop starvation

94841b5

fresh-borzoni reviewed May 13, 2026

View reviewed changes

address feedback

199da23

		await admin.drop_table(table_path, ignore_if_not_exists=False)


		async def test_to_arrow_batch_reader(connection, admin):

Conversation

charlesdong1991 commented Mar 19, 2026

Purpose

Tests

API and Format

Documentation

Uh oh!

charlesdong1991 Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesdong1991 commented Mar 30, 2026

Uh oh!

charlesdong1991 commented Apr 18, 2026

Uh oh!

fresh-borzoni commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charlesdong1991 commented Apr 21, 2026

Uh oh!

fresh-borzoni commented May 10, 2026

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesdong1991 commented May 11, 2026

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesdong1991 commented May 13, 2026

Uh oh!

charlesdong1991 Mar 19, 2026 •

edited

Loading

fresh-borzoni May 11, 2026 •

edited

Loading

fresh-borzoni commented Apr 19, 2026 •

edited

Loading

fresh-borzoni May 11, 2026 •

edited

Loading