Skip to content

Fix: Make concurrent CSV Read function really thread safe#6694

Open
FSchumacher wants to merge 1 commit intoapache:masterfrom
FSchumacher:fix-csv-read
Open

Fix: Make concurrent CSV Read function really thread safe#6694
FSchumacher wants to merge 1 commit intoapache:masterfrom
FSchumacher:fix-csv-read

Conversation

@FSchumacher
Copy link
Copy Markdown
Contributor

Current implementation could hand out the same row to more than one thread at the same time and could loose track when wrapping around.

To fix #6673 change to a AtomicInteger and a compareAndSet operation to allow thread safe wrap around. To guard against an endless loop in the case of an empty CSV file, we now throw an ISE in such a case.

This fixes #6673

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • I have updated the documentation accordingly.

Comment on lines +127 to +129
while (true) {
int current = nextRow.get();
int next = (current + 1) % size;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of going with AtomicLong.getAndIncrement and java.lang.Math#floorMod(long, int)?
Then it could work without a loop.

E.g.

private final AtomicLong nextRow = new AtomicLong();

      long row = nextRow.getAndIncrement();
      int result = (int) Math.floorMod(row, size);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretical we could still run into an overflow and in that case, we would get wrong values ;)

And probably more to the point: the name nextRow is misleading. It should probably be nextRowToReturn or such, as it contains the number, the function should return and the function sets it to the value the next user should read.

Therefore I think the loop is not that badly written.

If we want to take the AtomicLong-road, we should probably start with a value of -1 instead of 0, so that the first row is still read. But in that case we would have to check all uses of the value of nextRow outside of the changed function.

Copy link
Copy Markdown
Collaborator

@vlsi vlsi Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math.floorMod resolves the overflow, so I don't understand what you mean by "wrong values"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floorMod should work well within the range of long. But when our long wraps around, the calculation will be wrong.

It might be clearer, when we use a short instead of a long.
There we have (x % 255) % FILE_SIZE as our current value. Further assume that FILE_SIZE is 200.
Then for x == 254 we get 54 and for x == 255 we get 0 (wrap around of short).

The same will happen (really unlikely, as a long can get quite big) with an int or a long. (That might will happen with a long after 286.967.718 years, when we read about 1000 rows a second (for that complete period) :) ) (and that will be one wrong range, that we skip, in the above short example 55..255)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify what exactly would go wrong.
0 is a valid value, isn't it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong with the zero per se, but it should have been the 55 in the example I gave.

And we still would have the other problem, that we would have to either decrement the value before returning or start with -1 and look at other uses of nextRow, which doesn't feel right to me, either.

nextRow = 0;
int size = fileData.size();
if (size <= 0) {
throw new IllegalStateException("CSV file has no data rows");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include the file name, otherwise it would be hard to troubleshoot. An alternative option would be returning 0 and making sure the caller would throw an appropriate error with the file name of the problematic file.

Current implementation could hand out the same row to more than one thread
at the same time and could loose track when wrapping around.

To fix apache#6673 change to a AtomicInteger and a compareAndSet operation to
allow thread safe wrap around. To guard against an endless loop in the case
of an empty CSV file, we now throw an ISE in such a case.

This fixes apache#6673
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build-Failure in test for CSVReadFunctionTest

2 participants