Fix AssertionError during eval when val set size is not divisible by train_batch_size by rishithayenumula · Pull Request #1589 · NovaSky-AI/SkyRL

rishithayenumula · 2026-04-29T14:57:47Z

Problem

compute_prompt_mini_batch_boundaries assumes that all batches have size equal to train_batch_size, which holds during training (drop_last=True) but not during evaluation (drop_last=False).

During evaluation, the final batch can be smaller, leading to an AssertionError when convert_to_training_input is invoked on partial batches.

Solution

Added an is_training flag to compute_prompt_mini_batch_boundaries
Training mode (is_training=True):
- Retains strict assertion to enforce full batches
Evaluation mode (is_training=False):
- Allows partial batches
- Uses num_prompts instead of train_batch_size for boundary calculations
- Logs a warning for partial batches

Changes

Added is_training flag to compute_prompt_mini_batch_boundaries
Updated boundary logic to use num_prompts
Preserved strict validation during training
Added tests for evaluation with partial batches
Maintained backward compatibility (default is_training=True)

Result

Prevents crashes during evaluation when validation set size is not divisible by train_batch_size
Keeps training behavior unchanged
Improves robustness for custom evaluation pipelines

Fixes #1583

- Add is_training flag to compute_prompt_mini_batch_boundaries() - Allow partial batches during evaluation - Use num_prompts instead of train_batch_size in boundary calculations - Keep strict validation during training for distributed correctness - Add 4 comprehensive tests for eval partial batch scenarios - Backward compatible: default is_training=True preserves training behavior

gemini-code-assist

Code Review

This pull request introduces support for partial batches during evaluation by adding an is_training flag to the mini-batch boundary computation logic, preventing crashes when validation sets are not perfectly divisible by the batch size. Feedback identifies a NameError in the new test cases where assertions were incorrectly moved, and suggests restoring jaxtyping annotations for better documentation. Additionally, it is recommended to lower the logging level for partial batch detection during evaluation to reduce output noise.

gemini-code-assist · 2026-04-29T14:59:23Z

        # Non-step-wise boundaries should be uniform
        assert non_stepwise_bounds == [(0, 640), (640, 1280)]


The last two lines of test_same_step_count_as_non_stepwise were accidentally moved into the body of the new test_eval_stepwise_partial_batch method. This will cause a NameError when running the tests because non_stepwise_bounds is not defined within the scope of the new test. These lines should be moved back to the end of their original test function.

gemini-code-assist · 2026-04-29T14:59:23Z


 import torch
-from jaxtyping import Float, Integer
 from transformers import AutoTokenizer


The jaxtyping imports and associated type annotations were removed in this file. These annotations provide valuable documentation regarding tensor shapes and dtypes, which is particularly helpful in complex batching logic. Unless there is a specific reason for their removal, they should be retained to maintain code clarity and type safety.

Suggested change

from transformers import AutoTokenizer

from jaxtyping import Float, Integer

gemini-code-assist · 2026-04-29T14:59:24Z

+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    Optional[torch.Tensor],
+    Optional[torch.Tensor],


The detailed shape annotations for the return types were removed. It is recommended to keep these for better maintainability and readability.

Suggested change

torch.Tensor,

torch.Tensor,

torch.Tensor,

torch.Tensor,

torch.Tensor,

Optional[torch.Tensor],

Optional[torch.Tensor],

Float[torch.Tensor, "batch seq_len"],

Float[torch.Tensor, "batch seq_len"],

Float[torch.Tensor, "batch response_len"],

Float[torch.Tensor, "batch response_len"],

Float[torch.Tensor, "batch response_len"],

Optional[Float[torch.Tensor, "batch response_len"]],

Optional[Integer[torch.Tensor, "batch seq_len layer_num topk"]],

gemini-code-assist · 2026-04-29T14:59:24Z

+            logger.warning(
+                f"Partial batch detected during eval: got {num_prompts} prompts but "
+                f"train_batch_size={train_batch_size}. Using actual batch size for mini-batch boundaries."
+            )


Logging a warning for partial batches during evaluation is likely too noisy. When drop_last=False (standard for evaluation), the final batch is expected to be partial if the dataset size is not a multiple of the batch size. Consider using logger.info or removing the log entirely to avoid cluttering the output with expected behavior.

Suggested change

logger.warning(

f"Partial batch detected during eval: got {num_prompts} prompts but "

f"train_batch_size={train_batch_size}. Using actual batch size for mini-batch boundaries."

)

logger.info(

f"Partial batch detected during eval: got {num_prompts} prompts but "

f"train_batch_size={train_batch_size}. Using actual batch size for mini-batch boundaries."

)

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

rishithayenumula added 4 commits April 29, 2026 20:39

Fix test assertion placement and reduce eval partial-batch log noise

4adaade

Trigger CI rerun after review updates

6d7f74c

Restore jaxtyping tensor annotations in preprocess

dbb781d

fix formatting via pre-commit (black)

bde201e

CharlieFRuan self-assigned this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AssertionError during eval when val set size is not divisible by train_batch_size#1589

Fix AssertionError during eval when val set size is not divisible by train_batch_size#1589
rishithayenumula wants to merge 5 commits intoNovaSky-AI:mainfrom
rishithayenumula:fix-eval-batch-assertion

rishithayenumula commented Apr 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Non-step-wise boundaries should be uniform
		assert non_stepwise_bounds == [(0, 640), (640, 1280)]

	from transformers import AutoTokenizer
	from jaxtyping import Float, Integer

Conversation

rishithayenumula commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rishithayenumula commented Apr 29, 2026 •

edited

Loading