Skip to content

[2091][performance] Track throughput metrics#2124

Open
florianscheidl wants to merge 67 commits intoecmwf:developfrom
florianscheidl:fscheidl/flo-85-first-iteration-of-performance-metric-profiling
Open

[2091][performance] Track throughput metrics#2124
florianscheidl wants to merge 67 commits intoecmwf:developfrom
florianscheidl:fscheidl/flo-85-first-iteration-of-performance-metric-profiling

Conversation

@florianscheidl
Copy link
Copy Markdown
Contributor

@florianscheidl florianscheidl commented Mar 27, 2026

Description

Implements optional on-the-fly throughput metrics, logged per training step. The new metrics are named "performance.throughput.*" and track per-device and global throughput in terms of:

  • batches per second,
  • samples per second,
  • MB per second.

In multi-device and multi-node setups, we make an all-reduce call to get the global throughput metrics.

Usage

To activate throughput tracking, add track_performance_metrics: True in the training config, under train_logging, see the performance_*.yaml configs added here. Run with a base configuration, e.g.:

../WeatherGenerator-private/hpc/launch-slurm.py --time 8 --nodes=1 --base-config config/config_jepa.yml --config config/performance_jepa_config.yml

Issue Number

Closes #2091.

Preview:

We investigated the effect of batch sizes on throughput, see https://gitlab.jsc.fz-juelich.de/hedgedoc/SUW6Zq-BR3uYCwU3hmIb6w?both#.

Below are screenshots from MLFlow:

Screenshot 2026-04-02 at 17 48 06

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@florianscheidl florianscheidl changed the title [2091][performance] Track throughput and utilization metrics (optional) [2091][performance] Track throughput metrics Apr 2, 2026
@florianscheidl florianscheidl marked this pull request as ready for review April 2, 2026 15:46
@pytest.fixture()
def tracker():
"""A tracker with warmup_steps=2 on CPU."""
return ThroughputTracker(device=torch.device("cpu"), world_size=1, warmup_steps=2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ThroughputTracker(device=torch.device("cpu"), world_size=1, warmup_steps=2)
return ThroughputTracker(device=torch.device("cpu"), warmup_steps=2)

The signature of this is wrong, right? world_size doesn't exist in the __init__ of the ThroughputTracker class


def test_warmup_steps_not_counted():
"""Steps during warmup do not contribute to totals."""
tracker = ThroughputTracker(device=torch.device("cpu"), world_size=1, warmup_steps=3)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the fixture

@github-project-automation github-project-automation bot moved this to In Progress in WeatherGen-dev Apr 14, 2026
fresh each step via ``compute_source_bytes`` as batch sizes
can vary across samples.
"""
torch.cuda.synchronize()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to synchronize in every step or should we skip the warm up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Anything related to the datasets used in the project eval anything related to the model evaluation pipeline infra Issues related to infrastructure performance Work related to performance improvements

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Performance metrics

3 participants