[2091][performance] Track throughput metrics by florianscheidl · Pull Request #2124 · ecmwf/WeatherGenerator

florianscheidl · 2026-03-27T17:49:21Z

Description

Implements optional on-the-fly throughput metrics, logged per training step. The new metrics are named "performance.throughput.*" and track per-device and global throughput in terms of:

batches per second,
samples per second,
MB per second.

In multi-device and multi-node setups, we make an all-reduce call to get the global throughput metrics.

Usage

To activate throughput tracking, add track_performance_metrics: True in the training config, under train_logging, see the performance_*.yaml configs added here. Run with a base configuration, e.g.:

../WeatherGenerator-private/hpc/launch-slurm.py --time 8 --nodes=1 --base-config config/config_jepa.yml --config config/performance_jepa_config.yml

Issue Number

Closes #2091.

Preview:

We investigated the effect of batch sizes on throughput, see https://gitlab.jsc.fz-juelich.de/hedgedoc/SUW6Zq-BR3uYCwU3hmIb6w?both#.

Below are screenshots from MLFlow:

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…mance-metric-profiling

ekouts · 2026-04-14T13:17:01Z

tests/test_performance_utils.py

+@pytest.fixture()
+def tracker():
+    """A tracker with warmup_steps=2 on CPU."""
+    return ThroughputTracker(device=torch.device("cpu"), world_size=1, warmup_steps=2)


Suggested change

return ThroughputTracker(device=torch.device("cpu"), world_size=1, warmup_steps=2)

return ThroughputTracker(device=torch.device("cpu"), warmup_steps=2)

The signature of this is wrong, right? world_size doesn't exist in the __init__ of the ThroughputTracker class

ekouts · 2026-04-14T13:27:55Z

tests/test_performance_utils.py

+
+def test_warmup_steps_not_counted():
+    """Steps during warmup do not contribute to totals."""
+    tracker = ThroughputTracker(device=torch.device("cpu"), world_size=1, warmup_steps=3)


Same comment as the fixture

ekouts · 2026-04-14T15:01:01Z

src/weathergen/utils/performance.py

+                       fresh each step via ``compute_source_bytes`` as batch sizes
+                       can vary across samples.
+        """
+        torch.cuda.synchronize()


Do we need to synchronize in every step or should we skip the warm up?

florianscheidl added 30 commits March 25, 2026 15:49

Throughput calculation with some shortcuts

c5f3f7e

meta device

a847645

small config to quickly iterate

a164514

debug refactor

52acede

Track throughput in terms of source data size

a889e5d

Correct torch device wrapping

a9f67d9

Update get_last_lr behavior

66b0578

Update minimal config

5fc0081

Proide loss function

80691d1

lr scheduler none issue without warmup

32c238f

Throughput warmup behaviour implies no updates

5f10d0f

Reinitialize throughput tracker

0dd6b67

Increasing throughput logging instead of resetting

062c953

Smaller config

35ce423

Align datatypes

16aa00f

Update config since we're running on 4 gpus per node

bc1333a

Destroying process group to get successful training?

312ccc4

Use bigger config again to avoid chaos

e5b18f3

Adapt window size to see throughput metrics

954daa5

Adapt config hoping metrics weill be logged to mlflow

fe1952f

window must be greater than 1

15bc44a

Add step to logging for perf metrics

841af12

Add tests, isolate perf related functions, add doc strings

c7ca2a6

Configure perf metrics logging behind config

844a33b

Bigger config based on jepa

bf3cb64

Try with increasing number of samples

44e7672

Update config

8f94e78

Cleanup

db20a6e

Update config

1dbd25e

Clean up and track global throughput in mb

7274c75

florianscheidl added 18 commits March 31, 2026 11:11

Update performance integration test config from jepa

ed3d788

Zero offset?

c80945c

Add alternative config based on default

9cc3893

Decrease dimension and number of mini-epochs

12cc7f0

Updated performance configs

1e4a9a1

Remove int config args from training_config section

ff9272d

Correct num_samples parameter in model_input config

f66b204

Updated performance configs

0a78860

Update perf configs

0ed954b

Fix function args

4f7bee6

Move tracking to train logging

c012030

Synching across ranks

8478ac9

Increase number of samples

639433c

FlopCounterMode kills flash attention, remove flop counting

df68572

Update tests

126e244

Clean up trainer

46a856d

Remove legacy contexts

a3165d2

Comment cleanup

6e18869

florianscheidl changed the title ~~[2091][performance] Track throughput and utilization metrics (optional)~~ [2091][performance] Track throughput metrics Apr 2, 2026

florianscheidl and others added 2 commits April 2, 2026 17:41

Lint update

a0dc799

Merge branch 'develop' into fscheidl/flo-85-first-iteration-of-perfor…

9fb0855

…mance-metric-profiling

florianscheidl marked this pull request as ready for review April 2, 2026 15:46

florianscheidl added 5 commits April 2, 2026 19:13

Remove unused world size and comment

2f93888

Jepa perf config num_samples 2

1591460

throughput per step metrics

b381d83

Remove per step metrics

1f2c4d8

Drop legacy import

896cade

ekouts suggested changes Apr 14, 2026

View reviewed changes

github-project-automation bot moved this to In Progress in WeatherGen-dev Apr 14, 2026

ekouts suggested changes Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2091][performance] Track throughput metrics#2124

[2091][performance] Track throughput metrics#2124
florianscheidl wants to merge 67 commits intoecmwf:developfrom
florianscheidl:fscheidl/flo-85-first-iteration-of-performance-metric-profiling

florianscheidl commented Mar 27, 2026 •

edited

Loading

Uh oh!

ekouts Apr 14, 2026

Uh oh!

ekouts Apr 14, 2026

Uh oh!

ekouts Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	return ThroughputTracker(device=torch.device("cpu"), world_size=1, warmup_steps=2)
	return ThroughputTracker(device=torch.device("cpu"), warmup_steps=2)

Conversation

florianscheidl commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Issue Number

Preview:

Checklist before asking for review

Uh oh!

ekouts Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

ekouts Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

ekouts Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

florianscheidl commented Mar 27, 2026 •

edited

Loading