Skip to content

Move profiler trace collection out of timed loop to avoid polluting perf numbers (#2684)#2684

Closed
tissue3 wants to merge 1 commit intopytorch:mainfrom
tissue3:export-D101894273
Closed

Move profiler trace collection out of timed loop to avoid polluting perf numbers (#2684)#2684
tissue3 wants to merge 1 commit intopytorch:mainfrom
tissue3:export-D101894273

Conversation

@tissue3
Copy link
Copy Markdown
Contributor

@tissue3 tissue3 commented Apr 21, 2026

Summary:
X-link: pytorch/pytorch#181043

D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes:

Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels).

Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality.

Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately

Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs):

Mode Before (profiler in timed loop) After (profiler separate)
--export-profiler-trace 1.868x 2.168x
No profiler 2.193x 2.203x

Reviewed By: huydhn, PaulZhang12

Differential Revision: D101894273

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 21, 2026

@tissue3 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101894273.

tissue3 added a commit to tissue3/pytorch that referenced this pull request Apr 21, 2026
… polluting perf numbers (pytorch#181043)

Summary:

X-link: pytorch/benchmark#2684

D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes:

Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels).

Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality.

Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately

Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs):

Mode                               	Before (profiler in timed loop)	After (profiler separate)
--export-profiler-trace	 1.868x	                                       2.168x
No profiler	                      2.193x	                                       2.203x

Test Plan:
```
# Verify perf numbers are unaffected by --export-profiler-trace:
# These two should give similar speedup (within noise):

CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \
  --performance --inference --bfloat16 --inductor --disable-cudagraphs \
  --device cuda --only mobilenet_v2

CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \
  --performance --inference --bfloat16 --inductor --disable-cudagraphs \
  --device cuda --only mobilenet_v2 --export-profiler-trace

# Verify profiler trace is still generated:
ls /tmp/torch_dynamo_*/inductor_*_mobilenet_v2.json

# Verify accuracy is unaffected:
CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \
  --accuracy --inference --bfloat16 --inductor --disable-cudagraphs \
  --device cuda --only mobilenet_v2 --export-profiler-trace
```

Differential Revision: D101894273
@meta-codesync meta-codesync Bot changed the title Move profiler trace collection out of timed loop to avoid polluting perf numbers Move profiler trace collection out of timed loop to avoid polluting perf numbers (#2684) Apr 21, 2026
tissue3 added a commit to tissue3/benchmark that referenced this pull request Apr 21, 2026
…erf numbers (pytorch#2684)

Summary:
X-link: pytorch/pytorch#181043


D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes:

Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels).

Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality.

Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately

Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs):

Mode                               	Before (profiler in timed loop)	After (profiler separate)
--export-profiler-trace	 1.868x	                                       2.168x
No profiler	                      2.193x	                                       2.203x

Differential Revision: D101894273
@tissue3 tissue3 force-pushed the export-D101894273 branch from 2da79d7 to b42c738 Compare April 21, 2026 23:28
tissue3 added a commit to tissue3/pytorch that referenced this pull request Apr 23, 2026
… polluting perf numbers (pytorch#181043)

Summary:

X-link: pytorch/benchmark#2684

D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes:

Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels).

Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality.

Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately

Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs):

Mode                               	Before (profiler in timed loop)	After (profiler separate)
--export-profiler-trace	 1.868x	                                       2.168x
No profiler	                      2.193x	                                       2.203x

Test Plan:
```
# Verify perf numbers are unaffected by --export-profiler-trace:
# These two should give similar speedup (within noise):

CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \
  --performance --inference --bfloat16 --inductor --disable-cudagraphs \
  --device cuda --only mobilenet_v2

CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \
  --performance --inference --bfloat16 --inductor --disable-cudagraphs \
  --device cuda --only mobilenet_v2 --export-profiler-trace

# Verify profiler trace is still generated:
ls /tmp/torch_dynamo_*/inductor_*_mobilenet_v2.json

# Verify accuracy is unaffected:
CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \
  --accuracy --inference --bfloat16 --inductor --disable-cudagraphs \
  --device cuda --only mobilenet_v2 --export-profiler-trace
```

Reviewed By: huydhn, PaulZhang12

Differential Revision: D101894273
…erf numbers (pytorch#2684)

Summary:
X-link: pytorch/pytorch#181043


D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes:

Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels).

Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality.

Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately

Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs):

Mode                               	Before (profiler in timed loop)	After (profiler separate)
--export-profiler-trace	 1.868x	                                       2.168x
No profiler	                      2.193x	                                       2.203x

Reviewed By: huydhn, PaulZhang12

Differential Revision: D101894273
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 24, 2026

This pull request has been merged in f6ab775.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Apr 24, 2026
… polluting perf numbers (#181043) (#181043)

Summary:

X-link: pytorch/benchmark#2684

D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes:

Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels).

Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality.

Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately

Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs):

Mode                               	Before (profiler in timed loop)	After (profiler separate)
--export-profiler-trace	 1.868x	                                       2.168x
No profiler	                      2.193x	                                       2.203x

Test Plan:
```
# Verify perf numbers are unaffected by --export-profiler-trace:
# These two should give similar speedup (within noise):

CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \
  --performance --inference --bfloat16 --inductor --disable-cudagraphs \
  --device cuda --only mobilenet_v2

CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \
  --performance --inference --bfloat16 --inductor --disable-cudagraphs \
  --device cuda --only mobilenet_v2 --export-profiler-trace

# Verify profiler trace is still generated:
ls /tmp/torch_dynamo_*/inductor_*_mobilenet_v2.json

# Verify accuracy is unaffected:
CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \
  --accuracy --inference --bfloat16 --inductor --disable-cudagraphs \
  --device cuda --only mobilenet_v2 --export-profiler-trace
```

Reviewed By: huydhn, PaulZhang12

Differential Revision: D101894273

Pull Request resolved: #181043
Approved by: https://github.com/PaulZhang12, https://github.com/huydhn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants