Move profiler trace collection out of timed loop to avoid polluting perf numbers (#2684)#2684
Closed
tissue3 wants to merge 1 commit intopytorch:mainfrom
Closed
Move profiler trace collection out of timed loop to avoid polluting perf numbers (#2684)#2684tissue3 wants to merge 1 commit intopytorch:mainfrom
tissue3 wants to merge 1 commit intopytorch:mainfrom
Conversation
|
@tissue3 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101894273. |
tissue3
added a commit
to tissue3/pytorch
that referenced
this pull request
Apr 21, 2026
… polluting perf numbers (pytorch#181043) Summary: X-link: pytorch/benchmark#2684 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Test Plan: ``` # Verify perf numbers are unaffected by --export-profiler-trace: # These two should give similar speedup (within noise): CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace # Verify profiler trace is still generated: ls /tmp/torch_dynamo_*/inductor_*_mobilenet_v2.json # Verify accuracy is unaffected: CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --accuracy --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace ``` Differential Revision: D101894273
tissue3
added a commit
to tissue3/benchmark
that referenced
this pull request
Apr 21, 2026
…erf numbers (pytorch#2684) Summary: X-link: pytorch/pytorch#181043 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Differential Revision: D101894273
2da79d7 to
b42c738
Compare
PaulZhang12
approved these changes
Apr 23, 2026
huydhn
approved these changes
Apr 23, 2026
tissue3
added a commit
to tissue3/pytorch
that referenced
this pull request
Apr 23, 2026
… polluting perf numbers (pytorch#181043) Summary: X-link: pytorch/benchmark#2684 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Test Plan: ``` # Verify perf numbers are unaffected by --export-profiler-trace: # These two should give similar speedup (within noise): CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace # Verify profiler trace is still generated: ls /tmp/torch_dynamo_*/inductor_*_mobilenet_v2.json # Verify accuracy is unaffected: CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --accuracy --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace ``` Reviewed By: huydhn, PaulZhang12 Differential Revision: D101894273
…erf numbers (pytorch#2684) Summary: X-link: pytorch/pytorch#181043 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Reviewed By: huydhn, PaulZhang12 Differential Revision: D101894273
b42c738 to
7a32c25
Compare
|
This pull request has been merged in f6ab775. |
pytorchmergebot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Apr 24, 2026
… polluting perf numbers (#181043) (#181043) Summary: X-link: pytorch/benchmark#2684 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Test Plan: ``` # Verify perf numbers are unaffected by --export-profiler-trace: # These two should give similar speedup (within noise): CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace # Verify profiler trace is still generated: ls /tmp/torch_dynamo_*/inductor_*_mobilenet_v2.json # Verify accuracy is unaffected: CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --accuracy --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace ``` Reviewed By: huydhn, PaulZhang12 Differential Revision: D101894273 Pull Request resolved: #181043 Approved by: https://github.com/PaulZhang12, https://github.com/huydhn
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: pytorch/pytorch#181043
D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes:
Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels).
Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality.
Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately
Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs):
Mode Before (profiler in timed loop) After (profiler separate)
--export-profiler-trace 1.868x 2.168x
No profiler 2.193x 2.203x
Reviewed By: huydhn, PaulZhang12
Differential Revision: D101894273