Move profiler trace collection out of timed loop to avoid polluting perf numbers (#2684) by tissue3 · Pull Request #2684 · pytorch/benchmark

tissue3 · 2026-04-21T23:22:33Z

D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes:

Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels).

Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality.

Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately

Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs):

Mode Before (profiler in timed loop) After (profiler separate)
--export-profiler-trace 1.868x 2.168x
No profiler 2.193x 2.203x

Reviewed By: huydhn, PaulZhang12

Differential Revision: D101894273

meta-codesync · 2026-04-21T23:22:41Z

@tissue3 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101894273.

… polluting perf numbers (pytorch#181043) Summary: X-link: pytorch/benchmark#2684 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Test Plan: ``` # Verify perf numbers are unaffected by --export-profiler-trace: # These two should give similar speedup (within noise): CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace # Verify profiler trace is still generated: ls /tmp/torch_dynamo_*/inductor_*_mobilenet_v2.json # Verify accuracy is unaffected: CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --accuracy --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace ``` Differential Revision: D101894273

…erf numbers (pytorch#2684) Summary: X-link: pytorch/pytorch#181043 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Differential Revision: D101894273

… polluting perf numbers (pytorch#181043) Summary: X-link: pytorch/benchmark#2684 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Test Plan: ``` # Verify perf numbers are unaffected by --export-profiler-trace: # These two should give similar speedup (within noise): CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace # Verify profiler trace is still generated: ls /tmp/torch_dynamo_*/inductor_*_mobilenet_v2.json # Verify accuracy is unaffected: CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --accuracy --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace ``` Reviewed By: huydhn, PaulZhang12 Differential Revision: D101894273

…erf numbers (pytorch#2684) Summary: X-link: pytorch/pytorch#181043 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Reviewed By: huydhn, PaulZhang12 Differential Revision: D101894273

meta-codesync · 2026-04-24T00:42:15Z

This pull request has been merged in f6ab775.

… polluting perf numbers (#181043) (#181043) Summary: X-link: pytorch/benchmark#2684 D100006805 added Chrome profiler trace collection to the inductor perf nightly benchmarks. However, the torch.profiler.profile() context wraps the entire timed loop, which means _is_profiler_enabled is True during all timed iterations. This causes: Inflated wall-clock numbers: Every kernel launch pays ~1-2µs extra for _RecordFunctionFast enter/exit, adding ~150-300µs per forward pass on models with many kernels (e.g., mobilenet_v2 with 153 kernels). Fix: Move profiler trace collection to a separate extra iteration after the timed loop. The timed loop runs without profiler overhead, producing clean perf numbers. The profiler trace is then collected in one additional eager + compiled run, preserving the same trace quality. Before: profiler wraps timed loop → perf numbers include profiler overhead After: timed loop runs clean → profiler trace collected separately Local validation on mobilenet_v2 (H100, mode/opt, --disable-cudagraphs): Mode Before (profiler in timed loop) After (profiler separate) --export-profiler-trace 1.868x 2.168x No profiler 2.193x 2.203x Test Plan: ``` # Verify perf numbers are unaffected by --export-profiler-trace: # These two should give similar speedup (within noise): CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --performance --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace # Verify profiler trace is still generated: ls /tmp/torch_dynamo_*/inductor_*_mobilenet_v2.json # Verify accuracy is unaffected: CUDA_VISIBLE_DEVICES=0 buck2 run fbcode//mode/opt fbcode//caffe2/benchmarks/dynamo:torchbench -- \ --accuracy --inference --bfloat16 --inductor --disable-cudagraphs \ --device cuda --only mobilenet_v2 --export-profiler-trace ``` Reviewed By: huydhn, PaulZhang12 Differential Revision: D101894273 Pull Request resolved: #181043 Approved by: https://github.com/PaulZhang12, https://github.com/huydhn

tissue3 had a problem deploying to docker-s3-upload April 21, 2026 23:22 — with GitHub Actions Error

tissue3 had a problem deploying to docker-s3-upload April 21, 2026 23:22 — with GitHub Actions Failure

tissue3 had a problem deploying to docker-s3-upload April 21, 2026 23:22 — with GitHub Actions Error

meta-cla Bot added the cla signed label Apr 21, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 21, 2026

tissue3 mentioned this pull request Apr 21, 2026

[benchmark] Move profiler trace collection out of timed loop to avoid polluting perf numbers (#181043) pytorch/pytorch#181043

Closed

meta-codesync Bot changed the title ~~Move profiler trace collection out of timed loop to avoid polluting perf numbers~~ Move profiler trace collection out of timed loop to avoid polluting perf numbers (#2684) Apr 21, 2026

tissue3 force-pushed the export-D101894273 branch from 2da79d7 to b42c738 Compare April 21, 2026 23:28

tissue3 had a problem deploying to docker-s3-upload April 21, 2026 23:28 — with GitHub Actions Failure

tissue3 had a problem deploying to docker-s3-upload April 21, 2026 23:29 — with GitHub Actions Failure

PaulZhang12 approved these changes Apr 23, 2026

View reviewed changes

huydhn approved these changes Apr 23, 2026

View reviewed changes

tissue3 force-pushed the export-D101894273 branch from b42c738 to 7a32c25 Compare April 23, 2026 21:38

tissue3 had a problem deploying to docker-s3-upload April 23, 2026 21:38 — with GitHub Actions Failure

meta-codesync Bot closed this in f6ab775 Apr 24, 2026

facebook-github-tools Bot added the Merged label Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move profiler trace collection out of timed loop to avoid polluting perf numbers (#2684)#2684

Move profiler trace collection out of timed loop to avoid polluting perf numbers (#2684)#2684
tissue3 wants to merge 1 commit intopytorch:mainfrom
tissue3:export-D101894273

tissue3 commented Apr 21, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Apr 21, 2026

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tissue3 commented Apr 21, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Apr 21, 2026

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tissue3 commented Apr 21, 2026 •

edited by meta-codesync Bot

Loading