dCDH Phase 3a: placebo SE, non-binary treatment, parity SE assertions by igerber · Pull Request #300 · igerber/diff-diff

igerber · 2026-04-12T22:44:29Z

Summary

Add multi-horizon placebo analytical SE and bootstrap SE via new _compute_per_group_if_placebo_horizon() helper (resolves Phase 2 deferral of placebo inference)
Support non-binary (ordinal and continuous) treatment via the multi-horizon per-group path (L_max >= 1): remove binary enforcement gate, generalize within-cell-varying check, fix astype(int) truncation, rewrite multi-switch detection
Add per-horizon SE and placebo SE parity assertions against R DIDmultiplegtDYN golden values
Sweep stale Phase 2 comments, add TODO.md entries for remaining deferrals
Add REGISTRY.md documentation for non-binary treatment support

Methodology references (required if estimator / math changes)

Method name(s): ChaisemartinDHaultfoeuille (dCDH) - placebo influence function, non-binary treatment generalization
Paper / source link(s): de Chaisemartin & D'Haultfoeuille (2022, NBER WP 29873), Web Appendix Section 3.7.3 (variance formula), Section 2 (general treatment setup)
Any intentional deviations from the source (and why):
- Non-binary treatment requires L_max >= 1 because the per-period DID path uses binary joiner/leaver categorization; the multi-horizon per-group DID_{g,l} handles non-binary correctly (documented in REGISTRY)
- Groups with >1 treatment-change period are dropped (including monotone multi-step paths like 0->1->2) because the per-group building block attributes the full outcome change to the first switch (documented in REGISTRY)
- Cohort definition under non-binary pools groups with different dose magnitudes but same (D_{g,1}, F_g, S_g) - consistent with the paper's general setup

Validation

Tests added/updated: tests/test_chaisemartin_dhaultfoeuille.py (+8 new tests: placebo SE finite, placebo bootstrap SE, ordinal treatment, within-cell heterogeneity rejection, single large dose, multi-switch detection, monotone multi-step drop, normalized effects general formula), tests/test_chaisemartin_dhaultfoeuille_parity.py (+3 SE/placebo parity tests)
93 unit tests + 12 parity tests + 19 methodology tests all passing

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

- Add _compute_per_group_if_placebo_horizon() for multi-horizon placebo analytical SE and bootstrap SE (resolves Phase 2 deferral) - Remove binary treatment enforcement; support ordinal and continuous treatment via L_max path (5 downstream code path fixes) - Generalize multi-switch detection to count non-zero treatment diffs - Add SE/CI/placebo parity assertions for multi-horizon R golden values - Sweep stale Phase 2 comments, add TODO.md entries for remaining deferrals Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add REGISTRY Note for non-binary treatment: paper basis, L_max requirement, float baselines, cohort definition, dose-magnitude pooling - Fix _drop_crossing_cells() docstring to match implementation: >1 treatment-change period (not "direction reversal") - Update _validate_and_aggregate_to_cells() and fit() docstrings to reflect non-binary support - Fix type annotations: Tuple[int,int,int] -> Tuple[float,int,int] for cohort keys - Add test_monotone_multi_step_dropped for 0->1->2 path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-12T22:52:16Z

Overall Assessment

⛔ Blocker

Execution note: I did not run the changed tests in this sandbox because the available Python lacks numpy/pandas; this review is based on diff and source inspection.

Executive Summary

One unmitigated P0 and one unmitigated P1 remain in the estimator changes.
Non-binary treatment is now accepted on code paths that still use the binary-only per-period logic, so unsupported inputs can yield wrong or missing estimates with no explicit guard.
Dynamic placebo SE / p-value / CI are now emitted, but the public methodology contract still says those fields should remain NaN.
The new TODO.md tracking is fine for deferred parity follow-up, but it does not mitigate either blocker.

Methodology

Cross-check against the cited source: the dynamic paper’s event-study estimator compares each switcher to groups with the same period-one treatment, so the baseline-matching part of the non-binary extension is directionally consistent with the paper. The placebo section separately defines DID^{pl}_ℓ as a testable implication of the assumptions, while the confidence-interval theorem and cohort-recentered variance formula are stated for DID_ℓ / δ_ℓ, not for the placebo estimand. (arxiv.org)

Severity P0 — Non-binary inputs can still flow through the unsupported per-period binary path. Impact: _validate_and_aggregate_to_cells() now admits any numeric treatment, but fit() never enforces the documented “non-binary requires L_max” contract, _compute_per_period_dids() still counts only 0→1 / 1→0 transitions, and the per-group dynamic path still starts only at L_max >= 2. On mixed panels this can silently return a finite overall_att that ignores non-binary switches; on pure non-binary panels with L_max=1 it falls back to NaN instead of the advertised per-group estimator. Concrete fix: after aggregation, detect any d_gt outside {0,1} and either raise unless the per-group path is actually used, or wire _compute_multi_horizon_dids(..., L_max=1) into overall_att / event_study_effects[1]. Relevant code: diff_diff/chaisemartin_dhaultfoeuille.py:118, diff_diff/chaisemartin_dhaultfoeuille.py:847, diff_diff/chaisemartin_dhaultfoeuille.py:917, diff_diff/chaisemartin_dhaultfoeuille.py:1040, diff_diff/chaisemartin_dhaultfoeuille.py:2040, tests/test_chaisemartin_dhaultfoeuille.py:133, tests/test_chaisemartin_dhaultfoeuille.py:2249.
Severity P1 — Dynamic placebo inference is now surfaced as analytical and bootstrap SE / p-value / CI, but that is still an undocumented methodology change relative to both the paper cross-check above and the current registry/results contract. Impact: the code now emits finite inference for placebo_event_study, while the registry and results docstrings still say dynamic placebo inference should remain NaN; because the cited source material’s inference result is stated for DID_ℓ / δ_ℓ, this is an untracked change in the estimator’s statistical contract. Concrete fix: either revert dynamic placebo inference back to NaN, or add an explicit **Note:** / **Deviation...** entry in docs/methodology/REGISTRY.md and matching result/docstring updates that clearly label this as an intentional library extension with its supporting derivation. Relevant code: diff_diff/chaisemartin_dhaultfoeuille.py:1191, diff_diff/chaisemartin_dhaultfoeuille.py:1412, diff_diff/chaisemartin_dhaultfoeuille.py:1681, docs/methodology/REGISTRY.md:546, docs/methodology/REGISTRY.md:586, diff_diff/chaisemartin_dhaultfoeuille_results.py:51, diff_diff/chaisemartin_dhaultfoeuille_results.py:149. (arxiv.org)

Code Quality

No additional code-quality finding beyond the methodology blockers above.

Performance

No material performance regressions stood out in the changed paths.

Maintainability

No independent maintainability blocker beyond the contract split already called out between code, registry, docstrings, and tests.

Tech Debt

Severity P3 — The new TODO tracking is acceptable for the mixed-direction SE parity follow-up, but it does not mitigate the P0 / P1 issues above because project rules do not allow TODO tracking to downgrade correctness or methodology defects. Impact: none by itself. Concrete fix: keep the TODOs, but resolve the blockers in code/registry rather than treating them as deferred. Relevant file: TODO.md:59.

Security

No security findings in the changed files.

Documentation/Tests

Severity P2 — The tests currently encode the unsupported non-binary contract. Impact: test_non_binary_treatment_accepted now blesses fit() with non-binary treatment and no L_max, so the suite would preserve the silent P0 above instead of catching it. Concrete fix: replace that test with either an explicit rejection for L_max=None or a finite-estimate test for the correctly wired per-group L_max=1/2 path, and add a mixed 0→1 / 0→2 regression. Relevant files: tests/test_chaisemartin_dhaultfoeuille.py:133, tests/test_chaisemartin_dhaultfoeuille.py:2210.
Severity P2 — The placebo contract is internally inconsistent across docs and tests. Impact: the registry/results docstrings still advertise NaN dynamic-placebo inference, while the new tests require finite analytical and bootstrap placebo SEs. Concrete fix: after resolving the P1 above, align REGISTRY.md, result docstrings, and test expectations to the same contract. Relevant files: docs/methodology/REGISTRY.md:546, docs/methodology/REGISTRY.md:586, diff_diff/chaisemartin_dhaultfoeuille_results.py:51, tests/test_chaisemartin_dhaultfoeuille.py:1964.

Path to Approval

Enforce the non-binary gate before the Phase 1 per-period aggregation, and either route L_max=1 through the per-group DID_{g,1} machinery or reject L_max=1 until that path exists.
Resolve the placebo inference contract: either revert dynamic-placebo inference to NaN, or document the new placebo IF/SE surface as an explicit methodology note/deviation and update every public docstring/result surface to match.
Add regressions for non-binary + L_max=None, non-binary + L_max=1, and mixed binary/non-binary switch panels so the unsupported fallback cannot silently reappear.

P0: Add explicit ValueError for non-binary treatment + L_max=None. P0: Change multi-horizon gate from L_max>=2 to L_max>=1 so per-group DID_{g,1} path activates at L_max=1 (handles non-binary correctly). Populate overall_att from per-group l=1 when per-period path yields NaN. P1: Update REGISTRY Notes to document placebo SE as library extension (paper's Theorem 1 is for DID_l, placebo IF applies same structure to backward outcome differences). Update results docstring. P2: Fix test_non_binary_treatment_accepted to assert ValueError, add test_non_binary_treatment_with_lmax regression. Update test_L_max_1 to test per-group path behavior (not cross-path equality, since per-group and per-period are documented different estimands). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T23:02:13Z

/ai-review

github-actions · 2026-04-12T23:15:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: df6db2c415fc4ab8ffb4d0624253bbf30a066686

Overall Assessment
⛔ Blocker

Execution note: I could not run the changed tests in this sandbox because the available Python environment is missing numpy and pytest; this review is based on diff and source inspection.

Executive Summary

The prior placebo-contract blocker is resolved: REGISTRY.md and the results docstrings now document multi-horizon placebo SE/bootstrap as an intentional library extension, so I am not counting that issue again.
The previous non-binary blocker is only partially fixed: mixed 0→1 and 0→2 panels with L_max=1 can still return a top-level overall_att / overall_se built from the binary-only per-period path, even though the supported general-design estimand is the per-group DID_{g,1} path. (nber.org)
[Newly identified] The PR widens non-binary support into twfe_diagnostic / twowayfeweights(), but that diagnostic still hard-codes treated cells as d_gt == 1; the reference TwoWayFEWeights API treats D as the treatment variable and its summary measures over treated (g,t) cells generally, not only D == 1. (rdrr.io)
[Newly identified] Pure non-binary panels with n_bootstrap > 0 still lose all horizon/bootstrap inference because bootstrap execution returns early on the binary-only N_S divisor before processing the valid multi-horizon inputs. The dynamic paper explicitly allows non-binary, non-absorbing treatments, so this new supported input class needs its inference path wired end to end. (nber.org)
The new tests cover pure non-binary panels, but they still miss the mixed 0→1 / 0→2 L_max=1 case and the non-binary bootstrap path, so the remaining correctness issues are not protected.

Methodology
Previous placebo-inference documentation issues are resolved and do not block approval.

Severity P0 — Previous blocker only partially resolved: mixed binary/non-binary panels with L_max=1 still put the wrong estimand on the top-level results surface. Impact: _compute_per_period_dids() still defines switchers only through 0→1 / 1→0, Step 10 computes overall_att from that binary-only subset whenever N_S > 0, and Step 20 only swaps in multi_horizon_inference[1] when overall_att is NaN. On a supported mixed panel, results.overall_att / overall_se can silently exclude the non-binary switchers while event_study_effects[1] includes them. That contradicts both the PR’s own non-binary support claim and the dynamic paper’s general-design DID_{g,\ell} construction, which compares switchers to groups with the same period-one treatment that have not changed yet by the relevant horizon. Concrete fix: when any treatment value outside {0,1} is present and L_max=1, bypass the per-period DID_M path entirely and set the top-level overall_* fields from multi_horizon_inference[1] for all such panels, or reject L_max=1 until that contract is explicitly documented. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:2056, diff_diff/chaisemartin_dhaultfoeuille.py:911, diff_diff/chaisemartin_dhaultfoeuille.py:1577. (nber.org)
Severity P1 — [Newly identified] Non-binary support is now exposed on the TWFE diagnostic path, but the diagnostic math still assumes binary treatment. Impact: the shared validator now accepts non-binary inputs, fit() computes twfe_* by default on that input, and _compute_twfe_diagnostic() normalizes weights, counts negative weights, and computes sigma_fe using treated_mask = d_arr == 1. Cells with dose 2, 3, etc. are dropped from the very diagnostic the PR now blesses via test_twowayfeweights_accepts_non_binary_treatment, so results.twfe_* and standalone twowayfeweights() can silently misstate the diagnostic surface. Concrete fix: either restore a binary-only guard for twfe_diagnostic / twowayfeweights() or implement the general-treatment formula consistent with the reference TwoWayFEWeights package. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:202, diff_diff/chaisemartin_dhaultfoeuille.py:611, diff_diff/chaisemartin_dhaultfoeuille.py:3468, tests/test_chaisemartin_dhaultfoeuille.py:1674. (rdrr.io)

Code Quality

Severity P1 — [Newly identified] The new non-binary path is not propagated through bootstrap orchestration. Impact: fit() correctly builds multi_horizon_inputs and placebo_horizon_inputs, but still calls _compute_dcdh_bootstrap() with divisor_overall=N_S; on pure non-binary panels N_S == 0, the mixin returns immediately with all-NaN bootstrap results before it reaches the event-study or placebo-horizon bootstrap loops. Supported non-binary panels therefore cannot obtain the requested bootstrap SE/CI/p-values for DID_l or dynamic placebos. Concrete fix: make _compute_dcdh_bootstrap() skip only the scalar overall DID_M bootstrap when divisor_overall <= 0, while still processing multi_horizon_inputs and placebo_horizon_inputs if they were provided. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:1521, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:162, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:253. (nber.org)

Performance
No findings in the changed paths.

Maintainability
No independent maintainability finding beyond the contract/propagation issues above.

Tech Debt

Severity P3 — The new TODO entries are appropriately tracked deferred work and do not block merge by project policy. Impact: none by itself; the single-period placebo-SE limitation and mixed-direction parity follow-up are now explicitly recorded. Concrete fix: no action required beyond keeping the entries tracked. Ref: TODO.md:59.

Security
No security findings in the changed files.

Documentation/Tests

Severity P2 — The new tests still miss the two branches that matter for this PR’s risk surface, and one new test encodes the wrong contract. Impact: there is no regression for a mixed 0→1 / 0→2 panel at L_max=1, no non-binary panel with n_bootstrap > 0, and test_twowayfeweights_accepts_non_binary_treatment currently blesses the diagnostic bug above instead of catching it. Concrete fix: add a mixed-panel L_max=1 regression that asserts the intended top-level overall_* contract on non-binary input, add a pure non-binary bootstrap regression expecting finite event_study_ses[1] (and placebo bootstrap when available), and replace the standalone-helper acceptance test with whichever contract you actually support. Refs: tests/test_chaisemartin_dhaultfoeuille.py:133, tests/test_chaisemartin_dhaultfoeuille.py:1674, tests/test_chaisemartin_dhaultfoeuille.py:1823, tests/test_chaisemartin_dhaultfoeuille.py:1988.
Severity P3 — The parity-tolerance documentation is now stale. Impact: REGISTRY.md still says pure-direction SE parity uses a 5% tolerance, while the updated tests now use 10% generally and 15% on the long-panel scenario. Concrete fix: update the registry note to match the new tolerated gap, or tighten the tests back to the documented tolerance if 5% is still the intended contract. Refs: docs/methodology/REGISTRY.md:604, tests/test_chaisemartin_dhaultfoeuille_parity.py:235.

Path to Approval

For any non-binary input with L_max=1, stop using the per-period binary DID_M aggregate as the top-level result. Either map overall_att / overall_se / overall_p_value / overall_conf_int to multi_horizon_inference[1] unconditionally, or reject L_max=1 until that contract is explicit.
Prevent non-binary data from flowing into binary-only TWFE diagnostics. Either reject non-binary twfe_diagnostic / twowayfeweights() calls with a clear error, or implement the general-treatment formulas consistent with TwoWayFEWeights.
Refactor _compute_dcdh_bootstrap() so divisor_overall <= 0 only disables the scalar DID_M bootstrap; it should still compute event_study_ses / placebo_horizon_ses when those inputs exist.
Add regressions for: mixed 0→1 + 0→2 at L_max=1; pure 0→2 with n_bootstrap > 0; and the intended non-binary TWFE-diagnostic contract.

… N_S=0 P0: When L_max >= 1, always set overall_att from per-group DID_{g,1} (not conditional on NaN). Fixes mixed binary/non-binary panels where per-period N_S > 0 but excludes non-binary switches. P1: Gate TWFE diagnostic and twowayfeweights() to binary-only treatment. Emit warning on fit(), raise ValueError on standalone helper. P1: Refactor _compute_dcdh_bootstrap() to skip scalar DID_M when divisor_overall <= 0 but still process multi_horizon_inputs and placebo_horizon_inputs. Fixes non-binary bootstrap path. P2: Add regressions for mixed 0->1/0->2 panel at L_max=1, non-binary bootstrap, and TWFE diagnostic skip on non-binary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T23:23:57Z

/ai-review

github-actions · 2026-04-12T23:34:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 966a00d1044692b4faaa5cc12ae7c91d3ab13a4c

Overall Assessment

⚠️ Needs changes

Execution note: I could not run the changed tests in this sandbox because the available Python environment is missing numpy; this review is based on diff and source inspection.

Executive Summary

Previously flagged blockers look resolved: mixed non-binary L_max=1 panels now route the top-level effect through horizon 1, non-binary TWFE diagnostics are guarded, and the bootstrap mixin no longer returns early when N_S == 0. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1585-L1600, diff_diff/chaisemartin_dhaultfoeuille.py:L611-L622, diff_diff/chaisemartin_dhaultfoeuille.py:L3671-L3679, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L163-L190.
The high-level direction of the PR is consistent with the source material: the paper discusses placebo estimators and explicitly notes that its results can extend beyond the binary treatment case to ordered discrete treatments. The remaining issues are about the Python contract around L_max=1, not about widening support per se. citeturn13open0
Severity P1: L_max=1 now changes the top-level overall estimand from Phase 1 DID_M to per-group DID_1, but REGISTRY and the public results surface still describe that top-level quantity as DID_M.
Severity P1: when n_bootstrap > 0, Step 20 overwrites the top-level overall_* inference for L_max=1 with pre-bootstrap analytical horizon-1 values, so results.overall_* can silently disagree with results.event_study_effects[1].
Severity P2/P3: tests and docs still miss the new L_max=1 public-surface contract and contain stale L_max/placebo/parity wording.

Methodology

The paper itself does not make the placebo/non-binary direction suspicious here: it covers placebo estimators and extends beyond the binary case, while REGISTRY.md explicitly labels placebo SE as a documented library extension. citeturn13open0

Severity P1. Impact: fit(..., L_max=1) now sets overall_att/overall_se/overall_p_value from multi_horizon_inference[1], i.e. per-group DID_1, not Phase 1 DID_M; but REGISTRY still says the per-group path starts when L_max >= 2, and the results object still labels the top-level estimand as DID_M in __repr__, summary(), coef_var, is_significant, and to_dataframe(). That is an undocumented methodology change on a public API and it mislabels the estimand users see. Concrete fix: either revert L_max=1 to the Phase 1 DID_M surface, or fully document L_max=1 as DID_1 and relabel every top-level output surface accordingly. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1585-L1600, docs/methodology/REGISTRY.md:L520-L544, diff_diff/chaisemartin_dhaultfoeuille_results.py:L410-L539, diff_diff/chaisemartin_dhaultfoeuille_results.py:L790-L805.
Severity P1. Impact: the new L_max=1 path breaks the documented bootstrap inference surface. Bootstrap inference is first propagated into the top-level overall_* fields, but Step 20 then immediately overwrites those fields from analytical multi_horizon_inference[1]. As a result, summary(), to_dataframe(), is_significant, and significance_stars can report analytical overall_* values while event_study_effects[1] carries the bootstrap values. That is silent, user-visible inference drift on the new path. Concrete fix: after event-study bootstrap propagation, set the top-level overall fields from event_study_effects[1] whenever the overall estimand is horizon 1, or preserve the earlier bootstrap override instead of resetting from l1_inf. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1566-L1600, diff_diff/chaisemartin_dhaultfoeuille_results.py:L151-L164, docs/methodology/REGISTRY.md:L592-L592, tests/test_chaisemartin_dhaultfoeuille.py:L2388-L2399.

Code Quality

No findings beyond the two methodology/inference contract issues above.

Performance

No findings.

Maintainability

No findings beyond the documentation drift noted below.

Tech Debt

No findings. The new TODO entries are appropriate tracking for deferrable work, but they do not mitigate the two P1 issues above.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests do not protect the new public L_max=1 contract. test_L_max_1_uses_per_group_path only checks finiteness, and test_nonbinary_bootstrap stops at bootstrap_results.event_study_ses[1]; neither asserts that the top-level overall_* surface matches horizon 1 or that the reported estimand label is correct. That is why the two P1 issues above can slip through. Concrete fix: add binary and non-binary L_max=1 regressions that assert overall_* == event_study_effects[1] and assert the chosen top-level estimand label/contract. Refs: tests/test_chaisemartin_dhaultfoeuille.py:L1824-L1843, tests/test_chaisemartin_dhaultfoeuille.py:L2388-L2399.
Severity P3. Impact: documentation is still internally inconsistent after widening the L_max=1 and placebo surface. The results docstrings and to_dataframe() docs still say event-study/normalized outputs require L_max >= 2; TODO.md still says multi-horizon placebo SE starts at L_max >= 2; and REGISTRY still says pure-direction SE parity uses 5% rtol while the updated tests now use 10% and 15%. Non-blocking, but it makes the intended contract harder to read. Concrete fix: align the docs/tracking text with the implemented thresholds, or tighten the code/tests back to the documented contract. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L304-L310, diff_diff/chaisemartin_dhaultfoeuille_results.py:L770-L781, TODO.md:L59-L59, docs/methodology/REGISTRY.md:L604-L604, tests/test_chaisemartin_dhaultfoeuille_parity.py:L237-L343.

Path to Approval

Pick a single L_max=1 overall-estimand contract. If overall_* should remain Phase 1 DID_M, stop overriding from multi_horizon_inference[1]; if it should be DID_1, update REGISTRY and every top-level label/docstring/output surface to say so.
Preserve bootstrap inference on that same L_max=1 overall surface. After bootstrap propagation, make overall_se, overall_t_stat, overall_p_value, and overall_conf_int match event_study_effects[1] whenever L_max=1.
Add regressions for both a binary and a non-binary L_max=1 panel with n_bootstrap > 0 asserting that the top-level overall surface and horizon-1 event-study surface are identical.

…nment P1: Sync overall_* from event_study_effects[1] AFTER bootstrap propagation so bootstrap SE/p/CI flow to top-level surface for L_max=1. P1: Label overall estimand as "DID_1" (not "DID_M") when L_max=1 in __repr__. Update REGISTRY and results docstrings to document L_max=1 as per-group DID_1 path. P2: Add binary + non-binary L_max=1 bootstrap regressions asserting overall_* == event_study_effects[1]. P3: Update TODO.md, REGISTRY L_max >= 2 references to L_max >= 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T23:39:49Z

/ai-review

github-actions · 2026-04-12T23:48:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e59840d2eefeb68a503e00c10eacf4cf4ee1f4d4

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Execution note: I could not run the changed tests in this sandbox because pytest is not installed.

Executive Summary

The previous P1 bootstrap-surface issue looks resolved: fit() now re-syncs overall_* from event_study_effects[1] after bootstrap propagation for L_max >= 1, so the top-level inference no longer drifts from horizon 1. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1614-L1665
The estimator/math changes themselves are broadly aligned with the source material: the dynamic companion paper explicitly handles non-binary, non-absorbing treatment and develops inference results for DID_l / weighted combinations, while placebo estimators are discussed separately; REGISTRY also documents the placebo-SE path as a library extension. Refs: docs/methodology/REGISTRY.md:L520-L546. citeturn1search0turn2search24turn2search25
One previous P1 remains: the L_max=1 public surface is still inconsistent. fit() and REGISTRY define the top-level estimand as DID_1, but summary() and the overall DataFrame outputs still present it as DID_M, and the joiners_leavers overall row still uses DID_M-oriented counts/metadata. Refs: docs/methodology/REGISTRY.md:L520-L546, diff_diff/chaisemartin_dhaultfoeuille_results.py:L516-L521, diff_diff/chaisemartin_dhaultfoeuille_results.py:L657-L717, diff_diff/chaisemartin_dhaultfoeuille_results.py:L777-L837
The new TODO rows correctly track the remaining non-blocking deferrals around single-period placebo SE and mixed-direction parity SE/CI comparison. Refs: TODO.md:L59-L60
The updated tests still do not cover summary() / to_dataframe() on L_max=1, so the remaining public-surface mismatch is not pinned. Refs: tests/test_chaisemartin_dhaultfoeuille.py:L1824-L1839, tests/test_chaisemartin_dhaultfoeuille.py:L1257-L1290

Methodology

Severity P1. Impact: fit() now implements, and REGISTRY now documents, L_max=1 as top-level DID_1; however summary() still titles the overall block as DID_M, only renders the event-study / dynamic-placebo tables when L_max >= 2, and to_dataframe("overall") / to_dataframe("joiners_leavers") still label the overall row as DID_M. The joiners_leavers overall row also still attaches DID_M-style N_S / switching-cell counts to that row. That leaves users with a mislabeled estimand and mismatched metadata on the new L_max=1 path. Concrete fix: propagate the L_max=1 => DID_1 contract through every public renderer; if the overall row is meant to describe the top-level DID_1, use the horizon-1 label and horizon-1 counts, or expose DID_M as a separate row instead of mixing the two contracts. Refs: docs/methodology/REGISTRY.md:L520-L546, diff_diff/chaisemartin_dhaultfoeuille.py:L1585-L1665, diff_diff/chaisemartin_dhaultfoeuille_results.py:L516-L521, diff_diff/chaisemartin_dhaultfoeuille_results.py:L657-L717, diff_diff/chaisemartin_dhaultfoeuille_results.py:L777-L837
Severity P3 (informational). Impact: I do not treat the new non-binary-treatment path or dynamic-placebo SE path as methodology defects. The paper explicitly covers non-binary, non-absorbing treatment and develops inference for DID_l / weighted combinations, while placebo estimators are handled separately; REGISTRY marks placebo SE as a library extension and keeps single-period DID_M^pl SE as NaN. Concrete fix: none. Refs: docs/methodology/REGISTRY.md:L471-L474, docs/methodology/REGISTRY.md:L546-L546. citeturn1search0turn2search24turn2search25

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the L_max=1 public-surface inconsistency above.

Tech Debt

Severity P3 (informational). Impact: the PR adds the remaining deferrable items to TODO.md, which is the right mitigation under this repo’s review policy. Concrete fix: none required for approval. Refs: TODO.md:L59-L60

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new L_max=1 tests only assert that top-level overall_* matches event_study_effects[1]; they do not assert the rendered/public contract in summary() or to_dataframe(). Existing renderer tests still pin the default DID_M surface only. That gap is exactly where the surviving P1 slipped through. Concrete fix: add L_max=1 regressions for summary(), to_dataframe("overall"), and to_dataframe("joiners_leavers") on both binary and mixed/non-binary panels, with and without bootstrap. Refs: tests/test_chaisemartin_dhaultfoeuille.py:L1824-L1839, tests/test_chaisemartin_dhaultfoeuille.py:L2392-L2423, tests/test_chaisemartin_dhaultfoeuille.py:L1205-L1219, tests/test_chaisemartin_dhaultfoeuille.py:L1257-L1290
Severity P3. Impact: doc drift remains on the widened L_max=1 surface and parity tolerances. to_dataframe() still says event-study / normalized outputs require L_max >= 2, its normalized-level error text still says L_max >= 2, and the REGISTRY parity note still says pure-direction SE parity uses 5% rtol while the changed parity tests now use 10% / 15%. Concrete fix: align the docstrings, error text, and REGISTRY note with the implemented thresholds. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L777-L788, diff_diff/chaisemartin_dhaultfoeuille_results.py:L949-L951, docs/methodology/REGISTRY.md:L604-L604, tests/test_chaisemartin_dhaultfoeuille_parity.py:L232-L343

Path to Approval

Make the L_max=1 public surface internally consistent with the implemented/registered estimand: update summary(), to_dataframe("overall"), and to_dataframe("joiners_leavers") so the overall row is DID_1, not DID_M, and either use horizon-1 counts/metadata for that row or expose DID_M separately.
Add L_max=1 renderer regressions on both binary and mixed/non-binary panels, with and without bootstrap, asserting the chosen contract for summary(), to_dataframe("overall"), and to_dataframe("joiners_leavers").

…udy gate P1: Propagate L_max=1 => DID_1 label through summary(), to_dataframe("overall"), to_dataframe("joiners_leavers"), CV label, and __repr__. Show event study table in summary() when L_max >= 1 (not just >= 2). Update normalized effects error message to reference L_max >= 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T23:52:45Z

/ai-review

github-actions · 2026-04-12T23:59:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bb745703c36824d555996be217a4ecc2944523c9

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Execution note: I could not run the changed tests in this sandbox because pytest and pandas are not installed.

Executive Summary

The previous L_max=1 public-surface P1 is only partially fixed: the top-level label now says DID_1, but the decomposition/metadata surfaces still come from the old binary DID_M path.
On mixed or pure non-binary panels, that can silently pair a finite DID_1 with binary-only counts such as Switcher cells (N_S)=0, Treated observations=0, or joiners_leavers counts that exclude valid 0->2 switchers.
I do not count the new non-binary DID_l path or the dynamic-placebo SE work as undocumented methodology defects. The dynamic companion paper is written for general treatment paths and placebo estimands, and the extra Python restrictions/extensions are now documented in REGISTRY.md as notes/deviations. (nber.org)
The new TODO.md rows correctly track the remaining non-blocking deferrals around single-period placebo SE and mixed-direction SE parity.
Tests still miss L_max=1 renderer/metadata assertions, which is where the remaining contract bug lives.

Methodology

Severity P1. Impact: the PR changed fit() so L_max=1 promotes the public overall estimand to per-group DID_1, but the rest of the public surface is still wired to binary per-period DID_M metadata. N_S, joiner/leaver counts, and joiner/leaver observation counts are still built from 0->1 / 1->0 masks, n_treated_obs still counts only D==1, and to_dataframe("joiners_leavers") now labels the overall row as DID_1 anyway. On mixed or pure non-binary panels this silently mismatches the displayed sample metadata with the estimand actually shown, and the “Decomposition: Joiners and Leavers” block is no longer a decomposition of the labeled overall estimand. Concrete fix: for L_max>=1, either keep these surfaces explicitly on DID_M and expose DID_1 separately, or propagate DID_1 semantics end-to-end by recomputing or suppressing the binary-only metadata/decomposition fields. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L922, diff_diff/chaisemartin_dhaultfoeuille.py:L975, diff_diff/chaisemartin_dhaultfoeuille.py:L1410, diff_diff/chaisemartin_dhaultfoeuille.py:L1588, diff_diff/chaisemartin_dhaultfoeuille.py:L1834, diff_diff/chaisemartin_dhaultfoeuille_results.py:L516, diff_diff/chaisemartin_dhaultfoeuille_results.py:L584, diff_diff/chaisemartin_dhaultfoeuille_results.py:L833, docs/methodology/REGISTRY.md:L520
Severity P3. Impact: I did not find an undocumented math mismatch in the newly added non-binary/event-study path itself. The source paper allows general treatment trajectories and defines the DID_l / placebo estimands; the Python-specific restrictions here, including L_max>=1 for non-binary support, multi-switch dropping, and analytical placebo SE, are documented in REGISTRY.md as notes/deviations, so under the stated policy they are informational rather than blocking. Concrete fix: none. Refs: docs/methodology/REGISTRY.md:L471, docs/methodology/REGISTRY.md:L474, docs/methodology/REGISTRY.md:L546. (nber.org)

Code Quality

No additional findings beyond the L_max=1 public-contract issue above.

Performance

No findings.

Maintainability

No additional findings beyond the public-contract issue above.

Tech Debt

Severity P3. Impact: the PR adds the remaining deferrable dCDH items to TODO.md, so the single-period placebo-SE gap and the mixed-direction SE-parity limitation are properly tracked and non-blocking under this repo’s policy. Concrete fix: none required for approval. Refs: TODO.md:L59, TODO.md:L60

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests exercise the numeric L_max=1 / non-binary paths, but they still do not cover summary() or to_dataframe("joiners_leavers") on mixed or pure non-binary L_max=1 panels. That gap is exactly where the remaining P1 sits. Concrete fix: add renderer/metadata regressions on a mixed 0->1 + 0->2 panel and a pure 0->2 panel, asserting the overall label, treated/switcher counts, and joiners_leavers contents. Refs: tests/test_chaisemartin_dhaultfoeuille.py:L1823, tests/test_chaisemartin_dhaultfoeuille.py:L2355
Severity P3. Impact: doc drift remains. to_dataframe() still says event-study and normalized outputs require L_max>=2, REGISTRY.md still says pure-direction multi-horizon SE parity uses 5% rtol while the changed tests now use 10%/15%, and one test comment still says L_max=1 is “equivalent to None.” Concrete fix: align the docstrings/comments/registry note with the implemented L_max>=1 surface and the new parity tolerances. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L772, docs/methodology/REGISTRY.md:L604, tests/test_chaisemartin_dhaultfoeuille.py:L330, tests/test_chaisemartin_dhaultfoeuille_parity.py:L239

Path to Approval

Make the L_max=1 public surface internally consistent: do not label the overall row/block as DID_1 while still attaching DID_M/binary-only decomposition metadata to it.
Recompute or suppress the binary-only metadata on non-binary fits: n_treated_obs, n_switcher_cells, and the joiners_leavers row counts cannot continue to come from D==1 and 0->1 / 1->0 masks when non-binary switchers are supported.
Add L_max=1 renderer regressions on mixed and non-binary panels for summary() and to_dataframe("joiners_leavers"), asserting label/count consistency with event_study_effects[1].

…tion P1: For non-binary L_max >= 1, use per-group N_l as n_switcher_cells (not binary N_S=0), count non-baseline obs as n_treated_obs, and suppress joiner/leaver decomposition (binary concept). Ensures DID_1 label pairs with consistent per-group metadata. P2: Add test_nonbinary_lmax1_renderer_contract asserting DID_1 label in __repr__/to_dataframe, n_switcher_cells > 0, joiners unavailable. P3: Update to_dataframe() docstring for L_max >= 1 contract. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T00:02:27Z

/ai-review

github-actions · 2026-04-13T00:12:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 56b2526cfbb61fedb52d1cbd73aafd7a7b40f995

Overall Assessment

Needs changes.

Highest unmitigated severity: P1.

Execution note: I could not run the targeted tests in this sandbox because pytest and pandas are not installed.

Executive Summary

The previous L_max=1 blocker is only partially fixed. The PR now correctly promotes the top-level overall estimand to DID_1, but the joiners/leavers reporting surface is still not fully consistent with that estimand.
On mixed-direction binary L_max=1 fits, summary() and to_dataframe("joiners_leavers") still expose DID_+ / DID_- from the old per-period DID_M path even though the reported overall row is now DID_1.
On non-binary L_max=1 fits, the PR suppresses DID_+ / DID_-, but the overall joiners_leavers row still computes n_obs from zeroed joiner/leaver counts, so it can emit a finite DID_1 with n_obs=0.
I did not find an undocumented source-material defect in the new non-binary control-pool logic or the new multi-horizon placebo SE itself. The dynamic paper explicitly allows treatments that can increase, stay constant, or decrease over time, and defines placebo estimators as backward analogues built off the same baseline-matched control logic. citeturn1search0turn1search13
The new tests improve numeric coverage, but they still miss the renderer/dataframe cases where the remaining L_max=1 contract bug lives.

Methodology

Severity P1. Impact: the public L_max=1 surface still violates the semantic contract of the reported estimand. fit() overwrites the top-level overall estimand with per-group DID_1, but joiners_att, leavers_att, and their counts are still computed from the Phase 1 per-period DID_M path; REGISTRY.md already documents that DID_1 can differ from DID_M on mixed-direction panels. On non-binary panels, the PR zeros n_joiner_obs / n_leaver_obs, but to_dataframe("joiners_leavers") still derives the overall row’s n_obs from their sum, so it can report n_obs=0 for a finite DID_1. Concrete fix: either suppress/gate the joiners-leavers surfaces whenever L_max=1 unless the overall row stays on DID_M, or recompute those surfaces so they are explicitly DID_1-consistent. References: diff_diff/chaisemartin_dhaultfoeuille.py:L1588, diff_diff/chaisemartin_dhaultfoeuille.py:L1652, diff_diff/chaisemartin_dhaultfoeuille.py:L956, diff_diff/chaisemartin_dhaultfoeuille.py:L1834, diff_diff/chaisemartin_dhaultfoeuille_results.py:L823, diff_diff/chaisemartin_dhaultfoeuille_results.py:L850, docs/methodology/REGISTRY.md:L520, docs/methodology/REGISTRY.md:L536.
Severity P3. Impact: I did not find an undocumented paper mismatch in the added non-binary support or the multi-horizon placebo SE path. The source paper’s setup is general enough for non-binary, non-absorbing treatment paths, and REGISTRY.md explicitly labels the placebo-SE work as a library extension rather than a paper claim. Concrete fix: none. References: docs/methodology/REGISTRY.md:L471, docs/methodology/REGISTRY.md:L546. citeturn1search0turn1search13

Code Quality

No additional findings beyond the L_max=1 public-surface contract issue above.

Performance

No findings.

Maintainability

No additional findings beyond the L_max=1 surface inconsistency above.

Tech Debt

Severity P3. Impact: the remaining single-period placebo-SE limitation and mixed-direction SE-parity limitation are now explicitly tracked in TODO.md, so they are non-blocking under the repo policy. Concrete fix: none required for approval. References: TODO.md:L59, TODO.md:L60.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests still do not assert the problematic renderer/dataframe cases. The added non-binary renderer test checks repr() and to_dataframe("overall"), but not summary() or to_dataframe("joiners_leavers"); the mixed binary/non-binary test checks only numeric equality of overall_att and event_study_effects[1]. Concrete fix: add regressions for summary() and to_dataframe("joiners_leavers") on both a mixed-direction binary L_max=1 panel and a non-binary L_max=1 panel. References: tests/test_chaisemartin_dhaultfoeuille.py:L2354, tests/test_chaisemartin_dhaultfoeuille.py:L2425.
Severity P3. Impact: to_dataframe() docstrings still say "event_study" and "normalized" are available only when L_max >= 2, but the implementation now exposes both at L_max >= 1. Concrete fix: update the docstring text to match the new surface. References: diff_diff/chaisemartin_dhaultfoeuille_results.py:L789, diff_diff/chaisemartin_dhaultfoeuille_results.py:L792.

Path to Approval

Make the L_max=1 public reporting surface internally consistent: do not present DID_1 as the overall row while still showing Phase 1 DID_M joiner/leaver views as if they decompose it.
For non-binary L_max=1 fits, stop building the joiners_leavers overall row’s n_obs from n_joiner_obs + n_leaver_obs; either compute a DID_1-appropriate count or mark that row/surface unavailable.
Add renderer/dataframe regressions for summary() and to_dataframe("joiners_leavers") on:
1. a mixed-direction binary L_max=1 panel where DID_1 can differ from DID_M;
2. a non-binary L_max=1 panel, asserting the surface does not emit n_obs=0 for a finite overall estimate.

P1: Fix to_dataframe("joiners_leavers") overall row n_obs: use n_treated_obs when joiners/leavers are suppressed (non-binary) instead of summing zeroed joiner/leaver obs counts. P2: Add joiners_leavers and summary() assertions to test_nonbinary_lmax1_renderer_contract. P3: Update to_dataframe() docstring: event_study/normalized available at L_max >= 1 (was >= 2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T00:48:12Z

/ai-review

github-actions · 2026-04-13T00:57:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 46273b6045dae501503ec67711c7658ab27c0097

Overall Assessment
Needs changes.

Highest unmitigated severity: P1.

Execution note: I could not run targeted reproductions in this sandbox because numpy is not installed, so this re-review is based on static inspection of the diff and the surrounding dCDH code.

Executive Summary

The previous L_max=1 blocker is only partially fixed. The PR fixes the non-binary renderer/dataframe contract, but mixed-direction binary L_max=1 fits still expose DID_1 as the overall estimand while leaving DID_+ / DID_- on the old per-period path.
The prior non-binary joiners_leavers n_obs=0 bug appears fixed: non-binary L_max=1 now suppresses the binary-only decomposition and gives the overall row a positive count.
The new non-binary branch introduces a separate edge-case bug: constant-treatment non-binary panels with L_max >= 1 can now fall through to an all-NaN results object instead of raising the documented “no switchers” ValueError.
I did not find an unmitigated paper/registry defect in the new non-binary support or placebo-horizon SE path itself.
Test coverage improved, but it still misses the mixed-direction binary L_max=1 renderer/dataframe path and the new non-binary no-switcher path.

Methodology

Severity P1. Impact: the mixed-direction binary L_max=1 public surface is still internally inconsistent. fit() now promotes the overall estimand to per-group DID_1, but joiners_att / leavers_att are still computed from the Phase 1 per-period DID_+ / DID_- path, and summary() / to_dataframe("joiners_leavers") still present those rows alongside a DID_1 overall row as if they decomposed the same estimand. REGISTRY now explicitly notes that DID_1 can differ from DID_M on mixed-direction panels, so this remains an undocumented semantic-contract violation on the public API. Concrete fix: when L_max=1 and both joiners and leavers are present, either suppress the joiner/leaver surface entirely or recompute a DID_1-consistent decomposition before populating joiners_* and the renderer/dataframe layer. References: diff_diff/chaisemartin_dhaultfoeuille.py#L956, diff_diff/chaisemartin_dhaultfoeuille.py#L1588, diff_diff/chaisemartin_dhaultfoeuille.py#L1865, diff_diff/chaisemartin_dhaultfoeuille_results.py#L580, diff_diff/chaisemartin_dhaultfoeuille_results.py#L823, docs/methodology/REGISTRY.md#L520, docs/methodology/REGISTRY.md#L536.
Severity P3. Impact: I did not find an unmitigated source-material defect in the new non-binary support or the placebo-horizon SE extension itself. The dynamic companion explicitly covers nonbinary, non-absorbing treatments with lagged effects, and REGISTRY now labels placebo-horizon SE as a library extension rather than as a direct paper claim. Concrete fix: none. References: docs/methodology/REGISTRY.md#L471, docs/methodology/REGISTRY.md#L546. (nber.org)

Code Quality

Severity P1. Impact: the new non-binary L_max >= 1 path no longer enforces the documented “no switchers” failure mode. The old N_S == 0 guard is now bypassed for all non-binary panels when L_max is set, but there is no replacement check based on actual treatment changes. A constant-treatment non-binary panel therefore falls through _compute_multi_horizon_dids() with N_l = 0, then fit() still builds event_study_effects[1] and syncs the top-level overall_* fields from that all-NaN entry instead of raising the documented ValueError. This is missing empty-result handling on a new code path. Concrete fix: after switch metadata is computed, raise if no group ever changes treatment, even when L_max >= 1 and treatment is non-binary; reserve the current NaN/warning behavior for cases where switchers exist but a horizon is not estimable. References: diff_diff/chaisemartin_dhaultfoeuille.py#L922, diff_diff/chaisemartin_dhaultfoeuille.py#L1046, diff_diff/chaisemartin_dhaultfoeuille.py#L2480, diff_diff/chaisemartin_dhaultfoeuille.py#L1588, docs/methodology/REGISTRY.md#L577.

Performance

No findings.

Maintainability

No findings beyond the contract issues above.

Tech Debt

Severity P3. Impact: the remaining single-period placebo-SE limitation and the mixed-direction SE-parity limitation are now explicitly tracked in TODO.md, so they are non-blocking under the repo’s deferred-work policy. Concrete fix: none required for approval. References: TODO.md#L59, TODO.md#L60.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests cover the non-binary L_max=1 renderer path and mixed binary/non-binary numeric path, but they still do not pin the unresolved mixed-direction binary L_max=1 renderer/dataframe contract or the new non-binary no-switcher edge case. That leaves both live P1 paths unguarded. Concrete fix: add one mixed-direction binary regression asserting that summary() and to_dataframe("joiners_leavers") do not present legacy DID_+ / DID_- as a decomposition of a DID_1 overall row, and add one constant-treatment non-binary regression with L_max=1 asserting the documented no-switchers ValueError. References: tests/test_chaisemartin_dhaultfoeuille.py#L2354, tests/test_chaisemartin_dhaultfoeuille.py#L2425.
Severity P3. Impact: REGISTRY still says pure-direction SE parity uses a 5% tolerance, but the PR relaxes the parity tests to 10% generally and 15% for the long-panel scenario. That is documentation drift, not a blocker. Concrete fix: either update the REGISTRY parity note or tighten the test tolerances back to the documented thresholds after explaining the new gap. References: docs/methodology/REGISTRY.md#L604, tests/test_chaisemartin_dhaultfoeuille_parity.py#L232.

Path to Approval

Make the mixed-direction binary L_max=1 public surface estimand-consistent by either suppressing joiners_* / leavers_* and the joiners-leavers renderer/dataframe, or recomputing a DID_1-consistent decomposition.
Restore the documented no-switchers error for non-binary L_max >= 1 using actual treatment-change metadata rather than the binary N_S proxy.
Add regressions for both cases above: mixed-direction binary summary() / to_dataframe("joiners_leavers"), and constant-treatment non-binary L_max=1 expecting ValueError.

…itcher guard P1: Suppress joiner/leaver decomposition for ALL L_max >= 1 (not just non-binary). The decomposition is a per-period DID_M concept that can differ from the per-group DID_1 estimand on mixed panels. P1: Add no-switcher guard after multi-horizon computation - raise ValueError if N_l == 0 at horizon 1 (catches constant-treatment non-binary panels). P3: Update REGISTRY SE parity tolerance note (10%/15% for multi-horizon, 5% for single-horizon). Fix stale Step 12c comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T01:16:12Z

/ai-review

github-actions · 2026-04-13T01:24:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 79a01abb790981cc4febb262331c1c1812dcce02

Overall Assessment

✅ Looks good.

Highest unmitigated severity: P2.

Execution note: this is a static re-review. The sandbox does not have pytest, numpy, or pandas, so I could not run the added tests or do numeric reproductions.

Executive Summary

The two prior P1 blockers appear resolved in code: L_max=1 now routes the public surface through DID_1 and suppresses the old joiner/leaver decomposition on that surface, and the non-binary L_max>=1 path now raises on no-switcher data instead of surfacing an all-NaN result.
I did not find an unmitigated methodology mismatch in the new placebo-horizon SE or non-binary support. Both are now documented in REGISTRY.md, with placebo-horizon SE explicitly labeled as a library extension.
The remaining known limitations are properly tracked in TODO.md: single-period placebo DID_M^pl still has NaN SE, and mixed-direction SE parity remains intentionally unasserted.
Remaining gaps are lower severity: the exact regression tests requested in the prior review are still missing, and the new non-binary TWFE-diagnostic behavior is not fully reflected in the registry/docstrings.

Methodology

Severity P3. Impact: No unmitigated source-material defect found. The non-binary L_max>=1 route, exact-baseline control matching, multi-step-change drop rule, and placebo-horizon SE are now documented in the registry, and the implementation follows those documented choices in the estimator and results surface. The placebo SE work is correctly framed as a library extension: the dynamic paper explicitly handles general, non-binary/non-staggered treatment designs and defines placebo estimators, while its inference section is written for the main DID_{+,ℓ} estimands rather than for placebo SE. Concrete fix: none. diff_diff/chaisemartin_dhaultfoeuille.py#L1058, diff_diff/chaisemartin_dhaultfoeuille.py#L1218, diff_diff/chaisemartin_dhaultfoeuille.py#L1597, docs/methodology/REGISTRY.md#L471, docs/methodology/REGISTRY.md#L520, docs/methodology/REGISTRY.md#L546. (nber.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: The remaining single-period placebo-SE limitation and mixed-direction SE-parity limitation are now explicitly tracked in TODO.md, so they are non-blocking under the repo’s deferred-work policy. Concrete fix: none. TODO.md#L59, TODO.md#L60

Security

No findings.

Documentation/Tests

Severity P2. Impact: The code fixes for the prior blockers are present, but the new tests still do not pin the two exact edge cases requested last round. I do not see a constant-treatment non-binary fit(..., L_max=1) regression asserting the no-switcher ValueError, or a mixed-direction binary L_max=1 regression asserting that summary() / to_dataframe("joiners_leavers") do not present DID_+ / DID_- as a decomposition of DID_1. That leaves both formerly-broken paths vulnerable to future regression. Concrete fix: add those two regressions next to the new L_max=1 and non-binary test blocks. tests/test_chaisemartin_dhaultfoeuille.py#L1823, tests/test_chaisemartin_dhaultfoeuille.py#L2238, diff_diff/chaisemartin_dhaultfoeuille.py#L1058, diff_diff/chaisemartin_dhaultfoeuille.py#L1843
Severity P3. Impact: The non-binary TWFE-diagnostic behavior is implemented and tested, but the registry/docstrings still overstate fitted-vs-standalone TWFE alignment on identical inputs and still carry stale binary-treatment wording in fit()’s Raises section. This is documentation drift, not a correctness defect. Concrete fix: add a registry/docstring note that TWFE diagnostics are binary-only; fit(twfe_diagnostic=True) warns and skips, while twowayfeweights() raises on non-binary data. diff_diff/chaisemartin_dhaultfoeuille.py#L546, diff_diff/chaisemartin_dhaultfoeuille.py#L611, diff_diff/chaisemartin_dhaultfoeuille.py#L3724, tests/test_chaisemartin_dhaultfoeuille.py#L2466, docs/methodology/REGISTRY.md#L596

P2: Add test_constant_nonbinary_treatment_raises (no-switcher guard) and test_L_max_1_suppresses_joiner_leaver_decomposition (mixed- direction binary L_max=1 renderer contract). P3: Add REGISTRY Note documenting TWFE diagnostic as binary-only (fit() warns+skips, twowayfeweights() raises on non-binary). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T01:53:52Z

/ai-review

github-actions · 2026-04-13T02:04:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 285aee9d82e4eb2fd04d0b5ea0a11b29e7c45fc9

Overall Assessment

⚠️ Needs changes

Static review only: the sandbox does not have pytest, numpy, or pandas, so I could not execute the added tests.

Executive Summary

The prior re-review gaps are largely addressed: this PR adds the missing constant-nonbinary L_max=1 failure test, the L_max=1 joiner/leaver suppression regression, and the non-binary TWFE diagnostic note.
Severity P1: on the new L_max=1 surface, the top-level overall_* fields are remapped to DID_1, but results.bootstrap_results.overall_* is still computed from the old scalar DID_M bootstrap branch. That leaves a public inference object reporting the wrong SE/CI/p-value for the reported overall estimand on the newly added path.
I did not find an unmitigated source-material mismatch in the non-binary DID_{g,l} route, the documented multi-switch drop rule, or the placebo-horizon SE extension; those choices are now explicitly disclosed in the registry and are compatible with the paper’s per-group event-study / placebo constructions. citeturn2view0turn4view0turn2view1turn2view2
The new TODO.md entries correctly track the remaining non-blocking deferrals: single-period DID_M^pl still has no SE, and mixed-direction SE parity is intentionally not asserted.

Methodology

Severity P1. Impact: the PR changes the public L_max=1 behavior so the reported overall estimand is DID_1, but the nested bootstrap object still exposes overall_se / overall_ci / overall_p_value from the scalar DID_M bootstrap path. On mixed-direction binary panels, the registry already notes that DID_1 and phase-1 DID_M can differ, so this is an incorrect variance/inference surface for a newly introduced mode. Concrete fix: when L_max == 1, either overwrite bootstrap_results.overall_* from the horizon-1 event-study bootstrap outputs, or clear those scalar fields and require callers to use event_study_ses[1] instead. Relevant code: diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:163, diff_diff/chaisemartin_dhaultfoeuille.py:1541, diff_diff/chaisemartin_dhaultfoeuille.py:1597, diff_diff/chaisemartin_dhaultfoeuille.py:1661, diff_diff/chaisemartin_dhaultfoeuille_results.py:69.
Severity P3. Impact: no unmitigated paper-registry mismatch found in the new non-binary/per-group route or placebo-horizon SE work. The paper defines per-group event-study and placebo estimands with baseline-matched, not-yet-switched controls, and the PR’s extra restrictions/deviations are now explicitly documented in the registry rather than hidden in code. Concrete fix: none. Relevant docs: docs/methodology/REGISTRY.md:471, docs/methodology/REGISTRY.md:520, docs/methodology/REGISTRY.md:546. citeturn2view0turn4view0turn2view1turn2view2

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P2. Impact: for L_max >= 1, the code repurposes n_switcher_cells to hold horizon-1 eligible switcher groups, but the results layer still labels that value as Switcher cells (N_S) and exports it as n_cells; the fallback n_obs also switches meaning to n_treated_obs. That makes the count columns mode-dependent and easy to misread in audits/parity debugging. Concrete fix: introduce separate N_1 / switcher-group metadata for the L_max >= 1 surface, or relabel the existing summary/dataframe fields when overall is DID_1. Relevant code: diff_diff/chaisemartin_dhaultfoeuille.py:1848, diff_diff/chaisemartin_dhaultfoeuille_results.py:484, diff_diff/chaisemartin_dhaultfoeuille_results.py:849.

Tech Debt

Severity P3. Impact: the remaining single-period placebo-SE limitation and mixed-direction SE-parity limitation are properly tracked under the repo’s deferred-work policy, so they are non-blocking here. Concrete fix: none. Relevant tracking: TODO.md:59, TODO.md:60.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests correctly pin the top-level overall_* == event_study_effects[1] contract, but they never assert the nested bootstrap_results.overall_* contract on L_max=1, which is why the P1 inference mismatch above slips through. Concrete fix: extend the new L_max=1 bootstrap tests to assert bootstrap_results.overall_* too, or assert that those fields are intentionally empty when L_max=1. Relevant tests: tests/test_chaisemartin_dhaultfoeuille.py:1823, tests/test_chaisemartin_dhaultfoeuille.py:2459.

Path to Approval

Make the nested bootstrap surface consistent with the new L_max=1 overall estimand: either sync bootstrap_results.overall_* to horizon-1 event-study bootstrap results, or null those fields and document that event_study_ses[1] is the only valid bootstrap source for DID_1.
Add a binary mixed-direction L_max=1 regression that checks the nested bootstrap object, not just the top-level overall_* fields.
Add the same nested-bootstrap regression on a non-binary L_max=1 panel.

P1: Sync bootstrap_results.overall_se/ci/p_value from event_study horizon 1 when L_max >= 1. The nested bootstrap object now matches the top-level DID_1 surface. P2: Use "Eligible switchers (N_1)" label in summary() when L_max >= 1 (was "Switcher cells (N_S)" which is a DID_M concept). P2: Add test_L_max_1_bootstrap_results_overall_synced asserting bootstrap_results.overall_* == event_study horizon 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T10:09:59Z

/ai-review

github-actions · 2026-04-13T10:20:06Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8813170f4e76c7dae53a1cd8ab76ebba6e361d78

Overall Assessment

⚠️ Needs changes

Static review only: the sandbox does not have pytest, numpy, or pandas, so I could not execute the added tests.

Executive Summary

The prior L_max=1 scalar bootstrap_results.overall_* mismatch is addressed, and the new REGISTRY.md / TODO.md notes correctly mitigate the documented deferrals.
I did not find an unmitigated source-material mismatch in the documented non-binary DID_{g,l} route, the monotone multi-step drop rule, or the dynamic placebo SE extension; those are now explicitly registered.
Non-binary support did not propagate to the binary-only Phase 1 result surfaces: per_period_effects and single-period placebo_effect / placebo_available still run the hard-coded 0/1 path and can silently omit non-binary switchers.
The new nested-bootstrap sync block fixes L_max=1, but it also rewires bootstrap_results.overall_* for all L_max >= 1, so L_max >= 2 fits now expose DID_1 in the nested bootstrap object while top-level overall_* is delta.
Test coverage improved materially, but it still misses both of those public-contract failures.

Methodology

Severity P1. Impact: non-binary support is only documented for the per-group path in docs/methodology/REGISTRY.md:471, but fit() still always computes the Phase 1 binary-only per-period surfaces and returns them on the results object. _compute_per_period_dids() still keys off exact 0->1 / 1->0 masks at diff_diff/chaisemartin_dhaultfoeuille.py:2140, and _compute_placebo() does the same at diff_diff/chaisemartin_dhaultfoeuille.py:2267. Those outputs are still surfaced via the results assignment at diff_diff/chaisemartin_dhaultfoeuille.py:1913 and diff_diff/chaisemartin_dhaultfoeuille.py:1919. On mixed 0->1 / 0->2 panels, results.per_period_effects and results.placebo_effect silently reflect only the binary subset. Concrete fix: when not is_binary, suppress or NaN-gate the Phase 1 per-period outputs (per_period_effects, single-period placebo_effect / placebo_available, and any related binary-only bootstrap fields) with a clear warning and registry note, or implement a documented general-treatment analogue.
Severity P1. Impact: the new sync block at diff_diff/chaisemartin_dhaultfoeuille.py:1661 overwrites bootstrap_results.overall_* from horizon 1 for every L_max >= 1, but top-level overall_* becomes delta when L_max >= 2 at diff_diff/chaisemartin_dhaultfoeuille.py:1701. That leaves the nested bootstrap object reporting the wrong overall estimand on multi-horizon fits. bootstrap_distribution also remains on the old scalar path documented at diff_diff/chaisemartin_dhaultfoeuille_results.py:94. Concrete fix: restrict that sync to L_max == 1; for L_max >= 2, either keep the nested fields on their documented scalar contract at diff_diff/chaisemartin_dhaultfoeuille_results.py:69 or add explicit delta_* fields. If L_max == 1 keeps the sync, also sync or clear bootstrap_distribution.
Severity P3. Impact: the non-binary DID_{g,l} route, multi-switch drop rule, and dynamic placebo IF/SE extension are explicitly documented as notes/deviations at docs/methodology/REGISTRY.md:471, docs/methodology/REGISTRY.md:474, and docs/methodology/REGISTRY.md:546, so I am not counting those as defects. Concrete fix: none.

Code Quality

No material findings.

Performance

No material findings.

Maintainability

No material findings beyond the public-contract ambiguity noted above.

Tech Debt

Severity P3. Impact: the remaining single-period placebo-SE limitation and mixed-direction SE-parity limitation are properly tracked in TODO.md:59 and TODO.md:60, so they are non-blocking under the repo policy. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new regressions pin L_max=1 nested-bootstrap syncing at tests/test_chaisemartin_dhaultfoeuille.py:1866 and general non-binary success paths at tests/test_chaisemartin_dhaultfoeuille.py:2282, but there is still no coverage for non-binary per_period_effects / top-level placebo_effect, and no L_max >= 2 regression for the nested bootstrap contract. Those gaps allowed the two P1 issues above to slip through. Concrete fix: add a mixed 0->1 / 0->2 panel with placebo=True asserting the binary-only Phase 1 outputs are suppressed or warned off, and add an L_max=3, n_bootstrap>0 regression that asserts the intended meaning of bootstrap_results.overall_* and bootstrap_distribution.

Path to Approval

On non-binary fits, stop surfacing binary-only Phase 1 result artifacts unless they are explicitly generalized and documented. At minimum, gate per_period_effects, the single-period placebo_* fields, and any related binary-only bootstrap fields behind a warning/NaN contract.
Narrow the nested-bootstrap overall_* sync to L_max == 1 and define an explicit L_max >= 2 contract for results.bootstrap_results. Do not leave bootstrap_results.overall_* pointing at DID_1 when top-level overall_* is delta.
Add regressions for both contracts: one mixed binary/non-binary placebo/per-period case, and one multi-horizon bootstrap case (L_max >= 2) covering the nested bootstrap object.

…acts P1: Narrow bootstrap_results.overall_* sync to L_max == 1 only. When L_max >= 2, the cost-benefit delta overrides overall_*, so bootstrap_results.overall_* stays on the scalar DID_M bootstrap. P1: Gate binary-only Phase 1 artifacts on non-binary panels: per_period_effects emptied, single-period placebo_* set to NaN, placebo_available set to False. These are DID_M concepts that don't apply to non-binary treatment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T10:23:44Z

/ai-review

github-actions · 2026-04-13T10:31:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6f26709a57ad725ba465637b3ddbcc6bd75d2df1

Overall Assessment

Needs changes

Static review only: pytest is not installed in this sandbox, so I could not execute the added tests.

Executive Summary

The prior non-binary Phase 1 artifact issue looks addressed: non-binary L_max >= 1 fits now suppress binary-only per_period_effects / top-level placebo surfaces and skip the TWFE diagnostic.
The prior bootstrap-contract issue is only partially fixed: bootstrap_results.overall_* is now re-synced only for L_max == 1, but bootstrap_results.bootstrap_distribution still keeps the old DID_M bootstrap target.
I did not find an unmitigated source-material mismatch in the documented non-binary general-treatment route, multi-switch drop rule, or placebo-horizon extension itself.
Test coverage improved materially for placebo SEs, non-binary support, and L_max=1, but it still does not pin the nested bootstrap_distribution contract.

Methodology

Severity P1. Impact: results.bootstrap_results is internally inconsistent when L_max == 1. The bootstrap mixin always stores bootstrap_distribution from the scalar Phase 1 DID_M path at diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:163, but fit() later overwrites only bootstrap_results.overall_se, overall_ci, and overall_p_value with horizon-1 DID_1 values at diff_diff/chaisemartin_dhaultfoeuille.py:1661. The public docstring still describes bootstrap_distribution as the stored distribution for the overall target at diff_diff/chaisemartin_dhaultfoeuille_results.py:94. On panels where documented DID_1 != DID_M, this silently exposes the wrong bootstrap distribution for the reported overall estimand. Concrete fix: when L_max == 1, either replace bootstrap_distribution with the horizon-1 event-study distribution, or clear/rename that field so it cannot be read as matching overall_*.
Severity P3. Impact: I did not find an unmitigated source-material defect in the non-binary general-treatment path or the placebo-horizon estimator itself. The cited dynamic paper explicitly allows nonbinary, non-absorbing treatments and defines multi-horizon DID / normalized / cost-benefit estimands; its placebo estimators are introduced separately as assumption tests, while the registry now documents placebo SE as a library extension. Concrete fix: none. See docs/methodology/REGISTRY.md:471, docs/methodology/REGISTRY.md:474, and docs/methodology/REGISTRY.md:546. (nber.org)

Code Quality

No material findings.

Performance

No material findings.

Maintainability

Severity P2. Impact: the L_max=1 public metadata contract is still muddy. The results docstring says n_switcher_cells is the Phase 1 N_S cell count and cost_benefit_delta is populated only when L_max >= 2, but fit() repurposes n_switcher_cells to N_1 and still computes/stores cost_benefit_delta for L_max == 1 at diff_diff/chaisemartin_dhaultfoeuille.py:1191 and diff_diff/chaisemartin_dhaultfoeuille.py:1952. The dataclass docs still describe the old semantics at diff_diff/chaisemartin_dhaultfoeuille_results.py:274 and diff_diff/chaisemartin_dhaultfoeuille_results.py:311. Concrete fix: either keep the old fields on their Phase 1 semantics and add separate n_eligible_switchers / did1_* metadata, or fully update the results contract and suppress delta-related surfaces when L_max == 1.

Tech Debt

Severity P3. Impact: the remaining single-period placebo-SE limitation and mixed-direction parity-SE limitation are properly tracked in TODO.md:59 and TODO.md:60, so they are non-blocking under the repo policy. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new regressions pin overall_* == event_study_effects[1] and the nested overall_* sync for L_max=1 at tests/test_chaisemartin_dhaultfoeuille.py:1823 and tests/test_chaisemartin_dhaultfoeuille.py:1866, but there is still no regression for the estimand carried by bootstrap_results.bootstrap_distribution, and no L_max=1 assertion covering the cost_benefit_delta / renderer contract. Those gaps allowed the remaining inconsistencies through. Concrete fix: add a mixed-direction L_max=1, n_bootstrap>0 regression that asserts the chosen bootstrap_distribution behavior, plus an L_max=1 surface test that asserts whether cost_benefit_delta should or should not exist.

Path to Approval

Make results.bootstrap_results self-consistent for L_max == 1: either store the horizon-1 DID_1 bootstrap distribution in bootstrap_distribution, or clear/rename that field instead of leaving the Phase 1 DID_M distribution attached to DID_1 summary statistics.
Add a regression on a mixed-direction L_max=1, n_bootstrap>0 fit that fails if bootstrap_results.overall_* and bootstrap_distribution drift back onto different estimands.

P1: Clear bootstrap_results.bootstrap_distribution to None when L_max == 1. The stored DID_M distribution doesn't match the DID_1 summary statistics. Per-horizon bootstrap stats are accessible via event_study_ses/cis/p_values. P2: Add bootstrap_distribution is None assertion to the L_max=1 bootstrap sync regression test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T10:34:01Z

/ai-review

github-actions · 2026-04-13T10:43:42Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fc83a528751788a6a2be8435866ebd18e9477049

Overall Assessment

✅ Looks good

Static review only: this sandbox does not have the runtime deps needed to execute the added tests (numpy import failed), so this is based on diff inspection plus surrounding source/registry context.

Executive Summary

The prior blocker around L_max=1 bootstrap contract looks resolved: the code now re-syncs nested bootstrap_results.overall_* to horizon 1 and clears the stale scalar distribution at diff_diff/chaisemartin_dhaultfoeuille.py#L1685, with regression coverage added at tests/test_chaisemartin_dhaultfoeuille.py#L1866.
I did not find an unmitigated source-material mismatch in the changed estimator math. The non-binary path, multi-switch drop rule, and placebo-horizon SE are all explicitly documented in the registry at docs/methodology/REGISTRY.md#L471, docs/methodology/REGISTRY.md#L474, and docs/methodology/REGISTRY.md#L546.
Severity P2: fit(..., L_max=1) still computes/stores/renders cost_benefit_delta, even though the public contract now says delta is only a L_max >= 2 surface.
Severity P2: the n_switcher_cells metadata contract is still mixed for L_max >= 1; code now stores N_1 switcher-group counts in fields still documented/exported as switching-cell counts.
Severity P3: the placebo-bootstrap docs are partially stale; some text still says the placebo path is deferred/unchanged even though placebo_event_study now carries finite analytical/bootstrap inference.

Methodology

Severity P3. Impact: No unmitigated methodology defect found in the estimator changes under review. The non-binary treatment generalization, the broadened multi-switch drop rule, and the placebo-horizon SE extension are all documented as intended behavior/deviations in the registry, and the affected helpers line up with those contracts at diff_diff/chaisemartin_dhaultfoeuille.py#L928, diff_diff/chaisemartin_dhaultfoeuille.py#L2034, and diff_diff/chaisemartin_dhaultfoeuille.py#L2718. Concrete fix: none.

Code Quality

No material findings.

Performance

No material findings.

Maintainability

Severity P2. Impact: fit(..., L_max=1) still computes cost_benefit_result under the L_max >= 1 branch at diff_diff/chaisemartin_dhaultfoeuille.py#L1191 and diff_diff/chaisemartin_dhaultfoeuille.py#L1294, stores it unconditionally on the results object at diff_diff/chaisemartin_dhaultfoeuille.py#L1956, and summary() renders it whenever present at diff_diff/chaisemartin_dhaultfoeuille_results.py#L694. That conflicts with the stated L_max=1 => DID_1, L_max>=2 => delta contract in docs/methodology/REGISTRY.md#L520 and the results docstring at diff_diff/chaisemartin_dhaultfoeuille_results.py#L311. Concrete fix: gate delta computation/storage/rendering behind L_max >= 2, or explicitly document and test that L_max=1 intentionally exposes a trivial one-horizon delta.
Severity P2. Impact: The L_max >= 1 path repurposes n_switcher_cells to mean N_1 eligible switcher groups at diff_diff/chaisemartin_dhaultfoeuille.py#L1884 and exports that value at diff_diff/chaisemartin_dhaultfoeuille.py#L1942, but the public docs still define n_switcher_cells as the old N_S switching-cell count at diff_diff/chaisemartin_dhaultfoeuille_results.py#L276, and to_dataframe("joiners_leavers") still emits it in a column named n_cells at diff_diff/chaisemartin_dhaultfoeuille_results.py#L851. Downstream consumers can silently read a group count as a cell count. Concrete fix: keep n_switcher_cells on its original N_S meaning and add a separate n_eligible_switchers field, or rename/update the exported field/column/docs everywhere to the new N_1 semantics.

Tech Debt

Severity P3. Impact: The remaining Phase 1 per-period placebo-SE limitation and the mixed-direction parity-SE limitation are properly tracked in TODO.md#L59 and TODO.md#L60, so they are non-blocking under the repo’s deferred-work policy. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: The placebo-bootstrap docs are still internally stale. placebo_event_study now receives bootstrap inference via diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L310 and that inference is propagated onto the results surface at diff_diff/chaisemartin_dhaultfoeuille.py#L1805, but the results docstring still says placebo bootstrap is deferred to Phase 2 at diff_diff/chaisemartin_dhaultfoeuille_results.py#L151, and the registry still says “the placebo path is unchanged” at docs/methodology/REGISTRY.md#L592. Concrete fix: explicitly distinguish legacy top-level placebo_* (DID_M^pl, still NaN-inference) from placebo_event_study negative horizons (DID^{pl}_l, now finite analytical/bootstrap inference), and add a regression that checks bootstrap p_value/conf_int on a negative horizon, not just bootstrap SE.

…date placebo text P2: Gate cost_benefit_delta computation to L_max >= 2 (was >= 1). Delta is only meaningful with multiple horizons. P2: Update n_switcher_cells docstring to document dual semantics: N_S cell count for L_max=None, N_1 group count for L_max >= 1. P3: Update stale placebo-bootstrap text in results docstrings and REGISTRY: multi-horizon placebos now have valid bootstrap SE/CI/p via placebo_horizon_ses/cis/p_values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-13T10:52:34Z

/ai-review

github-actions · 2026-04-13T10:59:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7bfa4c63d582e3653bac9d28fe6ac698fee63539

Overall Assessment

✅ Looks good

Static review only: this sandbox does not have numpy, pandas, or pytest, so I could not execute the added tests. This assessment is based on diff inspection plus cross-checking against the methodology registry and nearby source/docstrings.

Executive Summary

I did not find an unmitigated methodology defect in the estimator changes. The non-binary treatment support, generalized multi-switch drop rule, L_max=1 -> DID_1 behavior, and placebo-horizon SE extension are all documented in docs/methodology/REGISTRY.md.
The prior L_max=1 delta-surface issue appears resolved: delta is now computed/stored only for L_max >= 2.
The prior n_switcher_cells issue is only partially resolved: docs/summary now say N_1, but the DID_1 dataframe surface still mixes group-count, cell-count, and “changed from baseline” count semantics.
The PR correctly tracks the remaining single-period placebo-SE limitation and mixed-direction parity-SE gap in TODO.md, so those are non-blocking under the repo’s deferred-work policy.
Coverage improved materially for non-binary treatment and placebo SEs, but there is still no regression that negative-horizon bootstrap p_value / conf_int propagate onto placebo_event_study.

Methodology

Severity: P3-informational. Impact: No unmitigated source-material mismatch found. The changed behaviors under review are documented in the registry: non-binary treatment requires L_max >= 1, monotone multi-step paths are dropped as multi-switch groups, L_max=1 surfaces DID_1 rather than per-period DID_M, and placebo-horizon SE is an explicit library extension rather than a claimed paper result. Concrete fix: none. (docs/methodology/REGISTRY.md:L471-L474, docs/methodology/REGISTRY.md:L520-L546, diff_diff/chaisemartin_dhaultfoeuille.py:L923-L935, diff_diff/chaisemartin_dhaultfoeuille.py:L2031-L2087, diff_diff/chaisemartin_dhaultfoeuille.py:L1218-L1282)

Code Quality

No material findings.

Performance

No material findings.

Maintainability

Severity: P2. Impact: The remaining DID_1 result-surface count contract is still internally inconsistent. For L_max >= 1, n_switcher_cells is repurposed to N_1 eligible switcher groups, but to_dataframe("joiners_leavers") still exports that value under n_cells, and on non-binary panels n_treated_obs is overwritten with “observations where treatment differs from baseline,” not “treated observations.” On non-zero-baseline non-binary panels (for example, baseline 1 -> 2 switchers or never-switching baseline-2 controls), the summary/dataframe will silently undercount treated observations and label a group count as a cell count. Concrete fix: keep n_treated_obs on its original post-filter treated-observation contract; add separate n_changed_obs / n_switcher_groups metadata for L_max >= 1, or rename the public dataframe columns/summary labels to match their new units. (diff_diff/chaisemartin_dhaultfoeuille.py:L1420-L1420, diff_diff/chaisemartin_dhaultfoeuille.py:L1870-L1892, diff_diff/chaisemartin_dhaultfoeuille.py:L1941-L1943, diff_diff/chaisemartin_dhaultfoeuille_results.py:L277-L288, diff_diff/chaisemartin_dhaultfoeuille_results.py:L491-L495, diff_diff/chaisemartin_dhaultfoeuille_results.py:L832-L865)

Tech Debt

Severity: P3-informational. Impact: The PR adds the remaining dCDH deferred items to TODO.md, so the single-period placebo-SE limitation and the mixed-direction parity-SE limitation are properly tracked and non-blocking under project policy. Concrete fix: none. (TODO.md:L57-L60)

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The new code propagates negative-horizon bootstrap percentile p_value / conf_int onto placebo_event_study, but the new tests only assert finite placebo_horizon_ses. A regression in the public inference surface could therefore slip through CI. Concrete fix: add a regression that checks, for at least one estimable placebo lag, that placebo_event_study[-l]["p_value"] and ["conf_int"] match bootstrap_results.placebo_horizon_p_values[l] and placebo_horizon_cis[l]. (diff_diff/chaisemartin_dhaultfoeuille.py:L1806-L1838, tests/test_chaisemartin_dhaultfoeuille.py:L2085-L2106)
Severity: P3. Impact: A couple of touched docstrings are still stale after the non-binary / DID_1 changes: _validate_and_aggregate_to_cells() still says non-binary raw treatment values raise ValueError, and DCDHBootstrapResults.overall_* is still documented as always referring to DID_M. Concrete fix: update those docstrings to match the current contracts. (diff_diff/chaisemartin_dhaultfoeuille.py:L137-L142, diff_diff/chaisemartin_dhaultfoeuille_results.py:L69-L74)

igerber and others added 2 commits April 12, 2026 18:43

igerber added the ready-for-ci Triggers CI test workflows label Apr 13, 2026

igerber merged commit 0028102 into main Apr 13, 2026
22 of 24 checks passed

igerber deleted the dcdh-phase3a-cleanup-nonbinary branch April 13, 2026 12:35

Conversation

igerber commented Apr 12, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

igerber commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant