Utilities for preparing and validating data for DiD analysis.
.. module:: diff_diff.prep
Generate synthetic data with known treatment effects for testing.
.. autofunction:: diff_diff.generate_did_data
from diff_diff import generate_did_data
# Generate basic 2x2 DiD data
data = generate_did_data(
n_units=100,
n_periods=10,
treatment_effect=5.0,
treatment_period=5,
treatment_fraction=0.5,
noise_sd=1.0
)
print(data.head())
# Columns: unit_id, period, outcome, treated, postGenerate synthetic staggered adoption data for testing.
.. autofunction:: diff_diff.generate_staggered_data
from diff_diff import generate_staggered_data
data = generate_staggered_data(
n_units=200,
n_periods=10,
cohort_periods=[4, 6, 8],
seed=42
)Generate synthetic event study data for testing.
.. autofunction:: diff_diff.generate_event_study_data
Generate synthetic Triple Difference data.
.. autofunction:: diff_diff.generate_ddd_data
Generate synthetic data with factor structure for TROP testing.
.. autofunction:: diff_diff.generate_factor_data
Generate generic synthetic panel data.
.. autofunction:: diff_diff.generate_panel_data
Generate synthetic continuous treatment DiD data with known dose-response.
.. autofunction:: diff_diff.generate_continuous_did_data
Generate synthetic reversible-treatment panel data — treatment can switch on and off over time. Use this with :class:`~diff_diff.ChaisemartinDHaultfoeuille` for testing the dCDH estimator on non-absorbing treatments.
.. autofunction:: diff_diff.generate_reversible_did_data
from diff_diff import generate_reversible_did_data, ChaisemartinDHaultfoeuille
data = generate_reversible_did_data(
n_groups=80,
n_periods=6,
pattern="single_switch", # or "joiners_only", "leavers_only", "mixed_single_switch"
treatment_effect=2.0,
seed=42,
)
est = ChaisemartinDHaultfoeuille()
results = est.fit(
data, outcome="outcome", group="group",
time="period", treatment="treatment",
)Create binary treatment indicator from categorical or numeric columns.
.. autofunction:: diff_diff.make_treatment_indicator
from diff_diff import make_treatment_indicator
# From categorical
data = make_treatment_indicator(
data,
column='group',
treated_values='treatment'
)
# From numeric threshold
data = make_treatment_indicator(
data,
column='exposure',
threshold=0.5,
new_column='high_exposure'
)Create post-treatment period indicator.
.. autofunction:: diff_diff.make_post_indicator
from diff_diff import make_post_indicator
data = make_post_indicator(
data,
time_column='period',
treatment_start=5
)Reshape wide panel data to long format.
.. autofunction:: diff_diff.wide_to_long
from diff_diff import wide_to_long
# Wide format: each column is a time period
# unit_id, y_2019, y_2020, y_2021, y_2022
long_data = wide_to_long(
wide_data,
id_col='unit_id',
value_name='outcome',
var_name='year'
)Balance panel data by filling or dropping incomplete observations.
.. autofunction:: diff_diff.balance_panel
from diff_diff import balance_panel
# Fill missing periods with NaN
balanced = balance_panel(
data,
unit_column='unit_id',
time_column='period',
method='fill'
)
# Or keep only units with all periods (default)
balanced = balance_panel(
data,
unit_column='unit_id',
time_column='period',
method='inner'
)Create event-time column for staggered adoption designs.
.. autofunction:: diff_diff.create_event_time
from diff_diff import create_event_time
data = create_event_time(
data,
time_column='period',
treatment_time_column='first_treat'
)
# event_time = period - first_treat
# Negative values: pre-treatment
# Zero: treatment period
# Positive values: post-treatment
# NaN for never-treatedAggregate unit-level data to cohort means.
.. autofunction:: diff_diff.aggregate_to_cohorts
from diff_diff import aggregate_to_cohorts
cohort_data = aggregate_to_cohorts(
data,
unit_column='unit_id',
time_column='period',
treatment_column='first_treat',
outcome='outcome'
)Aggregate survey microdata to geographic-period cells with design-based precision.
.. autofunction:: diff_diff.aggregate_survey
from diff_diff import aggregate_survey, SurveyDesign, DifferenceInDifferences
# Define the survey design for the microdata
design = SurveyDesign(weights="finalwt", strata="strat", psu="psu")
# Aggregate to state-year panel with design-based SEs
panel, stage2 = aggregate_survey(
microdata,
by=["state", "year"],
outcomes="smoking_rate",
covariates=["age", "income"],
survey_design=design,
)
# panel has: state, year, smoking_rate_mean, smoking_rate_se,
# smoking_rate_n, smoking_rate_precision, smoking_rate_weight,
# age_mean, income_mean, cell_n, cell_n_eff, cell_sum_w, srs_fallback
#
# *_weight is fit-ready: unit-constant population weight (pweight, default)
# or cleaned precision with NaN/Inf -> 0.0 (aweight opt-in).
# cell_sum_w is a per-cell diagnostic (sum of survey weights per cell).
# Non-estimable cells and zero-weight geos are dropped automatically.
# stage2 is pre-configured: pweights + state-level clustering
# Add treatment/time indicators at the panel level, then fit:
# panel["treated"] = ... # from policy adoption data
# panel["post"] = (panel["year"] >= treatment_year).astype(int)
# result = DifferenceInDifferences().fit(
# panel, outcome="smoking_rate_mean",
# treatment="treated", time="post", survey_design=stage2,
# )Validate data structure for DiD analysis.
.. autofunction:: diff_diff.validate_did_data
from diff_diff import validate_did_data
result = validate_did_data(
data,
outcome='outcome',
treatment='treated',
time='period',
unit='unit_id'
)
if not result['valid']:
for error in result['errors']:
print(f"Error: {error}")
for warning in result['warnings']:
print(f"Warning: {warning}")Generate summary statistics for DiD data.
.. autofunction:: diff_diff.summarize_did_data
from diff_diff import summarize_did_data
summary = summarize_did_data(
data,
outcome='outcome',
treatment='treated',
time='period',
unit='unit_id'
)
print(summary)Rank control units by suitability for DiD or synthetic control.
.. autofunction:: diff_diff.rank_control_units
from diff_diff import rank_control_units, generate_did_data
panel = generate_did_data(n_units=100, n_periods=10, treatment_effect=2.0)
ranked = rank_control_units(
panel,
unit_column='unit',
time_column='period',
outcome_column='outcome',
treatment_column='treated',
pre_periods=[0, 1, 2, 3, 4]
)
# Select top 10 control units
best_controls = ranked.head(10)['unit'].tolist()