Data Preparation

Utilities for preparing and validating data for DiD analysis.

.. module:: diff_diff.prep

Data Generation

generate_did_data

Generate synthetic data with known treatment effects for testing.

.. autofunction:: diff_diff.generate_did_data

Example

from diff_diff import generate_did_data

# Generate basic 2x2 DiD data
data = generate_did_data(
    n_units=100,
    n_periods=10,
    treatment_effect=5.0,
    treatment_period=5,
    treatment_fraction=0.5,
    noise_sd=1.0
)

print(data.head())
# Columns: unit_id, period, outcome, treated, post

generate_staggered_data

Generate synthetic staggered adoption data for testing.

.. autofunction:: diff_diff.generate_staggered_data

Example

from diff_diff import generate_staggered_data

data = generate_staggered_data(
    n_units=200,
    n_periods=10,
    cohort_periods=[4, 6, 8],
    seed=42
)

generate_event_study_data

Generate synthetic event study data for testing.

.. autofunction:: diff_diff.generate_event_study_data

generate_ddd_data

Generate synthetic Triple Difference data.

.. autofunction:: diff_diff.generate_ddd_data

generate_factor_data

Generate synthetic data with factor structure for TROP testing.

.. autofunction:: diff_diff.generate_factor_data

generate_panel_data

Generate generic synthetic panel data.

.. autofunction:: diff_diff.generate_panel_data

generate_continuous_did_data

Generate synthetic continuous treatment DiD data with known dose-response.

.. autofunction:: diff_diff.generate_continuous_did_data

generate_reversible_did_data

Generate synthetic reversible-treatment panel data — treatment can switch on and off over time. Use this with :class:`~diff_diff.ChaisemartinDHaultfoeuille` for testing the dCDH estimator on non-absorbing treatments.

.. autofunction:: diff_diff.generate_reversible_did_data

Example

from diff_diff import generate_reversible_did_data, ChaisemartinDHaultfoeuille

data = generate_reversible_did_data(
    n_groups=80,
    n_periods=6,
    pattern="single_switch",  # or "joiners_only", "leavers_only", "mixed_single_switch"
    treatment_effect=2.0,
    seed=42,
)

est = ChaisemartinDHaultfoeuille()
results = est.fit(
    data, outcome="outcome", group="group",
    time="period", treatment="treatment",
)

Indicator Creation

make_treatment_indicator

Create binary treatment indicator from categorical or numeric columns.

.. autofunction:: diff_diff.make_treatment_indicator

Example

from diff_diff import make_treatment_indicator

# From categorical
data = make_treatment_indicator(
    data,
    column='group',
    treated_values='treatment'
)

# From numeric threshold
data = make_treatment_indicator(
    data,
    column='exposure',
    threshold=0.5,
    new_column='high_exposure'
)

make_post_indicator

Create post-treatment period indicator.

.. autofunction:: diff_diff.make_post_indicator

Example

from diff_diff import make_post_indicator

data = make_post_indicator(
    data,
    time_column='period',
    treatment_start=5
)

Panel Data Utilities

wide_to_long

Reshape wide panel data to long format.

.. autofunction:: diff_diff.wide_to_long

Example

from diff_diff import wide_to_long

# Wide format: each column is a time period
# unit_id, y_2019, y_2020, y_2021, y_2022
long_data = wide_to_long(
    wide_data,
    id_col='unit_id',
    value_name='outcome',
    var_name='year'
)

balance_panel

Balance panel data by filling or dropping incomplete observations.

.. autofunction:: diff_diff.balance_panel

Example

from diff_diff import balance_panel

# Fill missing periods with NaN
balanced = balance_panel(
    data,
    unit_column='unit_id',
    time_column='period',
    method='fill'
)

# Or keep only units with all periods (default)
balanced = balance_panel(
    data,
    unit_column='unit_id',
    time_column='period',
    method='inner'
)

Staggered Adoption Utilities

create_event_time

Create event-time column for staggered adoption designs.

.. autofunction:: diff_diff.create_event_time

Example

from diff_diff import create_event_time

data = create_event_time(
    data,
    time_column='period',
    treatment_time_column='first_treat'
)

# event_time = period - first_treat
# Negative values: pre-treatment
# Zero: treatment period
# Positive values: post-treatment
# NaN for never-treated

aggregate_to_cohorts

Aggregate unit-level data to cohort means.

.. autofunction:: diff_diff.aggregate_to_cohorts

Example

from diff_diff import aggregate_to_cohorts

cohort_data = aggregate_to_cohorts(
    data,
    unit_column='unit_id',
    time_column='period',
    treatment_column='first_treat',
    outcome='outcome'
)

Survey Aggregation

aggregate_survey

Aggregate survey microdata to geographic-period cells with design-based precision.

.. autofunction:: diff_diff.aggregate_survey

Example

from diff_diff import aggregate_survey, SurveyDesign, DifferenceInDifferences

# Define the survey design for the microdata
design = SurveyDesign(weights="finalwt", strata="strat", psu="psu")

# Aggregate to state-year panel with design-based SEs
panel, stage2 = aggregate_survey(
    microdata,
    by=["state", "year"],
    outcomes="smoking_rate",
    covariates=["age", "income"],
    survey_design=design,
)

# panel has: state, year, smoking_rate_mean, smoking_rate_se,
#   smoking_rate_n, smoking_rate_precision, smoking_rate_weight,
#   age_mean, income_mean, cell_n, cell_n_eff, cell_sum_w, srs_fallback
#
# *_weight is fit-ready: unit-constant population weight (pweight, default)
#   or cleaned precision with NaN/Inf -> 0.0 (aweight opt-in).
# cell_sum_w is a per-cell diagnostic (sum of survey weights per cell).
# Non-estimable cells and zero-weight geos are dropped automatically.

# stage2 is pre-configured: pweights + state-level clustering
# Add treatment/time indicators at the panel level, then fit:
# panel["treated"] = ...  # from policy adoption data
# panel["post"] = (panel["year"] >= treatment_year).astype(int)
# result = DifferenceInDifferences().fit(
#     panel, outcome="smoking_rate_mean",
#     treatment="treated", time="post", survey_design=stage2,
# )

Data Validation

validate_did_data

Validate data structure for DiD analysis.

.. autofunction:: diff_diff.validate_did_data

Example

from diff_diff import validate_did_data

result = validate_did_data(
    data,
    outcome='outcome',
    treatment='treated',
    time='period',
    unit='unit_id'
)

if not result['valid']:
    for error in result['errors']:
        print(f"Error: {error}")
    for warning in result['warnings']:
        print(f"Warning: {warning}")

summarize_did_data

Generate summary statistics for DiD data.

.. autofunction:: diff_diff.summarize_did_data

Example

from diff_diff import summarize_did_data

summary = summarize_did_data(
    data,
    outcome='outcome',
    treatment='treated',
    time='period',
    unit='unit_id'
)

print(summary)

Control Unit Selection

rank_control_units

Rank control units by suitability for DiD or synthetic control.

.. autofunction:: diff_diff.rank_control_units

Example

from diff_diff import rank_control_units, generate_did_data

panel = generate_did_data(n_units=100, n_periods=10, treatment_effect=2.0)
ranked = rank_control_units(
    panel,
    unit_column='unit',
    time_column='period',
    outcome_column='outcome',
    treatment_column='treated',
    pre_periods=[0, 1, 2, 3, 4]
)

# Select top 10 control units
best_controls = ranked.head(10)['unit'].tolist()

FilesExpand file tree

prep.rst

Latest commit

History

prep.rst

File metadata and controls

Data Preparation

Data Generation

generate_did_data

Example

generate_staggered_data

Example

generate_event_study_data

generate_ddd_data

generate_factor_data

generate_panel_data

generate_continuous_did_data

generate_reversible_did_data

Example

Indicator Creation

make_treatment_indicator

Example

make_post_indicator

Example

Panel Data Utilities

wide_to_long

Example

balance_panel

Example

Staggered Adoption Utilities

create_event_time

Example

aggregate_to_cohorts

Example

Survey Aggregation

aggregate_survey

Example

Data Validation

validate_did_data

Example

summarize_did_data

Example

Control Unit Selection

rank_control_units

Example