Skip to content

JanFSchulte/ScoutingVVVTools

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

207 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScoutingVVVTools

CMS scouting VVV analysis workflow for pileup reweighting, branch conversion, optional sample-entry mixing, BDT/NN training, signal-region optimization, QCD ABCD estimation, data/MC plotting, and CMS combine significance/limits.

Pipeline order

  1. mode=1: build pileup-weight CSV files.
  2. mode=0: convert the selected NanoAOD-style inputs into fat2 and fat3 ROOT trees.
  3. mode=6: shuffle each selected MC sample across its ROOT chunks while preserving chunk sizes and filenames, using a deterministic block shuffle that keeps ROOT reads mostly sequential.
  4. mode=2: train the BDT or PyTorch NN and save the model plus copied configs. The checked-in config reads from dataset/{signal_mixed|bkg_mixed} by default and uses model_type: "bdt" unless changed.
  5. mode=3: scan the test split and write signal_region.csv.
  6. mode=5: run the QCD ABCD validation on the MC test split.
  7. mode=4: make data/MC comparison plots.
  8. mode=7: run the CMS combine wrapper to compute expected significance and AsymptoticLimits per signal class / signal sample / combined, both with the MC-true QCD class yields and with the merged ABCD QCD prediction.

selections/b_veto/ and systematics/ are separate studies and are not launched through run.sh.

run.sh

Usage:

./run.sh <mode> [config.json] [sample1 sample2 ...]

Examples:

./run.sh 1
./run.sh 0
./run.sh 0 selections/convert/config.json www qcd_ht2000
./run.sh 6
./run.sh 6 selections/mix/config.json www
./run.sh 2
./run.sh 3 selections/signal_region/config.json
./run.sh 5 background_estimation/config.json
./run.sh 4 plotting/config.json
./run.sh 7 combine/config.json

Modes:

  • mode=0: compile and run selections/convert/convert_branch.C. Input: selections/convert/config.json, selections/convert/branch.json, selections/convert/selection.json, src/sample.json, and the pileup CSVs from mode=1. Output: converted ROOT files under {output_root}/{signal|bkg|data}/, plus selections/convert/log.txt. The final writer streams each tree through a TChain clone owned by the chain itself, so branch addresses stay correct when ROOT advances across temp-file boundaries, and it aborts on any chunk/tree entry-count mismatch instead of silently writing too few or repeated entries. The convert expression engine includes collection helpers such as sum, max_value, min_value, nth_max_value(collection, expr, rank, default), value_at_max(collection, key_expr, value_expr, default), and value_at_nth_max(collection, key_expr, value_expr, rank, default); the checked-in output branch config writes per-AK8 WvsQCD and VvsQCD scores, event-level ScoutingFatPFJetRecluster_{WvsQCD,VvsQCD}_{sum,max} summaries, and AK4 b-tag summaries including ScoutingPFJetRecluster2_scoutUParT_probb_max_pt, ScoutingPFJetRecluster2_scoutUParT_probb_sum, ScoutingPFJetRecluster2_scoutUParT_probb_second_max, and ScoutingPFJetRecluster2_scoutUParT_probb_second_max_pt for both fat2 and fat3.
  • mode=1: compile and run selections/weight/weight.C. Input: selections/weight/config.json and src/sample.json. Output: pileup CSV files under the configured output_root, plus selections/weight/log.txt.
  • mode=6: compile and run selections/mix/mix.C. Input: selections/mix/config.json, src/sample.json, and the converted MC ROOT files from mode=0. Output: shuffled ROOT files under the configured output_root/{signal_mixed|bkg_mixed}/, preserving the original chunk filenames and per-chunk entry counts while applying a deterministic per-tree block shuffle for each selected MC sample, plus selections/mix/log.txt. Missing input files for a requested sample are fatal. The implementation scans input chunk entry counts first (warming ROOT serially, then using OpenMP across the remaining files when available) and then rewrites each tree through a TChain using shuffled contiguous entry blocks plus an in-block cyclic rotation. Each non-empty output chunk is cloned from the TChain itself so ROOT keeps branch addresses synced across input-file boundaries, and the writer validates both per-chunk and total entry counts before closing. The block size is chosen as clamp(total_entries / 512, min_block_entries, max_block_entries); the checked-in defaults are 32 and 4096. This keeps the output format unchanged while avoiding the old fully random entry-by-entry ROOT read pattern.
  • mode=2: run selections/BDT/train.py. Input: selections/BDT/config.json, the ROOT files resolved from its input_root / input_pattern (the checked-in config points to the mixed mode=6 output under ../../dataset/{sample_group}_mixed), and src/sample.json. Output: one trained model directory per tree under output_root. With model_type: "bdt" it contains {tree}_model.json and {tree}_model_stage1.json; with model_type: "nn" it contains PyTorch checkpoints {tree}_model.pt and {tree}_model_stage1.pt and requires PyTorch at training/inference time. The rest of the output layout is shared: feature_corr.pdf, loss plots (loss_mlogloss.pdf / loss_classification.pdf / loss_total.pdf for BDT or loss_weighted_ce.pdf / loss_objective.pdf for NN, plus loss_decorrelation.pdf; NN x-axes are epochs, BDT x-axes are boosting rounds), stage-tagged comparison plots such as importance_cls.pdf / importance_decorr.pdf, roc_*_cls.pdf / roc_*_decorr.pdf, score_*_cls.pdf / score_*_decorr.pdf, decor_corr_{train,test}_cls.pdf / decor_corr_{train,test}_decorr.pdf, decor_score_vs_branch_cls.pdf / decor_score_vs_branch_decorr.pdf, and decor_branch_shapes_by_signal_score_cls.pdf / decor_branch_shapes_by_signal_score_decorr.pdf. BDT feature importance uses XGBoost gain; NN feature importance uses deterministic permutation importance while preserving the same filenames. The output also includes a branches/ subdirectory with one normalized per-class input-distribution PDF per training branch, copied config.json / branch.json / selection.json, test_ranges.json, and the saved test-set prediction references test_reference_signal_region.npz, test_reference_qcd_est.npz, and test_reference_qcd_est_full.npz, plus selections/BDT/log.txt. Both model types keep the same sample splitting, clipping/log preprocessing, event-weight construction, dynamic learning-rate reduction, prediction-reference validation, and downstream score interface. When a per-tree decorrelate branch also appears in selection.json thresholds, only that branch's threshold is omitted from the training/early-stopping objective splits and their class balancing; ordinary branch, ROC, score, feature-importance, feature-correlation, and downstream reference outputs still use the full threshold set, while decorrelation diagnostics use the training-threshold view without that decorrelate-branch cut. BDT logs sum-scale classification_loss and stage 1 monitors test classification_loss; NN uses class-balanced mini-batches with sampling-corrected event weights for backward() and logs epoch-end eval-mode train_objective_loss / test_objective_loss, where stage 1 uses objective_loss = weighted_CE and stage 2 uses objective_loss = weighted_CE + scaled_smooth_CvM_decorrelation. NN early stopping monitors epoch-end test objective_loss in both stages, and AdamW weight_decay remains native decoupled optimizer behavior rather than a logged loss term. In NN mode, per-tree fat2_nn / fat3_nn blocks configure the PyTorch MLP (hidden_layers, dropout, batch_norm, batch_size, learning rates, weight decay, epochs, and permutation-importance event count); batch_size is the total size of each class-balanced mini-batch, the checked-in NN blocks set batch_norm: false, and smooth-CvM decorrelation is supported in the second NN stage with decorrelation scale calibrated against the stage-1 batch-averaged weighted CE.
  • mode=3: run selections/signal_region/signal_region.py. Input: selections/signal_region/config.json and one trained model directory from mode=2. Output: sr_score_*.pdf, scores_no_regions.pdf, scores.pdf, scores_no_regions_radial_equalized.pdf, scores_radial_equalized.pdf, and signal_region.csv inside the configured output_dir, plus selections/signal_region/log.txt. The 2D score PDFs use the shared class palette, hide coordinate axes/ticks/frames, and include both the original regular-polygon simplex projection and a radial-equalized projection that keeps polygon vertices fixed while spreading points outward. The script builds one shared candidate pool of general high-dimensional rectangles over the configured BDT score_axes (all in the checked-in config), where every score axis has an independent [low, high) interval and can therefore represent simultaneous lower and upper cuts. Candidate generation combines dense per-axis boundary sets from total/signal/background weighted quantiles plus tail-heavy quantiles, explicit low/mid/high-tail single-axis seeds from seed_quantiles, exact single-axis bounded interval seeds, optional bounded multi-axis seed combinations, deterministic beam coordinate search where every per-axis update scans exact [edges[a], edges[b]) intervals, compatibility expansion that optimizes regions constrained to be non-overlapping with high-Z anchor candidates, and local event-threshold refinement. Independent edge building, beam updates, compatibility updates, local-refinement workers, and event-mask canonicalization can use max_threads, but each stage merges results in deterministic scan order so the candidate set is unchanged by threading. No candidate-pool or global-candidate count limit is applied; candidates are removed only by rounded geometry duplicates and by final exact event-mask/canonical-box deduplication. Before global selection, every retained candidate is shrunk to an event-preserving minimal score box; candidates are deduped only when they select the same exact event mask and shrink to the same canonical box. The final selection is global rather than sequential: it searches K mutually non-overlapping candidates and maximizes the existing combined objective sqrt(sum Z_i^2), with Python and OpenMP compatibility checks using the same exact half-open [low, high) overlap semantics as event-mask membership. The OpenMP helper first finds a beam incumbent and then runs branch-and-bound; if the search completes, the result is exact for the finite candidate list, while node caps produce a conservative upper-bound certificate. After the K non-overlapping SRs are selected, an empty-bin expansion step runs in order SR1, SR2, …: each SR has its signal-class score-axis upper bounds pushed toward 1.0 and its background-class score-axis lower bounds pushed toward 0.0, but only into space that is empty in MC under the SR's other-axis cuts and only as far as the expanded box remains geometrically non-overlapping with every other selected SR. The expansion is constrained to keep each SR's exact selected event mask, so per-bin S/B/Z are numerically unchanged; only the empty-side score-axis bounds in signal_region.csv are widened. Before scanning, the script reloads the saved model, reproduces the saved full-threshold signal-region test prediction, and aborts if it does not match test_reference_signal_region.npz within the stored tolerances.
  • mode=4: run plotting/data_mc.py. Input: plotting/config.json, plotting/branch.json, the converted ROOT files from mode=0, and one trained model directory from mode=2. Output: one PDF per plotted branch under the configured output_root, plus plotting/log.txt. Data and MC both use the configured trained-model input pattern for ordinary ROOT branches; missing sample files, empty MC trees/classes before filtering, and non-positive MC raw_entries are treated as fatal errors. In addition to ROOT branches, one derived model score branch per class is plotted as score_{class_name}. For those score branches, MC is read from bdt_root/test_ranges.json, validated against test_reference_signal_region.npz, and normalised to the same full-sample target total while data is predicted from the full configured data_samples input. The script loads .json/.pkl BDT models or .pt NN checkpoints according to the copied model_type.
  • mode=5: run background_estimation/qcd_est.py. Input: background_estimation/config.json, one trained model directory from mode=2, and the signal_region.csv written by mode=3. Output: ABCD summary PDFs and one ROOT file under the configured output_dir, plus background_estimation/log.txt. The validation PDFs keep the finite-MC / ABCD-propagated uncertainties based on sum(w^2). The qcd_abcd_region_counts.pdf and qcd_abcd_region_fractions.pdf summaries are stacked by the BDT class_groups classes with a shared fixed class palette; additional colors are generated deterministically when there are more classes than base colors, and qcd_abcd_region_counts_linear.pdf saves the same counts with a linear y axis. The QCD-merged summaries (qcd_abcd_region_counts_qcd_merged.pdf, qcd_abcd_region_counts_qcd_merged_linear.pdf, and qcd_abcd_region_fractions_qcd_merged.pdf) combine every class_groups class whose name contains qcd case-insensitively into one QCD stack color and legend entry, leaving non-QCD classes unchanged. The ROOT file stores combine-facing bundles for every saved category (samples/*, groups/*, qcd_predict, qcd_true, total_predict, and total_true) as srN/yield, srN/stat_error, and srN/scale_error one-bin histograms plus a category-level covariance_total TH2. The combine-facing convention is stat_error = sqrt(yield), scale_error = 0, and diagonal covariance_total = diag(yield), so the weighted yield is treated as the Poisson event count. QCD classes are identified by class_groups names containing qcd case-insensitively; all samples from those classes are summed for the ABCD A/B/C/D totals, qcd_true, qcd_predict, and total_predict, while the per-class groups/* outputs remain MC-true. The signal-region score axes are detected from the CSV {axis}_low / {axis}_high columns, so the ABCD step can consume either the legacy independent axes or the new all-score-axis regions. The non-score ABCD dimension is configured by required per-tree abcd_branches; every listed branch must have a threshold in the copied BDT selection.json, A/B require all listed branches to pass, C/D require all listed branches to fail, and partial pass/fail events are excluded. Before building the ABCD regions, the script reloads the saved model, validates against test_reference_qcd_est_full.npz when present (falling back to the legacy filtered test_reference_qcd_est.npz), and aborts on mismatch.
  • mode=7: compile and run combine/combine.C. Input: combine/config.json (lists one or more channels, each pointing to a qcd_abcd_yields.root from mode=5) plus one trained bdt_root directory from mode=2. combine.C reads class_groups from bdt_root/config.json, then resolves that copied config's sample_config the same way as qcd_est.py so the signal/background-class split still comes from sample.json. Group names from class_groups are matched against groups/* in the ROOT file by the same slugified lowercase convention used by qcd_est.py, and QCD classes are the class names containing qcd case-insensitively. Must be executed inside a CMSSW area that has HiggsAnalysis/CombinedLimit built, so that combine and combineCards.py are on $PATH. Output: significance.csv, limits.csv, significance_abcd_mc.csv, limits_abcd_mc.csv, significance_by_channel.csv, limits_by_channel.csv, significance_by_channel_abcd_mc.csv, and limits_by_channel_abcd_mc.csv under the configured output_dir, plus combine/log.txt. Each CSV has one row per signal scenario (combined / per-signal-class / per-signal-sample); the per-channel CSVs add a channel column and rerun the same scenarios using one channel at a time. The wrapper reads every SR from the new srN/yield one-bin histograms and, by default (use_root_covariance=false), writes pure counting datacards with one independent bin per SR (<channel>_sr<N>). Combine's native Poisson likelihood then provides the statistical uncertainty; this matches the signal-region Asimov significance convention, avoids double-counting the same counting fluctuation as a Gaussian nuisance, and avoids the old multi-bin shape-PDF factorization warning. The _abcd_mc.csv CSVs replace all QCD classes with the single merged qcd_predict block while keeping the non-QCD processes on their MC-true group/sample blocks. The wrapper validates every required signal sample/group and aborts on missing inputs or invalid shapes instead of regularizing the yields. If use_root_covariance=true, the full per-process covariance between signal regions is additionally injected via an eigen-decomposition of each process's covariance_total, one Gaussian shape nuisance per retained eigenmode, using one-bin shape templates for each SR. In that optional mode, when rescale_shape_modes_to_positive is true (default), any background process whose nominal yields and covariance are identically zero across all SRs is dropped before writing the datacard, and any eigenmode whose raw ±1σ templates would make a varying bin negative or a total template norm non-positive is shrunk to the largest safe step below that boundary; the datacard shape coefficient is rescaled by 1/a, zero-valued bins are allowed as long as the varied templates stay non-negative and keep strictly positive total norms, and every drop/rescale is written as a warning to combine/log.txt. If a per-sample signal process is identically zero in some channels, those channels are skipped with a warning; if it is identically zero in every channel, the wrapper records 0 significance and inf expected limits for that row and continues. If AsymptoticLimits finishes successfully but its ROOT output still lacks the expected quantiles, the wrapper logs a warning, records 0 for that row's significance and inf for its expected limits in the CSVs, and continues. A temporary work directory is kept under output_dir/work/ with the generated datacards and combine outputs; optional covariance-nuisance shape ROOT files are written only when use_root_covariance=true.

Sample arguments:

  • Extra sample names are supported only for mode=0, mode=1, and mode=6.
  • If no sample names are given, run.sh uses submit_samples from the chosen config.
  • If submit_samples is empty or missing, all MC samples in src/sample.json are used.

Log archiving:

  • After the final [finished …] line is written, run.sh copies the per-mode log.txt into the program's configured output directory.
  • Applies to mode=1 (output_root), mode=2 / mode=4 (per-tree output_root expanded over submit_trees), and mode=3 / mode=5 / mode=7 (output_dir). mode=0 (convert_branch) and mode=6 (mix) are intentionally skipped.
  • The copy runs on both success and failure; if the resolved output dir does not yet exist (e.g., the run aborted before creating it), the copy is skipped with a warning instead of erroring.

Main JSON files

  • src/sample.json: master sample registry with name, path, sample_ID, is_MC, is_signal, xsection, lumi, and raw_entries.
  • selections/weight/config.json: pileup histogram inputs and pileup-weight output paths.
  • selections/convert/config.json: convert-step paths, threading, file-size splitting, and pileup CSV pattern.
  • selections/mix/config.json: mix-step tree selection, input/output ROOT roots and patterns, sample config, threading for the input-chunk scan, deterministic random_state, and the block-size bounds min_block_entries / max_block_entries (default 32 / 4096).
  • selections/convert/selection.json: event selection and tree split (fat2 / fat3).
  • selections/convert/branch.json: input branches to read and output branches to write.
  • selections/BDT/config.json: model inputs, input_root / input_pattern, class groups, training settings, output directories, top-level model_type (bdt or nn, default bdt), top-level decor_loss_mode, decor_lambda, decor_n_bins, decor_n_thresholds, decor_score_tau, decor_bin_tau_scale, per-tree BDT hyperparameters (n_estimators, n_estimators_decorr, max_depth, learning_rate, learning_rate_decorr, optional min_learning_rate / lr_reduce_patience, gamma, reg_lambda, reg_alpha, min_child_weight, subsample, colsample_bytree, optional colsample_bynode, early_stopping_rounds), per-tree NN hyperparameters in fat2_nn / fat3_nn (epochs, epochs_decorr, hidden_layers, activation, dropout, batch_norm, batch_size, learning_rate, learning_rate_decorr, optional min_learning_rate / lr_reduce_patience, weight_decay, optional grad_clip_norm defaulting off, early_stopping_rounds, permutation_importance_events; checked-in NN values use learning_rate: 0.001, learning_rate_decorr: 0.0005, and min_learning_rate: 0.00002), decorrelate, decor_efficiencies, and event_reweight_branches.
  • selections/signal_region/config.json: signal-region scan settings: lumi, n_signal_regions / N, bdt_root, output_dir, score_axes, min_bkg_weight, min_signal_weight, optional entry minima, max_edge_candidates_per_axis, beam_width, top_intervals_per_axis, coordinate_rounds, seed_intervals_per_axis, multi_axis_seed_max_axes, multi_axis_seed_max_seeds, compatibility_seed_anchors, compatibility_seed_rounds, local_refine_rounds, local_refine_neighbor_edges, local_refine_top_candidates, local_refine_diverse_masks, local_refine_candidate_overscan, global_beam_width, branch_bound_max_nodes, branch_bound_time_limit_seconds, deduplicate_event_masks, require_exact_n_regions, max_threads, progress_every_seconds, and seed_quantiles.
  • background_estimation/config.json: qcd_est.py settings, including bdt_root, signal_region_csv, output_dir, root_file_name, and required per-tree abcd_branches entries whose thresholds are read from the copied BDT selection.json.
  • plotting/config.json: data_mc.py settings, including bdt_root, output_root, data_samples, and per-tree event_reweight_branches (applied to MC events only; data weights stay 1.0).
  • plotting/branch.json: per-tree plot overrides such as skip_branches, bins, x_range, y_range, logx, and logy; derived model score branches use names like score_VVV and accept the same overrides.
  • combine/config.json: combine.C settings: channels (list of {name, root_file, bdt_root} — each ROOT file is the qcd_abcd_yields.root output of mode=5 for one channel, and each bdt_root is that channel's trained tree output directory), output_dir, optional combine_cmd / combine_cards_cmd, use_root_covariance (default false; combine uses binned Poisson statistics from the yields and does not turn ROOT covariance into nuisances), eigen_rel_cutoff (used only when use_root_covariance=true, dropping eigenmodes with λ_k ≤ cutoff × max(diag(cov)); default 1e-10), rescale_shape_modes_to_positive (used only when covariance nuisances are enabled; default true), and keep_work (keep the generated datacards under output_dir/work/; default true).

Step-by-step file flow

  • mode=1 writes the pileup CSVs used by mode=0 for MC samples.
  • mode=0 writes the converted fat2 and fat3 ROOT trees used directly by mode=4, and as the input source for mode=6 or any mode=2 config that still points input_root at the unmixed dataset.
  • mode=6 can rewrite those converted MC trees into {signal_mixed|bkg_mixed} directories under the same dataset root with the same chunk layout using a deterministic block shuffle, which the checked-in model-training config uses by default.
  • mode=2 writes the model, copied configs, and saved test-set prediction references used by mode=3, mode=4, and mode=5.
  • mode=3 writes signal_region.csv, which defines the A-region score bins for mode=5.
  • mode=5 writes the ABCD validation ROOT file and PDFs for the chosen tree.
  • mode=4 writes one data/MC comparison PDF per branch.
  • mode=7 reads one or more mode=5 ROOT files (one per channel) plus the BDT and sample configs, generates per-scenario datacards, and calls combine to fill significance.csv, limits.csv, significance_abcd_mc.csv, and limits_abcd_mc.csv under its output_dir; it also reruns the same scenarios per individual channel and writes the matching *_by_channel*.csv files. Combined/per-class scenarios use the stored groups/*/srN/yield one-bin histograms, per-sample scenarios use samples/*/srN/yield, and ABCD-mode scenarios replace every QCD class with the merged qcd_predict/srN/yield. By default each SR is written as an independent counting bin and ROOT covariance fields are not encoded as nuisances; combine's Poisson likelihood supplies the counting statistics from the weighted yields. Setting use_root_covariance=true restores the optional eigen-decomposed covariance shape nuisances.

About

Analysis tools for CMS VVV analysis with Scouting data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 53.2%
  • C 40.6%
  • C++ 3.8%
  • Shell 2.4%