CMS scouting VVV analysis workflow for pileup reweighting, branch conversion, optional sample-entry mixing, BDT/NN training, signal-region optimization, QCD ABCD estimation, data/MC plotting, and CMS combine significance/limits.
mode=1: build pileup-weight CSV files.mode=0: convert the selected NanoAOD-style inputs intofat2andfat3ROOT trees.mode=6: shuffle each selected MC sample across its ROOT chunks while preserving chunk sizes and filenames, using a deterministic block shuffle that keeps ROOT reads mostly sequential.mode=2: train the BDT or PyTorch NN and save the model plus copied configs. The checked-in config reads fromdataset/{signal_mixed|bkg_mixed}by default and usesmodel_type: "bdt"unless changed.mode=3: scan the test split and writesignal_region.csv.mode=5: run the QCD ABCD validation on the MC test split.mode=4: make data/MC comparison plots.mode=7: run the CMS combine wrapper to compute expected significance and AsymptoticLimits per signal class / signal sample / combined, both with the MC-true QCD class yields and with the merged ABCD QCD prediction.
selections/b_veto/ and systematics/ are separate studies and are not launched through run.sh.
Usage:
./run.sh <mode> [config.json] [sample1 sample2 ...]Examples:
./run.sh 1
./run.sh 0
./run.sh 0 selections/convert/config.json www qcd_ht2000
./run.sh 6
./run.sh 6 selections/mix/config.json www
./run.sh 2
./run.sh 3 selections/signal_region/config.json
./run.sh 5 background_estimation/config.json
./run.sh 4 plotting/config.json
./run.sh 7 combine/config.jsonModes:
mode=0: compile and runselections/convert/convert_branch.C. Input:selections/convert/config.json,selections/convert/branch.json,selections/convert/selection.json,src/sample.json, and the pileup CSVs frommode=1. Output: converted ROOT files under{output_root}/{signal|bkg|data}/, plusselections/convert/log.txt. The final writer streams each tree through aTChainclone owned by the chain itself, so branch addresses stay correct when ROOT advances across temp-file boundaries, and it aborts on any chunk/tree entry-count mismatch instead of silently writing too few or repeated entries. The convert expression engine includes collection helpers such assum,max_value,min_value,nth_max_value(collection, expr, rank, default),value_at_max(collection, key_expr, value_expr, default), andvalue_at_nth_max(collection, key_expr, value_expr, rank, default); the checked-in output branch config writes per-AK8WvsQCDandVvsQCDscores, event-levelScoutingFatPFJetRecluster_{WvsQCD,VvsQCD}_{sum,max}summaries, and AK4 b-tag summaries includingScoutingPFJetRecluster2_scoutUParT_probb_max_pt,ScoutingPFJetRecluster2_scoutUParT_probb_sum,ScoutingPFJetRecluster2_scoutUParT_probb_second_max, andScoutingPFJetRecluster2_scoutUParT_probb_second_max_ptfor bothfat2andfat3.mode=1: compile and runselections/weight/weight.C. Input:selections/weight/config.jsonandsrc/sample.json. Output: pileup CSV files under the configuredoutput_root, plusselections/weight/log.txt.mode=6: compile and runselections/mix/mix.C. Input:selections/mix/config.json,src/sample.json, and the converted MC ROOT files frommode=0. Output: shuffled ROOT files under the configuredoutput_root/{signal_mixed|bkg_mixed}/, preserving the original chunk filenames and per-chunk entry counts while applying a deterministic per-tree block shuffle for each selected MC sample, plusselections/mix/log.txt. Missing input files for a requested sample are fatal. The implementation scans input chunk entry counts first (warming ROOT serially, then using OpenMP across the remaining files when available) and then rewrites each tree through aTChainusing shuffled contiguous entry blocks plus an in-block cyclic rotation. Each non-empty output chunk is cloned from theTChainitself so ROOT keeps branch addresses synced across input-file boundaries, and the writer validates both per-chunk and total entry counts before closing. The block size is chosen asclamp(total_entries / 512, min_block_entries, max_block_entries); the checked-in defaults are32and4096. This keeps the output format unchanged while avoiding the old fully random entry-by-entry ROOT read pattern.mode=2: runselections/BDT/train.py. Input:selections/BDT/config.json, the ROOT files resolved from itsinput_root/input_pattern(the checked-in config points to the mixedmode=6output under../../dataset/{sample_group}_mixed), andsrc/sample.json. Output: one trained model directory per tree underoutput_root. Withmodel_type: "bdt"it contains{tree}_model.jsonand{tree}_model_stage1.json; withmodel_type: "nn"it contains PyTorch checkpoints{tree}_model.ptand{tree}_model_stage1.ptand requires PyTorch at training/inference time. The rest of the output layout is shared:feature_corr.pdf, loss plots (loss_mlogloss.pdf/loss_classification.pdf/loss_total.pdffor BDT orloss_weighted_ce.pdf/loss_objective.pdffor NN, plusloss_decorrelation.pdf; NN x-axes are epochs, BDT x-axes are boosting rounds), stage-tagged comparison plots such asimportance_cls.pdf/importance_decorr.pdf,roc_*_cls.pdf/roc_*_decorr.pdf,score_*_cls.pdf/score_*_decorr.pdf,decor_corr_{train,test}_cls.pdf/decor_corr_{train,test}_decorr.pdf,decor_score_vs_branch_cls.pdf/decor_score_vs_branch_decorr.pdf, anddecor_branch_shapes_by_signal_score_cls.pdf/decor_branch_shapes_by_signal_score_decorr.pdf. BDT feature importance uses XGBoost gain; NN feature importance uses deterministic permutation importance while preserving the same filenames. The output also includes abranches/subdirectory with one normalized per-class input-distribution PDF per training branch, copiedconfig.json/branch.json/selection.json,test_ranges.json, and the saved test-set prediction referencestest_reference_signal_region.npz,test_reference_qcd_est.npz, andtest_reference_qcd_est_full.npz, plusselections/BDT/log.txt. Both model types keep the same sample splitting, clipping/log preprocessing, event-weight construction, dynamic learning-rate reduction, prediction-reference validation, and downstream score interface. When a per-treedecorrelatebranch also appears inselection.jsonthresholds, only that branch's threshold is omitted from the training/early-stopping objective splits and their class balancing; ordinary branch, ROC, score, feature-importance, feature-correlation, and downstream reference outputs still use the full threshold set, while decorrelation diagnostics use the training-threshold view without that decorrelate-branch cut. BDT logs sum-scaleclassification_lossand stage 1 monitors testclassification_loss; NN uses class-balanced mini-batches with sampling-corrected event weights forbackward()and logs epoch-end eval-modetrain_objective_loss/test_objective_loss, where stage 1 usesobjective_loss = weighted_CEand stage 2 usesobjective_loss = weighted_CE + scaled_smooth_CvM_decorrelation. NN early stopping monitors epoch-end testobjective_lossin both stages, and AdamWweight_decayremains native decoupled optimizer behavior rather than a logged loss term. In NN mode, per-treefat2_nn/fat3_nnblocks configure the PyTorch MLP (hidden_layers,dropout,batch_norm,batch_size, learning rates, weight decay, epochs, and permutation-importance event count);batch_sizeis the total size of each class-balanced mini-batch, the checked-in NN blocks setbatch_norm: false, and smooth-CvM decorrelation is supported in the second NN stage with decorrelation scale calibrated against the stage-1 batch-averaged weighted CE.mode=3: runselections/signal_region/signal_region.py. Input:selections/signal_region/config.jsonand one trained model directory frommode=2. Output:sr_score_*.pdf,scores_no_regions.pdf,scores.pdf,scores_no_regions_radial_equalized.pdf,scores_radial_equalized.pdf, andsignal_region.csvinside the configuredoutput_dir, plusselections/signal_region/log.txt. The 2D score PDFs use the shared class palette, hide coordinate axes/ticks/frames, and include both the original regular-polygon simplex projection and a radial-equalized projection that keeps polygon vertices fixed while spreading points outward. The script builds one shared candidate pool of general high-dimensional rectangles over the configured BDTscore_axes(allin the checked-in config), where every score axis has an independent[low, high)interval and can therefore represent simultaneous lower and upper cuts. Candidate generation combines dense per-axis boundary sets from total/signal/background weighted quantiles plus tail-heavy quantiles, explicit low/mid/high-tail single-axis seeds fromseed_quantiles, exact single-axis bounded interval seeds, optional bounded multi-axis seed combinations, deterministic beam coordinate search where every per-axis update scans exact[edges[a], edges[b])intervals, compatibility expansion that optimizes regions constrained to be non-overlapping with high-Z anchor candidates, and local event-threshold refinement. Independent edge building, beam updates, compatibility updates, local-refinement workers, and event-mask canonicalization can usemax_threads, but each stage merges results in deterministic scan order so the candidate set is unchanged by threading. No candidate-pool or global-candidate count limit is applied; candidates are removed only by rounded geometry duplicates and by final exact event-mask/canonical-box deduplication. Before global selection, every retained candidate is shrunk to an event-preserving minimal score box; candidates are deduped only when they select the same exact event mask and shrink to the same canonical box. The final selection is global rather than sequential: it searches K mutually non-overlapping candidates and maximizes the existing combined objectivesqrt(sum Z_i^2), with Python and OpenMP compatibility checks using the same exact half-open[low, high)overlap semantics as event-mask membership. The OpenMP helper first finds a beam incumbent and then runs branch-and-bound; if the search completes, the result is exact for the finite candidate list, while node caps produce a conservative upper-bound certificate. After the K non-overlapping SRs are selected, an empty-bin expansion step runs in order SR1, SR2, …: each SR has its signal-class score-axis upper bounds pushed toward1.0and its background-class score-axis lower bounds pushed toward0.0, but only into space that is empty in MC under the SR's other-axis cuts and only as far as the expanded box remains geometrically non-overlapping with every other selected SR. The expansion is constrained to keep each SR's exact selected event mask, so per-binS/B/Zare numerically unchanged; only the empty-side score-axis bounds insignal_region.csvare widened. Before scanning, the script reloads the saved model, reproduces the saved full-threshold signal-region test prediction, and aborts if it does not matchtest_reference_signal_region.npzwithin the stored tolerances.mode=4: runplotting/data_mc.py. Input:plotting/config.json,plotting/branch.json, the converted ROOT files frommode=0, and one trained model directory frommode=2. Output: one PDF per plotted branch under the configuredoutput_root, plusplotting/log.txt. Data and MC both use the configured trained-model input pattern for ordinary ROOT branches; missing sample files, empty MC trees/classes before filtering, and non-positive MCraw_entriesare treated as fatal errors. In addition to ROOT branches, one derived model score branch per class is plotted asscore_{class_name}. For those score branches, MC is read frombdt_root/test_ranges.json, validated againsttest_reference_signal_region.npz, and normalised to the same full-sample target total while data is predicted from the full configureddata_samplesinput. The script loads.json/.pklBDT models or.ptNN checkpoints according to the copiedmodel_type.mode=5: runbackground_estimation/qcd_est.py. Input:background_estimation/config.json, one trained model directory frommode=2, and thesignal_region.csvwritten bymode=3. Output: ABCD summary PDFs and one ROOT file under the configuredoutput_dir, plusbackground_estimation/log.txt. The validation PDFs keep the finite-MC / ABCD-propagated uncertainties based onsum(w^2). Theqcd_abcd_region_counts.pdfandqcd_abcd_region_fractions.pdfsummaries are stacked by the BDTclass_groupsclasses with a shared fixed class palette; additional colors are generated deterministically when there are more classes than base colors, andqcd_abcd_region_counts_linear.pdfsaves the same counts with a linear y axis. The QCD-merged summaries (qcd_abcd_region_counts_qcd_merged.pdf,qcd_abcd_region_counts_qcd_merged_linear.pdf, andqcd_abcd_region_fractions_qcd_merged.pdf) combine everyclass_groupsclass whose name containsqcdcase-insensitively into oneQCDstack color and legend entry, leaving non-QCD classes unchanged. The ROOT file stores combine-facing bundles for every saved category (samples/*,groups/*,qcd_predict,qcd_true,total_predict, andtotal_true) assrN/yield,srN/stat_error, andsrN/scale_errorone-bin histograms plus a category-levelcovariance_totalTH2. The combine-facing convention isstat_error = sqrt(yield),scale_error = 0, and diagonalcovariance_total = diag(yield), so the weighted yield is treated as the Poisson event count. QCD classes are identified byclass_groupsnames containingqcdcase-insensitively; all samples from those classes are summed for the ABCDA/B/C/Dtotals,qcd_true,qcd_predict, andtotal_predict, while the per-classgroups/*outputs remain MC-true. The signal-region score axes are detected from the CSV{axis}_low/{axis}_highcolumns, so the ABCD step can consume either the legacy independent axes or the new all-score-axis regions. The non-score ABCD dimension is configured by required per-treeabcd_branches; every listed branch must have a threshold in the copied BDTselection.json, A/B require all listed branches to pass, C/D require all listed branches to fail, and partial pass/fail events are excluded. Before building the ABCD regions, the script reloads the saved model, validates againsttest_reference_qcd_est_full.npzwhen present (falling back to the legacy filteredtest_reference_qcd_est.npz), and aborts on mismatch.mode=7: compile and runcombine/combine.C. Input:combine/config.json(lists one or more channels, each pointing to aqcd_abcd_yields.rootfrommode=5) plus one trainedbdt_rootdirectory frommode=2.combine.Creadsclass_groupsfrombdt_root/config.json, then resolves that copied config'ssample_configthe same way asqcd_est.pyso the signal/background-class split still comes fromsample.json. Group names fromclass_groupsare matched againstgroups/*in the ROOT file by the same slugified lowercase convention used byqcd_est.py, and QCD classes are the class names containingqcdcase-insensitively. Must be executed inside a CMSSW area that hasHiggsAnalysis/CombinedLimitbuilt, so thatcombineandcombineCards.pyare on$PATH. Output:significance.csv,limits.csv,significance_abcd_mc.csv,limits_abcd_mc.csv,significance_by_channel.csv,limits_by_channel.csv,significance_by_channel_abcd_mc.csv, andlimits_by_channel_abcd_mc.csvunder the configuredoutput_dir, pluscombine/log.txt. Each CSV has one row per signal scenario (combined / per-signal-class / per-signal-sample); the per-channel CSVs add achannelcolumn and rerun the same scenarios using one channel at a time. The wrapper reads every SR from the newsrN/yieldone-bin histograms and, by default (use_root_covariance=false), writes pure counting datacards with one independent bin per SR (<channel>_sr<N>). Combine's native Poisson likelihood then provides the statistical uncertainty; this matches the signal-region Asimov significance convention, avoids double-counting the same counting fluctuation as a Gaussian nuisance, and avoids the old multi-bin shape-PDF factorization warning. The_abcd_mc.csvCSVs replace all QCD classes with the single mergedqcd_predictblock while keeping the non-QCD processes on their MC-true group/sample blocks. The wrapper validates every required signal sample/group and aborts on missing inputs or invalid shapes instead of regularizing the yields. Ifuse_root_covariance=true, the full per-process covariance between signal regions is additionally injected via an eigen-decomposition of each process'scovariance_total, one Gaussian shape nuisance per retained eigenmode, using one-bin shape templates for each SR. In that optional mode, whenrescale_shape_modes_to_positiveistrue(default), any background process whose nominal yields and covariance are identically zero across all SRs is dropped before writing the datacard, and any eigenmode whose raw±1σtemplates would make a varying bin negative or a total template norm non-positive is shrunk to the largest safe step below that boundary; the datacardshapecoefficient is rescaled by1/a, zero-valued bins are allowed as long as the varied templates stay non-negative and keep strictly positive total norms, and every drop/rescale is written as a warning tocombine/log.txt. If a per-sample signal process is identically zero in some channels, those channels are skipped with a warning; if it is identically zero in every channel, the wrapper records0significance andinfexpected limits for that row and continues. IfAsymptoticLimitsfinishes successfully but its ROOT output still lacks the expected quantiles, the wrapper logs a warning, records0for that row's significance andinffor its expected limits in the CSVs, and continues. A temporary work directory is kept underoutput_dir/work/with the generated datacards and combine outputs; optional covariance-nuisance shape ROOT files are written only whenuse_root_covariance=true.
Sample arguments:
- Extra sample names are supported only for
mode=0,mode=1, andmode=6. - If no sample names are given,
run.shusessubmit_samplesfrom the chosen config. - If
submit_samplesis empty or missing, all MC samples insrc/sample.jsonare used.
Log archiving:
- After the final
[finished …]line is written,run.shcopies the per-modelog.txtinto the program's configured output directory. - Applies to
mode=1(output_root),mode=2/mode=4(per-treeoutput_rootexpanded oversubmit_trees), andmode=3/mode=5/mode=7(output_dir).mode=0(convert_branch) andmode=6(mix) are intentionally skipped. - The copy runs on both success and failure; if the resolved output dir does not yet exist (e.g., the run aborted before creating it), the copy is skipped with a warning instead of erroring.
src/sample.json: master sample registry withname,path,sample_ID,is_MC,is_signal,xsection,lumi, andraw_entries.selections/weight/config.json: pileup histogram inputs and pileup-weight output paths.selections/convert/config.json: convert-step paths, threading, file-size splitting, and pileup CSV pattern.selections/mix/config.json: mix-step tree selection, input/output ROOT roots and patterns, sample config, threading for the input-chunk scan, deterministicrandom_state, and the block-size boundsmin_block_entries/max_block_entries(default32/4096).selections/convert/selection.json: event selection and tree split (fat2/fat3).selections/convert/branch.json: input branches to read and output branches to write.selections/BDT/config.json: model inputs,input_root/input_pattern, class groups, training settings, output directories, top-levelmodel_type(bdtornn, defaultbdt), top-leveldecor_loss_mode,decor_lambda,decor_n_bins,decor_n_thresholds,decor_score_tau,decor_bin_tau_scale, per-tree BDT hyperparameters (n_estimators,n_estimators_decorr,max_depth,learning_rate,learning_rate_decorr, optionalmin_learning_rate/lr_reduce_patience,gamma,reg_lambda,reg_alpha,min_child_weight,subsample,colsample_bytree, optionalcolsample_bynode,early_stopping_rounds), per-tree NN hyperparameters infat2_nn/fat3_nn(epochs,epochs_decorr,hidden_layers,activation,dropout,batch_norm,batch_size,learning_rate,learning_rate_decorr, optionalmin_learning_rate/lr_reduce_patience,weight_decay, optionalgrad_clip_normdefaulting off,early_stopping_rounds,permutation_importance_events; checked-in NN values uselearning_rate: 0.001,learning_rate_decorr: 0.0005, andmin_learning_rate: 0.00002),decorrelate,decor_efficiencies, andevent_reweight_branches.selections/signal_region/config.json: signal-region scan settings:lumi,n_signal_regions/N,bdt_root,output_dir,score_axes,min_bkg_weight,min_signal_weight, optional entry minima,max_edge_candidates_per_axis,beam_width,top_intervals_per_axis,coordinate_rounds,seed_intervals_per_axis,multi_axis_seed_max_axes,multi_axis_seed_max_seeds,compatibility_seed_anchors,compatibility_seed_rounds,local_refine_rounds,local_refine_neighbor_edges,local_refine_top_candidates,local_refine_diverse_masks,local_refine_candidate_overscan,global_beam_width,branch_bound_max_nodes,branch_bound_time_limit_seconds,deduplicate_event_masks,require_exact_n_regions,max_threads,progress_every_seconds, andseed_quantiles.background_estimation/config.json:qcd_est.pysettings, includingbdt_root,signal_region_csv,output_dir,root_file_name, and required per-treeabcd_branchesentries whose thresholds are read from the copied BDTselection.json.plotting/config.json:data_mc.pysettings, includingbdt_root,output_root,data_samples, and per-treeevent_reweight_branches(applied to MC events only; data weights stay 1.0).plotting/branch.json: per-tree plot overrides such asskip_branches,bins,x_range,y_range,logx, andlogy; derived model score branches use names likescore_VVVand accept the same overrides.combine/config.json:combine.Csettings:channels(list of{name, root_file, bdt_root}— each ROOT file is theqcd_abcd_yields.rootoutput ofmode=5for one channel, and eachbdt_rootis that channel's trained tree output directory),output_dir, optionalcombine_cmd/combine_cards_cmd,use_root_covariance(defaultfalse; combine uses binned Poisson statistics from the yields and does not turn ROOT covariance into nuisances),eigen_rel_cutoff(used only whenuse_root_covariance=true, dropping eigenmodes withλ_k ≤ cutoff × max(diag(cov)); default1e-10),rescale_shape_modes_to_positive(used only when covariance nuisances are enabled; defaulttrue), andkeep_work(keep the generated datacards underoutput_dir/work/; defaulttrue).
mode=1writes the pileup CSVs used bymode=0for MC samples.mode=0writes the convertedfat2andfat3ROOT trees used directly bymode=4, and as the input source formode=6or anymode=2config that still pointsinput_rootat the unmixed dataset.mode=6can rewrite those converted MC trees into{signal_mixed|bkg_mixed}directories under the same dataset root with the same chunk layout using a deterministic block shuffle, which the checked-in model-training config uses by default.mode=2writes the model, copied configs, and saved test-set prediction references used bymode=3,mode=4, andmode=5.mode=3writessignal_region.csv, which defines the A-region score bins formode=5.mode=5writes the ABCD validation ROOT file and PDFs for the chosen tree.mode=4writes one data/MC comparison PDF per branch.mode=7reads one or moremode=5ROOT files (one per channel) plus the BDT and sample configs, generates per-scenario datacards, and callscombineto fillsignificance.csv,limits.csv,significance_abcd_mc.csv, andlimits_abcd_mc.csvunder itsoutput_dir; it also reruns the same scenarios per individual channel and writes the matching*_by_channel*.csvfiles. Combined/per-class scenarios use the storedgroups/*/srN/yieldone-bin histograms, per-sample scenarios usesamples/*/srN/yield, and ABCD-mode scenarios replace every QCD class with the mergedqcd_predict/srN/yield. By default each SR is written as an independent counting bin and ROOT covariance fields are not encoded as nuisances; combine's Poisson likelihood supplies the counting statistics from the weighted yields. Settinguse_root_covariance=truerestores the optional eigen-decomposed covariance shape nuisances.