Skip to content

MiraldiLab/InferelatorJL

Repository files navigation

InferelatorJL

A Julia package for inference of transcriptional regulatory networks (GRNs) from gene expression data and prior chromatin accessibility or binding information, using the mLASSO-StARS algorithm.

Pipeline diagram


Overview

InferelatorJL infers genome-scale gene regulatory networks (GRNs) by combining gene expression data with a prior network (e.g., from ATAC-seq or ChIP-seq) to identify transcription factor (TF) → target gene regulatory relationships. The core method is multi-scale LASSO with Stability Approach to Regularization Selection (mLASSO-StARS), a penalized regression framework in which the prior network biases edge selection and bootstrap stability across subsamples determines a data-driven sparsity level. Two additional model-selection methods are available as alternatives to bStARS: EBIC (single-fit, fast) and bEBIC (bootstrap EBIC, produces selection-count confidence scores directly comparable to bStARS).

Key features:

  • Supports bulk RNA-seq, pseudobulk, and single-cell expression data
  • Estimates TF activity (TFA) via least squares as an alternative to raw TF mRNA
  • Builds separate networks for TFA and TF mRNA predictors, then combines them
  • Prior-weighted LASSO penalties reduce false positives for prior-supported edges
  • Three model-selection methods: bStARS (default), EBIC, bEBIC
  • Network-level or gene-level instability thresholding (bStARS)
  • Built-in PR / ROC evaluation against gold-standard interaction sets
  • Fully multithreaded subsampling loop

Installation

Requirements: Julia ≥ 1.7, Python (for PyPlot/matplotlib)

# From the Julia REPL:
] add https://github.com/miraldilab/InferelatorJL.jl

# Install all dependencies:
] instantiate

For development / local installation:

] dev /path/to/InferelatorJL
] instantiate

Quick start

using InferelatorJL

# Step 1 — load expression data
data = loadData(exprFile, targFile, regFile)

# Steps 2–3 — merge degenerate TFs, build prior, estimate TFA
priorData, mergedTFs = loadPrior(data, priorFile)
estimateTFA(priorData, data; outputDir = dirOut)

# Step 3b (optional) — apply time-lag correction (time-series data only)
# applyTimeLag(priorData, data, timeLagFile, timeLag)

# Step 4 — infer GRN (TFA predictors)
buildNetwork(data, priorData;
             tfaMode   = true,
             outputDir = joinpath(dirOut, "TFA"))

# Step 5 — aggregate TFA + mRNA networks
aggregateNetworks([tfaEdges, mrnaEdges];
                  method    = :max,
                  outputDir = joinpath(dirOut, "Combined"))

# Step 6 — refine TFA using the consensus network as a new prior
refineTFA(combinedNetFile, data, mergedTFs; outputDir = dirOut)

# Step 7 — evaluate against a gold standard
evaluateNetwork(networkFile, goldStandard;
                gsRegsFile = regFile, targGeneFile = targFile, saveDir = dirPR)

See examples/interactive_pipeline.jl for a fully annotated step-by-step example.


Pipeline

The full pipeline has seven steps. Each step is a single function call.

Step Function Description
1 loadData Load expression matrix; filter to target genes and regulators
2–3 loadPrior + estimateTFA Process prior, merge degenerate TFs, estimate TF activity via least squares
3b (optional) applyTimeLag Adjust TFA and TF mRNA for the delay between TF transcription and protein activity (time-series data only)
4 buildNetwork Infer GRN for one predictor mode (TFA or TF mRNA); model selection via bStARS, EBIC, or bEBIC
5 aggregateNetworks Combine TFA and mRNA edge lists into a consensus network
6 refineTFA Re-estimate TFA using the consensus network as a data-driven prior
7 evaluateNetwork Evaluate any network against gold standards (PR/ROC curves)

Steps 4–6 are typically run twice (TFA mode and TF mRNA mode) and the results aggregated in Step 5.


Input files

File Format Description
Expression matrix Tab-delimited TSV or Apache Arrow Genes × samples; genes in rows, samples in columns; first column is gene names
Target gene list Plain text, one gene per line Genes to model as regression targets
Regulator list Plain text, one TF per line Candidate transcription factors
Prior network Wide-matrix TSV (gene × TF, empty first header) Prior regulatory connections; rows = target genes, columns = TFs; values are edge weights (binary or continuous)
Prior penalties Same format as prior network Prior(s) used to set per-edge LASSO penalties; can be the same as prior network
TFA gene list Plain text, one gene per line Optional: restrict TFA estimation to a gene subset

Output files

All outputs are written under the directory specified by outputDir.

File Method Description
edges.tsv all Full ranked edge table: TF, gene, signed quantile, stability, partial correlation, inPrior flag
edges_subset.tsv all Top edges after applying the meanEdgesPerGene cap
grnOutMat.jld all Serialised BuildGrn struct (full network object)
instabOutMat.jld bStARS Serialised GrnData struct with full stability arrays across the λ grid
instability_diagnostic_network.png bStARS Two-panel plot: model size vs λ (left) and instability bounds vs λ (right)
instability_selection_network.png bStARS |Instability − target| vs λ with dot at the chosen λ
bebicOutMat.jld bEBIC Serialised GrnData struct — preserves lambdaSS (nGenes × totSS per-subsample lambda matrix) for post-hoc analysis
bebic_lambda_summary.tsv bEBIC Per-gene median and std of EBIC-optimal lambdas across subsamples
ebic_lambda_summary.tsv EBIC Per-gene chosen lambda and number of non-zero coefficients at that lambda
combined_<method>_sp.tsv all Aggregated network — long/edge list format (TF, gene, score; no zeros)
combined_<method>.tsv all Aggregated network — wide/matrix format (genes × TFs; used as prior input to refineTFA)

Key parameters

Parameter Default Description
modelSelection "bStARS" Lambda selection method: "bStARS" (stability-based, default), "EBIC" (single-fit, fast), "bEBIC" (bootstrap EBIC, produces bStARS-comparable confidence scores)
totSS 80 Total bootstrap subsamples (bStARS and bEBIC)
subsampleFrac 0.63 Fraction of samples drawn per subsample (bStARS and bEBIC)
targetInstability 0.05 Instability threshold for λ selection (bStARS only)
gamma 0.5 EBIC sparsity hyperparameter (EBIC and bEBIC only); 0 = standard BIC, 0.5 = recommended for GRN, 1 = maximum sparsity penalty
minLambda nothing Lower bound of the λ grid. For bStARS: defaults to 0.01 when nothing. For EBIC/bEBIC: nothing lets GLMNet choose the solution path automatically (recommended).
maxLambda nothing Upper bound of the λ grid. For bStARS: defaults to 0.5 when nothing. For EBIC/bEBIC: nothing = auto.
lambdaBias [0.5] Penalty reduction factor for prior-supported edges (0 = no prior, 1 = uniform)
meanEdgesPerGene 20 Maximum retained edges per target gene
instabilityLevel "Network" "Network": single λ for all genes; "Gene": per-gene λ (bStARS only)
zScoreTFA true Z-score expression before TFA estimation
zScoreLASSO true Z-score expression before LASSO regression. Always use true with EBIC/bEBIC unless you supply explicit minLambda/maxLambda; raw-scale responses inflate λ_max and cause EBIC to select null models
method :max Network aggregation rule: :max, :mean, or :min stability
timeLagFile "" Path to 4-column TSV (sampleQ, timeQ, sampleP, timeP) for time-lag correction; "" skips step 3b
timeLag 0.0 Lag between TF mRNA and protein activity, in the same time units as timeLagFile
leaveOutSampleList "" Path to a text file (one sample name per line) listing samples to exclude from subsampling; held-out samples form the test set for calcR2predFromStabilities

API reference

Data loading

data                    = loadData(exprFile, targFile, regFile; tfaGeneFile="", epsilon=0.01)
priorData, mergedTFs    = loadPrior(data, priorFile; minTargets=3)

Core pipeline

estimateTFA(priorData, data; edgeSS=0, zScoreTFA=true, outputDir=".")
applyTimeLag(priorData, data, timeLagFile, timeLag)          # optional step 3b
buildNetwork(data, priorData; tfaMode=true, totSS=80, lambdaBias=[0.5], ...)
aggregateNetworks(netFiles; method=:max, meanEdgesPerGene=20, outputDir=".")
refineTFA(combinedNetFile, data, mergedTFs; zScoreTFA=true, outputDir=".")

Internal — bEBIC two-stage (dev scripts only)

buildNetwork calls these internally when modelSelection = "bEBIC". Use them directly in dev scripts when you need to inspect grnData.lambdaSS between stages or swap in a custom aggregation function.

# Stage 1 — expensive: LASSO on every subsample per gene, EBIC selects optimal λ
# Fills: grnData.lambdaSS (nGenes × totSS), buildGrn.networkStability (selection counts)
constructSubsamples(data, grnData; totSS=80, subsampleFrac=0.63)
bebicEstimateLambdas!(grnData, buildGrn; gamma=0.5, zScoreLASSO=true)

# Stage 2 — cheap: aggregate per-subsample λ → one per gene, final full-data fit
# Fills: buildGrn.lambda, buildGrn.betas, grnData.ebicLambdas
# aggregateFn: any AbstractVector → Real function (default median)
bebicFinalFit!(grnData, buildGrn; aggregateFn=median, zScoreLASSO=true, outputDir=".")

Splitting the stages means bebicFinalFit! can be re-run with a different aggregateFn (e.g. mean, x -> quantile(x, 0.25)) without repeating the expensive subsample fits. gamma is applied in Stage 1 — changing it requires re-running bebicEstimateLambdas!.

Evaluation

evaluateNetwork(networkFile, goldStandard;
                gsRegsFile="", targGeneFile="", doPerTF=false, saveDir="")

# Out-of-sample R² prediction: sweep model sizes 1:maxModSize TFs/gene,
# fit OLS on training samples, evaluate on held-out test samples.
# Requires a leave-out set (passed via leaveOutSampleList in buildNetwork).
calcR2predFromStabilities(grnData, buildGrn, data, maxModSize, outputDir;
                          xaxisStepSize=5)

Data preparation (pseudobulk)

# Aggregate single-cell counts from an AnnData (.h5ad) object into pseudobulk
pbRaw = pseudoBulk(adata, "celltype_rep")   # returns features × samples DataFrame

# Normalize a features × samples count matrix
# methods: "none", "tpm", "log2tpm", "zscore", "log2tpm_zscore", "log2fc", "sizefactor"
countsNorm = normalizeMatrix(countsMat, "log2tpm")

# Build a one-hot biological covariate matrix (drop-first encoding)
bioDesign = buildDesignMatrix(meta, "celltype")

# Remove additive batch effects (pure Julia OLS, equivalent to limma)
corrected = removeBatchEffect(countsNorm, meta[!, "replicate"]; designMat=bioDesign)

# Sum replicate columns into per-celltype aggregate columns
# Column names must follow {celltype}_{replicate} convention
limaAgg = aggregateReplicates(limaDf, celltypeOrder, replicateOrder)

For DESeq2 VST normalization and exact limma batch correction, use R directly. See examples/prepareData/pseudobulkWorkflow.jl for a complete workflow using RCall.jl.

Utilities

matrixToEdgeList(df)                         # wide prior matrix → edge list (long format, no zeros)
edgeListToMatrix(df; indices=(2,1,3))        # edge list → wide matrix
frobeniusNormalize(M, :column)               # normalize matrix columns or rows
binarizeNumeric!(df)                         # continuous prior → binary
mergeDFs(dfs, :Gene, "sum")                  # merge prior DataFrames
completeDF(df, :Gene, allGenes, allTFs)      # align to full gene/TF universe
writeTSVWithEmptyFirstHeader(df, path)       # write sparse prior format
check_column_norms(M)                        # verify unit column norms
writeNetworkTable!(buildGrn; outputDir=".")  # write edges.tsv from BuildGrn struct
saveData(data, tfaData, grnData, buildGrn, dir, "checkpoint.jld2")

Examples

File Description
interactive_pipeline.jl Full 7-step pipeline, step-by-step in the REPL, public API only
interactive_pipeline_dev.jl Same pipeline using module-qualified internal calls (for development/debugging)
run_pipeline.jl Full pipeline wrapped in runInferelator() for batch/cluster use, public API
run_pipeline_dev.jl Same wrapped function using internal calls
dataUtils.jl Data utility and partial-correlation functions demonstrated with synthetic data; no input files needed
networkUtils.jl Network I/O and aggregation utilities demonstrated with synthetic data; no input files needed
plotPR.jl Standalone script to load saved PR results and generate PR curve plots
prepareData/pseudobulkWorkflow.jl End-to-end pseudobulk workflow: h5ad → normalize → batch-correct → save TSV/Arrow for InferelatorJL
prepareData/h5adToArrow.jl Convert an h5ad expression matrix directly to the Arrow format expected by InferelatorJL
r2PredWorkflow.jl Leave-out R² prediction workflow: build GRN with held-out samples, compute R²_pred vs model size to guide TF-per-gene selection

Data structures

Struct Populated by Key fields
GeneExpressionData loadData geneExpressionMat, geneNames, cellLabels, targGenes, potRegs
PriorTFAData loadPrior, estimateTFA priorMatrix, medTfas, pRegs, pTargs
mergedTFsResult loadPrior mergedPrior, mergedTFMap
GrnData buildNetwork internals predictorMat, penaltyMat, allPredictors, subsamps, trainInds, stabilityMat, targGenes; bEBIC adds lambdaSS (nGenes × totSS), ebicLambdas (per-gene median λ)
BuildGrn buildNetwork networkMat, networkMatSubset

Citation

If you use InferelatorJL in your work, please cite:

Miraldi ER, Pokrovskii M, Watters A, et al. Leveraging chromatin accessibility for transcriptional regulatory network inference in T Helper 17 Cells. PLOS Computational Biology, 2019. https://doi.org/10.1371/journal.pcbi.1006979


License

See LICENSE for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors