Collection of reproducible data analysis scripts in R
This repository provides a curated collection of R scripts designed to support reproducible data analysis.
The scripts have been developed and used in several of my publications and cover a range of analytical approaches including survival analysis, Bayesian model averaging, and machine learning methods such as Extreme Gradient Boosting (XGBoost).
Overview
The primary goal of this repository is to promote transparency, reproducibility, and reusability of statistical analyses.
Each script is fully documented and includes, where applicable:
• Example datasets for demonstration
• Usage instructions
• References to associated publications
Researchers and users are encouraged to adapt these scripts for their own data and workflows.
Contents
• Survival XGBoost Cox analysis.R
Implementation of Extreme Gradient Boosting (XGBoost) adapted for Cox proportional hazards modeling, suitable for time-to-event data.
• XGBoost_binary classification_nested CV.R
Implements XGBoost for binary classification tasks with nested cross-validation, allowing for robust model evaluation and hyperparameter tuning.
• XGBoost_binary classification_k-fold nested CV.R
Implements XGBoost for binary classification tasks with 50x repeated 5-fold nested cross-validation, allowing for robust model evaluation and hyperparameter tuning.
• XGBoost_linear regression_nested CV.R
Implements XGBoost for linear regression task with nested cross-validation, allowing for robust model evaluation and hyperparameter tuning.
• XGBoost_ordinal logistic classification.R
Implements XGBoost for logistic ordinal classification tasks with nested cross-validation, allowing for robust model evaluation and hyperparameter tuning.
• Multi-class XGBoost.R
Implements a multi-class prediction using XGBoost. Three reproducible and generalizable examples are provided:
-Example 1: Predicting epidemic cluster based on epidemic curves
-Example 2: Predicting chemical exposure cluster based on age, BMI, physical activity
-Example 3: Predicting cardiovascular risk cluster based on age, BMI, physical activity, BP, cholesterol
• Continuous_Multioutput_XGBoost.R
Implements a generalizable R and Python workflow for continuous multi-output regression using XGBoost.Three reproducible and generalizable examples are provided:
-Example 1: Predict epidemic parameters (beta, mu, S0) from simulated epidemic curves (SIR model).
-Example 2: Predict chemical exposure (exposure_A, exposure_B) from age, BMI, and physical activity level.
-Example 3: Predict blood pressure and cholesterol from age, BMI, and physical activity level..
• Linear BMA.R
Performs Bayesian Model Averaging (BMA) for probabilistic model selection in linear regression.
• Logistic BMA.R
Performs BMA for probabilistic model selection in logistic regression.
• SHAP_direction.R
Function for determining feature direction based on SHAP values using Rcpp pairwise counting.
• BKMR.R
Performs a Bayesian kernel machine regression (BKMR) analysis.
• WQS.R
Performs a Weighted Quantile Sum regression (WQS) analysis.
• PSM.R
Performs a propensity score matching (PSM) analysis.
• RCS.R
Performs a Restricted Cubic Spline (RCS) analysis.
• RCT.R
Performs a descriptive analysis, linear mixed effect models (continuous variables), logistic mixed effect models (binary variables), and cumulative link mixed effect models (ordinal variables) for analyzing randomized controlled trial (RCT) data.
• Convert_semi transparent color to opaque.R
This function converts a semi-transparent color to its opaque equivalent, considering the visual effect on a white background.
• Loading and using a model.R
This script loads a saved model, prepares new data, and makes predictions based on the model.
• Carbon_footprint.R
Initial attempt to estimate energy consumption and carbon emissions from running R code. This function attempts to monitor CPU, RAM, and GPU usage of R code, estimate energy consumption, and calculate associated carbon emissions (gCO2).
• Randomization.R
Creates random assignments for experiments (e.g., clinical trial).
• gmean.R
Function computing the geometric mean of a numeric vector.
• GSD.R
Function computing the geometric standard deviation of a numeric vector.
• Limit_value_exceedance_plot.R
Assesses chemical concentration exceedance against a specified limit value and visualizes the results.
• Continuous_variable_distribution_fig.R
Script for visualizing continuous variable distributions with multiple plot types.
Requirements
• R version ≥ 4.5.0
• Additional dependencies are listed in the header of each script.
Usage
-
Clone or download this repository:
git clone https://github.com/PetitPascal/R-scripts.git -
Open R or RStudio.
-
Load the desired script and follow the inline documentation and example usage.
-
Ensure all required packages are installed before running the scripts.
Reproducibility and citation
All scripts are provided to ensure transparent and reproducible analyses.
If you use any part of this repository in your research or publications, please cite the corresponding DOI of the release used.
• Initial release: Collection of R scripts for reproducible data analysis.
• Version v1.0.1: Includes additional logistic BMA and binary classification with XGBoost and nested CV scripts.
Refer to the DOI listed in the Releases.
All scripts are provided to promote transparency and reproducibility in data analysis.
If you use or adapt these scripts in your own work, please cite the corresponding release DOI:
Cite as (examples):
• Petit, P. (2025). Collection of reproducible data analysis scripts in R (Version v1.0.1). [R]. Zenodo. https://doi.org/10.5281/zenodo.17047856.
• Petit, P. (2025). Weighted survival XGBoot with R [R script]. In R-scripts (Version 1.0.0) [R]. Zenodo. https://zenodo.org/records/17752875.
• Petit, P. (2025). Logistic Bayesian Model Averaging (BMA) analysis for probabilistic model selection in logistic regression [R script]. In R-scripts (Version 1.0.1) [R]. Zenodo. https://zenodo.org/records/17752875.
• Petit, P. (2025). Extreme Gradient Boosting (XGBoost) for binary classification with nested cross-validation [R script]. In R-scripts (Version 1.0.1) [R]. Zenodo. https://zenodo.org/records/17752875.
• Petit, P. (2025). Bayesian kernel machine regression (BKMR) analysis [R script]. In R-scripts (Version 1.0.2) [R]. Zenodo. https://zenodo.org/records/17752875.
Version History
• v1.0.0: Initial release - XGBoost for Cox proportional hazards modeling
• v1.0.1: Update release - Added Logistic BMA and XGBoost (binary classification with nested CV)
• v1.0.2: Update release - Added several new scripts
Related Publications
These scripts have been developed to support analyses in the following works:
• Petit P, Nübel J, Walter J, Butter C, Heinze M, Ignatyev J, Haase-Fielitz A, Vuillerme N, Muehlensiepen F. Telemedicine adoption in cardiology: determinants and predictors identified using Bayesian model averaging and machine learning. PLOS Digit Health. 2026;5(4):e0001359. doi: 10.1371/journal.pdig.0001359.
• Petit P, Berger F, Bonneterre V, Vuillerme N. Exploring Alzheimer's Disease Risk Factors in Farmers with Explainable Machine Learning and Administrative Health Data. Submitted.
• Petit P, Berger F, Bonneterre V, Vuillerme N. Leveraging Machine Learning with Real-World Data to Identify exposomic Risk Factors in Parkinson’s Disease among Farmers. Submitted.
• Petit P, Vuillerme N, Gehrmann J, Stephan J, Muehlensiepen F, Nübel J, Hahn F, Martens E. Determinants and predictors of telemedicine use among physicians in the German outpatient sector: a secondary analysis of a web-based survey. Submitted.
• Petit P, Bonneterre V, Vuillerme N. Using Machine Learning and Nationwide Population-based Data to Unravel Predictors of Treated Depression in Farmers. Ment Illn. 2025;2025:17. doi: 10.1155/mij/5570491.
• Muehlensiepen F, Petit P, Knitza J, Welcker M, Vuillerme N. Prediction of the acceptance of telemedicine among rheumatic patients. A machine learning powered secondary analysis of German survey data. Rheumatol Int. 2024;44(3):523-534. doi: 10.1007/s00296-023-05518-9.
• Muehlensiepen F, Petit P, Knitza J, Welcker M, Vuillerme N. Identification of Motivational Determinants for Telemedicine Use Among Patients With Rheumatoid Arthritis in Germany: Secondary Analysis of Data From a Nationwide Cross-Sectional Survey Study. J Med Internet Res. 2024;26:e47733. doi: 10.2196/47733.
• Muehlensiepen F, Petit P, Knitza J, Welcker M, Vuillerme N. Factors Associated With Telemedicine Use Among Patients With Rheumatic and Musculoskeletal Disease: Secondary Analysis of Data From a German Nationwide Survey. J Med Internet Res. 2023;25:e40912. doi: 10.2196/40912.
• Muehlensiepen F, Petit P, Knitza J, Welcker M, Vuillerme N. Factors associated with Telemedicine Usage among German General Practitioners and Rheumatologists: Secondary Analysis of Data from a Nationwide Survey. J Med Internet Res. 2022;24(11):e40304. doi: 10.2196/40304.
Contact
For questions regarding the scripts or associated studies:
Pascal Petit
email: pascal.petit@univ-grenoble-alpes.fr
• ORCID: https://orcid.org/0000-0001-9015-5230
• ResearchGate: https://www.researchgate.net/profile/Pascal-Petit-3
• Google Scholar: https://scholar.google.fr/citations?user=ja8PT6MAAAAJ&hl=fr
• Web Of Science: https://www.webofscience.com/wos/author/record/M-4351-2017
• HAL: https://hal.science/search/index/q/*/authIdHal_s/pascal-petit
• Thèse.fr: https://theses.fr/223750166
Current affiliation: Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG, 38000 Grenoble, France
*Institute of Engineering Univ. Grenoble Alpes
Former affiliations:
• Univ. Grenoble Alpes, AGEIS, 38000 Grenoble, France
• Univ. Grenoble Alpes, CNRS, UMR 5525, VetAgro Sup, Grenoble INP, TIMC, 38000 Grenoble, France
• CHU Grenoble Alpes, Centre Régional de Pathologies Professionnelles et Environnementales, 38000 Grenoble, France
Funding
My research has been partially supported by:
• The French government, through the National Research Agency (ANR - Agence Nationale de la Recherche), under the France 2030 program (MIAI Cluster), grant ANR-23-IACL-0006 (February 2025 – present).
• The French government, through the ANR, under the Investissements d’avenir program, grants ANR-10-AIRT-0005 and ANR-15-IDEX-0002 (September 2022 – April 2026).
• The Agence nationale de sécurité sanitaire de l’alimentation, de l’environnement et du travail (ANSES), grants 2016-CRD-03_PPV16/534B and 2018-CRD-14_PPV18 (October 2018 – December 2020).
• Mutualité Sociale Agricole (MSA), grant MSA-2020-STOP (January 2021 – December 2022).
• Fondation pour la Recherche sur Alzheimer, grant 2020-A-01 (January 2021 – December 2021).
• MIAI@Grenoble Alpes, grant ANR-19-P3IA-0003 (October 2018 – January 2025).
If you find these scripts useful, please star this repository and cite the DOI in your research!