Data/ML engineer focused on reproducible, leakage-safe machine learning and honest evaluation.
I build models that survive scrutiny: leakage-safe labeling, real baselines, cross-validation, probability calibration, and a clear line between synthetic and real data. Every metric is committed and reproducible from a clean clone. I work end-to-end — from SQL/warehouse modeling through tuned models to Streamlit/FastAPI interfaces.
I optimize for correct methodology: proper train/test discipline, baselines, calibration, and clearly-stated limitations.
Pit Wall Intelligence — F1 race-strategy analytics: FastF1 data in a DuckDB + dbt warehouse; tyre-degradation and undercut-success models served via FastAPI and a 6-page Streamlit dashboard. Calibrated LightGBM undercut classifier (AUC 0.66 ± 0.05, 5-fold GroupKFold on 62/21 train/test race split — GREEN-flag stops only). Monte Carlo race simulator. Pit-cost calculator across 33 circuits with bootstrap CIs, SC/VSC regime separation.
SignalForge — Churn modeling on IBM Telco with statistical rigor: Optuna tuning, leakage-free CV, bootstrap 95% CIs, paired t-tests, calibration. The three models land within ~0.003 AUC with overlapping confidence intervals — model choice is a calibration/interpretability call, not an accuracy race.
SaaS Churn Simulator — Leakage-safe churn + retention-ROI pipeline on RetailRocket (2.76M events). Time-windowed labeling, visitor-disjoint splits, Optuna-tuned LightGBM, isotonic calibration. 5-fold CV ROC-AUC 0.88 ± 0.06. Reports honestly that the ~99% base rate caps business lift. Live demo →
Ecommerce Retention & Growth — 30-day churn prediction and LTV segmentation on KKBox data; calibrated XGBoost (ROC-AUC ~0.79), ROI simulator. Ships a synthetic generator so it runs without the large download.
Ticket Intel — Support-ticket routing and summarization on Banking77 using TF-IDF + Naive Bayes by design: fast, interpretable, with a documented rationale for not using an LLM. Live demo →
- MeasureMap — Self-hosted KPI governance registry: define, approve, version, and audit metrics with role-based access, LDAP/AD integration, CSV import with validation, full audit trail. Built to run air-gapped in a hospital network. Next.js · TypeScript · PostgreSQL · Prisma · Docker
- Healthcare SQL Analytics — Production EHR analytics SQL patterns from 6 years of clinical and operational BI on Meditech Paragon: wRVU physician productivity, SDOH screening compliance, 340B drug utilization extract, sepsis missed-identification rate. Synthetic identifiers throughout.
- AutoModeler — Type a ticker, get a fully-linked 3-statement Excel model. FMP API · FastAPI · Python.
Python SQL TypeScript scikit-learn XGBoost LightGBM Optuna pandas DuckDB dbt FastAPI Streamlit Next.js PostgreSQL Prisma pytest GitHub Actions Docker Tableau
- BI Analyst — 4 years of clinical and operational analytics at a community hospital system on Meditech Paragon EHR: physician productivity reporting, clinical quality (SDOH, sepsis, readmissions), 340B compliance, Tableau dashboards, EMR data migration
- MBA + MS Data Science at Eastern University (expected 2027)
- Previous: Manufacturing, law enforcement — learned to find signal in noisy data and explain it to people who need a decision, not a model card


