★ Honorable Mention · DubsTech Datathon 2026

HEALTHCARE ML

Three complementary ML models exposing who's being left behind by healthcare access — and where it's heading.

A comprehensive machine learning analysis of healthcare access barriers across 75 demographic subgroups, spanning 2019–2025. Built three distinct models — predictive, time-series, and clustering — to answer different research questions from one unified dataset.

TimelineFeb 7–9, 2026
RoleML Engineer
EventDatathon Submission
OutcomeHonorable Mention

The Challenge

Healthcare access barriers disproportionately affect different demographic subgroups in ways that aren't visible from aggregate statistics. The DubsTech Datathon asked teams to use ML to uncover which populations are most at risk, where trends are heading, and which groups are falling through the cracks — all from a single dataset spanning 2019–2025.

The Solution

Instead of a single model, I built three complementary ML analyses: a supervised predictive model to score risk by subgroup, a time-series forecasting pipeline to project 2025 trends, and an unsupervised clustering and anomaly detection system to surface hidden at-risk groups. Together, they answer what's predictable, where we're heading, and who's being missed.

01

Predictive Model — 93.7% Accuracy

Trained 6 supervised algorithms (Linear, Ridge, Lasso Regression, Decision Trees, Random Forest, Gradient Boosting) on a 70/30 train-test split to predict cost barriers by demographic subgroup.

  • 93.7% accuracy (R²)
  • 0.52 pp average error
  • Real-time subgroup scoring
02

Time-Series Forecasting

Multi-model forecasting pipeline (ARIMA-style, Exponential Smoothing, Moving Average, Polynomial Regression) projecting 2025 healthcare barrier trends across all tracked categories.

  • 9.15% predicted avg barrier for 2025
  • 95% CI: 7.93–10.36%
  • Mental health: only worsening category
03

Clustering & Anomaly Detection

Applied K-Means, Hierarchical, DBSCAN, Isolation Forest, and Local Outlier Factor with PCA dimensionality reduction to identify hidden at-risk subgroups not obvious from raw data.

  • 11 high-confidence anomalies
  • 19 at-risk subgroups flagged
  • 0.815 silhouette score

Language

Python 3.10+

Data & ML

scikit-learn · pandas · numpy · scipy

Visualization

matplotlib · seaborn

Algorithms

Gradient Boosting · Random Forest · DBSCAN · Isolation Forest

75 Demographic subgroups analyzed
13 Distinct ML algorithms applied
1,800+ Data points · 2019–2025
~90s Runtime for all three models
01

The COVID Paradox

Barriers paradoxically decreased in 2020 — likely from expanded coverage and telehealth — but increased sharply post-2022 as policies rolled back. Aggregate trends masked subgroup-level divergence.

02

Mental Health Divergence

While medical and delayed-care barriers trended downward, mental health access was the only category still worsening. A single aggregate metric would have missed this entirely.

03

Clustering Over Averages

Unsupervised methods revealed at-risk groups (e.g., Native Hawaiian/Pacific Islander, bisexual individuals) that were statistically buried in aggregate datasets but faced barriers of 14–23%.

04

Three Models, One Story

Each model answered a distinct question. Combining them gave a complete picture: what's predictable, where trends are heading, and who's being missed.

Next project Sea Score