DubsTech Datathon 2026

Healthcare ML Analysis

Three Complementary ML Models on Healthcare Access Barriers

A comprehensive machine learning analysis of healthcare access barriers across 75 demographic subgroups, spanning 2019–2025. Built three distinct models — predictive, time-series, and clustering — to answer different research questions from one unified dataset.

Timeline Feb 2026
Role ML Engineer
Event Datathon Submission

The Challenge

Healthcare access barriers disproportionately affect different demographic subgroups in ways that aren't visible from aggregate statistics. The DubsTech Datathon asked teams to use ML to uncover which populations are most at risk, where trends are heading, and which groups are falling through the cracks — all from a single dataset spanning 2019–2025.

The Solution

Instead of a single model, I built three complementary ML analyses: a supervised predictive model to score risk by subgroup, a time-series forecasting pipeline to project 2025 trends, and an unsupervised clustering and anomaly detection system to surface hidden at-risk groups. Together, they answer what's predictable, where we're heading, and who's being missed.

Key Features

Predictive Model — 93.7% Accuracy

Trained 6 supervised algorithms (Linear, Ridge, Lasso Regression, Decision Trees, Random Forest, Gradient Boosting) on a 70/30 train-test split to predict cost barriers by demographic subgroup.

  • 93.7% accuracy (R²)
  • 0.52 pp average error
  • Production-ready for real-time subgroup scoring

Time-Series Forecasting

Multi-model forecasting pipeline (ARIMA-style, Exponential Smoothing, Moving Average, Polynomial Regression) projecting 2025 healthcare barrier trends across all tracked categories.

  • 9.15% predicted average barrier for 2025
  • 95% CI: 7.93–10.36%
  • Mental health identified as only worsening category

Clustering & Anomaly Detection

Applied K-Means, Hierarchical, DBSCAN, Isolation Forest, and Local Outlier Factor with PCA dimensionality reduction to identify hidden at-risk subgroups not obvious from raw data.

  • 11 high-confidence anomalies detected
  • 19 at-risk subgroups flagged
  • 0.815 silhouette score — excellent separation

Technology Stack

Language

Python 3.10+

Data & ML

scikit-learn pandas numpy scipy

Visualization

matplotlib seaborn

Algorithms Used

Gradient Boosting Random Forest DBSCAN Isolation Forest

Impact & Results

75
Demographic subgroups analyzed across race, income, insurance, and identity
13
Distinct ML algorithms applied across three model types
1,800+
Data points processed covering 2019–2025
~90s
Total runtime for all three models end-to-end

Key Learnings

01

The COVID Paradox

Barriers paradoxically decreased in 2020 — likely from expanded coverage and telehealth — but increased sharply post-2022 as policies rolled back. Aggregate trends masked subgroup-level divergence.

02

Mental Health Divergence

While medical and delayed-care barriers trended downward, mental health access was the only category still worsening. A single aggregate metric would have missed this entirely.

03

Clustering Over Averages

Unsupervised methods revealed at-risk groups (e.g., Native Hawaiian/Pacific Islander, bisexual individuals) that were statistically buried in aggregate datasets but faced barriers of 14–23%.

04

Three Models, One Story

Each model answered a distinct question. Combining them gave a complete picture: what's predictable (Model 1), where trends are heading (Model 2), and who's being missed (Model 3).

Interested in Collaboration?

I'm always open to discussing data science projects, internship opportunities, or healthcare technology initiatives.