News|Articles|June 30, 2026

AI model identifies physical inactivity as strongest county-level predictor of diabetes

Author(s)Rose McNulty
Listen
0:00 / 0:00

Key Takeaways

  • LightGBM modeling of county-level “multilevel ecology” retained both upstream forcing factors and downstream behaviors, explaining 95% of diabetes prevalence variation across more than 3,000 counties.
  • SHAP importance placed no leisure-time physical activity and racial/ethnic minority share (social vulnerability index) clearly ahead of frequent physical distress, obesity, excessive drinking, and insufficient sleep.
SHOW MORE

The approach of pairing an ecological framework with AI methods offers a new way to identify where diabetes risk concentrates.

Physical inactivity and racial and ethnic minority status emerged as the two strongest predictors of diabetes prevalence across U.S. counties in a new machine-learning analysis published in Diabetes/Metabolism Research and Reviews.

The cross-sectional study, led by Nicolaas P. Pronk, Ph.D., of the HealthPartners Institute, set out to capture what the authors call the "multilevel ecology" behind diabetes, defined as the mix of broad structural forces and individual-level behaviors that impact diabetes development.

“A wide variety of risks and protective factors raise or lower the probability of diabetes diagnosis,” the authors wrote. “This study confirms some of these ‘forcing factors’ as upstream factors that are present for a defined population as a whole and are not specific to the individual, whereas others are typical downstream risk factors at the individual level (e.g., inactivity or obesity). Forcing factors are attributes of a system that are common to all people and collectively play a role in promoting or inhibiting health outcomes.”

The researchers compiled 27 county-level predictor variables from public sources, including the CDC, County Health Rankings & Roadmaps and U.S. Bureau of Economic Analysis data, then trained a Light Gradient Boosting Machine model on just over 3,000 counties.

The final model retained 17 variables and explained 95% of the variation in county-level diabetes prevalence. Using Shapley values to rank each factor's contribution, the authors found that the prevalence of no leisure-time physical activity and the share of residents from racial and ethnic minority groups — a social vulnerability index measure — were the most impactful, showing importance scores that exceeded every other variable. Frequent physical distress and obesity ranked next, followed by excessive drinking and insufficient sleep.

The model also surfaced upstream social and historical factors. Counties with a higher historical enslaved population, lower voter turnout and greater food insecurity were associated with higher predicted diabetes prevalence, though the authors emphasized these features were less influential than the top four in the study.

To test what the findings might mean for intervention, the team ran a county-level simulation that reduced physical inactivity by 10% while holding other characteristics constant. The model predicted diabetes prevalence would fall by up to 0.8 absolute percentage points on average.

“This exploratory model-based scenario may be helpful to support public health practice or policy considerations for the role of county-based physical activity levels,” the authors wrote.

Because physical inactivity was the single strongest and most modifiable predictor, they noted that physical activity assessment should be prioritized as an integral part of primary and diabetes care. Clinicians, they suggested, might also give patients "prescriptions" for follow-up care, which could be guidance as simple as walking more and sitting less and could include access to activity resources when available.

The authors cautioned against over-interpreting the rankings. Many variables relied on self-reported data, and ecological, county-level analysis carries inherent limits, including potential reverse causality, correlation among predictors and residual spatial clustering. The associations, they noted, reflect relative predictive contributions at the population level, not independent causal effects.

Still, Pronk and colleagues concluded that the approach of pairing an ecological framework with AI methods able to handle that complexity offers a new way to identify where diabetes risk concentrates.

“Both upstream forcing factors and downstream risk factors were retained in the final predictive model, thereby making the case that pursuit of prevention policies and public health practices should include multilevel thinking,” the authors concluded. “Additional research into the use of this framework for the prediction of health outcomes, identification of best places to intervene, and effectiveness of interventions seems warranted.”


Latest CME