HIV Incidence Prediction

Development of a Machine Learning Modelling Tool for Predicting Incident HIV Using Public Health Data from a County in the Southern United States

In collaboration with Carlos S. Saldana; Elizabeth Burkhardt; Alfred Pennisi, MA; Kirsten Oliver; John Olmstead; David P. Holland; Jenna Gettings; Daniel Mauck; David Austin, Pascale Wortley

,

Link to Article

This article explores the application of advanced Machine Learning (ML) techniques to predict incident HIV cases using de-identified public health datasets from a high-incidence Southern U.S. county. Analyzing data from January 2010 to December 2021, incorporating sociodemographic factors, sexually transmitted infections (STIs), and social vulnerability index (SVI) metrics, our ML algorithms achieved an 80% accuracy rate in predicting incident HIV for both males and females among the 85,224 individuals studied. While highlighting the potential of de-identified public health datasets for accurate predictive analyses, the study underscores the need for further research to translate these models into practical public health applications. The insights gained have significant implications for targeted interventions and policymaking in regions with high HIV incidence rates.


Introduction:

The U.S. Ending the HIV Epidemic (EHE) initiative aims to reduce new HIV infections by 90% by 2030, facing challenges due to disparities influenced by social determinants of health. The introduction underscores the necessity for innovative, data-driven strategies, citing the role of Artificial Intelligence (AI) and Machine Learning (ML) in identifying patterns associated with HIV acquisition. The study focuses on leveraging ML tools with public health datasets and social vulnerability indicators in a high-incidence Southern U.S. county to predict and address incident HIV.

Methodology:

The study focused on Fulton County, Georgia, a priority area for the Ending the HIV Epidemic (EHE) initiative due to its high HIV incidence rates. In 2021, the county's population was 1,065,000, reporting a new HIV diagnosis rate of 58 per 100,000 individuals, significantly higher than the national average of 13.2 per 100,000. Challenges in meeting EHE targets were evident, with 19.4% of new HIV diagnoses classified as late-stage, and only 74.1% of diagnosed individuals accessing care, and 62.4% achieving viral suppression.

Data from January 2010 to December 2021 were collected from two Georgia databases: the State Electronic Notifiable Disease Surveillance System (SendSS) and the Georgia Electronic HIV/AIDS Reporting System (eHARS). These databases provided de-identified information on demographics, diagnosing provider type, risk behaviors, and treatment. Social Vulnerability Index (SVI) data, assessing community vulnerability, were matched by census tract and classified into quintiles for overall score and four SVI themes.

Dataset development involved extracting sexually transmitted infection (STI) incidents from SendSS, cross-referencing HIV diagnoses using a probabilistic matching technique against eHARS, and excluding specific cases to ensure accuracy. The dataset was transformed into a patient-focused model, consolidating multiple STI incidents into a comprehensive profile per patient. Participant selection process details are illustrated in Figure 1.

Figure 1. Shows the methodology for matching SendSS and eHARS datasets from Fulton County Georgia from 2010 to 2021. From 132,928 STI cases, 5,729 were excluded. Transposition from case-based to patient-based led to 85,224 individuals, which were then matched to a Social Vulnerability quintile and categorized by sex assigned at birth and documented HIV status.

 

Feature selection

The feature selection in our study encompassed a range of sociodemographic variables detailed in Table 1, including sex assigned at birth, race, and ethnicity. We incorporated the age at STI diagnosis and compiled this information into an array for individuals with multiple STI occurrences. The cumulative Non-HIV STI count per patient was considered, and we cataloged all previous non-HIV STIs, specifying stages, such as gonorrhea, chlamydia, and syphilis.

In cases of multiple infections in a single patient, an array was constructed to represent this. Re-infection intervals were categorized, labeling re-infections based on different time intervals from the initial STI(s). Provider types, ranging from urgent care centers to correctional facilities, were organized into an array when multiple STIs occurred. Lastly, we aligned social vulnerability indexes (overall and themes 1-4) with individual census tracts at the time of the first STI diagnosis to assess the socioeconomic context influencing the STI landscape.

Model Development:

After data pre-processing, patients were stratified by sex assigned at birth, and separate models were intended for males and females due to varied factors influencing HIV acquisition. Missing values were addressed by imputing them with the mean variable value. The dataset featured diverse numerical, categorical, and array data, necessitating a three-fold approach. Numerical data underwent normalization, categorical data were one-hot encoded and normalized, and array data were processed using a Neural Network Autoencoder (NN Autoencoder) and T-distributed Stochastic Neighbor Embedding (T-SNE) for dimensionality reduction.

For the ML classifier methodology, balance in positive HIV tests and negatives in both training and test sets was ensured, addressing class distribution disparity through undersampling of the majority class. The balanced training set (85%) and the remaining test set (15%) were reserved for validation.

The above figure shows the user interface of the text-based labeling process, where each adjective is associated with a Gaussian distribution shown by the circles representing the 10% and 90% percentiles.

Model Selection and Evaluation:

The "horse race approach" was adopted, training multiple ML algorithms on the same dataset to identify the most suitable predictive model. Established classifiers, including Random Forest, Nearest Neighbors, Logistic Regression, Naive Bayes, and Gradient Boosted Trees, were employed to comprehensively evaluate performances. Model assessment criteria included accuracy, precision, recall, and F1-score.

Figure 2. Shows confusion matrices for various machine learning algorithms. True positives, true negatives, false positives, and false negatives are reported for GradientBoostedTrees, Naive Bayes, Logistic Regression, Nearest Neighbors, and Random Forest, with accuracy scores below each matrix. *Similar performance metrics were seen in the Female subgroup (not displayed in this figure).

Results:

From 2010 to 2021, there were 132,928 cases of sexually transmitted infections (STIs) recorded. After filtering, 127,169 cases met our criteria. When looking at individual patients instead of each STI case separately, we found 85,224 people with at least one STI. About 54% were females (45,834), and 46% were males (38,935), as shown in Table 2. Out of these, 2,027 individuals (2.37%) had HIV during this time—1,698 men (84%) and 329 women (16%).

To train our models, we divided the data by gender. For males, 85% (1444 individuals) with HIV were matched with 1444 without HIV, making a total of 2888 cases for training. The same was done for females, resulting in 560 cases of training. The remaining 15% of the data was used to test the model, with 508 males and 98 females, evenly split between those with and without documented HIV diagnoses, as seen in Table 3.

Figure 3. Gradient boosted trees confusion matrices present the performance for males (LEFT) and females (RIGHT) both with an accuracy of 80% with balanced precision and recall across classes. Both models exhibit comparable error rates.

On average, guys get diagnosed with STIs at around 28 years old, while girls get diagnosed at around 24, spanning ages from 13 to 95. Most of them are Black, with 63% of guys and 57% of girls, and a good number with unspecified race (23% of guys and 32% of girls). The highest number of STI incidents recorded was 18 for males and 23 for females. Most people had just one STI episode (71% of guys and 72% of girls), and the next common was two episodes (18% in guys and 17% in girls). Chlamydia was the main STI among girls, making up 79% of cases, compared to 54% in guys. On the flip side, gonorrhea was more frequent in guys (36%), compared to 19% in girls. Re-infections usually happened over a year later for both guys (69%) and girls (67%). When re-infections occurred within a year, they mostly took place between 201-365 days from the initial STI in both genders (13%).

Guys were most often diagnosed at STD Clinics (22.32%), and girls by private physicians (40%), with private physicians also being the second most common for guys (19%), followed by hospitals for girls (15%). The least common locations for diagnoses were school-based clinics for guys (2%) and correctional facilities for girls (2%).

The proposed workflow demonstrates the effectiveness of the interplay between human designers and AI in designing structural forms through descriptive text and quantitative parameters as inputs.

 

Model Performance and Evaluation:

When analyzing different models (see Figure 2), Random Forest and Gradient Boosted Trees emerged as top performers, achieving an impressive 80% accuracy in predicting HIV incidence for both males and females (see Figure 3). For the male group, the most important factors for prediction, in order of significance, were: 1. Age at STI diagnosis, 2. Previous non-HIV STI, 3. Provider type, 4. Reinfection interval, 5. SVI theme four, and 6. Non-HIV STI count. Demographic features and social vulnerability indexes played a significant role.

On the other hand, the predictive factors for the female group were slightly different, emphasizing: 1. Age at STI diagnosis, 2. Ethnicity, 3. Overall SVI, along with themes two and three, followed by 5. Race. These differences highlight the complexity and effectiveness of our model, underscoring the need to customize predictive methods based on demographic specifics (see Figure 4).

Figure 4. Displays the influential features for our predictive model males (left) and females (right). Each bar represents a feature's influence on the model's predictions. Longer bars indicate greater influence .


Discussion:

Our analysis of the dataset reveals sociodemographic disparities in the Southern U.S., consistent with established trends. The developed model accurately predicts HIV incidence from 2010 to 2021, showcasing advancements in data processing for multimodal data, including numerical values, categories, and arrays. This methodology, utilizing machine learning algorithms such as Autoencoders, stands out in addressing challenges posed by Electronic Medical Records (EMR).

Beyond technical contributions, our work identifies crucial determinants of HIV risk, offering insights for public health strategies. ML, "big data," and social vulnerability considerations provide a nuanced understanding of HIV transmission dynamics, surpassing traditional EMR-based and linear methods.

Age at first STI diagnosis emerges as a key predictor for both genders. For males, additional predictors include prior non-HIV STIs, provider type, and social vulnerability aspects. Notably, these factors take precedence over race and ethnicity, potentially influenced by the overrepresentation of Black individuals in our sample. For females, ethnic background is a significant predictor, along with STI-related variables and social vulnerability indices.

However, our study acknowledges limitations, including data integrity concerns, potential biases in undersampling and oversampling, and the challenge of model explainability. The model's applicability is specific to a Southern U.S. county, limiting generalizability. Future research should validate predictive models in diverse populations, explore uncovered aspects of STI and HIV incidence patterns, and ensure ethical deployment in real-world public health interventions, prioritizing equity and privacy.

Gallery