Healthcare Price Prediction Model

Posted Nov 29, 2022 Updated Aug 9, 2023

By Eunbi Kim 4 min read

Objectives

The goal is to build a prediction model on whether or not a healthcare cost of a customer will be expensive.

To deliver the results with shiny apps website.
To generate actionable insights for HMO(Health Maintenance Organization) on how to lower the cost of healthcare.

Process and Results

1. Data Cleaning

Dealing with missing data points: we removed 158 missing data points in bmi and hypertension variables.
Inspecting the data set: There are 7,502 examples with 14 features related to personal health information (cost, age, bmi, number of children, and etc.).
Performing binning and transformation on variables
- Numeric to Categorical: age, bmi
- Categorical to Logical: education level, the number of children
Setting the boundary for expensive or not expensive
- Considering that 80% of people spent less than or equal to $5,789.4, we set the boundary for expensive or not to be $6,000.
  - People who were charged more than $6,000 will be labeled as expensive, while people who paid less will be labeled as not expensive.

2. Exploratory Data Analysis with Visualization

Bar Charts demonstrated that the percentages of customers paying more than $6,000 can vary among different groups.

have_child, is_educated, location_type, yearly_physical, married don’t have a significant difference.
Histograms and box plots displayed the distribution of customers who pay more or less than $6,000 by each variable and visualized the statistical summary.

People who has the following health condition (or personal information) usually paid for their healthcare highly.
- Age: 50-59
- BMI: obesity
- Location type: urban
- Exercise: Not-active (no exercise)
- Smoker: Yes
- yearly_physical: No (not well visit with their doctor during the year)
- gender: male
- is_educated: yes
- married: yes
- have_child: yes
Mapping showed the distribution based on geographical information.
- People who live on New York state have higher chances of paying more than $6,000 on their healthcare.

3. Data Modeling

The prediction model simulation in Shiny Apps Link

Modeling: Applied the supervised and unsupervised algorithms and compared the results to find the best prediction model using accuracy and sensitivity rates. The algorithms and results are in the below table.
1. Linear Regression
  - We used numeric predictors (age, bmi, and the number of children)
  - Although all of the predictors are significant, the model only explain 15,69% of the data set, which is quite low.
  - The sensitivity is 0.8705, which means the model predicted 87% of the expensive healthcare prices in the dataset.
  - The accuracy rate is 77%, which is less than the ‘No Information’ rate (80.29%)
  - Based on these results, we would say the linear model is not a good prediction model in general.
2. Decision Tree
  1. Building Decision Tree Model using all features except hypertension
    - The sensitivity is 0.9714, which has been significantly improved compared with linear model. The accuracy rate is also higher than no information rate (90.84%).
    - Considering that we put all features in the model, we tried to simplify the model by turning numeric variables into categorical variables.
  2. Building Decision Tree Model using all features that are converted to categorical.
    - The sensitivity rate goes up to 0.9725 and accuracy was significantly improved compared to no information rate as we simplified the features. (91.33%)
    - Tried to remove less important predictors and build the same model again.
  3. Remove gender, have_child, hypertension.
    - The sensitivity rate goes up to 0.9758 when simplified some of predictors.
3. SVM Model: Building Decision Tree Model using all features except hypertension and converted to categorical.
  - Sensitivity and accuracy rate are less than those in the Decision Tree Model.
Further Exploration with unsupervised machine learning: Used unsupervised learning models (Associate Mining and K-Means clustering) to get more insights on cost and features.
1. Associate Mining: checking the importance of each variable to find the best predictors for the prediction model.
  - The following table shows the top 5 rules (sorted by support) in the Associate Mining model.
  - The most supported association here indicated that expensiveness relates to bmi and smoker.
2. K-Means Clustering: According to associate mining and the bar graph, bmi might be a most signficiant predictor of cost. Therefore, create clusters using bmi and cost and adjust the cost boundary on whether the healthcare price is expensive or not, if needed.
  - Divided the data set into five clusters, the lowest cost of the first cluster at the top, which is $12,282, could be considered as the boundary.
Conclusion: Based on the results of the modeling and further exploration steps, the final model contains the following predictors and machine learning algorithm.
1. Predictors: age, bmi, smoker, exercise, expensive
2. Algorithm: Decision Tree
3. Accuracy: 96.84%
4. Sensitivity: 98.81%
5. Notes
  - The simulation require the train and test dataset as input. Find on the Link
  - The expensive predictor indicates whether or not the healthcare cost of each customer.

4. Business Suggestions

Only certain factors are needed to build a model with great sensitivity, and therefore the cost for collecting and processing the less significant data could be saved.
Certain groups of people tend to be overcharged. Therefore, campaigns like smoking cessation programs (i.e. targeting smokers), regular yoga sessions (i.e. targeting less active group) could be initiated to promote healthy lifestyle.
Differences among States have brought our attention to how income level, tax, and socioeconomics statues might affect health cost. Our suggestion is that a nationwide standard to be set for maximum cost in healthcare.

Projects, Machine Learning

This post is licensed under CC BY 4.0 by the author.