Healthcare Price Prediction Model
Objectives
The goal is to build a prediction model on whether or not a healthcare cost of a customer will be expensive.
- To deliver the results with shiny apps website.
- To generate actionable insights for HMO(Health Maintenance Organization) on how to lower the cost of healthcare.
Process and Results
1. Data Cleaning
- Dealing with missing data points: we removed 158 missing data points in bmi and hypertension variables.
- Inspecting the data set: There are 7,502 examples with 14 features related to personal health information (cost, age, bmi, number of children, and etc.).
- Performing binning and transformation on variables
- Numeric to Categorical: age, bmi
- Categorical to Logical: education level, the number of children
- Setting the boundary for expensive or not expensive
2. Exploratory Data Analysis with Visualization
Bar Charts demonstrated that the percentages of customers paying more than $6,000 can vary among different groups.
have_child, is_educated, location_type, yearly_physical, married don’t have a significant difference.- Histograms and box plots displayed the distribution of customers who pay more or less than $6,000 by each variable and visualized the statistical summary.
People who has the following health condition (or personal information) usually paid for their healthcare highly.- Age: 50-59
- BMI: obesity
- Location type: urban
- Exercise: Not-active (no exercise)
- Smoker: Yes
- yearly_physical: No (not well visit with their doctor during the year)
- gender: male
- is_educated: yes
- married: yes
- have_child: yes
- Mapping showed the distribution based on geographical information.
- People who live on New York state have higher chances of paying more than $6,000 on their healthcare.
3. Data Modeling
The prediction model simulation in Shiny Apps Link
Modeling: Applied the supervised and unsupervised algorithms and compared the results to find the best prediction model using accuracy and sensitivity rates. The algorithms and results are in the below table.
- Linear Regression
- We used numeric predictors (age, bmi, and the number of children)
- Although all of the predictors are significant, the model only explain 15,69% of the data set, which is quite low.
- The sensitivity is 0.8705, which means the model predicted 87% of the expensive healthcare prices in the dataset.
- The accuracy rate is 77%, which is less than the ‘No Information’ rate (80.29%)
- Based on these results, we would say the linear model is not a good prediction model in general.
- Decision Tree
- Building Decision Tree Model using all features except hypertension
- Building Decision Tree Model using all features that are converted to categorical.
- Remove gender, have_child, hypertension.
- SVM Model: Building Decision Tree Model using all features except hypertension and converted to categorical.
- Sensitivity and accuracy rate are less than those in the Decision Tree Model.
- Linear Regression
- Further Exploration with unsupervised machine learning: Used unsupervised learning models (Associate Mining and K-Means clustering) to get more insights on cost and features.
- Associate Mining: checking the importance of each variable to find the best predictors for the prediction model.
- K-Means Clustering: According to associate mining and the bar graph, bmi might be a most signficiant predictor of cost. Therefore, create clusters using bmi and cost and adjust the cost boundary on whether the healthcare price is expensive or not, if needed.
- Divided the data set into five clusters, the lowest cost of the first cluster at the top, which is $12,282, could be considered as the boundary.
Conclusion: Based on the results of the modeling and further exploration steps, the final model contains the following predictors and machine learning algorithm.
- Predictors: age, bmi, smoker, exercise, expensive
- Algorithm: Decision Tree
- Accuracy: 96.84%
- Sensitivity: 98.81%
- Notes
- The simulation require the train and test dataset as input. Find on the Link
- The expensive predictor indicates whether or not the healthcare cost of each customer.
- Predictors: age, bmi, smoker, exercise, expensive
4. Business Suggestions
- Only certain factors are needed to build a model with great sensitivity, and therefore the cost for collecting and processing the less significant data could be saved.
- Certain groups of people tend to be overcharged. Therefore, campaigns like smoking cessation programs (i.e. targeting smokers), regular yoga sessions (i.e. targeting less active group) could be initiated to promote healthy lifestyle.
- Differences among States have brought our attention to how income level, tax, and socioeconomics statues might affect health cost. Our suggestion is that a nationwide standard to be set for maximum cost in healthcare.