Exploratory Data Analysis

Motivation

As the final project of the Modeling of Biomedical Systems course, students were tasked with performing an exploratory data analysis in R.

Challenge

Students were to pick a dataset of their choice and prepare/evaluate 3 predictive models of the data.

Solution

The K-nearest neighbor model was the most successful in predicting cost by placing a medical bill in the correct cost grouping 99.5% of the time.

Approach

Our group chose a dataset of medical costs to explore and decided to use the following productive models: multiple linear regression, K-nearest neighbor, and decision tree. Since we were attempting to create an accurate predictive model of medical costs, we decided it would be best not to predict costs to the dollar amount. Instead, we would split the cost of a bill into three groups of high, medium, and low costs based upon the existing cost data. After analysis, we determined that the K-nearest neighbor model performed most successfully by placing a bill in the right cost group 99.5% of the time. Although, this dataset was not particularly immense, we were able to demonstrate our knowledge of R for data analysis.

Skills Demonstrated Project Artifacts
R Statistics Exploratory Data Analysis (EDA)

Benefits

Data analysis of a field like medical costs yields great usefulness to insurance companies and healthcare practitioners alike. By exploring how parameters effect the price of a medical bill, we can better predict insurance costs and trends within costs.