Predicting heart disease using the Behavioral Risk Factor Surveillance System 2018
- sam33frodon
- Dec 28, 2020
- 7 min read
Updated: Dec 30, 2020
The project used many R packages as follow:
Data manipulation: foreign, gtools, tidyverse, questionr, MASS, DataExplorer, forcats
Univariate and bivariate analysis: tidyverse, gtools, corrplot, scales, DescTools
Data visualization: ggplot2, ggridges, scales, treemapify, waffle, ggimage, extrafont, png, plotly, treemap, echarts4r, extrafont, gridExtra, grid
Feature selection: CORElearn, FSelector, RWeka, Boruta, caret, randomForest, MASS
Prediction: caret, rpart, mltools, ipred, C50, randomForest, mlr, e1071
All the codes can be found on Github.
https://github.com/MinhDg/Predicting-heart-disease-using-the-Behavioral-Risk-Factor-Surveillance-System-BRFSS-2018
The term cardiovascular disease (CVD) refers to a range of diseases which alters the heart and blood vessels, including hypertension, coronary heart disease, cerebrovascular disease, heart failure, and other heart diseases. Heart diseases are the top cause of death globally. For example, Figure 1 shows that cardiovascular diseases and cancers are the leading causes of deaths in 2017. It has been shown that the number of deaths caused by CVD has increased from 1990 to 2017

Heart disease has become a significant concern. However, it is challenging to identify heart disease because of several contributory risk factors such as diabetes, lifestyle, chronic health conditions, and numerous other factors. Such constraints have motivated scientists to use data mining and machine learning to predict the illness. It has been found that 80% of cardiovascular disease can be attributed to well-known risk factors, such as age, gender, ethnic group, tobacco use, high blood pressure, high cholesterol, and diabetes. Therefore, it is vital to determine the remaining 20%.
Data with 437436 records were collected from the CDC’s Behavioral Risk Factor Surveillance System database for the year 2018. The raw data contains 275 columns.
The characteristics of the sample were examined by calculating frequencies of categorical variables and the mean value of a continuous variable. The outcome of interest, MICHD, was defined as a “Yes” response to having been “ever diagnosed with” at least one of the following: myocardial infarction/heart attack, angina or coronary heart disease. Of the 279572 individuals included in this analysis, 24594 respondents had MICHD (8.8%, Figure below).

For this project, the pre-selection of attributes is based on the literature. It has been shown that there are many risk factors associated with heart diseases, such as sociodemographic and socioeconomic attributes (age, gender, race, education, demography, income, and socioeconomic status), lifestyle (drinking alcohol, smoking, exercise or physical training, and sleep duration), dental health, health-related quality of life (HRQOL) measures, and health issues (obesity, hypertension, diabetes, mental distress, and depression). These variables are included in the project. The initial 275 attributes were reviewed and reduced to 28 relevant features.


The self-reported presence of chronic health conditions was defined as a doctor has told the participants they had a specific condition. In our analysis, these diseases included asthma (ASTHMA), arthritis (ARTHRITIS), cancer (CANCER), depression disorders (DEPRESSION), diabetes (DIABETE), kidney disease (KIDNEY), chronic obstructive pulmonary disease (PULMOND), and stroke (STROKE).
33% of respondents had arthritis. One filth had a depression disorder. 17% had cancer, 14.0% were asthmatic, 14% mentioned having kidney disease, 8% had pulmonary disease, 4.0% had a stroke, and 4.0% had kidney disease. 37.85% of participants in BRFSS data had no chronic illness. 29.24% of survey respondents stated that they had at least one of the nine chronic diseases.
Many people suffered from multiple chronic health conditions. For example, 32% of participants had at least two diseases, and about 15% had at least three chronic health conditions.
Among chronic health conditions, DIABETE, PULMOND, and STROKE are most strongly associated with the outcome MICHD. 17% of people with arthritis had MICHD, whereas 7% of free-arthritis respondents mentioned having MICHD. About 18% of cancer patients claimed to have MICHD. 23% of respondents with diabetes reported having MICHD. 30% of participants living with pulmonary disease indicated that they have MICHD. Among people who said that they had a stroke, 36% reported having MICHD. Other chronic health conditions, such as asthma (ASTHMA) and depression disorder (DEPRESSION), are weakly linked with MICHD.

The prevalence of heart disease increases substantially with age. Indeed, the rate of MICHD of people younger than 40 years of age is less than 2 %. The highest percentage of people experiencing MICHD is observed for the most senior citizens (“80+”, 23.59%). Heart disease was more prevalent among males (11.1%) than females (6.7%).
Compared to people with graduate degrees (“16+”), those with lower academic achievement appeared to have a higher risk of heart attack. Indeed, the prevalence of MICHD was 16.2% for those who have less than eight years of schooling. Only 6.5% of people who have a college education or higher were diagnosed with heart disease.
The prevalence of current MICHD among unemployed and retirees was twice as much as employees. Students and homemakers had the lowest risk of having heart disease (3.74%).
Lifestyle risk factors such as drinking, physical activity, and smoking are modifiable risk factors.

Not only sleep deprivation (≤ 6 hours per day) but also oversleeping (≥ 9 hours per day) resulted in an increased risk for MICHD
Smoking status was categorized into currently smokes every day, currently sometimes smokes, former smoker, and has never smoked. Just over one half of the sample (56.54%) never smoked cigarettes (Figure 7b). The proportions of daily smokers and someday smokers are 10.46% and 4.19%, respectively. 28.81% of the participants quit smoking. Almost three quarters (73.7%) exercised within the past month, 22.34% were sedentary (Figure 7c). Most interviewees sleep for 6 to 8 hours. The sleep duration variable (SLEPTIM1) follows a normal distribution (Figure 7d and Figure S6).

In the 2018 BRFSS survey, questions regarding people’s perceptions of general, physical, and mental health situations developed by the CDC.
We observe a decrease in the prevalence of heart disease when the level of well-being increases from poor to excellent. Indeed, participants who rated their health as poor have the highest prevalence of MICHD (33.2%). In contrast, less than 10% of respondents who were in good health reported having MICHD.

In the BRFSS2018 data, interviewees reported any loss of permanent teeth because of tooth decay or gum disease, and excluded teeth lost because of injury or orthodontics.
Among people with no missing teeth, only 4.5% of them live with MICHD. For people who were missing more than six teeth, the prevalence of cardiovascular disease was four times higher (18.1%). The percentage increased to 24% among people with complete edentulism. A similar trend is observed for the variable LASTDENTV. The proportion of respondents with MICHD and who visited a dental clinic more than five years ago were twice as much as those who visited a dental clinic less than one year ago.

Using values of χ2 and Cramér’s V to evaluate the strength of the association between variables, it appears that variables, naming AGE (age), GENHLTH (self-perceived general health), and RMVTETH (number of removed teeth) have a strong correlation with most pathological conditions covered in the BRFSS2018 data.
For example, the prevalence of disease increases with age. Arthritis and cancer appear to be affected the most by age. Arthritis is also impacted the most by the self-perceived general health. The prevalence of depression and diabetes decrease in participants reporting their health as very good and excellent. It has been found that the edentulism affects diseases at different levels. The percentage of patient diagnosed with disease increase with the number of missing teeth. Once again, arthritis is affected the most by the number of missing teeth.
In summary, heart disease is more prevalent for participants with the following characteristics:
• Older age, with fewer years of education, and a lower income
• Sleep deprivation or over-sleeping, former smoker, daily smoker
• Poor general and physical health
• Underweight, overweight
• Poor oral health
• Having conditions such as stroke, arthritis, kidney, and diabetes

The feature selection was performed using many methods, such as Information Gain, Recursive Feature Elimination, Boruta, and Random Forest. We also used Backward and Forward Stepwise Elimination. It should be noted that only the balanced training data was used for the feature selection step. The output of the Boruta method indicated that all the 31 variables are important. The Backward and Forward Stepwise Elimination method (package MASS) indicates that the drinking factor (ALCpa30) should be dropped when building models. In order to select important attributes, those with values of importance higher than the median were considered (16 attributes). Furthermore, we found that there are eight common attributes among the five sets of attributes. When running a few machine learners such as Logistic Regression, C5.0, and Random Forest, we observed a slight difference in performance (such as accuracy and Kappa statistic). Therefore, the final model only contains eight variables, including AGE, EMPL, GENHLTH, DIABETE, PULMOND, STROKE, RMVTETH, and LASTDENTV.
Classification methods are used mainly in machine learning, pattern recognition, and artificial intelligence. These methods have numerous applications, which include risk analysis, credit card fraud detection, target marketing, manufacturing, and medical diagnosis. Our work intends to use classifiers such as Decision Tree, Random Forest, Naïve Bayes, and K nearest neighbors (KNN) to detect the presence of heart disease in participants. Table 6 shows the performance of competing algorithms. It appears that these classifiers provided similar performance. However, the time to run KNN is about 26 minutes, which is much longer than running any other classifiers.

In order to evaluate the stability of all classifiers, we fed ten testing sets into each classifier. The performance of each classifier is given in the Appendix (See The Report). We used the Friedman test to see if either one of the machine learning algorithms was more accurate than the others. Since the Friedman test indicates significance (χ2 (4) = 35.44, p < 0.01), it is meaningful to conduct multiple comparisons in order to identify differences between the algorithms. According to the Nemenyi posthoc test for multiple joint samples, Random Forest differs significantly from Logistic Regression, and Decision Tree differs significantly from Logistic Regression. Comparisons of other algorithms did not differ significantly.

Descriptive analysis determined that 69% of respondents lived in urban geographic locales, 48% were males, and 63% were older than 50 years. 40.95% were university graduates with at least a four-year degree, 55.14% were employees, and 57.59% were either married or part of an unmarried couple. Drinking in the past 30 days was reported by 56% of respondents, while about 15% reported being current smokers. While 83.31% self-reported their health as good to excellent, 62.1% had at least one chronic disease.
A strong association between certain variables have been found. Age appeared to be strongly connected to many other risk factors, such as relationship status, employment status, mental health, number of removed teeth, arthritis, and diabetes. The income attribute was found to be correlated to the highest number of variables. These categories and variables are: demographics (relationship status, years of schooling, employment situation), lifestyle (physical activity and alcohol consumption), health-related quality of life attributes (general health and physical health), health care access, and some chronic diseases such as arthritis, depression disorder, and pulmonary disease. Among chronic health conditions, arthritis, diabetes, pulmonary disease, and stroke are strongly associated with the response variable MICHD.
Comments