Ensemble Methods for Identifying Heart Disease.


Heart Disease is the prime cause of deaths all over the world. Many problems related to heart are caused from a condition called Atherosclerosis in which a substance known as plaque gets deposited onto the surface of the blood vessels thereby narrowing the entry of blood into the arteries and making the heart work harder to pump the blood. This can cause a heart attack or a stroke.

A heart attack occurs when the blood flow to a part of the heart is blocked by a blood clot. An ischemic stroke (the most common type of stroke) occurs when a blood vessel that feeds the brain gets blocked, usually from a blood clot. A hemorrhagic stroke occurs when a blood vessel within the brain bursts. This is most often caused by uncontrolled hypertension (high blood pressure).

Heart disease encompasses a wide range of cardiovascular problems. Several diseases and conditions fall under the umbrella of heart disease. Types of heart disease include:

  • Arrhythmia. An arrhythmia is a heart rhythm abnormality.
  • Atherosclerosis. Atherosclerosis is a hardening of the arteries.
  • Cardiomyopathy. This condition causes the heart’s muscles to harden or grow weak.
  • Congenital heart defects. Congenital heart defects are heart irregularities that are present at birth.
  • Coronary artery disease (CAD). CAD is caused by the buildup of plaque in the heart’s arteries. It’s sometimes called ischemic heart disease.
  • Heart infections. Heart infections may be caused by bacteria, viruses, or parasites.

Heart Disease can be caused by the following:

  • High blood pressure
  • High cholesterol and low levels of high-density lipoprotein (HDL), the “good” cholesterol
  • Smoking
  • Obesity
  • Physical inactivity

Healthy lifestyle choices can help you prevent heart disease. They can also help you treat the condition and prevent it from getting worse. Your diet is one of the first areas you may seek to change.

A low-sodium, low-fat diet that’s rich in fruits and vegetables may help you lower your risk for heart disease complications. One example is the Dietary Approaches to Stop Hypertension (DASH) diet.

Likewise, getting regular exercise and quitting tobacco can help treat heart disease. Also look to reduce your alcohol consumption.

In this project we will perform data analysis on the Heart Disease dataset released by the University of California, Irvine to gather meaning insights and then, from the given data we try to predict what parameters lead to a person being diagnosed with a heart disease.

We will use standard EDA techniques for data analysis and various Machine Learning models viz. Logistic Regression, Decision Trees, Random Forests, Support Vector Machines, Extreme Gradient Boosting, KNN and Naive Bayes for initial prediction and we obtain the best model by ensembling three models viz. Extreme Gradient Boosing, KNN and SVMs to achieve the best results.

At the end we will perform evaluation of the models with standard metrics.

import pandas as pd
import numpy as np

#VISUALIZATION
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from collections import Counter
import pandas_profiling as pp

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler


from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import roc_curve, classification_report

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from mlxtend.classifier import StackingCVClassifier
data = pd.read_csv('../input/heart-disease-uci/heart.csv')
data.head()
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
data.describe()
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.366337 0.683168 0.966997 131.623762 246.264026 0.148515 0.528053 149.646865 0.326733 1.039604 1.399340 0.729373 2.313531 0.544554
std 9.082101 0.466011 1.032052 17.538143 51.830751 0.356198 0.525860 22.905161 0.469794 1.161075 0.616226 1.022606 0.612277 0.498835
min 29.000000 0.000000 0.000000 94.000000 126.000000 0.000000 0.000000 71.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 47.500000 0.000000 0.000000 120.000000 211.000000 0.000000 0.000000 133.500000 0.000000 0.000000 1.000000 0.000000 2.000000 0.000000
50% 55.000000 1.000000 1.000000 130.000000 240.000000 0.000000 1.000000 153.000000 0.000000 0.800000 1.000000 0.000000 2.000000 1.000000
75% 61.000000 1.000000 2.000000 140.000000 274.500000 0.000000 1.000000 166.000000 1.000000 1.600000 2.000000 1.000000 3.000000 1.000000
max 77.000000 1.000000 3.000000 200.000000 564.000000 1.000000 2.000000 202.000000 1.000000 6.200000 2.000000 4.000000 3.000000 1.000000
data.isnull().sum()
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64
pp.ProfileReport(data)
HBox(children=(FloatProgress(value=0.0, description='variables', max=14.0, style=ProgressStyle(description_wid…






HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…






HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=36.0, style=ProgressStyle…






HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…






HBox(children=(FloatProgress(value=0.0, description='missing', max=2.0, style=ProgressStyle(description_width=…






HBox(children=(FloatProgress(value=0.0, description='warnings', max=3.0, style=ProgressStyle(description_width…






HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…






HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…
y = data['target']
x = data.drop('target', axis = 1)

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, random_state = 0)
scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.transform(xtest)
print(ytest.unique())
Counter(ytrain)
[0 1]





Counter({1: 131, 0: 111})
model1 = 'Logistic Regression'
lr = LogisticRegression()
model = lr.fit(xtrain, ytrain)
ypred = lr.predict(xtest)
lr_cm = confusion_matrix(ytest, ypred)
lr_acc = accuracy_score(ytest, ypred)
print('Confusion Matrix')
print(lr_cm)
print('\n')
print(f'Accuracy of {model1} : {lr_acc *100} \n')
print(classification_report(ytest, ypred))
Confusion Matrix
[[21  6]
 [ 3 31]]


Accuracy of Logistic Regression : 85.24590163934425

              precision    recall  f1-score   support

           0       0.88      0.78      0.82        27
           1       0.84      0.91      0.87        34

    accuracy                           0.85        61
   macro avg       0.86      0.84      0.85        61
weighted avg       0.85      0.85      0.85        61
model2 = 'Naive Bayes'
nb = GaussianNB()
nb.fit(xtrain, ytrain)
ypred = nb.predict(xtest)
nb_cm = confusion_matrix(ytest, ypred)
nb_acc = accuracy_score(ytest, ypred)
print('Confusion Matrix')
print(nb_cm)
print('\n')
print(f'Accuracy of {model2} : {nb_acc * 100} \n')
print(classification_report(ytest, ypred))
Confusion Matrix
[[21  6]
 [ 3 31]]


Accuracy of Naive Bayes : 85.24590163934425

              precision    recall  f1-score   support

           0       0.88      0.78      0.82        27
           1       0.84      0.91      0.87        34

    accuracy                           0.85        61
   macro avg       0.86      0.84      0.85        61
weighted avg       0.85      0.85      0.85        61
model3 = 'Random Forest Classifer'
rf = RandomForestClassifier(n_estimators = 20, random_state = 2, max_depth = 5)
rf.fit(xtrain,ytrain)
ypred = rf.predict(xtest)
rf_cm = confusion_matrix(ytest, ypred)
rf_acc = accuracy_score(ytest, ypred)
print("confussion matrix")
print(rf_cm)
print("\n")
print(f"Accuracy of {model3} : {rf_acc*100}\n")
print(classification_report(ytest,ypred))
confussion matrix
[[15  2]
 [ 2 12]]


Accuracy of Random Forest Classifer : 87.09677419354838

              precision    recall  f1-score   support

           0       0.88      0.88      0.88        17
           1       0.86      0.86      0.86        14

    accuracy                           0.87        31
   macro avg       0.87      0.87      0.87        31
weighted avg       0.87      0.87      0.87        31
model4 = 'K Neighbors Classifier'
knn = KNeighborsClassifier(n_neighbors = 10)
knn.fit(xtrain, ytrain)
ypred = knn.predict(xtest)
knn_cm = confusion_matrix(ytest, ypred)
knn_acc = accuracy_score(ytest, ypred)
print('Confusion Matrix')
print(knn_cm)
print('\n')
print(f'Accuracy of {model4} : {knn_acc * 100} \n')
print(classification_report(ytest, ypred))
Confusion Matrix
[[24  3]
 [ 4 30]]


Accuracy of K Neighbors Classifier : 88.52459016393442

              precision    recall  f1-score   support

           0       0.86      0.89      0.87        27
           1       0.91      0.88      0.90        34

    accuracy                           0.89        61
   macro avg       0.88      0.89      0.88        61
weighted avg       0.89      0.89      0.89        61
model5 = 'DecisionTreeClassifier'
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0, max_depth = 6)
dt.fit(xtrain, ytrain)
ypred = dt.predict(xtest)
dt_cm = confusion_matrix(ytest, ypred)
dt_acc = accuracy_score(ytest, ypred)
print('Confusion Matrix')
print(dt_cm)
print('\n')
print(f'Accuracy of {model5} : {dt_acc * 100} \n')
print(classification_report(ytest, ypred))
Confusion Matrix
[[23  4]
 [ 7 27]]


Accuracy of DecisionTreeClassifier : 81.9672131147541

              precision    recall  f1-score   support

           0       0.77      0.85      0.81        27
           1       0.87      0.79      0.83        34

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61
model6 = 'Support Vector Classifier'
svc = SVC(kernel = 'rbf', C = 2)
svc.fit(xtrain, ytrain)
ypred = svc.predict(xtest)
svc_cm = confusion_matrix(ytest, ypred)
svc_acc = accuracy_score(ytest, ypred)
print('Confusion Matrix')
print(svc_cm)
print('\n')
print(f'Accuracy of {model6} : {svc_acc * 100} \n')
print(classification_report(ytest, ypred))
Confusion Matrix
[[23  4]
 [ 3 31]]


Accuracy of Support Vector Classifier : 88.52459016393442

              precision    recall  f1-score   support

           0       0.88      0.85      0.87        27
           1       0.89      0.91      0.90        34

    accuracy                           0.89        61
   macro avg       0.89      0.88      0.88        61
weighted avg       0.89      0.89      0.88        61
model7 = 'Extreme Gradient Boosting'
xgb = XGBClassifier(learning_rate = 0.01, n_estimators = 25,
                    max_depth = 15, gamma = 0.6,
                    subsample = 0.52, colsample_bytree = 0.6,
                    seed = 27, reg_lambda = 2, booster = 'dart',
                    colsample_bylevel = 0.6, colsample_bynode = 0.5
                   )
xgb.fit(xtrain, ytrain)
ypred = xgb.predict(xtest)
xgb_cm = confusion_matrix(ytest, ypred)
xgb_acc = accuracy_score(ytest, ypred)
print('Confusion Matrix')
print(xgb_cm)
print('\n')
print(f'Accuracy of {model7} : {xgb_acc * 100} \n')
print(classification_report(ytest, ypred))
Confusion Matrix
[[24  3]
 [ 3 31]]


Accuracy of Extreme Gradient Boosting : 90.1639344262295

              precision    recall  f1-score   support

           0       0.89      0.89      0.89        27
           1       0.91      0.91      0.91        34

    accuracy                           0.90        61
   macro avg       0.90      0.90      0.90        61
weighted avg       0.90      0.90      0.90        61
imp_features = pd.DataFrame(
    {'Feature' : ['age', 'sex', 'cp', 'trestbps', 'chol',
                  'fbs', 'restecg', 'thalach','exang',
                  'oldpeak', 'slope', 'ca', 'thal'],
    'Importance' : xgb.feature_importances_
    }
)
plt.figure(figsize=(10,4))
plt.title("barplot Represent feature importance ")
plt.xlabel("importance ")
plt.ylabel("features")
plt.barh(imp_features['Feature'],imp_features['Importance'],color = 'cymkbgr')
plt.show()

png

model_ev = pd.DataFrame({'Model': ['Logistic Regression','Naive Bayes',
                                   'Random Forest','Support Vector Machine',
                                   'K-Nearest Neighbour','Decision Tree',
                                   'Extreme Gradient Boost'],
                         'Accuracy': [lr_acc*100, nb_acc*100,
                                      rf_acc*100, svc_acc*100,
                                      knn_acc*100,dt_acc*100,
                                      xgb_acc*100]})
model_ev
Model Accuracy
0 Logistic Regression 85.245902
1 Naive Bayes 85.245902
2 Random Forest 87.096774
3 Support Vector Machine 88.524590
4 K-Nearest Neighbour 88.524590
5 Decision Tree 81.967213
6 Extreme Gradient Boost 90.163934
colors = ['red','green','blue','gold','silver','yellow','orange',]
plt.figure(figsize=(12,5))
plt.title("barplot Represent Accuracy of different models")
plt.xlabel("Accuracy %")
plt.ylabel("Algorithms")
plt.bar(model_ev['Model'],model_ev['Accuracy'],color = colors)
plt.show()

png

scv=StackingCVClassifier(classifiers=[xgb,knn,svc],meta_classifier= svc,random_state=42)
scv.fit(xtrain,ytrain)
ypred = scv.predict(xtest)
scv_cm = confusion_matrix(ytest, ypred)
scv_acc = accuracy_score(ytest, ypred)
print("confussion matrix")
print(scv_cm)
print("\n")
print("Accuracy of StackingCVClassifier:",scv_acc*100,'\n')
print(classification_report(ytest,ypred))
confussion matrix
[[24  3]
 [ 2 32]]


Accuracy of StackingCVClassifier: 91.80327868852459

              precision    recall  f1-score   support

           0       0.92      0.89      0.91        27
           1       0.91      0.94      0.93        34

    accuracy                           0.92        61
   macro avg       0.92      0.92      0.92        61
weighted avg       0.92      0.92      0.92        61