Data Sampling Over Synthetic Minority on Balanced Dataset in Python

Data Sampling Over Synthetic Minority on Balanced Dataset in Python

Abstract:

In this study, a novel method is proposed for the detection of Parkinson's disease with the features obtained from the speech signals. Detection and early diagnosis of Parkinson's disease are essential in terms of disease progression and treatment process. Parkinson's disease dataset used in this study was obtained from the UCI machine learning repository. The proposed hybrid machine learning method consists of two stages: i) data pre-processing (oversampling), ii) classification. The Parkinson's disease dataset (PD dataset) is a two-class dataset. While 192 data belong to normal (healthy) individuals, 564 data belong to the diseased class (PD). The data set has an imbalanced class distribution. To transform this imbalanced dataset to balanced dataset, SMOTE (Synthetic Minority Over-Sampling Technique) method is used. Then, after converting to a balanced class distribution, Random Forests classification method was used for classification of Parkinson's disease dataset. The PD dataset consists of 753 attributes. Only the random forests classification were classified as 87.037% in the classification of PD dataset, while the proposed hybrid method (the combination of SMOTE and random forests) achieved 94.89% classification success. Obtained results showed that promising results had been achieved in discrimination of the PD dataset with this hybrid method.