Cluster Analysis of Mixed and Missing Chronic Kidney Disease Data in KwaZulu-Natal Province, South Africa

Cluster Analysis of Mixed and Missing Chronic Kidney Disease Data in KwaZulu-Natal Province, South Africa

Abstract:

Real-world datasets, particularly Electronic Health Records, are routinely found to be mixed (comprised of both categorical and continuous variables) and/or missing in nature. Such datasets present peculiar challenges related both to their clustering and the evaluation of the clusterings obtained. In this paper, we discuss these challenges in detail, as well as the solution approaches applied to them in the literature. We then apply some of these approaches to a multi-racial Chronic Kidney Disease (CKD) dataset comprising of 20 continuous and 12 categorical variables with an over 30% missingness ratio, evaluating our results through external and internal validation as well as cluster stability testing. From the results of our study, the Ahmad-Dey distance measure consistently outperformed Gower's distance on our mixed and missing dataset. In addition, our results show that advanced imputation methods like multiple imputation, which take into consideration the uncertainty inherent in imputation, should be explored when clustering missing datasets. Three clusters were identified from our dataset which were significantly differentiated by age, sex, estimated Glomerular Filtration Rate (eGFR), creatinine, urea, and hemoglobin, but not by race or blood pressure. The fact that, through proper cluster analysis, we were unable to identify five clusters corresponding to the five CKD stages usually used to classify CKD patients indicates that datasets with more than the usual four/six variables used for computing eGFR may contain a latent structure different from this five-group structure, the identification of which will provide valuable insights peculiar to each cohort for medical practitioners.