Abstract:
Identifying proximity between pairs of expression vectors is one of the fundamental requirements in machine learning and data mining algorithms. We propose a new metric, Bidirectional Association Similarity ( BiAS ), to measure the degree of mutual association between a pair of features and present a generalized formulation to compute BiAS between two vectors. Using non-linear programming optimization, we establish soundness of BiAS against the Jaccard and cosine similarities and prove that mutually associative features must be similar. The reverse, however, is not true. Finally, we show that BiAS is a transitive relation and can suitably be incorporated with any clustering algorithm, just like other metrics, to identify groups of mutually associative features in an ensemble. Experiments on clustering and classification of genome sequences for taxa identification and finding biomarkers in large airway epithelial cells expressions from smokers diagnosed with lung cancer reveal that knowledge precision is further improved with BiAS compared to seven other well-established metrics including the Pearson correlation coefficient, cosine similarity and the Jaccard similarity. Remarkably, the 10 out of the top 11 lung-cancer biomarkers found in the study using BiAS has been corroborated through previously reported clinically-backed studies. Thus, bidirectional association mining turns out effective for bio-knowledge discovery.