A Data Centric Pipeline for Better Multiple Sequence Alignment

A Data Centric Pipeline for Better Multiple Sequence Alignment

Abstract:

In this paper, we explore using the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problem. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach explores using classification models trained from existing benchmark data to guide the construction. We identified two simple classifications to help us choose a better alignment tool and determine whether and how much to carry out realignment. We show that shallow machine-learning algorithms suffice to train sensitive models for these classifications. Based on these models, we implemented a new multiple sequence alignment pipeline, called MLProbs. Compared with 10 other popular alignment tools over four benchmark databases (namely, BAliBASE, OXBench, OXBench-X and SABMark), MLProbs consistently gives the highest TC score. More importantly, MLProbs shows non-trivial improvement for protein families with low similarity; in particular, when evaluated against the 1,356 protein families with similarity  50%, MLProbs achieves a TC score of 56.93, while the next best three tools are in the range of [55.41, 55.91] (increased by more than 1.8%). We also compared the performance of MLProbs and other MSA tools in two real-life applications – Phylogenetic Tree Construction Analysis and Protein Secondary Structure Prediction – and MLProbs also had the best performance. In our study, we used only shallow machine-learning algorithms to train our models. It would be interesting to study whether deep-learning methods can help make further improvements, so we suggest some possible research directions in the conclusion section.