Abstract:
DNA-binding proteins (DBPs) play vital roles in the regulation of biological systems. Although there are already many deep learning methods for predicting the sequence specificities of DBPs, they face two challenges as follows. Classic deep learning methods for DBPs prediction usually fail to capture the dependencies between genomic sequences since their commonly used one-hot codes are mutually orthogonal. Besides, these methods usually perform poorly when samples are inadequate. To address these two challenges, we developed a novel language model for mining DBPs using human genomic data and ChIP-seq datasets with decaying learning rates, named DNA Fine-tuned Language Model (DFLM). It can capture the dependencies between genome sequences based on the context of human genomic data and then fine-tune the features of DBPs tasks using different ChIP-seq datasets. First, we compared DFLM with the existing widely used methods on 69 datasets and we achieved excellent performance. Moreover, we conducted comparative experiments on complex DBPs and small datasets. The results show that DFLM still achieved a significant improvement. Finally, through visualization analysis of one-hot encoding and DFLM, we found that one-hot encoding completely cut off the dependencies of DNA sequences themselves, while DFLM using language models can well represent the dependency of DNA sequences.