Empirical Study of Protein Feature Representation on Deep Belief Networks Trained With Small Data fo

Empirical Study of Protein Feature Representation on Deep Belief Networks Trained With Small Data fo

Abstract:

Protein secondary structure (SS) prediction is a classic problem of computational biology and is widely used in structural characterization and to infer homology. While most SS predictors have been trained on thousands of sequences, a previous approach had developed a compact model of training proteins that used a C - A lpha, C- B eta S ide Chain ( CABS )-algorithm derived energy based feature representation. Here, the previous approach is extended to Deep Belief Networks (DBN). Deep learning methods are notorious for requiring large datasets and there is a wide consensus that training deep models from scratch on small datasets, works poorly. By contrast, we demonstrate a simple DBN architecture containing a single hidden layer, trained only on the CB513 dataset. Testing on an independent set of G Switch proteins improved the Q 3 score of the previous compact model by almost 3%. The findings are further confirmed by comparison to several deep learning models which are trained on thousands of proteins. Finally, the DBN performance is also compared with P osition S pecific S coring M atrix ( PSSM )-profile based feature representation. The importance of (i) structural information in protein feature representation and (ii) complementary small dataset learning approaches for detection of structural fold switching are demonstrated.