Causal Gene Identification Using Non Linear Regression Based Independence Tests

Causal Gene Identification Using Non Linear Regression Based Independence Tests

Abstract:

With the development of biomedical techniques in the past decades, causal gene identification has become one of the most promising applications in human genome-based business, which can help doctors to evaluate the risk of certain genetic diseases and provide further treatment recommendations for potential patients. When no controlled experiments can be applied, machine learning techniques like causal inference-based methods are generally used to identify causal genes. Unfortunately, most of the existing methods detect disease-related genes by ranking-based strategies or feature selection techniques, which generally return a superset of the corresponding real causal genes. There are also some causal inference-based methods that can identify a part of real causal genes from those supersets, but they are just able to return a few causal genes. This is contrary to our knowledge, as many results from controlled experiments have demonstrated that a certain disease, especially cancer, is usually related to dozens or hundreds of genes. In this work, we present an effective approach for identifying causal genes from gene expression data by using a new search strategy based on non-linear regression-based independence tests, which is able to greatly reduce the search space, and simultaneously establish the causal relationships from the candidate genes to the disease variable. Extensive experiments on real-world cancer datasets show that our method is superior to the existing causal inference-based methods in three aspects: 1) our method can identify dozens of causal genes, and 1/31/2 of the discovered causal genes can be verified by existing works that they are really directly related to the corresponding disease; 2) The discovered causal genes are able to distinguish the status or disease subtype of the target patient; 3) Most of the discovered causal genes are closely relevant to the disease variable.