Abstract:
In the age of Big Genomics Data, institutions such as the National Human Genome Research Institute (NHGRI) are challenged in their efforts to share volumes of data between researchers, a process that has been plagued by unreliable transfers and slow speeds. These occur due to throughput bottlenecks of traditional transfer technologies. Two factors that affect the efficiency of data transmission are the channel bandwidth and the amount of data. Increasing the bandwidth is one way to transmit data efficiently, but might not always be possible due to resource limitations. Another way to maximize channel utilization is by decreasing the bits needed for transmission of a dataset. Traditionally, transmission of big genomic data between two geographical locations is done using general-purpose protocols, such as hypertext transfer protocol (HTTP) and file transfer protocol (FTP) secure. In this paper, we present a novel deep learning-based data minimization algorithm that 1) minimizes the datasets during transfer over the carrier channels; 2) protects the data from the man-in-the-middle (MITM) and other attacks by changing the binary representation (content-encoding) several times for the same dataset: we assign different codewords to the same character in different parts of the dataset. Our data minimization strategy exploits the alphabet limitation of DNA sequences and modifies the binary representation (codeword) of dataset characters using deep learning-based convolutional neural network (CNN) to ensure a minimum of code word uses to the high frequency characters at different time slots during the transfer time. This algorithm ensures transmission of big genomic DNA datasets with minimal bits and latency and yields an efficient and expedient process. Our tested heuristic model, simulation, and real implementation results indicate that the proposed data minimization algorithm is up to 99 times faster and more secure than the currently used content-encoding scheme used in HTTP of the HTTP content-encoding scheme and 96 times faster than FTP on tested datasets. The developed protocol in C# will be available to the wider genomics community and domain scientists.