Abstract:
The increased use of cyber-enabled systems and Internet-of-Things (IoT) led to a massive amount of data with different structures. Most big data solutions are built on top of the Hadoop eco-system or use its distributed file system (HDFS). However, studies have shown inefficiency in such systems when dealing with today's data. Some research overcame these problems for specific types of graph data, but today's data are more than one type of data. Such efficiency issues may lead to large-scale problems, including larger space requirements in data centers, and waste in resources (like power consumption), that in turn lead to environmental problems (such as more carbon emission) [1] , as per scholars. We propose a data-aware module for the Hadoop eco-system. We also propose a distributed encoding technique for genetic algorithms efficient data processing. Our framework allows Hadoop to manage the distribution of data and its placement based on cluster analysis of the data itself. We are able to handle a broad range of data types as well as optimize query time and resource usage. We performed experiments on multiple datasets generated via LUBM (Lehigh University Benchmark) and reported results along with performance analysis.