Abstract:
Data loading is a crucial and well standardized procedure of the usage of modern relational database management systems (RDBMS). If data arises highly frequently and intensive real time analyses are required, single rows need to be imported as soon as they are generated, e.g., By monitoring applications, into a columnar table layout. The experiment described and conducted in this paper, evaluates the data import performance for such single inserts of SAP HANA's column store and compares it with the load performance of MySQL's row store. In both cases, we use the command line interface (CLI) that is shipped with these databases. After comparing the results for sequential imports, we address the research question how concurrency influences the import speed and to which extent performance can be accelerated by invoking multiple insert operations simultaneously. In order to build a practical example, we used a data set formatted as comma-separated-values (CSV), containing monitoring information of productively running IT infrastructures. The flat data has been transformed into 1.89 million independent insert statements and hence simulates typical asynchronous sensor or monitoring workload. Within our experiment, the best import performance for SAP HANA's column store was achieved by invoking 200 CLIs simultaneously each importing 5,000 rows. We state that the number of parallel CLIs seems to strongly depend on the available CPU cores and could accelerate the overall import process up to factor 12 compared to the default sequential import. Confirming the general assumption of row stores being more suitable for single inserts, MySQL performed better in 25 out of 34 test cases. However, within a range between 60 and 150 simultaneously invoked CLIs, we were able to accelerate SAP HANA's column store load performance to surpass MySQL's row store for certain test cases. Our results serve as a contribution to all practitioners in further investigating column and row store suitability for Big Data scenarios involving frequently generated data sets.