Finding Persistent Items in Distributed Datasets in Java

Finding Persistent Items in Distributed Datasets in Java

Abstract:

This paper concerns the problem of finding persistent items in distributed datasets, which has many applications such as port scanning and intrusion detection. To the best of our knowledge, there is no existing solution for finding persistent items in distributed datasets. In this paper, we propose DISPERSE, a probabilistic algorithm that can find persistent items in distributed datasets without collecting all the datasets. Our basic idea is that each monitor compresses each item ID in its dataset in a lossy fashion, sends the set of lossily compressed item IDs to the server, then the server recovers the IDs of the persistent items. We design the lossy compression so that given one lossily compressed item ID, the server cannot recover the item ID, but when the number of lossily compressed versions of the same item ID exceeds a threshold, which means that such items are persistent ones, the server can recover the item ID with a high probability. This threshold is exactly the threshold in the definition of persistent items. We implemented DISPERSE and evaluated its performance. In comparison with the straightforward solution, DISPERSE achieves a compression ratio of 26.5% with FNR=3.5% and FPR=O. In comparison with a developed Bloom filter based scheme and the adapted kBF and IBF schemes, our scheme can achieve 7.9, 5.7, and 6.6 times performance gains, respectively, in terms of compression ratio.