Privacy-Preserving Multi-keyword Top-kSimilarity Search Over Encrypted Data

Privacy-Preserving Multi-keyword Top-kSimilarity Search Over Encrypted Data

Privacy-Preserving Multi-keyword Top-kSimilarity Search Over Encrypted Data
Privacy-Preserving Multi-keyword Top-kSimilarity Search Over Encrypted Data


Cloud computing provides individuals and enterprises massive computing power and scalable storage capacities to supporta variety of big data applications in domains like health care and scientific research, therefore more and more data owners are involvedto outsource their data on cloud servers for great convenience in data management and mining. However, data sets like health recordsin electronic documents usually contain sensitive information, which brings about privacy concerns if the documents are released orshared to partially untrusted third-parties in cloud. A practical and widely used technique for data privacy preservation is to encryptdata before outsourcing to the cloud servers, which however reduces the data utility and makes many traditional data analyticoperators like keyword-based top-document retrieval obsolete. In this paper, we investigate the multi-keyword top-search problemfor big data encryption against privacy breaches, and attempt to identify an efficient and secure solution to this problem. Specifically, forthe privacy concern of query data, we construct a special tree-based index structure and design a random traversal algorithm, whichmakes even the same query to produce different visiting paths on the index, and can also maintain the accuracy of queries unchangedunder stronger privacy. For improving the query efficiency, we propose a group multi-keyword top-search scheme based on the ideaof partition, where a group of tree-based indexes are constructed for all documents. Finally, we combine these methods together intoan efficient and secure approach to address our proposed top-similarity search. Extensive experimental results on real-life data setsdemonstrate that our proposed approach can significantly improve the capability of defending the privacy breaches, the scalability andthe time efficiency of query processing over the state-of-the-art methods.


  • Song et al. first defined the problem of searching on encrypted dataand proposed a symmetric searchable encryption schemewith linear complexity.
  • After that, Goh et al. formulateda security definition for SSE and proposed a secure indexwhich is based on pseudo-random functions and Bloomfilters, but the time cost of Goh’s scheme is O(n).
  • Curtmolaet al. introduced two formal definitions of SSE and proposeda method which is based on inverted list to improvethe query performance, their method is proved to be moreefficient than other works. However, most of these workscan only support single keyword boolean search, which isnot advanced enough to support complex functionalities. Inrecent years, many works have been proposed to achievedifferent kinds of complex queries like similarity search,multi-keyword search, etc.
  • Cao et al. proposed the multi-keyword ranked search over encrypted data for the first time and built a searchable index based on the vector space model, and chosen ”coordinate matching” to measure the similarity between queries and documents. However, in their schemes, the time complexity of search is O(nm) (is the number of keywords in dictionary, is the size of the documents that stored in the cloud server), and the time complexity of trapdoor construction is also very high.


  • Applying these approaches for data encryption usually cause tremendous cost in terms of data utility, which makes traditional data processing methods that are designed for plaintext data no longer work well over encrypted data.
  • Most of these methods cannot meet the high search efficiency and the strong data security simultaneously, especially when applying them to big data encryption poses great scalability and efficiency challenges.
  • But the cost of search remains high and the time complexity of creating trapdoor is high.


  • In this paper, we focus on a special type of multi-keyword ranked search, namely the multikeyword top-search, which has been a very popular database operator in many important applications, and only needs to return the documents with the highest relevance scores. For supporting multi-keyword search, we introduce the vector space model which represents documents and queries as vectors. In order to support top-search, the relevance scores between documents and queries should be calculated, therefore, the TF_IDF (term frequency inverse document frequency) model is introduced as a weighting rule to compute the relevance scores for ranking purposes.
  • In addition, to improve the query efficiency for better user experiences, we propose a group multi-keyword top-search scheme (GMTS), which is based on partition and supports top-similarity search over encrypted data. In this scheme, the data owner divides the keywords in the dictionary (suppose that the dictionary contains all the keywords that could be extracted from all documents) into multiple groups and establishes a searchable index for each group.
  • We propose a random traversal algorithm (RTRA) to strengthen the data security, where the data owner builds a binary tree as searchable index and assigns a random switch to each node, so the data user can assign a random key to each query. Therefore, the data user can change the results and visiting paths of queries by using different keys, which maintains high accuracy of queries. Finally, we combine the GMTS and the RTRA together into an efficient and secure solution to our proposed problem.


  • We first propose the random traversal algorithm which makes the cloud server randomly traverse on index and returns different results for the same query, and in the meantime, it maintains the accuracy of queries unchanged for higher security.
  • Based on the random traversal algorithm, we present one both efficient and secure searchable encryption scheme, which can support top-similarity search over encrypted data. In this scheme, the data owner can control the level of query unlinkability without sacrificing accuracy.
  • Our experimental results show that our methods are more efficient than the state-of-the-art methods and can better protect data privacy. Especially, our proposed method has good scalability performancewhen dealing with large data sets.




  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :


  • Operating system : Windows 7.
  • Coding Language : NET,C#.NET
  • Tool : Visual Studio 2008
  • Database : SQL SERVER 2005


Xiaofeng Ding, Member, IEEE, Peng Liu and Hai Jin, Senior Member, IEEE, “Privacy-Preserving Multi-keyword Top-kSimilarity Search Over Encrypted Data”, IEEETransactions on Dependable and Secure Computing, 2017