Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Efficient privacy-preserving variable-length substring match for genome sequence

Efficient privacy-preserving variable-length substring match for genome sequence The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that V [V [... V [p ] ...]] is computed for a given depth of recursion where p is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communi- cation, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non- indexed database search protocol under the realistic computation/network environment. Keywords: Private genome sequence search, Secure multiparty computation, Secret sharing, FM-index, Suffix array, LCP array, Maximal exact match Introduction stakeholders with different legal backgrounds, which The dramatic reduction in the cost of genome sequenc - slows down the pace of research. Therefore, there is a ing has prompted increased interest in personal genome strong demand for privacy-preserving technologies that sequencing over the last 15 years. Extensive collections can potentially compensate for or even replace the tra- of personal genome sequences have been accumulated ditional policy-based approach [4, 5]. One important both in academic and industrial organizations, and there application that needs a privacy-preserving technol- is now a global demand for sharing the data to acceler- ogy is private genome sequence search, where different ate scientific research [1, 2]. As discussed in previous stakeholders respectively hold a query sequence and a studies, disclosing personal genome information has a database sequence and the goal is to let the query holder high privacy risk [3], so it is crucial to ensure that indi- know the result while simultaneously keeping the query viduals’ privacy is protected upon data sharing. At pre- and the database private. Many studies have addressed sent, the most popular approach for this is to formulate the problem of how to compute exact or approximate edit and enforce a privacy policy, but it is a time-consum- distance or the longest common substring (LCS) through ing process to reach an agreement, especially among techniques based on homomorphic encryption [6–8] and secure multi-party computation (MPC) [9–15], or how to compute sequence similarity based on private set *Correspondence: shimizu.kana@waseda.jp intersection [16]. While these studies can evaluate global National Institute of Advanced Industrial Science and Technology, Tokyo, Japan sequence similarity for two sequences of similar length, Full list of author information is available at the end of the article © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 2 of 22 other studies address the problem of finding a substring and round complexities for the search time (i.e., the time between a query and a long genome sequence or a set of after the input of a query until the end of the search). long genome sequences, with the aim of evaluating local The basic idea of the protocols is to represent the data - sequence similarity [17–23]. Shimizu et  al. proposed an base string by a compressed index [24, 27] and store the approach to combine an additive homomorphic encryp- index as a lookup table. LPM and MEMs are found by at tion and index structures such as FM-index [24] and the most ℓ and 2ℓ table lookups respectively, where ℓ is the positional Burrows-Wheeler transform [25] to find the length of the query. More specifically, the table V is ref- longest prefix of a query that matches a database (LPM) erenced in a recursive manner; i.e., one needs to obtain and a set-maximal match for a collection of haplotypes V [j] , where j = V [i] , g iven i. To ensure security, we need [17]. Sudo et  al. used a similar approach and improved to compute V [j] without seeing any element of V . The the time and communication complexities for LPM on a key technical contribution of this study is an efficient protein sequence by using a wavelet matrix [19]. Ishimaki protocol that achieves this type of recursive reference. et  al. improved the round complexity of a set-maximal We named the protocol secret-shared recursive oblivi- match, though the search time was more than one order ous transfer (ss-ROT). While the previous studies require of magnitude slower than [17] due to the heavy computa- O(N ) time complexity to ensure security, the time, com- tional cost caused by the fully homomorphic encryption munication, and round complexities of ss-ROT  are all [18]. Sotiraki et al. used the Goldreich-Micali-Wigderson O(ℓ) for ℓ recursive table lookups, except for the prepa- protocol to build a suffix tree for a set-maximal match ration of the table and generation of shares before the [20]. According to experiments by [21], the search time query input. Since the entire protocols mainly consist of ℓ of [20] is one order of magnitude slower than [17, 21]. table lookups for LPM, and 2ℓ table lookups and 2ℓ inner Mahdi et  al. [21] used a garbled circuit to build a suffix product computations for LMEM, the search times for tree for substring match and a set-maximal match under LPM and LMEM do not depend on the database size. In a different security assumption such that the tree-tra - addition to the protocols based on ss-ROT, we developed versal pattern is leaked to the cloud server. Chen et  al. a protocol to reduce data transfer size in the initial step [22] and Popic et  al. [23] found fixed-length substring by using a similar approach taken in ss-ROT. The pro - matches using a one-way hash function or homomorphic tocol offers a reasonable trade-off between the amount encryption on a public cloud under a security assump- of reduction in data transfer in the initial step and the tion such that the database is a public sequence and a increase in computational cost in the later step. query is leaked to a private cloud server. We implemented the proposed protocol and tested it 3 7 In this study, we aim to improve privacy-preserving on substrings of a human genome sequence 10 to 10 in substring match under the security assumption such that length and confirmed that the actual CPU time and data both the query and the database sequence are strictly transfer overhead were in good agreement with the theo- protected. We first propose a more efficient method retical complexities. We also found that the search time for finding LPM, and then extend it to find the longest of our protocol was three orders of magnitude faster than maximal exact match (LMEM), which is more practically that of the previous method [17, 19]. For conducting fur- important in bioinformatics. We designed the protocol ther performance analysis, we designed and implemented for LMEM for ease of explanation, and the protocol can baseline protocols using major techniques of secret-shar- be applied to similar problems such as finding all maxi - ing-based protocols. The results showed that the search mal exact matches (MEMs) with a small modification. To times of our protocols were at least two orders of magni- our knowledge, this is the first study to address the prob - tude faster than those of the baseline protocols. lem of securely finding MEMs. Preliminaries Our contribution Secure computation based on secret sharing The time complexity of the previous studies [17, 19] Here, we explain the 2-out-of-2 additive secret sharing include the factor of N , and thus they do not scale well ((2, 2)-SS) scheme and how to securely compute arithme- to a large database. For a similar reason, using secure tic/Boolean gates (Fig. 1). matching protocols (e.g., [26]) for the shares (or tags in Secret sharing and secure computation In t-out-of- searchable encryption) of all substrings in a query and n secret sharing (e.g., [28]), we split the secret value x database is even worse in terms of time complexity. To into n pieces, and can reconstruct x by combining more achieve a real-time search on an actual genome database, or an equal number of t pieces. We call the split pieces we propose novel secret-sharing-based protocols that do “share”. The basic security notion for secret sharing is not include the factor of N in the time, communication, that we cannot obtain any information about x even Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 3 of 22 Fig. 1 Arithmetic addition and multiplication over secret sharing if we gather less than or equal to (t − 1) shares. In this compute arithmetic ADD/MULT gates over shares as paper, we consider a case with (t, n) = (2, 2) . A 2-out- follows: of-2 secret sharing ((2,  2)-SS) scheme over Z consists of two algorithms: Share and Reconst . Share takes as • [[z]] ← ADD([[x]], [[y]]) can be done locally by just input x ∈ Z and outputs ([[x]] , [[x]] ) ∈ Z , where the adding each party’s share on x and on y. In Fig.  1 2 0 1 n bracket notation [[x]] denotes the arithmetic share of the (left), we show an example of secure addition. P /P 0 1 i-th party (for i ∈{0, 1} ). We denote [[x]] = ([[x]] , [[x]] ) obtain shares 6/7 by adding their two shares. In this 0 1 as their shorthand. Reconst takes as inputs [[x]] and [[x]] process, P /P cannot find they are computing 2 + 3. 0 1 0 1 and outputs x. For arithmetic sharing [[x]] and Boolean • Multiplication is more complex than addition. There sharing [[x]] , we consider power-of-two integers n (e.g., are various methods for multiplication over shares, n = 16 ) and n = 1 , resp e ctively . most of which require communication between Depending on the secret sharing scheme, we can com- computing nodes. In this paper, we use the stand- pute arithmetic/Boolean gates over shares; that is, we can ard method for [[w]] ← MULT([[x]], [[y]]) based on execute some kind of processing related to x without x. Beaver triples (BT) [29]. Such a triple consists of This means it is possible to perform some computation bt = (a , b , c ) and bt = (a , b , c ) such that 0 0 0 0 1 1 1 1 without violating the privacy of the secret data, and is (a + a )(b + b ) = (c + c ) . Hereaf ter , a, b, and 0 1 0 1 0 1 called secure (multi-party) computation. It is known c denote a + a , b + b , and c + c , resp e ctively . 0 1 0 1 0 1 that we can execute arbitrary computation by combining We use these BTs as auxiliary inputs for computing basic arithmetic/Boolean gates. In the following para- MULT . Note that we can compute them in advance graphs, we show how to concretely compute these gates (or in offline phase) since they are independent of over shares. inputs [[x]] and [[y]] . We adopt a trusted initializer set- Semi-honest secure two-party computation based on ting (e.g., [30, 31]); that is, BTs are generated by the (2,  2)-Additive SS We use a standard (2,  2)-additive SS party other than two computing nodes and then dis- scheme, defined by tributed. In the online phase of MULT , each i-th party P ( i ∈{0, 1} ) can compute the multiplication share • Share(x) : randomly choose r ∈ Z n and let [[x]] = r [[z]] = [[xy]] as follows: 2 0 and [[x]] = x − r. • Reconst([[x]] , [[x]] ) : output [[x]] + [[x]] . 0 1 0 1 1) P first computes ([[x]] − a ) and ([[y]] − b ) , and i i i i i Note that one of the shares of x ([[ x]] or [[x]] ) does not sends them to P . 0 1 1−i ′ ′ reveal any information about x. In Fig. 1, the secret value 2) P reconstructs x = x − a and y = y − b. ′ ′ ′ ′ x = 2 is split into [[x]] = 4 and [[x]] = 6 . These are 3) P computes [[z]] = x y + x b + y a + c , and P 0 1 0 0 0 0 0 1 ′ ′ valid (2,  2)-additive shares because 4 + 6 ≡ 2 (mod 8) computes [[z]] = x b + y a + c . 1 1 1 1 holds. Even if we can see [[x]] = 4 , we cannot decide the Here, [[z]] and [[z]] calculated with the above 0 1 value of x since we execute a split of x uniformly at ran- procedures are valid shares of xy; that is, dom. This means, in Fig.  1, computing nodes P and P Reconst([[z]] , [[z]] ) = xy . We shorten the notations 0 1 0 1 cannot obtain any information about x as long as these and write the ADD and MULT protocols simply as two nodes do not collude. On the other hand, we can [[x]] + [[y]] and [[x]] · [[y]] , resp e ctively . Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 4 of 22 We also write ADD(ADD([[x ]], [[x ]]), [[x ]]) as 0. The length of S is denoted by |S|. A reverse string of A B C [[x ]] . Note that, similarly to the ADD protocol, S (i.e., S[|S|− 1], . . . , S[0] ) is denoted by S . We consider c={A,B,C} c we can also locally compute multiplication by constant c, a direction from the i-th position to the j-th position as denoted by c · [[x]] . We can easily extend the above proto- rightward if i < j and leftward otherwise. cols to Boolean gates. By converting + and − into ⊕ in the Given a query w and a database S, we define the long - arithmetic ADD and MULT protocols, we can obtain the est prefix that matches a database string (LPM) by XOR and AND protocols, respectively. We can construct max {j|w[0, . . . , j] = S[k, . . . , l]} , where 0 ≤ j <ℓ and (0,j) NOT and OR protocols from the properties of these 0 ≤ k ≤ l < N , and the longest maximal exact match B B gates. When we compute NOT([[x]] , [[x]] ) , P and P (LMEM) by max {j − i|w[i, . . . , j] = S[k, . . . , l]} , where (i,j) 0 1 0 1 B B output ¬[[x]] and [[x]] , respectively. When we compute 0 ≤ i ≤ j <ℓ and 0 ≤ k ≤ l < N . 0 1 B B B B OR([[x]] , [[y]] ) , we compute ¬AND(¬[[x]] , ¬[[y]] ) . W e FM-Index and related data structures FM-Index [24] shorten the notations and write XOR , AND , NOT , and and related data structures [27] are widely used for OR simply as [[x]] ⊕ [[y]] , [[x]] ∧ [[y]] , ¬[[x]] , and [[x]] ∨ [[y]] , genome sequence search. Given a query string w of respectively. By combining the above gates, we can length ℓ and a database string S of length N, [24] enables securely compute higher-level protocols. The function - LPM to be found in O(ℓ) time regardless of N, and it also ality of the secure subprotocols [15] used in this paper enables LMEM to be found in O(ℓ) if auxiliary data struc- are shown in Table  1. Due to space limits, we omit the tures are used [27]. Given all the suffixes of a string S: details of their construction. Note that we can compute S[0, . . . , |S|− 1] , S[1, . . . , |S|− 1], . . . , S[|S|− 1] , a suffix Choose  by [[z]] = [[y]] + [[e]] · ([[x]] − [[y]]) . In this paper, array is an array of positions (p , . . . , p ) such 0 |S|−1 S[p , . . . , |S|− 1] ≤ S[p , . . . , |S|− 1] ≤ S[p , . . . , we consider the standard simulation-based security that 0 1 2 |S|− 1], . . . , ≤ S[p , . . . , |S|− 1] notion in the presence of semi-honest adversaries (for |S|−1 . We denote the suffix 2PC), as in [32]. We show the definition in Appendix  2. array of S by SA and denote its i-th element by SA[i]. A Roughly speaking, this security notion guarantees the Burrows-Wheeler transform (BWT) is a permutation of privacy of the secret under the condition that computing the sequence S such that its i-th letter becomes nodes do not deviate from the protocol; that is, although S[SA[i] − 1] . We denote a BWT of S by L and denote its computing nodes are allowed to execute arbitrary attacks i-th letter by L[i]. Let us define a rank of S for a letter in their local, they do not (maliciously) manipulate trans- c ∈  at position t by Rank (t, S) = |{j|S[j] = c,0 ≤ j < t}| mission data to other parties. The building blocks we and a count of occurrences of letters that are lexicograph- adopt in this paper satisfy this security notion. Moreo- ically smaller than c in S by CF (S) = Rank (|S|, S) , c r r<c ver, as described in [32], the composition theorem for and the operation LF (i, S) = CF (L) + Rank (i, L) . The c c c the semi-honest model holds; that is, any protocol is pri- match between w and S is reported as a form of left- vately computed as long as its subroutines are privately closed and right-open interval on SA, and the lower and computed. upper bounds of the interval are respectively computed by LF . Given a letter c and an interval [f,  g) that corre- Index structure for string search sponds to suffixes that share the prefix x (i.e., [f,  g) reports Notation and definition  denotes a set of ordered sym- the locations of the substring x in S), we can find a new bols. A string consists of symbols in  . We denote a lexi- interval that corresponds to all suffixes that share the ′ ′ cographical order of two strings S and S by S ≤ S (i.e., A prefix cx (i.e., locations of the substring cx) by < C < G < T and AAA < AAC). We denote the i-th letter ′ ′ [f , g ) =[LF (f , S), LF (g , S)). (1) c c of a string S by S[i] and a substring starting from the i- th letter to the j-th letter by S[i,  j]. The index starts with The leftward extension of the match is called a backward search, which is the main functionality of FM-Index. By starting the search with the initial interval [0, N) and con- ducting the backward searches for w[ℓ − 1],w[ℓ − 2], . . . , Table 1 Secure subprotocols used in this paper the longest suffix match is detected when f = g . Rank Input Output and CF are precomputed and stored in an efficient from that can be searched in constant time. Therefore, the Equality [[x]] , [[y]] [[z]] s.t. z = 1 if x = y otherwise z = 0 longest suffix match can be computed in O(ℓ) time. LPM Comp [[x]] , [[y]] [[z]] s.t. z = 1 if x < y otherwise z = 0 is found if the search is conducted on S and match is ′ ′ CastUp n ′ [[x]] ∈ Z , n [[x]] ∈ Z ( n < n ) 2 n extended by w[0],w[1], . . . ,w[ℓ − 1]. B2A [[x]] [[x]] Searching LMEM by repeating LPM for Choose [[x]] , [[y]] , [[e ∈{0, 1}]] [[z]] s.t. z = x if e = 1 , otherwise ( e = 0 ) w[0, ... , ℓ − 1],w[1, ... , ℓ − 1],w[2, ... , ℓ − 1], ... ,w[ℓ − 1] takes z = y Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 5 of 22 time. We can improve it to O(ℓ) time by using only A knows LPM or LMEM between w and T . P and O(ℓ ) the longest common prefix (LCP) array and related P do not obtain any information of w and T , except for data structures [27]. The LCP array, denoted by LCP , ℓ and N . is an array that stores the length of the longest prefix Our protocol consists of offline, DB preparation, and of S[SA[i − 1], |S|− 1] and S[SA[i], |S|− 1] in LCP[i] Search phases. In the offline phase, B generates BTs (cor- for 0 < i ≤ N . The lcp-interval [i,  j) of lcp-value d is an related randomness used for multiplication) and sends interval such that it satisfies LCP[i] < d , LCP[j] < d , them to P and P . In the DB preparation phase, B cre- 0 1 LCP[k] > d for all k ∈{i + 1, . . . , j − 1} , and LCP[k] = d ates a lookup table and distributes its shares to P and for at least one k ∈{i + 1, . . . , j − 1} , and is denoted by P . In the Search phase, A generates shares of the query d − [i, j) . d − [i, j) corresponds to all the suffixes that and sends them to P and P , and P and P jointly com- 0 1 0 1 share the prefix S[SA[i], . . . , SA[i] + d − 1] . The parent pute the result without obtaining any information of interval of d − [i, j) is the lcp-interval h − [m, n) such the lookup table. Finally, A obtains the results. Figure  2 that h < d and 0 ≤ m ≤ i < j ≤ n < N , and there is shows the schematic view of our goal and model. Note no other lcp-interval t −[r , s) such that h < t < d and that the offline and DB preparation phases do not depend 0 ≤ m ≤ r ≤ i < j ≤ s ≤ n < N . The parent of the lcp- on a query string, so they can be computed in advance interval [f, g) can be found by for multiple queries. In section  "Secret-shared recursive oblivious transfer", [PSV[f ], NSV[f ]) LCP[g ] ≤ LCP[f ] ′ ′ i i i i we propose the important building block ss-ROT  that [f , g ) = [PSV[g ], NSV[g ])(otherwise), i i enables recursive reference to a lookup table. In sec- (2) tion "Secure LPM", we describe how to design the lookup PSV[i] = max{j|0 ≤ j < i ∧ LCP[j] < LCP[i]} where and table based on FM-Index, and propose an efficient pro - NSV[i] = min{j|i ≤ j < N ∧ LCP[j] < LCP[i]} . By find - tocol for LPM by using the lookup table and ss-ROT. In PSV NSV ing a parent interval using and whenever it section "Secure LMEM", we describe the additional table fails to extend the match, we can avoid useless backward design for auxiliary data structures, and propose the searches, and thus LMEM is found at most 2ℓ backward complete protocol for LMEM. Table  2 summarizes the LCP PSV NSV searches. , and are precomputed and theoretical complexities of the three protocols. For com- stored in an efficient form that can be searched in con - parison, the complexities of the baseline protocols and a O(ℓ) stant time, so we can find LMEM in time. See sec- previous method for LPM based on an additive homo- tion  5.2 of [27] for more details of the data structures. morphic encryption [17, 19] are shown. As we men- LCP PSV Examples of the search by FM-Index, , , and tioned in section  "Introduction", the baseline protocols NSV are provided in Appendix 1. are designed using major techniques of secret-sharing- based protocols. The detailed algorithms are described in Proposed protocols Appendix 3. Problem setting and outline of our protocols A B We assume that a query holder , a database holder , Secret‑shared recursive oblivious transfer P P and two computing nodes and participate the pro- We define a problem called a secret-shared recursive 0 1 A w ℓ B tocol. holds a query string of length and holds a oblivious transfer (ss-ROT) as follows. T N database string of length . After the protocol is run, Fig. 2 Schematic view of our goal and model. (0) Server (DB holder) distributes Beaver triples. (A reliable third party can serve as the trusted initializer instead.) (1) Server distributes shares of the database. (2) User (query holder) distributes shares of the query. (3) The computing nodes jointly calculate shares of the result. (4) The results are sent to User. The offline phase is (0), DB preparation phase is (1), and Search phase consists of (2)–(4) Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 6 of 22 Table 2 Summary of complexities for our protocols and related protocols Btime Bsize Dtime Dsize Stime Comm. Round ss-ROT (proposed) 0 0 ℓN ℓN ℓ ℓ ℓ Secure LPM (proposed) ℓ ℓ ℓN ℓN ℓ ℓ ℓ [17, 19] (LPM by AHE) − − − − ℓN ℓ ℓ N 2 2 2 2 Baseline LPM N N log ℓ + log N ℓ N ℓ N ℓ N ℓ N 2 2 2 2 Secure LMEM (proposed) ℓN ℓN ℓ ℓ ℓ ℓ ℓ 3 3 3 3 Baseline LMEM N N log ℓ + log N ℓ N ℓ N ℓ N ℓ N BTime and Bsize are generation time and size of BTs. Dtime and Dsize are generation time for the shares of the database and size of the shares. Stime is the time for Search phase. Comm. is the size of data exchanged between computing nodes. Round is the number of data exchanges j j j Definition 1 We assume a database holder B and two B computes Share(R [i]) and sends [[R [i]]] and [[R [i]]] 0 1 computing nodes P and P participate the protocol. B 0 1 to P and P , for i = 0, . . . , N − 1 and j = 0, . . . , ℓ − 1. 0 1 holds a vector V of length N and 0 ≤ V [i] < N . Given the Search phase The Search phase consists of two steps initial position p and the depth of recursion ℓ (2 ≤ ℓ) , 0 and is described in Lines 2–5 of Protocol  1. The input the secret-shared recursive oblivious transfer protocol is the initial position p and shares of R. The output is (ℓ) outputs shares of [[V [p ]]] . An example of a search is illustrated in Fig. 3. V [V [··· V [p ]··· ]] Security intuition (3) In the DB preparation phase of ss-ROT, B does not disclose any private values, and P and P receive the 0 1 without leaking V to P and P . 0 1 shares. In the Search phase, all the messages exchanged between P and P are shares except for the result of 0 1 For simplicity, we denote the recursion of Eq.  3 by (ℓ) (2) Reconst in Step 1. In the j-th step of the loop in Step 1, V [p ] (e.g., V [V [p ]] is denoted by V [p ] ). In our 0 0 0 j (j+1) j p = R [p ] = (V [p ]+ r ) is reconstructed. j+1 j 0 mod N protocol, all the random values are uniformly generated Since the reconstructed value is randomized by r , no from Z . information is leaked. Note that for each vector R , all DB preparation phase B generates ℓ − 1 random val- j j 0 ℓ−2 the elements R [0], . . . , R [N − 1] are randomized by the ues r , . . . , r and computes the following vectors 0 ℓ−1 j same value r , but only one of them is reconstructed, R , . . . , R . Each vector R has N elements. 0 ℓ−1 and different random numbers r , . . . , r are used for 0 ℓ−1  (V [i]+ r ) (j = 0) mod N R , . . . , R . In Step 2, P and P output a result, and no 0 1 j j−1 j R [i] = (V [(i − r ) ]+ r ) (1 ≤ j ≤ ℓ − 2) mod N mod N information other than the result is leaked. j−1 (V [(i − r ) ]) (j = ℓ − 1) mod N mod N (4) Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 7 of 22 (4) Fig. 3 Example of a search when V = (2, 0, 3, 1) , p = 2 , and ℓ = 4 . The goal is to compute [[V [2] ]] = [[2]] . Here we assume B generates 0 1 2 0 0 0 0 r = 1, r = 2, r = 1 . In Step 1 of Search phase, P and P jointly compute Reconst([[R [2] ]] , [[R [2] ]] ) to obtain R [2]= 0 . ( R [2] is randomized 0 1 0 1 0 1 2 3 by r , so any element of V is leaked.) In a similar way, P and P compute R [0]= 3 and R [3]= 1 . In Step 2, P and P output [[R [1] ]] and 0 1 0 1 0 3 0 0 1 0 0 0 1 2 1 1 1 2 [[R [1] ]] respectively. Since R [2]= V[2]+ r , R [V[2]+ r ]= V[V[2]+ r − r ]+ r , R [V[V[2]] + r ]= V[V[V[2]] + r − r ]+ r , and 3 2 2 2 (4) R [V[V[V[2]]] + r ]= V[V[V[V[2]]] + r − r ] , ss-ROT successfully computes [[V [2] ]] (ℓ) Security input to P is p and ℓ , and output of P is V [p ] . The 0 0 0 0 function achieved by Protocol 1 is deterministic and the Theorem  1 ss-ROT  is correct and secure in the semi- protocol is correct. Therefore, to ensure the security of honest model. Protocol  1, we need to prove existence of a probabilistic polynomial-time simulator S such that Proof Correctness and security of ss-ROT  protocol are (ℓ) (ℓ) (ℓ) {(S(p , ℓ, V [p ]), V [p ])}≡{(X, V [p ])}, (8) proved as follows. 0 0 0 0 where X is P ’ s vie w . X consists of: Correctness. We assume the following equation. (i) i−1 • [[R [i]]] for i = 0, . . . , N − 1 and j = 0, . . . , ℓ − 1 (a p = (V [p ]+ r ) (5) i 0 mod N message from B) In Step1, for j = 0 , the protocol computes p by recon- 1 • [[R [p ]]] (j-th message from P ) for j = 0, . . . , ℓ − 1 j 1 1 0 j j j structing R [p ] . From the definition of R [i] in Eq. 4, 0 • p (j-th value obtained by Reconst([[R [p ]]] , [[R [p ]]] ) j j 0 j 1 in Step1) for j = 1, . . . , ℓ − 1. 0 (1) 0 p = R [p ] = (V [p ]+ r ) . (6) 1 0 0 mod N All the messages from B and P are uniformly at For j = k , the protocol computes p by reconstruct- k+1 k j n random in Z , as they are generated by Share . ing R [p ] . From the definition of R [i] in Eq.  4 and the j j p + 1 = Reconst([[R [p ]]] , [[R [p ]]] ) holds for j = 0, ... , ℓ − 2 , j j 0 j 1 assumption of Eq. 5, (ℓ) ℓ−1 ℓ−1 and V [p ] = Reconst([[R [p ]]] , [[R [p ]]] ) 0 ℓ−1 0 ℓ−1 1 k k−1 k 0 1 ℓ−2 p = R [p ] =(V [ (p − r ) ]+ r ) holds. p = R [p ], p = R [p ], . . . , p = R [p ] 1 0 2 1 ℓ−1 ℓ−2 k+1 k k mod N mod N are uniformly at random in Z from the definition of (k) k N =(V [ V [p ]] + r ) 0 mod N Eq. 4. (k+1) k =(V [p ]+ r ) . mod N Let us denote a random number u chosen from a set (7) U uniformly at random by u∈U . We construct S as ℓ×N Eq.  5 holds for i = 1 by Eq.  6. It also holds for i = k + 1 described in Protocol  2. The output of S is R ∈ Z n , ˜ ˜ under the assumption that Eq.  5 holds for i = k . There - R ∈ Z n , and p , . . . , p . In Line  6 and Line  9, 1 1 ℓ−1 ˜ ˜ fore by induction, Eq. 5 holds for i = 1, . . . , ℓ − 1. p , . . . , p are generated such that they are uniformly at 1 ℓ−1 ˜ ˜ random in Z . In Line 7, R [p ] and R [0] are generated N 0 0 1 ℓ−1 by Share such that they are shares of p and uniformly In Step 2, P and P output [[R [p ]]] . Since Eq.  5 1 0 1 ℓ−1 ˜ ˜ n ˜ at random in Z . In Line 10, R [p ] and R [j] are gener- holds for i = ℓ − 1, 2 0 j 1 ated by Share such that they are shares of p and uni- j+1 ℓ−1 ℓ−2 R [p ] = (V [(p − r ) ]) ℓ−1 ℓ−1 mod N mod N n formly at random in Z for j = 1, . . . , ℓ − 2 . In Line  12, ˜ ˜ R [p ] and R [ℓ − 1] are generated by Share such that 0 ℓ−1 1 (ℓ) is transformed into (V [p ]) by plugging in mod N (ℓ) they are shares of V [p ] and uniformly at random in (ℓ−1) ℓ−2 0 j p = V [p ]+ r . Therefore the final output of ℓ−1 0 ˜ ˜ ˜ n ˜ Z . All the elements of R except for R [p ] and R [p ] 2 0 0 0 0 j (ℓ) ss-ROT  becomes (V [p ]) . The above argument mod N ( j = 1, . . . , ℓ − 1 ) are uniformly at random in Z by completes the proof of correctness of Theorem 1. Line  3. Therefore, Eq.  8 holds. By the above discussion, we find our ss-ROT  satisfies security in the semi-honest Security. Since the roles of P and P are symmetric, it is 0 1 model. sufficient to consider the case when P is corrupted. The 0 Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 8 of 22 Complexities (Eq.  11 ) and computing the inner product of Eq. 11 and In the DB preparation phase, B generates shares of V of (V [·],V [·],V [·],V [·]) . To find LPM, P and P need C G T 0 1 length N for ℓ times. Therefore, time and communication to check f = g for each reference. We use the subproto- complexities are O(ℓN ) . For the Search phase, Reconst is col Equality to check it securely. Since V is randomized computed ℓ times in Step 1. Since the time, communica- with different numbers for searching f and g, the dif - tion, and round complexities of Reconst are O(1), those ference of the random numbers is precomputed and of the Search phase become O(ℓ). removed securely upon the equality check. A receives only the result of each equality check to know LPM. For Secure LPM example, LPM is the prefix of length i − 1 when f = g for Construction of lookup table The goal is to find LPM the i-th reference. If f = g for all references, LPM is the securely. To apply FM-Index for a prefix search, the entire query. reverse string of T (i.e., T ) is used. The backward search DB preparation phaseB creates a lookup table and of FM-Index is formulated by Eq.  1. If we precompute generates the following 4ℓ vectors in a similar manner LF (i, T ) for i = 0, . . . , N and c ∈{A,T,G,C} , and store c to ss-ROT. For simplicity, we denote the length of V by them in a lookup table that consists of four vectors: N = N + 1. V , V , V , and V such that V [i] = LF (i, T ) , Eq.  1 is C G T c c (V [i]+ r ) ′ (j = 0) replaced by the following table lookup j c mod N R [i] = c,f j−1 j ′ ′ (V [(i − r ) ]+ r ) (1 ≤ j < ℓ) c mod N mod N f f f = V [f ], g = V [g ]. (9) k+1 w[k] k k+1 w[k] k (10) I.e., starting with the initial interval [f = 0, g = N ) , we 0 0 R [i] is used for computing the lower bound f of the c,f can compute the match by recursively referring to the interval [f,  g). We also generate R [i] for the upper c,g lookup table while f < g. ′ bound g. R consists of 8ℓ vectors, each of length N . Since Protocol overview The key idea of Secure LPM is to the longest match is found when f = g , B also generates j j refer to V by ss-ROT, i.e., P and P jointly refer to V ℓ ′ 0 1 a vector r [j] = (r − r ) that is used for equality g mod N times in a recursive manner. To achieve backward j j check of f and g. Then, B sends shares of R [i] , R [i] , c,g c,f search, P and P need to select V [·] for each refer- 0 1 x and r [j] to P and P . 0 1 ence, where x is a query letter to be searched with. This is achieved by expressing the query letter by unary code Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 9 of 22 ′ ′ Search phase Protocol 3 describes the algorithm in detail. and LF (g , T ) + r in Lines 5–8 without leaking f and w[j] ′ ′ ′ A generates four vectors q , q , q , q , each of length ℓ , C G T g , where [f , g ) corresponds to the match of w[0,  j] and ′ ′ as follows. T . In Lines 10–13, the equality of f and g is examined j−1 j−1 for all rounds. Note that different values r and r are 1 (c = w[j]) j−1 j−1 q [j] = ′ ′ (11) used for f = (f − r ) ′ and g = (g − r ) ′ 0 (c �= w[j]) j mod N j g mod N j−1 ′ ′ ′ ′ in order to conceal f and g . Since f , g , r , j−1 For each j, (q [j], q [j], q [j], q [j]) encodes w[j] (e.g., ′ ′ A C G T r , r [j − 1]∈{0, . . . , N − 1} , it is sufficient to check if (q [j], q [j], q [j], q [j]) = (1, 0, 0, 0) if w[j] = A ). The aim ′ ′ ′ A C G T f − g − r [j − 1] is equal to either one of −N , 0, and N . j j of the encode is to compute [[R [j]]] = [[ q [j]· R [j]]] x c c c∈ In Lines 16–18, A receives all the results of equality when w[j] = x . F igur e  4 illustrates an example of the table B B checks (i.e., [[o[1]]] , . . . , [[o[ℓ]]] ) from P and P , and 0 1 lookup. knows LPM by reconstructing them. For example, if w = A generates shares of q , q , q , q and distributes A C G T GCT and o = (0, 0, 1) , A knows that LPM is GC. them to P and P . P and P compute LF (f , T ) + r 0 1 0 1 w[j] j j Fig. 4 Example of a secure table lookup when w = GCT and T = ACGT. Only the lookup for a lower bound is shown. For simplicity, R and r c,f f j 0 are denoted by R and r . LF (f , T ) ( i = 0, 1, 2 ) is computed by V [0], V [2] , and V [1] . V is referenced securely by using R. R [0] is computed by c w[i] i G C T 1 0 0 2 1 1 q [0]· R [0] . R [2 + r ] is computed by q [1]· R [2 + r ] . R [1 + r ] is computed by q [2]· R [1 + r ] c c c c c c c∈ c∈ c∈ C T Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 10 of 22 Security in Lines 7–8. (see section  "Secure computation based on secret sharing" for details of the subprotocols.) In Theorem 2 Protocol 3 is correct and secure in the semi- Lines 7–8, reconstructed values are R [f ] and w[j],f honest model. R [g ] . Since the values are (V [f ]+ r ) and j w[j] j mod N w[j],g f (V [g ] + r ) ′ according to Eq.  10, it is obvious w[j] j g mod N Proof Correctness and security of Protocol 3 are proved that V is randomized for all rounds j = 0, . . . , ℓ − 1 , as follows. and no information is leaked. For Lines 14–17, only the output of Equality at Line 11 is reconstructed. The Correctness. The lookup table V simply stores all possible reconstructed values are either 1 or 0 according to outputs of LF . Therefore, backward search (Eq.  1) is Equality , and no information other than the result is equivalent to Eq.  9. For the case of querying w, leaked. V [··· V [p ]··· ] becomes lower bound f (for w[0] 0 w[k−1] p = 0 ) or upper bound g (for p = N ) of the interval 0 0 A may reveal T by making many queries. Such a problem that corresponds to the prefix match of length k. In Line is called output privacy. Although output privacy is out- k k 5 of Protocol  3, [[R [f ]× q [k] + R [f ]× q [k] k A k A,f C,f side of the scope of this paper, we should mention here k k +R [f ]× q [k]+ R [f ]× q [k]]] is computed. Since k G k T that A needs to make an unrealistically large number of G,f T,f q [j] = 1 and q [j] = 0 ( c =w[j] ), it is equivalent to queries for obtaining T by such a brute-force attack, con- w[j] c [[R [f ]]] [[R [g ]]] sidering that N is very long. . Line 6 computes in the same k k w[k],f w[k],g manner. Each vector R in Eq.  10 is generated in the c,f Complexities same manner as R in Eq.  4. Since Eq.  10 uses the com- j j−1 j j j j The DB preparation phase generates shares of R and c,f mon random values r and r for R , R , R , R , f f A,f C,f G,f T,f j R ( c ∈  and 0 ≤ j <ℓ ); i.e., 8 × ℓ vectors of length N . c,g we can recursively reference V ( c ∈{ A, C, G, T } ), Therefore, the time and communication complexities are which is obvious from the correctness of ss-ROT. O(ℓN ) . For the Search phase, MULT and Reconst are Therefore, the recursion by Line 5 and Line 7 can k−1 computed twice in Lines 4–9 for ℓ rounds and Equality is compute (V [··· V [f ]··· ] + r ) ′ , and w[k−1] w[0] 0 mod N computed once in Lines 10–13 for ℓ rounds. Note that the recursion by Line 6 and Line 8 can also compute Equality is computed in parallel, and the number of k−1 (V [··· V [g ]··· ] + r ) . w[0] 0 w[k−1] g mod N round can be reduced to a constant number. Each time, the communication and round complexities of these sub- The longest match is found when the interval width protocols are O(1), so those of the Search phase become k−1 becomes 0. Since f = (V [··· V [f ]···] + r ) ′ w[0] 0 k w[k−1] mod N O(ℓ). k−1 and g = (V [··· V [g ]··· ] + r ) ′ are w[0] 0 k w[k−1] mod N randomized, Line 11 computes f − g − (r [k − 1] = k k Secure LMEM k−1 k−1 (r − r ) ′ ) to obtain the correct interval width. mod N f g Construction of lookup table As described in sec- When the width is 0, d becomes either one of 0, N and tion  "Index structure for string search", we can find a −N . Therefore, Line 12 computes the equality d and 0, parent interval by a reference to LCP , PSV , and NSV . ′ ′ N and −N respectively. By reconstructing all the results Therefore, in addition to V defined in section  "Secure in Lines 16–18, A knows the round, in which the interval LPM", we prepare lookup tables that simply store all the width becomes 0; i.e., he/she knows LPM. The above outputs of them; i.e., V [i] = LCP[i] , V [i] = PSV[i] , psv lcp argument completes the proof of correctness of and V [i] = NSV[i]. nsv Theorem 2. DB preparation phase B generates randomized vectors j j R , R and r [j] = (r − r ) ′ using the same algo- c,f c,g g mod N Security We only show a sketch of the proof. For Lines rithm in section "Secure LPM" for length 2ℓ . As shown in 1–2 of Protocol 3, A and B do not disclose any private Eq.  2, V is referred by the upper and lower bounds of lcp values, and P and P receive the shares. For Lines [f,  g). Therefore, B generates following circular permuta- 3–14, it is guaranteed by the subprotocols ADD , MULT , tions of V such that W and R , and W and R , are lcp l,f c,f l,g c,g and Equality that all the messages exchanged between permutated by the same random values, respectively. I.e., P and P are shares except for the output of Reconst 0 Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 11 of 22 otherwise. When the search is finished (e.g., the right V [i] (j = 0) j lcp W [i] = j−1 l,x end of a match exceeds the right end of the query) V [(i − r ) ] (1 ≤ j < 2ℓ), lcp x mod N u = (0, . . . ,0) . Therefore in Lines 25–28, x = 1 while the where x is either f or g. V is referred by both f and g, and right end of a match dose not exceed the right end of psv j j is plugged in to f. Therefore, B generates W and W the query and x = 1 after finishing the search. In Lines p,g p,f j j 29–31, the inner product of q ( c ∈  ) and u becomes the such that both of them are randomized by r , and W is f p,f j−1 j j−1 encode of w[t] that is used for the next round. permutated by r and W is permutated by r as p,g g We also maintain the left end position of the match. follows. While the match is extended, the position remains the same and it moves toward the right when the interval (V [i] + r ) (j = 0) psv mod N W [i] = p,f j−1 j is updated by [f , g ) . The new left end position can be ex ex (V [(i − r ) ]+ r ) (1 ≤ j < 2ℓ) psv mod N mod N f f computed by p + m − c where p is the current position, (V [i] + r ) (j = 0) j psv g mod N m is the length of the current match, and c is the lcp- W [i] = p,g j−1 j (V [(i − r ) ]+ r ) (1 ≤ j < 2ℓ) psv g mod N mod N value of [f , g ) (i.e., the longest common prefix length of f ex ex suffixes contained in [f , g ) ). The position is computed ex ex Similarly, V is referred by both f and g, and is plugged nsv in Line 33. The match length is incremented by 1 for j j in to g. Therefore, B generates W [i] and W [i] as n,g n,f each extension while the right end of the match does not follows. exceed the query length. When the interval is updated by [f , g ) , the match length is reduced to the lcp-value ex ex (V [i] + r ) (j = 0) nsv j mod N of [f , g ) , which is computed by max(LCP[f ], LCP[g ]) . ex ex W [i] = j−1 j n,f (V [(i − r ) ]+ r ) (1 ≤ j < 2ℓ) nsv mod N g mod N f The match length is computed in Line 32. In Line 35, the longest match length and the corresponding left end j (V [i] + r ) (j = 0) nsv g mod N W [i] = n,g j−1 j position are updated. After all the positions in the query (V [(i − r ) ] + r ) (1 ≤ j < 2ℓ) nsv g mod N g mod N have been examined, LMEM and its left end position are B distributes shares of R , R , r , W , W , W , W , sent to A in Line 37. c,f c,g l,f l,g p,f p,g W , and W to P and P . n,f n,g 0 1 Search phase Protocol  4 describes the algorithm in Security detail. A generates query vectors q , q , q , q by Eq. 11 A C G T and distributes shares of the vectors to P and P . In Line Theorem 3 Protocol 4 is correct and secure in the semi- 0 1 6 of Protocol  4, [f , gˆ) is computed by the reference to R honest model. (i.e., a search based on a backward search) similarly to Lines 5–6 of Protocol 3. In Line 11, [f , g ) is computed Proof Correctness and security of Protocol 4 are proved ex ex by the reference to W (i.e., a search based on LCP , PSV as follows. Correctness. V, R, r and q are generated by the and NSV ). In Line 13, the interval is updated by either same algorithm used in Protocol  3. Therefore, Line 6 is ′ ′ [f , gˆ) or [f , g ) based on the result of f = g in Lines equivalent to a backward search, and e1 is the result of ex ex ′ ′ 7–9, where [f , g ) corresponds to the interval that corre- the equality check of 0 and the width of the obtained sponds to a substring match. interval in Lines 7-8. The lookup tables V , V , and V lcp psv nsv In each round, we need to know a query letter to be store all the outputs of LCP , PSV and NSV , and W , W , l p searched with, so we need to maintain the right end and W are generated based on V , V , and V , n lcp psv nsv j j position of the match in the query. The position moves respectively. Since W and W are circular permutations l,f l,g j−1 j−1 toward the right while the match is extended, but remains of V by the same random values r and r that are lcp g the same when the interval is updated based on PSV and used for generating R and R (c ∈ �) respectively, Line c,f c,g NSV . To memorize the position, we prepare shares of 8 can compute LCP[g ] ≤ LCP[f ] and e2 holds the result. j j a unit bit vector u of length ℓ , in which the position t is j j By using Choose  and e2, either [W [f ], W [f ]) or j j p,f n,f memorized as u[t] = 1 and u[i �= t] = 0 . In Lines 20–23, j j j j [W [g ], W [g ]) is selected. W and W are permu- j j p,g n,g p,g p,f u remains the same if the interval is updated based j−1 j−1 tated by r and r , but are randomized by the identi- on PSV and NSV , and u = (0, u[0], u[1], . . . , u[ℓ − 2]) f Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 12 of 22 j j j Security. We only show a sketch of the proof. For Lines cal random value r . Similarly , W and W are permu- n,g f n,f j−1 j−1 j 1–2 of Protocol  4, A and B do not disclose any private tated by r and r , but are randomized by r . Since g g values, and P and P receive the shares. For Lines 3–37, 0 1 W [f ] and W [g ] are generated in the same manner as p,f j n,g j it is guaranteed by the subprotocols ADD , MULT , R and R , it is obvious that the reference by them is c,f c,g Equality , and Choose  that all the messages exchanged correct. The reference by W [f ] is transformed into n,f between P and P are shares except for the output of 0 1 j+1 j j j j+1 Reconst in Line 14. (see section  "Secure computation X [W [f ]] = V [W [f ] − r ]+ r g j x j g g n,f n,f based on secret sharing" for details of the subprotocols.) j−1 j j j+1 In Line 14, the reconstructed values are = V [V [f − r ]+ r − r ]+ r x nsv j g g g (j+1) j (j+1) j f = V [p ]+ r and g = V [p ]+ r , i+1 x 0 j+1 x 0 g j−1 j+1 f = V [ V [f − r ] ]+ r x nsv j g according to Eq. 5, Eq. 12, and Eq. 13. Since f and g j+1 j+1 j j (12) are randomized by r and r , respectively, for all rounds and the reference by W [g ] is transformed into j j = 0, . . . ,2ℓ − 1 , no information is leaked. In Line 38, A p,f reconstructs only the search result (the length and start j+1 j j j j+1 X [W [g ]] = V [W [g ]− r ]+ r p,g j x p,g j f f f position of LMEM). j−1 j j j+1 = V [V [g − r ]+ r − r ]+ r x psv j g f f f j−1 j+1 = V [ V [g − r ] ] + r x psv j g Complexities (13) The DB preparation phase generates shares of R and c,f j j j j+1 j+1 j+1 j+1 0 ≤ j <ℓ x ∈{l, p, n} R ( c ∈  , ) and W and W ( and where X is any one of R , W and W , and V is c,g x,g c p n x x,f the corresponding lookup table; i.e., either one of V , V 0 ≤ j <ℓ ); 14 × ℓ vectors of length N + 1 . Therefore, the c psv and V . Note that V could be a different table for each time and communication complexities are O(ℓN ) . For nsv x j + 1 , but we abuse the same notation for simplicity of the Search phase, MULT is computed ℓ times in parallel in notation. Since f and g are described in the form of Lines 17–18. (These are not dependent on each other.) In j j (j) j−1 (j) j−1 V [p ]+ r and V [p ]+ r based on Eq.  5, Eq.  12 Line  30, MULT is computed ℓ times in parallel, and Line x 0 x g (j+2) j+1 30 is computed in parallel four times in Lines  29–31. and Eq.  13 are transformed into V [p ] + r and x 0 g (j+2) j+1 Lines  17–18 and Lines  29–31 are repeated for 2ℓ − 1 V [p ] + r , which also satisfy the recursion form of j j rounds. Other subprotocols are also computed for 2ℓ − 1 Eq.  5. Thus, the intervals [W [f ], W [f ]) and j j p,f n,f j j rounds. The time, communication, and round complexi - [W [g ], W [g ]) are correct intervals and Line  11 is p,g j n,g j ties are O(1) for MULT , and independent computation of equivalent to computing Eq. 2. MULT for ℓ times does not increase the round complex- ity. The time, communication and round complexities Lines 16–23, u remains the same if e1 = 0 and are O(1) for the other subprotocols used in Protocol  4. u = (0, u[0], u[1], . . . , u[ℓ − 2]) otherwise. Therefore Therefore, the complexities of the Search phase are O(ℓ ) Lines 29–31 can choose the letter to be searched with. for time and communication, and O(ℓ) for the number of The match length and the start position are obtained rounds. The time complexity of the standard (i.e., non- based on e1 in Lines 32–33, and the longest value and the privacy-preserving) LMEM is O(ℓ) while that of Secure corresponding position are selected in Lines 34–35. The LMEM is O(ℓ ) . The increase in time complexity is shares of the length and start position of LMEM are sent caused by the computation for maintaining match posi- to A , and A reconstructs them. Then, Protocol 4 outputs tion securely. them. The above argument completes the proof of cor - rectness of Theorem 3. Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 13 of 22 Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 14 of 22 j j ′ ′ ′ Reducing size of shares in DB preparation phase Theorem 4 0 ≤ V [i + 1]− V [i]≤ 1 for i = 0, . . . , N − 2. c,f c,f The protocols based on ss-ROT  are quite efficient in Search phase, however, they require large data trans- Proof Following equation is equivalent to Eq. 14. fer from B to the computing nodes in DB preparation V [i] phase when the number of queries and the length of the c,f database are large. To mitigate the problem, we propose V [(i − r [j]) ′]− o ((i − r [j]) ′ ≤ i) c f mod N f mod N c,f another protocol that can reduce size of shares in DB V [(i − r [j]) ′]− o¯ ((i − r [j]) ′ > i) . f mod N f mod N c,f preparation phase. (15) We use two parameters m and n ( m < n ) for comput- 0 ≤ V [i + 1] − V [i] ≤ 1 holds for i = 0, . . . , N − 2 2 c c ing shares. When Share outputs ([[x]] , [[x]] ) ∈ Z , we 0 1 from the definition of V . m m c denote the share by ([[x]] , [[x]] ) . When Share outputs 0 1 ([[x]] , [[x]] ) ∈ Z , we denote the share by ([[x]] , [[x]] ) . 0 1 0 1 j I f (r [j]) ′ = 0 , V = V . T h e r e f o r e , m f mod N c c,f We denote M = 2 . In our protocol, all the random val- j j ′ ′ ′ ues are uniformly generated from Z . 0 ≤ V [i + 1] − V [i] ≤ 1 holds for i = 0, . . . , N − 2. c,f c,f Basic idea V [i] = LF (i, T ) is a lookup table c c j j ′ ′ used by Protocol  3 and  4. We sample V [i] at If (r [j]) ′ �= 0 and , i = (r [j]− 1) ′ c f mod N V [i + 1]− V [i] f mod N c,f c,f ′ ′ j j i = 0, M,2M, . . . , ⌊N /M⌋M , where N is the length of = V [0] − o − V [N − 1]+ o¯ = 0 . Let us consider c c c,f c,f V and store the sampled values in a vector z. We com- when ′ and ′ . We denote (r [j]) �= 0 i =(r [j]− 1) f mod N f mod N pute x[i] = V [i]− V [p] for i = 0, . . . , N − 1 , where p c c ′ . Then, ′ i = (r [j]− 1 + a) ′ (0 < a < N ) (i + 1 − r [j]) f mod N f mod N is the sampled position closest to i and p ≤ i . Given a and ′ . Since ′ i + 1 = (r [j]− 1 + a) + 1 = (a) f mod N mod N position k, we can compute V [k] by z[⌊k/M⌋] + x[k] . (a) ′ − ((r [j]− 1 + a) ′ + 1) mod N f mod N Any element in z is non-negative and at most N − 1 = (a − 1) ′ − (r [j]− 1 + a) ′ holds because mod N f mod N j j while that in x is also non-negative and at most M − 1 ′ ′ 0 < a , an offset for V [i + 1] and that for V [i] are c,f c,f because 0 ≤ V [i + 1] − V [i] ≤ 1 . Our idea is to use c c j j ′ ′ same and . V [i + 1]− V [i]= V [(a) ′]− V [(a − 1) ′] c mod N c mod N c,f c,f n bits for storing z[i] and m bits for storing x[i]. Note j j ′ ′ Therefore, 0 ≤ V [i + 1] − V [i] ≤ 1 holds for c,f c,f that we used n bits for storing V [i] in Protocol  3 and ′ i = 0, . . . , N − 2 . Protocol  4. There are ⌈N /M⌉ sampled positions, so the ′ ′ size of the lookup table becomes O(n⌈N /M⌉ + mN ) , Let Q be an integer vector of length ⌈N /M⌉ such that which is n/m times smaller compared to V if M is suf- c c,f ficiently large. We use a rotation technique to hide an j j j ′ Q [p] = V [pM] , and R [i] intermediate position. Since 1 < V [0]− V [N − 1] for c,f c,f c,f c c j j most cases, we design a rotated table V that satisfies ′ ′ = V [i]− V [M⌊i/M⌋] . c,f c,f ′ ′ 0 ≤ V [i + 1] − V [i] ≤ 1 by subtracting an offset from c c j j j V . ′ Note that V [i] = Q [⌊i/M⌋] + R [i] , and V [i] is c,f c,f c,f DB preparation phase B computes following vectors for j obtained by adding an offset to V [i]. c,f j = 0, . . . , ℓ − 1 Since R [i] is non-negative and at most M − 1 , B gen- c,f j j j ′ m V [(i + r [j]) ′] erates shares [[R [i]]] . B also generates [[Q [p]]] , [[o ]] , f mod N c,f c,f c,f c,f [[o ¯ ]] and [[r [j]]] . Above shares are used for computing j f c,f (14) V [i]− o (i ≤ (i + r [j]) ′) c f mod N c,f lower bound f of an interval. B generates shares for upper V [i]− o¯ (i > (i + r [j]) ′) , f mod N c,f bound g in a same manner. Then B distributes all the shares to P and P . 0 1 where r [j] is a random value, o = V [(N − 1−r [j]) ′] Search phase A generates table w for a query string w f c,f f mod N by Eq. 11. A generates shares of q and distributes them to ′ ′ −V [N − 1]and o¯ = V [(N − 1−r [j]) ′]−V [0]. c c f mod N c c,f P and P . The entire protocol is described in Protocol 5. 0 1 Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 15 of 22 Security Complexities In DB preparation phase, shares of R are generated c,f Theorem 5 Protocol 5 is correct and secure in semi-hon- with a parameter m and shares of other values including est setting. j j Q are generated with a parameter n. The length of R c,f c,f is N + 1 and that of Q is ⌈(N + 1)/M⌉ . The total num - c,f Proof Correctness and security of Protocol 5 are proved ber of other values do not depend on N. The query as follows. j j length is ℓ and shares of R , Q , and other values are c,f c,f necessary for each query character. Therefore, time Correctness. In Line 5-6 of Protocol 5, p = (f + r ) ′ j j j mod N complexity is O(ℓN ) and communication complexity is is computed. In Line 8, CastUp(R [p ]) is computed to c,f ′ O(ℓNm + ℓ⌈N /M⌉n). avoid overflow in Line 9. In Line 9, shares of V [p ] are c,f j For Search phase, ADD , MULT , Reconst , CastUp and computed, which is obvious from the definition of Q c,f Comp are computed a few times for 2ℓ times in Line 4-16 j j+1 j j and R . In Line 11-13, [[f ]] , [[o ]] and [[o ¯ ]] are w[j] c,f w[j],f w[j],f and Equality is computed ℓ times in Line 17-19. Since selected. From the definition of V described in Eq. 14, each time and communication and round complexities of c,f j j it is obvious that V [f ] is obtained by V [p ] + o when these subprotocols are O(1), those of the entire protocol c j j c,f c,f j j become O(ℓ). f ≤ p and V [p ] + o¯ when f > p , and Line 14 j j j j j c,f c,f computes [[V [f ]]] . g is computed similarly to f. Since ref- c j Experiment erence to V achieved in Lines 4–16 is equivalent to eval- We implemented Protocol  3 (Secure LPM), Protocol  4 uating Eq. 1 and an equality check of f = g is conducted (Secure LMEM) and Protocol  5. For comparison, we in Lines 17–19, Protocol 5 is correct. also implemented baseline protocols (Baseline LPM and Baseline LMEM). Details of the baseline protocols are Security We only show sketch of the proof. All the mes- provided in Appendix 3. All protocols were implemented sages exchanged between P and P are shares except for 0 1 by Python 3.5.2. The dataset was created from Chromo - Line 6. In Line 6, reconstructed value p is randomized by some 1 of the human genome. We extracted substrings of r [j] in Line 5. Therefore, no information is leaked. f Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 16 of 22 3 4 5 6 7 length N = 10 , 10 , 10 , 10 , and 10 for databases, and ℓ = 10 , 25, 50, 75, and 100 for queries. Share was run 5 5 with n = 16 and n = 32 for N < 10 and 10 ≤ N in the proposed protocols, and n = 1 for a Boolean share and n = 8 for an arithmetic share in the baseline protocols. We did not implement a data transfer module, and each protocol is implemented as a single program. Therefore, the search time of the protocols was measured by the time consumed by either one of P and P . To assess the 0 1 influence of communication on a realistic environment, we theoretically estimated delays caused by network bandwidth and latency. We assume three environments: LAN (0.2 ms/10 Gbps), WAN (10 ms/100 Mbps), and WAN (50 ms/10 Mbps). During the run of Search phase, we stored all the data that were transferred from P to P Fig. 5 Estimated time (actual search time on a local machine + 0 1 estimated data-transfer time) for various N in a file and measured the file size as an actual commu - nication size. Note that the communication is symmet- ric and data transfer size from P to P is equal to that 0 1 (e.g., genome sequence) is the same as [19]. Sudo et  al. from P to P . Based on the data transfer size D byte, we 1 0 [19] is implemented as a server-client software, and the estimate the communication delay by D/k + eT /1000 , client and the server were run with individual single where k is bandwidth, e is latency and T is a round of threads on the same machine. Therefore, the results of communication. All the protocols were run with a single [19] do not include delays caused by bandwidth limita- thread on the same machine equipped with Intel Xeon tion and latency, so we also estimated delays based on 2.2 GHz CPU and 256 GB memory. We also tested the the data transfer size and round of communication in the C++ implementation of [19], which is based on AHE. The algorithm for LPM in [17] for the string with ||≤ 4 Table 3 Offline time (Time), offline size (Size), DB preparation time (Time), DB preparation size (Size), Search time on a local machine (Time), Search communication size (Size), estimated Search time for three environments: LAN (0.2 ms/10 Gbps), WAN (10 ms/100 4 5 6 7 Mbps), and WAN (50 ms/10 Mbps), for N = 10 (only for Baseline LMEM), 10 , 10 , 10 , and ℓ = 100 N Offline DB preparation Search Estimated timeon network Time Size Time Size Time Size LAN WAN WAN 1 2 Secure 0.166 0.013 123 305 0.141 0.010 0.181 2.162 10.249 LPM 0.141 0.013 1248 3051 0.113 0.010 0.153 2.134 10.221 (proposed) 0.150 0.013 12628 30517 0.126 0.010 0.167 2.147 10.234 Secure 2.318 0.162 123 77 2.888 0.040 3.028 9.911 38.020 LPM2 2.317 0.162 1236 774 2.878 0.040 3.018 9.901 38.010 (proposed) 2.342 0.162 12387 7748 2.939 0.040 3.079 9.962 38.071 – – – – 691 163 691 707 838 [19] – – – – 7817 517 7818 7863 8261 – – – – 20 h< – – – - Baseline (LPM) 3995 184 0.146 0.095 13 122 13 24 118 38767 1841 1.522 0.954 164 1227 165 268 1196 20 h< – – – – – – – – Secure 7.619 1.704 435 1068 4.817 0.999 5.577 42.900 195.654 LMEM 7.882 1.704 4467 10681 4.926 0.999 5.686 43.009 195.763 (proposed) 8.457 1.704 46384 106811 5.740 0.999 6.501 43.824 196.578 Baseline 12747 611 0.015 0.010 46 407 46 80 389 (LMEM) 20 h< – – – – – – – – The size unit is MB and the time unit is s except for the cell describing “20 h<” Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 17 of 22 Mbps). The results were 40 s for Secure LPM and 1739 s for Baseline LPM when N = 10 . Though both of the preliminary implementations have room for improve- ment in the performance of data transfer, the results also indicate that our protocol outperforms the baseline pro- tocol and the previous study. The time and size of Secure LPM and Secure LMEM are several orders of magnitude better than those of the baseline protocols for the offline phase, and vice versa for the DB preparation phase. The total time of the offline and DB preparation phases of our protocols are more than one order magnitude faster than that of baseline protocols. For the total size of the offline and DB prep - aration phases, Secure LMEM was better than Baseline LMEM, but Baseline LPM was better than Secure LPM Fig. 6 Estimated time (actual search time on a local machine + estimated data-transfer time) for various ℓ though the complexity is better for Secure LPM. This is because the majority of the shares were Boolean in the baseline protocols, while all of the shares were arithmetic in the proposed protocols. same manner. Each run of the program was terminated if the total runtime of all phases exceeded 20 h. Comparison to [19] [19] is a two-party MPC based on AHE. Each homo- Comparison to baseline protocols morphic operation is time consuming and has no offline Table  3 shows the offline time and size, DB preparation and DB preparation phases. As shown in Table  3, the time and size, and Search time and communication size Search time of Secure LPM is four orders of magnitude 5 6 7 for N = 10 , 10 , 10 , and ℓ = 100 . It also shows the faster than [19] for N = 10 . Since time complexity of result of Baseline LMEM for N = 10 , as the runs for [19] includes a factor of N, the difference in Search time N > 10 did not finish within 20 h. The Search times and becomes greater as N becomes large. Moreover, our pro- communication sizes of Secure LPM and Secure LMEM tocols have a further advantage in communication for a are several orders of magnitudes faster and smaller than query response when the network environment is poor, those of Baseline LPM and Baseline LMEM. Since the as the round complexity of [19] and our protocols are the round and communication complexities of the proposed same while [19] requires O( N ) communication size. protocols do not depend on N, their estimated Search The entire runtimes including all the phases are still six 5 6 time remains small even on WAN environments. Fig- times faster for N = 10 and N = 10 . We can compute ure  5 shows the estimated Search time on WAN for LMEM by examining [19] for all the positions in a query 3 4 7 N = 10 , 10 , . . . , 10 and ℓ = 100 . The times of Secure string, but this approach consumed 3406 s and 2.6 GByte LPM and Secure LMEM do not increase, while those of communication for N = 10 . of the baseline protocols increase linearly to N. Fig- ure  6 shows the estimated Search time on WAN for Result of the approach in section "Reducing size of shares ℓ = 10, 25, . . . , 100 for N = 10 . We can not show the in DB preparation phase" results of Baseline LMEM because none of its runs were We also implemented Protocol 5 (Secure LPM2) to inves- finished within the time limit. As shown in the graph, tigate a trade-off between reduction of the size of shares the time of Secure LPM increases linearly to ℓ and that in DB preparation phase and increase in search time and of Baseline LPM increases proportionally to ℓ , which communication overhead in Search phase. We used the are in good agreement with the theoretical complexities same programming language (i.e., Python 3.5.2) for the in Table  2. According to the graph, the time of Secure implementation and used the same datasets. Share was LMEM also increases linearly to though its time and run with n = 8 when generating the arithmetic shares communication complexities are O(ℓ ) . This is because of R. For the generation of rest of the arithmetic shares, the CPU times are much smaller than the delays caused Share was run with n = 16 and n = 32 for N < 10 and 5 5 by network latency that are influenced by the round com - 10 ≤ N . (i.e., m = 8 , n = 16 ( N < 10 ), and n = 32 plexity O(ℓ). (10 ≤ N ) for the notation used in section  "Reduc- We have preliminary results for testing Secure LPM ing size of shares in DB preparation phase"). The results and Baseline LPM on the actual network (10 ms/100 are shown in Table  3. The total size of shares in DB Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 18 of 22 preparation phase was 7.7GB for Protocol  5 and 30.5GB are large. To mitigate the problem, we also proposed for Protocol 3, which is in good agreement with the theo- the approach that uses arithmetic shares of a shorter bit retical complexities discussed in section  "Reducing size length, which offers a reasonable trade-off between the of shares in DB preparation phase". The search time of reduction of data size in DB preparation phase and the Protocol  5 is around 2 s longer than that of Protocol  3. increase in time and communication overhead in Search We consider the increase in search time is mainly caused phase. Another solution that potentially mitigate the by using rather costly subprotocols: CastUp , Comp and problem is to use an AES-based random number gen- MULT more times, which also increases the number of eration that is similar to the technique used in [33]. To communication rounds. Although the increase in search explain it briefly, when the server needs to distribute a time, Protocol  5 is still more than two orders of magni- share of x, (1) the server and P generate the same ran- tude faster than Baseline LPM and three orders of mag- domness r using a pre-shared key and a pseudorandom nitude faster than [19], so we consider that Protocol  5 function, and (2) the server computes x − r and sends it offers a reasonable trade-off between performance in DB to P . Although P ’s computation cost increases, we can 1 0 preparation phase and Search phase. remove the data transfer from the server to P . In our protocols, the generation of shares in the DB preparation Discussion phase cannot be outsourced because they are depend- As clearly shown by the results, Search time of the pro- ent on the database. Designing an efficient algorithm posed protocols are significantly efficient. Considering to outsource the share generation is an important open the importance of query response time for real applica- question. tions, it is realistic to reduce Search time at the cost of DB preparation time. Since the total times for offline and DB preparation phases of the proposed protocols were Appendices significantly better than those of the well-designed base - Appendix 1: Examples of a aearch with FM‑Index line protocols, we consider the trade-off between Search and auxiliary data structures and DB preparation times in our approach to be efficient. Let us show examples of a search with FM-Index, LCP For further reduction of DB preparation time, paralleliz- array, PSV and NSV. In addition to the data structures ing the share generation is a feasible approach. Regard- defined in section  "Index structure for string search ", ing the DB preparation phase, the data transfer between we also define a string F such that F[i] = S[SA[i]] . For the server and the computing nodes is problematic when the case of S =ATGA AT GCGA, the indices become the number of queries and the length of the database SA = (9, 3, 0, 4, 7, 8, 2, 6, 1, 5) , L = GGA AGC TTAA, and Fig. 7 An example of search by FM-Index Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 19 of 22 Fig. 8 An example of search by FM-ndex, LCP array, PSV and NSV F = AAA ACG GGTT. Figure 7 illustrates the example of corrupted party P if there exists a probabilistic polyno- a backward search to find the longest suffix of the query mial-time algorithm S such that (ATG) that matches the database, and Fig.  8 illustrates � � � � � � {(S(i, x , f (x)), f (x))}≡{(View (x), Output (x))}, i i the search for MEMs with the query (CGC) by using LCP array, PSV, and NSV. As shown in the upper center panel where the symbol ≡ means that the two probability dis- of Fig.  8, the search failed when the backward search tributions are statistically indistinguishable. with ‘C’ after finding the interval [7, 8) that corresponds to GC. Since LCP[8]≤ LCP[7] , the parent lcp-interval As described in [32], the composition theorem for the becomes [PSV[7] = 5,NSV[7] = 8) , which corresponds semi-honest model holds; that is, any protocol is pri- to ‘G’. The match CG is then searched with the backward vately computed as long as its subroutines are privately search with ‘C’ from the parent lcp-interval. computed. Appendix 3: Our secure baseline LPM and LMEM In this section, we show our secure baseline LCP and Appendix 2: Semi‑honest security LMEM based on secret sharing. We explain how to Here, we recall the simulation-based security notion in construct LCP, since we can obtain LMEM by (paral- the presence of semi-honest adversaries (for two-party lelly) executing LCP for all positions in the query. Note computation), as in [32]. that x� = (x , x , ··· ) , x denotes an i-th element of x  , 1 2 i � � � [[t]] = ([[t]] , [[t]] ) , and (|x�|, |�y|) = (L, N ) . Here, we 0 1 ∗ 2 ∗ 2 Definition 2 Let f : ({0, 1} ) → ({0, 1} ) be a proba- assume N > L . When [[x �]] = ([[x ]], [[x ]], ··· , [[x ]]) , 1 2 p bilistic 2-ary functionality and f (x ) denote the i-th ele- means . In our protocol, ([[0]], [[x ]], ··· , [[x]] ) ∗ 2 [[x�]] ≫ 1 1 p−1 ment of f (x ) for x� = (x , x ) ∈ ({0, 1} ) and i ∈{0, 1} ; 0 1 we use two subprotocols as follows: f (x�) = (f (x�), f (x�)) . Let  be a 2-party protocol to com- 0 1 pute the functionality f. The view of party P for i ∈{0, 1} • All-AND takes a list [[t]] (with p Boolean shares) as during an execution of  on input x� = (x , x ) ∈ ({0, 1} ) 0 1 input and outputs [[t ∧ ··· ∧ t ]] . We can compute 1 p where |x |=|x | , denoted by View (x ) , consists of 0 1 this function with ⌈p⌉ communication rounds (by (x , r , m , . . . , m ) , where x represents P ’ s input , r i i i,1 i,t i i i appropriate parallelization) and O(p)-bit data transfer. represents its internal random coins, and m repre- i,j • All-OR takes a list [[u �]] (with p Boolean shares) as sents the j-th message that P has received. The out - input and outputs [[u ∨ ··· ∨ u ]] . We can com- 1 p put of all parties after an execution of  on input x  is pute this function with ⌈p⌉ communication rounds (by denoted as Output (x ) . Then, for each party P , we say appropriate parallelization) and O(p)-bit data transfer. that  privately computes f in the presence of semi-honest Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 20 of 22 Our protocol is as in Protocol  A1. In the following, we matches exist”. For these operations, we need to use explain the details of our baseline longest common prefix O(N), O(NL), and O(N) secure AND gates, respectively. search protocol using an example that strings x = “TGA” Since we execute these operations for all L candidates, and � y = “ATTGC”. In this example, w = 2 since there the number of AND gates we need for are O(NL), O(NL ) , exists “TG” in  y , but “TGA” does not. For better under- and O(NL), respectively. In these operations, We do not standing, we introduce a more straightforward approach need to compute the letters match for each time since and analyze its efficiency before explaining our baseline the string is fixed. In our baseline protocol, therefore, we protocol. In the straightforward approach, we securely compute whether the letter is matched or not beforehand check whether the first letter of (i.e., “T”) exists in y or and repeatedly use them. Since we can check this check not. Next, we check every pattern up to the second let- with O(NL), however, our baseline still requires O(NL ) ter of x (i.e., “TG”) for a match anywhere in  y . We also AND gates. Although it may be possible to reduce the execute the same operations for up to the third latter of number of AND gates via increasing other costs (e.g., x (i.e., “TGA”). In these processes, we necessary to exe- communication rounds), it will not be easy to construct cute the “check if the characters match”, “check if all the the protocol with N-independent online cost like the pro- characters match”, and “check if at least one of the perfect posed one with this strategy. Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 21 of 22 References Why the offline cost of our baseline is so significant: In 1. Fiume M, Cupak M, Keenan S, Rambla J, de la Torre S, Dyke SO, Brookes secure computation, it is impossible in principle to change AJ, Carey K, Lloyd D, Goodhand P, et al. Federated discovery and sharing the behavior depending on the computation results in the of genomic data using beacons. Nat Biotechnol. 2019;37(3):220–4. 2. Philippakis AA, Azzariti DR, Beltran S, Brookes AJ, Brownstein CA, middle. In other words, we are always forced to perform the Brudno M, Brunner HG, Buske OJ, Carey K, Doll C, et al. The matchmaker worst-case computation. In the previous example, for exam- exchange: a platform for rare disease gene discovery. Hum Mutat. ple, we consider the case for checking whether the first let - 2015;36(10):915–21. 3. Erlich Y, Narayanan A. Routes for breaching and protecting genetic ter of x (i.e., “T”) matches any of the letters in y. If it is done privacy. Nat Rev Genet. 2014;15(6):409–21. in plain text, the moment we find “T” in the second letter 4. Aziz MMA, Sadat MN, Alhadidi D, Wang S, Jiang X, Brown CL, Mohammed of  y , we don’t have to worry about the rest of the letters in N. Privacy-preserving techniques of genomic data—a survey. Briefings Bioinform. 2019;20(3):887–95. y . In secure computation, however, we have to check eve- 5. Naveed M, Ayday E, Clayton EW, Fellay J, Gunter CA, Hubaux J-P, Malin BA, rything, including the rest, since we cannot find that the Wang X. Privacy in the genomic era. ACM Comput Surv. 2015;48(1):1–44. match has already existed. In addition, we consider the case 6. Jha S, Kruger L, Shmatikov V Towards practical privacy for genomic com- putation. In: Proc. of IEEE S&P 2000; 2008, p. 216–230. that we check the match for up to the first two letters in x 7. Cheon JH, Kim M, Lauter KE Homomorphic computation of edit distance. (i.e., “TG”) and the first two letters in  y (i.e., “AT”). In this In: Proc. of FC 2015; 2015, p. 194–212. case, the moment we see A, we can decide there is no match 8. Nuida K, Ohata S, Mitsunari S, Attrapadung N. Arbitrary univariate func- tion evaluation and re-encryption protocols over lifted-elgamal type and terminate the process in plaintext computation. In ciphertexts. IACR Cryptology ePrint Archive. 2019;2019:1233. secure computation, however, this is impossible. As we see 9. Huang Y, Evans D, Katz J, Malka L Faster secure two-party computation above, we are always forced to consider the worst-case com- using garbled circuits. In: Proc. of USENIX 2011; 2011. 10. Wang XS, Huang Y, Zhao Y, Tang H, Wang X, Bu D Efficient genome-wide, puting cost in secure computation. Note that offline costs privacy-preserving similar patient query based on private edit distance. for secure computation are linear to the number of AND In: Proc. of CCS 2015; 2015, p. 492–503. gates. We need O(NL ) offline cost in our baseline (and 11. Zhu R, Huang Y Efficient and precise secure generalized edit distance and beyond. IEEE Transactions on Dependable and Secure Computing. straightforward) protocol, and N is large in our setting. This 2020;1–1. is why the offline cost of our baseline protocol is so large. 12. Cheng K, Hou Y, Wang L Secure similar sequence query on outsourced Our proposed protocol successfully avoids this problem by genomic data. In: Proc. of AsiaCCS 2018; 2018. p. 237–251. 13. Asharov G, Halevi S, Lindell Y, Rabin T. Privacy-preserving search of similar developing a new secure primitive and combining it with an patients in genomic data. PoPETs. 2018;2018(4):104–24. appropriate data structure. 14. Schneider T, Tkachenko O EPISODE: efficient privacy-preserving similar sequence queries on outsourced genomic databases. In: Proc. of AsiaCCS Acknowledgements 2019, pp. 315–327 (2019) This work is partially supported by JST CREST Grant Number JPMJCR19F6, 15. Ohata S, Nuida K Communication-efficient (client-aided) secure two- MEXT/JSPS KAKENHI grant number 19K12209 and 21H04871/21H05052. party protocols and its application. In: Proc. of FC 2020; 2020, p. 369–385. KS thanks Prof. Kunihiko Sadakane and Mr. Tomoki Uchiyama for giving the 16. Baldi P, Baronio R, Cristofaro E.D., Gasti P, Tsudik G Countering GAT TAC A: important comments for improving the paper. efficient and secure testing of fully-sequenced human genomes. In: Proc. of CCS 2011; 2011, p. 691–702. Authors’ contributions 17. Shimizu K, Nuida K, Rätsch G. Efficient privacy-preserving string search KS designed proposed protocols with the help of SO and YN, and organ- and an application in genomics. Bioinformatics. 2016;32(11):1652–61. ized the study. SO implemented a secure multi-party computation library 18. Ishimaki Y, Imabayashi H, Shimizu K, Yamana H Privacy-preserving string equipped with all the sub-protocols necessary for this study and designed search for genome sequences with fhe bootstrapping optimization. In: baseline protocols. YN implemented proposed and baseline protocols and Proc. of IEEE Big Data 2016, pp. 3989–3991 (2016) conducted experiments. KS and SO mainly wrote the manuscript. All the 19. Sudo H, Jimbo M, Nuida K, Shimizu K. Secure wavelet matrix: alphabet- authors contributed to the final form of the manuscript. All authors read and friendly privacy-preserving string search for bioinformatics. IEEE/ACM approved the final manuscript. Trans Comput Biol Bioinform. 2019;16(5):1675–84. 20. Sotiraki K, Ghosh E, Chen H. Privately computing set-maximal matches in genomic data. BMC Med Genom. 2020;13(7):1–8. Declarations 21. Mahdi MSR, Al Aziz MM, Mohammed N, Jiang X. Privacy-preserving string search on encrypted genomic data using a generalized suffix tree. Inform Competing interests Med Unlocked 23, 100525 (2021) The authors declare that they have no competing interests. 22. Chen Y, Peng B, Wang X, Tang H Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds. In: Proc. of NDSS 2012; Author details Department of Computer Science and Engineering, Waseda University, Tokyo, 2 3 23. Popic V, Batzoglou S. A hybrid cloud read aligner based on minhash and Japan. Self- employment, T ok yo, Japan. National Institute of Advanced Indus- kmer voting that preserves privacy. Nat Commun. 2017;8(1):1–7. trial Science and Technology, Tokyo, Japan. 24. Ferragina P, Manzini G Opportunistic data structures with applications. In: Proc. of FOCS 2000; 2000; p. 390–398. Received: 19 November 2021 Accepted: 1 March 2022 Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 22 of 22 25. Durbin R. Efficient haplotype matching and storage using the positional burrows-wheeler transform (pbwt). Bioinformatics. 2014;30(9):1266–72. 26. Yasuda M, Shimoyama T, Kogure J, Yokoyama K, Koshiba T Secure pattern matching using somewhat homomorphic encryption. In: Juels, A., Parno, B. (eds.) Proc. of CCSW’13; 2013, p. 65–76. 27. Fischer J, Mäkinen V, Navarro G An(other) entropy-bounded compressed suffix tree. In: Proc. of CPM 2008; 2008, p. 152–165. 28. Shamir A. How to share a secret. Commun ACM. 1979;22(11):612–3. 29. Beaver D Efficient multiparty protocols using circuit randomization. In: Proc. of CRYPTO 1991; 1991, p. 420–432. 30. Mohassel P, Orobets O, Riva B. Efficient server-aided 2pc for mobile phones. PoPETs. 2016;2016(2):82–99. 31. Mohassel P, Zhang Y Secureml: a system for scalable privacy-preserving machine learning. In: Proc. of IEEE S&P 2017; 2017, p. 19–38. 32. Goldreich O. The foundations of cryptography. Basic applications, vol. 2. Cambridge: Cambridge University Press; 2004. 33. Araki T, Furukawa J, Lindell Y, Nof A, Ohara K High-throughput semi- honest secure three-party computation with an honest majority. In: Proc. of CCS 2016; 2016, p. 805–817. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in pub- lished maps and institutional affiliations. Re Read ady y to to submit y submit your our re researc search h ? Choose BMC and benefit fr ? Choose BMC and benefit from om: : fast, convenient online submission thorough peer review by experienced researchers in your field rapid publication on acceptance support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year At BMC, research is always in progress. Learn more biomedcentral.com/submissions http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Algorithms for Molecular Biology Springer Journals

Efficient privacy-preserving variable-length substring match for genome sequence

Loading next page...
 
/lp/springer-journals/efficient-privacy-preserving-variable-length-substring-match-for-bnriCYmpyg

References (50)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2022
eISSN
1748-7188
DOI
10.1186/s13015-022-00211-1
Publisher site
See Article on Publisher Site

Abstract

The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that V [V [... V [p ] ...]] is computed for a given depth of recursion where p is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communi- cation, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non- indexed database search protocol under the realistic computation/network environment. Keywords: Private genome sequence search, Secure multiparty computation, Secret sharing, FM-index, Suffix array, LCP array, Maximal exact match Introduction stakeholders with different legal backgrounds, which The dramatic reduction in the cost of genome sequenc - slows down the pace of research. Therefore, there is a ing has prompted increased interest in personal genome strong demand for privacy-preserving technologies that sequencing over the last 15 years. Extensive collections can potentially compensate for or even replace the tra- of personal genome sequences have been accumulated ditional policy-based approach [4, 5]. One important both in academic and industrial organizations, and there application that needs a privacy-preserving technol- is now a global demand for sharing the data to acceler- ogy is private genome sequence search, where different ate scientific research [1, 2]. As discussed in previous stakeholders respectively hold a query sequence and a studies, disclosing personal genome information has a database sequence and the goal is to let the query holder high privacy risk [3], so it is crucial to ensure that indi- know the result while simultaneously keeping the query viduals’ privacy is protected upon data sharing. At pre- and the database private. Many studies have addressed sent, the most popular approach for this is to formulate the problem of how to compute exact or approximate edit and enforce a privacy policy, but it is a time-consum- distance or the longest common substring (LCS) through ing process to reach an agreement, especially among techniques based on homomorphic encryption [6–8] and secure multi-party computation (MPC) [9–15], or how to compute sequence similarity based on private set *Correspondence: shimizu.kana@waseda.jp intersection [16]. While these studies can evaluate global National Institute of Advanced Industrial Science and Technology, Tokyo, Japan sequence similarity for two sequences of similar length, Full list of author information is available at the end of the article © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 2 of 22 other studies address the problem of finding a substring and round complexities for the search time (i.e., the time between a query and a long genome sequence or a set of after the input of a query until the end of the search). long genome sequences, with the aim of evaluating local The basic idea of the protocols is to represent the data - sequence similarity [17–23]. Shimizu et  al. proposed an base string by a compressed index [24, 27] and store the approach to combine an additive homomorphic encryp- index as a lookup table. LPM and MEMs are found by at tion and index structures such as FM-index [24] and the most ℓ and 2ℓ table lookups respectively, where ℓ is the positional Burrows-Wheeler transform [25] to find the length of the query. More specifically, the table V is ref- longest prefix of a query that matches a database (LPM) erenced in a recursive manner; i.e., one needs to obtain and a set-maximal match for a collection of haplotypes V [j] , where j = V [i] , g iven i. To ensure security, we need [17]. Sudo et  al. used a similar approach and improved to compute V [j] without seeing any element of V . The the time and communication complexities for LPM on a key technical contribution of this study is an efficient protein sequence by using a wavelet matrix [19]. Ishimaki protocol that achieves this type of recursive reference. et  al. improved the round complexity of a set-maximal We named the protocol secret-shared recursive oblivi- match, though the search time was more than one order ous transfer (ss-ROT). While the previous studies require of magnitude slower than [17] due to the heavy computa- O(N ) time complexity to ensure security, the time, com- tional cost caused by the fully homomorphic encryption munication, and round complexities of ss-ROT  are all [18]. Sotiraki et al. used the Goldreich-Micali-Wigderson O(ℓ) for ℓ recursive table lookups, except for the prepa- protocol to build a suffix tree for a set-maximal match ration of the table and generation of shares before the [20]. According to experiments by [21], the search time query input. Since the entire protocols mainly consist of ℓ of [20] is one order of magnitude slower than [17, 21]. table lookups for LPM, and 2ℓ table lookups and 2ℓ inner Mahdi et  al. [21] used a garbled circuit to build a suffix product computations for LMEM, the search times for tree for substring match and a set-maximal match under LPM and LMEM do not depend on the database size. In a different security assumption such that the tree-tra - addition to the protocols based on ss-ROT, we developed versal pattern is leaked to the cloud server. Chen et  al. a protocol to reduce data transfer size in the initial step [22] and Popic et  al. [23] found fixed-length substring by using a similar approach taken in ss-ROT. The pro - matches using a one-way hash function or homomorphic tocol offers a reasonable trade-off between the amount encryption on a public cloud under a security assump- of reduction in data transfer in the initial step and the tion such that the database is a public sequence and a increase in computational cost in the later step. query is leaked to a private cloud server. We implemented the proposed protocol and tested it 3 7 In this study, we aim to improve privacy-preserving on substrings of a human genome sequence 10 to 10 in substring match under the security assumption such that length and confirmed that the actual CPU time and data both the query and the database sequence are strictly transfer overhead were in good agreement with the theo- protected. We first propose a more efficient method retical complexities. We also found that the search time for finding LPM, and then extend it to find the longest of our protocol was three orders of magnitude faster than maximal exact match (LMEM), which is more practically that of the previous method [17, 19]. For conducting fur- important in bioinformatics. We designed the protocol ther performance analysis, we designed and implemented for LMEM for ease of explanation, and the protocol can baseline protocols using major techniques of secret-shar- be applied to similar problems such as finding all maxi - ing-based protocols. The results showed that the search mal exact matches (MEMs) with a small modification. To times of our protocols were at least two orders of magni- our knowledge, this is the first study to address the prob - tude faster than those of the baseline protocols. lem of securely finding MEMs. Preliminaries Our contribution Secure computation based on secret sharing The time complexity of the previous studies [17, 19] Here, we explain the 2-out-of-2 additive secret sharing include the factor of N , and thus they do not scale well ((2, 2)-SS) scheme and how to securely compute arithme- to a large database. For a similar reason, using secure tic/Boolean gates (Fig. 1). matching protocols (e.g., [26]) for the shares (or tags in Secret sharing and secure computation In t-out-of- searchable encryption) of all substrings in a query and n secret sharing (e.g., [28]), we split the secret value x database is even worse in terms of time complexity. To into n pieces, and can reconstruct x by combining more achieve a real-time search on an actual genome database, or an equal number of t pieces. We call the split pieces we propose novel secret-sharing-based protocols that do “share”. The basic security notion for secret sharing is not include the factor of N in the time, communication, that we cannot obtain any information about x even Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 3 of 22 Fig. 1 Arithmetic addition and multiplication over secret sharing if we gather less than or equal to (t − 1) shares. In this compute arithmetic ADD/MULT gates over shares as paper, we consider a case with (t, n) = (2, 2) . A 2-out- follows: of-2 secret sharing ((2,  2)-SS) scheme over Z consists of two algorithms: Share and Reconst . Share takes as • [[z]] ← ADD([[x]], [[y]]) can be done locally by just input x ∈ Z and outputs ([[x]] , [[x]] ) ∈ Z , where the adding each party’s share on x and on y. In Fig.  1 2 0 1 n bracket notation [[x]] denotes the arithmetic share of the (left), we show an example of secure addition. P /P 0 1 i-th party (for i ∈{0, 1} ). We denote [[x]] = ([[x]] , [[x]] ) obtain shares 6/7 by adding their two shares. In this 0 1 as their shorthand. Reconst takes as inputs [[x]] and [[x]] process, P /P cannot find they are computing 2 + 3. 0 1 0 1 and outputs x. For arithmetic sharing [[x]] and Boolean • Multiplication is more complex than addition. There sharing [[x]] , we consider power-of-two integers n (e.g., are various methods for multiplication over shares, n = 16 ) and n = 1 , resp e ctively . most of which require communication between Depending on the secret sharing scheme, we can com- computing nodes. In this paper, we use the stand- pute arithmetic/Boolean gates over shares; that is, we can ard method for [[w]] ← MULT([[x]], [[y]]) based on execute some kind of processing related to x without x. Beaver triples (BT) [29]. Such a triple consists of This means it is possible to perform some computation bt = (a , b , c ) and bt = (a , b , c ) such that 0 0 0 0 1 1 1 1 without violating the privacy of the secret data, and is (a + a )(b + b ) = (c + c ) . Hereaf ter , a, b, and 0 1 0 1 0 1 called secure (multi-party) computation. It is known c denote a + a , b + b , and c + c , resp e ctively . 0 1 0 1 0 1 that we can execute arbitrary computation by combining We use these BTs as auxiliary inputs for computing basic arithmetic/Boolean gates. In the following para- MULT . Note that we can compute them in advance graphs, we show how to concretely compute these gates (or in offline phase) since they are independent of over shares. inputs [[x]] and [[y]] . We adopt a trusted initializer set- Semi-honest secure two-party computation based on ting (e.g., [30, 31]); that is, BTs are generated by the (2,  2)-Additive SS We use a standard (2,  2)-additive SS party other than two computing nodes and then dis- scheme, defined by tributed. In the online phase of MULT , each i-th party P ( i ∈{0, 1} ) can compute the multiplication share • Share(x) : randomly choose r ∈ Z n and let [[x]] = r [[z]] = [[xy]] as follows: 2 0 and [[x]] = x − r. • Reconst([[x]] , [[x]] ) : output [[x]] + [[x]] . 0 1 0 1 1) P first computes ([[x]] − a ) and ([[y]] − b ) , and i i i i i Note that one of the shares of x ([[ x]] or [[x]] ) does not sends them to P . 0 1 1−i ′ ′ reveal any information about x. In Fig. 1, the secret value 2) P reconstructs x = x − a and y = y − b. ′ ′ ′ ′ x = 2 is split into [[x]] = 4 and [[x]] = 6 . These are 3) P computes [[z]] = x y + x b + y a + c , and P 0 1 0 0 0 0 0 1 ′ ′ valid (2,  2)-additive shares because 4 + 6 ≡ 2 (mod 8) computes [[z]] = x b + y a + c . 1 1 1 1 holds. Even if we can see [[x]] = 4 , we cannot decide the Here, [[z]] and [[z]] calculated with the above 0 1 value of x since we execute a split of x uniformly at ran- procedures are valid shares of xy; that is, dom. This means, in Fig.  1, computing nodes P and P Reconst([[z]] , [[z]] ) = xy . We shorten the notations 0 1 0 1 cannot obtain any information about x as long as these and write the ADD and MULT protocols simply as two nodes do not collude. On the other hand, we can [[x]] + [[y]] and [[x]] · [[y]] , resp e ctively . Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 4 of 22 We also write ADD(ADD([[x ]], [[x ]]), [[x ]]) as 0. The length of S is denoted by |S|. A reverse string of A B C [[x ]] . Note that, similarly to the ADD protocol, S (i.e., S[|S|− 1], . . . , S[0] ) is denoted by S . We consider c={A,B,C} c we can also locally compute multiplication by constant c, a direction from the i-th position to the j-th position as denoted by c · [[x]] . We can easily extend the above proto- rightward if i < j and leftward otherwise. cols to Boolean gates. By converting + and − into ⊕ in the Given a query w and a database S, we define the long - arithmetic ADD and MULT protocols, we can obtain the est prefix that matches a database string (LPM) by XOR and AND protocols, respectively. We can construct max {j|w[0, . . . , j] = S[k, . . . , l]} , where 0 ≤ j <ℓ and (0,j) NOT and OR protocols from the properties of these 0 ≤ k ≤ l < N , and the longest maximal exact match B B gates. When we compute NOT([[x]] , [[x]] ) , P and P (LMEM) by max {j − i|w[i, . . . , j] = S[k, . . . , l]} , where (i,j) 0 1 0 1 B B output ¬[[x]] and [[x]] , respectively. When we compute 0 ≤ i ≤ j <ℓ and 0 ≤ k ≤ l < N . 0 1 B B B B OR([[x]] , [[y]] ) , we compute ¬AND(¬[[x]] , ¬[[y]] ) . W e FM-Index and related data structures FM-Index [24] shorten the notations and write XOR , AND , NOT , and and related data structures [27] are widely used for OR simply as [[x]] ⊕ [[y]] , [[x]] ∧ [[y]] , ¬[[x]] , and [[x]] ∨ [[y]] , genome sequence search. Given a query string w of respectively. By combining the above gates, we can length ℓ and a database string S of length N, [24] enables securely compute higher-level protocols. The function - LPM to be found in O(ℓ) time regardless of N, and it also ality of the secure subprotocols [15] used in this paper enables LMEM to be found in O(ℓ) if auxiliary data struc- are shown in Table  1. Due to space limits, we omit the tures are used [27]. Given all the suffixes of a string S: details of their construction. Note that we can compute S[0, . . . , |S|− 1] , S[1, . . . , |S|− 1], . . . , S[|S|− 1] , a suffix Choose  by [[z]] = [[y]] + [[e]] · ([[x]] − [[y]]) . In this paper, array is an array of positions (p , . . . , p ) such 0 |S|−1 S[p , . . . , |S|− 1] ≤ S[p , . . . , |S|− 1] ≤ S[p , . . . , we consider the standard simulation-based security that 0 1 2 |S|− 1], . . . , ≤ S[p , . . . , |S|− 1] notion in the presence of semi-honest adversaries (for |S|−1 . We denote the suffix 2PC), as in [32]. We show the definition in Appendix  2. array of S by SA and denote its i-th element by SA[i]. A Roughly speaking, this security notion guarantees the Burrows-Wheeler transform (BWT) is a permutation of privacy of the secret under the condition that computing the sequence S such that its i-th letter becomes nodes do not deviate from the protocol; that is, although S[SA[i] − 1] . We denote a BWT of S by L and denote its computing nodes are allowed to execute arbitrary attacks i-th letter by L[i]. Let us define a rank of S for a letter in their local, they do not (maliciously) manipulate trans- c ∈  at position t by Rank (t, S) = |{j|S[j] = c,0 ≤ j < t}| mission data to other parties. The building blocks we and a count of occurrences of letters that are lexicograph- adopt in this paper satisfy this security notion. Moreo- ically smaller than c in S by CF (S) = Rank (|S|, S) , c r r<c ver, as described in [32], the composition theorem for and the operation LF (i, S) = CF (L) + Rank (i, L) . The c c c the semi-honest model holds; that is, any protocol is pri- match between w and S is reported as a form of left- vately computed as long as its subroutines are privately closed and right-open interval on SA, and the lower and computed. upper bounds of the interval are respectively computed by LF . Given a letter c and an interval [f,  g) that corre- Index structure for string search sponds to suffixes that share the prefix x (i.e., [f,  g) reports Notation and definition  denotes a set of ordered sym- the locations of the substring x in S), we can find a new bols. A string consists of symbols in  . We denote a lexi- interval that corresponds to all suffixes that share the ′ ′ cographical order of two strings S and S by S ≤ S (i.e., A prefix cx (i.e., locations of the substring cx) by < C < G < T and AAA < AAC). We denote the i-th letter ′ ′ [f , g ) =[LF (f , S), LF (g , S)). (1) c c of a string S by S[i] and a substring starting from the i- th letter to the j-th letter by S[i,  j]. The index starts with The leftward extension of the match is called a backward search, which is the main functionality of FM-Index. By starting the search with the initial interval [0, N) and con- ducting the backward searches for w[ℓ − 1],w[ℓ − 2], . . . , Table 1 Secure subprotocols used in this paper the longest suffix match is detected when f = g . Rank Input Output and CF are precomputed and stored in an efficient from that can be searched in constant time. Therefore, the Equality [[x]] , [[y]] [[z]] s.t. z = 1 if x = y otherwise z = 0 longest suffix match can be computed in O(ℓ) time. LPM Comp [[x]] , [[y]] [[z]] s.t. z = 1 if x < y otherwise z = 0 is found if the search is conducted on S and match is ′ ′ CastUp n ′ [[x]] ∈ Z , n [[x]] ∈ Z ( n < n ) 2 n extended by w[0],w[1], . . . ,w[ℓ − 1]. B2A [[x]] [[x]] Searching LMEM by repeating LPM for Choose [[x]] , [[y]] , [[e ∈{0, 1}]] [[z]] s.t. z = x if e = 1 , otherwise ( e = 0 ) w[0, ... , ℓ − 1],w[1, ... , ℓ − 1],w[2, ... , ℓ − 1], ... ,w[ℓ − 1] takes z = y Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 5 of 22 time. We can improve it to O(ℓ) time by using only A knows LPM or LMEM between w and T . P and O(ℓ ) the longest common prefix (LCP) array and related P do not obtain any information of w and T , except for data structures [27]. The LCP array, denoted by LCP , ℓ and N . is an array that stores the length of the longest prefix Our protocol consists of offline, DB preparation, and of S[SA[i − 1], |S|− 1] and S[SA[i], |S|− 1] in LCP[i] Search phases. In the offline phase, B generates BTs (cor- for 0 < i ≤ N . The lcp-interval [i,  j) of lcp-value d is an related randomness used for multiplication) and sends interval such that it satisfies LCP[i] < d , LCP[j] < d , them to P and P . In the DB preparation phase, B cre- 0 1 LCP[k] > d for all k ∈{i + 1, . . . , j − 1} , and LCP[k] = d ates a lookup table and distributes its shares to P and for at least one k ∈{i + 1, . . . , j − 1} , and is denoted by P . In the Search phase, A generates shares of the query d − [i, j) . d − [i, j) corresponds to all the suffixes that and sends them to P and P , and P and P jointly com- 0 1 0 1 share the prefix S[SA[i], . . . , SA[i] + d − 1] . The parent pute the result without obtaining any information of interval of d − [i, j) is the lcp-interval h − [m, n) such the lookup table. Finally, A obtains the results. Figure  2 that h < d and 0 ≤ m ≤ i < j ≤ n < N , and there is shows the schematic view of our goal and model. Note no other lcp-interval t −[r , s) such that h < t < d and that the offline and DB preparation phases do not depend 0 ≤ m ≤ r ≤ i < j ≤ s ≤ n < N . The parent of the lcp- on a query string, so they can be computed in advance interval [f, g) can be found by for multiple queries. In section  "Secret-shared recursive oblivious transfer", [PSV[f ], NSV[f ]) LCP[g ] ≤ LCP[f ] ′ ′ i i i i we propose the important building block ss-ROT  that [f , g ) = [PSV[g ], NSV[g ])(otherwise), i i enables recursive reference to a lookup table. In sec- (2) tion "Secure LPM", we describe how to design the lookup PSV[i] = max{j|0 ≤ j < i ∧ LCP[j] < LCP[i]} where and table based on FM-Index, and propose an efficient pro - NSV[i] = min{j|i ≤ j < N ∧ LCP[j] < LCP[i]} . By find - tocol for LPM by using the lookup table and ss-ROT. In PSV NSV ing a parent interval using and whenever it section "Secure LMEM", we describe the additional table fails to extend the match, we can avoid useless backward design for auxiliary data structures, and propose the searches, and thus LMEM is found at most 2ℓ backward complete protocol for LMEM. Table  2 summarizes the LCP PSV NSV searches. , and are precomputed and theoretical complexities of the three protocols. For com- stored in an efficient form that can be searched in con - parison, the complexities of the baseline protocols and a O(ℓ) stant time, so we can find LMEM in time. See sec- previous method for LPM based on an additive homo- tion  5.2 of [27] for more details of the data structures. morphic encryption [17, 19] are shown. As we men- LCP PSV Examples of the search by FM-Index, , , and tioned in section  "Introduction", the baseline protocols NSV are provided in Appendix 1. are designed using major techniques of secret-sharing- based protocols. The detailed algorithms are described in Proposed protocols Appendix 3. Problem setting and outline of our protocols A B We assume that a query holder , a database holder , Secret‑shared recursive oblivious transfer P P and two computing nodes and participate the pro- We define a problem called a secret-shared recursive 0 1 A w ℓ B tocol. holds a query string of length and holds a oblivious transfer (ss-ROT) as follows. T N database string of length . After the protocol is run, Fig. 2 Schematic view of our goal and model. (0) Server (DB holder) distributes Beaver triples. (A reliable third party can serve as the trusted initializer instead.) (1) Server distributes shares of the database. (2) User (query holder) distributes shares of the query. (3) The computing nodes jointly calculate shares of the result. (4) The results are sent to User. The offline phase is (0), DB preparation phase is (1), and Search phase consists of (2)–(4) Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 6 of 22 Table 2 Summary of complexities for our protocols and related protocols Btime Bsize Dtime Dsize Stime Comm. Round ss-ROT (proposed) 0 0 ℓN ℓN ℓ ℓ ℓ Secure LPM (proposed) ℓ ℓ ℓN ℓN ℓ ℓ ℓ [17, 19] (LPM by AHE) − − − − ℓN ℓ ℓ N 2 2 2 2 Baseline LPM N N log ℓ + log N ℓ N ℓ N ℓ N ℓ N 2 2 2 2 Secure LMEM (proposed) ℓN ℓN ℓ ℓ ℓ ℓ ℓ 3 3 3 3 Baseline LMEM N N log ℓ + log N ℓ N ℓ N ℓ N ℓ N BTime and Bsize are generation time and size of BTs. Dtime and Dsize are generation time for the shares of the database and size of the shares. Stime is the time for Search phase. Comm. is the size of data exchanged between computing nodes. Round is the number of data exchanges j j j Definition 1 We assume a database holder B and two B computes Share(R [i]) and sends [[R [i]]] and [[R [i]]] 0 1 computing nodes P and P participate the protocol. B 0 1 to P and P , for i = 0, . . . , N − 1 and j = 0, . . . , ℓ − 1. 0 1 holds a vector V of length N and 0 ≤ V [i] < N . Given the Search phase The Search phase consists of two steps initial position p and the depth of recursion ℓ (2 ≤ ℓ) , 0 and is described in Lines 2–5 of Protocol  1. The input the secret-shared recursive oblivious transfer protocol is the initial position p and shares of R. The output is (ℓ) outputs shares of [[V [p ]]] . An example of a search is illustrated in Fig. 3. V [V [··· V [p ]··· ]] Security intuition (3) In the DB preparation phase of ss-ROT, B does not disclose any private values, and P and P receive the 0 1 without leaking V to P and P . 0 1 shares. In the Search phase, all the messages exchanged between P and P are shares except for the result of 0 1 For simplicity, we denote the recursion of Eq.  3 by (ℓ) (2) Reconst in Step 1. In the j-th step of the loop in Step 1, V [p ] (e.g., V [V [p ]] is denoted by V [p ] ). In our 0 0 0 j (j+1) j p = R [p ] = (V [p ]+ r ) is reconstructed. j+1 j 0 mod N protocol, all the random values are uniformly generated Since the reconstructed value is randomized by r , no from Z . information is leaked. Note that for each vector R , all DB preparation phase B generates ℓ − 1 random val- j j 0 ℓ−2 the elements R [0], . . . , R [N − 1] are randomized by the ues r , . . . , r and computes the following vectors 0 ℓ−1 j same value r , but only one of them is reconstructed, R , . . . , R . Each vector R has N elements. 0 ℓ−1 and different random numbers r , . . . , r are used for 0 ℓ−1  (V [i]+ r ) (j = 0) mod N R , . . . , R . In Step 2, P and P output a result, and no 0 1 j j−1 j R [i] = (V [(i − r ) ]+ r ) (1 ≤ j ≤ ℓ − 2) mod N mod N information other than the result is leaked. j−1 (V [(i − r ) ]) (j = ℓ − 1) mod N mod N (4) Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 7 of 22 (4) Fig. 3 Example of a search when V = (2, 0, 3, 1) , p = 2 , and ℓ = 4 . The goal is to compute [[V [2] ]] = [[2]] . Here we assume B generates 0 1 2 0 0 0 0 r = 1, r = 2, r = 1 . In Step 1 of Search phase, P and P jointly compute Reconst([[R [2] ]] , [[R [2] ]] ) to obtain R [2]= 0 . ( R [2] is randomized 0 1 0 1 0 1 2 3 by r , so any element of V is leaked.) In a similar way, P and P compute R [0]= 3 and R [3]= 1 . In Step 2, P and P output [[R [1] ]] and 0 1 0 1 0 3 0 0 1 0 0 0 1 2 1 1 1 2 [[R [1] ]] respectively. Since R [2]= V[2]+ r , R [V[2]+ r ]= V[V[2]+ r − r ]+ r , R [V[V[2]] + r ]= V[V[V[2]] + r − r ]+ r , and 3 2 2 2 (4) R [V[V[V[2]]] + r ]= V[V[V[V[2]]] + r − r ] , ss-ROT successfully computes [[V [2] ]] (ℓ) Security input to P is p and ℓ , and output of P is V [p ] . The 0 0 0 0 function achieved by Protocol 1 is deterministic and the Theorem  1 ss-ROT  is correct and secure in the semi- protocol is correct. Therefore, to ensure the security of honest model. Protocol  1, we need to prove existence of a probabilistic polynomial-time simulator S such that Proof Correctness and security of ss-ROT  protocol are (ℓ) (ℓ) (ℓ) {(S(p , ℓ, V [p ]), V [p ])}≡{(X, V [p ])}, (8) proved as follows. 0 0 0 0 where X is P ’ s vie w . X consists of: Correctness. We assume the following equation. (i) i−1 • [[R [i]]] for i = 0, . . . , N − 1 and j = 0, . . . , ℓ − 1 (a p = (V [p ]+ r ) (5) i 0 mod N message from B) In Step1, for j = 0 , the protocol computes p by recon- 1 • [[R [p ]]] (j-th message from P ) for j = 0, . . . , ℓ − 1 j 1 1 0 j j j structing R [p ] . From the definition of R [i] in Eq. 4, 0 • p (j-th value obtained by Reconst([[R [p ]]] , [[R [p ]]] ) j j 0 j 1 in Step1) for j = 1, . . . , ℓ − 1. 0 (1) 0 p = R [p ] = (V [p ]+ r ) . (6) 1 0 0 mod N All the messages from B and P are uniformly at For j = k , the protocol computes p by reconstruct- k+1 k j n random in Z , as they are generated by Share . ing R [p ] . From the definition of R [i] in Eq.  4 and the j j p + 1 = Reconst([[R [p ]]] , [[R [p ]]] ) holds for j = 0, ... , ℓ − 2 , j j 0 j 1 assumption of Eq. 5, (ℓ) ℓ−1 ℓ−1 and V [p ] = Reconst([[R [p ]]] , [[R [p ]]] ) 0 ℓ−1 0 ℓ−1 1 k k−1 k 0 1 ℓ−2 p = R [p ] =(V [ (p − r ) ]+ r ) holds. p = R [p ], p = R [p ], . . . , p = R [p ] 1 0 2 1 ℓ−1 ℓ−2 k+1 k k mod N mod N are uniformly at random in Z from the definition of (k) k N =(V [ V [p ]] + r ) 0 mod N Eq. 4. (k+1) k =(V [p ]+ r ) . mod N Let us denote a random number u chosen from a set (7) U uniformly at random by u∈U . We construct S as ℓ×N Eq.  5 holds for i = 1 by Eq.  6. It also holds for i = k + 1 described in Protocol  2. The output of S is R ∈ Z n , ˜ ˜ under the assumption that Eq.  5 holds for i = k . There - R ∈ Z n , and p , . . . , p . In Line  6 and Line  9, 1 1 ℓ−1 ˜ ˜ fore by induction, Eq. 5 holds for i = 1, . . . , ℓ − 1. p , . . . , p are generated such that they are uniformly at 1 ℓ−1 ˜ ˜ random in Z . In Line 7, R [p ] and R [0] are generated N 0 0 1 ℓ−1 by Share such that they are shares of p and uniformly In Step 2, P and P output [[R [p ]]] . Since Eq.  5 1 0 1 ℓ−1 ˜ ˜ n ˜ at random in Z . In Line 10, R [p ] and R [j] are gener- holds for i = ℓ − 1, 2 0 j 1 ated by Share such that they are shares of p and uni- j+1 ℓ−1 ℓ−2 R [p ] = (V [(p − r ) ]) ℓ−1 ℓ−1 mod N mod N n formly at random in Z for j = 1, . . . , ℓ − 2 . In Line  12, ˜ ˜ R [p ] and R [ℓ − 1] are generated by Share such that 0 ℓ−1 1 (ℓ) is transformed into (V [p ]) by plugging in mod N (ℓ) they are shares of V [p ] and uniformly at random in (ℓ−1) ℓ−2 0 j p = V [p ]+ r . Therefore the final output of ℓ−1 0 ˜ ˜ ˜ n ˜ Z . All the elements of R except for R [p ] and R [p ] 2 0 0 0 0 j (ℓ) ss-ROT  becomes (V [p ]) . The above argument mod N ( j = 1, . . . , ℓ − 1 ) are uniformly at random in Z by completes the proof of correctness of Theorem 1. Line  3. Therefore, Eq.  8 holds. By the above discussion, we find our ss-ROT  satisfies security in the semi-honest Security. Since the roles of P and P are symmetric, it is 0 1 model. sufficient to consider the case when P is corrupted. The 0 Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 8 of 22 Complexities (Eq.  11 ) and computing the inner product of Eq. 11 and In the DB preparation phase, B generates shares of V of (V [·],V [·],V [·],V [·]) . To find LPM, P and P need C G T 0 1 length N for ℓ times. Therefore, time and communication to check f = g for each reference. We use the subproto- complexities are O(ℓN ) . For the Search phase, Reconst is col Equality to check it securely. Since V is randomized computed ℓ times in Step 1. Since the time, communica- with different numbers for searching f and g, the dif - tion, and round complexities of Reconst are O(1), those ference of the random numbers is precomputed and of the Search phase become O(ℓ). removed securely upon the equality check. A receives only the result of each equality check to know LPM. For Secure LPM example, LPM is the prefix of length i − 1 when f = g for Construction of lookup table The goal is to find LPM the i-th reference. If f = g for all references, LPM is the securely. To apply FM-Index for a prefix search, the entire query. reverse string of T (i.e., T ) is used. The backward search DB preparation phaseB creates a lookup table and of FM-Index is formulated by Eq.  1. If we precompute generates the following 4ℓ vectors in a similar manner LF (i, T ) for i = 0, . . . , N and c ∈{A,T,G,C} , and store c to ss-ROT. For simplicity, we denote the length of V by them in a lookup table that consists of four vectors: N = N + 1. V , V , V , and V such that V [i] = LF (i, T ) , Eq.  1 is C G T c c (V [i]+ r ) ′ (j = 0) replaced by the following table lookup j c mod N R [i] = c,f j−1 j ′ ′ (V [(i − r ) ]+ r ) (1 ≤ j < ℓ) c mod N mod N f f f = V [f ], g = V [g ]. (9) k+1 w[k] k k+1 w[k] k (10) I.e., starting with the initial interval [f = 0, g = N ) , we 0 0 R [i] is used for computing the lower bound f of the c,f can compute the match by recursively referring to the interval [f,  g). We also generate R [i] for the upper c,g lookup table while f < g. ′ bound g. R consists of 8ℓ vectors, each of length N . Since Protocol overview The key idea of Secure LPM is to the longest match is found when f = g , B also generates j j refer to V by ss-ROT, i.e., P and P jointly refer to V ℓ ′ 0 1 a vector r [j] = (r − r ) that is used for equality g mod N times in a recursive manner. To achieve backward j j check of f and g. Then, B sends shares of R [i] , R [i] , c,g c,f search, P and P need to select V [·] for each refer- 0 1 x and r [j] to P and P . 0 1 ence, where x is a query letter to be searched with. This is achieved by expressing the query letter by unary code Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 9 of 22 ′ ′ Search phase Protocol 3 describes the algorithm in detail. and LF (g , T ) + r in Lines 5–8 without leaking f and w[j] ′ ′ ′ A generates four vectors q , q , q , q , each of length ℓ , C G T g , where [f , g ) corresponds to the match of w[0,  j] and ′ ′ as follows. T . In Lines 10–13, the equality of f and g is examined j−1 j−1 for all rounds. Note that different values r and r are 1 (c = w[j]) j−1 j−1 q [j] = ′ ′ (11) used for f = (f − r ) ′ and g = (g − r ) ′ 0 (c �= w[j]) j mod N j g mod N j−1 ′ ′ ′ ′ in order to conceal f and g . Since f , g , r , j−1 For each j, (q [j], q [j], q [j], q [j]) encodes w[j] (e.g., ′ ′ A C G T r , r [j − 1]∈{0, . . . , N − 1} , it is sufficient to check if (q [j], q [j], q [j], q [j]) = (1, 0, 0, 0) if w[j] = A ). The aim ′ ′ ′ A C G T f − g − r [j − 1] is equal to either one of −N , 0, and N . j j of the encode is to compute [[R [j]]] = [[ q [j]· R [j]]] x c c c∈ In Lines 16–18, A receives all the results of equality when w[j] = x . F igur e  4 illustrates an example of the table B B checks (i.e., [[o[1]]] , . . . , [[o[ℓ]]] ) from P and P , and 0 1 lookup. knows LPM by reconstructing them. For example, if w = A generates shares of q , q , q , q and distributes A C G T GCT and o = (0, 0, 1) , A knows that LPM is GC. them to P and P . P and P compute LF (f , T ) + r 0 1 0 1 w[j] j j Fig. 4 Example of a secure table lookup when w = GCT and T = ACGT. Only the lookup for a lower bound is shown. For simplicity, R and r c,f f j 0 are denoted by R and r . LF (f , T ) ( i = 0, 1, 2 ) is computed by V [0], V [2] , and V [1] . V is referenced securely by using R. R [0] is computed by c w[i] i G C T 1 0 0 2 1 1 q [0]· R [0] . R [2 + r ] is computed by q [1]· R [2 + r ] . R [1 + r ] is computed by q [2]· R [1 + r ] c c c c c c c∈ c∈ c∈ C T Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 10 of 22 Security in Lines 7–8. (see section  "Secure computation based on secret sharing" for details of the subprotocols.) In Theorem 2 Protocol 3 is correct and secure in the semi- Lines 7–8, reconstructed values are R [f ] and w[j],f honest model. R [g ] . Since the values are (V [f ]+ r ) and j w[j] j mod N w[j],g f (V [g ] + r ) ′ according to Eq.  10, it is obvious w[j] j g mod N Proof Correctness and security of Protocol 3 are proved that V is randomized for all rounds j = 0, . . . , ℓ − 1 , as follows. and no information is leaked. For Lines 14–17, only the output of Equality at Line 11 is reconstructed. The Correctness. The lookup table V simply stores all possible reconstructed values are either 1 or 0 according to outputs of LF . Therefore, backward search (Eq.  1) is Equality , and no information other than the result is equivalent to Eq.  9. For the case of querying w, leaked. V [··· V [p ]··· ] becomes lower bound f (for w[0] 0 w[k−1] p = 0 ) or upper bound g (for p = N ) of the interval 0 0 A may reveal T by making many queries. Such a problem that corresponds to the prefix match of length k. In Line is called output privacy. Although output privacy is out- k k 5 of Protocol  3, [[R [f ]× q [k] + R [f ]× q [k] k A k A,f C,f side of the scope of this paper, we should mention here k k +R [f ]× q [k]+ R [f ]× q [k]]] is computed. Since k G k T that A needs to make an unrealistically large number of G,f T,f q [j] = 1 and q [j] = 0 ( c =w[j] ), it is equivalent to queries for obtaining T by such a brute-force attack, con- w[j] c [[R [f ]]] [[R [g ]]] sidering that N is very long. . Line 6 computes in the same k k w[k],f w[k],g manner. Each vector R in Eq.  10 is generated in the c,f Complexities same manner as R in Eq.  4. Since Eq.  10 uses the com- j j−1 j j j j The DB preparation phase generates shares of R and c,f mon random values r and r for R , R , R , R , f f A,f C,f G,f T,f j R ( c ∈  and 0 ≤ j <ℓ ); i.e., 8 × ℓ vectors of length N . c,g we can recursively reference V ( c ∈{ A, C, G, T } ), Therefore, the time and communication complexities are which is obvious from the correctness of ss-ROT. O(ℓN ) . For the Search phase, MULT and Reconst are Therefore, the recursion by Line 5 and Line 7 can k−1 computed twice in Lines 4–9 for ℓ rounds and Equality is compute (V [··· V [f ]··· ] + r ) ′ , and w[k−1] w[0] 0 mod N computed once in Lines 10–13 for ℓ rounds. Note that the recursion by Line 6 and Line 8 can also compute Equality is computed in parallel, and the number of k−1 (V [··· V [g ]··· ] + r ) . w[0] 0 w[k−1] g mod N round can be reduced to a constant number. Each time, the communication and round complexities of these sub- The longest match is found when the interval width protocols are O(1), so those of the Search phase become k−1 becomes 0. Since f = (V [··· V [f ]···] + r ) ′ w[0] 0 k w[k−1] mod N O(ℓ). k−1 and g = (V [··· V [g ]··· ] + r ) ′ are w[0] 0 k w[k−1] mod N randomized, Line 11 computes f − g − (r [k − 1] = k k Secure LMEM k−1 k−1 (r − r ) ′ ) to obtain the correct interval width. mod N f g Construction of lookup table As described in sec- When the width is 0, d becomes either one of 0, N and tion  "Index structure for string search", we can find a −N . Therefore, Line 12 computes the equality d and 0, parent interval by a reference to LCP , PSV , and NSV . ′ ′ N and −N respectively. By reconstructing all the results Therefore, in addition to V defined in section  "Secure in Lines 16–18, A knows the round, in which the interval LPM", we prepare lookup tables that simply store all the width becomes 0; i.e., he/she knows LPM. The above outputs of them; i.e., V [i] = LCP[i] , V [i] = PSV[i] , psv lcp argument completes the proof of correctness of and V [i] = NSV[i]. nsv Theorem 2. DB preparation phase B generates randomized vectors j j R , R and r [j] = (r − r ) ′ using the same algo- c,f c,g g mod N Security We only show a sketch of the proof. For Lines rithm in section "Secure LPM" for length 2ℓ . As shown in 1–2 of Protocol 3, A and B do not disclose any private Eq.  2, V is referred by the upper and lower bounds of lcp values, and P and P receive the shares. For Lines [f,  g). Therefore, B generates following circular permuta- 3–14, it is guaranteed by the subprotocols ADD , MULT , tions of V such that W and R , and W and R , are lcp l,f c,f l,g c,g and Equality that all the messages exchanged between permutated by the same random values, respectively. I.e., P and P are shares except for the output of Reconst 0 Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 11 of 22 otherwise. When the search is finished (e.g., the right V [i] (j = 0) j lcp W [i] = j−1 l,x end of a match exceeds the right end of the query) V [(i − r ) ] (1 ≤ j < 2ℓ), lcp x mod N u = (0, . . . ,0) . Therefore in Lines 25–28, x = 1 while the where x is either f or g. V is referred by both f and g, and right end of a match dose not exceed the right end of psv j j is plugged in to f. Therefore, B generates W and W the query and x = 1 after finishing the search. In Lines p,g p,f j j 29–31, the inner product of q ( c ∈  ) and u becomes the such that both of them are randomized by r , and W is f p,f j−1 j j−1 encode of w[t] that is used for the next round. permutated by r and W is permutated by r as p,g g We also maintain the left end position of the match. follows. While the match is extended, the position remains the same and it moves toward the right when the interval (V [i] + r ) (j = 0) psv mod N W [i] = p,f j−1 j is updated by [f , g ) . The new left end position can be ex ex (V [(i − r ) ]+ r ) (1 ≤ j < 2ℓ) psv mod N mod N f f computed by p + m − c where p is the current position, (V [i] + r ) (j = 0) j psv g mod N m is the length of the current match, and c is the lcp- W [i] = p,g j−1 j (V [(i − r ) ]+ r ) (1 ≤ j < 2ℓ) psv g mod N mod N value of [f , g ) (i.e., the longest common prefix length of f ex ex suffixes contained in [f , g ) ). The position is computed ex ex Similarly, V is referred by both f and g, and is plugged nsv in Line 33. The match length is incremented by 1 for j j in to g. Therefore, B generates W [i] and W [i] as n,g n,f each extension while the right end of the match does not follows. exceed the query length. When the interval is updated by [f , g ) , the match length is reduced to the lcp-value ex ex (V [i] + r ) (j = 0) nsv j mod N of [f , g ) , which is computed by max(LCP[f ], LCP[g ]) . ex ex W [i] = j−1 j n,f (V [(i − r ) ]+ r ) (1 ≤ j < 2ℓ) nsv mod N g mod N f The match length is computed in Line 32. In Line 35, the longest match length and the corresponding left end j (V [i] + r ) (j = 0) nsv g mod N W [i] = n,g j−1 j position are updated. After all the positions in the query (V [(i − r ) ] + r ) (1 ≤ j < 2ℓ) nsv g mod N g mod N have been examined, LMEM and its left end position are B distributes shares of R , R , r , W , W , W , W , sent to A in Line 37. c,f c,g l,f l,g p,f p,g W , and W to P and P . n,f n,g 0 1 Search phase Protocol  4 describes the algorithm in Security detail. A generates query vectors q , q , q , q by Eq. 11 A C G T and distributes shares of the vectors to P and P . In Line Theorem 3 Protocol 4 is correct and secure in the semi- 0 1 6 of Protocol  4, [f , gˆ) is computed by the reference to R honest model. (i.e., a search based on a backward search) similarly to Lines 5–6 of Protocol 3. In Line 11, [f , g ) is computed Proof Correctness and security of Protocol 4 are proved ex ex by the reference to W (i.e., a search based on LCP , PSV as follows. Correctness. V, R, r and q are generated by the and NSV ). In Line 13, the interval is updated by either same algorithm used in Protocol  3. Therefore, Line 6 is ′ ′ [f , gˆ) or [f , g ) based on the result of f = g in Lines equivalent to a backward search, and e1 is the result of ex ex ′ ′ 7–9, where [f , g ) corresponds to the interval that corre- the equality check of 0 and the width of the obtained sponds to a substring match. interval in Lines 7-8. The lookup tables V , V , and V lcp psv nsv In each round, we need to know a query letter to be store all the outputs of LCP , PSV and NSV , and W , W , l p searched with, so we need to maintain the right end and W are generated based on V , V , and V , n lcp psv nsv j j position of the match in the query. The position moves respectively. Since W and W are circular permutations l,f l,g j−1 j−1 toward the right while the match is extended, but remains of V by the same random values r and r that are lcp g the same when the interval is updated based on PSV and used for generating R and R (c ∈ �) respectively, Line c,f c,g NSV . To memorize the position, we prepare shares of 8 can compute LCP[g ] ≤ LCP[f ] and e2 holds the result. j j a unit bit vector u of length ℓ , in which the position t is j j By using Choose  and e2, either [W [f ], W [f ]) or j j p,f n,f memorized as u[t] = 1 and u[i �= t] = 0 . In Lines 20–23, j j j j [W [g ], W [g ]) is selected. W and W are permu- j j p,g n,g p,g p,f u remains the same if the interval is updated based j−1 j−1 tated by r and r , but are randomized by the identi- on PSV and NSV , and u = (0, u[0], u[1], . . . , u[ℓ − 2]) f Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 12 of 22 j j j Security. We only show a sketch of the proof. For Lines cal random value r . Similarly , W and W are permu- n,g f n,f j−1 j−1 j 1–2 of Protocol  4, A and B do not disclose any private tated by r and r , but are randomized by r . Since g g values, and P and P receive the shares. For Lines 3–37, 0 1 W [f ] and W [g ] are generated in the same manner as p,f j n,g j it is guaranteed by the subprotocols ADD , MULT , R and R , it is obvious that the reference by them is c,f c,g Equality , and Choose  that all the messages exchanged correct. The reference by W [f ] is transformed into n,f between P and P are shares except for the output of 0 1 j+1 j j j j+1 Reconst in Line 14. (see section  "Secure computation X [W [f ]] = V [W [f ] − r ]+ r g j x j g g n,f n,f based on secret sharing" for details of the subprotocols.) j−1 j j j+1 In Line 14, the reconstructed values are = V [V [f − r ]+ r − r ]+ r x nsv j g g g (j+1) j (j+1) j f = V [p ]+ r and g = V [p ]+ r , i+1 x 0 j+1 x 0 g j−1 j+1 f = V [ V [f − r ] ]+ r x nsv j g according to Eq. 5, Eq. 12, and Eq. 13. Since f and g j+1 j+1 j j (12) are randomized by r and r , respectively, for all rounds and the reference by W [g ] is transformed into j j = 0, . . . ,2ℓ − 1 , no information is leaked. In Line 38, A p,f reconstructs only the search result (the length and start j+1 j j j j+1 X [W [g ]] = V [W [g ]− r ]+ r p,g j x p,g j f f f position of LMEM). j−1 j j j+1 = V [V [g − r ]+ r − r ]+ r x psv j g f f f j−1 j+1 = V [ V [g − r ] ] + r x psv j g Complexities (13) The DB preparation phase generates shares of R and c,f j j j j+1 j+1 j+1 j+1 0 ≤ j <ℓ x ∈{l, p, n} R ( c ∈  , ) and W and W ( and where X is any one of R , W and W , and V is c,g x,g c p n x x,f the corresponding lookup table; i.e., either one of V , V 0 ≤ j <ℓ ); 14 × ℓ vectors of length N + 1 . Therefore, the c psv and V . Note that V could be a different table for each time and communication complexities are O(ℓN ) . For nsv x j + 1 , but we abuse the same notation for simplicity of the Search phase, MULT is computed ℓ times in parallel in notation. Since f and g are described in the form of Lines 17–18. (These are not dependent on each other.) In j j (j) j−1 (j) j−1 V [p ]+ r and V [p ]+ r based on Eq.  5, Eq.  12 Line  30, MULT is computed ℓ times in parallel, and Line x 0 x g (j+2) j+1 30 is computed in parallel four times in Lines  29–31. and Eq.  13 are transformed into V [p ] + r and x 0 g (j+2) j+1 Lines  17–18 and Lines  29–31 are repeated for 2ℓ − 1 V [p ] + r , which also satisfy the recursion form of j j rounds. Other subprotocols are also computed for 2ℓ − 1 Eq.  5. Thus, the intervals [W [f ], W [f ]) and j j p,f n,f j j rounds. The time, communication, and round complexi - [W [g ], W [g ]) are correct intervals and Line  11 is p,g j n,g j ties are O(1) for MULT , and independent computation of equivalent to computing Eq. 2. MULT for ℓ times does not increase the round complex- ity. The time, communication and round complexities Lines 16–23, u remains the same if e1 = 0 and are O(1) for the other subprotocols used in Protocol  4. u = (0, u[0], u[1], . . . , u[ℓ − 2]) otherwise. Therefore Therefore, the complexities of the Search phase are O(ℓ ) Lines 29–31 can choose the letter to be searched with. for time and communication, and O(ℓ) for the number of The match length and the start position are obtained rounds. The time complexity of the standard (i.e., non- based on e1 in Lines 32–33, and the longest value and the privacy-preserving) LMEM is O(ℓ) while that of Secure corresponding position are selected in Lines 34–35. The LMEM is O(ℓ ) . The increase in time complexity is shares of the length and start position of LMEM are sent caused by the computation for maintaining match posi- to A , and A reconstructs them. Then, Protocol 4 outputs tion securely. them. The above argument completes the proof of cor - rectness of Theorem 3. Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 13 of 22 Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 14 of 22 j j ′ ′ ′ Reducing size of shares in DB preparation phase Theorem 4 0 ≤ V [i + 1]− V [i]≤ 1 for i = 0, . . . , N − 2. c,f c,f The protocols based on ss-ROT  are quite efficient in Search phase, however, they require large data trans- Proof Following equation is equivalent to Eq. 14. fer from B to the computing nodes in DB preparation V [i] phase when the number of queries and the length of the c,f database are large. To mitigate the problem, we propose V [(i − r [j]) ′]− o ((i − r [j]) ′ ≤ i) c f mod N f mod N c,f another protocol that can reduce size of shares in DB V [(i − r [j]) ′]− o¯ ((i − r [j]) ′ > i) . f mod N f mod N c,f preparation phase. (15) We use two parameters m and n ( m < n ) for comput- 0 ≤ V [i + 1] − V [i] ≤ 1 holds for i = 0, . . . , N − 2 2 c c ing shares. When Share outputs ([[x]] , [[x]] ) ∈ Z , we 0 1 from the definition of V . m m c denote the share by ([[x]] , [[x]] ) . When Share outputs 0 1 ([[x]] , [[x]] ) ∈ Z , we denote the share by ([[x]] , [[x]] ) . 0 1 0 1 j I f (r [j]) ′ = 0 , V = V . T h e r e f o r e , m f mod N c c,f We denote M = 2 . In our protocol, all the random val- j j ′ ′ ′ ues are uniformly generated from Z . 0 ≤ V [i + 1] − V [i] ≤ 1 holds for i = 0, . . . , N − 2. c,f c,f Basic idea V [i] = LF (i, T ) is a lookup table c c j j ′ ′ used by Protocol  3 and  4. We sample V [i] at If (r [j]) ′ �= 0 and , i = (r [j]− 1) ′ c f mod N V [i + 1]− V [i] f mod N c,f c,f ′ ′ j j i = 0, M,2M, . . . , ⌊N /M⌋M , where N is the length of = V [0] − o − V [N − 1]+ o¯ = 0 . Let us consider c c c,f c,f V and store the sampled values in a vector z. We com- when ′ and ′ . We denote (r [j]) �= 0 i =(r [j]− 1) f mod N f mod N pute x[i] = V [i]− V [p] for i = 0, . . . , N − 1 , where p c c ′ . Then, ′ i = (r [j]− 1 + a) ′ (0 < a < N ) (i + 1 − r [j]) f mod N f mod N is the sampled position closest to i and p ≤ i . Given a and ′ . Since ′ i + 1 = (r [j]− 1 + a) + 1 = (a) f mod N mod N position k, we can compute V [k] by z[⌊k/M⌋] + x[k] . (a) ′ − ((r [j]− 1 + a) ′ + 1) mod N f mod N Any element in z is non-negative and at most N − 1 = (a − 1) ′ − (r [j]− 1 + a) ′ holds because mod N f mod N j j while that in x is also non-negative and at most M − 1 ′ ′ 0 < a , an offset for V [i + 1] and that for V [i] are c,f c,f because 0 ≤ V [i + 1] − V [i] ≤ 1 . Our idea is to use c c j j ′ ′ same and . V [i + 1]− V [i]= V [(a) ′]− V [(a − 1) ′] c mod N c mod N c,f c,f n bits for storing z[i] and m bits for storing x[i]. Note j j ′ ′ Therefore, 0 ≤ V [i + 1] − V [i] ≤ 1 holds for c,f c,f that we used n bits for storing V [i] in Protocol  3 and ′ i = 0, . . . , N − 2 . Protocol  4. There are ⌈N /M⌉ sampled positions, so the ′ ′ size of the lookup table becomes O(n⌈N /M⌉ + mN ) , Let Q be an integer vector of length ⌈N /M⌉ such that which is n/m times smaller compared to V if M is suf- c c,f ficiently large. We use a rotation technique to hide an j j j ′ Q [p] = V [pM] , and R [i] intermediate position. Since 1 < V [0]− V [N − 1] for c,f c,f c,f c c j j most cases, we design a rotated table V that satisfies ′ ′ = V [i]− V [M⌊i/M⌋] . c,f c,f ′ ′ 0 ≤ V [i + 1] − V [i] ≤ 1 by subtracting an offset from c c j j j V . ′ Note that V [i] = Q [⌊i/M⌋] + R [i] , and V [i] is c,f c,f c,f DB preparation phase B computes following vectors for j obtained by adding an offset to V [i]. c,f j = 0, . . . , ℓ − 1 Since R [i] is non-negative and at most M − 1 , B gen- c,f j j j ′ m V [(i + r [j]) ′] erates shares [[R [i]]] . B also generates [[Q [p]]] , [[o ]] , f mod N c,f c,f c,f c,f [[o ¯ ]] and [[r [j]]] . Above shares are used for computing j f c,f (14) V [i]− o (i ≤ (i + r [j]) ′) c f mod N c,f lower bound f of an interval. B generates shares for upper V [i]− o¯ (i > (i + r [j]) ′) , f mod N c,f bound g in a same manner. Then B distributes all the shares to P and P . 0 1 where r [j] is a random value, o = V [(N − 1−r [j]) ′] Search phase A generates table w for a query string w f c,f f mod N by Eq. 11. A generates shares of q and distributes them to ′ ′ −V [N − 1]and o¯ = V [(N − 1−r [j]) ′]−V [0]. c c f mod N c c,f P and P . The entire protocol is described in Protocol 5. 0 1 Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 15 of 22 Security Complexities In DB preparation phase, shares of R are generated c,f Theorem 5 Protocol 5 is correct and secure in semi-hon- with a parameter m and shares of other values including est setting. j j Q are generated with a parameter n. The length of R c,f c,f is N + 1 and that of Q is ⌈(N + 1)/M⌉ . The total num - c,f Proof Correctness and security of Protocol 5 are proved ber of other values do not depend on N. The query as follows. j j length is ℓ and shares of R , Q , and other values are c,f c,f necessary for each query character. Therefore, time Correctness. In Line 5-6 of Protocol 5, p = (f + r ) ′ j j j mod N complexity is O(ℓN ) and communication complexity is is computed. In Line 8, CastUp(R [p ]) is computed to c,f ′ O(ℓNm + ℓ⌈N /M⌉n). avoid overflow in Line 9. In Line 9, shares of V [p ] are c,f j For Search phase, ADD , MULT , Reconst , CastUp and computed, which is obvious from the definition of Q c,f Comp are computed a few times for 2ℓ times in Line 4-16 j j+1 j j and R . In Line 11-13, [[f ]] , [[o ]] and [[o ¯ ]] are w[j] c,f w[j],f w[j],f and Equality is computed ℓ times in Line 17-19. Since selected. From the definition of V described in Eq. 14, each time and communication and round complexities of c,f j j it is obvious that V [f ] is obtained by V [p ] + o when these subprotocols are O(1), those of the entire protocol c j j c,f c,f j j become O(ℓ). f ≤ p and V [p ] + o¯ when f > p , and Line 14 j j j j j c,f c,f computes [[V [f ]]] . g is computed similarly to f. Since ref- c j Experiment erence to V achieved in Lines 4–16 is equivalent to eval- We implemented Protocol  3 (Secure LPM), Protocol  4 uating Eq. 1 and an equality check of f = g is conducted (Secure LMEM) and Protocol  5. For comparison, we in Lines 17–19, Protocol 5 is correct. also implemented baseline protocols (Baseline LPM and Baseline LMEM). Details of the baseline protocols are Security We only show sketch of the proof. All the mes- provided in Appendix 3. All protocols were implemented sages exchanged between P and P are shares except for 0 1 by Python 3.5.2. The dataset was created from Chromo - Line 6. In Line 6, reconstructed value p is randomized by some 1 of the human genome. We extracted substrings of r [j] in Line 5. Therefore, no information is leaked. f Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 16 of 22 3 4 5 6 7 length N = 10 , 10 , 10 , 10 , and 10 for databases, and ℓ = 10 , 25, 50, 75, and 100 for queries. Share was run 5 5 with n = 16 and n = 32 for N < 10 and 10 ≤ N in the proposed protocols, and n = 1 for a Boolean share and n = 8 for an arithmetic share in the baseline protocols. We did not implement a data transfer module, and each protocol is implemented as a single program. Therefore, the search time of the protocols was measured by the time consumed by either one of P and P . To assess the 0 1 influence of communication on a realistic environment, we theoretically estimated delays caused by network bandwidth and latency. We assume three environments: LAN (0.2 ms/10 Gbps), WAN (10 ms/100 Mbps), and WAN (50 ms/10 Mbps). During the run of Search phase, we stored all the data that were transferred from P to P Fig. 5 Estimated time (actual search time on a local machine + 0 1 estimated data-transfer time) for various N in a file and measured the file size as an actual commu - nication size. Note that the communication is symmet- ric and data transfer size from P to P is equal to that 0 1 (e.g., genome sequence) is the same as [19]. Sudo et  al. from P to P . Based on the data transfer size D byte, we 1 0 [19] is implemented as a server-client software, and the estimate the communication delay by D/k + eT /1000 , client and the server were run with individual single where k is bandwidth, e is latency and T is a round of threads on the same machine. Therefore, the results of communication. All the protocols were run with a single [19] do not include delays caused by bandwidth limita- thread on the same machine equipped with Intel Xeon tion and latency, so we also estimated delays based on 2.2 GHz CPU and 256 GB memory. We also tested the the data transfer size and round of communication in the C++ implementation of [19], which is based on AHE. The algorithm for LPM in [17] for the string with ||≤ 4 Table 3 Offline time (Time), offline size (Size), DB preparation time (Time), DB preparation size (Size), Search time on a local machine (Time), Search communication size (Size), estimated Search time for three environments: LAN (0.2 ms/10 Gbps), WAN (10 ms/100 4 5 6 7 Mbps), and WAN (50 ms/10 Mbps), for N = 10 (only for Baseline LMEM), 10 , 10 , 10 , and ℓ = 100 N Offline DB preparation Search Estimated timeon network Time Size Time Size Time Size LAN WAN WAN 1 2 Secure 0.166 0.013 123 305 0.141 0.010 0.181 2.162 10.249 LPM 0.141 0.013 1248 3051 0.113 0.010 0.153 2.134 10.221 (proposed) 0.150 0.013 12628 30517 0.126 0.010 0.167 2.147 10.234 Secure 2.318 0.162 123 77 2.888 0.040 3.028 9.911 38.020 LPM2 2.317 0.162 1236 774 2.878 0.040 3.018 9.901 38.010 (proposed) 2.342 0.162 12387 7748 2.939 0.040 3.079 9.962 38.071 – – – – 691 163 691 707 838 [19] – – – – 7817 517 7818 7863 8261 – – – – 20 h< – – – - Baseline (LPM) 3995 184 0.146 0.095 13 122 13 24 118 38767 1841 1.522 0.954 164 1227 165 268 1196 20 h< – – – – – – – – Secure 7.619 1.704 435 1068 4.817 0.999 5.577 42.900 195.654 LMEM 7.882 1.704 4467 10681 4.926 0.999 5.686 43.009 195.763 (proposed) 8.457 1.704 46384 106811 5.740 0.999 6.501 43.824 196.578 Baseline 12747 611 0.015 0.010 46 407 46 80 389 (LMEM) 20 h< – – – – – – – – The size unit is MB and the time unit is s except for the cell describing “20 h<” Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 17 of 22 Mbps). The results were 40 s for Secure LPM and 1739 s for Baseline LPM when N = 10 . Though both of the preliminary implementations have room for improve- ment in the performance of data transfer, the results also indicate that our protocol outperforms the baseline pro- tocol and the previous study. The time and size of Secure LPM and Secure LMEM are several orders of magnitude better than those of the baseline protocols for the offline phase, and vice versa for the DB preparation phase. The total time of the offline and DB preparation phases of our protocols are more than one order magnitude faster than that of baseline protocols. For the total size of the offline and DB prep - aration phases, Secure LMEM was better than Baseline LMEM, but Baseline LPM was better than Secure LPM Fig. 6 Estimated time (actual search time on a local machine + estimated data-transfer time) for various ℓ though the complexity is better for Secure LPM. This is because the majority of the shares were Boolean in the baseline protocols, while all of the shares were arithmetic in the proposed protocols. same manner. Each run of the program was terminated if the total runtime of all phases exceeded 20 h. Comparison to [19] [19] is a two-party MPC based on AHE. Each homo- Comparison to baseline protocols morphic operation is time consuming and has no offline Table  3 shows the offline time and size, DB preparation and DB preparation phases. As shown in Table  3, the time and size, and Search time and communication size Search time of Secure LPM is four orders of magnitude 5 6 7 for N = 10 , 10 , 10 , and ℓ = 100 . It also shows the faster than [19] for N = 10 . Since time complexity of result of Baseline LMEM for N = 10 , as the runs for [19] includes a factor of N, the difference in Search time N > 10 did not finish within 20 h. The Search times and becomes greater as N becomes large. Moreover, our pro- communication sizes of Secure LPM and Secure LMEM tocols have a further advantage in communication for a are several orders of magnitudes faster and smaller than query response when the network environment is poor, those of Baseline LPM and Baseline LMEM. Since the as the round complexity of [19] and our protocols are the round and communication complexities of the proposed same while [19] requires O( N ) communication size. protocols do not depend on N, their estimated Search The entire runtimes including all the phases are still six 5 6 time remains small even on WAN environments. Fig- times faster for N = 10 and N = 10 . We can compute ure  5 shows the estimated Search time on WAN for LMEM by examining [19] for all the positions in a query 3 4 7 N = 10 , 10 , . . . , 10 and ℓ = 100 . The times of Secure string, but this approach consumed 3406 s and 2.6 GByte LPM and Secure LMEM do not increase, while those of communication for N = 10 . of the baseline protocols increase linearly to N. Fig- ure  6 shows the estimated Search time on WAN for Result of the approach in section "Reducing size of shares ℓ = 10, 25, . . . , 100 for N = 10 . We can not show the in DB preparation phase" results of Baseline LMEM because none of its runs were We also implemented Protocol 5 (Secure LPM2) to inves- finished within the time limit. As shown in the graph, tigate a trade-off between reduction of the size of shares the time of Secure LPM increases linearly to ℓ and that in DB preparation phase and increase in search time and of Baseline LPM increases proportionally to ℓ , which communication overhead in Search phase. We used the are in good agreement with the theoretical complexities same programming language (i.e., Python 3.5.2) for the in Table  2. According to the graph, the time of Secure implementation and used the same datasets. Share was LMEM also increases linearly to though its time and run with n = 8 when generating the arithmetic shares communication complexities are O(ℓ ) . This is because of R. For the generation of rest of the arithmetic shares, the CPU times are much smaller than the delays caused Share was run with n = 16 and n = 32 for N < 10 and 5 5 by network latency that are influenced by the round com - 10 ≤ N . (i.e., m = 8 , n = 16 ( N < 10 ), and n = 32 plexity O(ℓ). (10 ≤ N ) for the notation used in section  "Reduc- We have preliminary results for testing Secure LPM ing size of shares in DB preparation phase"). The results and Baseline LPM on the actual network (10 ms/100 are shown in Table  3. The total size of shares in DB Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 18 of 22 preparation phase was 7.7GB for Protocol  5 and 30.5GB are large. To mitigate the problem, we also proposed for Protocol 3, which is in good agreement with the theo- the approach that uses arithmetic shares of a shorter bit retical complexities discussed in section  "Reducing size length, which offers a reasonable trade-off between the of shares in DB preparation phase". The search time of reduction of data size in DB preparation phase and the Protocol  5 is around 2 s longer than that of Protocol  3. increase in time and communication overhead in Search We consider the increase in search time is mainly caused phase. Another solution that potentially mitigate the by using rather costly subprotocols: CastUp , Comp and problem is to use an AES-based random number gen- MULT more times, which also increases the number of eration that is similar to the technique used in [33]. To communication rounds. Although the increase in search explain it briefly, when the server needs to distribute a time, Protocol  5 is still more than two orders of magni- share of x, (1) the server and P generate the same ran- tude faster than Baseline LPM and three orders of mag- domness r using a pre-shared key and a pseudorandom nitude faster than [19], so we consider that Protocol  5 function, and (2) the server computes x − r and sends it offers a reasonable trade-off between performance in DB to P . Although P ’s computation cost increases, we can 1 0 preparation phase and Search phase. remove the data transfer from the server to P . In our protocols, the generation of shares in the DB preparation Discussion phase cannot be outsourced because they are depend- As clearly shown by the results, Search time of the pro- ent on the database. Designing an efficient algorithm posed protocols are significantly efficient. Considering to outsource the share generation is an important open the importance of query response time for real applica- question. tions, it is realistic to reduce Search time at the cost of DB preparation time. Since the total times for offline and DB preparation phases of the proposed protocols were Appendices significantly better than those of the well-designed base - Appendix 1: Examples of a aearch with FM‑Index line protocols, we consider the trade-off between Search and auxiliary data structures and DB preparation times in our approach to be efficient. Let us show examples of a search with FM-Index, LCP For further reduction of DB preparation time, paralleliz- array, PSV and NSV. In addition to the data structures ing the share generation is a feasible approach. Regard- defined in section  "Index structure for string search ", ing the DB preparation phase, the data transfer between we also define a string F such that F[i] = S[SA[i]] . For the server and the computing nodes is problematic when the case of S =ATGA AT GCGA, the indices become the number of queries and the length of the database SA = (9, 3, 0, 4, 7, 8, 2, 6, 1, 5) , L = GGA AGC TTAA, and Fig. 7 An example of search by FM-Index Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 19 of 22 Fig. 8 An example of search by FM-ndex, LCP array, PSV and NSV F = AAA ACG GGTT. Figure 7 illustrates the example of corrupted party P if there exists a probabilistic polyno- a backward search to find the longest suffix of the query mial-time algorithm S such that (ATG) that matches the database, and Fig.  8 illustrates � � � � � � {(S(i, x , f (x)), f (x))}≡{(View (x), Output (x))}, i i the search for MEMs with the query (CGC) by using LCP array, PSV, and NSV. As shown in the upper center panel where the symbol ≡ means that the two probability dis- of Fig.  8, the search failed when the backward search tributions are statistically indistinguishable. with ‘C’ after finding the interval [7, 8) that corresponds to GC. Since LCP[8]≤ LCP[7] , the parent lcp-interval As described in [32], the composition theorem for the becomes [PSV[7] = 5,NSV[7] = 8) , which corresponds semi-honest model holds; that is, any protocol is pri- to ‘G’. The match CG is then searched with the backward vately computed as long as its subroutines are privately search with ‘C’ from the parent lcp-interval. computed. Appendix 3: Our secure baseline LPM and LMEM In this section, we show our secure baseline LCP and Appendix 2: Semi‑honest security LMEM based on secret sharing. We explain how to Here, we recall the simulation-based security notion in construct LCP, since we can obtain LMEM by (paral- the presence of semi-honest adversaries (for two-party lelly) executing LCP for all positions in the query. Note computation), as in [32]. that x� = (x , x , ··· ) , x denotes an i-th element of x  , 1 2 i � � � [[t]] = ([[t]] , [[t]] ) , and (|x�|, |�y|) = (L, N ) . Here, we 0 1 ∗ 2 ∗ 2 Definition 2 Let f : ({0, 1} ) → ({0, 1} ) be a proba- assume N > L . When [[x �]] = ([[x ]], [[x ]], ··· , [[x ]]) , 1 2 p bilistic 2-ary functionality and f (x ) denote the i-th ele- means . In our protocol, ([[0]], [[x ]], ··· , [[x]] ) ∗ 2 [[x�]] ≫ 1 1 p−1 ment of f (x ) for x� = (x , x ) ∈ ({0, 1} ) and i ∈{0, 1} ; 0 1 we use two subprotocols as follows: f (x�) = (f (x�), f (x�)) . Let  be a 2-party protocol to com- 0 1 pute the functionality f. The view of party P for i ∈{0, 1} • All-AND takes a list [[t]] (with p Boolean shares) as during an execution of  on input x� = (x , x ) ∈ ({0, 1} ) 0 1 input and outputs [[t ∧ ··· ∧ t ]] . We can compute 1 p where |x |=|x | , denoted by View (x ) , consists of 0 1 this function with ⌈p⌉ communication rounds (by (x , r , m , . . . , m ) , where x represents P ’ s input , r i i i,1 i,t i i i appropriate parallelization) and O(p)-bit data transfer. represents its internal random coins, and m repre- i,j • All-OR takes a list [[u �]] (with p Boolean shares) as sents the j-th message that P has received. The out - input and outputs [[u ∨ ··· ∨ u ]] . We can com- 1 p put of all parties after an execution of  on input x  is pute this function with ⌈p⌉ communication rounds (by denoted as Output (x ) . Then, for each party P , we say appropriate parallelization) and O(p)-bit data transfer. that  privately computes f in the presence of semi-honest Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 20 of 22 Our protocol is as in Protocol  A1. In the following, we matches exist”. For these operations, we need to use explain the details of our baseline longest common prefix O(N), O(NL), and O(N) secure AND gates, respectively. search protocol using an example that strings x = “TGA” Since we execute these operations for all L candidates, and � y = “ATTGC”. In this example, w = 2 since there the number of AND gates we need for are O(NL), O(NL ) , exists “TG” in  y , but “TGA” does not. For better under- and O(NL), respectively. In these operations, We do not standing, we introduce a more straightforward approach need to compute the letters match for each time since and analyze its efficiency before explaining our baseline the string is fixed. In our baseline protocol, therefore, we protocol. In the straightforward approach, we securely compute whether the letter is matched or not beforehand check whether the first letter of (i.e., “T”) exists in y or and repeatedly use them. Since we can check this check not. Next, we check every pattern up to the second let- with O(NL), however, our baseline still requires O(NL ) ter of x (i.e., “TG”) for a match anywhere in  y . We also AND gates. Although it may be possible to reduce the execute the same operations for up to the third latter of number of AND gates via increasing other costs (e.g., x (i.e., “TGA”). In these processes, we necessary to exe- communication rounds), it will not be easy to construct cute the “check if the characters match”, “check if all the the protocol with N-independent online cost like the pro- characters match”, and “check if at least one of the perfect posed one with this strategy. Nak agawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 21 of 22 References Why the offline cost of our baseline is so significant: In 1. Fiume M, Cupak M, Keenan S, Rambla J, de la Torre S, Dyke SO, Brookes secure computation, it is impossible in principle to change AJ, Carey K, Lloyd D, Goodhand P, et al. Federated discovery and sharing the behavior depending on the computation results in the of genomic data using beacons. Nat Biotechnol. 2019;37(3):220–4. 2. Philippakis AA, Azzariti DR, Beltran S, Brookes AJ, Brownstein CA, middle. In other words, we are always forced to perform the Brudno M, Brunner HG, Buske OJ, Carey K, Doll C, et al. The matchmaker worst-case computation. In the previous example, for exam- exchange: a platform for rare disease gene discovery. Hum Mutat. ple, we consider the case for checking whether the first let - 2015;36(10):915–21. 3. Erlich Y, Narayanan A. Routes for breaching and protecting genetic ter of x (i.e., “T”) matches any of the letters in y. If it is done privacy. Nat Rev Genet. 2014;15(6):409–21. in plain text, the moment we find “T” in the second letter 4. Aziz MMA, Sadat MN, Alhadidi D, Wang S, Jiang X, Brown CL, Mohammed of  y , we don’t have to worry about the rest of the letters in N. Privacy-preserving techniques of genomic data—a survey. Briefings Bioinform. 2019;20(3):887–95. y . In secure computation, however, we have to check eve- 5. Naveed M, Ayday E, Clayton EW, Fellay J, Gunter CA, Hubaux J-P, Malin BA, rything, including the rest, since we cannot find that the Wang X. Privacy in the genomic era. ACM Comput Surv. 2015;48(1):1–44. match has already existed. In addition, we consider the case 6. Jha S, Kruger L, Shmatikov V Towards practical privacy for genomic com- putation. In: Proc. of IEEE S&P 2000; 2008, p. 216–230. that we check the match for up to the first two letters in x 7. Cheon JH, Kim M, Lauter KE Homomorphic computation of edit distance. (i.e., “TG”) and the first two letters in  y (i.e., “AT”). In this In: Proc. of FC 2015; 2015, p. 194–212. case, the moment we see A, we can decide there is no match 8. Nuida K, Ohata S, Mitsunari S, Attrapadung N. Arbitrary univariate func- tion evaluation and re-encryption protocols over lifted-elgamal type and terminate the process in plaintext computation. In ciphertexts. IACR Cryptology ePrint Archive. 2019;2019:1233. secure computation, however, this is impossible. As we see 9. Huang Y, Evans D, Katz J, Malka L Faster secure two-party computation above, we are always forced to consider the worst-case com- using garbled circuits. In: Proc. of USENIX 2011; 2011. 10. Wang XS, Huang Y, Zhao Y, Tang H, Wang X, Bu D Efficient genome-wide, puting cost in secure computation. Note that offline costs privacy-preserving similar patient query based on private edit distance. for secure computation are linear to the number of AND In: Proc. of CCS 2015; 2015, p. 492–503. gates. We need O(NL ) offline cost in our baseline (and 11. Zhu R, Huang Y Efficient and precise secure generalized edit distance and beyond. IEEE Transactions on Dependable and Secure Computing. straightforward) protocol, and N is large in our setting. This 2020;1–1. is why the offline cost of our baseline protocol is so large. 12. Cheng K, Hou Y, Wang L Secure similar sequence query on outsourced Our proposed protocol successfully avoids this problem by genomic data. In: Proc. of AsiaCCS 2018; 2018. p. 237–251. 13. Asharov G, Halevi S, Lindell Y, Rabin T. Privacy-preserving search of similar developing a new secure primitive and combining it with an patients in genomic data. PoPETs. 2018;2018(4):104–24. appropriate data structure. 14. Schneider T, Tkachenko O EPISODE: efficient privacy-preserving similar sequence queries on outsourced genomic databases. In: Proc. of AsiaCCS Acknowledgements 2019, pp. 315–327 (2019) This work is partially supported by JST CREST Grant Number JPMJCR19F6, 15. Ohata S, Nuida K Communication-efficient (client-aided) secure two- MEXT/JSPS KAKENHI grant number 19K12209 and 21H04871/21H05052. party protocols and its application. In: Proc. of FC 2020; 2020, p. 369–385. KS thanks Prof. Kunihiko Sadakane and Mr. Tomoki Uchiyama for giving the 16. Baldi P, Baronio R, Cristofaro E.D., Gasti P, Tsudik G Countering GAT TAC A: important comments for improving the paper. efficient and secure testing of fully-sequenced human genomes. In: Proc. of CCS 2011; 2011, p. 691–702. Authors’ contributions 17. Shimizu K, Nuida K, Rätsch G. Efficient privacy-preserving string search KS designed proposed protocols with the help of SO and YN, and organ- and an application in genomics. Bioinformatics. 2016;32(11):1652–61. ized the study. SO implemented a secure multi-party computation library 18. Ishimaki Y, Imabayashi H, Shimizu K, Yamana H Privacy-preserving string equipped with all the sub-protocols necessary for this study and designed search for genome sequences with fhe bootstrapping optimization. In: baseline protocols. YN implemented proposed and baseline protocols and Proc. of IEEE Big Data 2016, pp. 3989–3991 (2016) conducted experiments. KS and SO mainly wrote the manuscript. All the 19. Sudo H, Jimbo M, Nuida K, Shimizu K. Secure wavelet matrix: alphabet- authors contributed to the final form of the manuscript. All authors read and friendly privacy-preserving string search for bioinformatics. IEEE/ACM approved the final manuscript. Trans Comput Biol Bioinform. 2019;16(5):1675–84. 20. Sotiraki K, Ghosh E, Chen H. Privately computing set-maximal matches in genomic data. BMC Med Genom. 2020;13(7):1–8. Declarations 21. Mahdi MSR, Al Aziz MM, Mohammed N, Jiang X. Privacy-preserving string search on encrypted genomic data using a generalized suffix tree. Inform Competing interests Med Unlocked 23, 100525 (2021) The authors declare that they have no competing interests. 22. Chen Y, Peng B, Wang X, Tang H Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds. In: Proc. of NDSS 2012; Author details Department of Computer Science and Engineering, Waseda University, Tokyo, 2 3 23. Popic V, Batzoglou S. A hybrid cloud read aligner based on minhash and Japan. Self- employment, T ok yo, Japan. National Institute of Advanced Indus- kmer voting that preserves privacy. Nat Commun. 2017;8(1):1–7. trial Science and Technology, Tokyo, Japan. 24. Ferragina P, Manzini G Opportunistic data structures with applications. In: Proc. of FOCS 2000; 2000; p. 390–398. Received: 19 November 2021 Accepted: 1 March 2022 Nakagawa et al. Algorithms for Molecular Biology (2022) 17:9 Page 22 of 22 25. Durbin R. Efficient haplotype matching and storage using the positional burrows-wheeler transform (pbwt). Bioinformatics. 2014;30(9):1266–72. 26. Yasuda M, Shimoyama T, Kogure J, Yokoyama K, Koshiba T Secure pattern matching using somewhat homomorphic encryption. In: Juels, A., Parno, B. (eds.) Proc. of CCSW’13; 2013, p. 65–76. 27. Fischer J, Mäkinen V, Navarro G An(other) entropy-bounded compressed suffix tree. In: Proc. of CPM 2008; 2008, p. 152–165. 28. Shamir A. How to share a secret. Commun ACM. 1979;22(11):612–3. 29. Beaver D Efficient multiparty protocols using circuit randomization. In: Proc. of CRYPTO 1991; 1991, p. 420–432. 30. Mohassel P, Orobets O, Riva B. Efficient server-aided 2pc for mobile phones. PoPETs. 2016;2016(2):82–99. 31. Mohassel P, Zhang Y Secureml: a system for scalable privacy-preserving machine learning. In: Proc. of IEEE S&P 2017; 2017, p. 19–38. 32. Goldreich O. The foundations of cryptography. Basic applications, vol. 2. Cambridge: Cambridge University Press; 2004. 33. Araki T, Furukawa J, Lindell Y, Nof A, Ohara K High-throughput semi- honest secure three-party computation with an honest majority. In: Proc. of CCS 2016; 2016, p. 805–817. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in pub- lished maps and institutional affiliations. Re Read ady y to to submit y submit your our re researc search h ? Choose BMC and benefit fr ? Choose BMC and benefit from om: : fast, convenient online submission thorough peer review by experienced researchers in your field rapid publication on acceptance support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year At BMC, research is always in progress. Learn more biomedcentral.com/submissions

Journal

Algorithms for Molecular BiologySpringer Journals

Published: Apr 26, 2022

Keywords: Private genome sequence search; Secure multiparty computation; Secret sharing; FM-index; Suffix array; LCP array; Maximal exact match

There are no references for this article.