Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Efficient Skyline Computation on Massive Incomplete Data

Efficient Skyline Computation on Massive Incomplete Data Incomplete skyline query is an important operation to filter out pareto-optimal tuples on incomplete data. It is harder than skyline due to intransitivity and cyclic dominance. It is analyzed that the existing algorithms cannot process incomplete skyline on massive data efficiently. This paper proposes a novel table-scan-based TSI algorithm to deal with incomplete skyline on massive data with high efficiency. TSI algorithm solves the issues of intransitivity and cyclic dominance by two separate stages. In stage 1, TSI computes the candidates by a sequential scan on the table. The tuples dominated by others are discarded directly in stage 1. In stage 2, TSI refines the candidates by another sequential scan. The pruning operation is devised in this paper to reduce the execution cost of TSI. By the assistant structures, TSI can skip majority of the tuples in phase 1 without retrieving it actually. The extensive experimental results, which are conducted on synthetic and real-life data sets, show that TSI can compute skyline on massive incomplete data efficiently. Keywords Massive data · TSI · Incomplete skyline · Pruning operation 1 Introduction On complete data, the transitivity rule is that: if p domi- nates p , and p dominates p , obviously p dominates p 2 2 3 1 3 The skyline operator filters out a set of interesting tuples by the definition of dominance. The transitivity is the basis from a potential huge data set. Among the specified skyline of the efficiency of the existing skyline algorithms which criteria, a tuple p is said to dominate another tuple q if p is utilize indexing, partitioning and pre-sorting operation. On strictly better than q in at least one attribute, and no worse incomplete data, some attributes of tuples are missing, the than q in the other attributes. The skyline query actually traditional definition of dominance does not hold any more, discovers all tuples which are not dominated by any other and the dominance relationship is re-defined on incomplete tuples. data. Given the skyline criteria, p and q are two tuples on Due to its practical importance, skyline queries have incomplete data, let C be the common complete attributes received extensive attentions [2, 3, 5, 6, 8, 9, 14–17, 19, of p and q among the skyline criteria, p dominates q if p is 20]. However, the overwhelming majority of the existing no worse than q among C and strictly better than q in at least algorithms only consider the data set of complete attrib- one attribute among C. From the dominance relationship utes, i.e., all the attributes of every tuple are available. In defined above, transitivity does not hold on incomplete data. real-life applications, because of the reasons such as the As illustrated in Fig. 1, the specified skyline criteria are delivery failure or the deliberate concealment, the data set {A , A , A } . In the table, p dominates p since the com- 1 2 3 1 2 we encounter often is incomplete, i.e., some attributes of mon attribute among the skyline criteria of p and p is A , 1 2 1 tuples are unknown [13]. On incomplete data, the existing p .A < p .A . Similarly, p dominates p . But p does not 1 1 2 1 2 3 1 skyline algorithms cannot be applied directly, since all of dominate p here and transitivity does not hold. Besides, them assume the transitivity of dominance relationship. it is found that p dominates p . On incomplete data, we 3 1 may face the issue of cyclic dominance. The two issues, intransitivity and cyclic dominance, make the processing * Xixian Han of skyline on incomplete data different from the skyline on xxhan1981@163.com complete data. The current incomplete skyline algorithms can be clas- School of Computer Science and Technology, Harbin sified into three categories: replace-based algorithms [7 ], Institute of Technology, No.92, Xidazhi Street, Harbin, Heilongjiang, China Vol:.(1234567890) 1 3 Efficient Skyline Computation on Massive Incomplete Data 103 is devised to skip the tuples in stage 1. The useful data struc- tures are pre-constructed, which is used to check whether a tuple is dominated before retrieving it. The extensive experi- ments are conducted on synthetic and real-life data sets. The experimental results show that the pruning operation can Fig. 1 Data set of example table skip overwhelming majority of tuples in stage 1 and TSI outperforms the existing algorithms significantly. sorted-based algorithms [1], and bucket-based algorithms The contributions of this paper are listed as follows: [7, 10]. The replace-based algorithms first replace the incomplete attributes with a specific value, then compute – This paper proposes a novel table-scan-based TSI algo- traditional skyline on transformed data, and finally refine rithm of two stages to process skyline query on massive the candidate to compute the results by pairwise compari- incomplete data efficiently. son. Normally, the number of candidate on massive data is – Two novel data structure is designed to maintain infor- large and the pairwise comparison is significantly expensive. mation of tuples and obtain pruning tuples with strong Sorted-based algorithms utilize the selected tuples with pos- dominance capability. sible high dominance via pre-sorted structures one by one – This paper devises efficient pruning operations to reduce to prune the non-skyline tuples. It usually performs many the execution cost of TSI, which directly skips the tuples passes on scan on the table and will incur high I/O cost on dominated by some tuples before retrieving them actu- massive data. Bucket-based algorithms first split the tuples ally. into different buckets according to their attribute encoding – The experimental results show that TSI can compute to make the tuples in the same buckets have the same encod- incomplete skyline on massive data efficiently. ing and hold the transitivity rule, then compute local skyline results on every buckets, and finally merge the local skyline The rest of the paper is organized as follows. The related results to obtain the results. In incomplete skyline computa- work is surveyed in Sect. 2, followed by preliminaries in tion, the skyline criteria size usually is greater than that on Sect.  3. The existing algorithms are analyzed in Sect.  4. complete data due to the cyclic dominance, the bucket num- Baseline algorithm is developed in Sect. 5. Section 6 intro- ber involved in bucket-based algorithms often is large, and duces TSI algorithm. The performance evaluation is pro- the number of local skyline results is relatively great. The vided in Sect. 7. Section 8 concludes the paper. computation operation and merge operation of local skyline results often incur high computation cost and I/O cost on massive data. To sum up, the existing algorithms cannot 2 Related Work process incomplete skyline query on massive data efficiently. Based on the discussion above, this paper proposes TSI Since [2] first introduces the skyline operator into database algorithm (Table-scan-based Skyline over Incomplete data) environment, skyline has been studied extensively by data- to compute skyline results on massive incomplete data with base researchers [2, 3, 5, 6, 8, 11, 14–17, 20]. However, the high efficiency. In order to reduce the computation cost and most of the existing skyline algorithms only consider the I/O cost, the execution of TSI consists of two stages. In stage complete data, and they utilize the transitivity of dominance 1, TSI performs a sequential scan on the table and maintains relationship to acquire significant pruning power. They can - candidates in memory. For each tuple t retrieved currently in not be directly used for the skyline query on incomplete data, stage 1, any candidate dominated by t is removed. And if t is where the dominance relationship is intransitivity and cyclic. not dominated by any candidates, t is added to the candidate In the rest of this section, we survey the skyline algorithms set. In stage 1, TSI does not consider the intransitivity and on incomplete data. The current incomplete skyline algo- cyclic dominance in incomplete skyline computation, but rithms can be classified into three categories: replace-based just discards the tuples which are not final results definitely. algorithms, sorted-based algorithms, and bucket-based In stage 2, another sequential scan is executed to refine the algorithms. candidates. For each tuple t retrieved currently in stage 2, it discards any candidates dominated by it. When stage 2 2.1 Replace‑Based Algorithms terminates, the candidates in memory are the incomplete skyline results. In this paper, it is found that the cost in stage Khalefa et al. [7] propose a set of skyline algorithms for 1 dominates the overall cost of TSI, so a pruning operation incomplete data. The first two algorithms, replacement and 1 3 104 J. He, X. Han bucket, are the extension of the existing skyline algorithms stored. SIDS performs a round-robin retrieval on the sorted to accommodate the incomplete data. Replacement algo- lists. For each retrieved data p, if it is not retrieved before, rithm first replaces the incomplete attributes by a special it is compared with each data q in the candidate set, which value to transform the incomplete data to complete data. is initialized to be the whole incomplete data. If p and q Traditional skyline algorithm can be used to compute the are compared already, the next data in the candidate set are skyline results SKY on the transformed complete data, retrieved and processed. Otherwise, if p dominates q, q is comp which also is the superset of the skyline results on the removed from the candidate set. And if q dominates p and p incomplete data. Finally, the tuples in SKY are trans- is in candidate set, p is removed from the candidate set also. comp formed into their original incomplete form, and the exhaus- If the number of p being retrieved during the round-robin tive pairwise comparison between all tuples in SKY is retrieval is equal to the number of its complete attributes and comp performed to compute the final results. Bucket algorithm p is not pruned yet, p can be reported to be one of the skyline first divides all the tuples on incomplete data into differ - results. SIDS terminates when candidate set becomes empty ent buckets to make all tuples in the same bucket have the or all points in sorted lists are processed at least once. same bitmap representation. The dominance relationship Sorted-based algorithms utilize the selected tuples with within the same bucket is transitive now since the tuples possible high dominance via pre-sorted structures one by here have the same bitmap representation. The traditional one to prune the non-skyline tuples. It usually performs skyline algorithm is utilized to compute the skyline results many passes on scan on the table and will incur high I/O within each bucket, which is called local skyline. The local cost on massive data. skyline results for all buckets are merged as the candidate skyline. The exhaustive pairwise comparison is performed 2.3 Bucket‑Based Algorithms on the candidate skyline to compute the query answer. ISky- line algorithm employs two new concepts, virtual points and Lee et al. [10] propose a sorting-based SOBA algorithm to shadow skylines, to improve bucket algorithm. The execu- optimize the bucket algorithm. Similar to the bucket algo- tion of ISkyline consists of three phases. In phase I, each rithm, SOBA also first divides the incomplete data into a newly retrieved tuple is compared against the local skyline set of buckets according to their bitmap representation, and the virtual points to determine whether the tuple needs then computes the local skyline of tuples in each bucket, to be (a) stored in the local skyline, (b) stored in the shadow and finally performs the pairwise comparison for the sky - skyline, (c) discarded directly. In phase II, the tuples newly line candidates (the collection of all local skylines). SOBA inserted into local skyline are compared with the current uses two techniques to reduce the dominance tests for the candidate skyline; ISkyline updates the candidate skyline skyline candidates. The first technique is to sort the buckets and the virtual points correspondingly. Every time t tuples in ascending order of the decimal numbers of the bitmap are kept in the candidate skyline, ISkyline enters into phase representation. This can identify the non-skyline points as III, updates the global skyline, and clears current candidate early as possible. The second technique is to rearrange the skyline. The similar processing continues until the end of order of tuples within the bucket. By sorting tuples in the input is reached and ISkyline returns the global skyline. ascending order of the sum of the complete attributes, the The replace-based algorithms first replace the incomplete tuples accessed earlier have the higher probability to domi- attributes with a specic fi value, then compute traditional sky - nate other tuples and this can help reduce the number of line on transformed data, and finally refine the candidate to dominance tests further. compute the results by pairwise comparison. Normally, the Bucket-based algorithms first split the tuples into differ - number of candidate on massive data is large and the pair- ent buckets according to their attribute encoding to make wise comparison is significantly expensive. the tuples in the same buckets have the same encoding and hold the transitivity rule, then compute local skyline results 2.2 Sorted‑Based Algorithms on every buckets, and finally merge the local skyline results to obtain the results. In incomplete skyline computation, Bharuka et al. [1] propose a sort-based skyline algorithm the skyline criteria size usually is greater than that on com- SIDS to evaluate the skyline over incomplete data. SIDS plete data due to the cyclic dominance, the bucket num- r fi st sorts the incomplete data D in non-descending order for ber involved in bucket-based algorithms often is large, and each attribute. Let D be the sorted list with respect to the ith the number of local skyline results is relatively great. The attribute. Only the ids of the tuples are kept in D , and the computation operation and merge operation of local skyline ids of the tuples whose ith attributes are incomplete are not 1 3 Efficient Skyline Computation on Massive Incomplete Data 105 Table 1 Symbols description Symbol Description T An incomplete table T The current part of T loaded in memory part t A tuple in T C The common complete attribute(s) PI The positional index of tuple X S The size of allocated memory for storing tuples of T in each time X S A set maintaining candidate tuples cnd SL The sorted list which is built for the i-th attribute MCR The bit-vector representing the membership checking result of SL RIA The bit-vector representing whether the attribute is complete S The set of the complete attributes of t NUM The number of the complete attributes for each tuple results often incur high computation cost and I/O cost on replacement algorithm often generates a large number of massive data. skyline candidates and the pairwise comparison among the For the algorithms mentioned above, the dominance over candidates incurs a prohibitively expensive cost. Bucket- incomplete data is defined on the common complete attrib- based algorithms, such as bucket algorithm, ISkyline and utes. There are also other definitions of dominance over SOBA, have the problem that they have to divide the data set incomplete data. Zhang et al. [21] propose a general frame- into different buckets. Given the size m of the skyline crite- work to extend skyline query. For each attribute, they first ria, the number of the buckets can be as high as 2 − 1 ; t his retrieve the probability distribution function of the values will cause serious performance issue when m is not small. in the attribute by all the non-missing values on the attrib- For SIDS, it utilizes one selected tuple to prune the non- ute and then convert incomplete tuples to complete data by skyline tuples in the candidate set, and this incurs a pass of estimating all missing attributes. And a mapping dominance sequential scan on the data. Thus, it requires many passes of is defined on the converted data. Zhang et al. [ 18] propose scan on the data to finish its execution, and this will incur a PISkyline to compute probabilistic skyline on incomplete high I/O cost on massive data. data. It is considered in [18] that each missing attribute value can be described by a probability density function. The prob- ability is used to measure the preference condition between 3 Preliminaries missing values and the valid values. Then, the probability of a tuple being skyline can be computed. PISkyline returns Given an incomplete table T of n tuples with attributes the K tuples with the highest skyline probability. A ,… ,A , some attributes of the tuples in T are incomplete. 1 M Discussion Throughout this paper, we use the definition The attributes in T are considered to be numerical type, let of dominance over incomplete data as [7]. Firstly, this domi- A ,… , A be the specified skyline criteria. Throughout the 1 m nance notion is commonly used in most skyline algorithms paper, it is assumed that the smaller attribute values are over incomplete data. Secondly, the estimation of the incom- preferred. In this paper, the attributes with known values plete attribute values may be undesirable in some cases. are called complete attributes, while the attributes with Therefore, we do not guess the incomplete attribute in this unknown values are called incomplete attributes. ∀t ∈ T , paper and not consider such algorithms anymore. t has at least one complete attribute among A , A ,… , A , 1 2 m In this paper, we consider the skyline over massive while all other attributes have a probability p ( 0 < p ≤ 1 ) of incomplete data, i.e., the data set cannot be kept in memory being incomplete. The frequently used symbols in this paper entirely. It is found that the existing algorithms, including are listed in Table 1. [1, 7, 10], all assume their processing of the in-memory The dominance over incomplete data is given in Defini- data. Their performance will be seriously degraded on mas- tion 1. The incomplete skyline returns the tuples in T which sive data. Since the cardinality of skyline query increases are not dominated by any other tuples. exponentially with respect to the size of skyline criteria [4], 1 3 106 J. He, X. Han be skyline result and any tuples which can be dominated by t can be dominated by t naturally. Of course, there are 1 2 other techniques to optimize the pairwise comparison among skyline candidates [10]. As illustrated in Fig. 2, for analysis, we assume that the bitmap encoding of the buckets consists of m cases with equal likelihood: ∃i,1 ≤ i ≤ m , the values of A must be known and other attributes can be unknown with the prob- Fig. 2 Different cases of bitmap encoding ability p independently. Given an m-bit b =(b b … b ) 1 2 m of some bucket, Cnt1(b)= r , where Cnt1 is a function to Definition 1 (Dominance over incomplete data) Given return the number of bit 1 in a bit-vector. Of course, in table T and skyline criteria A ,… , A , ∀t , t ∈ T , let C be this paper, 1 ≤ r ≤ m . The bit-vector b can be occurred 1 m 1 2 their common complete attributes among skyline criteria, t in r cases. In each case, the probability of generating b is t t r−1 m−r 1 2 dominates t (denoted by ) if ∀A ∈ C , t .A ≤ t .A , and (1 − p) × p , i.e., besides the selected complete attrib- 2 1 2 ∃A ∈ C , t .A < t .A. ute, there are (r − 1) complete attributes and (m − r) incom- 1 2 plete attributes. Therefore, the probability pr of generating r−1 m−r Definition 2 ( Positional index) ∀t ∈ T , its positional index b among the overall cases is pr = ×(1 − p) × p . (PI) is a if t is the ath tuple in T. The number N of tuples which have the encoded bit-vector b in T is n = n × pr . b b The positional index is defined in Definition 2. We Theoretically, bucket-based algorithm can split T into denoted by T(a) the tuple with PI = a , by T(a, … , b)(a ≤ b) all 2 − 1 buckets. The size of skyline criteria of skyline the tuples in T whose PIs are between a and b, by on incomplete data usually is greater than that on complete T(a, … , b).A be the set of attribute A in T(a, … , b). data due to the cyclic dominance. This can be verified in the i i existing skyline algorithms on incomplete data [1, 7, 10]. Then, the number of all buckets is not small. For example, 4 The Analysis for the Existing Algorithms given m = 20 , there are possibly 1048575 buckets. Then, bucket-based algorithm has to maintain a large number of The existing skyline algorithms over incomplete data can buckets. For one thing, this increases the management bur- be classified into three types: replacement-based algorithm, den of the file system; for another, this makes each bucket bucket-based algorithm, and sort-based algorithm. As dis- maintain a relatively small number of tuples with not small cussed in Sect. 2, replacement-based algorithm usually gen- skyline criteria size. erates too many skyline candidates and sort-based algorithm The size of the skyline candidates for pairwise compari- often needs to perform many passes on the table before son, i.e., the local skylines of all buckets, is (11…11) returning the results. They both incur much high computa- size = �SKY � , where SKY are the skyline tuples sc b b b=(00…01) tion cost and I/O cost on massive data. In the following part in bucket b. Under the independent assumption, the number of this section, we analyze the performance of bucket-based of local skyline in the bucket of encoded bit-vector b can be r−1 ((ln n )+) algorithm. estimated as [4], where  ≈ 0.57721 is the Euler- (r−1)! Given table T and the skyline criteria {A , A ,… , 1 2 Mascheroni constant. But in this paper, it is found that the A } , ∀t ∈ T , t can be encoded by an m-bit vector t.B. cardinality estimation is much lower than the actual cardi- ∀i,1 ≤ i ≤ m , if t.A is a complete attribute, the ith bit of t.B nality when m is relatively large. Of course, we can use other is 1 (denoted by t.B(i)= 1 ); otherwise, the ith bit of t.B is cardinality estimation methods [12, 22] in such case. For 0 ( t.B(i)= 0 ). Note that the most significant bit is the first bit. Bucket-based algorithm divides tuples in T according to their encoded vectors. Therefore, the tuples in the same bucket share the same vectors, and the transitive dominance relation holds among the tuples in a bucket. Traditional sky- line algorithm can be utilized to compute the local skyline within the bucket. Any tuple t dominated by a tuple t in 1 2 the same bucket can be discarded directly, since it cannot Fig. 3 Illustration of row table in running example 1 3 Efficient Skyline Computation on Massive Incomplete Data 107 this paper, let S be the size of allocated memory for stor- ing tuples of T each time, the number of table scan in BA 8×M×n is + 1 . In order to reduce the I/O cost in BA, a n-bit bit-vector B , each bit initialized with 1 is maintained. ret In the first iteration, the tuples of size S bytes are loaded into memory. Let T be the current part of T loaded in part memory. The tuples in T are compared with all tuples part in T. ∀t = T(a) , if t is dominated by some tuple in T , t he part ath bit in B is set to 0. Then, in the next iteration, suppose ret that the next retrieved tuple is T(b), if B (b)= 1 , T(b) is ret retrieved; otherwise, T(b) skips directly since it cannot be a incomplete skyline tuple. Example 1 In the rest of this paper, we use a running exam- ple, as depicted in Fig. 3, to illustrate the execution of algo- Fig. 4 Illustration of execution of BA algorithm rithms proposed in this paper. In the running example, we set M to be 3, m to be 3, n to be 16 and S to be 256 bytes. The value field of the attribute is [0, 100). According to the parameters, the execution of BA divides into two iterations. simplicity, we still use the cardinality estimation in [4], since In the first iteration, T(1, … ,8) are loaded into memory. As it still can provide useful insight for our analysis. Given depicted in Fig. 4, in the first iteration, only T (8) is left and n = 10 , m = 20 and p = 0.5 , the total number of all local reported as a incomplete tuple. Besides, T(10), T(11), T(13) skyline results is 7641060 even by use of the cardinality are dominated by the in-memory candidates in the first itera- formula mentioned above, which is much lower than the tion and they are skipped in the second iteration. At the end actual value. The number of local skyline results, which is of the second iteration, T(12) and T(15) are left and reported used to perform pairwise comparisons, is still too high. as incomplete skyline tuples. On the whole, the skyline To sum up, the existing skyline algorithms on massive results in the running example are {T(8), T(12), T(15)}. incomplete data all have their performance issue. 6 TSI Algorithm 5 Baseline Algorithm In this paper, we propose a new algorithm TSI (Table-scan- The existing algorithms, as mentioned in Secst. 2 and 4, based Skyline over Incomplete data) to process skyline over have rather poor performance and very long execution time massive incomplete data efficiently. TSI performs two passes on massive incomplete data. Therefore, this section first of scan on the table to compute the skyline results. Sec- devises a baseline algorithm BA which can be used as a tion 6.1 describes the basic execution of TSI algorithm. The benchmark against the algorithm proposed in this paper. pruning operation is presented in Sect. 6.2. Different from the existing methods, BA adopts a block- nested-loop-like execution. It first retrieves T from the 6.1 Basic Process beginning and loads a part of T into the memory, com- pares the tuples in memory with all tuples in T, removes The basic process of TSI consists of two stages. In stage 1, the dominated tuples in the memory. Each time the tuples TSI performs the first-pass scan on T to find the candidate left in memory are compared with all other tuples and can tuples, while in the stage 2, TSI scans T again to discard the be reported as part of incomplete skyline results. Then, the candidates which are dominated by some tuple. Algorithm 1 next part of T is loaded and the similar processing is exe- is the pseudo-code of the basic process. cuted; the iteration continues until all tuples in T are loaded into memory once and compared with all other tuples. In 1 3 108 J. He, X. Han Algorithm 1 TSI basic(T) Input: T is an incomplete table Output: S aset maintainingthe skylinetuplesover T cnd 1: initialize S ←∅ cnd 2: // Stage 1 find thecandidate tuples 3: while T hasmoretuples do 4: retrieve thenexttuple t of T ; 5: if S = ∅ then cnd 6: S ← S ∪ t; cnd cnd 7: else 8: while S hasmoretuples do cnd 9: retrieve thenexttuple p of S ; cnd 10: if p is dominatedby t then 11: remove p from S ; cnd 12: endif 13: endwhile 14: if t is dominatedby p then 15: discard t; 16: else 17: S ← S ∪ t; cnd cnd 18: endif 19: endif 20: endwhile 21: // Stage 2discard thecandidateswhich aredominated by some tuples 22: while T hasmoretuples do 23: retrieve thenexttuple t of T ; 24: while S hasmoretuples do cnd 25: retrieve thenexttuple can of S ; cnd 26: if can is dominatedby t then 27: remove can from S ; cnd 28: endif 29: endwhile 30: endwhile 31: return S ; cnd Theorem  1 When the first -pass scan of TSI is over, S In stage 1, TSI retrieves the tuples in T sequentially and cnd maintains the candidate tuples in a set S (empty initially) maintains a superset of skyline results over T. cnd (line 1). Let t be the currently retrieved tuple. If S is cnd empty, TSI keeps t in S (line 5-6). Otherwise, S is iter- Proof ∀t = T(pi ) , if t is a skyline tuple, there is no other 1 1 1 cnd cnd tuple in T which can dominate t . At the end of stage 1, t ated over, any candidate which is dominated by t is removed 1 1 from S (line 10-11). At the end of iteration, if t is domi- obviously will be kept in S . If t is not a skyline tuple, and cnd 1 cnd there is another tuple t = T(pi ) which can dominate t . If nated by some candidate in S , t is discarded (line 14-15); cnd 2 2 1 otherwise, TSI keeps t in S (line 16-17). In stage 1, TSI pi < pi , t will be retrieved after t and remove t from S . 1 2 2 1 1 cnd cnd If pi > pi , t is retrieved before t . If t is dominated by does not consider the intransitivity and cyclic dominance 1 2 2 1 2 of skyline on incomplete data. Any candidates is discarded some tuple and discarded, t still will be kept in S at the 1 cnd end of stage 1. Q.E.D. if it is dominated by some tuple, even though the candidate may dominate the following tuples. In this way, TSI does In stage 2, TSI performs another sequential scan on T. not need to maintain the dominated tuples and reduces the in-memory maintenance cost significantly. It is proved in Let t be the currently retrieved tuple (line 22-23), any candi- dates are removed from S if they are dominated by t (line Theorem 1 that S contains a superset of the query results cnd cnd at the end of stage 1. 26-27). It is proved in Theorem 2 that the candidates in S cnd are the skyline results at the end of stage 2. 1 3 Efficient Skyline Computation on Massive Incomplete Data 109 Fig. 6 Illustration of execution in stage 2 of TSI increases during the first-pass scan on T , while the size of S decreases gradually in stage 2. cnd Time complexity of stage 1. As shown in Algorithm 1, the time complexity of stage 1 is determined by the nested loop, the outer loop from line 3 to Line 20, and the inner loop from Line 8 to Line 13. Assume that there are n tuples in the incomplete table, in other words, algorithm 1 needs to retrieve n tuples. The iteration count of the outer loop is O(n), since time complexity is the amount of time taken by an algorithm to run as a function of the input size. The inner Fig. 5 Illustration of execution in stage 1 of TSI loop involves one sequential scan on S , whose size is no cnd more than n. For each iteration in the inner loop, the opera- tions take in constant time; thus, the time complexity of the inner loop is O(S ) . On the whole, the time complexity Theorem 2 When the second-pass scan of TSI is over, S cnd cnd maintains the skyline results over T. of stage 1 is determined by the number of tuples in T and the number of candidates in S , i.e., the time complexity cnd Proof ∀t ∈ S , if t is not a skyline tuple, there is another of stage 1 is O(n ∗ S ). cnd 1 cnd 1 Time complexity of stage 2. The execution of stage 2 is tuple t = T(pi ) which can dominate t . In the second-pass 2 2 1 scan, TSI will discard t when retrieving t . Q.E.D. described in Algorithm 1. Obviously, the cost of stage 2 is 1 2 similar to stage 1, i.e., the product of n and the size of S ; cnd The existing algorithms utilize many methods, such as it might be insignificant compared with the cost of the fol- lowing operations. The reason is that if the skyline candi- replacement, sortedness and bucket, to deal with intransitiv- ity and cyclic dominance. They usually incur high execution dates are relatively small, the size of S with skyline subset cnd generating in stage 1 is much large than the size of S with cost on massive incomplete data, as analyzed in Sects. 2 and cnd 4. In this paper, TSI neglects the intransitivity and cyclic skyline tuples generating in stage 2 and the size of S in cnd stage 1 often dominates the overall execution cost. On the dominance in the first-pass scan and leaves the refinement of the skyline results in the second-pass scan. whole, the time complexity of algorithm 1 is O(n ). In Sect. 6.2, we will propose pruning method to skip the Example 2 The execution in stage 1 of TSI in the running unnecessary tuples in the sequential scan to improve the per- formance TSI further. example is illustrated in Fig. 5. Initially, the candidate set S is empty. Then, as the first sequential scan is performed, cnd 6.2 Pruning Operation S ={T(8), T(12), T(15), T(16)} at the end of stage 1. In cnd stage 2, another sequential scan is executed to refine the can- 6.2.1 Intuitive Idea didates. As depicted in Fig. 6, T(16) in S is dominated by cnd T(3). Finally, TSI returns {T(8), T(12), T(15)} as incomplete On massive incomplete data, it is analyzed that the majority skyline results. of the execution cost of TSI is consumed in stage 1. In stage 1, TSI computes the candidates of the skyline over T. Obvi- Time complexity On massive incomplete data, the major- ity of the execution cost of TSI is consumed in stage 1. ously, any tuple must not be a skyline tuple if it is dominated by some tuple. In stage 1, TSI utilizes some pre-constructed The reason is that every tuple retrieved in stage 1 needs to compare with all candidates in S and the size of S data structure to skip the tuples in T which are dominated. In cnd cnd this way, TSI will speed up its execution in stage 1, since the 1 3 110 J. He, X. Han pruning operation not only reduces the I/O cost to retrieve tuples, but also reduces the computation cost of dominance checking. 6.2.2 Dominance Checking on Incomplete Data Given t ∈ T , ∀t ∈ T , le t C be the common complete 1 2 attributes among skyline criteria of t and t . For one thing, 1 2 t t 1 2 if , it means that ∀A ∈ C , t .A ≤ t .A and ∃A ∈ C , 1 2 t .A < t .A . Suppose that t is obtained currently, we can 1 2 1 utilize the values of t to skip the tuples dominated by it. For another, it C is empty, t and t cannot be compared in terms 1 2 of dominance checking. Therefore, the key to the dominance checking on incomplete data is (1) the comparison of com- plete attributes, (2) the representation of incomplete attrib- utes. In the following, we introduce how to construct data structures to solve the two issues. Fig. 7 Illustration of MCR and RIA in the running example In the paper, the value of any incomplete attribute is regarded as the positive infinity since the smaller val- ues are preferred. Given table T(A , … , A ) , the sorted In the running example, T(1).A and T(4).A are incomplete 1 M 1 1 list SL (1 ≤ i ≤ M) is built for each attribute. The schema attribute values, therefore, RIA = 0110111111111111 . i 1 of SL is SL (PI , A ) , where PI is the positional index of Similarly, we can generate RIA and RIA . i i T i T 2 3 the tuple in T, and the tuples of SL are arranged in the ascending order of A . By the sorted lists, TSI constructs By the structures MCR and RIA, given t ∈ T , w e i 1 the structure MCR (Membership Checking Result) to com- want to know which tuples in T are dominated by t . Let pare the complete attributes. For sorted list SL (1 ≤ i ≤ M) , S be set of the complete attributes among A , A ,… , A i c 1 2 m MCR (1 ≤ b ≤ ⌊log n⌋) is a n-bit bit-vector represent- of t , without loss of generality, assume that S ={A ,… , i,b 2 1 c 1 ing the membership checking results of SL (1, … ,2 ).PI . A } . ∀A ∈ S (1 ≤ i ≤ S ) , we determine the first i T S  i c c ∀t = T(a)(1 ≤ a ≤ n) , if a ∈ SL (1, … ,2 ).PI , value ITV [b ] of ITV which is greater than t .A , i.e., i T i i i 1 i MCR (a)= 1 ; ot her wise, MCR (a)= 0 . MCR (a) is the ITV [b − 1] ≤ t .A < ITV [b ] , her e ITV [0] is assigned i,b i,b i,b i i 1 i i i i ath bit of MCR . The maximum values of SL (1, … ,2 ).A negative infinity. Let DBV be the n-bit bit-vector of domi- i,b i i t (1 ≤ b ≤ ⌊log n⌋) are kept in a array ITV , i.e., nance checking corresponding to t , whose bits are initial- 2 i 1 ITV [b]= SL (2 ).A . ized to bit 1. It is proved by Theorem 3 that the bit 1s of i i i ⋀ ⋁ �S � �S � c c For the representation of incomplete attributes, TSI DBV =( ¬MCR )∧( RIA ) correspond to the t i,b i i=1 i=1 1 i performs a sequential scan on T and constructs the struc- tuples dominated by t . ture RIA, which consists of M n-bit bit-vectors. For ⋀ ⋁ �S � �S � c c RIA (1 ≤ i ≤ M) , ∀t = T(a)(1 ≤ a ≤ n) , if T(a).A is a com- Theorem  3 The bit 1s of DBV =( ¬MCR )∧( RIA ) t i,b i i i 1 i=1 i i=1 plete attribute RIA (a)= 1 ; ot her wise, RIA (a)= 0. represent the tuples which are dominated by t . i i 1 Example 3 The required data structures mentioned above are Proof As mentioned above, the value b is determined as the illustrated in Fig. 7. SL , SL , SL are three sorted lists, whose minimum integer value satisfying ITV [b ] > t .A . There- 1 2 3 i i 1 i elements are arranged in the ascending order of A , A , A , fore, the bit 1s of ¬MCR represent the tuples whose A 1 2 3 i,b i respectively. MCR is a 16-bit bit-vector representing the values are greater than t .A . Since we treat the incomplete 1,1 1 i �S � 1 c membership checking results of SL (1, 2 ).PI , i.e., 12 attribute values as positive infinity, ¬MCR represents 1 T i,b i=1 and 8. Therefore, the 8th bit and 12th bit in MCR are 1, the tuples whose values of A ,… , A are all greater than 1,1 1 S MCR = 0000000100010000 . ITV keeps the attribute val- those of t . Given t among these tuples, if at least one of 1,1 1 1 2 1 2 ues of exponential gaps in SL , i.e., SL (2 ).A , SL (2 ).A , A ,… , A of t is complete attribute, t is dominated by t 1 1 1 1 1 1 S  2 2 1 3 4 SL (2 ).A , SL (2 ).A , ITV ={26, 47, 65, +∞} . The other according to the dominance definition over incomplete data. 1 1 1 1 1 MCR bit-vectors and other ITVs can be obtained similarly. If all of A ,… , A of t are incomplete, t and t are not 1 S  2 1 2 The structure RIA represents the incomplete values of A . comparable from the perspective of dominance relationship. i i 1 3 Efficient Skyline Computation on Massive Incomplete Data 111 �S � The bit 1s of RIA mean that at least one of A ,… , A i 1 S i=1 �S � is complete, and the bit 0s of RIA indicate that all of i=1 A ,… , A are incomplete. Consequently, the bit 1s of 1 S ⋀ ⋁ �S � �S � c c DBV =( ¬MCR )∧( RIA ) represent the tuples t i,b i 1 i=1 i i=1 which are dominated by t . Q.E.D. 6.2.3 The Extraction of the Pruning Tuples In order to skip the unnecessary tuples of T in stage 1, we first extract some pruning tuples for the following execution of TSI. The number of pruning tuples should not be large and they should have relatively strong dominance capability. Since the dimensionality of T can be high, we do not extract the pruning tuples with respect to the combination of dif- ferent attributes, but to the values of single attribute and the number of complete attributes for each tuple. It is known that the cardinality of skyline results grows exponentially with the size of skyline criteria [4] and on incomplete data, dom- inance relationship between two tuples is performed over their common complete attributes. Intuitively, for a tuple, if it has a small number of complete attributes and one of its complete attributes is very small, it tends to have a relatively Fig. 8 Illustration of extracting pruning tuples in the running example strong dominance capability. The pruning tuples can be extracted from M sorted col- first in the ascending order of NUM , and the tuples with umn files SC , SC , , SC . The schema of SC (1 ≤ i ≤ M) 1 2 M i the same value of NUM are sorted in ascending of A . c i is (PI , NUM , A ) , where NUM is the number of the com- T c i c In the running example, f = 12.5(16 × 12.5%= 2) and plete attributes for each tuple. The tuples of SC (1 ≤ i ≤ M) n = 1 , one pruning tuple will be retrieved for SC . F o r pt i are sorted on NUM and A , i.e., they are first arranged in c i SC , SC (1, … , 11) cannot be used to generate pruning 1 1 the ascending order of NUM , then all tuples with the same tuples since their attribute values are not within the first two NUM are arranged in the ascending order of A . c i smallest values of A . Then, SC (12) is selected to obtain the i 1 For each sorted column file SC , we retrieve its tuples pruning tuple T(SC (12).PI ) since it is the first tuple in SC 1 T 1 sequentially. Let sc be the current retrieved tuple, if sc.A whose A value is among the first two smallest values of A . 1 1 is within the first f% proportion among all A values, the Other pruning tuples (T(14) and T(6)) are obtained similarly. PI value of sc is maintained in memory, and otherwise, the next tuple is retrieved. The process continues until the 6.2.4 The Execution of Pruning Operation number of PI values maintained in memory reaches n or T pt it reaches to the end of file. Then, the corresponding tuples By the pre-constructed structures described above, TSI of T are extracted and kept in a separate pruning tuple file can utilize pruning operation to reduce the execution cost PT . In this paper, f is set to 5 and n is set 1000; the prun- i pt in stage 1. In order to execute the pruning operation, TSI ing effect with such parameter setting is satisfactory in the maintains a n-bit pruning bit-vector PRB in memory, which performance evaluation. is filled with bit 0 initially. Example 4 Figure  8 illustrates the extracting of pruning tuples in the running example. SC (1 ≤ i ≤ 3) is arranged 1 3 112 J. He, X. Han Algorithm 2 TSI Pruning(T , S ) cnd Input: T is an incomplete table, S aset maintainingthe candidatetuples cnd Output: S aset maintainingthe skylinetuplesover T cnd 1: MH is amin-heaptokeep mpruningtuples with thehighest dominancecapability. 2: initialize S ←∅, MH ←∅; cnd 3: // Stage 1 find thecandidate tuples 4: extractthe involved pruningtuples PT , PT , ..., PT foreachskyline criteria of T , 1 2 m andput PT , PT , ..., PT in to MH; 1 2 m 5: while MH hasmorepruning tuples do 6: retrieve thenexttuple pt of MH; 7: S is thecompleteattributes of pt, S = {A ,. .., A }}; c c 1 |S | 8: if PRB(pt)=1 then 9: pt canbeskipped; 10: else 11: for (i =1; i ≤|S |; i ++) do 12: computethe first value ITV [b ]of ITV , ITV [b ] ← SL (2 ).A ; i i i i i i i 13: endfor th 14: the(pt.P I ) bitof PRB to be 1; 15: if S = ∅ then cnd 16: S ← S ∪ t; cnd cnd 17: else 18: while S hasmoretuples do cnd 19: retrieve thenexttuple p of S ; cnd 20: if p is dominatedby t then 21: remove p from S ; cnd 22: endif 23: endwhile 24: if t is dominatedby p then 25: discard t; 26: else 27: S ← S ∪ t; cnd cnd 28: endif |S | |S | c c 29: DBV ← ( ¬MCR ) ∧ ( RIA ); pt i,b i i=1 i i=1 30: PRB = PRB ∨ ( DBV ); pt b=1 31: endif 32: endif 33: endwhile 34: // Stage 2discard thecandidateswhich aredominated by some tuples 35: while T hasmoretuples do 36: retrieve thenexttuple t of T ; 37: while S hasmoretuples do cnd 38: retrieve thenexttuple can of S ; cnd 39: if can is dominatedby t then 40: remove can from S ; cnd 41: endif 42: endwhile 43: endwhile 44: return S ; cnd ∏ b �S � i is computed to be (line 11-13). For the retrieved i=1 Algorithm 2 is the pseudo-code of the execution of prun- n pruning tuple pt, TSI sets the (pt.PI )th bit of PRB to be ing operation. At the beginning of the stage 1, TSI deter- 1, since it is retrieved already (line 14). Besides, for each mines the involved pruning tuple files PT , PT ,… , PT 1 2 m pruning tuple pt, TSI removes any candidates in S cnd according to the current skyline criteria and retrieves prun- which are dominated by pt (line 18-23). If pt is not domi- ing tuples from them. In the process of retrieving PT , PT , 1 2 nated by any candidate in S , TSI keeps it in S (line cnd cnd … , PT , TSI maintains a min-heap MH in memory to keep 26-27). ∀pt ∈ MH(1 ≤ b ≤ m) , TSI computes its corre- m pruning tuples with the highest dominance capability sponding bit-vector DBV of dominance checking as in pt (line 4). Given a pruning tuple pt, let S be its complete b Sect. 6.2.2 (line 29). The final pruning bit-vector PRB is attributes. Likewise, assume that S ={A , … , A }} (line c 1 S PRB = PRB ∨( DBV ) (line 30). pt b=1 5-6). ∀1 ≤ i ≤ S  , we determine the first value ITV [b ] of b c i i ITV which is greater than pt.A , its dominance capability i i 1 3 Efficient Skyline Computation on Massive Incomplete Data 113 Table 2 Parameter Settings Parameter Used values 5, 10, 50, 100, 500 Tuple number(10 ) (syn) Skyline criteria size (syn) 10, 15, 20, 25 Incomplete ratio (syn) 0.3, 0.4, 0.5, 0.6, 0.7 Correlation coefficient (syn) -0.8, -0.4, 0, 0.4, 0.8 Incomplete ratio (real) 0.3, 0.4, 0.5, 0.6, 0.7 S of the allocated memory is 4GB. We do not use a larger size for BA because, with the assistance of the bit-vector B ret as mentioned in Sect. 5, the larger value of S makes more tuples of T loaded in memory at a time and reduces the num- ber of iteration, but it also reduces the proportion of retrieval which can use the optimization of skipping operation. Fig. 9 Illustration of constructing PRB in the running example In the experiments, we evaluate the performance of TSI in terms of several aspects: tuple number (n), used attrib- Example 5 The construction of PRB in the running example ute number (m), incomplete ratio (p), correlation coefficient (c). The experiments are executed on three data sets: two is illustrated in Fig. 9. For the pruning tuple T(6)(56, 3, 0), TSI determines MCR , MCR , MCR which correspond synthetic data sets (independent distribution and correlated 1,3 2,1 3,1 distribution) and a real data set. The used parameter set- to the values of T(6). The tuples dominated by T(6) can be specified by a bit-vector PRB = 1101001011000000 . Simi- tings are listed in Table 2. For correlated distribution, the first two attributes have the specified correlation coefficient, larly, we obtain PRB and PRB . Since T(6), T(12), T(14) 12 14 are the pruning tuples, after retrieving them, PRB is while the left attributes follow the independent distribution. In order to generate two sequences of random numbers with set to be 0000010000010100, i.e., the 6th bit, the 12th and the 14th bit are 1. The final pruning bit-vector correlation coefficient c , we first generate two sequences of uncorrelated distributed random number X and X , t hen PRB = PRB ∨(DBV ∨ DBV ∨ DBV )= 1111111011111100. 6 12 14 √ 1 2 a new sequence Y = c × X + 1 − c × X is generated, 1 1 2 and we get two sequences X and Y with the given cor- In stage 1, ∀1 ≤ a ≤ n , i f PRB(a)= 1 , T(a) can be 1 1 skipped; otherwise, TSI needs to retrieve T(a). The rest of relation coefficient c . When generating synthetic data, we fix the number of M to be 60 and generate data with all the execution in stage 1 is the same as that in Sect. 6.1. complete attributes. Then, according to used skyline crite- ria, we select one attribute first, this attribute is complete. Example 6 In the running example, TSI only needs to retrieve three tuples (T(8), T(15), T(16)) in stage 1 by use Other (m − 1) attributes in skyline criteria have a probability p of being incomplete independently. The real data used are of PRB. This reduces the I/O cost and computation cost significantly. HIGGS Data Set from UCI Machine Learning Repository , it is provided to classification problem including 11000000 instances. The main reasons for using HIGGS are that 1) HIGGS is one of the largest databases to our knowledge, 7 Performance Evaluation accordingly, we have better access to compare the perfor- mance of above algorithms. 2) and it is an open dataset that 7.1 Experimental Settings we can find and obtain expediently. On real data, we evaluate the performance of TSI with varying values of p. To evaluate the performance of TSI, we implement it in Java with jdk-8u20-windows-x64. The experiments are executed The required structures are pre-constructed before the experiments. Under the default setting of the experiments, on LENOVO ThinkCentre M8400 (Intel (R) Core(TM) i7 CPU @ 3.40GHz (8 CPUs) + 32G memory + 3TB HDD + i.e., M = 60 , n = 50 × 10 , and p = 0.3 , it takes 6840.573 seconds to pre-construct the required data structures. 64 bit windows 7). In the experiments, we implement TSI, BA, SOBA [10] and SIDS [1]. With the experimental setting below, the execution time of SOBA and SIDS is so long that we do not report its experimental results with the settings below, but evaluate it in Sect. 7.8 separately. For BA, the size https://archive.ics.uci.edu/ml/datasets/HIGGS# 1 3 114 J. He, X. Han 1e+006 1e+006 20000 20000 Fig. 10 Comparison between 18000 18000 TSI and TSI 16000 16000 100000 100000 14000 14000 12000 12000 TS TSII B B 10000 10000 TS TSII 10000 10000 8000 8000 1000 1000 6000 6000 TS TSII B B 4000 4000 TS TSII 10 100 0 2000 2000 5 5 10 10 50 50 100 100 500 500 5 5 10 10 50 50 10 100 0 50 500 0 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (10 tuple number (10 ) ) (a)Execution time (b)Candidate size 250000 250000 5000 5000 S2 S2 S2 S2 4500 4500 S1 S1 S1 S1 200000 200000 4000 4000 BM BM 3500 3500 PT PT 150000 150000 3000 3000 2500 2500 100000 100000 2000 2000 1500 1500 50000 50000 1000 1000 500 500 0 0 0 0 5 5 10 10 50 50 10 100 0 500 500 5 5 10 10 50 50 10 100 0 500 500 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (10 tuple number (10 ) ) (c)TSI decomposition (d)TSI decomposition 1e+012 1e+012 1e+013 1e+013 1e+012 1e+012 1e+011 1e+011 1e+011 1e+011 1e+010 1e+010 1e+010 1e+010 TS TSII TS TSII B B B B TS TSII TS TSII 1e+009 1e+009 1e+009 1e+009 5 5 10 10 50 50 100 100 500 500 5 5 10 10 50 50 100 100 500 500 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (10 tuple number (10 ) ) (e)The I/Ocost (f)Comparisonnumber decomposition of TSI, which consists of four parts: the time 7.2 T he Comparison of TSI with and Without to retrieve pruning tuples, the time to load the required bit- Pruning vectors, the time in stage 1, and the time in stage 2. The time in stage 2 of TSI is longer than that of TSI due to the greater The performance of TSI and TSI is compared in different number of candidates left. However, the time reduction in aspects, where TSI is the TSI algorithm without pruning stage 1 of TSI is much significant compared with TSI and operation. As depicted in Fig. 10a, TSI runs 18.84 times TSI runs one order of magnitude faster than TSI averagely. faster than TSI and the speedup ratio increases with a As shown in Fig. 10(e and f), the pruning operation makes greater value of n. This significant advantage is due to the TSI incur less I/O cost and perform fewer number of domi- effective pruning operation. The numbers of the candidates nance checking. after stage 1 are illustrated in Fig. 10b. TSI maintains more candidates than TSI after stage 1. This is because the prun- 7.3 Experiment 1: the Eec ff t of Tuple Number ing operation skips most of the tuples in stage 1, and there- fore, many candidates which should be removed by some Given m = 20 , M = 60 , p = 0.3 and c = 0 , experiment 1 tuples are left. But the pruning operation reduces the cost in evaluates the performance of TSI on varying tuple numbers. stage 1 significantly. Figure  10c reports the time decomposi- As shown in Fig. 11a, TSI runs 60.42 times faster than BA tion of TSI . Obviously, the execution time of stage 1 domi- averagely. The speedup ratio of TSI over BA increases with nates its overall time. We even cannot see the time in stage 2 a greater value of n, from 8.31 at n = 5 × 10 to 166.58 at due to its rather small proportion. Figure 10d gives the time 1 3 io cost(bytes io cost(bytes) ) time(s) time(s) time(s) time(s) time(s) time(s) number of dominance checking number of dominance checking time(s) time(s) Efficient Skyline Computation on Massive Incomplete Data 115 1e+006 1e+006 1e+013 1e+013 Fig. 11 Effect of tuple number 100000 100000 1e+012 1e+012 10000 10000 1e+011 1e+011 1000 1000 1e+010 1e+010 BA BA BA BA TS TSII TS TSII 100 100 1e+009 1e+009 5 5 10 10 50 50 100 100 500 500 5 5 10 10 50 50 100 100 500 500 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (1 tuple number (10 0 ) ) (a)Execution time (b)The I/Ocost 1e+013 1e+013 1 1 0.99 0.99 1e+012 1e+012 0.98 0.98 0.97 0.97 1e+011 1e+011 0.96 0.96 0.95 0.95 1e+010 1e+010 BA BA 0.94 0.94 TS TSII 1e+009 1e+009 0.93 0.93 5 5 10 10 50 50 100 100 500 500 5 5 10 10 50 50 100 100 50 500 0 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (1 tuple number (10 0 ) ) (c)Comparisonnumber (d)The pruningratio Fig. 12 Effect of skyline criteria 1e+00 1e+006 6 1e+012 1e+012 size 100000 100000 1e+011 1e+011 10000 10000 1e+010 1e+010 1000 1000 1e+009 1e+009 100 100 BA BA BA BA TSI TSI TSI TSI 10 10 1e+008 1e+008 10 10 15 15 20 20 25 25 10 10 15 15 20 20 25 25 skyline criteria size skyline criteria size skyline criteria siz skyline criteria size e (a)Execution time (b)The I/Ocost 1e+013 1e+013 1 1 1e+012 1e+012 0.99 0.99 1e+011 1e+011 0.98 0.98 1e+010 1e+010 0.97 0.97 1e+009 1e+009 0.96 0.96 1e+008 1e+008 0.95 0.95 1e+007 1e+007 0.94 0.94 1e+006 1e+006 0.93 0.93 BA BA 100000 100000 0.92 0.92 TSI TSI 10000 10000 0.91 0.91 10 10 15 15 20 20 25 25 10 10 15 15 20 20 25 25 skyline criteria size skyline criteria size skyline criteria siz skyline criteria size e (c)Comparison number (d)Thepruning ratio than BA. The performance advantage of TSI over BA is n = 500 × 10 . Figure 11b depicts that TSI incurs 6.73 times widened with the greater value of n. At n = 5 × 10 , BA can less I/O cost than BA. And as illustrated in Fig. 11c, TSI load all T into memory and perform another table scan on performs 38.17 times fewer number of dominance checking 1 3 number of dominance checkin number of dominance checking g number of dominance checking number of dominance checking time(s time(s) ) time(s) time(s) retrieved byte number retrieved byte number retrieved byte number retrieved byte number pruning ratio pruning ratio pruning rati pruning ratio o 116 J. He, X. Han 100000 100000 1e+012 1e+012 Fig. 13 Effect of incomplete BA BA ratio TS TSII 10000 10000 1e+011 1e+011 1000 1000 1e+010 1e+010 10 100 0 1e+009 1e+009 10 10 BA BA TSI TSI 1 1 1e+008 1e+008 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 incomplete rati incomplete ratio o incomplete rati incomplete ratio o (a)Execution time (b)The I/Ocost 1e+012 1e+012 1 1 1e+011 1e+011 0.99 0.999 9 1e+010 1e+010 1e+009 1e+009 0.99 0.998 8 1e+008 1e+008 1e+007 1e+007 0.99 0.997 7 1e+006 1e+006 0.99 0.996 6 100000 100000 1000 10000 0 0.99 0.995 5 BA BA 1000 1000 TSI TSI 10 100 0 0.99 0.994 4 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 incomplete rati incomplete ratio o incomplete rati incomplete ratio o (c)Comparisonnumber (d)The pruningratio Fig. 14 Effect of correlation 100000 100000 1e+012 1e+012 BA BA coefficient TS TSII BA BA 1000 10000 0 1e+011 1e+011 TSI TSI 1000 1000 1e+010 1e+010 -0.8 -0.8 -0.4 -0.4 0 0 0. 0.4 4 0. 0.8 8 -0.8 -0.8 -0.4 -0.4 0 0 0. 0.4 4 0. 0.8 8 correlation coefficient correlation coefficient correlation coefficient correlation coefficient (a)Execution time (b)The I/Ocost 1e+012 1e+012 0.99 0.996 6 0.9955 0.9955 0.99 0.995 5 1e+011 1e+011 0.9945 0.9945 0.99 0.994 4 0.9935 0.9935 1e+010 1e+010 0.99 0.993 3 BA BA 0.9925 0.9925 TSI TSI 1e+009 1e+009 0.99 0.992 2 -0.8 -0.8 -0.4 -0.4 0 0 0. 0.4 4 0. 0.8 8 -0.8 -0.8 -0.4 -0.4 0 0 0. 0.4 4 0. 0.8 8 correlation coefficient correlation coefficient correlation coefficient correlation coefficient (c)Comparisonnumber (d)The pruningratio trend on tuple number due to its execution process and T to compute incomplete skyline results. At n = 500 × 10 , pruning operation. As illustrated in Fig. 11d, the pruning BA needs to execute 56 iterations, each loading a part of T operation of TSI can skip vast majority of tuples in stage and then followed by a table scan on T to remove the domi- 1. The pruning ratio in the experiments is computed by the nated tuples. On the contrary, TSI shows a slower growing 1 3 number of dominance checking number of dominance checking number of dominance checking number of dominance checking time(s) time(s) time(s) time(s) retrieved byte number retrieved byte number retrieved byte number retrieved byte number pruning rati pruning ratio o pruning rati pruning ratio o Efficient Skyline Computation on Massive Incomplete Data 117 100000 100000 1e+010 1e+010 Fig. 15 Effect of real data 1000 10000 0 1e+009 1e+009 1000 1000 10 100 0 1e+008 1e+008 10 10 BA BA BA BA TSI TSI TS TSII 1 1 1e+007 1e+007 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 skyline criteria size skyline criteria size skyline criteria size skyline criteria size (a)Execution time (b)The I/Ocost 1e+011 1e+011 1 1 1e+010 1e+010 0.99 0.995 5 1e+009 1e+009 1e+008 1e+008 0.99 0.99 1e+007 1e+007 1e+006 1e+006 0.98 0.985 5 100000 100000 0.98 0.98 1000 10000 0 1000 1000 0.97 0.975 5 BA BA 10 100 0 TSI TSI 10 10 0.97 0.97 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 skyline criteria size skyline criteria size skyline criteria size skyline criteria size (c)Comparisonnumber (d)The pruning ratio 12 12 Fig. 16 Comparison with BA, 100000 100000 1x10 1x10 BA BA BA BA SOBA, and SIDS TSI TSI TSI TSI 11 11 1000 10000 0 1x10 1x10 SOBA SOBA SOBA SOBA SIDS SIDS SIDS SIDS 10 10 1000 1000 1x10 1x10 9 9 10 100 0 1x10 1x10 8 8 10 10 1x10 1x10 7 7 1 1 1x10 1x10 6 6 7 7 8 8 9 9 10 10 6 6 7 7 8 8 9 9 10 10 skyline criteria size skyline criteria size skyline criteria size skyline criteria size (a)Execution time (b)TheI/Ocost skip tuples. For the first part, BA may not retrieve all tuples into formula , where n is the number of tuples skipped in skip memory since the current tuples may be dominated by the stage 1. previous iterations. For the second part, if the current can- didates all are discarded, BA does not have to continue the 7.4 Experiment 2: the Eec ff t of Skyline Criteria Size sequential scan but just performs the next iteration directly. When the value of m increases, given other parameters are Given M = 60 , n = 50 × 10 , p = 0.3 and c = 0 , exper iment fixed, the probability that a tuple is dominated by other tuple 2 evaluates the performance of TSI on varying skyline cri- becomes lower. Therefore, the I/O cost increases on both teria sizes. As illustrated in Fig. 12a, with a greater value of parts. This is reported in Fig. 12b. For TSI, its I/O cost also m, the execution times of BA and TSI both increase signifi- consists of two parts. In stage 1, TSI performs a selective cantly; TSI still runs 85.79 times faster than BA averagely. scan on T to obtain the candidates of incomplete skyline For BA, its I/O cost depends on two parts. For one thing, BA results. In stage 2, TSI does another sequential scan on T to needs to retrieve T once to load it into memory. For another, compute the results, in which if all candidates are removed, BA performs a sequential scan on T in each iteration to dis- TSI can terminate directly. As the value of m increases, the card the candidates in memory which are dominated by some pruning effect in TSI becomes worse in stage 1, which also 1 3 time(s time(s) ) number of dominance checking number of dominance checking time(s) time(s) retrieved byte number retrieved byte number candidate size candidate size pruning rati pruning ratio o 118 J. He, X. Han is verified in Fig.  12d, and TSI has to retrieve more tuples one variable decreases, the other increases. And a positive before it terminate in stage 2. This makes a higher I/O cost correlation means that variables tend to move in the same for TSI with a greater value of m, as illustrated in Fig. 12b. direction. Therefore, the skyline computation on negatively With the similar explanation, as shown in Fig. 12c, the num- correlated data usually is more expensive than that on posi- bers of dominance checking for both algorithms increase tively correlation data. The variations in TSI and BA both with a greater value of m. show a downward trend in experiment 4. Here, the trend is not significant because the incomplete attributes in the data 7.5 Experiment 3: the Eec ff t of Incomplete Ratio set reduce the impact of correlation. The I/O cost and num- ber of dominance checking are depicted in Fig. 14(b and c), Given m = 20 , M = 60 , n = 50 × 10 and c = 0 , exper iment respectively, and they have the similar variation trends. The 3 evaluates the performance of TSI on varying incomplete effect of pruning operation of TSI is illustrated in Fig.  14d. ratios. As the value of p increases, the execution time of Due to the impact of incomplete attributes, the pruning ratio BA decreases quickly, while the execution time of TSI first shows considerable change, but it still shows upward trend decreases and then increases gradually. For BA, the decline overall. of execution time is easy to understand. With a greater value of p, the probability that any tuple is dominated by other 7.7 Experiment 5: Real Data tuples increases. This makes more in-memory candidates in each iteration dominated by some tuples in the sequential The real data, HIGGS Data Set, are obtained from UCI scan, and can reduce the I/O cost and dominance checking Machine Learning Repository. It contains 11,000,000 tuples cost. As illustrated in Fig. 13c, with a greater value of p, the with 28 attributes. We select the first 20 attributes as skyline number of dominance checking in BA decreases constantly. criteria and evaluate the performance of TSI with varying And as shown in Fig. 13b, the I/O cost of BA first decreases incomplete ratios. Before the experiment is executed, one significantly when p increases from 0.3 to 0.4, then remains attribute first is chosen to be complete and other (m − 1) unchanged basically ever since. When p increases from 0.3 attributes in skyline criteria have a probability p of being to 0.4, the number of in-memory candidates is reduced dur- incomplete independently. As depicted in Fig. 15a, TSI runs ing the sequential scan and in each iteration, BA terminates 40.46 times faster than BA. The variation trends of execution earlier. This makes less I/O cost for BA. When the value of times of BA and TSI are very close to those in Sect. 7.5 and p is greater than 0.4, the number of in-memory candidates is can be explained similarly. The I/O cost and the number of reduced also, but in each iteration, BA reaches an approxi- dominance checking are depicted in Fig. 15(b and c), respec- mately equal scan depth before it terminates. For TSI, the tively. The pruning ratio in TSI is illustrated in Fig. 15d. effect of pruning operation depends on two factors. One is The variation in these figures can be explained similarly as the probability that one tuple can be dominated by other in Sect. 7.5. tuples. The other is whether all common attributes of two tuples are incomplete. The two factors have different effects 7.8 Experiment 6: the Comparison with SOBA in different cases. With a greater value of p , the probabil- and SIDS ity of a tuple dominated by some tuples increases, also the probability that the common attributes of two tuples are all In this part, we evaluate the performance of TSI against BA, incomplete. When p increases from 0.3 to 0.5, the first factor SOBA and SIDS on a relatively small data set with relatively has a greater impact, and ever since, the second factor plays small skyline criteria size. Given n = 10 × 10 , p = 0.3 and a larger role. This explains the trend of the execution time c = 0 , in order to acquire a better performance for SOBA of TSI. Similarly, this can explain the variation trend of TSI and SIDS, we set the value of m to be from 6 to 10, and the in I/O cost (Fig. 13b), the number of dominance checking value of M equal to that of m. This can reduce the length of (Fig. 13c), and the pruning ratio (Fig. 13d). each tuple and also lower the cost of bucket partitioning for SOBA and SIDS. 7.6 Experiment 4: the Eec ff t of Correlation As illustrated in Fig. 16a, SIDS is the slowest among the Coefficient four algorithms while TSI is the faster in various skyline criteria size, and the execution time of SOBA increases sig- Given m = 20 , M = 60 , n = 50 × 10 and p = 0.3 , exper i- nificantly with the number of m . When m = 10 , SOBA runs ment 4 evaluates the performance of TSI on varying correla- 10.96 times slower than BA, the baseline algorithm in this tion coefficients. As illustrated in Fig.  14a, TSI runs 47.72 paper, and runs 200.91 times slower than TSI. As for SIDS, times faster than BA. The correlation coefficients considered it runs 21.11 times slower than BA and runs 386.84 times range from -0.8 to 0.8. A negative correlation means that slower than TSI. On disk resident data, SOBA and SIDS there is an inverse relationship between two variables, when cannot process incomplete skyline efficiently. The bucket 1 3 Efficient Skyline Computation on Massive Incomplete Data 119 3. Chomicki J, Godfrey P, Gryz J, Liang D (2003) Skyline with pre- partitioning of SOBA involves two passes of table scan, not sorting. In: Proceedings of the 19th international conference on to mention the maintenance cost of the large number of par- data engineering, pp 717–719 titions in the disk if the number of m is not small. Then, the 4. Godfrey P (2004) Skyline cardinality for relational processing. In: computation of local skyline involves another pass of tuple foundations of information and knowledge systems, Third Inter- national Symposium, FoIKS 2004:78–97 retrieval. On the relatively large value of m, the number of 5. Godfrey Parke, Shipley Ryan, Gryz Jarek (2007) Algorithms and local skyline is great also. As depicted in Fig. 16b, the local analyses for maximal vector computation. VLDB J 16(1):5–28 skyline makes up 11.7% of the total tuples at m = 10 . The 6. Xixian H, Jianzhong L, Donghua Y, Jinbao W (2013) Efficient I/O cost of SIDS is much larger than others, SIDS and BA skyline computation on big data. IEEE Trans Knowl Data Eng 25(11):2521–2535 are much close in I/O cost. The growth trend of the execu- 7. Khalefa ME, Mokbel MF, Levandoski JJ (2008) Skyline query tion time of SIDS is fast with respect to skyline criteria size. processing for incomplete data. In: Proceedings of the 24th inter- The performance of TSI is efficient not only for the in-mem- national conference on data engineering, pp 556–565 ory data set with small size of skyline criteria, but also for 8. Kossmann D, Ramsak F, Rost S (2002) Shooting stars in the sky: an online algorithm for skyline queries. In: Proceedings of the disk-resident data with not small size of skyline criteria. the 28th international conference on very large data bases, pp 275–286 9. Lee Jongwuk, Hwang Seung-Won (January 2014) Scalable skyline 8 Conclusion computation using a balanced pivot selection technique. Inf Syst 39:1–21 10. Lee Jongwuk, Im Hyeonseung, You Gae-won (2016) Optimizing This paper considers the problem of incomplete skyline skyline queries over incomplete data. Inf Sci 361–362:14–28 computation on massive data. It is analyzed that the existing 11. Lee Ken C, Lee Wang-Chien, Zheng Baihua, Li Huajing, Tian algorithms cannot process the problem efficiently. A table- Yuan (2010) Z-sky: an efficient skyline query processing frame- work based on z-order. VLDB J 19(3):333–362 scan-based algorithm TSI is devised in this paper to deal 12. Luo Cheng, Jiang Zhewei, Hou Wen-Chi, He Shan, Zhu Qiang with the problem efficiently. Its execution consists of two (2012) A sampling approach for skyline query cardinality estima- stages. In stage 1, TSI maintains the candidates by a sequen- tion. Knowl Inf Syst 32(2):281–301 tial scan. And in stage 2, TSI performs another sequential 13. Miao X, Yunjun G, Su G, Wanqi L (2018) Incomplete data man- agement: a survey. Front Comput Sci 12(1):4–25 scan to refine the candidate and acquire the final results. 14. Papadias Dimitris, Tao Yufei, Greg Fu, Seeger Bernhard (2005) In order to reduce the cost in stage 1, which dominates the Progressive skyline computation in database systems. ACM Trans overall cost of TSI, a pruning operation is utilized to skip the Database Syst 30(1):41–82 unnecessary tuples in stage 1. The experimental results show 15. Sheng C, Tao Y(2011) On finding skylines in external memory. In: Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART that TSI outperforms the existing algorithms significantly. symposium on principles of database systems, pp 107–116 16. Tan K-L, Eng P-K, Ooi BC (2001) Efficient progressive skyline computation. In: Proceedings of the 27th international conference Open Access This article is licensed under a Creative Commons Attri- on very large data bases, pp 301–310 bution 4.0 International License, which permits use, sharing, adapta- 17. Tao Yufei, Xiao Xiaokui, Pei Jian (2007) Efficient skyline tion, distribution and reproduction in any medium or format, as long and top-k retrieval in subspaces. IEEE Trans Knowl Data Eng as you give appropriate credit to the original author(s) and the source, 19(8):1072–1088 provide a link to the Creative Commons licence, and indicate if changes 18. Zhang K, Gao H, Han X, Cai Z, Li J (2017) Probabilistic skyline were made. The images or other third party material in this article are on incomplete data. In: Proceedings of the 2017 ACM on confer- included in the article's Creative Commons licence, unless indicated ence on information and knowledge management, pp 427–436 otherwise in a credit line to the material. If material is not included in 19. Zhang Kaiqi, Gao Hong, Han Xixian, Cai Zhipeng, Li Jianzhong the article's Creative Commons licence and your intended use is not (2020) Modeling and computing probabilistic skyline on incom- permitted by statutory regulation or exceeds the permitted use, you will plete data. IEEE Trans Knowl Data Eng 32(7):1405–1418 need to obtain permission directly from the copyright holder. To view a 20. Shiming Z, Nikos M, Cheung DW (2009) Scalable skyline com- copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . putation using object-based space partitioning. In: Proceedings of the 2009 ACM SIGMOD international conference on manage- ment of data, pp 483–494 21. Zhenjie Z, Hua L, Beng Chin O, Tung AK (2010) Understanding References the meaning of a shifted sky: a general framework on extending skyline query. The VLDB J 19(2):181–201 1. Bharuka R, Sreenivasa Kumar P (2013) Finding skylines for 22. Zhenjie Z, Yin Y, Ruichu C, Dimitris P, Anthony KHT (2009) incomplete data. In: Proceedings of the 24th australasian database Kernel-based skyline cardinality estimation. In: Proceedings of conference - Vol 137, pp 109–117 the ACM SIGMOD international conference on management of 2. Börzsönyi S, Kossmann D, Stocker K (2001) The skyline opera- data, pp 509–522 tor. In: Proceedings of the 17th international conference on data engineering, pp 421–430 1 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Data Science and Engineering Springer Journals

Efficient Skyline Computation on Massive Incomplete Data

Data Science and Engineering , Volume 7 (2) – Jun 1, 2022

Loading next page...
 
/lp/springer-journals/efficient-skyline-computation-on-massive-incomplete-data-SepyPdRk2e

References (23)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2022
ISSN
2364-1185
eISSN
2364-1541
DOI
10.1007/s41019-022-00183-7
Publisher site
See Article on Publisher Site

Abstract

Incomplete skyline query is an important operation to filter out pareto-optimal tuples on incomplete data. It is harder than skyline due to intransitivity and cyclic dominance. It is analyzed that the existing algorithms cannot process incomplete skyline on massive data efficiently. This paper proposes a novel table-scan-based TSI algorithm to deal with incomplete skyline on massive data with high efficiency. TSI algorithm solves the issues of intransitivity and cyclic dominance by two separate stages. In stage 1, TSI computes the candidates by a sequential scan on the table. The tuples dominated by others are discarded directly in stage 1. In stage 2, TSI refines the candidates by another sequential scan. The pruning operation is devised in this paper to reduce the execution cost of TSI. By the assistant structures, TSI can skip majority of the tuples in phase 1 without retrieving it actually. The extensive experimental results, which are conducted on synthetic and real-life data sets, show that TSI can compute skyline on massive incomplete data efficiently. Keywords Massive data · TSI · Incomplete skyline · Pruning operation 1 Introduction On complete data, the transitivity rule is that: if p domi- nates p , and p dominates p , obviously p dominates p 2 2 3 1 3 The skyline operator filters out a set of interesting tuples by the definition of dominance. The transitivity is the basis from a potential huge data set. Among the specified skyline of the efficiency of the existing skyline algorithms which criteria, a tuple p is said to dominate another tuple q if p is utilize indexing, partitioning and pre-sorting operation. On strictly better than q in at least one attribute, and no worse incomplete data, some attributes of tuples are missing, the than q in the other attributes. The skyline query actually traditional definition of dominance does not hold any more, discovers all tuples which are not dominated by any other and the dominance relationship is re-defined on incomplete tuples. data. Given the skyline criteria, p and q are two tuples on Due to its practical importance, skyline queries have incomplete data, let C be the common complete attributes received extensive attentions [2, 3, 5, 6, 8, 9, 14–17, 19, of p and q among the skyline criteria, p dominates q if p is 20]. However, the overwhelming majority of the existing no worse than q among C and strictly better than q in at least algorithms only consider the data set of complete attrib- one attribute among C. From the dominance relationship utes, i.e., all the attributes of every tuple are available. In defined above, transitivity does not hold on incomplete data. real-life applications, because of the reasons such as the As illustrated in Fig. 1, the specified skyline criteria are delivery failure or the deliberate concealment, the data set {A , A , A } . In the table, p dominates p since the com- 1 2 3 1 2 we encounter often is incomplete, i.e., some attributes of mon attribute among the skyline criteria of p and p is A , 1 2 1 tuples are unknown [13]. On incomplete data, the existing p .A < p .A . Similarly, p dominates p . But p does not 1 1 2 1 2 3 1 skyline algorithms cannot be applied directly, since all of dominate p here and transitivity does not hold. Besides, them assume the transitivity of dominance relationship. it is found that p dominates p . On incomplete data, we 3 1 may face the issue of cyclic dominance. The two issues, intransitivity and cyclic dominance, make the processing * Xixian Han of skyline on incomplete data different from the skyline on xxhan1981@163.com complete data. The current incomplete skyline algorithms can be clas- School of Computer Science and Technology, Harbin sified into three categories: replace-based algorithms [7 ], Institute of Technology, No.92, Xidazhi Street, Harbin, Heilongjiang, China Vol:.(1234567890) 1 3 Efficient Skyline Computation on Massive Incomplete Data 103 is devised to skip the tuples in stage 1. The useful data struc- tures are pre-constructed, which is used to check whether a tuple is dominated before retrieving it. The extensive experi- ments are conducted on synthetic and real-life data sets. The experimental results show that the pruning operation can Fig. 1 Data set of example table skip overwhelming majority of tuples in stage 1 and TSI outperforms the existing algorithms significantly. sorted-based algorithms [1], and bucket-based algorithms The contributions of this paper are listed as follows: [7, 10]. The replace-based algorithms first replace the incomplete attributes with a specific value, then compute – This paper proposes a novel table-scan-based TSI algo- traditional skyline on transformed data, and finally refine rithm of two stages to process skyline query on massive the candidate to compute the results by pairwise compari- incomplete data efficiently. son. Normally, the number of candidate on massive data is – Two novel data structure is designed to maintain infor- large and the pairwise comparison is significantly expensive. mation of tuples and obtain pruning tuples with strong Sorted-based algorithms utilize the selected tuples with pos- dominance capability. sible high dominance via pre-sorted structures one by one – This paper devises efficient pruning operations to reduce to prune the non-skyline tuples. It usually performs many the execution cost of TSI, which directly skips the tuples passes on scan on the table and will incur high I/O cost on dominated by some tuples before retrieving them actu- massive data. Bucket-based algorithms first split the tuples ally. into different buckets according to their attribute encoding – The experimental results show that TSI can compute to make the tuples in the same buckets have the same encod- incomplete skyline on massive data efficiently. ing and hold the transitivity rule, then compute local skyline results on every buckets, and finally merge the local skyline The rest of the paper is organized as follows. The related results to obtain the results. In incomplete skyline computa- work is surveyed in Sect. 2, followed by preliminaries in tion, the skyline criteria size usually is greater than that on Sect.  3. The existing algorithms are analyzed in Sect.  4. complete data due to the cyclic dominance, the bucket num- Baseline algorithm is developed in Sect. 5. Section 6 intro- ber involved in bucket-based algorithms often is large, and duces TSI algorithm. The performance evaluation is pro- the number of local skyline results is relatively great. The vided in Sect. 7. Section 8 concludes the paper. computation operation and merge operation of local skyline results often incur high computation cost and I/O cost on massive data. To sum up, the existing algorithms cannot 2 Related Work process incomplete skyline query on massive data efficiently. Based on the discussion above, this paper proposes TSI Since [2] first introduces the skyline operator into database algorithm (Table-scan-based Skyline over Incomplete data) environment, skyline has been studied extensively by data- to compute skyline results on massive incomplete data with base researchers [2, 3, 5, 6, 8, 11, 14–17, 20]. However, the high efficiency. In order to reduce the computation cost and most of the existing skyline algorithms only consider the I/O cost, the execution of TSI consists of two stages. In stage complete data, and they utilize the transitivity of dominance 1, TSI performs a sequential scan on the table and maintains relationship to acquire significant pruning power. They can - candidates in memory. For each tuple t retrieved currently in not be directly used for the skyline query on incomplete data, stage 1, any candidate dominated by t is removed. And if t is where the dominance relationship is intransitivity and cyclic. not dominated by any candidates, t is added to the candidate In the rest of this section, we survey the skyline algorithms set. In stage 1, TSI does not consider the intransitivity and on incomplete data. The current incomplete skyline algo- cyclic dominance in incomplete skyline computation, but rithms can be classified into three categories: replace-based just discards the tuples which are not final results definitely. algorithms, sorted-based algorithms, and bucket-based In stage 2, another sequential scan is executed to refine the algorithms. candidates. For each tuple t retrieved currently in stage 2, it discards any candidates dominated by it. When stage 2 2.1 Replace‑Based Algorithms terminates, the candidates in memory are the incomplete skyline results. In this paper, it is found that the cost in stage Khalefa et al. [7] propose a set of skyline algorithms for 1 dominates the overall cost of TSI, so a pruning operation incomplete data. The first two algorithms, replacement and 1 3 104 J. He, X. Han bucket, are the extension of the existing skyline algorithms stored. SIDS performs a round-robin retrieval on the sorted to accommodate the incomplete data. Replacement algo- lists. For each retrieved data p, if it is not retrieved before, rithm first replaces the incomplete attributes by a special it is compared with each data q in the candidate set, which value to transform the incomplete data to complete data. is initialized to be the whole incomplete data. If p and q Traditional skyline algorithm can be used to compute the are compared already, the next data in the candidate set are skyline results SKY on the transformed complete data, retrieved and processed. Otherwise, if p dominates q, q is comp which also is the superset of the skyline results on the removed from the candidate set. And if q dominates p and p incomplete data. Finally, the tuples in SKY are trans- is in candidate set, p is removed from the candidate set also. comp formed into their original incomplete form, and the exhaus- If the number of p being retrieved during the round-robin tive pairwise comparison between all tuples in SKY is retrieval is equal to the number of its complete attributes and comp performed to compute the final results. Bucket algorithm p is not pruned yet, p can be reported to be one of the skyline first divides all the tuples on incomplete data into differ - results. SIDS terminates when candidate set becomes empty ent buckets to make all tuples in the same bucket have the or all points in sorted lists are processed at least once. same bitmap representation. The dominance relationship Sorted-based algorithms utilize the selected tuples with within the same bucket is transitive now since the tuples possible high dominance via pre-sorted structures one by here have the same bitmap representation. The traditional one to prune the non-skyline tuples. It usually performs skyline algorithm is utilized to compute the skyline results many passes on scan on the table and will incur high I/O within each bucket, which is called local skyline. The local cost on massive data. skyline results for all buckets are merged as the candidate skyline. The exhaustive pairwise comparison is performed 2.3 Bucket‑Based Algorithms on the candidate skyline to compute the query answer. ISky- line algorithm employs two new concepts, virtual points and Lee et al. [10] propose a sorting-based SOBA algorithm to shadow skylines, to improve bucket algorithm. The execu- optimize the bucket algorithm. Similar to the bucket algo- tion of ISkyline consists of three phases. In phase I, each rithm, SOBA also first divides the incomplete data into a newly retrieved tuple is compared against the local skyline set of buckets according to their bitmap representation, and the virtual points to determine whether the tuple needs then computes the local skyline of tuples in each bucket, to be (a) stored in the local skyline, (b) stored in the shadow and finally performs the pairwise comparison for the sky - skyline, (c) discarded directly. In phase II, the tuples newly line candidates (the collection of all local skylines). SOBA inserted into local skyline are compared with the current uses two techniques to reduce the dominance tests for the candidate skyline; ISkyline updates the candidate skyline skyline candidates. The first technique is to sort the buckets and the virtual points correspondingly. Every time t tuples in ascending order of the decimal numbers of the bitmap are kept in the candidate skyline, ISkyline enters into phase representation. This can identify the non-skyline points as III, updates the global skyline, and clears current candidate early as possible. The second technique is to rearrange the skyline. The similar processing continues until the end of order of tuples within the bucket. By sorting tuples in the input is reached and ISkyline returns the global skyline. ascending order of the sum of the complete attributes, the The replace-based algorithms first replace the incomplete tuples accessed earlier have the higher probability to domi- attributes with a specic fi value, then compute traditional sky - nate other tuples and this can help reduce the number of line on transformed data, and finally refine the candidate to dominance tests further. compute the results by pairwise comparison. Normally, the Bucket-based algorithms first split the tuples into differ - number of candidate on massive data is large and the pair- ent buckets according to their attribute encoding to make wise comparison is significantly expensive. the tuples in the same buckets have the same encoding and hold the transitivity rule, then compute local skyline results 2.2 Sorted‑Based Algorithms on every buckets, and finally merge the local skyline results to obtain the results. In incomplete skyline computation, Bharuka et al. [1] propose a sort-based skyline algorithm the skyline criteria size usually is greater than that on com- SIDS to evaluate the skyline over incomplete data. SIDS plete data due to the cyclic dominance, the bucket num- r fi st sorts the incomplete data D in non-descending order for ber involved in bucket-based algorithms often is large, and each attribute. Let D be the sorted list with respect to the ith the number of local skyline results is relatively great. The attribute. Only the ids of the tuples are kept in D , and the computation operation and merge operation of local skyline ids of the tuples whose ith attributes are incomplete are not 1 3 Efficient Skyline Computation on Massive Incomplete Data 105 Table 1 Symbols description Symbol Description T An incomplete table T The current part of T loaded in memory part t A tuple in T C The common complete attribute(s) PI The positional index of tuple X S The size of allocated memory for storing tuples of T in each time X S A set maintaining candidate tuples cnd SL The sorted list which is built for the i-th attribute MCR The bit-vector representing the membership checking result of SL RIA The bit-vector representing whether the attribute is complete S The set of the complete attributes of t NUM The number of the complete attributes for each tuple results often incur high computation cost and I/O cost on replacement algorithm often generates a large number of massive data. skyline candidates and the pairwise comparison among the For the algorithms mentioned above, the dominance over candidates incurs a prohibitively expensive cost. Bucket- incomplete data is defined on the common complete attrib- based algorithms, such as bucket algorithm, ISkyline and utes. There are also other definitions of dominance over SOBA, have the problem that they have to divide the data set incomplete data. Zhang et al. [21] propose a general frame- into different buckets. Given the size m of the skyline crite- work to extend skyline query. For each attribute, they first ria, the number of the buckets can be as high as 2 − 1 ; t his retrieve the probability distribution function of the values will cause serious performance issue when m is not small. in the attribute by all the non-missing values on the attrib- For SIDS, it utilizes one selected tuple to prune the non- ute and then convert incomplete tuples to complete data by skyline tuples in the candidate set, and this incurs a pass of estimating all missing attributes. And a mapping dominance sequential scan on the data. Thus, it requires many passes of is defined on the converted data. Zhang et al. [ 18] propose scan on the data to finish its execution, and this will incur a PISkyline to compute probabilistic skyline on incomplete high I/O cost on massive data. data. It is considered in [18] that each missing attribute value can be described by a probability density function. The prob- ability is used to measure the preference condition between 3 Preliminaries missing values and the valid values. Then, the probability of a tuple being skyline can be computed. PISkyline returns Given an incomplete table T of n tuples with attributes the K tuples with the highest skyline probability. A ,… ,A , some attributes of the tuples in T are incomplete. 1 M Discussion Throughout this paper, we use the definition The attributes in T are considered to be numerical type, let of dominance over incomplete data as [7]. Firstly, this domi- A ,… , A be the specified skyline criteria. Throughout the 1 m nance notion is commonly used in most skyline algorithms paper, it is assumed that the smaller attribute values are over incomplete data. Secondly, the estimation of the incom- preferred. In this paper, the attributes with known values plete attribute values may be undesirable in some cases. are called complete attributes, while the attributes with Therefore, we do not guess the incomplete attribute in this unknown values are called incomplete attributes. ∀t ∈ T , paper and not consider such algorithms anymore. t has at least one complete attribute among A , A ,… , A , 1 2 m In this paper, we consider the skyline over massive while all other attributes have a probability p ( 0 < p ≤ 1 ) of incomplete data, i.e., the data set cannot be kept in memory being incomplete. The frequently used symbols in this paper entirely. It is found that the existing algorithms, including are listed in Table 1. [1, 7, 10], all assume their processing of the in-memory The dominance over incomplete data is given in Defini- data. Their performance will be seriously degraded on mas- tion 1. The incomplete skyline returns the tuples in T which sive data. Since the cardinality of skyline query increases are not dominated by any other tuples. exponentially with respect to the size of skyline criteria [4], 1 3 106 J. He, X. Han be skyline result and any tuples which can be dominated by t can be dominated by t naturally. Of course, there are 1 2 other techniques to optimize the pairwise comparison among skyline candidates [10]. As illustrated in Fig. 2, for analysis, we assume that the bitmap encoding of the buckets consists of m cases with equal likelihood: ∃i,1 ≤ i ≤ m , the values of A must be known and other attributes can be unknown with the prob- Fig. 2 Different cases of bitmap encoding ability p independently. Given an m-bit b =(b b … b ) 1 2 m of some bucket, Cnt1(b)= r , where Cnt1 is a function to Definition 1 (Dominance over incomplete data) Given return the number of bit 1 in a bit-vector. Of course, in table T and skyline criteria A ,… , A , ∀t , t ∈ T , let C be this paper, 1 ≤ r ≤ m . The bit-vector b can be occurred 1 m 1 2 their common complete attributes among skyline criteria, t in r cases. In each case, the probability of generating b is t t r−1 m−r 1 2 dominates t (denoted by ) if ∀A ∈ C , t .A ≤ t .A , and (1 − p) × p , i.e., besides the selected complete attrib- 2 1 2 ∃A ∈ C , t .A < t .A. ute, there are (r − 1) complete attributes and (m − r) incom- 1 2 plete attributes. Therefore, the probability pr of generating r−1 m−r Definition 2 ( Positional index) ∀t ∈ T , its positional index b among the overall cases is pr = ×(1 − p) × p . (PI) is a if t is the ath tuple in T. The number N of tuples which have the encoded bit-vector b in T is n = n × pr . b b The positional index is defined in Definition 2. We Theoretically, bucket-based algorithm can split T into denoted by T(a) the tuple with PI = a , by T(a, … , b)(a ≤ b) all 2 − 1 buckets. The size of skyline criteria of skyline the tuples in T whose PIs are between a and b, by on incomplete data usually is greater than that on complete T(a, … , b).A be the set of attribute A in T(a, … , b). data due to the cyclic dominance. This can be verified in the i i existing skyline algorithms on incomplete data [1, 7, 10]. Then, the number of all buckets is not small. For example, 4 The Analysis for the Existing Algorithms given m = 20 , there are possibly 1048575 buckets. Then, bucket-based algorithm has to maintain a large number of The existing skyline algorithms over incomplete data can buckets. For one thing, this increases the management bur- be classified into three types: replacement-based algorithm, den of the file system; for another, this makes each bucket bucket-based algorithm, and sort-based algorithm. As dis- maintain a relatively small number of tuples with not small cussed in Sect. 2, replacement-based algorithm usually gen- skyline criteria size. erates too many skyline candidates and sort-based algorithm The size of the skyline candidates for pairwise compari- often needs to perform many passes on the table before son, i.e., the local skylines of all buckets, is (11…11) returning the results. They both incur much high computa- size = �SKY � , where SKY are the skyline tuples sc b b b=(00…01) tion cost and I/O cost on massive data. In the following part in bucket b. Under the independent assumption, the number of this section, we analyze the performance of bucket-based of local skyline in the bucket of encoded bit-vector b can be r−1 ((ln n )+) algorithm. estimated as [4], where  ≈ 0.57721 is the Euler- (r−1)! Given table T and the skyline criteria {A , A ,… , 1 2 Mascheroni constant. But in this paper, it is found that the A } , ∀t ∈ T , t can be encoded by an m-bit vector t.B. cardinality estimation is much lower than the actual cardi- ∀i,1 ≤ i ≤ m , if t.A is a complete attribute, the ith bit of t.B nality when m is relatively large. Of course, we can use other is 1 (denoted by t.B(i)= 1 ); otherwise, the ith bit of t.B is cardinality estimation methods [12, 22] in such case. For 0 ( t.B(i)= 0 ). Note that the most significant bit is the first bit. Bucket-based algorithm divides tuples in T according to their encoded vectors. Therefore, the tuples in the same bucket share the same vectors, and the transitive dominance relation holds among the tuples in a bucket. Traditional sky- line algorithm can be utilized to compute the local skyline within the bucket. Any tuple t dominated by a tuple t in 1 2 the same bucket can be discarded directly, since it cannot Fig. 3 Illustration of row table in running example 1 3 Efficient Skyline Computation on Massive Incomplete Data 107 this paper, let S be the size of allocated memory for stor- ing tuples of T each time, the number of table scan in BA 8×M×n is + 1 . In order to reduce the I/O cost in BA, a n-bit bit-vector B , each bit initialized with 1 is maintained. ret In the first iteration, the tuples of size S bytes are loaded into memory. Let T be the current part of T loaded in part memory. The tuples in T are compared with all tuples part in T. ∀t = T(a) , if t is dominated by some tuple in T , t he part ath bit in B is set to 0. Then, in the next iteration, suppose ret that the next retrieved tuple is T(b), if B (b)= 1 , T(b) is ret retrieved; otherwise, T(b) skips directly since it cannot be a incomplete skyline tuple. Example 1 In the rest of this paper, we use a running exam- ple, as depicted in Fig. 3, to illustrate the execution of algo- Fig. 4 Illustration of execution of BA algorithm rithms proposed in this paper. In the running example, we set M to be 3, m to be 3, n to be 16 and S to be 256 bytes. The value field of the attribute is [0, 100). According to the parameters, the execution of BA divides into two iterations. simplicity, we still use the cardinality estimation in [4], since In the first iteration, T(1, … ,8) are loaded into memory. As it still can provide useful insight for our analysis. Given depicted in Fig. 4, in the first iteration, only T (8) is left and n = 10 , m = 20 and p = 0.5 , the total number of all local reported as a incomplete tuple. Besides, T(10), T(11), T(13) skyline results is 7641060 even by use of the cardinality are dominated by the in-memory candidates in the first itera- formula mentioned above, which is much lower than the tion and they are skipped in the second iteration. At the end actual value. The number of local skyline results, which is of the second iteration, T(12) and T(15) are left and reported used to perform pairwise comparisons, is still too high. as incomplete skyline tuples. On the whole, the skyline To sum up, the existing skyline algorithms on massive results in the running example are {T(8), T(12), T(15)}. incomplete data all have their performance issue. 6 TSI Algorithm 5 Baseline Algorithm In this paper, we propose a new algorithm TSI (Table-scan- The existing algorithms, as mentioned in Secst. 2 and 4, based Skyline over Incomplete data) to process skyline over have rather poor performance and very long execution time massive incomplete data efficiently. TSI performs two passes on massive incomplete data. Therefore, this section first of scan on the table to compute the skyline results. Sec- devises a baseline algorithm BA which can be used as a tion 6.1 describes the basic execution of TSI algorithm. The benchmark against the algorithm proposed in this paper. pruning operation is presented in Sect. 6.2. Different from the existing methods, BA adopts a block- nested-loop-like execution. It first retrieves T from the 6.1 Basic Process beginning and loads a part of T into the memory, com- pares the tuples in memory with all tuples in T, removes The basic process of TSI consists of two stages. In stage 1, the dominated tuples in the memory. Each time the tuples TSI performs the first-pass scan on T to find the candidate left in memory are compared with all other tuples and can tuples, while in the stage 2, TSI scans T again to discard the be reported as part of incomplete skyline results. Then, the candidates which are dominated by some tuple. Algorithm 1 next part of T is loaded and the similar processing is exe- is the pseudo-code of the basic process. cuted; the iteration continues until all tuples in T are loaded into memory once and compared with all other tuples. In 1 3 108 J. He, X. Han Algorithm 1 TSI basic(T) Input: T is an incomplete table Output: S aset maintainingthe skylinetuplesover T cnd 1: initialize S ←∅ cnd 2: // Stage 1 find thecandidate tuples 3: while T hasmoretuples do 4: retrieve thenexttuple t of T ; 5: if S = ∅ then cnd 6: S ← S ∪ t; cnd cnd 7: else 8: while S hasmoretuples do cnd 9: retrieve thenexttuple p of S ; cnd 10: if p is dominatedby t then 11: remove p from S ; cnd 12: endif 13: endwhile 14: if t is dominatedby p then 15: discard t; 16: else 17: S ← S ∪ t; cnd cnd 18: endif 19: endif 20: endwhile 21: // Stage 2discard thecandidateswhich aredominated by some tuples 22: while T hasmoretuples do 23: retrieve thenexttuple t of T ; 24: while S hasmoretuples do cnd 25: retrieve thenexttuple can of S ; cnd 26: if can is dominatedby t then 27: remove can from S ; cnd 28: endif 29: endwhile 30: endwhile 31: return S ; cnd Theorem  1 When the first -pass scan of TSI is over, S In stage 1, TSI retrieves the tuples in T sequentially and cnd maintains the candidate tuples in a set S (empty initially) maintains a superset of skyline results over T. cnd (line 1). Let t be the currently retrieved tuple. If S is cnd empty, TSI keeps t in S (line 5-6). Otherwise, S is iter- Proof ∀t = T(pi ) , if t is a skyline tuple, there is no other 1 1 1 cnd cnd tuple in T which can dominate t . At the end of stage 1, t ated over, any candidate which is dominated by t is removed 1 1 from S (line 10-11). At the end of iteration, if t is domi- obviously will be kept in S . If t is not a skyline tuple, and cnd 1 cnd there is another tuple t = T(pi ) which can dominate t . If nated by some candidate in S , t is discarded (line 14-15); cnd 2 2 1 otherwise, TSI keeps t in S (line 16-17). In stage 1, TSI pi < pi , t will be retrieved after t and remove t from S . 1 2 2 1 1 cnd cnd If pi > pi , t is retrieved before t . If t is dominated by does not consider the intransitivity and cyclic dominance 1 2 2 1 2 of skyline on incomplete data. Any candidates is discarded some tuple and discarded, t still will be kept in S at the 1 cnd end of stage 1. Q.E.D. if it is dominated by some tuple, even though the candidate may dominate the following tuples. In this way, TSI does In stage 2, TSI performs another sequential scan on T. not need to maintain the dominated tuples and reduces the in-memory maintenance cost significantly. It is proved in Let t be the currently retrieved tuple (line 22-23), any candi- dates are removed from S if they are dominated by t (line Theorem 1 that S contains a superset of the query results cnd cnd at the end of stage 1. 26-27). It is proved in Theorem 2 that the candidates in S cnd are the skyline results at the end of stage 2. 1 3 Efficient Skyline Computation on Massive Incomplete Data 109 Fig. 6 Illustration of execution in stage 2 of TSI increases during the first-pass scan on T , while the size of S decreases gradually in stage 2. cnd Time complexity of stage 1. As shown in Algorithm 1, the time complexity of stage 1 is determined by the nested loop, the outer loop from line 3 to Line 20, and the inner loop from Line 8 to Line 13. Assume that there are n tuples in the incomplete table, in other words, algorithm 1 needs to retrieve n tuples. The iteration count of the outer loop is O(n), since time complexity is the amount of time taken by an algorithm to run as a function of the input size. The inner Fig. 5 Illustration of execution in stage 1 of TSI loop involves one sequential scan on S , whose size is no cnd more than n. For each iteration in the inner loop, the opera- tions take in constant time; thus, the time complexity of the inner loop is O(S ) . On the whole, the time complexity Theorem 2 When the second-pass scan of TSI is over, S cnd cnd maintains the skyline results over T. of stage 1 is determined by the number of tuples in T and the number of candidates in S , i.e., the time complexity cnd Proof ∀t ∈ S , if t is not a skyline tuple, there is another of stage 1 is O(n ∗ S ). cnd 1 cnd 1 Time complexity of stage 2. The execution of stage 2 is tuple t = T(pi ) which can dominate t . In the second-pass 2 2 1 scan, TSI will discard t when retrieving t . Q.E.D. described in Algorithm 1. Obviously, the cost of stage 2 is 1 2 similar to stage 1, i.e., the product of n and the size of S ; cnd The existing algorithms utilize many methods, such as it might be insignificant compared with the cost of the fol- lowing operations. The reason is that if the skyline candi- replacement, sortedness and bucket, to deal with intransitiv- ity and cyclic dominance. They usually incur high execution dates are relatively small, the size of S with skyline subset cnd generating in stage 1 is much large than the size of S with cost on massive incomplete data, as analyzed in Sects. 2 and cnd 4. In this paper, TSI neglects the intransitivity and cyclic skyline tuples generating in stage 2 and the size of S in cnd stage 1 often dominates the overall execution cost. On the dominance in the first-pass scan and leaves the refinement of the skyline results in the second-pass scan. whole, the time complexity of algorithm 1 is O(n ). In Sect. 6.2, we will propose pruning method to skip the Example 2 The execution in stage 1 of TSI in the running unnecessary tuples in the sequential scan to improve the per- formance TSI further. example is illustrated in Fig. 5. Initially, the candidate set S is empty. Then, as the first sequential scan is performed, cnd 6.2 Pruning Operation S ={T(8), T(12), T(15), T(16)} at the end of stage 1. In cnd stage 2, another sequential scan is executed to refine the can- 6.2.1 Intuitive Idea didates. As depicted in Fig. 6, T(16) in S is dominated by cnd T(3). Finally, TSI returns {T(8), T(12), T(15)} as incomplete On massive incomplete data, it is analyzed that the majority skyline results. of the execution cost of TSI is consumed in stage 1. In stage 1, TSI computes the candidates of the skyline over T. Obvi- Time complexity On massive incomplete data, the major- ity of the execution cost of TSI is consumed in stage 1. ously, any tuple must not be a skyline tuple if it is dominated by some tuple. In stage 1, TSI utilizes some pre-constructed The reason is that every tuple retrieved in stage 1 needs to compare with all candidates in S and the size of S data structure to skip the tuples in T which are dominated. In cnd cnd this way, TSI will speed up its execution in stage 1, since the 1 3 110 J. He, X. Han pruning operation not only reduces the I/O cost to retrieve tuples, but also reduces the computation cost of dominance checking. 6.2.2 Dominance Checking on Incomplete Data Given t ∈ T , ∀t ∈ T , le t C be the common complete 1 2 attributes among skyline criteria of t and t . For one thing, 1 2 t t 1 2 if , it means that ∀A ∈ C , t .A ≤ t .A and ∃A ∈ C , 1 2 t .A < t .A . Suppose that t is obtained currently, we can 1 2 1 utilize the values of t to skip the tuples dominated by it. For another, it C is empty, t and t cannot be compared in terms 1 2 of dominance checking. Therefore, the key to the dominance checking on incomplete data is (1) the comparison of com- plete attributes, (2) the representation of incomplete attrib- utes. In the following, we introduce how to construct data structures to solve the two issues. Fig. 7 Illustration of MCR and RIA in the running example In the paper, the value of any incomplete attribute is regarded as the positive infinity since the smaller val- ues are preferred. Given table T(A , … , A ) , the sorted In the running example, T(1).A and T(4).A are incomplete 1 M 1 1 list SL (1 ≤ i ≤ M) is built for each attribute. The schema attribute values, therefore, RIA = 0110111111111111 . i 1 of SL is SL (PI , A ) , where PI is the positional index of Similarly, we can generate RIA and RIA . i i T i T 2 3 the tuple in T, and the tuples of SL are arranged in the ascending order of A . By the sorted lists, TSI constructs By the structures MCR and RIA, given t ∈ T , w e i 1 the structure MCR (Membership Checking Result) to com- want to know which tuples in T are dominated by t . Let pare the complete attributes. For sorted list SL (1 ≤ i ≤ M) , S be set of the complete attributes among A , A ,… , A i c 1 2 m MCR (1 ≤ b ≤ ⌊log n⌋) is a n-bit bit-vector represent- of t , without loss of generality, assume that S ={A ,… , i,b 2 1 c 1 ing the membership checking results of SL (1, … ,2 ).PI . A } . ∀A ∈ S (1 ≤ i ≤ S ) , we determine the first i T S  i c c ∀t = T(a)(1 ≤ a ≤ n) , if a ∈ SL (1, … ,2 ).PI , value ITV [b ] of ITV which is greater than t .A , i.e., i T i i i 1 i MCR (a)= 1 ; ot her wise, MCR (a)= 0 . MCR (a) is the ITV [b − 1] ≤ t .A < ITV [b ] , her e ITV [0] is assigned i,b i,b i,b i i 1 i i i i ath bit of MCR . The maximum values of SL (1, … ,2 ).A negative infinity. Let DBV be the n-bit bit-vector of domi- i,b i i t (1 ≤ b ≤ ⌊log n⌋) are kept in a array ITV , i.e., nance checking corresponding to t , whose bits are initial- 2 i 1 ITV [b]= SL (2 ).A . ized to bit 1. It is proved by Theorem 3 that the bit 1s of i i i ⋀ ⋁ �S � �S � c c For the representation of incomplete attributes, TSI DBV =( ¬MCR )∧( RIA ) correspond to the t i,b i i=1 i=1 1 i performs a sequential scan on T and constructs the struc- tuples dominated by t . ture RIA, which consists of M n-bit bit-vectors. For ⋀ ⋁ �S � �S � c c RIA (1 ≤ i ≤ M) , ∀t = T(a)(1 ≤ a ≤ n) , if T(a).A is a com- Theorem  3 The bit 1s of DBV =( ¬MCR )∧( RIA ) t i,b i i i 1 i=1 i i=1 plete attribute RIA (a)= 1 ; ot her wise, RIA (a)= 0. represent the tuples which are dominated by t . i i 1 Example 3 The required data structures mentioned above are Proof As mentioned above, the value b is determined as the illustrated in Fig. 7. SL , SL , SL are three sorted lists, whose minimum integer value satisfying ITV [b ] > t .A . There- 1 2 3 i i 1 i elements are arranged in the ascending order of A , A , A , fore, the bit 1s of ¬MCR represent the tuples whose A 1 2 3 i,b i respectively. MCR is a 16-bit bit-vector representing the values are greater than t .A . Since we treat the incomplete 1,1 1 i �S � 1 c membership checking results of SL (1, 2 ).PI , i.e., 12 attribute values as positive infinity, ¬MCR represents 1 T i,b i=1 and 8. Therefore, the 8th bit and 12th bit in MCR are 1, the tuples whose values of A ,… , A are all greater than 1,1 1 S MCR = 0000000100010000 . ITV keeps the attribute val- those of t . Given t among these tuples, if at least one of 1,1 1 1 2 1 2 ues of exponential gaps in SL , i.e., SL (2 ).A , SL (2 ).A , A ,… , A of t is complete attribute, t is dominated by t 1 1 1 1 1 1 S  2 2 1 3 4 SL (2 ).A , SL (2 ).A , ITV ={26, 47, 65, +∞} . The other according to the dominance definition over incomplete data. 1 1 1 1 1 MCR bit-vectors and other ITVs can be obtained similarly. If all of A ,… , A of t are incomplete, t and t are not 1 S  2 1 2 The structure RIA represents the incomplete values of A . comparable from the perspective of dominance relationship. i i 1 3 Efficient Skyline Computation on Massive Incomplete Data 111 �S � The bit 1s of RIA mean that at least one of A ,… , A i 1 S i=1 �S � is complete, and the bit 0s of RIA indicate that all of i=1 A ,… , A are incomplete. Consequently, the bit 1s of 1 S ⋀ ⋁ �S � �S � c c DBV =( ¬MCR )∧( RIA ) represent the tuples t i,b i 1 i=1 i i=1 which are dominated by t . Q.E.D. 6.2.3 The Extraction of the Pruning Tuples In order to skip the unnecessary tuples of T in stage 1, we first extract some pruning tuples for the following execution of TSI. The number of pruning tuples should not be large and they should have relatively strong dominance capability. Since the dimensionality of T can be high, we do not extract the pruning tuples with respect to the combination of dif- ferent attributes, but to the values of single attribute and the number of complete attributes for each tuple. It is known that the cardinality of skyline results grows exponentially with the size of skyline criteria [4] and on incomplete data, dom- inance relationship between two tuples is performed over their common complete attributes. Intuitively, for a tuple, if it has a small number of complete attributes and one of its complete attributes is very small, it tends to have a relatively Fig. 8 Illustration of extracting pruning tuples in the running example strong dominance capability. The pruning tuples can be extracted from M sorted col- first in the ascending order of NUM , and the tuples with umn files SC , SC , , SC . The schema of SC (1 ≤ i ≤ M) 1 2 M i the same value of NUM are sorted in ascending of A . c i is (PI , NUM , A ) , where NUM is the number of the com- T c i c In the running example, f = 12.5(16 × 12.5%= 2) and plete attributes for each tuple. The tuples of SC (1 ≤ i ≤ M) n = 1 , one pruning tuple will be retrieved for SC . F o r pt i are sorted on NUM and A , i.e., they are first arranged in c i SC , SC (1, … , 11) cannot be used to generate pruning 1 1 the ascending order of NUM , then all tuples with the same tuples since their attribute values are not within the first two NUM are arranged in the ascending order of A . c i smallest values of A . Then, SC (12) is selected to obtain the i 1 For each sorted column file SC , we retrieve its tuples pruning tuple T(SC (12).PI ) since it is the first tuple in SC 1 T 1 sequentially. Let sc be the current retrieved tuple, if sc.A whose A value is among the first two smallest values of A . 1 1 is within the first f% proportion among all A values, the Other pruning tuples (T(14) and T(6)) are obtained similarly. PI value of sc is maintained in memory, and otherwise, the next tuple is retrieved. The process continues until the 6.2.4 The Execution of Pruning Operation number of PI values maintained in memory reaches n or T pt it reaches to the end of file. Then, the corresponding tuples By the pre-constructed structures described above, TSI of T are extracted and kept in a separate pruning tuple file can utilize pruning operation to reduce the execution cost PT . In this paper, f is set to 5 and n is set 1000; the prun- i pt in stage 1. In order to execute the pruning operation, TSI ing effect with such parameter setting is satisfactory in the maintains a n-bit pruning bit-vector PRB in memory, which performance evaluation. is filled with bit 0 initially. Example 4 Figure  8 illustrates the extracting of pruning tuples in the running example. SC (1 ≤ i ≤ 3) is arranged 1 3 112 J. He, X. Han Algorithm 2 TSI Pruning(T , S ) cnd Input: T is an incomplete table, S aset maintainingthe candidatetuples cnd Output: S aset maintainingthe skylinetuplesover T cnd 1: MH is amin-heaptokeep mpruningtuples with thehighest dominancecapability. 2: initialize S ←∅, MH ←∅; cnd 3: // Stage 1 find thecandidate tuples 4: extractthe involved pruningtuples PT , PT , ..., PT foreachskyline criteria of T , 1 2 m andput PT , PT , ..., PT in to MH; 1 2 m 5: while MH hasmorepruning tuples do 6: retrieve thenexttuple pt of MH; 7: S is thecompleteattributes of pt, S = {A ,. .., A }}; c c 1 |S | 8: if PRB(pt)=1 then 9: pt canbeskipped; 10: else 11: for (i =1; i ≤|S |; i ++) do 12: computethe first value ITV [b ]of ITV , ITV [b ] ← SL (2 ).A ; i i i i i i i 13: endfor th 14: the(pt.P I ) bitof PRB to be 1; 15: if S = ∅ then cnd 16: S ← S ∪ t; cnd cnd 17: else 18: while S hasmoretuples do cnd 19: retrieve thenexttuple p of S ; cnd 20: if p is dominatedby t then 21: remove p from S ; cnd 22: endif 23: endwhile 24: if t is dominatedby p then 25: discard t; 26: else 27: S ← S ∪ t; cnd cnd 28: endif |S | |S | c c 29: DBV ← ( ¬MCR ) ∧ ( RIA ); pt i,b i i=1 i i=1 30: PRB = PRB ∨ ( DBV ); pt b=1 31: endif 32: endif 33: endwhile 34: // Stage 2discard thecandidateswhich aredominated by some tuples 35: while T hasmoretuples do 36: retrieve thenexttuple t of T ; 37: while S hasmoretuples do cnd 38: retrieve thenexttuple can of S ; cnd 39: if can is dominatedby t then 40: remove can from S ; cnd 41: endif 42: endwhile 43: endwhile 44: return S ; cnd ∏ b �S � i is computed to be (line 11-13). For the retrieved i=1 Algorithm 2 is the pseudo-code of the execution of prun- n pruning tuple pt, TSI sets the (pt.PI )th bit of PRB to be ing operation. At the beginning of the stage 1, TSI deter- 1, since it is retrieved already (line 14). Besides, for each mines the involved pruning tuple files PT , PT ,… , PT 1 2 m pruning tuple pt, TSI removes any candidates in S cnd according to the current skyline criteria and retrieves prun- which are dominated by pt (line 18-23). If pt is not domi- ing tuples from them. In the process of retrieving PT , PT , 1 2 nated by any candidate in S , TSI keeps it in S (line cnd cnd … , PT , TSI maintains a min-heap MH in memory to keep 26-27). ∀pt ∈ MH(1 ≤ b ≤ m) , TSI computes its corre- m pruning tuples with the highest dominance capability sponding bit-vector DBV of dominance checking as in pt (line 4). Given a pruning tuple pt, let S be its complete b Sect. 6.2.2 (line 29). The final pruning bit-vector PRB is attributes. Likewise, assume that S ={A , … , A }} (line c 1 S PRB = PRB ∨( DBV ) (line 30). pt b=1 5-6). ∀1 ≤ i ≤ S  , we determine the first value ITV [b ] of b c i i ITV which is greater than pt.A , its dominance capability i i 1 3 Efficient Skyline Computation on Massive Incomplete Data 113 Table 2 Parameter Settings Parameter Used values 5, 10, 50, 100, 500 Tuple number(10 ) (syn) Skyline criteria size (syn) 10, 15, 20, 25 Incomplete ratio (syn) 0.3, 0.4, 0.5, 0.6, 0.7 Correlation coefficient (syn) -0.8, -0.4, 0, 0.4, 0.8 Incomplete ratio (real) 0.3, 0.4, 0.5, 0.6, 0.7 S of the allocated memory is 4GB. We do not use a larger size for BA because, with the assistance of the bit-vector B ret as mentioned in Sect. 5, the larger value of S makes more tuples of T loaded in memory at a time and reduces the num- ber of iteration, but it also reduces the proportion of retrieval which can use the optimization of skipping operation. Fig. 9 Illustration of constructing PRB in the running example In the experiments, we evaluate the performance of TSI in terms of several aspects: tuple number (n), used attrib- Example 5 The construction of PRB in the running example ute number (m), incomplete ratio (p), correlation coefficient (c). The experiments are executed on three data sets: two is illustrated in Fig. 9. For the pruning tuple T(6)(56, 3, 0), TSI determines MCR , MCR , MCR which correspond synthetic data sets (independent distribution and correlated 1,3 2,1 3,1 distribution) and a real data set. The used parameter set- to the values of T(6). The tuples dominated by T(6) can be specified by a bit-vector PRB = 1101001011000000 . Simi- tings are listed in Table 2. For correlated distribution, the first two attributes have the specified correlation coefficient, larly, we obtain PRB and PRB . Since T(6), T(12), T(14) 12 14 are the pruning tuples, after retrieving them, PRB is while the left attributes follow the independent distribution. In order to generate two sequences of random numbers with set to be 0000010000010100, i.e., the 6th bit, the 12th and the 14th bit are 1. The final pruning bit-vector correlation coefficient c , we first generate two sequences of uncorrelated distributed random number X and X , t hen PRB = PRB ∨(DBV ∨ DBV ∨ DBV )= 1111111011111100. 6 12 14 √ 1 2 a new sequence Y = c × X + 1 − c × X is generated, 1 1 2 and we get two sequences X and Y with the given cor- In stage 1, ∀1 ≤ a ≤ n , i f PRB(a)= 1 , T(a) can be 1 1 skipped; otherwise, TSI needs to retrieve T(a). The rest of relation coefficient c . When generating synthetic data, we fix the number of M to be 60 and generate data with all the execution in stage 1 is the same as that in Sect. 6.1. complete attributes. Then, according to used skyline crite- ria, we select one attribute first, this attribute is complete. Example 6 In the running example, TSI only needs to retrieve three tuples (T(8), T(15), T(16)) in stage 1 by use Other (m − 1) attributes in skyline criteria have a probability p of being incomplete independently. The real data used are of PRB. This reduces the I/O cost and computation cost significantly. HIGGS Data Set from UCI Machine Learning Repository , it is provided to classification problem including 11000000 instances. The main reasons for using HIGGS are that 1) HIGGS is one of the largest databases to our knowledge, 7 Performance Evaluation accordingly, we have better access to compare the perfor- mance of above algorithms. 2) and it is an open dataset that 7.1 Experimental Settings we can find and obtain expediently. On real data, we evaluate the performance of TSI with varying values of p. To evaluate the performance of TSI, we implement it in Java with jdk-8u20-windows-x64. The experiments are executed The required structures are pre-constructed before the experiments. Under the default setting of the experiments, on LENOVO ThinkCentre M8400 (Intel (R) Core(TM) i7 CPU @ 3.40GHz (8 CPUs) + 32G memory + 3TB HDD + i.e., M = 60 , n = 50 × 10 , and p = 0.3 , it takes 6840.573 seconds to pre-construct the required data structures. 64 bit windows 7). In the experiments, we implement TSI, BA, SOBA [10] and SIDS [1]. With the experimental setting below, the execution time of SOBA and SIDS is so long that we do not report its experimental results with the settings below, but evaluate it in Sect. 7.8 separately. For BA, the size https://archive.ics.uci.edu/ml/datasets/HIGGS# 1 3 114 J. He, X. Han 1e+006 1e+006 20000 20000 Fig. 10 Comparison between 18000 18000 TSI and TSI 16000 16000 100000 100000 14000 14000 12000 12000 TS TSII B B 10000 10000 TS TSII 10000 10000 8000 8000 1000 1000 6000 6000 TS TSII B B 4000 4000 TS TSII 10 100 0 2000 2000 5 5 10 10 50 50 100 100 500 500 5 5 10 10 50 50 10 100 0 50 500 0 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (10 tuple number (10 ) ) (a)Execution time (b)Candidate size 250000 250000 5000 5000 S2 S2 S2 S2 4500 4500 S1 S1 S1 S1 200000 200000 4000 4000 BM BM 3500 3500 PT PT 150000 150000 3000 3000 2500 2500 100000 100000 2000 2000 1500 1500 50000 50000 1000 1000 500 500 0 0 0 0 5 5 10 10 50 50 10 100 0 500 500 5 5 10 10 50 50 10 100 0 500 500 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (10 tuple number (10 ) ) (c)TSI decomposition (d)TSI decomposition 1e+012 1e+012 1e+013 1e+013 1e+012 1e+012 1e+011 1e+011 1e+011 1e+011 1e+010 1e+010 1e+010 1e+010 TS TSII TS TSII B B B B TS TSII TS TSII 1e+009 1e+009 1e+009 1e+009 5 5 10 10 50 50 100 100 500 500 5 5 10 10 50 50 100 100 500 500 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (10 tuple number (10 ) ) (e)The I/Ocost (f)Comparisonnumber decomposition of TSI, which consists of four parts: the time 7.2 T he Comparison of TSI with and Without to retrieve pruning tuples, the time to load the required bit- Pruning vectors, the time in stage 1, and the time in stage 2. The time in stage 2 of TSI is longer than that of TSI due to the greater The performance of TSI and TSI is compared in different number of candidates left. However, the time reduction in aspects, where TSI is the TSI algorithm without pruning stage 1 of TSI is much significant compared with TSI and operation. As depicted in Fig. 10a, TSI runs 18.84 times TSI runs one order of magnitude faster than TSI averagely. faster than TSI and the speedup ratio increases with a As shown in Fig. 10(e and f), the pruning operation makes greater value of n. This significant advantage is due to the TSI incur less I/O cost and perform fewer number of domi- effective pruning operation. The numbers of the candidates nance checking. after stage 1 are illustrated in Fig. 10b. TSI maintains more candidates than TSI after stage 1. This is because the prun- 7.3 Experiment 1: the Eec ff t of Tuple Number ing operation skips most of the tuples in stage 1, and there- fore, many candidates which should be removed by some Given m = 20 , M = 60 , p = 0.3 and c = 0 , experiment 1 tuples are left. But the pruning operation reduces the cost in evaluates the performance of TSI on varying tuple numbers. stage 1 significantly. Figure  10c reports the time decomposi- As shown in Fig. 11a, TSI runs 60.42 times faster than BA tion of TSI . Obviously, the execution time of stage 1 domi- averagely. The speedup ratio of TSI over BA increases with nates its overall time. We even cannot see the time in stage 2 a greater value of n, from 8.31 at n = 5 × 10 to 166.58 at due to its rather small proportion. Figure 10d gives the time 1 3 io cost(bytes io cost(bytes) ) time(s) time(s) time(s) time(s) time(s) time(s) number of dominance checking number of dominance checking time(s) time(s) Efficient Skyline Computation on Massive Incomplete Data 115 1e+006 1e+006 1e+013 1e+013 Fig. 11 Effect of tuple number 100000 100000 1e+012 1e+012 10000 10000 1e+011 1e+011 1000 1000 1e+010 1e+010 BA BA BA BA TS TSII TS TSII 100 100 1e+009 1e+009 5 5 10 10 50 50 100 100 500 500 5 5 10 10 50 50 100 100 500 500 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (1 tuple number (10 0 ) ) (a)Execution time (b)The I/Ocost 1e+013 1e+013 1 1 0.99 0.99 1e+012 1e+012 0.98 0.98 0.97 0.97 1e+011 1e+011 0.96 0.96 0.95 0.95 1e+010 1e+010 BA BA 0.94 0.94 TS TSII 1e+009 1e+009 0.93 0.93 5 5 10 10 50 50 100 100 500 500 5 5 10 10 50 50 100 100 50 500 0 6 6 6 6 tuple number (1 tuple number (10 0 ) ) tuple number (1 tuple number (10 0 ) ) (c)Comparisonnumber (d)The pruningratio Fig. 12 Effect of skyline criteria 1e+00 1e+006 6 1e+012 1e+012 size 100000 100000 1e+011 1e+011 10000 10000 1e+010 1e+010 1000 1000 1e+009 1e+009 100 100 BA BA BA BA TSI TSI TSI TSI 10 10 1e+008 1e+008 10 10 15 15 20 20 25 25 10 10 15 15 20 20 25 25 skyline criteria size skyline criteria size skyline criteria siz skyline criteria size e (a)Execution time (b)The I/Ocost 1e+013 1e+013 1 1 1e+012 1e+012 0.99 0.99 1e+011 1e+011 0.98 0.98 1e+010 1e+010 0.97 0.97 1e+009 1e+009 0.96 0.96 1e+008 1e+008 0.95 0.95 1e+007 1e+007 0.94 0.94 1e+006 1e+006 0.93 0.93 BA BA 100000 100000 0.92 0.92 TSI TSI 10000 10000 0.91 0.91 10 10 15 15 20 20 25 25 10 10 15 15 20 20 25 25 skyline criteria size skyline criteria size skyline criteria siz skyline criteria size e (c)Comparison number (d)Thepruning ratio than BA. The performance advantage of TSI over BA is n = 500 × 10 . Figure 11b depicts that TSI incurs 6.73 times widened with the greater value of n. At n = 5 × 10 , BA can less I/O cost than BA. And as illustrated in Fig. 11c, TSI load all T into memory and perform another table scan on performs 38.17 times fewer number of dominance checking 1 3 number of dominance checkin number of dominance checking g number of dominance checking number of dominance checking time(s time(s) ) time(s) time(s) retrieved byte number retrieved byte number retrieved byte number retrieved byte number pruning ratio pruning ratio pruning rati pruning ratio o 116 J. He, X. Han 100000 100000 1e+012 1e+012 Fig. 13 Effect of incomplete BA BA ratio TS TSII 10000 10000 1e+011 1e+011 1000 1000 1e+010 1e+010 10 100 0 1e+009 1e+009 10 10 BA BA TSI TSI 1 1 1e+008 1e+008 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 incomplete rati incomplete ratio o incomplete rati incomplete ratio o (a)Execution time (b)The I/Ocost 1e+012 1e+012 1 1 1e+011 1e+011 0.99 0.999 9 1e+010 1e+010 1e+009 1e+009 0.99 0.998 8 1e+008 1e+008 1e+007 1e+007 0.99 0.997 7 1e+006 1e+006 0.99 0.996 6 100000 100000 1000 10000 0 0.99 0.995 5 BA BA 1000 1000 TSI TSI 10 100 0 0.99 0.994 4 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 incomplete rati incomplete ratio o incomplete rati incomplete ratio o (c)Comparisonnumber (d)The pruningratio Fig. 14 Effect of correlation 100000 100000 1e+012 1e+012 BA BA coefficient TS TSII BA BA 1000 10000 0 1e+011 1e+011 TSI TSI 1000 1000 1e+010 1e+010 -0.8 -0.8 -0.4 -0.4 0 0 0. 0.4 4 0. 0.8 8 -0.8 -0.8 -0.4 -0.4 0 0 0. 0.4 4 0. 0.8 8 correlation coefficient correlation coefficient correlation coefficient correlation coefficient (a)Execution time (b)The I/Ocost 1e+012 1e+012 0.99 0.996 6 0.9955 0.9955 0.99 0.995 5 1e+011 1e+011 0.9945 0.9945 0.99 0.994 4 0.9935 0.9935 1e+010 1e+010 0.99 0.993 3 BA BA 0.9925 0.9925 TSI TSI 1e+009 1e+009 0.99 0.992 2 -0.8 -0.8 -0.4 -0.4 0 0 0. 0.4 4 0. 0.8 8 -0.8 -0.8 -0.4 -0.4 0 0 0. 0.4 4 0. 0.8 8 correlation coefficient correlation coefficient correlation coefficient correlation coefficient (c)Comparisonnumber (d)The pruningratio trend on tuple number due to its execution process and T to compute incomplete skyline results. At n = 500 × 10 , pruning operation. As illustrated in Fig. 11d, the pruning BA needs to execute 56 iterations, each loading a part of T operation of TSI can skip vast majority of tuples in stage and then followed by a table scan on T to remove the domi- 1. The pruning ratio in the experiments is computed by the nated tuples. On the contrary, TSI shows a slower growing 1 3 number of dominance checking number of dominance checking number of dominance checking number of dominance checking time(s) time(s) time(s) time(s) retrieved byte number retrieved byte number retrieved byte number retrieved byte number pruning rati pruning ratio o pruning rati pruning ratio o Efficient Skyline Computation on Massive Incomplete Data 117 100000 100000 1e+010 1e+010 Fig. 15 Effect of real data 1000 10000 0 1e+009 1e+009 1000 1000 10 100 0 1e+008 1e+008 10 10 BA BA BA BA TSI TSI TS TSII 1 1 1e+007 1e+007 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 skyline criteria size skyline criteria size skyline criteria size skyline criteria size (a)Execution time (b)The I/Ocost 1e+011 1e+011 1 1 1e+010 1e+010 0.99 0.995 5 1e+009 1e+009 1e+008 1e+008 0.99 0.99 1e+007 1e+007 1e+006 1e+006 0.98 0.985 5 100000 100000 0.98 0.98 1000 10000 0 1000 1000 0.97 0.975 5 BA BA 10 100 0 TSI TSI 10 10 0.97 0.97 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 0. 0.3 3 0. 0.4 4 0. 0.5 5 0. 0.6 6 0. 0.7 7 skyline criteria size skyline criteria size skyline criteria size skyline criteria size (c)Comparisonnumber (d)The pruning ratio 12 12 Fig. 16 Comparison with BA, 100000 100000 1x10 1x10 BA BA BA BA SOBA, and SIDS TSI TSI TSI TSI 11 11 1000 10000 0 1x10 1x10 SOBA SOBA SOBA SOBA SIDS SIDS SIDS SIDS 10 10 1000 1000 1x10 1x10 9 9 10 100 0 1x10 1x10 8 8 10 10 1x10 1x10 7 7 1 1 1x10 1x10 6 6 7 7 8 8 9 9 10 10 6 6 7 7 8 8 9 9 10 10 skyline criteria size skyline criteria size skyline criteria size skyline criteria size (a)Execution time (b)TheI/Ocost skip tuples. For the first part, BA may not retrieve all tuples into formula , where n is the number of tuples skipped in skip memory since the current tuples may be dominated by the stage 1. previous iterations. For the second part, if the current can- didates all are discarded, BA does not have to continue the 7.4 Experiment 2: the Eec ff t of Skyline Criteria Size sequential scan but just performs the next iteration directly. When the value of m increases, given other parameters are Given M = 60 , n = 50 × 10 , p = 0.3 and c = 0 , exper iment fixed, the probability that a tuple is dominated by other tuple 2 evaluates the performance of TSI on varying skyline cri- becomes lower. Therefore, the I/O cost increases on both teria sizes. As illustrated in Fig. 12a, with a greater value of parts. This is reported in Fig. 12b. For TSI, its I/O cost also m, the execution times of BA and TSI both increase signifi- consists of two parts. In stage 1, TSI performs a selective cantly; TSI still runs 85.79 times faster than BA averagely. scan on T to obtain the candidates of incomplete skyline For BA, its I/O cost depends on two parts. For one thing, BA results. In stage 2, TSI does another sequential scan on T to needs to retrieve T once to load it into memory. For another, compute the results, in which if all candidates are removed, BA performs a sequential scan on T in each iteration to dis- TSI can terminate directly. As the value of m increases, the card the candidates in memory which are dominated by some pruning effect in TSI becomes worse in stage 1, which also 1 3 time(s time(s) ) number of dominance checking number of dominance checking time(s) time(s) retrieved byte number retrieved byte number candidate size candidate size pruning rati pruning ratio o 118 J. He, X. Han is verified in Fig.  12d, and TSI has to retrieve more tuples one variable decreases, the other increases. And a positive before it terminate in stage 2. This makes a higher I/O cost correlation means that variables tend to move in the same for TSI with a greater value of m, as illustrated in Fig. 12b. direction. Therefore, the skyline computation on negatively With the similar explanation, as shown in Fig. 12c, the num- correlated data usually is more expensive than that on posi- bers of dominance checking for both algorithms increase tively correlation data. The variations in TSI and BA both with a greater value of m. show a downward trend in experiment 4. Here, the trend is not significant because the incomplete attributes in the data 7.5 Experiment 3: the Eec ff t of Incomplete Ratio set reduce the impact of correlation. The I/O cost and num- ber of dominance checking are depicted in Fig. 14(b and c), Given m = 20 , M = 60 , n = 50 × 10 and c = 0 , exper iment respectively, and they have the similar variation trends. The 3 evaluates the performance of TSI on varying incomplete effect of pruning operation of TSI is illustrated in Fig.  14d. ratios. As the value of p increases, the execution time of Due to the impact of incomplete attributes, the pruning ratio BA decreases quickly, while the execution time of TSI first shows considerable change, but it still shows upward trend decreases and then increases gradually. For BA, the decline overall. of execution time is easy to understand. With a greater value of p, the probability that any tuple is dominated by other 7.7 Experiment 5: Real Data tuples increases. This makes more in-memory candidates in each iteration dominated by some tuples in the sequential The real data, HIGGS Data Set, are obtained from UCI scan, and can reduce the I/O cost and dominance checking Machine Learning Repository. It contains 11,000,000 tuples cost. As illustrated in Fig. 13c, with a greater value of p, the with 28 attributes. We select the first 20 attributes as skyline number of dominance checking in BA decreases constantly. criteria and evaluate the performance of TSI with varying And as shown in Fig. 13b, the I/O cost of BA first decreases incomplete ratios. Before the experiment is executed, one significantly when p increases from 0.3 to 0.4, then remains attribute first is chosen to be complete and other (m − 1) unchanged basically ever since. When p increases from 0.3 attributes in skyline criteria have a probability p of being to 0.4, the number of in-memory candidates is reduced dur- incomplete independently. As depicted in Fig. 15a, TSI runs ing the sequential scan and in each iteration, BA terminates 40.46 times faster than BA. The variation trends of execution earlier. This makes less I/O cost for BA. When the value of times of BA and TSI are very close to those in Sect. 7.5 and p is greater than 0.4, the number of in-memory candidates is can be explained similarly. The I/O cost and the number of reduced also, but in each iteration, BA reaches an approxi- dominance checking are depicted in Fig. 15(b and c), respec- mately equal scan depth before it terminates. For TSI, the tively. The pruning ratio in TSI is illustrated in Fig. 15d. effect of pruning operation depends on two factors. One is The variation in these figures can be explained similarly as the probability that one tuple can be dominated by other in Sect. 7.5. tuples. The other is whether all common attributes of two tuples are incomplete. The two factors have different effects 7.8 Experiment 6: the Comparison with SOBA in different cases. With a greater value of p , the probabil- and SIDS ity of a tuple dominated by some tuples increases, also the probability that the common attributes of two tuples are all In this part, we evaluate the performance of TSI against BA, incomplete. When p increases from 0.3 to 0.5, the first factor SOBA and SIDS on a relatively small data set with relatively has a greater impact, and ever since, the second factor plays small skyline criteria size. Given n = 10 × 10 , p = 0.3 and a larger role. This explains the trend of the execution time c = 0 , in order to acquire a better performance for SOBA of TSI. Similarly, this can explain the variation trend of TSI and SIDS, we set the value of m to be from 6 to 10, and the in I/O cost (Fig. 13b), the number of dominance checking value of M equal to that of m. This can reduce the length of (Fig. 13c), and the pruning ratio (Fig. 13d). each tuple and also lower the cost of bucket partitioning for SOBA and SIDS. 7.6 Experiment 4: the Eec ff t of Correlation As illustrated in Fig. 16a, SIDS is the slowest among the Coefficient four algorithms while TSI is the faster in various skyline criteria size, and the execution time of SOBA increases sig- Given m = 20 , M = 60 , n = 50 × 10 and p = 0.3 , exper i- nificantly with the number of m . When m = 10 , SOBA runs ment 4 evaluates the performance of TSI on varying correla- 10.96 times slower than BA, the baseline algorithm in this tion coefficients. As illustrated in Fig.  14a, TSI runs 47.72 paper, and runs 200.91 times slower than TSI. As for SIDS, times faster than BA. The correlation coefficients considered it runs 21.11 times slower than BA and runs 386.84 times range from -0.8 to 0.8. A negative correlation means that slower than TSI. On disk resident data, SOBA and SIDS there is an inverse relationship between two variables, when cannot process incomplete skyline efficiently. The bucket 1 3 Efficient Skyline Computation on Massive Incomplete Data 119 3. Chomicki J, Godfrey P, Gryz J, Liang D (2003) Skyline with pre- partitioning of SOBA involves two passes of table scan, not sorting. In: Proceedings of the 19th international conference on to mention the maintenance cost of the large number of par- data engineering, pp 717–719 titions in the disk if the number of m is not small. Then, the 4. Godfrey P (2004) Skyline cardinality for relational processing. In: computation of local skyline involves another pass of tuple foundations of information and knowledge systems, Third Inter- national Symposium, FoIKS 2004:78–97 retrieval. On the relatively large value of m, the number of 5. Godfrey Parke, Shipley Ryan, Gryz Jarek (2007) Algorithms and local skyline is great also. As depicted in Fig. 16b, the local analyses for maximal vector computation. VLDB J 16(1):5–28 skyline makes up 11.7% of the total tuples at m = 10 . The 6. Xixian H, Jianzhong L, Donghua Y, Jinbao W (2013) Efficient I/O cost of SIDS is much larger than others, SIDS and BA skyline computation on big data. IEEE Trans Knowl Data Eng 25(11):2521–2535 are much close in I/O cost. The growth trend of the execu- 7. Khalefa ME, Mokbel MF, Levandoski JJ (2008) Skyline query tion time of SIDS is fast with respect to skyline criteria size. processing for incomplete data. In: Proceedings of the 24th inter- The performance of TSI is efficient not only for the in-mem- national conference on data engineering, pp 556–565 ory data set with small size of skyline criteria, but also for 8. Kossmann D, Ramsak F, Rost S (2002) Shooting stars in the sky: an online algorithm for skyline queries. In: Proceedings of the disk-resident data with not small size of skyline criteria. the 28th international conference on very large data bases, pp 275–286 9. Lee Jongwuk, Hwang Seung-Won (January 2014) Scalable skyline 8 Conclusion computation using a balanced pivot selection technique. Inf Syst 39:1–21 10. Lee Jongwuk, Im Hyeonseung, You Gae-won (2016) Optimizing This paper considers the problem of incomplete skyline skyline queries over incomplete data. Inf Sci 361–362:14–28 computation on massive data. It is analyzed that the existing 11. Lee Ken C, Lee Wang-Chien, Zheng Baihua, Li Huajing, Tian algorithms cannot process the problem efficiently. A table- Yuan (2010) Z-sky: an efficient skyline query processing frame- work based on z-order. VLDB J 19(3):333–362 scan-based algorithm TSI is devised in this paper to deal 12. Luo Cheng, Jiang Zhewei, Hou Wen-Chi, He Shan, Zhu Qiang with the problem efficiently. Its execution consists of two (2012) A sampling approach for skyline query cardinality estima- stages. In stage 1, TSI maintains the candidates by a sequen- tion. Knowl Inf Syst 32(2):281–301 tial scan. And in stage 2, TSI performs another sequential 13. Miao X, Yunjun G, Su G, Wanqi L (2018) Incomplete data man- agement: a survey. Front Comput Sci 12(1):4–25 scan to refine the candidate and acquire the final results. 14. Papadias Dimitris, Tao Yufei, Greg Fu, Seeger Bernhard (2005) In order to reduce the cost in stage 1, which dominates the Progressive skyline computation in database systems. ACM Trans overall cost of TSI, a pruning operation is utilized to skip the Database Syst 30(1):41–82 unnecessary tuples in stage 1. The experimental results show 15. Sheng C, Tao Y(2011) On finding skylines in external memory. In: Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART that TSI outperforms the existing algorithms significantly. symposium on principles of database systems, pp 107–116 16. Tan K-L, Eng P-K, Ooi BC (2001) Efficient progressive skyline computation. In: Proceedings of the 27th international conference Open Access This article is licensed under a Creative Commons Attri- on very large data bases, pp 301–310 bution 4.0 International License, which permits use, sharing, adapta- 17. Tao Yufei, Xiao Xiaokui, Pei Jian (2007) Efficient skyline tion, distribution and reproduction in any medium or format, as long and top-k retrieval in subspaces. IEEE Trans Knowl Data Eng as you give appropriate credit to the original author(s) and the source, 19(8):1072–1088 provide a link to the Creative Commons licence, and indicate if changes 18. Zhang K, Gao H, Han X, Cai Z, Li J (2017) Probabilistic skyline were made. The images or other third party material in this article are on incomplete data. In: Proceedings of the 2017 ACM on confer- included in the article's Creative Commons licence, unless indicated ence on information and knowledge management, pp 427–436 otherwise in a credit line to the material. If material is not included in 19. Zhang Kaiqi, Gao Hong, Han Xixian, Cai Zhipeng, Li Jianzhong the article's Creative Commons licence and your intended use is not (2020) Modeling and computing probabilistic skyline on incom- permitted by statutory regulation or exceeds the permitted use, you will plete data. IEEE Trans Knowl Data Eng 32(7):1405–1418 need to obtain permission directly from the copyright holder. To view a 20. Shiming Z, Nikos M, Cheung DW (2009) Scalable skyline com- copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . putation using object-based space partitioning. In: Proceedings of the 2009 ACM SIGMOD international conference on manage- ment of data, pp 483–494 21. Zhenjie Z, Hua L, Beng Chin O, Tung AK (2010) Understanding References the meaning of a shifted sky: a general framework on extending skyline query. The VLDB J 19(2):181–201 1. Bharuka R, Sreenivasa Kumar P (2013) Finding skylines for 22. Zhenjie Z, Yin Y, Ruichu C, Dimitris P, Anthony KHT (2009) incomplete data. In: Proceedings of the 24th australasian database Kernel-based skyline cardinality estimation. In: Proceedings of conference - Vol 137, pp 109–117 the ACM SIGMOD international conference on management of 2. Börzsönyi S, Kossmann D, Stocker K (2001) The skyline opera- data, pp 509–522 tor. In: Proceedings of the 17th international conference on data engineering, pp 421–430 1 3

Journal

Data Science and EngineeringSpringer Journals

Published: Jun 1, 2022

Keywords: Massive data; TSI; Incomplete skyline; Pruning operation

There are no references for this article.