Access the full text.
Sign up today, get DeepDyve free for 14 days.
Parallel computing architectures are proven to significantly shorten computation time for different clustering algorithms. Nonetheless, some characteristics of the architecture limit the application of graphics processing units (GPUs) for biclustering task, whose function is to find focal similarities within the data. This might be one of the reasons why there have not been many biclustering algorithms proposed so far. In this article, we verify if there is any potential for application of complex biclustering calculations (CPU+GPU). We introduce minimax with Pearson correlation a complex biclustering method. The algorithm utilizes Pearson's correlation to determine similarity between rows of input matrix. We present two implementations of the algorithm, sequential and parallel, which are dedicated for heterogeneous environments. We verify the weak scaling efficiency to assess if a heterogeneous architecture may successfully shorten heavy biclustering computation time. Keywords: biclustering; data mining; graphics processing unit (GPU); OpenCL; parallel algorithms. imputation, text mining, and development of target marketing or recommendation systems [57]. These days, improvements in computer manufacturing are not sufficient for performing data analysis on continuously increasing volume of input data. This urges the developers to propose novel parallel techniques that allow significant reduction in time of execution or that analyze larger amounts of data in a shorter period. Not many parallel biclustering algorithms have been proposed so far [811]. The main reason for this could be limitations in parallel architectures or the nature of the biclustering technique, which requires to repetitively read the same data or to distribute it between different devices. In this article, we present a computationally intensive biclustering method developed to verify the potential of utilizing parallel architectures for biclustering. We verify the potential of performing intensive computations on a single device. Materials and methods Given a data set A=(X, Y)=[aij]m×n, biclustering is a process of identifying a series of triplets Bk(I, J, h), called biclusters, where IX, JY, and h: I×J are the level functions of the bicluster, such that (xi,yj)I×J: h(xi, yj)=aij [12]. In this section, we introduce MiniMax with Pearson Correlation (MMPC), a biclustering algorithm inspired by correlated pattern biclusters (CPB) algorithm [13]. We call the algorithm an aggregated biclustering method, as its concept was formed based on four recognizable different biclustering methods: CPB, OPSM [14], QUBIC [15] and Bimax [4]. The algorithm uses the same measure as CPB for determining similarity between rows, i.e. Pearson correlation coefficient (PCC), and the major differences include initialization of biclusters, the method of columns incorporation, and the how the final biclusters are determined. Each row of the input matrix is assumed to be a potential bicluster in MMPC. By taking into consideration only the most extreme values within each row, the Introduction Biclustering, which is simultaneous clustering in rows and columns, has gained much interest in the recent years due to its application in gene expression data [14]. Apart from biomedicine, the technique has been successfully applied to many other areas, such as missing values *Corresponding author: Patryk Orzechowski, Faculty of Electrical Engineering, Department of Automatics and Bioengineering, Automatics, Computer Science and Biomedical Engineering, AGH University of Science and Technology, Mickiewicza Av. 30, 30-059 Cracow, Poland, E-mail: patrick@agh.edu.pl Krzysztof Boryczko: Faculty of Computer Science, Department of Computer Science, Electronics, and Telecommunications, AGH University of Science and Technology, Cracow, Poland 244Orzechowski and Boryczko: Rough assessment of GPU capabilities for parallel PCC-based biclustering initial column set of the biclusters are determined. This partially adapts a concept of partial models of OPSM [14] and QUBIC [15]. In comparison with MMPC, CPB predetermines the specified number of initial biclusters randomly. Biclusters in MMPC are expanded based on the area criterion. A specific column is incorporated only if the area of a bicluster increases after its addition. Row and column additions are performed in turns. CPB handles bicluster growth differently. The column is added based on the root mean squared error. The final set of biclusters is determined in MMPC in a filtering procedure similar to the Bimax algorithm [4]. Biclusters are sorted based on their area, and those that substantially overlap (i.e. more than 25%) are disregarded. 7. Sort all biclusters in order of their area and include only those biclusters that do not overlap with previously accepted biclusters by more than 25%. This step is a Bimax [4] filtering technique. Sequential implementation (CPU) Two different versions of MMPC were implemented. The first one, purely sequential, was implemented in C programming language and was used for comparison with the parallel version in terms of time execution. A binary field is used to store biclusters. Each row and each column in a binary field are represented by a single bit, as each row may become a pattern for a single bicluster only. If a row or a column belongs to a bicluster, corresponding bits are set. Detecting the most extreme values in each row is performed by sorting the values with quicksort. Further computations involve determining the PCC between different rows with respect to the specified columns, setting the proper bits in binary field, and finally applying the filtering procedure. The algorithm The MMPC algorithm takes four input parameters: A, an input matrix; 1 and 2, thresholds that determine the distance from the largest to the smallest element in each row; and , the Pearson correlation threshold. The details of the algorithm are presented below. Algorithm 1 MMPC. 1. Determine the extreme (i.e. minimal and maximal) elements in each row (Eq. 1). ai max = max { aij } j=1,...,m GPU implementation issues There are certain constraints and limitations for graphics processing unit (GPU) parallel programming of biclustering. The extensive discussion about the issues is covered in [16]. The major issues include the following: Choice of storage structures. Unless the number of biclusters is specified a priori, dynamic structures provide difficulties. Limited amount of fast shared memory on GPU. For example, NVIDIA Tesla M2050 (NVIDIA, Krakow, Poland) in Fermi only has a total of 786,432 bytes of the L2 cache, whereas NVIDIA Tesla K20c offers 1,310,720 bytes of the L2 cache. This limitation makes it impossible to cache whole data sets. Amount of global memory. NVIDIA Tesla M2050 offers 2687 MB of the total (global) memory. This is sufficient for a single biological database popularly used for biclustering, but may be inadequate for biclustering larger data sets, such as social networks. There is also a tendency to increase the global memory volume by major GPU vendors. Relatively slow global memory access. Many data sets used for biclustering are too small to hide the latency associated with global memory access. This may be reduced by performing excessive computations. ai min = min { aij } j=1,...,m (1) For each row i, set all elements i within a radius 1 from the largest and 2 from the smallest element (Eq. 2). i min ={ aij : aij ai max -( ai max - ai min ) 1 } i = i min i max i max ={ aij : aij ai min + ( ai max - ai min ) 2 } (2) Form an initial bicluster as a single row (seed) and i columns. 4. For each column j, loop through all rows i and calculate the PCC value between the row and the seed row, with respect to all active columns (including j). Determine how many rows meet PCC(seed, i, (ij)) and the area covered by the bicluster. 5. Choose this column j, the addition of which will cause the bicluster to cover the largest area. Incorporate column j into the bicluster formed by the seed row. 6. Repeat steps 46 until the area of the bicluster stops increasing. Orzechowski and Boryczko: Rough assessment of GPU capabilities for parallel PCC-based biclustering245 Need for optimization. A program runs efficiently on GPU only if memory coalescence is performed during global memory. Unfortunately, the task of biclustering allows creation of any subset of rows and columns, which makes the design of an algorithm very complex and challenging. Determining the proper size of a grid (i.e. number and size of workgroups) to obtain the highest acceleration. For example, hardware limits the maximum size of a workgroup to 1024 threads on NVIDIA Tesla M2050 (Krakow, Poland). Synchronization difficulties. Although synchronization within a single workgroup poses no problem, there is no costless synchronization between different workgroups. Atomic operations, multiple barriers, or the use of a single workgroup is not good enough, and the only choice is synchronization by kernels, which causes memory overhead. The final biclusters are stored in a bit array, which is organized into 32-bit groups. The array is then passed to the host, where filtering procedure is applied and overlapping biclusters are disregarded. Complexity of the MMPC algorithm The proposed version of the MMPC algorithm appears to be very complex. Detecting the extreme elements in each row may be accomplished in O(log m) time, where m is the number of columns. This gives O(n log m) time for the whole matrix, where n is the number of rows. Values in each row (representing step 2 of Algorithm 1) are updated in O(m) time. Determining the PCC value is the most complex operation. In a naive implementation, in steps 46, each iteration requires the calculation of PCC for all m columns, while there may be at most m additions. In each iteration, a comparison of a single row with each of the n rows takes place. This gives time complexity in the order of O(m3n2). Final filtering takes up to O(mn2) time. Summing up, we deem the the time complexity of MMPC to be polynomial: O(m3n2). The memory complexity of the algorithm is in the order of O(mn). Parallel implementation (CPU+GPU) The second implementation, dedicated to work on heterogeneous infrastructures, was implemented using OpenCL framework. The data are read on the CPU, stored in a global memory of GPU, and expanded to the column size divisible by 32, as this is the native hardware execution width associated with workgroup. To optimize the utilization of a device, each row of the data matrix is handled by a different workgroup, and each column corresponds to a single work item. Detecting the most extreme elements is performed in each workgroup separately, using a reduction pattern. Comparisons are made between elements separated by an offset, which is increased by half of each iteration. Afterward, the bits corresponding to the highest and lowest values in each row are concurrently set in each row, as well as the values within some radius, which is determined by the algorithm input parameter. The result in the form of a binary array is then stored in the global memory and the second kernel is invoked, which is responsible for the calculation of the PCC value. Using synchronization point is necessary, as further calculation requires that the values of the binary array are calculated for the whole array. The calculations of PCCs are performed between each row associated with the workgroup and all other rows. Two local arrays are used to determine the size of the bicluster and the index of the best column. If the PCC coefficient is higher than a threshold (i.e. the rows are correlated), the array containing the size of the bicluster gets updated. The maximal value is calculated in the same manner, as in the first kernel. Results One of the purposes of applying the MMPC algorithm was to verify the suitability of heterogeneous architectures for reducing computation time for complex biclustering algorithm. Thus, we decided to include in times of execution of the parallel version of the algorithm data transfers to and from the device and execution of the kernels. To assess the algorithm, we have used six selected data sets from [2]. As can be noted in Table 1, all the data sets have had the same number of rows, but different number of columns. The experiment was performed on two different environments, each time comparing the results obtained by the most optimized sequential and parallel versions: 2.66 GHz Intel Xeon X5650 CPU with NVIDIA Tesla M2090 in Fermi architecture GPU 2.80 GHz Intel Xeon X5660 CPU with NVIDIA Tesla K20c in Kepler architecture GPU The accumulated running times of both versions of the MMPC algorithm are shown in Table 2. The weak scaling efficiency of the MMPC algorithm is shown in Figure 1. 246Orzechowski and Boryczko: Rough assessment of GPU capabilities for parallel PCC-based biclustering Table 1:Data sets used for assessing time of execution of the algorithm. GEO ID Genes 22,215 22,215 22,215 22,215 22,215 22,215 Samples 12 14 15 17 21 27 Platform U133A U133A U133A U133A U133A U133A Description Heart Peripheral blood leukocytes Dorsolateral prefrontal cortex Omental adipose tissue Peripheral blood CDI4+leukocytes Heart GSE3585 GSE7148 GSE5390 GSE5090 GSE7893 GSE10161 Columns indicate the name of the experiment, number of genes and samples, microarray platform, and tissue description. Table 2:Average time of execution (in minutes) of the main loop of MMPC biclustering algorithm on CPU and parallel versions on homogeneous and heterogeneous architectures. Samples (Columns) 2.66 GHz CPU with Fermi GPU Sequential 12 14 15 17 21 27 361.2 512.3 622.9 863.8 1488.4 2745.3 Parallel 145.3 168.3 179.8 203.3 308.2 397.6 2.8 GHz CPU with Kepler GPU Sequential 106.2 159.4 193.5 271.0 413.5 847.4 Parallel 85.4 98.8 105.5 119.4 139.5 179.9 Parallel version of MMPC - Fermi Parallel version of MMPC - Kepler 8 Weak scaling efficiency 17 21 Number of columns Figure 1:Weak scaling efficiency of the MMPC algorithm for Fermi (red) and Kepler (green) GPU architectures. By looking into the execution times of both versions of MMPC in Table 2, we can see that the algorithm is very complex. The highest time gain observed was 6.9× for a data set with 27 columns (see Figure 1). This led us to believe that it may reach 9× for a data set with 32 columns, but verification of larger data sets needs to be performed to confirm this. Note that all the calculations were performed on double-precision numbers instead of single-precision numbers (floats). Performing float arithmetic operations, which are favored by the GPU design, resulted in a twofold increase in speedup of calculations on Fermi and more than a double on Kepler, which unfortunately has a side effect on the resulting set of biclusters. This involved row addition or deletion for about 1020 biclusters out of a total of 22,215. This is probably caused by the multiple arithmetic calculations (additions, multiplications, divisions, square roots) needed to determine PPC. Although this is way below 0.1%, it may have an impact for the latter biclustering results and thus should not be ignored. Although the algorithm computes faster with Kepler (compare the total times in Table 2), it has a weak scaling efficiency compared with that obtained with Fermi, which is quite surprising. The reason for this cannot be the GPU itself, as the computing capabilities of Fermi are lower than Kepler (Fermi, 2.0; Kepler, 3.5) [17]; therefore, the maximum number of resident workgroups per multiprocessor is 48 for Fermi and 64 for Kepler. Kepler offers twice as much double-floating point peak performance as Fermi and reaches 208 GB/s of memory bandwidth compared with 177 GB/s of Fermi. The reason for this has to be the runtime environment, as the sequential version is over three times slower on the environment with Fermi than on the environment with Kepler. The different configuration of the nodes on the environment with Fermi may lead to the nodes being shared with other processes, which extends running times. To compare the biological enrichment of the algorithm, we used five recognizable biclustering algorithms: Bimax [4], CPB [13], QUBIC [15], Plaid Models [18], and xMotifs [19]. MMPC was the best at discovering the GO:0034341 gene (the response to interferon ) compared with the other algorithms (p-value after Benjamini-Hochberg [20] correction of 2.88715e-24). This proves that the algorithm may be very useful for some biological experiments. The availability of parallel implementation provides a chance to obtain the results much faster. Orzechowski and Boryczko: Rough assessment of GPU capabilities for parallel PCC-based biclustering247 Conclusions and future work In this article, the MMPC biclustering algorithm was presented with two implementations, sequential and parallel, which allows to obtain a sensible parallel efficiency under weak scaling on GPUs. The algorithm was inspired by a couple of other biclustering algorithms. It combines the PCC metric from the CPB algorithm and detection of extremes from OPSM and Qubic, as well as the filtering procedure from Bimax. The major advantage of the MMPC algorithm is its ability to discover, as with CPB, both shifting and scaling patterns simultaneously. The MMPC algorithm is considered to be very complex. Meanwhile, an increase in availability of computing power as well as tools for exploiting parallelism allows algorithms with excessive execution time to be run. MMPC scales well with increasing data size. There are a couple of reasons why the MMPC algorithm scaling potential worsened. First, comparisons of rows require repeatable access to the global memory, which affects the performance. The correlation is counted on non-consecutive columns, which does not allow coalesced memory access. Second, the bit representation of rows and columns may favor the sequential version. GPU is not optimized for bitwise operations. A single multiprocessor may perform only 16 (Fermi) and 64 (Kepler) integer shifts and 32 (Fermi) and 160 (Kepler with Computing Capabilities 3.5) bitwise ORs, XORs, or ANDs in a single clock cycle. The visible obstruction is the appearance of race condition between threads, when multiple threads access the same integer. Certain optimizations of the algorithm may be proposed to shorten computation time. First, the procedure of column setting could be modified. Instead of visiting columns multiple times, only inactive columns may be visited or a greedy column selection may be applied. Disregarding multiple columns checks may allow the reduction of computation time to O(m2n2). Further modifications of how the column set is selected could possibly allow to reach the borderline of O(m2n). Finally, there are certain limitations of the algorithm, for example, data sets with more than 1024 columns are not supported. An interesting idea is porting the algorithm to Many Integrated Core Architecture (Intel MIC) and comparing the obtained speedup between Intel MIC and the GPU. Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: This research was funded by the Polish National Science Center (Narodowe Centrum Nauki, grant no. 2013/11/N/ST6/03204). This research was supported in part by PL-Grid Infrastructure. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.
Bio-Algorithms and Med-Systems – de Gruyter
Published: Dec 1, 2015
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.