email@example.com College of Intelligence and With the increasing of computing ability, large-scale simulations have been generating Computing, Tianjin University, massive amounts of data in aerodynamics. Sort-last parallel rendering is the most Tianjin, China classical image compositing method for large-scale scientific visualization. However, in Full list of author information is available at the end of the article the stage of image compositing, the sort-last method may suffer from scalability problem on large-scale processors. Existing image compositing algorithms tend to perform well in certain situations. For instance, Direct Send is well on small and medium scale; Radix-k gets well performance only when the k-value is appropriate and so on. In this paper, we propose a novel method named mSwap for scientific visualization in aerodynamics, which uses the best scale of processors to make sure its performance at the best. mSwap groups the processors that we can use with a (m, k) table, which records the best combination of m (the number of processors in subgroup of each group) and k (the number of processors in each group). Then in each group, using a m-ary tree to composite the image for reducing the communication of processors. Finally, the image is composited between different groups to generate the final image. The performance and scalability of our mSwap method is demonstrated through experiments with thousands of processors. Keywords: Parallel rendering, Sort-last, Image compositing, mSwap, m-ary tree 1 Introduction With the development of HPC systems, simulations could generate larger amounts of data than before in aerodynamics, which means more powerful techniques should be adopted for data analyzing. Visualization aimed at aerodynamics has been playing an important role in scientific discovery, especially in massive data discovery [1–6]. The increase of data also challenges the traditional methods of scientific visualization. Parallel render- ing is a useful approach to improve the performance of visualization, which has been widely used in scientific visualization to make full use of the HPC systems. As defined by Molnar et al. , all parallel rendering approaches are classified into three categories: sort-first, sort-middle and sort-last based on where the sort from object-space to screen space occurs. Among these classifications, sort-first and sort-last are more appropriate for parallel systems than sort-middle, because of their entire rendering pipeline. Besides, © The Author(s). 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. (2021) 3:3 Hou et al. Advances in Aerodynamics Page 2 of 17 sort-last parallel rendering has been widely used in parallel systems for its well perfor- mance in load balancing. Different from the other parallel rendering approaches, sort-last has the step of image compositing, that’s where the bottleneck is. To settle this problem, several image composition algorithms have been proposed so far. For accelerating sort-last parallel rendering, image compositing algorithms are used to improve the utilization of processors. The classical image compositing algorithms can be divided into three categories according to : Direct Send, Parallel Pipeline and Tree which includes all algorithms based on tree. Direct Send is simple to implement. Every processor will be responsible for part of the dataset. However, its complex communication between processors leads to lower performance on massive scale. Parallel Pipeline hasn’t been widely used for the same reason. The algorithms based on tree have been improved all these years, such as Binary Tree, Binary Swap, 2-3 Swap, Radix-k, 2-3-4 Decomposition and so on. All of these image compositing algorithms are implemented by an image com- positing tree, which is designed using the tree structure. Although, parallel rendering has been studied for so many years, there are still some problems during image compositing. In this paper, we introduce a new image compositing algorithm, called mSwap, for mak- ing full use of processors. The advantage of mSwap algorithm is that it can keep each group running in the best situation for avoiding performance degradation. Besides image compositing using m-ary tree could reduce the communication between processors effec- tively, which is the most time-consume step. Considering the cost of collection, Binary Tree is used at the end of mSwap, which replaces the collection step because of its no image split. On the other hand, the mSwap algorithm is easy to be understood and imple- mented, which makes it very practical. After experiment, the algorithm scales very well in a massively parallel environment. Our algorithm also includes the following contributions: An optimal (m, k) table, which indicates the highest performance combination of m and k. A new image compositing algorithm named mSwap using m-ary tree to reduce the communication of processors. A new method to avoid the collection step at the end of image compositing algorithm. A new environment of Tianhe-3 prototype to prove the reliability of parallel rendering. The remainder of this paper is organized as follows: Section 2 provides several image compositing algorithms. Section 3 introduces mSwap algorithm in detail, followed by descriptions of mSwap, the environment and results of experiments in Section 4.Section 5 discusses existing technical challenges and proposes orientations for future research. Section 6 concludes the paper. 2 Related work Parallel technology has been widely used in many fields of scientific research, including visualization. Parallel rendering is a production of the combination of parallel technol- ogy and visualization. As mentioned before, parallel rendering is classified as sort-first, sort-middle and sort-last, where sort-last is the most popular for its well performance on HPC systems . In sort-last, each processor loads part of the dataset and renders into a whole image, which means there is no communication between processors before image (2021) 3:3 Hou et al. Advances in Aerodynamics Page 3 of 17 compositing. Given that each processor only has few blocks of the dataset, the time of loading and rendering could be ignored. On the contrast, during the image composit- ing each processor must communicate with other processors to produce the final image. Therefore, the cost of communication is very expensive in the image compositing stage. To solve this problem, several algorithms have been proposed for sort-last system. 2.1 Direct send One of the most popular algorithms is Direct Send , which is intuitive and easy to implement. In sort-last parallel rendering, how to control every processor to commu- nicate with the other processors is the main problem. The simplest idea is that each processor is responsible for a part of the whole image and that is what Direct Send done. As shown in Fig. 1, in Direct Send, each processor is assigned a partial area of the final image before the image compositing. Then each processor requires to receive data belonging to itself from the other processors, and send data that it is not responsible to corresponding processor. With the completion of sending and receiving, each processor has the final image that it is responsible for. At last, Direct Send collects the image in order to produce the final whole image. Direct Send performs well on small scale or under some specific situation. For example, when the pixels are concentrated relatively, each processor owns the most pixels of its responsibility, which means there is no need to com- municate with all of the other processors to get the final image. However, it still has the problem of communication between processors. In the worst situation, Direct Send needs N (N − 1) communications to finish the composition, where N is the number of proces- sors. To minimize the communication cost of Direct Send, SLIC  (Scheduled Linear Image Composition) algorithm uses a compositing schedule to control the image com- positing step. Besides, pixels are classified into three categories: background pixels, pixels in the non-overlapping areas, and pixels in the overlapping areas. Only the second case requires image compositing. By pixels classifying, SLIC could reduce the number of pix- els communicating between processors. So SLIC is essentially a highly optimized Direct Send algorithm. There is another reason that Direct Send is still keep alive nowadays. In Direct Send, the process of communication and computing could be parallel completely, which could reduce the time of image compositing further more. 2.2 Parallel pipeline Parallel Pipeline  is another kind of image compositing algorithm, which is used sel- dom in practice. Different from Direct Send, processors are organized on a circular ring in Parallel Pipeline, and every processor will send data to the next processor, and receive data from the previous processor. Besides the whole image is divided into N pieces, where N is the number of processors. Each processor is responsible for one piece at a stage, but different stages mean different pieces. After round N − 1, collection is also required just like Direct Send to produce the final image. Parallel Pipeline is also suitable for small scale of processors, because more processors lead to more communication. Based on Paral- lel Pipeline, there are several optimal algorithms proposed to improve the performance. PPB (Parallel Pipeline using a Bounding Box) selects active pixels by a bounding box to avoid sending inactive pixels. In sparse image, active pixels tend to stay together, which could be contained in a bounding box to reduce the number of pixels when communi- cating. DPF (Direct Pixel Forwarding) assigns fixed areas for each processor to avoid link (2021) 3:3 Hou et al. Advances in Aerodynamics Page 4 of 17 Fig. 1 Direct send. For example, there are four processors involved in image compositing progress. In each processor the data is divided into four parts. The one marked in blue indicates the part that the processor is responsible for. At each stage, processors swap data with each other as framed by red dotted line contention just like Direct Send. The difference is DPF still organizes processors as a circular ring to exchange data. DPFL (DPF with Static Load Balancing) is an advanced optimization of DPF, which assigns horizontal lines among the processors in an inter- leaved fashion to keep tasks balanced. DPFS (DPF with Task Scheduling) avoids waiting among processors by adding a task schedule, which is caused by rendering out of sync. For the reason that all these algorithms only perform well on small scale, they are not widely used as the others. 2.3 Tree Last but not least, there are lots of algorithms based on tree, which are popular nowadays. Binary Tree is the basic image compositing algorithm in this category. In Binary Tree, all processors are grouped, and each group has two processors. The key point is one proces- sor sends all data to the other one in the same group to composite in a stage, which means the processor sending data will be idle in the next stage. It could manage all processors better than before, but the idle processors lead to reduce performance to a certain extent. Binary Swap , as a well-known algorithm, solves this problem by changing the com- positing step. In each stage, one processor sends half of the data to the other processor, and receives the rest from the other one. Then each processor will have half composited image. After logN stages, each processor owns 1/N images and collects to one processor finally. By splitting data, Binary Swap improves the utilization of processors. Meanwhile the splitting data also leads to the communication problem mentioned before. Consid- ering that both in Binary Tree and Binary Swap, processors are grouped into two, the (2021) 3:3 Hou et al. Advances in Aerodynamics Page 5 of 17 number of processors must be power-of-two. For supporting arbitrary number of proces- sors, 2-3 Swap  groups processors into two or three instead of two only. It is based on that any integer can be expressed as the sum of a sequence of integers consisting of 2 and 3. What should be noted is that the image in processors is not aligned when image com- positing between groups. Besides Direct Send is adopted to composite image within and between groups. Radix-k  is the result of the generalization of Binary Swap and Direct Send. The number of processors N is factored in r factors so that k is a vector where k = [k , k , ..., k ]. In each stage, processors are grouped according to k which is the number 1 2 r of processors in each group. However, k is related to the network topology and to physical placement of processes onto that topology, which means there is no clear method to get the k value . The same as 2-3 Swap, Direct Send is used to composite image within and between processors. Given that Binary Swap performs well when the number of proces- sors is power-of-two, 2-3-4 Decomposition [17, 18]isproposedtomakesurethe number of groups is power-of-two so that Binary Swap could be used between groups. 2.4 The others What’s more, there are still some image compositing algorithms, which are optimized in special environments based on the algorithms mentioned before. All the image compositing algorithms perform well in the case of uniform pixels distri- bution, which is called dense image. However, there are all kinds of possibilities in reality. In some images, pixels may lie in a concentrated area called sparse image , which leads to not all of the pixels need to be composited. For example, if active pixels lie in the cen- ter of the whole image, it is wasting time to composite the other area of the whole image. To settle the problem, several solutions have been proposed. Bounding box isone of them, which selects the active area using a bounding box to reduce the scale of compo- sition. Interlacing algorithm  is another solution, which keeps the load-balancing by interlacing assignment. Besides, Shift-based  image compositing method using shift permutation is based on Infiniband Fat-Tree interconnection network. Multi-Step  image compositing approach minimizes undesirable performance degradation to achieve scalability. TOD- Tree  (Task-Overlapped Direct send Tree) image compositing algorithm combines Direct Send and k-ary Tree to improve the entire performance based on hybrid MPI parallelism. Larsen et al.  optimizes for multi-image sort-last parallel rendering. Aly et al.  proposes the Distributed kd-trees for retrieval. After that, Zhang et al.  designs dynamic load balancing based on constrained K-D tree for parallel parti- cle tracing. Cavin et al.  designs a sort-last parallel rendering system based on COTS Cluster. There are also several applications implemented by sort-last parallel rendering [29–32]. Existing image compositing algorithms tend to perform well in some specific situations, which means in most cases, algorithms are not at the best point or even lower the aver- age. The scale of processors is one of the most important factors. With the increasing of processors, the performance of parallel rendering has begun to raise because of the high parallelism. However, once the number of processors reaches a certain value, the per- formance won’t improve again or even decline which is called performance degradation. That is the communication between processors affects the performance. Most algo- rithms prefer to keep processors busy to increase the utilization. However high utilization (2021) 3:3 Hou et al. Advances in Aerodynamics Page 6 of 17 causes frequent communication between processors. Considering that parallel rendering is a communication-intensive approach, this case will lead to performance degradation. Besides that, more processors mean more pieces of image in some image compositing algorithms, which leads to such a phenomenon that each processor is responsible for a few pixels. That will waste the performance of processors. At last, there is an additional image collection step at the end of image composition in most image compositing algo- rithms. When the image is divided into multiple pieces, the image collection would take a lot of time besides the waste of performance according to , which is the next largest time-consume in parallel rendering. In this paper, we propose a new image compositing algorithm named mSwap, which is also based on tree using the best scale of processors to make sure its performance at the best. The details will be discussed in the next section. 3 Methodology In this section, we will introduce our mSwap algorithm in detail. At the beginning of the compositing step, each processor has an image rendered from partial blocks of the dataset, which has both RGBA value and depth information. All images will be compos- ited into one final image according to the depth information with the image compositing algorithm. Before the description of mSwap, there are some definitions need to be known: N : the number of processors. k : the number of processors in each group. m : the number of processors in subgroup of each group. To describe mSwap in detail, this section is divided into three sections according to the different steps of mSwap algorithm. The strategy of grouping introduces the strategy of grouping. In the second section, image compositing based on m-ary tree will be displayed in detail, and the last Section explains the method for avoiding image collection. The entire process of mSwap is shown in Fig. 2. 3.1 The strategy of grouping As most of the existing image compositing algorithms do, grouping is also required in our algorithm. The independence of dataset makes it possible to group processors. On the other hand, grouping could reduce the scale of processors for easy management. The improvement of performance depends on the grouping situation. So in mSwap, we group processors using the optimal (m, k) table. In the classic Binary Swap algorithm, every two nodes are grouped together for data exchanging. However, it is different from that in mSwap. As mentioned before, existing algorithms always perform well in specific case, but in common their behavior is at the average or even under the average, which limits these algorithms greatly. For avoiding this problem, mSwap groups all processors according to the optimal (m, k) table, which makes sure that each group could run at the best situation. There is a point that the performance of every image compositing algorithm is changed with the different scale of processors. At the beginning, with the increase of processors, the advantage of parallel rendering starts to reflect. But there is a threshold. Once the number of processors is beyond the thresh- old, the performance of algorithms will decline. Considering that parallel rendering is a (2021) 3:3 Hou et al. Advances in Aerodynamics Page 7 of 17 Fig. 2 mSwap contains three steps to finish the image compositing process. Step 1 : all processors are grouped according to the optimal (m, k) table. Step 2 : image compositing with m-ary tree in each group. Step 3 : image compositing with Binary Tree between groups communication intensive approach, the increasing communication between processors due to the increasing processors could be the main reason of performance degradation. It is easy to understand that the performance will decline when the number of proces- sors is too large. An extreme situation is that each processor is only responsible for one pixel, but it needs to communicate with the other processors to produce the final image. In that case, so many processors lead to huge communication pressure instead of the performance improvement in parallel rendering. For avoiding performance degradation, mSwap limits the number of processors in each group, and that is what optimal (m, k) table done. The optimal (m, k) table is the key of grouping in mSwap, which indicates the best scale of processors. In the table, m is the number of processors in subgroup of each group. k is the number of processors of each group, which is also the best scale of processors according to m. Since mSwap is based on m-ary tree, there is such a relationship between m and k : n = log k (1) where n is a positive integer, and m could be any positive integer. Given that in subgroup of each group Direct Send is adopted to composite image which will be introduced in the next section, the value of m should be small to guarantee the performance. Especially when m equals two, then k is the best point of Binary Tree. As shown in Fig. 3, the optimal (m, k) table contains two combinations of m and k. In each column, there is a combination of m and k, which indicates the best value of k when m is settled. For example, when m = 3, the best scale of processors k is 9, which means the performance will decline when k increases or decreases. On the other hand, the column with m = 3istothe left of m = 2. That is because the image compositing time of m = 3, k = 9 is shorter than that when m = 2, k = 4. Figure 3 shows the grouping of 13 processors. At first, 13 is both larger than 9 and 4, so we group processors 0-8 into (2021) 3:3 Hou et al. Advances in Aerodynamics Page 8 of 17 Fig. 3 Grouping according to the optimal (m, k) table. In the optimal (m, k) table, the combination of m = 3, k = 9 is better than m = 2, k = 4. mSwap assigns processors 0 to 8 into the group 1 in priority, and the rest is grouped into group 2 group 1. As there are only 4 processors left, which is not enough for 9, processors 9-12 are grouped into group 2. In the optimal (m, k) table, the performance of combination is decreasing gradually from left to right, which means the combination on the left is better than that on the right. So, we may consider using the combination in priority to composite image. However, if we keep assigning the previous combination of m and k in advance, the next combina- tion of m and k may not be used forever when the previous k is small. For avoiding this case, except for the priority need to be noted, we also consider the number of the rest of processors, which should be as small as possible to improve the performance of image compositing. In mSwap, the combination on the left is chosen in priority, and then the next com- bination (m, k) is considered. After traversing the whole table, almost all processors are grouped according to the best scale. Considering that there may be several processors left because of no enough processors for grouping, we assign the left processors to step three in mSwap. 3.2 Image compositing with m-ary tree After grouping, each group will composite image by using m-ary tree to manage pro- cessors. The design of mSwap is different from the existing algorithms, which makes it reduce the communication between processors efficiently. In each group, there are still a lot of processors need to be controlled for image com- positing. Thus, the processors in the same group will be grouped again according to m, which is the number of processors in subgroup of each group. After the second group- ing, each subgroup owns m processors and Direct Send is adopted to composite a whole image among these processors. We still consider some of the other image compositing algorithms, but as shown in , Direct Send performs better than others on small scale. Then the rest of processors continue to group according to m until there is only one pro- cessor on m-ary tree. When step two is done, each group will keep one processor to store the compositing result of the entire group. Although after image compositing by m-ary tree, there are some processors out of computing, the communication between processors reduces efficiently, which is also adopted in other algorithms like 2-3-4 Decomposition. (2021) 3:3 Hou et al. Advances in Aerodynamics Page 9 of 17 Fig. 4 Comparison of Binary Tree and Binary Swap. The cost of communication in Binary Tree is smaller than that in Binary Swap when the number of processors is smaller than 256 mSwap is also based on tree to composite image, which makes it easy to manage processors. When m = 2, the grouping is just like Binary Tree. The reason that we choose Binary Tree instead of Binary Swap is that the number of pieces keeps growing with the scale increasing in Binary Swap, which leads to two potential problems. One is that the greater the number of pieces is, the more time image collection will take. The other is the increasing pieces result in huge communication between processors, which is the most expensive cost in parallel rendering. As shown in Fig. 4, the cost of communication in Binary Tree is smaller than that in Binary Swap when the number of processors is smaller than 256. Considering that the value of k won’t be large, mSwap uses one processor to store the result of each subgroup image compositing, which means in the next stage, each node owns a whole image like the first stage. The details are displayed in Fig. 5. Fig. 5 Image compositing with m-ary tree. In group 1, image compositing with 3-ary tree, and in group 2, image compositing with 2-ary tree, which is also called binary tree (2021) 3:3 Hou et al. Advances in Aerodynamics Page 10 of 17 There is another reason that we use m-ary tree to composite image, instead of binary tree in Binary Swap. It is obviously that the number of stages begins to decrease with the growth of m. However, the communication time in each stage will be longer. To settle the problem, we have compared the cost of image compositing time under different combina- tions of m and k. And it turns out that different image resolutions have different optimal combinations. In step two of mSwap, m-ary tree is adopted to composite image, which means there will be n m-ary trees, where n is the number of groups in step one. Direct Send could parallel the process of computing and communicating as mentioned before. However, image compositing algorithms based on tree couldn’t parallel the both completely. The next stage of image compositing can only continue until the previous stage is finished. In mSwap, n m-ary trees are mutually independent, which is better than those based on one image compositing tree in the other algorithms. 3.3 Binary tree for avoiding image collection At the last step of mSwap, the sub-images in n processors need to be composited into one final image, where n is the number of groups in step one. Considering Binary Tree is the only one algorithm that does not need image collection at the end of image compositing, mSwap uses Binary Tree to avoid image collection. In parallel rendering, the image composition time accounts for the vast majority, and the image collection time accounts for the majority of the rest . Besides the time for loading image data and writing into a file takes up the remaining part. So the total time of parallel rendering should be like this : t = t + t + t + t (2) total load composite collect write With the increasing of processors, the image collection time keeps growing, which effects overall performance in parallel rendering. Considering that t and t are load write far less than t or t , t is mainly decided by t and t .mSwap composite collect total composite collect reduces the t by step two using m-ary tree, which is most of the total time. Besides composite mSwap reduces the t using Binary Tree in step three, which will further increase collect performance. Binary Tree is the original image compositing algorithm based on tree, and it is the only algorithm that does not need image collection. In Binary Tree, processors are grouped like Binary Swap. Each group owns two processors. The difference is one processor will send all its data to the other one in the same group to produce the composite image. Although Binary Tree does not perform as good as the other image compositing algorithms on large scale for its lower processor utilization, it is superior on small scale for its no image col- lection. This is the reason that mSwap chooses Binary Tree finally. After the composition based on m-ary tree in step two, the number of nodes reduces to a small scale, which fits the characteristic of Binary Tree very well. 4 Experimental results We have implemented the proposed mSwap image compositing algorithm using C++ pro- gramming language together with VTK [33, 34] (Visualization Toolkit) based on Tianhe-3 prototype. For proving the performance of mSwap algorithm, we obtain the optimal (m, k) (2021) 3:3 Hou et al. Advances in Aerodynamics Page 11 of 17 table by testing compositing time at different image resolutions. Besides we also com- pared our algorithm against Direct Send, Binary Swap, Radix-k and 2-3-4 Decomposition, which are also designed based on VTK. As for 2-3 Swap, considering that Radix-k is a more general result of 2-3 Swap, we choose Radix-k for testing. 4.1 Tianhe-3 prototype Tianhe-3 prototype has a high-performance cluster environment, which is designed for high-performance computing and massive data processing. It provides two kinds of CPUs, including Phytium MT2000+ and Phytium FT2000+. Tianhe-3 prototype owns 512 boards with three Phytium MT2000+ CPUs on each board, and 128 boards with four Phytium FT2000+ CPUs on each board. Each Phytium MT2000+ CPU is divided into four nodes, which owns 32 cores and 16GB RAM. Each Phytium FT2000+ CPU owns 64 cores and 64GB RAM. Besides, the floating-point computing performance of Tianhe-3 proto- type could reach 3.146PFlops. The capacity of total parallel storage is 1PB, which could meet the needs of users. 4.2 The optimal (m, k) table In sort-last system, each processor owns a full image information before image composit- ing, which means the resolution of image decides the pressure of communication between processors. Therefore, different resolutions mean different optimal combination of m and k. We firstly select the image resolution to 300 ∗ 300 to test the optimal combination, which is the default window size in VTK. Besides the resolution of image, the relationship between m and k also needs to be considered. When testing, we must make sure that k is the power of m as mentioned before. In order to get the optimal (m, k) table, we introduce the performance curves of (m, k) first. Theoretically, the time of image compositing will decline as the number of pro- cessors grows due to its high parallelism. Once the number of processors is beyond a threshold, the time of image compositing has begun to increase because of its over- blocking. However, as shown in Fig. 6, there is a little different in reality. Considering the cost of communication between processors, the time of image compositing keeps grow- ing at the beginning. As the number of processors continues to increase, the advantages of parallelism have appeared, which would cause a rapid decline in the growth rate of image compositing time. And this is the point that we are looking for. To find the point, we define the growth rate of image compositing time : v = δt/δk (3) Where δt is the change of time and δk is thechangeofnumberofprocessors. At the beginning, the time of image compositing grows quickly because of the need of parallel rendering, which means v is a large value. After that, v has begun to slow down, which means the time of rendering is also stable. To make sure the proper k value, we select the smallest v as the best point, which is also the best point of k. For example, when m = 2, the value of k could be 2, 4, 8, ..., 1024 and so on. We have recorded the time of image compositing under different combinations. Then we calculated the growth rate of two adjacent combinations separately to determine the best combination. Another important thing is to choose the k value in Radix-k algorithm. In , it is mentioned that there is no exact way to get the appropriate k value, but when fairly large (2021) 3:3 Hou et al. Advances in Aerodynamics Page 12 of 17 Fig. 6 Performance curves of (m, k) with a resolution of 300 ∗ 300 and 1024 ∗ 1024. The performance curves of (m, k) are used to locate the best point of m-ary tree. With the number of processors increasing, compositing time also rises until the number of processors reaches a threshold radices such as 32 or 16 appear in the k vector, the performance will improve to a certain degree.SoinRadix-k basedonVTK,weprefertochoose32and 16 as the k value. Figure 6a shows the optimal (m, k) table and the performance of each combination of m and k. After that, the larger resolution of image needs to be considered. So, we also test the performance when compositing images of sizes 1024 ∗ 1024. Figure 6bshows the optimal (m, k) table under different image sizes. The optimal (m, k) table is the key to the mSwap algorithm, which indicates the process of grouping. Considering that Direct Send is used in each subgroup of each group, the value of m is set from 1 to 9, where 1 is for the rest of processors that are not grouped. In the best situation, the number of ungrouped processors is zero, which means all of the processors are grouped according to the best point. Although, it is possible that there are unassigned processors left, which is sent to step three for Binary Tree compositing, mSwap still performs well under most circumstances. After m-ary tree compositing in step two, the number of processors has declined to a smaller scale, which is suitable for Binary Tree. So, the performance of mSwap won’t fall under any conditions. 4.3 The performance of mSwap We also execute our mSwap algorithm based on the optimal (m, k) table mentioned before. Since existing algorithms were implemented based on VTK, we compared our algorithm with the other image compositing algorithms in the same situation. There are two factors that affect the performance of image compositing. One is the res- olution of image, which affects the optimal (m, k) table directly. We use the default size of image (300 ∗ 300) to test first. After that, to prove the effectiveness on larger resolution, we also test the image size of 1024 ∗1024 and 2048 ∗2048. The other is the number of pro- cessors. mSwap algorithm supports any number of processors, so we still compare mSwap with the other algorithms when the number of processor is non power-of-two. However, Binary Swap is not tested because it does not support the case where the number of processors is non power-of two. Figure 7 shows the different performance curves of different algorithms when the image size is 300 ∗ 300, Fig. 8 is the result with a resolution of 1024 ∗ 1024, and Fig. 9 is the result (2021) 3:3 Hou et al. Advances in Aerodynamics Page 13 of 17 Fig. 7 Performance curves of different algorithms with a resolution of 300 ∗ 300. a is the compositing time when the number of processors is power-of-two. b is the compositing time when the number of processors is non power-of-two without Binary Swap with a resolution of 2048 ∗ 2048. From the result, we could see that mSwap is better than the other algorithms no matter the number of processors is power-of-two or not. The performance of mSwap is well under most situations. When the number of pro- cessors has begun to grow, the compositing time increases quickly like the other image compositing algorithms. However, the upper limit of time in mSwap is lower than other algorithms for its well performance. Direct Send is popular for its simplicity and aban- doned for its bad behavior on large scale. Thus, Direct Send is more appropriate for mixing with the other algorithms, which could provide better environment for Direct Send. Although Binary Swap has been proposed for so many years, it is also popular when the number of processors is power-of-two. It could make full use of processors to compos- ite image, which leads to performance improvement. 2-3-4 Decomposition algorithm is based on Binary Swap, which adds pre-processing to make sure any number of processors could use Binary Swap to composite image. Therefore, when the number of processors is power-of-two, its performance is slightly worse than Binary Swap on large scale because of the pre-processing progress. What’s more, Radix-k is the combination of Direct Send and Binary Swap, which is affected by the value of k greatly. Although we choose appro- priate k value, it is a little worse than Binary Swap and 2-3-4 Decomposition. Compared to these image compositing algorithms, mSwap shows its advantages both on small or large scale. On the one hand, the optimal (m, k) table provides the best situation when using m- ary tree to composite image, which could keep the processors at the best point or above Fig. 8 Performance curves of different algorithms with a resolution of 1024 ∗ 1024. a is the compositing time when the number of processors is power-of-two. b is the compositing time when the number of processors is non power-of-two without Binary Swap (2021) 3:3 Hou et al. Advances in Aerodynamics Page 14 of 17 Fig. 9 Performance curves of different algorithms with a resolution of 2048 ∗ 2048. a is the compositing time when the number of processors is power-of-two. b is the compositing time when the number of processors is non power-of-two without Binary Swap the average. On the other hand, mSwap composites image by m-ary tree, which reduces the communication between processors. For example, if m = 2 is the best combination in the optimal (m, k) table, mSwap prefers to use Binary Tree for image compositing. It means that mSwap improves the performance of each group by using the optimal (m, k) table to perform better than other algorithms. Besides the performance, the scalability of mSwap is better than other image composit- ing algorithms. When the number of processors keeps growing, mSwap adjusts the result of grouping according to the optimal (m, k) table, which could resize the scale of each group for performance improving. However, the other image compositing algorithms can not adjust grouping flexibly. In Direct Send, the communication between processors increases rapidly with the increasing number of processors, which causes worse perfor- mance than the others. Binary Swap performs well when the number of processors is power-of-two, but it can not support the other situations. As an improved algorithm of Binary Swap, 2-3-4 Decomposition shows its well scalability. By pre-processing, it makes sure the number of processors is power-of-two. Then Binary Swap is adopted to get high performance. Radix-k is a configurable image compositing algorithm, and the value of k is the key to the performance. Radix-k uses the vector k to group in each stage, which is a bit like mSwap. The difference is Radix-k could only group according to each k value, but mSwap could adjust the grouping according to the optimal (m, k) table. It is like that Radix-k owns a dynamic vector k, which indicates the best grouping under different numbers of processors. Figure 10 shows the rendering result of the wing model. 5 Challenges In this section, several possible research challenges will be introduced. With the developing of HPC, there are various of environments both in hardware and software. Special optimization for different environments becomes a challenge gradually, such as Shift-based image composition, Multi-Step image composition, TOD-Tree image composition and so on. Considering that parallel rendering has a variety of application environments, it is important to improve image compositing algorithms based on the specific environments. For scientific analysis, especially that involved real-time data, the efficiency of visualiza- tion is very important. To improve the speed of visualization, the utilization of processors needs to be increased, the communication between processors needs to be reduced and (2021) 3:3 Hou et al. Advances in Aerodynamics Page 15 of 17 Fig. 10 Rendering result of the wing model the performance of processors needs to be kept above the average at least. To solve this problem, one method named in-situ visualization was mentioned. By in-situ visualiza- tion, researchers will visualize the intermediate data in situ for avoiding data movement, which could improve the efficiency of scientific research in some way. However, there are still a lot of problems need to be done in in-situ visualization [35, 36]. 6Conclusions In this paper, we propose an image compositing algorithm called mSwap, which aims to find the best case of processors according to the performance curves. In the past two decades, various of image compositing algorithms have been proposed to improve the performance of parallel rendering by increasing the utilization of proces- sors. Most of them are designed based on tree for its high parallelism, and their sort-last system could provide load-balancing. However, existing algorithms perform well only in some specific situations, which means algorithms are at the average or even under the average in most cases. mSwap is proposed to settle this problem. Firstly, mSwap groups all the processors according to the optimal (m, k) table for making full use of every processor. Considering that each algorithm has its own characteristic, we make sure processors running at their best point by using the table containing the best scale of processors which is obtained by testing. Secondly, instead of using the existing image compositing algorithms, we com- posite image based on m-ary tree in group, which reduces the communication between processors. Finally, Binary Tree is used to composite images in the third phase because of its high performance on small scale and no image collection which is different from the other image compositing algorithms. Furthermore, it is proved that mSwap performs bet- ter than the other image compositing algorithms in a large-scale situation. Future work includes the further optimization for mSwap based on current challenges, which makes mSwap adapt to special environments easily. (2021) 3:3 Hou et al. Advances in Aerodynamics Page 16 of 17 Acknowledgements The work was carried out at National Supercomputer Center in Tianjin, and the calculations were performed on Tianhe 3 prototype. Authors’ contributions MH and CB designed and implemented the image compositing algorithm. FW and LD analyzed and extracted data from the simulation process. GZ and XM provided assistance in cluster environment. The author(s) read and approved the final manuscript. Funding This work was partially supported by the National Numerical Windtunnel Project, partially by the National Natural Science Foundation of China under Grant No. 61702360. Availability of data and materials The data used in this article is in tecplot format, any data in tecplot format can be used as the input of the source code. Competing interests The authors declare that they have no competing interests. Author details 1 2 College of Intelligence and Computing, Tianjin University, Tianjin, China. Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Mianyang, China. National Supercomputer Center in Tianjin, Tianjin, China. Received: 19 August 2020 Accepted: 16 November 2020 References 1. Bi C, Wang J, Duan Y, Fu B, Kang J-R, Shi Y (2020) MobileNet based apple leaf diseases identification. Mob Netw Appl.1–9. https://doi.org/10.1007/s11036-020-01640-1 2. Yang L, Xie P, Bi C, Zhang R, Cai B, Shao X, Wang R (2020) Household power consumption pattern modeling through a single power sensor. Renew Energy 155:121–133 3. Bi C, Pan G, Yang L, Lin C-C, Hou M, Huang Y (2019) Evacuation route recommendation using auto-encoder and markov decision process. Appl Soft Comput 84:105741 4. Bi C, Fu B, Chen J, Zhao Y, Yang L, Duan Y, Shi Y (2019) Machine learning based fast multi-layer liquefaction disaster assessment. World Wide Web 22(5):1935–1950 5. Bi C, Yang L, Duan Y, Shi Y (2019) A survey on visualization of tensor field. J Vis 22(3):641–660 6. Yang L, Wang B, Zhang R, Zhou H, Wang R (2017) Analysis on location accuracy for the binocular stereo vision system. IEEE Photon J 10(1):1–16 7. Molnar S, Cox M, Ellsworth D, Fuchs H (1994) A sorting classification of parallel rendering. IEEE Comput Graph Appl 14(4):23–32 8. Bethel EW, Childs H, Hansen C (2012) High performance visualization: enabling extreme-scale scientific insight. CRC Press, New York 9. Nonaka J, Ono K, Miyachi H (2009) Performance evaluation of large-scale parallel image compositing on a t2k open supercomputer. Inf Media Technol 4(4):780–788 10. Eilemann S, Pajarola R (2007) Direct send compositing for parallel sort-last rendering. In: Proc. 7th Eurographics Conf. Parallel Graph. Vis, Piscataway. pp 29–36 11. Stompel A, Ma K-L, Lum EB, Ahrens J, Patchett J (2003) SLIC: scheduled linear image compositing for parallel volume rendering. In: IEEE Symposium on Parallel and Large-Data Visualization and Graphics, 2003. PVG 2003. IEEE, Piscataway. pp 33–40 12. Lee T-Y, Raghavendra CS, Nicholas JB (1996) Image composition schemes for sort-last polygon rendering on 2d mesh multicomputers. IEEE Trans Vis Comput Graph 2(3):202–217 13. Ma K-L, Painter JS, Hansen CD, Krogh MF (1994) Parallel volume rendering using binary-swap compositing. IEEE Comput Graph Appl 14(4):59–68 14. Yu H, Wang C, Ma K-L (2008) Massively parallel volume rendering using 2–3 swap image compositing. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, Piscataway. pp 1–11 15. Peterka T, Goodell D, Ross R, Shen H-W, Thakur R (2009) A configurable algorithm for parallel image-compositing applications. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, Piscataway. pp 1–10 16. Kendall W, Peterka T, Huang J, Shen H-W, Ross RB (2010) Accelerating and benchmarking radix-k image compositing at large scale. EGPGV 10:101–110 17. Nonaka J, Bi C, Fujita M, Ono K (2014) 2-3-4 decomposition method for large-scale parallel image composition with arbitrary number of nodes. In: SIMS 2014: International Conference on Systems Informatics, Modelling and Simulation. IEEE, Piscataway. pp 59–64 18. Nonaka J, Ono K, Fujita M (2018) 234compositor: A flexible parallel image compositing framework for massively parallel visualization environments. Futur Gener Comput Syst 82:647–655 19. Yang D-L, Yu J-C, Chung Y-C (2001) Efficient compositing methods for the sort-last-sparse parallel volume rendering system on distributed memory multicomputers. J Supercomput 18(2):201–220 20. Takeuchi A, Ino F, Hagihara K (2003) An improved binary-swap compositing for sort-last parallel rendering on distributed memory multiprocessors. Parallel Comput 29(11-12):1745–1762 (2021) 3:3 Hou et al. Advances in Aerodynamics Page 17 of 17 21. Moreland K, Kendall W, Peterka T, Huang J (2011) An image compositing solution at scale. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York. pp 1–10 22. Cavin X, Demengeon O (2012) Shift-based parallel image compositing on InfiniBand fat-trees. PGV - Eurographics Symposium on Parallel Graphics and Visualization 23. Nonaka J, Fujita M, Ono K (2015) Multi-step image composition approach for sort-last massively parallel rendering. J Adv Simul Sci Eng 2(1):108–125 24. Grosset AP, Prasad M, Christensen C, Knoll A, Hansen CD (2015) Tod-tree: Task-overlapped direct send tree image compositing for hybrid mpi parallelism. EGPGV 15:67–76 25. Larsen M, Moreland K, Johnson CR, Childs H (2016) Optimizing multi-image sort-last parallel rendering. In: 2016 IEEE 6th Symposium on Large Data Analysis and Visualization (LDAV). IEEE, Piscataway. pp 37–46 26. Aly M, Munich M, Perona P (2011) Distributed kd-trees for retrieval from very large image collections. In: Proceedings of the British Machine Vision Conference (BMVC), vol. 17. The British Machine Vision Conference, United Kingdom 27. Zhang J, Guo H, Hong F, Yuan X, Peterka T (2017) Dynamic load balancing based on constrained k-d tree decomposition for parallel particle tracing. IEEE Trans Vis Comput Graph 24(1):954–963 28. Cavin X, Mion C, Filbois A (2005) Cots cluster-based sort-last rendering: Performance evaluation and pipelined implementation. In: VIS 05. IEEE Visualization, 2005. IEEE, Piscataway. pp 111–118 29. Muraki S, Ogata M, Ma K-L, Koshizuka K, Kajihara K, Liu X, Nagano Y, Shimokawa K (2001) Next-generation visual supercomputing using pc clusters with volume graphics hardware devices. In: SC’01: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing. IEEE, Piscataway. pp 51–58 30. Moreland K, Wylie B, Pavlakos C (2001) Sort-last parallel rendering for viewing extremely large data sets on tile displays. In: Proceedings IEEE 2001 Symposium on Parallel and Large-Data Visualization and Graphics (Cat. No. 01EX520). IEEE, Piscataway. pp 85–154 31. Eilemann S, Makhinya M, Pajarola R (2009) Equalizer: A scalable parallel rendering framework. IEEE Trans Vis Comput Graph 15(3):436–452 32. Biedert T, Werner K, Hentschel B, Garth C (2017) A task-based parallel rendering component for large-scale visualization applications. In: EGPGV. The Eurographics Association, Germany. pp 63–71 33. Avila LS, Barre S, Blue R, Geveci B, Henderson A, Hoffman WA, King B, Law CC, Martin KM, Schroeder WJ (2010) The VTK User’s Guide. Kitware New York, New York 34. Hanwell MD, Martin KM, Chaudhary A, Avila LS (2015) The visualization toolkit (vtk): Rewriting the rendering code for modern graphics cards. SoftwareX 1:9–12 35. Ma K-L (2009) In situ visualization at extreme scale: Challenges and opportunities. IEEE Comput Graph Appl 29(6):14–19 36. Childs H, Bennett JC, Garth C, Hentschel B, Rhyne T (2019) In situ visualization for computational science. IEEE Comput Graph Appl 39(6):76–85 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Advances in Aerodynamics – Springer Journals
Published: Jan 27, 2021