Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Finite Element Simulation of a Flexible Manipulator-Part 2: Parallel Processing Techniques:

Finite Element Simulation of a Flexible Manipulator-Part 2: Parallel Processing Techniques: Finite Element Simulation of a Flexible Manipulator­ Part 2: Parallel Processing Techniques M.O. Tokhi, M.H. Shaheed, D.N. Ramos-Hernandez and H. Poerwanto Department of Automatic Control and Systems Engineering The University of Sheffield. UK. Received 10th January 1999 ABSTRACT This paper presents an investigation into the performance evaluation of homogeneous and heterogeneous parallel architectures in the real-time implementation of finite element (FE) simulation algorithm of a flexible manipulator. The algorithm is implemented on a number of homogeneous and heterogeneous architectures incorporating high-performance processors. The partitioning and mapping of the algorithms on both the homogeneous and heterogeneous architectures arc investigated and finally a comparative assessment of the performance of the architectures in implementing the algorithm revealing the capahilities of the architectures in relation to the nature of the algorithm is presented in terms of execution time and speedup. KEYWORDS: Finite element method, flexible manipulator, heterogeneous architectures, homogeneous architectures, parallel processing, performance metrics, simulation, speedup, task to processor allocation. 1. INTRODUCTION Parallel processing (PP) is a subject of widespread interest for real-time signal processing and control. The concept of PP on different problems or different parts of the same problem is not new. Discussions of parallel computing machines are found in the literature at least as far back as the 1920s (Denining, 1986). It is noted that, throughout the years, there has been a continuing research effort to understand parallel computing (Hocney and Jesshope, 1981). Such effort has intensified dramatically in the last few years involving various applications in signal processing, control, artificial intelligence, pattern recognition, computer vision, computer aided design and discrete event simulation. In a conventional parallel system all the processing elements (PEs) are identical. This architecture can be described as homogeneous. However, many algorithms are heterogeneous, as they usually have varying computational requirements. The implementation of an algorithm on a homogeneous architecture having PEs of different types and features can provide a closer match with the varying hardware requirements and, thus, lead to performance enhancement. However, the relationship between algorithms and heterogeneous architectures for real-time control systems is not well understood. The mapping of algorithms onto processors in a heterogeneous architecture is, therefore, Journal of Low Frequency Noise. 149 Vibration and Acri\'(' Control Vol. 18 No.3 1999 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR especially challenging. To exploit the heterogeneous nature of the hardware it is required to identify the heterogeneity of the algorithm so that a close match can be forged with the hardware resources available (Baxter et al., 1994). One of the challenging aspects of PP, as compared to sequential processing, is how to distribute the computational load across the PEs. This requires a consideration of several issues, including the choice of algorithm, the choice of processing topology, the relative computation and communication capabilities of the processor array and partitioning the algorithm into tasks and scheduling of these tasks (Crummey et al., 1994). It is, thus, essential to note that in implementing an algorithm on a parallel computing platform, a consideration of the issues related to the interconnection schemes, the scheduling and mapping of the algorithm on the architecture, and the mechanism for detecting parallelism and partitioning the algorithm into modules or sub-tasks, will lead to a computational speedup (Ching and Wu, 1989; Hwang, 1993; Khokhar et al., 1993). A finite element (FE) simulation algorithm characterising the dynamic behaviour of a flexible manipulator is considered in this paper. Flexible robot manipulators are receiving noticeable attention, in comparison to their traditional (rigid) counterparts. This is due to several advantages they offer, such as fast speed of response, efficiency and relatively low cost. However, the dynamic behaviour of such systems comprise both rigid body and flexible motion. Thus, for control purposes, both motions have to be accounted for at the modelling as well as the control levels. This paper presents an investigation into the real-time performance evaluation of several homogeneous and heterogeneous architectures in implementing the FE simulation algorithm of a flexible manipulator system. The purpose of this investigation is to provide a coherent analysis and evaluation of the performance of parallel computing techniques in implementing the algorithm, within the framework of real-time signal processing and control applications. 2. HARDWARE PLATFORMS AND SOFTWARE RESOURCES A brief description of the homogeneous and heterogeneous parallel architectures utilised in this work is presented in this section. These incorporate three processor types, namely the Inmos T805 (T8) transputer, the Texas Instruments TMS320C40 (C40) DSP device and the Intel 80i860 (i860) vector processor (Tokhi et al., 1999). The compilers used in this work include the 3L Parallel C (for C40 and T8) and the Inmos Portland Group ANSI C (for i860). 2.1. Homogeneous architectures The homogeneous architectures considered include a network of C40s and a network of T8s. A pipeline topology is utilised for these architectures, on the basis of the algorithm structure, which is simple to realise and is well reflected as a linear farm (Irwin and Fleming, 1992). The homogeneous architecture of C40s is shown in Figure 1. This comprises a network of C40s resident on a Transtech TDM410 motherboard and a TMB08 motherboard incorporating a T8 as a root processor. The T8 possesses 1 Mbyte local memory and communicates with the TDM410 (C40s network) via a link adapter using serial to parallel communication links. The C40s, on the other hand, communicate with each other via parallel communication links. Each C40 processor possesses 3 Mbytes DRAM and I Mbyte SRAM. 150 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR The homogeneous architecture of T8s used in this work is shown in Figure 2. This comprises a network of T8s resident on a Transtech TMB08 motherboard. The root T8 incorporates 2 Mbytes of local memory, with the rest of the T8s each having 1 Mbyte. The serial links of the processors are used for communication with one another. Figure I. Homogeneous architecture of C40s. ----£] Figure 2. Homogeneous architecture of T8s. Figure 3. The i860+T8 heterogeneous architecture. 151 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR 2.2 Heterogeneous architectures The heterogeneous parallel architectures considered include an integrated i860 and T8 system, an integrated C40 and T8 system, and an integrated i860 and C40 system. Figures 3, 4 and 5 show the operational configuration of these architectures. The i860+T8 architecture comprises an IBM compatible PC, AID and D/A conversion facility, a TMB 16 motherboard and a TTMIIO board incorporating a T8 and an i860 processor. The TTM 110 board also possesses 16 Mbytes of shared memory accessible only by the T8. The i860 and the T8 communicate with each other via this shared memory. In the C40+T8 architecture the T8 is used both as the root processor providing an interface with the host, and as an active PE. The C40 and the T8 communicate with each other via serial-to parallel or parallel-to-serial links. In the i860+C40 architecture, the communication is established via the root T8 that provides interface mechanism with the host and also can act as an active PE as in the C40+T8 architecture. In the above architectures, wherever indicated, the host is utilised for development and downloading of programs. Thus, the host does not take part in the real-time implementation process. Figure 4. The C40+T8 heterogeneous architecture. Figure 5. The i860+C40 heterogeneous architecture. 3. THE ALGORITHM A schematic diagram of a single-link flexible manipulator is shown in Figure 6, where I represents the hub inertia of the manipulator. A payload mass M with h p its associated inertia I is attached to the end point. A control torque T(t) is applied at the hub by an actuator motor. The angular displacement of the 152 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR manipulator, in moving in the POQ plane, is denoted by 6(t). The manipulator is assumed to be stiff in vertical bending and torsion, thus allowing it to vibrate (be flexible) dominantly in the horizontal direction (POQ plane). The shear deformation and rotary inertial effects are also ignored. For an angular displacement 6(t) and an elastic deflection u(x,t) the total (net) displacement y(x,t) of a point along the manipulator at a distance x from the hub can be described as a function of both the rigid body motion 8(t)h and elastic deflection u(x,t) measured from the line OX; y(X,t) =x6(t) + u(x,t) This dynamic behaviour of the manipulator can easily be modelled using FE methods. The steps involved in this process include: (a) discretisation of the structure into elements, (b) selection of an approximation function to interpolate the result, (c) derivation of the basic element equation, (d) incorporation of the boundary conditions and (e) solving the system equation with the inclusion of the boundary conditions. In this manner, the flexible manipulator is treated as an assemblage of n elements and the development of the algorithm can be divided into three main parts: the FE analysis, state-space representation and obtaining the system outputs. This process for the flexible manipulator considered yields the matrix differential equation (Tokhi et 01., 1999) MQ(t) + KQ(t) = F(t) (I) where M and K are the system mass and stiffness matrices, F(t) is the vector of applied forces and torque and The M and K matrices in equation (I) are of size m X m and Fit) is of size m X I, where m=2n+ I. For the manipulator, considered as pinned-free arm, with the applied torque T at the hub, the flexural and rotational displacement, velocity and acceleration are all zero at the hub at t =0 and the external force, Fit] = [T 0 ... 0 f. Moreover, it is assumed that Q(O) = O. The matrix differential equation in equation (I) can be represented in a state­ space form as v= Av + Ou y = Cv + Du where A = [----~!!'----l!!'!.-l, B = [_~'!!:L1 -M-IK :0 M·I 1m Om is an mxm null matrix, 1 is an mxm identity matrix, 0mxl is an mX I null vector and U = [T 0 ... O]T V = [6 U 6 ... Un+1 6 + 6 U 6 U + 6 + ] T 2 2 n 1 2 2 n 1 n I 153 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR Solving the state-space representation gives the vector of states v, i.e. the angular, nodal flexural and rotational displacements and velocities. 4. PERFORMANCE METRICS A commonly used measure of performance of a processor in an application is speedup. This is defined as the ratio of execution time of the processor in implementing the application algorithm relative to a reference or execution time of a reference processor (Tokhi et al.. 1995). The speedup thus defined provides a relative performance measure of a processor for fixed load (task size) and thus can be referred to as fixed-load speed-up. This can also be used to obtain a comparative performance measure of a processor in an application with fixed task sizes under different processing conditions. Speedup is also a commonly used metric for parallel processing. In this context, there are three known speedup performance models: fixed-size (fixed­ load) speedup, fixed-time speedup and memory-bounded speedup (Sun and Ni, 1993, Sun and Rover, 1994). Fixed-size speedup fixes the problem size (load) and emphasises how fast a problem can be solved. Fixed-time speedup argues that parallel computers are designed for otherwise intractably large problems. It fixes the execution time and emphasises how much more work can be done with PP within the same time. Memory-bounded speedup assumes that the memory capacity, as a physical limitation of the machine is primary constraint on large problem sizes. It allows memory capacity to increase linearly with the number of processors. Both fixed time and memory-bounded speedup are forms of scaled speedups. The term scaled-speedup has been used for memory­ bounded speedup by many authors (Gustafson et al .. 1988; Nussabaum and Agrawal, 1991). Speedup (SN) is defined as the ratio the execution time (T,) on a single processor to the execution time (TN) on N processors; The theoretical maximum speed that can be achieved with a parallel architecture of N identical processors working concurrently on a problem is N. This is known as the "ideal speedup". In practice, the speedup is much less, since some processors are ideal at times due to conflicts over memory access, communication delays, algorithm in-efficiency and mapping for exploiting the natural concurrence in a computing problem (Hwang and Briggs, 1985). But, in some cases, the speedup can be obtained above the ideal speedup, due to anomalies in programming, compilation and architecture usage. For example a single-processor system may store all its data off-chip, whereas the multi­ processor system may store all its data on-chip, leading to an unpredicted increase in performance. When speed is the goal, the power to solve problems of some magnitude in a reasonably short period of time is sought. Speed is a quantity that ideally would increase linearly with system size. Based on this reasoning, the isospeed approach, described by the average unit speed as the achieved speed of a given computing system divided by the number of processors N, has previously been proposed (Sun and Rover, 1994). This provides a quantitative measure of describing the behaviour of a parallel algorithm machine combination as sizes 154 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR are varied. Another useful measure in evaluating the performance of a parallel system is efficiency, (EN)' This can be defined as Efficiency can be interpreted as providing an indication of the average utilisation of the 'N' processors, expressed as a percentage. Furthermore, this measure allows a uniform comparison of the various speeds obtained from systems containing different number of processors. It has also been illustrated that the value of efficiency is directly related to the granularity of the system. Although, speedup and efficiency and their variants have widely been discussed in relation to homogeneous parallel architectures. Not much has been reported on such performance measures for heterogeneous architectures. Due to substantial variation in computing capabilities of the processing elements (PEs), the traditional parallel performance metrics of homogeneous architectures are not suitable for heterogeneous architectures in their current form. Note for example, that speedup and efficiency provide measures of performance of parallel computation relative to sequential computation on a single processor node. In this manner, the processing node is used as reference node. Such a node representing the characteristics of all the PEs, is not readily apparent in a heterogeneous architecture. In this investigation such a reference node is identified by proposing the concept of virtual processor. Moreover, it is argued that a homogeneous architecture can be considered as a sub-class of heterogeneous architectures. In this manner, the performance metrics developed for heterogeneous architectures should be general enough to be applicable to both classes of architectures. Attempts have previously been made at proposing speedup of a heterogeneous architecture as the ratio of minimum sequential execution time among the PEs over parallel execution time of the architecture (Yan et al., 1996; Zhang and Van, 1995). In this manner, the best PE in the architecture is utilised as the reference node and efficiency of the architecture is defined accordingly. Although it has been shown that the definitions under a specific situation transform to those of a homogeneous architecture, such transformation does not in general hold. Moreover, the concept does not fully exploit the capabilities of all the PEs in the architecture. This study attempts to propose a concept that ensures the capabilities of the PEs to be exploited by maximising the efficiency of the architecture. Consider a heterogeneous parallel architecture of N processors. To allow define speedup and efficiency of the architecture, assume a virtual processor is constructed that would achieve a performance in terms of average speed equivalent to the average performance of the N processors. Let the performance characteristics of processor i (i=] ... N) over task increments of dW be given by dW= V.dT. (2) I I where dT and V represent the corresponding execution time increment and j j average speed of the processor. Thus, the speed V" and average execution time increment dTI' of the virtual processor, executing the task increment d W are given as 155 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR 1 N 1 N aw aw N 1 =-LV;=-L-=-L- N ;=1 N ;=1 aT; N ;=1 aT; (3) Thus, the fixed-load increment parallel speedup Sf and generalised speedup Sg of the parallel architecture, over a task increment of l\W, can be defined as where aT and V are the execution time increment and average speed of the parallel system. In this manner, the (fixed-load) efficiency E and generalised efficiency E of the parallel architecture can be defined as S S, E = ~ x 100% , Eg = ~ x 100% Note in the above that the concepts of parallel speedup and efficiency defined for heterogeneous architectures are consistent with the corresponding definitions for homogeneous architectures. Thus, these can be referred to as the general definitions of speedup and efficiency of parallel architectures. Note further that the generalised parallel speedup and efficiency defined above are based on the assumption that the execution time to task size relationship for each processor is linear, resulting in constant speeds. If the execution time to task size relationship is not linear then the characteristic can either be considered as piecewise linear or a variable speed obtained for each processor and the definitions above applied accordingly. 5. TASK TO PROCESSOR ALLOCATION IN PARALLEL ARCHITECTURES The concept of generalised sequential speedup can be utilised as a guide to allocation of tasks to processors in parallel architectures so as to achieve maximum efficiency and maximum (parallel) speedup. Let the sequential speedup of processor i (in a parallel architecture) to the virtual processor be V; (4) S;tv = ; i = 1,..·, N V" Using the processor characterisations of equations (2) and (3) for processor i and the virtual processor, equation (4) can alternatively be expressed in terms of fixed-load execution time increments as ».. sr.: i= l ,..·,N 156 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR Thus, to allow 100% utilisation of the processors in the architecture the task increments aWi should be aw. I aw ~T =~T.= --' = =--- (5) i = I ..··,N P I V. N v, or, using equation (4), v. aw aw (6) aw = Vi N =Silv N; i = I ..··,N It follows from equation (5) that, with the distribution of load among the processors according to equation (6) the parallel architecture is characterised by having an average speed of V =NV p I' Thus, with the distribution of load among the processors according to equation (6), the speedup and efficiency achieved with N processors are Nand 100% respectively. These are the ideal speedup and efficiency of the parallel architecture. In practice, due to communication overheads, these will be less than the ideal values. Note in the above that, in developing the performance metrics for a heterogeneous parallel architecture of N processors, the architecture is conceptually transformed into an equivalent homogeneous architecture incorporating N identical virtual processors according to their computing capabilities to achieve maximum efficiency. For a homogeneous parallel architecture the virtual processor corresponds to a single PE of the architecture. 6. PERFORMANCE RELATED FACTORS When contemplating the implementation of algorithms with the aid of associated software on PP systems, it is essential to organise the algorithm to realise the maximum benefits of parallelism. Several performance related factors associated with PP are discussed below. While several processors are required to work co-operatively on a single task, frequent exchange of data is expected among the several sub-tasks that comprise the main task. The amount of data, the frequency of data transmission, the speed of transmission, and the transmission route are all significant in affecting the inter-communication within the architecture. The first two factors depend on the algorithm itself and how well it has been partitioned. The remaining two factors are the function of the hardware depending on the inter­ connection strategy, whether tightly coupled or loosely coupled. Any evaluation of the performance of the inter-connection must be, to a certain extent quantitative. However, once a few candidate networks have been selected, detailed (and expensive) evaluation including simulation can be carried out and the best one selected for a proposed application. The homogeneous and heterogeneous architectures considered in this 157 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR investigation require (a) T8-T8: serial communication, (b) C40-C40: parallel communication, (c) T8-C40: serial to parallel communication and (d) T8-i860: shared memory communication. The performance of all these communication links have been measured uni-directionally and bi-directionally using a 400 0­ point floating-type data (Tokhi et at.. 1997). It has been noted that among these the C40-C40 double-line parallel communication is the fastest, whereas the T8­ C40 single-line serial-to-parallel communication is the slowest. It is also evident that as compared to serial communication, parallel communication offers a substantial advantage. In shared-memory communication, additional time is required in accessing and/or writing into the shared memory. In serial­ to-parallel communication, on the other hand, an additional penalty is paid during the transformation of data from serial to parallel and vice versa. There are three different issues to be considered in implementing an algorithm on a PP system: (a) identifying parallelism in the algorithm. (b) partitioning the algorithm into sub-tasks and (c) allocating the tasks to processors. These include the inter-processor communication, the issues of granularity of the algorithm and of the hardware and regularity of the algorithm. Hardware granularity is a ratio of computational performance over the communication performance of each processor within the architecture. Similarly, task granularity is the ratio of computational demand over the communication demand of the task. Performance benefits of parallel architectures strongly depend on these ratios (Maguire, 1991; Stone, 1987). When the ratio is very low, it becomes ineffective to use parallelism. When the ratios are very high, parallelism is potentially profitable. Typically a high computelcommunication ratio is desirable. The concept of task granularity can also be viewed in terms of computation time per task. When this is large, it is a coarse-grain task implementation. When it is small, it is a fine grain task implementation. Although large grains may ignore potential parallelism, partitioning a problem into the finest possible granularity does not necessarily lead to the fastest solution, as maximum parallelism also has the maximum overhead, particularly due to increased communication requirements. Therefore, when partitioning and mapping the algorithm onto the PEs, it is essential to choose an algorithm granularity that balances useful parallel computation against communication overheads (Nocetti and Fleming, 1991). Regularity is a term used to describe the degree of uniformity in the execution thread of the computation. Many algorithms can be expressed by matrix computations. This leads to the regular iterative (RI) type of algorithms due to their very regular structure. In implementing these types of algorithms, a vector processor will, principally, be expected to perform better. Moreover, if a large amount of data is to be handled for computation in this type of algorithms, the performance will further be enhanced if the processor has more internal data cache, instruction cache and/or built in math co-processor. In implementing these algorithms on a PP platform, the tasks could be distributed uniformly among the PEs. However, this may require a large amount of communication between the processors and, therefore, will be a detriment to the performance of the computing platform in both homogeneous and heterogeneous architectures. There are two main approaches to allocate tasks to processors: statically and dynamically. In static allocation, the association of a group of tasks with a processor is resolved before running time and remains fixed throughout the execution, whereas in dynamic allocation, tasks are allocated to processors at running time according to certain criteria, such as processor availability, inter- IS8 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR task dependencies and task pnonnes. Whatever method is used, a clear appreciation is required of the overheads and parallelism/communication trade­ off. Dynamic allocation offers greater potential for optimum processor utilisation, but it also incurs a performance penalty associated with scheduling software overheads and increased communication requirements which may prove unacceptable in some real-time applications. 7. IMPLEMENTATIONS AND RESULTS To implement the FE simulation algorithm an aluminium-type flexible manipulator with dimensions 960 X 19.23 X 3.2 mm'', mass density 2710 kg/m'', manipulator inertia 1= 0.0495 kgm? and hub inertia I = 5.2530 X IO-ll kgrrr' was considered. The algorithm granularity (task size) was achieved by increasing the number of elements from I to 20, in steps of I. The algorithm was implemented on a number of homogeneous and heterogeneous architectures comprising of the T8s, C40s and the i86O. The FE simulation algorithm is matrix based. Thus it was possible to parallelise the algorithm by distributing the matrices among the PE s of the homogeneous architectures with little communication overhead involved. In the heterogeneous architectures the tasks were distributed among the PEs in accordance with the performance metrics discussed earlier. In this investigation. the total execution times achieved by the architectures, in implementing the algorithm over 1000 iterations was considered. Q Y U(X,tJ t(t) ----------.-.;> X HUB Figure 6. The flexible robot manipulator system. 7.1. Implementations on the homogeneous architectures With a view to achieve the shortest possible execution time the algorithm was implemented on the homogeneous architectures of T8s and C40s. The results of implementing the algorithm on a network of T8s consisting of two T8s are shown in Figures 7 and 8 in terms of execution time and average speed respectively. For comparative reasons, results of implementations on single T8 are shown. It is noted that in comparison to a single T8 implementation, significant enhancement in execution time is achieved by distributing the algorithm on two T8s. It is also noted that better performance is achieved with higher number of elements than with lower number of elements. The processor speed of the two T8s as noted in Figure 8, is nearly twice the speed of a single 159 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR 1'8. The speed and efficiency achieved with two T8s. relative to a single T8 . are 1.8692 and 93.46% . •-r------------------------:~ o~_.___..__.__..............."""T"___.-...__r__.___..__..__..............._.____.-.___r_--l 1 2'45'7' .,011,21314,515'711,.20 Numberof .,.",.",. Figure 7. Performance of the T8s in implementing the algorithm. 0.' 0" 0.' "l:I I 0.5 0 .0 0.3 0.2 0 .1 Two T.. Sln1Ile U Figure 8. Processor speed of the T8s in implementing the algorithm . 3Or-----------------------, o~......"""T"__.-...__ _.___..-.__............._.____.-.___r__.___..__.__.......~ 1 2 , 4 5 • 7 • • 15 11 n " 14 15 15 U 11 11 20 Numberof .,.",.",. Figure 9. Performance of the C40s in implementing the algorithm with code optimisation level-O, TABLE 1. Speedup and efficiency of the C40s with code Optimisation level-D . Number of C40s Two Three Speedup 1.894 2.403 Efficiency 94.7% 80% 160 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR The algorithm was then implemented on a network of C40s consisting of three C40 processors. In this process code optimisation levels 0 and 2 were used. Figure 9 shows the performance of the C40s with code optimisation level­ O. It is noted that the performance enhancement achieved with three C40s is not as significantly different from that achieved with two C40s. As compared to a single C40 implementation, the performance with two and three C40s is better as the number of elements increase. The average processor speeds of the C40s implementing the algorithm with code optimisation level-O are shown in Figure 10. The corresponding execution time speedup and efficiency achieved with two and three C40s relative to a single C40 implementation are shown in Table I. It is noted that two C40s and three C40s are 1.894 and 2.403 times faster than a single C40 respectively. The implementation efficiency achieved with two C40s and three C40s, on the other hand, are 94.7% and 80% respectively. Figure 10. Processor speed of the C40s in implementing the algorithm with code optimisation level-O. ll-r-------------------------, o ~_,.___..___.r__...._........__._.......,.-.._......._,.___..___.r__.._........__._.......,.-.._......._4 1 3 4 5 I 7 I • 10 11 12 13 14 15 11 17 " " :Ill Number of .1fHnents Figure II. Performance of the C40s in implementing the algorithm with code optimisation level-2. 161 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR SI ngle COO Two C_ Th... C_ Figure 12. Processor speed of the C40s in implementing the algorithm with code optimisation level-2. The impact of optimisation facility in implementing the algorithm on the C40s is demonstrated in Figures II and 12. These were obtained by optimi sing the algorithm with optimisation level-2 of the 3L parallel C compiler. The performance evolution in Figure II, as noted , is similar in character to th at obtained with optimisation level-O, Figure 9 . However, the execution time has reduced signifi cantly. The corresponding enh ancement in spe ed can be se e n in Figure 12. It is noted that, in compari son to Figure 10, the performance enhancement achieved in implementing the algorithm with optimisation level­ 2 on single, two and three C40s is by a factor of 1.8286, 1.9037 and 2 .0155 respectively. The execution time speedup and efficiency corresponding to the implementations in Figures II and 12 are sho w n in Table 2. As compared to Table I, the speedup and efficiency achieved were significantly improved with code optimisat ion . This shows that, in addition to communication overhead, code optimi sation and compiler efficiency play important roles in the performance of a system. TABLE 2. Speedup and efficiency of the C40s with code Optimisation level-2. Number of C40s Two Three Speedup 1.972 2.649 Efficiency 98.6% 88.32% 7.2. Implementations on the heterogeneous architectures The algorithm was implemented on three different types of heterogeneous architectures namely, the C40+T8, i860+T8 , and i860+C40. In this process the concept of virtual processor was utilised with the corresponding task allocation strategy as described earlier. Figure 13 shows the results in implementing the algorithm on the C40+T8 architecture. The characteristics of uni -processors, virtual processor and of the corresponding theoretical parallel architecture are also shown , for comparison reasons. It is noted that for implementations up to 5 elements the performance of the actual parallel architecture is better than that of the T8. For implementations over 5-12 elements, the performance of the parallel architecture is better than that of the virtual processor and the speedup achieved is between I and 2, giving an efficiency of 50%-100%. For implementations beyond 12-elements, the performance of the parallel 162 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR architecture matches that of the theoretical model. resulting in a speedup of 2 and an efficiency of 100%. It is also noted in Figure 13 that the characteristics of the theoretical parallel model are closer to those of the single C40 for implementations up to 10 elements. This implies that due to the large disparity between the capabilities of the C40 and the T8. better performance will be achieved with the parallel architecture by allocating the entire task to the C40 over this range. The disparity between the capabilities of the processors in the i860+T8 architecture is even larger. This is evidenced in Figure 14 which shows performances of the uni-processors. the corresponding virtual processor and theoretical parallel model with that of the actual parallel i860+T8 architecture. It is noted that the performance of the i860 is close to that of the theoretical model. Thus. the task allocation strategy in this case resulted in allocating the entire task to the i860 for implementations up to 12 elements. Beyond this the T8 was also involved. resulting in an increase in the execution time of the parallel architecture. Thus. since the performance enhancement that the T8 is expected to provide beyond 12 elements is significantly smaller than the communication overhead between the processors. the latter becomes a dominant factor. This implies that in such a situation. better performance is still achieved by implementing the algorithm on the i860. 4Or;===================::;---------""7f __ Slglerl ____ SIgIeC40 -<>- C40.rl V1rtu111 peee..... -x- C40.rl Th_ellCIII pe."1eI model -<>- C40.re Aclu........lelimplemenlllllon o+-----.--"""'T"'--.......----._-"""'T"'--.......----._-"""'T"'---l 2 4 6 II II 20 I NumJ::,r of e,.w,ents 14 Figure 13. Performance of the C40+ T8 in implementing the algorithm. 1.-----;::::==============::::;---------, ____ Sigle TI -D-Slgle_ -<>--.TI V1rtu111...- -X--'TI~_"""",model ___ • TI ActUIII penllellmp_lIllIon 12 14 II II 20 4 I 10 Number of elements Figure 14. Performance of the i860-T8 in implementing the algorithm. 163 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR s,-----.============;------....., __ 5IngIe C40 -5IngIe­ _1IIO+C40VI'-__ -X-_+C40~"'-- -0 . -.c4OAc_ ...............--.- p- .. --0- .. O~---.--~--.---"""T""--r--.........--.,.....---.---l 14 1. 1. 20 4 10 12 Numbe, of elements Figure 15. Performance of the i860+C40 in implementing the algorithm. Figure 15 shows the results in implementing the algorithm on the i860+C40 architecture. The characteristics of the uni-processors, the corresponding virtual processor and the theoretical parallel model are also shown. In this process for implementations up to 4 elements the entire task was allocated to the i860. For implementations over 6 elements and beyond the C40 is also involved. As noted, due to the disparity in the performances of the i860 and the C40, the performance of the parallel architecture is worse than that of the virtual processor for implementations over 6-10 elements. This implies that, to overcome the performance barrier due to communication overhead in this range, allocating the entire task to the i860 for implementations up to 10 elements and including the C40 implementations beyond 10 elements will result in better performance of the parallel architecture. 8. CONCLUSION An investigation into the development of high-performance computing methods within the framework of real-time applications has been presented. A comparative performance evaluation of homogeneous and heterogeneous architectures with contrasting features in implementing an FE algorithm has been carried out. In the case of implementing the algorithm on the homogeneous architecture of C40s the impact of code optimisation has also been investigated. It has also been demonstrated that due to the nature of an algorithm a close match needs to be made between the computing requirements of the algorithm and computing capabilities of the architecture. A quantitative measure of the performance for heterogeneous parallel computing has been carried out utilising the concept of virtual processor and virtual parallel machine. A task allocation strategy for parallel architectures has accordingly been proposed and verified within several experiments. It has been demonstrated that the capabilities of the processors in a parallel architecture are exploited and a close match is made with the requirements of an algorithm in using this strategy. The linear nature of evolution of performance of processors has led to the introduction of generalised speedup and efficiency, for a more comprehensive performance evaluation of parallel architectures. These have been shown to provide suitable measures of the performance of a processor over a wide range of loading conditions and thus reflect on the real-time computing capabilities of the architectures in a comprehensive manner. 164 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR 9. REFERENCES Baxter, M. J., Tokhi, M. O. and Fleming, P. J. (1994). "Parallelising algorithms to exploit heterogeneous architectures for real-time control systems", Proceedings oflEE Control-94 Conference, Coventry, 21-24 March, Vol.2, pp. /266-/271. Ching, P. C. and Wu, S. W. (1989). "Real-time digital signal processing system using a parallel architecture", Microprocessors and Microsystems, Vol.l3, pp. 653-658. Crummy, T. P., Jones, D. J., Fleming, P. J. and Marnane, W. P. (1994). "A hardware scheduler for parallel processing in control applications", Proceedings of lEE Control-94 Conference, Coventry, 21-24 March 1994, Vol.2, pp. J098-II03. Denning, P. J. (1986). "Parallel computing and its evolution". Communications of the ACM, Vol.29, pp. II63-II67. Gustafson, J. L., Montry, G. and Benner, R. (1988). "Development of parallel methods for a 1024-processor hypercube", SlAMJ, SSTC, Vol.9, (4), pp. 609-638. Hocney, R. W. and Jesshope, C. R. (1981). Parallel Computers. Hilger Publishing Co, Bristol. Hwang, K. (1993). Advanced computer architecture, parallelism, scalability and programmability, McGraw-Hill, California. Hwang, K. and Briggs, F. A. (1985). Computer architecture and Parallel Processing. McGraw-Hill, California. Irwin, g. W. and Fleming, P. J. (1992) Transputers in real-time control, John Wiley, England. Khokhar, A. A., Prasanna, V. K., Shahban, M. E. and Wang, C. (1993). "Heterogeneous Computing : challenges and opportunities", Computer, Vol.26, pp. 18-27. Maguire, L. P. (1991). Parallel architecture of Kalman filtering and self-tuning control, PhD thesis, the Queen's University of Belfast, UK. Nocetti, G. D. F. and Fleming, P. J. (1991). "Performance studies of parallel real-time controllers", Proceedings IFAC workshop on Algorithms and Architectures for Real-time Control, Bangor, UK, pp. 249-254. Nussbaum, D. and Agrawal, A. (1991). "Scalability of parallel machines", Communications of the ACM, Vo1.34, pp. 57-61. Stone, H. S. (1987). High Performance Computer Architecture, Addison Wesley, USA. Sun, X.-H. and Ni, L. (1993). "Scalable problems and memory-bounded 165 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR speedup". Journal ofparallel and Distributed Computing. Vol.19. pp.27-37. Sun, X.-H. and Rover, D. T. (1994). "Scalability of parallel algorithm­ machine combinations". IEEE Transactions on Parallel and Distributed Systems. 5. pp. 599-613. Tokhi, M. O. and Hossain, M. A. (1995). "CISC, RISC and DSP processors in real-time signal processing and control". Microprocessors and Microsystems, Vol.19, pp. 291-300. Tokhi, M. 0., Hossain, M. A. and Chambers, C. (1997). "Performance evaluation of DSP and transputer based systems in sequential real-time applications". Microprocessors and Microsystems, Vol. 21. pp. 237-248. Tokhi, M. O. Shaheed, M. H., Ramos-Hernandez, D. N. and Poerwanto, H. (1999). "Finite element simulation ofa flexible manipulator; part-I : Sequential processing techniques". Journal ofLow Frequency Noise, Vibration and Active Control. (submitted). Van, Y., Zhang, X. and Song, Y. (1996). "An effective and practical performance prediction model for parallel computing on nondedicated heterogeneous NOW". Journal of Parallel and Distributed Computing, Vol. 38. (1), pp. 63-80. Zhang, X. and Van, Y. (1995). "Modelling and characterising parallel computing performance on heterogeneous networks of workstations". Proceedings of Seventh IEEE Symposium on Parallel and Distrihuted Processing, San Antonio. Texas, 25-28 Octoher 1995, pp. 25-34. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png "Journal of Low Frequency Noise, Vibration and Active Control" SAGE

Finite Element Simulation of a Flexible Manipulator-Part 2: Parallel Processing Techniques:

Loading next page...
 
/lp/sage/finite-element-simulation-of-a-flexible-manipulator-part-2-parallel-yl3tUV9Uht

References (20)

Publisher
SAGE
Copyright
Copyright © 2022 by SAGE Publications Ltd unless otherwise noted. Manuscript content on this site is licensed under Creative Commons Licenses
ISSN
0263-0923
eISSN
2048-4046
DOI
10.1177/026309239901800305
Publisher site
See Article on Publisher Site

Abstract

Finite Element Simulation of a Flexible Manipulator­ Part 2: Parallel Processing Techniques M.O. Tokhi, M.H. Shaheed, D.N. Ramos-Hernandez and H. Poerwanto Department of Automatic Control and Systems Engineering The University of Sheffield. UK. Received 10th January 1999 ABSTRACT This paper presents an investigation into the performance evaluation of homogeneous and heterogeneous parallel architectures in the real-time implementation of finite element (FE) simulation algorithm of a flexible manipulator. The algorithm is implemented on a number of homogeneous and heterogeneous architectures incorporating high-performance processors. The partitioning and mapping of the algorithms on both the homogeneous and heterogeneous architectures arc investigated and finally a comparative assessment of the performance of the architectures in implementing the algorithm revealing the capahilities of the architectures in relation to the nature of the algorithm is presented in terms of execution time and speedup. KEYWORDS: Finite element method, flexible manipulator, heterogeneous architectures, homogeneous architectures, parallel processing, performance metrics, simulation, speedup, task to processor allocation. 1. INTRODUCTION Parallel processing (PP) is a subject of widespread interest for real-time signal processing and control. The concept of PP on different problems or different parts of the same problem is not new. Discussions of parallel computing machines are found in the literature at least as far back as the 1920s (Denining, 1986). It is noted that, throughout the years, there has been a continuing research effort to understand parallel computing (Hocney and Jesshope, 1981). Such effort has intensified dramatically in the last few years involving various applications in signal processing, control, artificial intelligence, pattern recognition, computer vision, computer aided design and discrete event simulation. In a conventional parallel system all the processing elements (PEs) are identical. This architecture can be described as homogeneous. However, many algorithms are heterogeneous, as they usually have varying computational requirements. The implementation of an algorithm on a homogeneous architecture having PEs of different types and features can provide a closer match with the varying hardware requirements and, thus, lead to performance enhancement. However, the relationship between algorithms and heterogeneous architectures for real-time control systems is not well understood. The mapping of algorithms onto processors in a heterogeneous architecture is, therefore, Journal of Low Frequency Noise. 149 Vibration and Acri\'(' Control Vol. 18 No.3 1999 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR especially challenging. To exploit the heterogeneous nature of the hardware it is required to identify the heterogeneity of the algorithm so that a close match can be forged with the hardware resources available (Baxter et al., 1994). One of the challenging aspects of PP, as compared to sequential processing, is how to distribute the computational load across the PEs. This requires a consideration of several issues, including the choice of algorithm, the choice of processing topology, the relative computation and communication capabilities of the processor array and partitioning the algorithm into tasks and scheduling of these tasks (Crummey et al., 1994). It is, thus, essential to note that in implementing an algorithm on a parallel computing platform, a consideration of the issues related to the interconnection schemes, the scheduling and mapping of the algorithm on the architecture, and the mechanism for detecting parallelism and partitioning the algorithm into modules or sub-tasks, will lead to a computational speedup (Ching and Wu, 1989; Hwang, 1993; Khokhar et al., 1993). A finite element (FE) simulation algorithm characterising the dynamic behaviour of a flexible manipulator is considered in this paper. Flexible robot manipulators are receiving noticeable attention, in comparison to their traditional (rigid) counterparts. This is due to several advantages they offer, such as fast speed of response, efficiency and relatively low cost. However, the dynamic behaviour of such systems comprise both rigid body and flexible motion. Thus, for control purposes, both motions have to be accounted for at the modelling as well as the control levels. This paper presents an investigation into the real-time performance evaluation of several homogeneous and heterogeneous architectures in implementing the FE simulation algorithm of a flexible manipulator system. The purpose of this investigation is to provide a coherent analysis and evaluation of the performance of parallel computing techniques in implementing the algorithm, within the framework of real-time signal processing and control applications. 2. HARDWARE PLATFORMS AND SOFTWARE RESOURCES A brief description of the homogeneous and heterogeneous parallel architectures utilised in this work is presented in this section. These incorporate three processor types, namely the Inmos T805 (T8) transputer, the Texas Instruments TMS320C40 (C40) DSP device and the Intel 80i860 (i860) vector processor (Tokhi et al., 1999). The compilers used in this work include the 3L Parallel C (for C40 and T8) and the Inmos Portland Group ANSI C (for i860). 2.1. Homogeneous architectures The homogeneous architectures considered include a network of C40s and a network of T8s. A pipeline topology is utilised for these architectures, on the basis of the algorithm structure, which is simple to realise and is well reflected as a linear farm (Irwin and Fleming, 1992). The homogeneous architecture of C40s is shown in Figure 1. This comprises a network of C40s resident on a Transtech TDM410 motherboard and a TMB08 motherboard incorporating a T8 as a root processor. The T8 possesses 1 Mbyte local memory and communicates with the TDM410 (C40s network) via a link adapter using serial to parallel communication links. The C40s, on the other hand, communicate with each other via parallel communication links. Each C40 processor possesses 3 Mbytes DRAM and I Mbyte SRAM. 150 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR The homogeneous architecture of T8s used in this work is shown in Figure 2. This comprises a network of T8s resident on a Transtech TMB08 motherboard. The root T8 incorporates 2 Mbytes of local memory, with the rest of the T8s each having 1 Mbyte. The serial links of the processors are used for communication with one another. Figure I. Homogeneous architecture of C40s. ----£] Figure 2. Homogeneous architecture of T8s. Figure 3. The i860+T8 heterogeneous architecture. 151 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR 2.2 Heterogeneous architectures The heterogeneous parallel architectures considered include an integrated i860 and T8 system, an integrated C40 and T8 system, and an integrated i860 and C40 system. Figures 3, 4 and 5 show the operational configuration of these architectures. The i860+T8 architecture comprises an IBM compatible PC, AID and D/A conversion facility, a TMB 16 motherboard and a TTMIIO board incorporating a T8 and an i860 processor. The TTM 110 board also possesses 16 Mbytes of shared memory accessible only by the T8. The i860 and the T8 communicate with each other via this shared memory. In the C40+T8 architecture the T8 is used both as the root processor providing an interface with the host, and as an active PE. The C40 and the T8 communicate with each other via serial-to parallel or parallel-to-serial links. In the i860+C40 architecture, the communication is established via the root T8 that provides interface mechanism with the host and also can act as an active PE as in the C40+T8 architecture. In the above architectures, wherever indicated, the host is utilised for development and downloading of programs. Thus, the host does not take part in the real-time implementation process. Figure 4. The C40+T8 heterogeneous architecture. Figure 5. The i860+C40 heterogeneous architecture. 3. THE ALGORITHM A schematic diagram of a single-link flexible manipulator is shown in Figure 6, where I represents the hub inertia of the manipulator. A payload mass M with h p its associated inertia I is attached to the end point. A control torque T(t) is applied at the hub by an actuator motor. The angular displacement of the 152 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR manipulator, in moving in the POQ plane, is denoted by 6(t). The manipulator is assumed to be stiff in vertical bending and torsion, thus allowing it to vibrate (be flexible) dominantly in the horizontal direction (POQ plane). The shear deformation and rotary inertial effects are also ignored. For an angular displacement 6(t) and an elastic deflection u(x,t) the total (net) displacement y(x,t) of a point along the manipulator at a distance x from the hub can be described as a function of both the rigid body motion 8(t)h and elastic deflection u(x,t) measured from the line OX; y(X,t) =x6(t) + u(x,t) This dynamic behaviour of the manipulator can easily be modelled using FE methods. The steps involved in this process include: (a) discretisation of the structure into elements, (b) selection of an approximation function to interpolate the result, (c) derivation of the basic element equation, (d) incorporation of the boundary conditions and (e) solving the system equation with the inclusion of the boundary conditions. In this manner, the flexible manipulator is treated as an assemblage of n elements and the development of the algorithm can be divided into three main parts: the FE analysis, state-space representation and obtaining the system outputs. This process for the flexible manipulator considered yields the matrix differential equation (Tokhi et 01., 1999) MQ(t) + KQ(t) = F(t) (I) where M and K are the system mass and stiffness matrices, F(t) is the vector of applied forces and torque and The M and K matrices in equation (I) are of size m X m and Fit) is of size m X I, where m=2n+ I. For the manipulator, considered as pinned-free arm, with the applied torque T at the hub, the flexural and rotational displacement, velocity and acceleration are all zero at the hub at t =0 and the external force, Fit] = [T 0 ... 0 f. Moreover, it is assumed that Q(O) = O. The matrix differential equation in equation (I) can be represented in a state­ space form as v= Av + Ou y = Cv + Du where A = [----~!!'----l!!'!.-l, B = [_~'!!:L1 -M-IK :0 M·I 1m Om is an mxm null matrix, 1 is an mxm identity matrix, 0mxl is an mX I null vector and U = [T 0 ... O]T V = [6 U 6 ... Un+1 6 + 6 U 6 U + 6 + ] T 2 2 n 1 2 2 n 1 n I 153 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR Solving the state-space representation gives the vector of states v, i.e. the angular, nodal flexural and rotational displacements and velocities. 4. PERFORMANCE METRICS A commonly used measure of performance of a processor in an application is speedup. This is defined as the ratio of execution time of the processor in implementing the application algorithm relative to a reference or execution time of a reference processor (Tokhi et al.. 1995). The speedup thus defined provides a relative performance measure of a processor for fixed load (task size) and thus can be referred to as fixed-load speed-up. This can also be used to obtain a comparative performance measure of a processor in an application with fixed task sizes under different processing conditions. Speedup is also a commonly used metric for parallel processing. In this context, there are three known speedup performance models: fixed-size (fixed­ load) speedup, fixed-time speedup and memory-bounded speedup (Sun and Ni, 1993, Sun and Rover, 1994). Fixed-size speedup fixes the problem size (load) and emphasises how fast a problem can be solved. Fixed-time speedup argues that parallel computers are designed for otherwise intractably large problems. It fixes the execution time and emphasises how much more work can be done with PP within the same time. Memory-bounded speedup assumes that the memory capacity, as a physical limitation of the machine is primary constraint on large problem sizes. It allows memory capacity to increase linearly with the number of processors. Both fixed time and memory-bounded speedup are forms of scaled speedups. The term scaled-speedup has been used for memory­ bounded speedup by many authors (Gustafson et al .. 1988; Nussabaum and Agrawal, 1991). Speedup (SN) is defined as the ratio the execution time (T,) on a single processor to the execution time (TN) on N processors; The theoretical maximum speed that can be achieved with a parallel architecture of N identical processors working concurrently on a problem is N. This is known as the "ideal speedup". In practice, the speedup is much less, since some processors are ideal at times due to conflicts over memory access, communication delays, algorithm in-efficiency and mapping for exploiting the natural concurrence in a computing problem (Hwang and Briggs, 1985). But, in some cases, the speedup can be obtained above the ideal speedup, due to anomalies in programming, compilation and architecture usage. For example a single-processor system may store all its data off-chip, whereas the multi­ processor system may store all its data on-chip, leading to an unpredicted increase in performance. When speed is the goal, the power to solve problems of some magnitude in a reasonably short period of time is sought. Speed is a quantity that ideally would increase linearly with system size. Based on this reasoning, the isospeed approach, described by the average unit speed as the achieved speed of a given computing system divided by the number of processors N, has previously been proposed (Sun and Rover, 1994). This provides a quantitative measure of describing the behaviour of a parallel algorithm machine combination as sizes 154 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR are varied. Another useful measure in evaluating the performance of a parallel system is efficiency, (EN)' This can be defined as Efficiency can be interpreted as providing an indication of the average utilisation of the 'N' processors, expressed as a percentage. Furthermore, this measure allows a uniform comparison of the various speeds obtained from systems containing different number of processors. It has also been illustrated that the value of efficiency is directly related to the granularity of the system. Although, speedup and efficiency and their variants have widely been discussed in relation to homogeneous parallel architectures. Not much has been reported on such performance measures for heterogeneous architectures. Due to substantial variation in computing capabilities of the processing elements (PEs), the traditional parallel performance metrics of homogeneous architectures are not suitable for heterogeneous architectures in their current form. Note for example, that speedup and efficiency provide measures of performance of parallel computation relative to sequential computation on a single processor node. In this manner, the processing node is used as reference node. Such a node representing the characteristics of all the PEs, is not readily apparent in a heterogeneous architecture. In this investigation such a reference node is identified by proposing the concept of virtual processor. Moreover, it is argued that a homogeneous architecture can be considered as a sub-class of heterogeneous architectures. In this manner, the performance metrics developed for heterogeneous architectures should be general enough to be applicable to both classes of architectures. Attempts have previously been made at proposing speedup of a heterogeneous architecture as the ratio of minimum sequential execution time among the PEs over parallel execution time of the architecture (Yan et al., 1996; Zhang and Van, 1995). In this manner, the best PE in the architecture is utilised as the reference node and efficiency of the architecture is defined accordingly. Although it has been shown that the definitions under a specific situation transform to those of a homogeneous architecture, such transformation does not in general hold. Moreover, the concept does not fully exploit the capabilities of all the PEs in the architecture. This study attempts to propose a concept that ensures the capabilities of the PEs to be exploited by maximising the efficiency of the architecture. Consider a heterogeneous parallel architecture of N processors. To allow define speedup and efficiency of the architecture, assume a virtual processor is constructed that would achieve a performance in terms of average speed equivalent to the average performance of the N processors. Let the performance characteristics of processor i (i=] ... N) over task increments of dW be given by dW= V.dT. (2) I I where dT and V represent the corresponding execution time increment and j j average speed of the processor. Thus, the speed V" and average execution time increment dTI' of the virtual processor, executing the task increment d W are given as 155 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR 1 N 1 N aw aw N 1 =-LV;=-L-=-L- N ;=1 N ;=1 aT; N ;=1 aT; (3) Thus, the fixed-load increment parallel speedup Sf and generalised speedup Sg of the parallel architecture, over a task increment of l\W, can be defined as where aT and V are the execution time increment and average speed of the parallel system. In this manner, the (fixed-load) efficiency E and generalised efficiency E of the parallel architecture can be defined as S S, E = ~ x 100% , Eg = ~ x 100% Note in the above that the concepts of parallel speedup and efficiency defined for heterogeneous architectures are consistent with the corresponding definitions for homogeneous architectures. Thus, these can be referred to as the general definitions of speedup and efficiency of parallel architectures. Note further that the generalised parallel speedup and efficiency defined above are based on the assumption that the execution time to task size relationship for each processor is linear, resulting in constant speeds. If the execution time to task size relationship is not linear then the characteristic can either be considered as piecewise linear or a variable speed obtained for each processor and the definitions above applied accordingly. 5. TASK TO PROCESSOR ALLOCATION IN PARALLEL ARCHITECTURES The concept of generalised sequential speedup can be utilised as a guide to allocation of tasks to processors in parallel architectures so as to achieve maximum efficiency and maximum (parallel) speedup. Let the sequential speedup of processor i (in a parallel architecture) to the virtual processor be V; (4) S;tv = ; i = 1,..·, N V" Using the processor characterisations of equations (2) and (3) for processor i and the virtual processor, equation (4) can alternatively be expressed in terms of fixed-load execution time increments as ».. sr.: i= l ,..·,N 156 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR Thus, to allow 100% utilisation of the processors in the architecture the task increments aWi should be aw. I aw ~T =~T.= --' = =--- (5) i = I ..··,N P I V. N v, or, using equation (4), v. aw aw (6) aw = Vi N =Silv N; i = I ..··,N It follows from equation (5) that, with the distribution of load among the processors according to equation (6) the parallel architecture is characterised by having an average speed of V =NV p I' Thus, with the distribution of load among the processors according to equation (6), the speedup and efficiency achieved with N processors are Nand 100% respectively. These are the ideal speedup and efficiency of the parallel architecture. In practice, due to communication overheads, these will be less than the ideal values. Note in the above that, in developing the performance metrics for a heterogeneous parallel architecture of N processors, the architecture is conceptually transformed into an equivalent homogeneous architecture incorporating N identical virtual processors according to their computing capabilities to achieve maximum efficiency. For a homogeneous parallel architecture the virtual processor corresponds to a single PE of the architecture. 6. PERFORMANCE RELATED FACTORS When contemplating the implementation of algorithms with the aid of associated software on PP systems, it is essential to organise the algorithm to realise the maximum benefits of parallelism. Several performance related factors associated with PP are discussed below. While several processors are required to work co-operatively on a single task, frequent exchange of data is expected among the several sub-tasks that comprise the main task. The amount of data, the frequency of data transmission, the speed of transmission, and the transmission route are all significant in affecting the inter-communication within the architecture. The first two factors depend on the algorithm itself and how well it has been partitioned. The remaining two factors are the function of the hardware depending on the inter­ connection strategy, whether tightly coupled or loosely coupled. Any evaluation of the performance of the inter-connection must be, to a certain extent quantitative. However, once a few candidate networks have been selected, detailed (and expensive) evaluation including simulation can be carried out and the best one selected for a proposed application. The homogeneous and heterogeneous architectures considered in this 157 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR investigation require (a) T8-T8: serial communication, (b) C40-C40: parallel communication, (c) T8-C40: serial to parallel communication and (d) T8-i860: shared memory communication. The performance of all these communication links have been measured uni-directionally and bi-directionally using a 400 0­ point floating-type data (Tokhi et at.. 1997). It has been noted that among these the C40-C40 double-line parallel communication is the fastest, whereas the T8­ C40 single-line serial-to-parallel communication is the slowest. It is also evident that as compared to serial communication, parallel communication offers a substantial advantage. In shared-memory communication, additional time is required in accessing and/or writing into the shared memory. In serial­ to-parallel communication, on the other hand, an additional penalty is paid during the transformation of data from serial to parallel and vice versa. There are three different issues to be considered in implementing an algorithm on a PP system: (a) identifying parallelism in the algorithm. (b) partitioning the algorithm into sub-tasks and (c) allocating the tasks to processors. These include the inter-processor communication, the issues of granularity of the algorithm and of the hardware and regularity of the algorithm. Hardware granularity is a ratio of computational performance over the communication performance of each processor within the architecture. Similarly, task granularity is the ratio of computational demand over the communication demand of the task. Performance benefits of parallel architectures strongly depend on these ratios (Maguire, 1991; Stone, 1987). When the ratio is very low, it becomes ineffective to use parallelism. When the ratios are very high, parallelism is potentially profitable. Typically a high computelcommunication ratio is desirable. The concept of task granularity can also be viewed in terms of computation time per task. When this is large, it is a coarse-grain task implementation. When it is small, it is a fine grain task implementation. Although large grains may ignore potential parallelism, partitioning a problem into the finest possible granularity does not necessarily lead to the fastest solution, as maximum parallelism also has the maximum overhead, particularly due to increased communication requirements. Therefore, when partitioning and mapping the algorithm onto the PEs, it is essential to choose an algorithm granularity that balances useful parallel computation against communication overheads (Nocetti and Fleming, 1991). Regularity is a term used to describe the degree of uniformity in the execution thread of the computation. Many algorithms can be expressed by matrix computations. This leads to the regular iterative (RI) type of algorithms due to their very regular structure. In implementing these types of algorithms, a vector processor will, principally, be expected to perform better. Moreover, if a large amount of data is to be handled for computation in this type of algorithms, the performance will further be enhanced if the processor has more internal data cache, instruction cache and/or built in math co-processor. In implementing these algorithms on a PP platform, the tasks could be distributed uniformly among the PEs. However, this may require a large amount of communication between the processors and, therefore, will be a detriment to the performance of the computing platform in both homogeneous and heterogeneous architectures. There are two main approaches to allocate tasks to processors: statically and dynamically. In static allocation, the association of a group of tasks with a processor is resolved before running time and remains fixed throughout the execution, whereas in dynamic allocation, tasks are allocated to processors at running time according to certain criteria, such as processor availability, inter- IS8 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR task dependencies and task pnonnes. Whatever method is used, a clear appreciation is required of the overheads and parallelism/communication trade­ off. Dynamic allocation offers greater potential for optimum processor utilisation, but it also incurs a performance penalty associated with scheduling software overheads and increased communication requirements which may prove unacceptable in some real-time applications. 7. IMPLEMENTATIONS AND RESULTS To implement the FE simulation algorithm an aluminium-type flexible manipulator with dimensions 960 X 19.23 X 3.2 mm'', mass density 2710 kg/m'', manipulator inertia 1= 0.0495 kgm? and hub inertia I = 5.2530 X IO-ll kgrrr' was considered. The algorithm granularity (task size) was achieved by increasing the number of elements from I to 20, in steps of I. The algorithm was implemented on a number of homogeneous and heterogeneous architectures comprising of the T8s, C40s and the i86O. The FE simulation algorithm is matrix based. Thus it was possible to parallelise the algorithm by distributing the matrices among the PE s of the homogeneous architectures with little communication overhead involved. In the heterogeneous architectures the tasks were distributed among the PEs in accordance with the performance metrics discussed earlier. In this investigation. the total execution times achieved by the architectures, in implementing the algorithm over 1000 iterations was considered. Q Y U(X,tJ t(t) ----------.-.;> X HUB Figure 6. The flexible robot manipulator system. 7.1. Implementations on the homogeneous architectures With a view to achieve the shortest possible execution time the algorithm was implemented on the homogeneous architectures of T8s and C40s. The results of implementing the algorithm on a network of T8s consisting of two T8s are shown in Figures 7 and 8 in terms of execution time and average speed respectively. For comparative reasons, results of implementations on single T8 are shown. It is noted that in comparison to a single T8 implementation, significant enhancement in execution time is achieved by distributing the algorithm on two T8s. It is also noted that better performance is achieved with higher number of elements than with lower number of elements. The processor speed of the two T8s as noted in Figure 8, is nearly twice the speed of a single 159 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR 1'8. The speed and efficiency achieved with two T8s. relative to a single T8 . are 1.8692 and 93.46% . •-r------------------------:~ o~_.___..__.__..............."""T"___.-...__r__.___..__..__..............._.____.-.___r_--l 1 2'45'7' .,011,21314,515'711,.20 Numberof .,.",.",. Figure 7. Performance of the T8s in implementing the algorithm. 0.' 0" 0.' "l:I I 0.5 0 .0 0.3 0.2 0 .1 Two T.. Sln1Ile U Figure 8. Processor speed of the T8s in implementing the algorithm . 3Or-----------------------, o~......"""T"__.-...__ _.___..-.__............._.____.-.___r__.___..__.__.......~ 1 2 , 4 5 • 7 • • 15 11 n " 14 15 15 U 11 11 20 Numberof .,.",.",. Figure 9. Performance of the C40s in implementing the algorithm with code optimisation level-O, TABLE 1. Speedup and efficiency of the C40s with code Optimisation level-D . Number of C40s Two Three Speedup 1.894 2.403 Efficiency 94.7% 80% 160 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR The algorithm was then implemented on a network of C40s consisting of three C40 processors. In this process code optimisation levels 0 and 2 were used. Figure 9 shows the performance of the C40s with code optimisation level­ O. It is noted that the performance enhancement achieved with three C40s is not as significantly different from that achieved with two C40s. As compared to a single C40 implementation, the performance with two and three C40s is better as the number of elements increase. The average processor speeds of the C40s implementing the algorithm with code optimisation level-O are shown in Figure 10. The corresponding execution time speedup and efficiency achieved with two and three C40s relative to a single C40 implementation are shown in Table I. It is noted that two C40s and three C40s are 1.894 and 2.403 times faster than a single C40 respectively. The implementation efficiency achieved with two C40s and three C40s, on the other hand, are 94.7% and 80% respectively. Figure 10. Processor speed of the C40s in implementing the algorithm with code optimisation level-O. ll-r-------------------------, o ~_,.___..___.r__...._........__._.......,.-.._......._,.___..___.r__.._........__._.......,.-.._......._4 1 3 4 5 I 7 I • 10 11 12 13 14 15 11 17 " " :Ill Number of .1fHnents Figure II. Performance of the C40s in implementing the algorithm with code optimisation level-2. 161 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR SI ngle COO Two C_ Th... C_ Figure 12. Processor speed of the C40s in implementing the algorithm with code optimisation level-2. The impact of optimisation facility in implementing the algorithm on the C40s is demonstrated in Figures II and 12. These were obtained by optimi sing the algorithm with optimisation level-2 of the 3L parallel C compiler. The performance evolution in Figure II, as noted , is similar in character to th at obtained with optimisation level-O, Figure 9 . However, the execution time has reduced signifi cantly. The corresponding enh ancement in spe ed can be se e n in Figure 12. It is noted that, in compari son to Figure 10, the performance enhancement achieved in implementing the algorithm with optimisation level­ 2 on single, two and three C40s is by a factor of 1.8286, 1.9037 and 2 .0155 respectively. The execution time speedup and efficiency corresponding to the implementations in Figures II and 12 are sho w n in Table 2. As compared to Table I, the speedup and efficiency achieved were significantly improved with code optimisat ion . This shows that, in addition to communication overhead, code optimi sation and compiler efficiency play important roles in the performance of a system. TABLE 2. Speedup and efficiency of the C40s with code Optimisation level-2. Number of C40s Two Three Speedup 1.972 2.649 Efficiency 98.6% 88.32% 7.2. Implementations on the heterogeneous architectures The algorithm was implemented on three different types of heterogeneous architectures namely, the C40+T8, i860+T8 , and i860+C40. In this process the concept of virtual processor was utilised with the corresponding task allocation strategy as described earlier. Figure 13 shows the results in implementing the algorithm on the C40+T8 architecture. The characteristics of uni -processors, virtual processor and of the corresponding theoretical parallel architecture are also shown , for comparison reasons. It is noted that for implementations up to 5 elements the performance of the actual parallel architecture is better than that of the T8. For implementations over 5-12 elements, the performance of the parallel architecture is better than that of the virtual processor and the speedup achieved is between I and 2, giving an efficiency of 50%-100%. For implementations beyond 12-elements, the performance of the parallel 162 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR architecture matches that of the theoretical model. resulting in a speedup of 2 and an efficiency of 100%. It is also noted in Figure 13 that the characteristics of the theoretical parallel model are closer to those of the single C40 for implementations up to 10 elements. This implies that due to the large disparity between the capabilities of the C40 and the T8. better performance will be achieved with the parallel architecture by allocating the entire task to the C40 over this range. The disparity between the capabilities of the processors in the i860+T8 architecture is even larger. This is evidenced in Figure 14 which shows performances of the uni-processors. the corresponding virtual processor and theoretical parallel model with that of the actual parallel i860+T8 architecture. It is noted that the performance of the i860 is close to that of the theoretical model. Thus. the task allocation strategy in this case resulted in allocating the entire task to the i860 for implementations up to 12 elements. Beyond this the T8 was also involved. resulting in an increase in the execution time of the parallel architecture. Thus. since the performance enhancement that the T8 is expected to provide beyond 12 elements is significantly smaller than the communication overhead between the processors. the latter becomes a dominant factor. This implies that in such a situation. better performance is still achieved by implementing the algorithm on the i860. 4Or;===================::;---------""7f __ Slglerl ____ SIgIeC40 -<>- C40.rl V1rtu111 peee..... -x- C40.rl Th_ellCIII pe."1eI model -<>- C40.re Aclu........lelimplemenlllllon o+-----.--"""'T"'--.......----._-"""'T"'--.......----._-"""'T"'---l 2 4 6 II II 20 I NumJ::,r of e,.w,ents 14 Figure 13. Performance of the C40+ T8 in implementing the algorithm. 1.-----;::::==============::::;---------, ____ Sigle TI -D-Slgle_ -<>--.TI V1rtu111...- -X--'TI~_"""",model ___ • TI ActUIII penllellmp_lIllIon 12 14 II II 20 4 I 10 Number of elements Figure 14. Performance of the i860-T8 in implementing the algorithm. 163 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR s,-----.============;------....., __ 5IngIe C40 -5IngIe­ _1IIO+C40VI'-__ -X-_+C40~"'-- -0 . -.c4OAc_ ...............--.- p- .. --0- .. O~---.--~--.---"""T""--r--.........--.,.....---.---l 14 1. 1. 20 4 10 12 Numbe, of elements Figure 15. Performance of the i860+C40 in implementing the algorithm. Figure 15 shows the results in implementing the algorithm on the i860+C40 architecture. The characteristics of the uni-processors, the corresponding virtual processor and the theoretical parallel model are also shown. In this process for implementations up to 4 elements the entire task was allocated to the i860. For implementations over 6 elements and beyond the C40 is also involved. As noted, due to the disparity in the performances of the i860 and the C40, the performance of the parallel architecture is worse than that of the virtual processor for implementations over 6-10 elements. This implies that, to overcome the performance barrier due to communication overhead in this range, allocating the entire task to the i860 for implementations up to 10 elements and including the C40 implementations beyond 10 elements will result in better performance of the parallel architecture. 8. CONCLUSION An investigation into the development of high-performance computing methods within the framework of real-time applications has been presented. A comparative performance evaluation of homogeneous and heterogeneous architectures with contrasting features in implementing an FE algorithm has been carried out. In the case of implementing the algorithm on the homogeneous architecture of C40s the impact of code optimisation has also been investigated. It has also been demonstrated that due to the nature of an algorithm a close match needs to be made between the computing requirements of the algorithm and computing capabilities of the architecture. A quantitative measure of the performance for heterogeneous parallel computing has been carried out utilising the concept of virtual processor and virtual parallel machine. A task allocation strategy for parallel architectures has accordingly been proposed and verified within several experiments. It has been demonstrated that the capabilities of the processors in a parallel architecture are exploited and a close match is made with the requirements of an algorithm in using this strategy. The linear nature of evolution of performance of processors has led to the introduction of generalised speedup and efficiency, for a more comprehensive performance evaluation of parallel architectures. These have been shown to provide suitable measures of the performance of a processor over a wide range of loading conditions and thus reflect on the real-time computing capabilities of the architectures in a comprehensive manner. 164 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR 9. REFERENCES Baxter, M. J., Tokhi, M. O. and Fleming, P. J. (1994). "Parallelising algorithms to exploit heterogeneous architectures for real-time control systems", Proceedings oflEE Control-94 Conference, Coventry, 21-24 March, Vol.2, pp. /266-/271. Ching, P. C. and Wu, S. W. (1989). "Real-time digital signal processing system using a parallel architecture", Microprocessors and Microsystems, Vol.l3, pp. 653-658. Crummy, T. P., Jones, D. J., Fleming, P. J. and Marnane, W. P. (1994). "A hardware scheduler for parallel processing in control applications", Proceedings of lEE Control-94 Conference, Coventry, 21-24 March 1994, Vol.2, pp. J098-II03. Denning, P. J. (1986). "Parallel computing and its evolution". Communications of the ACM, Vol.29, pp. II63-II67. Gustafson, J. L., Montry, G. and Benner, R. (1988). "Development of parallel methods for a 1024-processor hypercube", SlAMJ, SSTC, Vol.9, (4), pp. 609-638. Hocney, R. W. and Jesshope, C. R. (1981). Parallel Computers. Hilger Publishing Co, Bristol. Hwang, K. (1993). Advanced computer architecture, parallelism, scalability and programmability, McGraw-Hill, California. Hwang, K. and Briggs, F. A. (1985). Computer architecture and Parallel Processing. McGraw-Hill, California. Irwin, g. W. and Fleming, P. J. (1992) Transputers in real-time control, John Wiley, England. Khokhar, A. A., Prasanna, V. K., Shahban, M. E. and Wang, C. (1993). "Heterogeneous Computing : challenges and opportunities", Computer, Vol.26, pp. 18-27. Maguire, L. P. (1991). Parallel architecture of Kalman filtering and self-tuning control, PhD thesis, the Queen's University of Belfast, UK. Nocetti, G. D. F. and Fleming, P. J. (1991). "Performance studies of parallel real-time controllers", Proceedings IFAC workshop on Algorithms and Architectures for Real-time Control, Bangor, UK, pp. 249-254. Nussbaum, D. and Agrawal, A. (1991). "Scalability of parallel machines", Communications of the ACM, Vo1.34, pp. 57-61. Stone, H. S. (1987). High Performance Computer Architecture, Addison Wesley, USA. Sun, X.-H. and Ni, L. (1993). "Scalable problems and memory-bounded 165 FINITE ELEMENT SIMULATION OF A FLEXIBLE MANIPULATOR speedup". Journal ofparallel and Distributed Computing. Vol.19. pp.27-37. Sun, X.-H. and Rover, D. T. (1994). "Scalability of parallel algorithm­ machine combinations". IEEE Transactions on Parallel and Distributed Systems. 5. pp. 599-613. Tokhi, M. O. and Hossain, M. A. (1995). "CISC, RISC and DSP processors in real-time signal processing and control". Microprocessors and Microsystems, Vol.19, pp. 291-300. Tokhi, M. 0., Hossain, M. A. and Chambers, C. (1997). "Performance evaluation of DSP and transputer based systems in sequential real-time applications". Microprocessors and Microsystems, Vol. 21. pp. 237-248. Tokhi, M. O. Shaheed, M. H., Ramos-Hernandez, D. N. and Poerwanto, H. (1999). "Finite element simulation ofa flexible manipulator; part-I : Sequential processing techniques". Journal ofLow Frequency Noise, Vibration and Active Control. (submitted). Van, Y., Zhang, X. and Song, Y. (1996). "An effective and practical performance prediction model for parallel computing on nondedicated heterogeneous NOW". Journal of Parallel and Distributed Computing, Vol. 38. (1), pp. 63-80. Zhang, X. and Van, Y. (1995). "Modelling and characterising parallel computing performance on heterogeneous networks of workstations". Proceedings of Seventh IEEE Symposium on Parallel and Distrihuted Processing, San Antonio. Texas, 25-28 Octoher 1995, pp. 25-34.

Journal

"Journal of Low Frequency Noise, Vibration and Active Control"SAGE

Published: Aug 1, 2016

Keywords: Finite element method; flexible manipulator; heterogeneous architectures; homogeneous architectures; parallel processing; performance metrics; simulation; speedup; task to processor allocation

There are no references for this article.