Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A Novel FPGA-based H.264/AVC Intra Prediction

A Novel FPGA-based H.264/AVC Intra Prediction Fuzzy Inf. Eng. (2011) 2: 183-191 DOI 10.1007/s12543-011-0076-7 ORIGINAL ARTICLE Guo-yan Ren · Jian-jun Li Received: 12 January 2010/ Revised: 10 April 2011/ Accepted: 17 May 2011/ © Springer-Verlag Berlin Heidelberg and Fuzzy Information and Engineering Branch of the Operations Research Society of China Abstract The advanced video compression standard H.264/AVC adopts Rate Dis- tortion Optimization to enhance coding efficiency at the cost of a very high compu- tational complexity. Intra Prediction part is the major processing bottleneck consid- ering total time and power consumption. We therefore propose an efficient parallel processing structure for H.264/AVC 4 × 4 intra prediction. Unlike generic architec- tures utilizing serial processing with increased time and power consumption, a new processing order is introduced to reduce data dependencies between consecutively executed blocks within H.264/AVC intra prediction. Our experimental results show that the parallel execution of these blocks saves power consumption by up to 22.8% with slight increase in bit rate. Keywords H.264/AVC · Rate Distortion Optimization · Parallel processing · Intra prediction 1. Introduction H.264/AVC is a new video compression standard jointly developed by the Video Team of International Organization for Standardization (ISO)/International Electro- technical Commission Moving Picture Experts Group (MPEG) and Video Coding Experts Group [1, 2]. Unlike previous standards such as MPEG-2 and MPEG-4, H.264/AVC utilizes aggressive compression techniques which promise improved com- pression efficiency with increased computational complexity. Such an improvement is mainly obtained by H.264/AVC intra prediction. While several researchers have published hardware implementations for fast intra prediction [3-5], the computations Guo-Yan Ren () College of Electronic Information and Engineering, Chongqing University of Science and Technology, Chongqing 401331, P.R.China email: rendai206@126.com Jian-Jun Li Department of Electrical and Computer Engineering, University of Windsor, Windsor, Ontario, N9B 3P4, Canada 184 Guo-yan Ren · Jian-jun Li (2011) are essentially performed in serial order, leading to unnecessarily long execution time. To speed up the data processing without significant reduction of compression effi- ciency, the parallel features of intra prediction have to be utilized [1]. There are three well-known hardware approaches to generate predictors for Luma 4× 4 [6]. The first one has reduced instruction set computing (RISC) structure, which is area efficient but requires high frequency to meet the real-time performance re- quirements. The second one is Reconfigurable Module Approach which was pro- posed in [5]. It uses reconfigurable predictor generators that exploit inherent paral- lelism within one prediction mode. However, it processes modes sequentially thus still requires a relatively higher frequency. Moreover, it suffers from overhead of configuring the hardware for different prediction modes. The third design is based on the Dedicated Module Approach that uses nine different hardware modules, each processing an individual Luma 4 × 4 prediction mode to generate the required pre- dictors. The advantage of this approach is processing all the modes in parallel. The disadvantage is that it will increase area. This paper presents a Very Large Scale Integrated circuits (VLSI) implementation of our proposed parallel processing on H.264/AVC 4× 4 intra prediction, which can save power consumption by up to 22.8% based on the experiments with reference software JM10.2 [2]. After logic synthesis, it is implemented in Xilinx Virtex-4 [7, 8]. The design is expected to be favorable contribution to the real-time low power H.264/AVC encoding or MPEG-2 to H.264/AVC trans-coding of high definition tele- vision (HDTV). The rest of the paper is organized as follows. In Section 2, the H.264/AVC intra prediction algorithm is briefly introduced. Section 3 proposes a parallel architecture, and Section 4 presents simulation results followed by conclu- sions given in Section 5. 2. H.264 / AVC Intra Prediction In H.264/AVC, intra prediction is performed in two modes: intra 4 × 4 and intra 16 × 16. For a 4 × 4 luma block, H.264 provides 9 different prediction modes as shown in Figure 1(a) where the number given with each arrow represents the mode number, and DC prediction mode (mode 2) does not have a prediction direction and the predictor is made by averaging adjacent samples. Figure 1(b) shows a 4× 4 block with its predictor pixels. The sixteen pixels in one 4 × 4 block labeled from “a”to “p” are the pixels to be predicted. Depending on the prediction mode, some pixels are chosen and used as the predictors. For a 16× 16 luma block, there are 4 other prediction modes, which are based on a linear spatial interpolation by using the upper and left-hand predictors of a macro- block (MB). A typical H.264 encoder has to perform rate distortion optimization (RDO) calculations to find an optimum prediction mode [7]. A data dependency problem arises as the prediction of one 4× 4 block cannot be done until its neighbors are processed. Therefore, it is impossible to process the 16 4 × 4 blocks within an MB simultaneously, posing a challenge to efficient hardware implementation. Parallel processing is one of the widely-used methods in low power design. Since high computational complexity and data dependency are involved in H.264/AVC, the generic design methodology using sequential scan mode fails to achieve parallel Fuzzy Inf. Eng. (2011) 2: 183-191 185 Fig.1 Intra 4× 4 prediction modes processing, and thus is not power-efficient. 2.1. Parallel Processing The processing order of the intra prediction employed is shown in Figure 2(a) by the JM 10.2 algorithm [2], where each box (numbered 0, 1,Ă, 15) is a 4 ×4 block and the number inside a box represents the order to be processed. For example, the block in the upper-left corner (labeled 0) is processed first, and the block labeled 1 is processed next, and so on. The intra prediction of block 1 uses the rightmost four pixels of block 0 as the left side predictors for the horizontal mode. Note that these predictors should be given in a reconstructed frame. Therefore, block 1 can be processed after block 0 is reconstructed. To reconstruct block 0 after intra prediction, additional operations including integer transform quantization, inverse quantization and integer transform should be performed. This implies that the intra prediction of block 1 can start only after all these operations for reconstruction of block 0 are finished. Fig.2 Traditional scan mode vs. proposed scan mode Figure 3 shows the execution sequence of block 0 and block 1, where IP, ITQ, ITQ- 1, and IP-1 represent the intra prediction, integer transform with quantization, inverse integer transform with inverse quantization, and inverse intra prediction, respectively. It is indicated that the IP for block 1 dependents on the IP-1 of block 0 and can start only after the IP-1 of block 0 is finished. The execution order in Figure 3 does not efficiently use hardware resources since all operations are serialized and only one hardware module among those designed for 186 Guo-yan Ren · Jian-jun Li (2011) time -1 -1 Block 0 IP ITQ ITQ IP -1 -1 Block 1 IP ITQ ITQ IP Fig.3 Serial execution sequence IP, ITQ, ITQ-1 and IP-1 is utilized at any point of time. To achieve higher efficiency, a parallel execution of these hardware modules is desirable. For example, It is known that the block 4 depends on block 1, but does not depend on its previously computed block 3. Therefore, block 4 can be paralleled with block 3. The same situation occurs in other blocks as well. With such parallel processing in hardware architecture, one can choose a parallel scan mode as shown in Figure 2(b) for hardware design. Figure 4 shows the parallel execution sequence. -1 -1 -1 -1 -1 -1 IP ITQ ITQ IP IP ITQ ITP IP IP ITQ ITQ IP time Block 0 Block 1 Block 2 Ă Block 4 Block 5 Ă Ă Ă Block 8 Ă Fig.4 Parallel execution sequence 2.2. Mode Reduction in Intra Prediction Of all nine modes of intra 4 × 4 prediction, only Mode 3 and Mode 7 cannot be applied to the proposed scan mode since both of them have more data dependency except on their left, upper and left-upper blocks. In order to make full use of par- allelism, our proposed prediction mode excludes Mode 3 and Mode 7. This may affect the compression performance of the intra prediction. However, major con- cerns of most portable and remote surveillance applications are efficient execution and/or low power consumption. Our experimental conditions are shown in Table 1. Table 2(a) and 2(b) show comparison of four different sequences-Street car, Eu- rope market, Whale show and Harbor scene in terms of bit rate and peak signal- to-noise ratio (PSNR). It can be seen that there is no significant difference in bit rate or PSNR between the proposed mode and JM10.2. Fuzzy Inf. Eng. (2011) 2: 183-191 187 Table 1: Simulation conditions. Video format HDDVD (1920× 1088) Inter prediction disable Intra prediction enable Rate distortion optimization on Entropy coding method CABAC Frames to be encoded 90 Figure 5 illustrates the coding efficiency comparison between JM10.2 and the re- vised encoder with our proposed algorithm for HDTV (1920× 1088) of street car clip 30 frames/sec. It can be seen from the figure that our proposed algorithm causes little degradation in performance compared with H.264/AVC full intra mode prediction. Table 2(a): Bit rate comparison of JM10.2 and the proposed mode(pro.). Sequence Algorithm Bit rate (Mbits/sec) 25 28 31 34 37 Streetcar JM10.2 56.864 35.185 23.343 15.081 11.355 Pro. 57.002 35.166 23.540 15.843 11.386 Europe market JM10.2 76.478 51.878 36.401 24.813 17.689 Pro. 76.789 52.036 36.710 24.978 17.937 Whale show JM10.2 74.288 51.219 36.952 25.534 17.970 Pro. 74.316 51.244 36.955 25.539 17.979 Harbor scene JM10.2 79.449 55.444 40.324 28.557 20.878 Pro. 79.668 55.615 40.427 28.664 20.933 Table 2(b): PSNR comparison of JM10.2 and the proposed mode (pro.). Sequence Algorithm Bit rate (Mbits/sec) 25 28 31 34 37 Streetcar JM10.2 38.51 36.48 35.23 33.84 32.39 Pro. 38.4 36.43 35.10 33.71 32.36 Europe market JM10.2 38.48 36.21 34.50 32.80 31.32 Pro. 38.46 36.20 34.43 32.74 31.29 Whale show JM10.2 38.81 36.66 34.92 33.04 31.38 Pro. 38.80 36.66 34.92 33.04 31.38 Harbor scene JM10.2 38.49 36.24 34.47 32.62 30.93 Pro. 38.48 36.24 34.47 32.62 30.93 188 Guo-yan Ren · Jian-jun Li (2011) Fig.5 Performance comparison 2.2. Design Flow The pseudo code of the proposed processing is listed below. A parallel level can be specified at first according to the design requirement. The higher value of levels means deeper parallel level. For intra 4× 4 prediction, the maximum level will be 4 since the length of the longest diagonal of each macroblock is 4 (see Figure 2). Note that the deeper the parallel processing, the more hardware resource is required, and more idle times arise when fewer blocks are available to be processed. For example, if the parallel level is set to 4, only block 0 (see Figure 2(b)) is available in the first processing stage, meaning that 3 processing units are idle. Only in the fourth pro- cessing stage are all of parallel processing units being used (for block 3, 6, 9 and 12 in Figure 2(b)). The pseudo code of the proposed processing: -Set up parallel levels: Maxmin = 16; P level = 4; -Initializing status of blocks: for Num in 0 to P level-1 loop if (D valid (Num)=1), then Prediction ( ); else if (D valid (neighours) = 1 ), then D valid (Num) = 1; end if; end loop; -Prediction processing: for nn in 0 to Naxnum - 1 loop if(D valid (nn) = 1), then DCT; Quantization; Fuzzy Inf. Eng. (2011) 2: 183-191 189 Inverse Quantization; Inverse DCT; end if; end loop; The above parallel processing consists of two loops. The first loop is to find the available block to be processed, and the second is for prediction processing of the available block. The prediction processing includes discrete cosine transform, Quan- tization, Inverse Quantization, and discrete cosine transform. The best mode with minimum cost for the block will be sent to entropy coding processing for the output with compressed video bitstream. 3. Experimental Results With Discussion The proposed parallel architecture is implemented in Hardware Description Lan- guage. The implementation is verified with register transfer level (RTL) simulations using Mentor Graphics ModelSim SE 6.1. The VHDL RTL is then synthesized and the resulting netlist is placed and routed to a Xilinx Virtex-4 FPGA with speed grade 10 using Xilinx ISE Series 9.1i. The FPGA implementation can code 30 VGA frames (640 × 480) per second. The complete implementation is shown in Figure 6, where XPower is designed by Xilinx to estimate the power consumption of a given placed FPGA design. Since Xilinx FPGA design environment is used in this design, Xpower is used as power estimation tool. The outputs from simulation (.vcd file) and from the place and route tool (.ncd file) are used as an input to Xilinx XPower [8]. Fig.6 The implementation flow A testbench with frequency of 92 MHz is fed to the design with the traditional serial processing, while lower frequencies are fed to our proposed parallel processing. The results are shown in Table 3 in terms of power savings. We see that the lower power consumption is achieved with higher parallel level. 190 Guo-yan Ren · Jian-jun Li (2011) Table 3: Total power savings by parallel processing. Parallel Vcc Clock Total Savings leves (V) (MHz) Power(mw) (%) Serial 2.5 92 391 0 2 2.5 46 333 14.8 3 2.5 30 311 20.5 4 2.5 23 302 22.8 Note that the power consumption above contains both of dynamic and quiescent power. In Xpower analysis, the static power given by Xpower is almost a constant since it is calculated by the multiplication of maximum leakage current absorbed by the FPGA core and its supply voltage. If only the dynamic power consumption is considered for our parallel design, the power savings are shown in Table 4 for comparison. Table 4: Dynamic power savings. Parallel Vcc Clock Total Savings leves (V) (MHz) Power(mw) (%) Serial 2.5 92 391-280=111 0 2 2.5 46 333-280=53 52.3 3 2.5 30 311-280=31 72.1 4 2.5 23 302-208=22 80.2 4. Conclusion We have proposed a low power design for H.264/AVC intra 4 × 4 prediction based on parallel processing. The parallel execution cuts down part of data dependency by mode reduction method. The experiments results have shown that the proposed approach can save power by up to 22.8% in 4-level parallel processing without any significant performance degradation. The design has been implemented in Xilinx Virtex-4 chipset with Xilinx ISE electronic design automation tools for power analy- sis, and the function verification has been completed with ModelSim. Acknowledgements The authors sincerely acknowledge support for the research work under the China Scholarship Council award. References 1. Jin G H, Lee H J (2006) A parallel and pipelined execution of H.264/AVC intra prediction. The Sixth IEEE International Conference on Computer and Information Technology (CIT’06): 246-251 2. JVT H.264 (2005) Reference Software Version JM 10.2 3. Pan F, Lin X, Susanto Rahardja, Keng Pang Lim, Li Z G, Wu D J, Wu S (2005) Fast mode decision algorithm for intra prediction in H.264/AVC video coding. IEEE Trans Circuits and Systems for Video Technology 15(7): 813-822 Fuzzy Inf. Eng. (2011) 2: 183-191 191 4. Kibum Suh, Seongmo Park, Cho H J (2005) An efficient hardware architecture of intra prediction and TQ/IQIT module for H.264 encoder. ETRI Journal 27(5): 511-524 5. Huang Y W, Hsieh B YˈChen T C, Chen L G (2005) Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra frame coder. IEEE Trans. Circuit and Systems for Video Technology 15(3): 378-401 6. Muhammad Shafique, Lars Bauer, Jorg Henkel (2009) A parallel approach for high performance hardware design of intradiction in H.264/AVC video codec. Design, Automation & Test in Europe Conference & Exhibition: 1434-1439 7. Elleouet D, Julien N, Houzet D, Cousin J G, Martin M (2004) Power consumption characterization and modeling of embedded memories in XILINX. Digital System Design, Euromicro Symposium: 394-401 8. http://www.xilinx.com/products/design tools/logic design/verification/xpower an.htm http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Fuzzy Information and Engineering Taylor & Francis

A Novel FPGA-based H.264/AVC Intra Prediction

Fuzzy Information and Engineering , Volume 3 (2): 9 – Jun 1, 2011

A Novel FPGA-based H.264/AVC Intra Prediction

Abstract

AbstractThe advanced video compression standard H.264/AVC adopts Rate Distortion Optimization to enhance coding efficiency at the cost of a very high computational complexity. Intra Prediction part is the major processing bottleneck considering total time and power consumption. We therefore propose an efficient parallel processing structure for H.264/AVC 4 × 4 intra prediction. Unlike generic architectures utilizing serial processing with increased time and power consumption, a new...
Loading next page...
 
/lp/taylor-francis/a-novel-fpga-based-h-264-avc-intra-prediction-JxyPg18Kn7
Publisher
Taylor & Francis
Copyright
© 2011 Taylor and Francis Group, LLC
ISSN
1616-8666
eISSN
1616-8658
DOI
10.1007/s12543-011-0076-7
Publisher site
See Article on Publisher Site

Abstract

Fuzzy Inf. Eng. (2011) 2: 183-191 DOI 10.1007/s12543-011-0076-7 ORIGINAL ARTICLE Guo-yan Ren · Jian-jun Li Received: 12 January 2010/ Revised: 10 April 2011/ Accepted: 17 May 2011/ © Springer-Verlag Berlin Heidelberg and Fuzzy Information and Engineering Branch of the Operations Research Society of China Abstract The advanced video compression standard H.264/AVC adopts Rate Dis- tortion Optimization to enhance coding efficiency at the cost of a very high compu- tational complexity. Intra Prediction part is the major processing bottleneck consid- ering total time and power consumption. We therefore propose an efficient parallel processing structure for H.264/AVC 4 × 4 intra prediction. Unlike generic architec- tures utilizing serial processing with increased time and power consumption, a new processing order is introduced to reduce data dependencies between consecutively executed blocks within H.264/AVC intra prediction. Our experimental results show that the parallel execution of these blocks saves power consumption by up to 22.8% with slight increase in bit rate. Keywords H.264/AVC · Rate Distortion Optimization · Parallel processing · Intra prediction 1. Introduction H.264/AVC is a new video compression standard jointly developed by the Video Team of International Organization for Standardization (ISO)/International Electro- technical Commission Moving Picture Experts Group (MPEG) and Video Coding Experts Group [1, 2]. Unlike previous standards such as MPEG-2 and MPEG-4, H.264/AVC utilizes aggressive compression techniques which promise improved com- pression efficiency with increased computational complexity. Such an improvement is mainly obtained by H.264/AVC intra prediction. While several researchers have published hardware implementations for fast intra prediction [3-5], the computations Guo-Yan Ren () College of Electronic Information and Engineering, Chongqing University of Science and Technology, Chongqing 401331, P.R.China email: rendai206@126.com Jian-Jun Li Department of Electrical and Computer Engineering, University of Windsor, Windsor, Ontario, N9B 3P4, Canada 184 Guo-yan Ren · Jian-jun Li (2011) are essentially performed in serial order, leading to unnecessarily long execution time. To speed up the data processing without significant reduction of compression effi- ciency, the parallel features of intra prediction have to be utilized [1]. There are three well-known hardware approaches to generate predictors for Luma 4× 4 [6]. The first one has reduced instruction set computing (RISC) structure, which is area efficient but requires high frequency to meet the real-time performance re- quirements. The second one is Reconfigurable Module Approach which was pro- posed in [5]. It uses reconfigurable predictor generators that exploit inherent paral- lelism within one prediction mode. However, it processes modes sequentially thus still requires a relatively higher frequency. Moreover, it suffers from overhead of configuring the hardware for different prediction modes. The third design is based on the Dedicated Module Approach that uses nine different hardware modules, each processing an individual Luma 4 × 4 prediction mode to generate the required pre- dictors. The advantage of this approach is processing all the modes in parallel. The disadvantage is that it will increase area. This paper presents a Very Large Scale Integrated circuits (VLSI) implementation of our proposed parallel processing on H.264/AVC 4× 4 intra prediction, which can save power consumption by up to 22.8% based on the experiments with reference software JM10.2 [2]. After logic synthesis, it is implemented in Xilinx Virtex-4 [7, 8]. The design is expected to be favorable contribution to the real-time low power H.264/AVC encoding or MPEG-2 to H.264/AVC trans-coding of high definition tele- vision (HDTV). The rest of the paper is organized as follows. In Section 2, the H.264/AVC intra prediction algorithm is briefly introduced. Section 3 proposes a parallel architecture, and Section 4 presents simulation results followed by conclu- sions given in Section 5. 2. H.264 / AVC Intra Prediction In H.264/AVC, intra prediction is performed in two modes: intra 4 × 4 and intra 16 × 16. For a 4 × 4 luma block, H.264 provides 9 different prediction modes as shown in Figure 1(a) where the number given with each arrow represents the mode number, and DC prediction mode (mode 2) does not have a prediction direction and the predictor is made by averaging adjacent samples. Figure 1(b) shows a 4× 4 block with its predictor pixels. The sixteen pixels in one 4 × 4 block labeled from “a”to “p” are the pixels to be predicted. Depending on the prediction mode, some pixels are chosen and used as the predictors. For a 16× 16 luma block, there are 4 other prediction modes, which are based on a linear spatial interpolation by using the upper and left-hand predictors of a macro- block (MB). A typical H.264 encoder has to perform rate distortion optimization (RDO) calculations to find an optimum prediction mode [7]. A data dependency problem arises as the prediction of one 4× 4 block cannot be done until its neighbors are processed. Therefore, it is impossible to process the 16 4 × 4 blocks within an MB simultaneously, posing a challenge to efficient hardware implementation. Parallel processing is one of the widely-used methods in low power design. Since high computational complexity and data dependency are involved in H.264/AVC, the generic design methodology using sequential scan mode fails to achieve parallel Fuzzy Inf. Eng. (2011) 2: 183-191 185 Fig.1 Intra 4× 4 prediction modes processing, and thus is not power-efficient. 2.1. Parallel Processing The processing order of the intra prediction employed is shown in Figure 2(a) by the JM 10.2 algorithm [2], where each box (numbered 0, 1,Ă, 15) is a 4 ×4 block and the number inside a box represents the order to be processed. For example, the block in the upper-left corner (labeled 0) is processed first, and the block labeled 1 is processed next, and so on. The intra prediction of block 1 uses the rightmost four pixels of block 0 as the left side predictors for the horizontal mode. Note that these predictors should be given in a reconstructed frame. Therefore, block 1 can be processed after block 0 is reconstructed. To reconstruct block 0 after intra prediction, additional operations including integer transform quantization, inverse quantization and integer transform should be performed. This implies that the intra prediction of block 1 can start only after all these operations for reconstruction of block 0 are finished. Fig.2 Traditional scan mode vs. proposed scan mode Figure 3 shows the execution sequence of block 0 and block 1, where IP, ITQ, ITQ- 1, and IP-1 represent the intra prediction, integer transform with quantization, inverse integer transform with inverse quantization, and inverse intra prediction, respectively. It is indicated that the IP for block 1 dependents on the IP-1 of block 0 and can start only after the IP-1 of block 0 is finished. The execution order in Figure 3 does not efficiently use hardware resources since all operations are serialized and only one hardware module among those designed for 186 Guo-yan Ren · Jian-jun Li (2011) time -1 -1 Block 0 IP ITQ ITQ IP -1 -1 Block 1 IP ITQ ITQ IP Fig.3 Serial execution sequence IP, ITQ, ITQ-1 and IP-1 is utilized at any point of time. To achieve higher efficiency, a parallel execution of these hardware modules is desirable. For example, It is known that the block 4 depends on block 1, but does not depend on its previously computed block 3. Therefore, block 4 can be paralleled with block 3. The same situation occurs in other blocks as well. With such parallel processing in hardware architecture, one can choose a parallel scan mode as shown in Figure 2(b) for hardware design. Figure 4 shows the parallel execution sequence. -1 -1 -1 -1 -1 -1 IP ITQ ITQ IP IP ITQ ITP IP IP ITQ ITQ IP time Block 0 Block 1 Block 2 Ă Block 4 Block 5 Ă Ă Ă Block 8 Ă Fig.4 Parallel execution sequence 2.2. Mode Reduction in Intra Prediction Of all nine modes of intra 4 × 4 prediction, only Mode 3 and Mode 7 cannot be applied to the proposed scan mode since both of them have more data dependency except on their left, upper and left-upper blocks. In order to make full use of par- allelism, our proposed prediction mode excludes Mode 3 and Mode 7. This may affect the compression performance of the intra prediction. However, major con- cerns of most portable and remote surveillance applications are efficient execution and/or low power consumption. Our experimental conditions are shown in Table 1. Table 2(a) and 2(b) show comparison of four different sequences-Street car, Eu- rope market, Whale show and Harbor scene in terms of bit rate and peak signal- to-noise ratio (PSNR). It can be seen that there is no significant difference in bit rate or PSNR between the proposed mode and JM10.2. Fuzzy Inf. Eng. (2011) 2: 183-191 187 Table 1: Simulation conditions. Video format HDDVD (1920× 1088) Inter prediction disable Intra prediction enable Rate distortion optimization on Entropy coding method CABAC Frames to be encoded 90 Figure 5 illustrates the coding efficiency comparison between JM10.2 and the re- vised encoder with our proposed algorithm for HDTV (1920× 1088) of street car clip 30 frames/sec. It can be seen from the figure that our proposed algorithm causes little degradation in performance compared with H.264/AVC full intra mode prediction. Table 2(a): Bit rate comparison of JM10.2 and the proposed mode(pro.). Sequence Algorithm Bit rate (Mbits/sec) 25 28 31 34 37 Streetcar JM10.2 56.864 35.185 23.343 15.081 11.355 Pro. 57.002 35.166 23.540 15.843 11.386 Europe market JM10.2 76.478 51.878 36.401 24.813 17.689 Pro. 76.789 52.036 36.710 24.978 17.937 Whale show JM10.2 74.288 51.219 36.952 25.534 17.970 Pro. 74.316 51.244 36.955 25.539 17.979 Harbor scene JM10.2 79.449 55.444 40.324 28.557 20.878 Pro. 79.668 55.615 40.427 28.664 20.933 Table 2(b): PSNR comparison of JM10.2 and the proposed mode (pro.). Sequence Algorithm Bit rate (Mbits/sec) 25 28 31 34 37 Streetcar JM10.2 38.51 36.48 35.23 33.84 32.39 Pro. 38.4 36.43 35.10 33.71 32.36 Europe market JM10.2 38.48 36.21 34.50 32.80 31.32 Pro. 38.46 36.20 34.43 32.74 31.29 Whale show JM10.2 38.81 36.66 34.92 33.04 31.38 Pro. 38.80 36.66 34.92 33.04 31.38 Harbor scene JM10.2 38.49 36.24 34.47 32.62 30.93 Pro. 38.48 36.24 34.47 32.62 30.93 188 Guo-yan Ren · Jian-jun Li (2011) Fig.5 Performance comparison 2.2. Design Flow The pseudo code of the proposed processing is listed below. A parallel level can be specified at first according to the design requirement. The higher value of levels means deeper parallel level. For intra 4× 4 prediction, the maximum level will be 4 since the length of the longest diagonal of each macroblock is 4 (see Figure 2). Note that the deeper the parallel processing, the more hardware resource is required, and more idle times arise when fewer blocks are available to be processed. For example, if the parallel level is set to 4, only block 0 (see Figure 2(b)) is available in the first processing stage, meaning that 3 processing units are idle. Only in the fourth pro- cessing stage are all of parallel processing units being used (for block 3, 6, 9 and 12 in Figure 2(b)). The pseudo code of the proposed processing: -Set up parallel levels: Maxmin = 16; P level = 4; -Initializing status of blocks: for Num in 0 to P level-1 loop if (D valid (Num)=1), then Prediction ( ); else if (D valid (neighours) = 1 ), then D valid (Num) = 1; end if; end loop; -Prediction processing: for nn in 0 to Naxnum - 1 loop if(D valid (nn) = 1), then DCT; Quantization; Fuzzy Inf. Eng. (2011) 2: 183-191 189 Inverse Quantization; Inverse DCT; end if; end loop; The above parallel processing consists of two loops. The first loop is to find the available block to be processed, and the second is for prediction processing of the available block. The prediction processing includes discrete cosine transform, Quan- tization, Inverse Quantization, and discrete cosine transform. The best mode with minimum cost for the block will be sent to entropy coding processing for the output with compressed video bitstream. 3. Experimental Results With Discussion The proposed parallel architecture is implemented in Hardware Description Lan- guage. The implementation is verified with register transfer level (RTL) simulations using Mentor Graphics ModelSim SE 6.1. The VHDL RTL is then synthesized and the resulting netlist is placed and routed to a Xilinx Virtex-4 FPGA with speed grade 10 using Xilinx ISE Series 9.1i. The FPGA implementation can code 30 VGA frames (640 × 480) per second. The complete implementation is shown in Figure 6, where XPower is designed by Xilinx to estimate the power consumption of a given placed FPGA design. Since Xilinx FPGA design environment is used in this design, Xpower is used as power estimation tool. The outputs from simulation (.vcd file) and from the place and route tool (.ncd file) are used as an input to Xilinx XPower [8]. Fig.6 The implementation flow A testbench with frequency of 92 MHz is fed to the design with the traditional serial processing, while lower frequencies are fed to our proposed parallel processing. The results are shown in Table 3 in terms of power savings. We see that the lower power consumption is achieved with higher parallel level. 190 Guo-yan Ren · Jian-jun Li (2011) Table 3: Total power savings by parallel processing. Parallel Vcc Clock Total Savings leves (V) (MHz) Power(mw) (%) Serial 2.5 92 391 0 2 2.5 46 333 14.8 3 2.5 30 311 20.5 4 2.5 23 302 22.8 Note that the power consumption above contains both of dynamic and quiescent power. In Xpower analysis, the static power given by Xpower is almost a constant since it is calculated by the multiplication of maximum leakage current absorbed by the FPGA core and its supply voltage. If only the dynamic power consumption is considered for our parallel design, the power savings are shown in Table 4 for comparison. Table 4: Dynamic power savings. Parallel Vcc Clock Total Savings leves (V) (MHz) Power(mw) (%) Serial 2.5 92 391-280=111 0 2 2.5 46 333-280=53 52.3 3 2.5 30 311-280=31 72.1 4 2.5 23 302-208=22 80.2 4. Conclusion We have proposed a low power design for H.264/AVC intra 4 × 4 prediction based on parallel processing. The parallel execution cuts down part of data dependency by mode reduction method. The experiments results have shown that the proposed approach can save power by up to 22.8% in 4-level parallel processing without any significant performance degradation. The design has been implemented in Xilinx Virtex-4 chipset with Xilinx ISE electronic design automation tools for power analy- sis, and the function verification has been completed with ModelSim. Acknowledgements The authors sincerely acknowledge support for the research work under the China Scholarship Council award. References 1. Jin G H, Lee H J (2006) A parallel and pipelined execution of H.264/AVC intra prediction. The Sixth IEEE International Conference on Computer and Information Technology (CIT’06): 246-251 2. JVT H.264 (2005) Reference Software Version JM 10.2 3. Pan F, Lin X, Susanto Rahardja, Keng Pang Lim, Li Z G, Wu D J, Wu S (2005) Fast mode decision algorithm for intra prediction in H.264/AVC video coding. IEEE Trans Circuits and Systems for Video Technology 15(7): 813-822 Fuzzy Inf. Eng. (2011) 2: 183-191 191 4. Kibum Suh, Seongmo Park, Cho H J (2005) An efficient hardware architecture of intra prediction and TQ/IQIT module for H.264 encoder. ETRI Journal 27(5): 511-524 5. Huang Y W, Hsieh B YˈChen T C, Chen L G (2005) Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra frame coder. IEEE Trans. Circuit and Systems for Video Technology 15(3): 378-401 6. Muhammad Shafique, Lars Bauer, Jorg Henkel (2009) A parallel approach for high performance hardware design of intradiction in H.264/AVC video codec. Design, Automation & Test in Europe Conference & Exhibition: 1434-1439 7. Elleouet D, Julien N, Houzet D, Cousin J G, Martin M (2004) Power consumption characterization and modeling of embedded memories in XILINX. Digital System Design, Euromicro Symposium: 394-401 8. http://www.xilinx.com/products/design tools/logic design/verification/xpower an.htm

Journal

Fuzzy Information and EngineeringTaylor & Francis

Published: Jun 1, 2011

Keywords: H.264/AVC; Rate Distortion Optimization; Parallel processing; Intra prediction

References