Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

An Algorithm for Scene Text Detection Using Multibox and Semantic Segmentation

An Algorithm for Scene Text Detection Using Multibox and Semantic Segmentation applied sciences Article An Algorithm for Scene Text Detection Using Multibox and Semantic Segmentation 1 1 2 , 1 2 1 Hongbo Qin , Haodi Zhang , Hai Wang *, Yujin Yan , Min Zhang and Wei Zhao Key Laboratory of Electronic Equipment Structure Design, Ministry of Education, Xidian University, Xi’an 710071, China; qhb0920qhb@xidian.edu.cn (H.Q.); haodiz@stu.xidian.edu.cn (H.Z.); yanyujin.xidian@gmail.com (Y.Y.); weizhao@xidian.edu.cn (W.Z.) School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China; minzhanghk@gmail.com * Correspondence: wanghai@mail.xidian.edu.cn; Tel.: +86-029-8820-3115 Received: 28 January 2019; Accepted: 8 March 2019; Published: 13 March 2019 Abstract: An outside mutual correction (OMC) algorithm for natural scene text detection using multibox and semantic segmentation was developed. In the OMC algorithm, semantic segmentation and multibox were processed in parallel, and the text detection results were mutually corrected. The mutual correction process was divided into two steps: (1) The semantic segmentation results were employed in the bounding box enhancement module (BEM) to correct the multibox results. (2) The semantic bounding box module (SBM) was used to optimize the adhesion text boundary of the semantic segmentation results. Non-maximum suppression (NMS) was adopted to merge the SBM and BEM results. Our algorithm was evaluated on the ICDAR2013 and SVT datasets. The experimental results show that the developed algorithm had a maximum increase of 13.62% in the F-measure score and the highest F-measure score was 81.38%. Keywords: scene text detection; multibox detector; semantic segmentation 1. Introduction Scene text detection is an important and challenging task in computer vision [1,2]. The goal is to accurately locate the text area in a scene image. This technology has a wide range of applications in image retrieval, scene analysis, blind navigation, and other fields [3]. Text found in natural scenes have different fonts, styles, and sizes. It is usually accompanied by geometric distortion, a complex background, and uncontrolled lighting. Therefore, natural scene text detection is still a very open research challenge. Early mainstream approaches focused on various heuristics that help detect characters or character components. The two most famous approaches are: (1) the maximum stable extremal region (MSER) [4–6]; and (2) the stroke width transformation (SWT) [7,8]. MSER extracts character regions with a similar intensity whereas SWT assumes that the text component has a comparable stroke width. Both MSER and SWT must be combined with additional post-processing to produce reasonable text candidates. In recent works, various convolutional neural network (CNN)-based methods have been proposed to detect scene text [9–14]. These efforts focus on reducing the number of handcrafted features or artificial rules in text detection. Tian et al. [15] propose a text flow method using a minimum cost flow network to sort character CNN candidate detection, erroneous character deletion, text line extraction, and text line verification. Cho et al. [16] propose a canny text detector using the maximum stable region, edge similarity, text line tracking, and heuristic rule grouping to calculate candidate characters. Appl. Sci. 2019, 9, 1054; doi:10.3390/app9061054 www.mdpi.com/journal/applsci Appl. Sci. 2019, 9, 1054 2 of 13 One of the most important CNN-based methods is a fully convolutional network (FCN). Zhang et al. [17] suggest using a FCN to obtain text block candidates, a character-centroid FCN to generate auxiliary text lines, and a set of heuristic rules based on intensity and geometric consistency to reject the wrong candidates. Gupta et al. [18] proposes a fully convolutional regression network (FCRN) that efficiently performs text detection and bounding-box regression at all locations across multiple scales in an image based on a FCN. All of these algorithms use FCN to generate text semantic segmentation detection results containing semantic segmentation information. Another method called multibox uses multiple default candidate boxes to calculate the position of text in an image. Since general object detection based on CNN has achieved remarkable results in recent years, scene text detection has been greatly improved by treating text words or lines as objects. High-performance methods for object detection such as a faster region-based convolutional neural network (R-CNN) [19], single shot multibox detector (SSD) [20], and you only look once (YOLO) [21] have been modified to detect horizontal scene text [10,14,22,23] and have been greatly improved. In this paper, the outside mutual correction (OMC) algorithm is proposed. Existing fusion approaches use semantic segmentation as a module for extracting features inside a multibox detector, this is referred to as inside feature extraction (IFE). During IFE processing, feature maps extracted by semantic segmentation are enlarged first and then reduced, which usually introduces noise and reduces the accuracy of detection. In our proposed algorithm, semantic segmentation and multibox are processed in parallel, and the text detection results are mutually corrected. Thus, the bounding box enhancement module (BEM) and the semantic bounding box module (SBM) were designed. The proposed algorithm inherits the advantages of these methods and obtained more accurate text detection results through OMC. The rest of the paper is organized as follows: In Section 2, we provide a brief review of the related theories, including single shot multibox and the FCN. In Section 3, we describe the details of our proposed algorithm. In Section 4, we present the experimental results on benchmarks and comparisons to other scene text detection systems. In Section 5, we provide the conclusions of this paper. 2. Related Work 2.1. Multibox Text Detector The multibox text detector extracts feature maps using convolutional layers. The detector draws multiple default bounding boxes on feature maps of different resolutions. After the convolution process, the targets in the original image will decrease in size. Since the default box has a fixed shape, the target will be captured at the right size. SSD is a representative of multibox text detectors. SSD has several advantages, including: (1) SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per a feature map location. (2) During the prediction procedure, the network generates scores for the presence of each object category in each default box and adjusts the box to better match the object shape. (3) Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes [20]. Figure 1 shows the SSD network architecture. He et al. [24] proposed an improved SSD-based network named single shot text detector (SSTD). The inception structure and semantic segmentation are integrated in the SSD. The inception structure shows better performance for extraction convolution features. A new module named the text attention module is used to fuse semantic segmentation information and convolution features. The whole semantic segmentation fusing process has three steps: (1) The feature maps are enlarged and scored during semantic segmentation using a deconvolution process. (2) In order to fuse with the original feature map, the semantic segmentation results need to be reduced to smaller sizes through convolution processing. (3) Finally, the text attention module combines semantic segmentation information and convolution features. In this process, the feature map will be enlarged first and then reduced, Appl. Sci. 2019, 9, x FOR PEER REVIEW 2 of 12 One of the most important CNN-based methods is a fully convolutional network (FCN). Zhang et al. [17] suggest using a FCN to obtain text block candidates, a character-centroid FCN to generate auxiliary text lines, and a set of heuristic rules based on intensity and geometric consistency to reject the wrong candidates. Gupta et al. [18] proposes a fully convolutional regression network (FCRN) that efficiently performs text detection and bounding-box regression at all locations across multiple scales in an image based on a FCN. All of these algorithms use FCN to generate text semantic segmentation detection results containing semantic segmentation information. Another method called multibox uses multiple default candidate boxes to calculate the position of text in an image. Since general object detection based on CNN has achieved remarkable results in recent years, scene text detection has been greatly improved by treating text words or lines as objects. High-performance methods for object detection such as a faster region-based convolutional neural network (R-CNN) [19], single shot multibox detector (SSD) [20], and you only look once (YOLO) [21] have been modified to detect horizontal scene text [10,14,22,23] and have been greatly improved. In this paper, the outside mutual correction (OMC) algorithm is proposed. Existing fusion approaches use semantic segmentation as a module for extracting features inside a multibox detector, this is referred to as inside feature extraction (IFE). During IFE processing, feature maps extracted by semantic segmentation are enlarged first and then reduced, which usually introduces noise and reduces the accuracy of detection. In our proposed algorithm, semantic segmentation and multibox are processed in parallel, and the text detection results are mutually corrected. Thus, the bounding box enhancement module (BEM) and the semantic bounding box module (SBM) were designed. The proposed algorithm inherits the advantages of these methods and obtained more accurate text detection results through OMC. The rest of the paper is organized as follows: In Section 2, we provide a brief review of the related theories, including single shot multibox and the FCN. In Section 3, we describe the details of our proposed algorithm. In Section 4, we present the experimental results on benchmarks and comparisons to other scene text detection systems. In Section 5, we provide the conclusions of this paper. 2. Related Work 2.1. Multibox Text Detector The multibox text detector extracts feature maps using convolutional layers. The detector draws multiple default bounding boxes on feature maps of different resolutions. After the convolution process, the targets in the original image will decrease in size. Since the default box has a fixed shape, the target will be captured at the right size. SSD is a representative of multibox text detectors. SSD has several advantages, including: 1) SSD discretizes the output space of bounding boxes into a set Appl. of defa Sci.ul 2019 t boxes over di , 9, 1054 fferent aspect ratios and scales per a feature map location. 2) During 3 ofth 13 e prediction procedure, the network generates scores for the presence of each object category in each Appl. Sci. 2019, 9, x FOR PEER REVIEW 3 of 12 default box and adjusts the box to better match the object shape. 3) Additionally, the network introducing noise. This problem is caused by the desire to combine semantic segmentation within the Figure 1. (a) Single shot multibox detector (SSD) architecture. From the convolutional layers C1 to C9, combines predictions from multiple feature maps with different resolutions to naturally handle multibox. Thus, this method to integrate semantic segmentation has limited effect. a number of bounding boxes and their corresponding confidence are inferred. (b) Fixed size default objects of various sizes [20]. Figure 1 shows the SSD network architecture. bounding box on different resolution feature maps. The fixed size default box captures different-sized BoundingBoxes + Confidence targets in different resolution feature maps. This architecture allows the SSD to capture both large 5x5 and small targets in one shot. Target 10x10 He et al. [24] proposed an improved SSD-based network named single shot text detector (SSTD). VGG16 19x19 Default The inception structure and semantic segmentation are integrated in the SSD. The inception structure Bounding Box shows better performance for extraction convolution features. A new module named the text attention module is used to fuse semantic segmentation information and convolution features. The C4 feature map C1-C4 C5 C6 C7 C8 C9 resolution:36x36 whole semantic segmentation fusing process has three steps: 1) The feature maps are enlarged and (a) (b) scored during semantic segmentation using a deconvolution process. 2) In order to fuse with the Figure 1. (a) Single shot multibox detector (SSD) architecture. From the convolutional layers C1 to C9, original feature map, the semantic segmentation results need to be reduced to smaller sizes through a number of bounding boxes and their corresponding confidence are inferred. (b) Fixed size default convolution processing. 3) Finally, the text attention module combines semantic segmentation bounding box on different resolution feature maps. The fixed size default box captures different-sized information and convolution features. In this process, the feature map will be enlarged first and then targets in different resolution feature maps. This architecture allows the SSD to capture both large and reduced, introducing noise. This problem is caused by the desire to combine semantic segmentation small targets in one shot. within the multibox. Thus, this method to integrate semantic segmentation has limited effect. 2.2. Semantic Segmentation 2.2. Semantic Segmentation Semantic segmentation information can be acquired from the fully convolutional network Semantic segmentation information can be acquired from the fully convolutional network (FCN) (FCN) [25]. Typically, the convolutional neural network has several fully connected layers after [25]. Typically, the convolutional neural network has several fully connected layers after the the convolution layers. The feature maps generated from convolution layers are mapped to feature convolution layers. The feature maps generated from convolution layers are mapped to feature vectors by fully connected layers. For example, the 1000-dimensional vector output by the ImageNet vectors by fully connected layers. For example, the 1000-dimensional vector output by the ImageNet model in [26] indicated the probability that the input image belongs to each class. The FCN classifies model in [26] indicated the probability that the input image belongs to each class. The FCN classifies the object’s class at the pixel level. The FCN uses the deconvolution layer to upsample the feature map the object’s class at the pixel level. The FCN uses the deconvolution layer to upsample the feature and restore it to the input image size. After the deconvolution process, a prediction is made for each map and restore it to the input image size. After the deconvolution process, a prediction is made for pixel classification and the spatial information in the input image is preserved. The FCN fused features each pixel classification and the spatial information in the input image is preserved. The FCN fused from different coarseness layers to refine the segmentation using spatial information. Finally, the loss features from different coarseness layers to refine the segmentation using spatial information. Finally, of the softmax classification was calculated pixel by pixel, which was equivalent to one training sample the loss of the softmax classification was calculated pixel by pixel, which was equivalent to one per pixel. The FCN architecture is shown in Figure 2. training sample per pixel. The FCN architecture is shown in Figure 2. Convolutional Deconvolutional (a) (b) Figure 2. (a) Fully convolutional network (FCN) architecture. Including convolutional and Figure 2. (a) Fully convolutional network (FCN) architecture. Including convolutional and deconvolution layers. (b) Overlay of the input image and FCN processing results. deconvolution layers. (b) Overlay of the input image and FCN processing results. Based on the FCN, DeepLab-V2 [27], containing atrous spatial pyramid pooling (ASPP), was Based on the FCN, DeepLab-V2 [27], containing atrous spatial pyramid pooling (ASPP), was proposed. In ASPP, parallel atrous convolution with different rates were applied in the input feature proposed. In ASPP, parallel atrous convolution with different rates were applied in the input feature map and fused together. As objects of the same class may have different scales in the image, ASPP map and fused together. As objects of the same class may have different scales in the image, ASPP helps to account for different object scales that can improve the accuracy. helps to account for different object scales that can improve the accuracy. Usually semantic segmentation is treated as a special feature extraction process integrated into the convolutional neural network. The feature map will first be enlarged and then reduced during NMS NMS Appl. Sci. 2019, 9, 1054 4 of 13 Usually semantic segmentation is treated as a special feature extraction process integrated Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 12 Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 12 into the convolutional neural network. The feature map will first be enlarged and then reduced during the inside feature extraction process as shown in Figure 3. During this process, noise is the inside feature extraction process as shown in Figure 3. During this process, noise is introduced the inside feature extraction process as shown in Figure 3. During this process, noise is introduced introduced and the detection results are affected. Therefore, the proposed algorithm uses the outside and the detection results are affected. Therefore, the proposed algorithm uses the outside mutual and the detection results are affected. Therefore, the proposed algorithm uses the outside mutual mutual correction (OMC) algorithm to fuse semantic segmentation outside the multibox. Through the correction (OMC) algorithm to fuse semantic segmentation outside the multibox. Through the correction (OMC) algorithm to fuse semantic segmentation outside the multibox. Through the proposed algorithm, the pixel-level classification of semantic segmentation can be more effectively proposed algorithm, the pixel-level classification of semantic segmentation can be more effectively proposed algorithm, the pixel-level classification of semantic segmentation can be more effectively utilized to improve the detection accuracy. We used SSD and SSTD as the basic multibox text detectors. utilized to improve the detection accuracy. We used SSD and SSTD as the basic multibox text utilized to improve the detection accuracy. We used SSD and SSTD as the basic multibox text DeepLab-V2 was used to generate semantic segmentation information. detectors. DeepLab-V2 was used to generate semantic segmentation information. detectors. DeepLab-V2 was used to generate semantic segmentation information. Inside Feature Extrction Inside Feature Extrction 32 32 32 32 32 32 32 32 Convolutional Convolutional Deconvolutional 32 32 64 32 32 Deconvolutional 32 32 64 32 32 Concatenate Concatenate Figure 3. Inside feature extraction (IFE) process. The feature map will be enlarged first (32 × 32 to Figure 3. Inside feature extraction (IFE) process. The feature map will be enlarged first (32  32 to Figure 3. Inside feature extraction (IFE) process. The feature map will be enlarged first (32 × 32 to 64 × 64) and then reduced (64 × 64 to 32 × 32). 64  64) and then reduced (64  64 to 32  32). 64 × 64) and then reduced (64 × 64 to 32 × 32). 3. Proposed Algorithm 3. Proposed Algorithm 3. Proposed Algorithm 3.1. Overall Framework 3.1. Overall Framework 3.1. Overall Framework The The proposed algorithm is shown proposed algorithm is shown in in Figur Figure e 4 4. . Plent Plenty y of text can of text candidate didate bound bounding ing boxe boxes s we wer re e The proposed algorithm is shown in Figure 4. Plenty of text candidate bounding boxes were obtained from the multibox processing. Meanwhile, the text semantic segmentation result was obtained obtained from the multibox processing. Meanwhile, the text semantic segmentation result was obtained from the multibox processing. Meanwhile, the text semantic segmentation result was fr obta om i semantic ned from sema segmentation ntic segmenta processing. tion processi A softmax ng. A sof layer t was max added layer was to the ad output ded to the output layer of the la semantic yer of obtained from semantic segmentation processing. A softmax layer was added to the output layer of segmentation the semantic pr segmenta ocess to ti obtain on process to obta the classification in the cla probability ssifica of tion proba each pixel. bil The ity of text each pi candidate xel. The text bounding the semantic segmentation process to obtain the classification probability of each pixel. The text candidate bounding boxes and semantic segmentation results are merged in the bounding box boxes and semantic segmentation results are merged in the bounding box enhancement module (BEM) candidate bounding boxes and semantic segmentation results are merged in the bounding box to enhancement eliminate the module false results (BEM) to of multibox eliminate t prh ocessing. e false resu The lts o semantic f multibox proc segmentation essing. The result se enter mantthe ic enhancement module (BEM) to eliminate the false results of multibox processing. The semantic segmentation result enter the semantic bounding box module (SBM). After the CRF algorithm semantic bounding box module (SBM). After the CRF algorithm optimizes the text boundaries, the segmentation result enter the semantic bounding box module (SBM). After the CRF algorithm optimizes the text boundaries, the text semantic bounding box is computed. Finally, the outputs of text semantic bounding box is computed. Finally, the outputs of the BEM and SBM are sent to the optimizes the text boundaries, the text semantic bounding box is computed. Finally, the outputs of the BEM and SBM are sent to the non-maximum suppression (NMS) to remove duplicate bounding non-maximum suppression (NMS) to remove duplicate bounding boxes. the BEM and SBM are sent to the non-maximum suppression (NMS) to remove duplicate bounding boxes. boxes. Multibox Processing Multibox Reuslt BEM Multibox Processing Multibox Reuslt BEM Semantic Result Semantic Processing SB SBM M Semantic Result Semantic Processing SB SBM M Figure 4. Framework of scene text detection algorithm using multibox and semantic segmentation. Figure 4. Framework of scene text detection algorithm using multibox and semantic segmentation. Figure 4. Framework of scene text detection algorithm using multibox and semantic segmentation. 3.2. Bounding Box Enhancement Module 3.2. Bounding Box Enhancement Module 3.2. Boun The BEM ding Box merE ges nhathe ncem multibox ent Modul results e and semantic segmentation result to eliminate the false The BEM merges the multibox results and semantic segmentation result to eliminate the false bounding boxes. The regional median probability of all bounding boxes was calculated based on the bounding boxes. The regional median probability of all bounding boxes was calculated based on the The BEM merges the multibox results and semantic segmentation result to eliminate the false semantic segmentation result. The bounding box will be removed if its regional median probability bounding boxes. The regional median probability of all bounding boxes was calculated based on the is less than the threshold. The detailed steps of the BEM are shown in Algorithm 1. semantic segmentation result. The bounding box will be removed if its regional median probability Algorithm 1. Bounding box enhancement module (BEM) is less than the threshold. The detailed steps of the BEM are shown in Algorithm 1. Algorithm 1. Bounding box enhancement module (BEM) NMS NMS Appl. Sci. 2019, 9, 1054 5 of 13 semantic segmentation result. The bounding box will be removed if its regional median probability is less than the threshold. The detailed steps of the BEM are shown in Algorithm 1. Algorithm 1. Bounding box enhancement module (BEM) Step 1. Acquire a multibox result: Rec = ((x , y ), (x , y )) . i refers to the i-th result. (x , y ) i 1 1 2 2 i 1 1 is the coordinates of the upper left corner of the text bounding box. (x , y ) is the 2 2 coordinates of the right bottom corner of the text bounding box. Appl. Sci. 2019, 9, x FOR PEER REVIEW 5 of 12 Step 2. Get the rectangular area Area of Rec in the semantic segmentation result. Rec Step 3. Calculate the regional median probability: Step 1. Acquire a multibox result: Reci= ((x1, y1), (x2, y2))i. i refers to the i-th result. (x1, y1) P = Median( P ) Area ij is the coordinates of the upper left corner of the text bounding box. (x2, y2) is the ij in Aera Step 4. coordi Compar nates of e P the ri to the ght bottom corner of the text boundi threshold T: ng box. Area If P < T: Delete the Rec in the multibox results. Step 2. Get the rectangular area AreaRec of Reci in the semantic segmentation result. Area i Else: continue. Step 3. Calculate the regional median probability: Step 5. Repeat steps 1–4 until all multibox results have been calculated. P =𝑀𝑎𝑖𝑛𝑒𝑑( 𝑃 ) The BEM process is shown in Figure 5. It can be seen from the figure that region (a) is the text and Step 4. Compare P to the threshold T: the region (b) is the background. These two blocks are detected as text in multibox processing. From If P <𝑇 : Delete the Reci in the multibox results. the heatmap of the semantic processing results, we found that region (a) was bright while region (b) Else: continue. was dark. After the processing of the bounding box enhancement module, region (a) was reserved and Step 5. Repeat steps 1--4 until all multibox results have been calculated. region (b) was discarded. (a) (b) MultiBox Processing Result (a) (a) (b) (b) Concat and Reconstruct BEM Result (a) (b) (a) (b) Semantic Processing Result Figure 5. Flow chart of the BEM. Figure 5. Flow chart of the BEM. 3.3. Semantic Bounding Box Module The BEM process is shown in Figure 5. It can be seen from the figure that region (a) is the text The SBM contains two modules: CRF processing and bounding box search. The CRF processing and the region (b) is the background. These two blocks are detected as text in multibox processing. solves the problem of boundary blur and stickiness in text semantic segmentation. The bounding box From the heatmap of the semantic processing results, we found that region (a) was bright while search obtains the optimal text semantic segmentation bounding box according to the CRF processing region (b) was dark. After the processing of the bounding box enhancement module, region (a) was result. The SBM process is shown in Figure 6. reserved and region (b) was discarded. 3.3. Semantic Bounding Box Module The SBM contains two modules: CRF processing and bounding box search. The CRF processing solves the problem of boundary blur and stickiness in text semantic segmentation. The bounding box search obtains the optimal text semantic segmentation bounding box according to the CRF processing result. The SBM process is shown in Figure 6. Appl. Sci. 2019, 9, 1054 6 of 13 Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 12 Semantic Processing CRF Processing BoundingBox Search Figure 6. Semantic bounding box module flow chart. The yellow circle areas in the figure are clearer in Figure 6. Semantic bounding box module flow chart. The yellow circle areas in the figure are clearer the text boundary portion after the CRF processing. in the text boundary portion after the CRF processing. The semantic segmentation result has a defect, which is that it is easy to cause adhesion when The semantic segmentation result has a defect, which is that it is easy to cause adhesion when words are close. The CRF algorithm is used to correct the pixel-level prediction of the semantic words are close. The CRF algorithm is used to correct the pixel-level prediction of the semantic segmentation result. Text edges are sharper and less sticky after CRF processing. CRF has been segmentation result. Text edges are sharper and less sticky after CRF processing. CRF has been employed to smooth noisy segmentation maps [28,29]. These methods use short-range CRF to couple employed to smooth noisy segmentation maps [28,29]. These methods use short-range CRF to couple neighboring nodes, favoring same-label assignments to spatially proximal pixels. In this work, the neighboring nodes, favoring same-label assignments to spatially proximal pixels. In this work, the goal should be to recover the detailed local structure rather than smooth it further. Therefore, the fully goal should be to recover the detailed local structure rather than smooth it further. Therefore, the connected CRF model [30] is integrated into our network. The CRF model employs the energy function: fully connected CRF model [30] is integrated into our network. The CRF model employs the energy E(x) = q (x ) + q x , x , (1) function: å i i i å ij ij i j E(𝑥 ) = ∑ 𝜃 (𝑥 ) + ∑ 𝜃 𝑥 ,𝑥 , (1) where the x is the label assignment for pixels and the unary potential is defined as follows: where the x is the label assignment for pixels and the unary potential is defined as follows: q (x ) = log P(x ), (2) i i i ( ) ( ) θ 𝑥 =− log 𝑃 𝑥 , (2) where P(x ) is the label assignment probability of pixel i computed by semantic processing. where 𝑃 (𝑥 ) is the label assignment probability of pixel i computed by semantic processing. The pairwise potential has a form that allows for efficient inference while using a fully-connected The pairwise potential has a form that allows for efficient inference while using a fully-connected graph, i.e., connecting all pairs of image pixels, i, j. In particular, as in [30], the following expression graph, i.e. connecting all pairs of image pixels, i, j. In particular, as in [30], the following expression is used: is used: " ! !# 2 2 2 k p p k k I I k k p p k i j i j i j q x , x = m x , x w exp + w exp , (3) θ 𝑥 ,𝑥 = 𝜇𝑥 ,𝑥 𝜔 𝑒𝑥𝑝 − − + 𝜔 𝑒𝑥𝑝 − , (3) ij i j i j 1 2 2 2 2 2s 2s 2s a g where 𝜇𝑥 ,𝑥 =1 if 𝑥 ≠𝑥 , and zero otherwise. The remaining expression uses two Gaussian where m x , x = 1 if x 6= x , and zero otherwise. The remaining expression uses two Gaussian i j i j kernels in different feature spaces; the first, bilateral kernel depends on both pixel positions p and kernels in different feature spaces; the first, bilateral kernel depends on both pixel positions p and RGB color I, while the second kernel depends on pixel positions. The hyperparameters σ , σ and RGB color I, while the second kernel depends on pixel positions. The hyperparameters s , s and s a g σ control the scale of Gaussian kernels. The first kernel has similar tags for pixels with similar colors control the scale of Gaussian kernels. The first kernel has similar tags for pixels with similar colors and positions, whereas the second kernel considers spatial proximity while smoothing the and positions, whereas the second kernel considers spatial proximity while smoothing the boundaries. boundaries. Figure 7 shows the effectiveness of the CRF processing. Figure 7 shows the effectiveness of the CRF processing. Before SMB After SBM Before SBM After SBM (a) (b) Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 12 Semantic Processing CRF Processing BoundingBox Search Figure 6. Semantic bounding box module flow chart. The yellow circle areas in the figure are clearer in the text boundary portion after the CRF processing. The semantic segmentation result has a defect, which is that it is easy to cause adhesion when words are close. The CRF algorithm is used to correct the pixel-level prediction of the semantic segmentation result. Text edges are sharper and less sticky after CRF processing. CRF has been employed to smooth noisy segmentation maps [28,29]. These methods use short-range CRF to couple neighboring nodes, favoring same-label assignments to spatially proximal pixels. In this work, the goal should be to recover the detailed local structure rather than smooth it further. Therefore, the fully connected CRF model [30] is integrated into our network. The CRF model employs the energy function: E(𝑥 ) = ∑ 𝜃 (𝑥 ) + ∑ 𝜃 𝑥 ,𝑥 , (1) where the x is the label assignment for pixels and the unary potential is defined as follows: ( ) ( ) θ 𝑥 =− log 𝑃 𝑥 , (2) where 𝑃 (𝑥 ) is the label assignment probability of pixel i computed by semantic processing. The pairwise potential has a form that allows for efficient inference while using a fully-connected graph, i.e. connecting all pairs of image pixels, i, j. In particular, as in [30], the following expression is used: θ 𝑥 ,𝑥 = 𝜇𝑥 ,𝑥 𝜔 𝑒𝑥𝑝 − − + 𝜔 𝑒𝑥𝑝 − , (3) where 𝜇𝑥 ,𝑥 =1 if 𝑥 ≠𝑥 , and zero otherwise. The remaining expression uses two Gaussian kernels in different feature spaces; the first, bilateral kernel depends on both pixel positions p and RGB color I, while the second kernel depends on pixel positions. The hyperparameters σ , σ and σ control the scale of Gaussian kernels. The first kernel has similar tags for pixels with similar colors Appl. Sci. 2019, 9, 1054 7 of 13 and positions, whereas the second kernel considers spatial proximity while smoothing the boundaries. Figure 7 shows the effectiveness of the CRF processing. Before SMB After SBM Before SBM After SBM (a) (b) Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 12 Figure 7. Examples of CRF processing effects. (a) Two lines of text are placed in one bounding box. Figure 7. Examples of CRF processing effects. (a) Two lines of text are placed in one bounding box. After CRF processing, each word has a separate bounding box. (b) The arrow is erroneously detected After CRF processing, each word has a separate bounding box. (b) The arrow is erroneously detected as text and the arrow bounding box is removed after CRF processing. as text and the arrow bounding box is removed after CRF processing. 4. Experimental Results 4. Experimental Results 4.1. Datasets 4.1. Datasets ICDAR2013 • ICDAR2013 The ICDAR 2013 [31] consists of 229 training images and 233 testing images, with word-level The ICDAR 2013 [31] consists of 229 training images and 233 testing images, with word-level annotations provided. It is the standard benchmark for evaluating near-horizontal text detection. Some annotations provided. It is the standard benchmark for evaluating near-horizontal text detection. examples of the ICDAR2013 dataset are shown in Figure 8. Some examples of the ICDAR2013 dataset are shown in Figure 8. Image Ground Truth (a) (b) (c) (d) Figure 8. Several examples of difficulties in text detection in the ICDAR2013 dataset. (a) Background Figure 8. Several examples of difficulties in text detection in the ICDAR2013 dataset. (a) Background destruction font structure. (b) Perspective transformation and uneven illumination. (c) Text is too small destruction font structure. (b) Perspective transformation and uneven illumination. (c) Text is too and low contrast. (d) Complex background and low contrast. small and low contrast. (d) Complex background and low contrast. Street View Text (SVT) • Street View Text (SVT) The SVT dataset is harvested from Google Street View. Image text in this data exhibits high The SVT dataset is harvested from Google Street View. Image text in this data exhibits high variability and often has low resolution [32]. In autonomous driving, the text in these Street View variability and often has low resolution [32]. In autonomous driving, the text in these Street View images helps the system confirm its position. Compared to the ICDAR2013 dataset, text in the SVT images helps the system confirm its position. Compared to the ICDAR2013 dataset, text in the SVT dataset has lower resolution, more complex light conversion, and relatively smaller text, which is more dataset has lower resolution, more complex light conversion, and relatively smaller text, which is challenging. Figure 9 shows some examples of the SVT dataset. more challenging. Figure 9 shows some examples of the SVT dataset. Image Ground Truth Figure 9. Examples of the Street View Text (SVT) dataset. Text with lower resolution, complex light conversion, and smaller text. 4.2. Exploration Study Inside feature extraction (IFE) and outside mutual correction (OMC) algorithms were tested. Since the IFE in SSTD cannot be removed alone, the performance without IFE cannot be tested. Therefore, only IFE and OMC algorithms were tested. The training and testing environments were Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 12 Figure 7. Examples of CRF processing effects. (a) Two lines of text are placed in one bounding box. After CRF processing, each word has a separate bounding box. (b) The arrow is erroneously detected as text and the arrow bounding box is removed after CRF processing. 4. Experimental Results 4.1. Datasets • ICDAR2013 The ICDAR 2013 [31] consists of 229 training images and 233 testing images, with word-level annotations provided. It is the standard benchmark for evaluating near-horizontal text detection. Some examples of the ICDAR2013 dataset are shown in Figure 8. Image Ground Truth (a) (b) (c) (d) Figure 8. Several examples of difficulties in text detection in the ICDAR2013 dataset. (a) Background destruction font structure. (b) Perspective transformation and uneven illumination. (c) Text is too small and low contrast. (d) Complex background and low contrast. • Street View Text (SVT) The SVT dataset is harvested from Google Street View. Image text in this data exhibits high variability and often has low resolution [32]. In autonomous driving, the text in these Street View images helps the system confirm its position. Compared to the ICDAR2013 dataset, text in the SVT Appl. Sci. 2019, 9, 1054 8 of 13 dataset has lower resolution, more complex light conversion, and relatively smaller text, which is more challenging. Figure 9 shows some examples of the SVT dataset. Image Ground Truth Figure 9. Examples of the Street View Text (SVT) dataset. Text with lower resolution, complex light Figure 9. Examples of the Street View Text (SVT) dataset. Text with lower resolution, complex light conversion, and smaller text. conversion, and smaller text. 4.2. Exploration Study 4.2. Exploration Study Inside feature extraction (IFE) and outside mutual correction (OMC) algorithms were tested. Since Inside feature extraction (IFE) and outside mutual correction (OMC) algorithms were tested. the IFE in SSTD cannot be removed alone, the performance without IFE cannot be tested. Therefore, Since the IFE in SSTD cannot be removed alone, the performance without IFE cannot be tested. only IFE and OMC algorithms were tested. The training and testing environments were consistent. Therefore, only IFE and OMC algorithms were tested. The training and testing environments were All methods used the same training datasets, the same number of training epochs, and the same set parameters. We used two standard evaluation protocols: the IC13 standard and the DetEval standard [33]. The proposed method was implemented with Caffe and Matlab, running on a computer with an 8-core CPU, 32G RAM, TitianXP GPU, and Ubuntu 16.04. The complete test results are shown in Tables 1 and 2. Table 1. Comparison of the IFE and outside mutual correction (OMC) algorithms (on ICDAR2013). IC13 Standard DetEval Standard Network 1 2 3 R P F R P F SSD 52.20% 87.76% 65.18% 53.06% 87.24% 65.98% SSD-IFE 49.17% 82.79% 61.69% 49.88% 83.29% 62.39% SSD-OMC 69.30% 81.21% 74.78% 70.15% 85.10% 76.91% SSTD-IFE 74.54% 83.65% 78.83% 75.39% 84.07% 79.50% SSTD-OMC 80.51% 82.27% 81.38% 80.43% 87.60% 83.86% 1 2 3 Recall. Precision. F-measure. Table 2. Comparison of IFE and OMC algorithms (on SVT). IC13 Standard DetEval Standard Network R P F R P F SSD 48.61% 73.24% 58.43% 48.17% 75.12% 58.70% SSD-IFE 50.34% 79.91% 61.77% 50.34% 79.91% 61.77% SSD-OMC 69.91% 73.57% 71.69% 66.01% 77.97% 71.48% SSTD-IFE 78.60% 73.35% 75.89% 78.60% 73.35% 75.89% SSTD-OMC 85.13% 71.68% 77.83% 81.65% 79.78% 80.71% ‘SSD’ refers to the original SSD algorithm without IFE and OMC. ‘SSD-IFE’ refers to the SSD added IFE algorithm. ‘SSD-OMC’ refers to the SSD added OMC algorithm. ‘SSTD-IFE’ refers to the SSTD added IFE algorithm, which was the original SSTD. ‘SSTD-OMC’ refers to the SSTD added OMC algorithm. The experimental results show that the IFE algorithm reduces the accuracy of text detection on the ICDAR2013 dataset. For the SVT dataset, the IFE algorithm slightly improved the accuracy of the text detection. Compared to the IFE algorithm, the OMC algorithm significantly improved the F-measure score. Appl. Sci. 2019, 9, 1054 9 of 13 4.3. Experimental Results Five methods were tested on the ICDAR2013 and SVT datasets: FCN, SSD, SSD-OMC, SSTD, and SSTD-OMC. The SSD-OMC and the SSTD-OMC use the proposed algorithm to combine the semantic segmentation with SSD and SSTD, respectively. Table 3 shows the tested results of five methods on the ICDAR2013 dataset using two standard evaluation protocols. It can be seen from the test results that the SSD-OMC and SSTD-OMC algorithms showed an increase of 17.10% (IC13), 17.09% (DetEval) and 5.97% (IC13), 5.04% (DetEval) in the recall rate relative to the SSD and SSTD algorithms, which means that more texts missed by the multibox processing was detected. The SSD-OMC method was 9.60% (IC13) and 10.93% (DetEval) higher than the SSD method in the F-measure score, and the SSTD-OMC method was 2.55% (IC13) and 4.36% (DetEval) higher compared to the SSTD method in the F-measure score, meaning that the multibox processing optimized by our algorithm had a better detection accuracy. Table 3. Improved algorithm compared with original algorithm (on ICDAR2013). IC13 Standard DetEval Standard Network 1 2 3 R P F R P F FCN 62.54% 60.80% 61.66% 66.97 62.05% 64.42% SSD 52.20% 87.76% 65.18% 53.06% 87.24% 65.98% SSD-OMC 69.30% 81.21% 74.78% 70.15% 85.10% 76.91% SSTD 74.54% 83.65% 78.83% 75.39% 84.07% 79.50% SSTD-OMC 80.51% 82.27% 81.38% 80.43% 87.60% 83.86% 1 2 3 Recall. Precision. F-measure. Table 4 shows the results of the SSTD-OMC algorithm and four other advanced text detection algorithms tested on the ICDAR2013 dataset. As can be seen in the table, SSTD-OMC has a higher f-measure score, indicating that SSTD-OMC had the better detection accuracy among these five algorithms. Some detection results are shown in Figure 10. Table 4. Improved algorithm compared with other advanced algorithms (on ICDAR2013). IC13 Standard DetEval Standard Network R P F R P F Yin [34] 0.66 0.88 0.76 0.69 0.89 0.78 Neumann [35] 0.72 0.82 0.77 - - - Zhang [36] 0.74 0.88 0.80 0.76 0.88 0.82 Textboxes [22] 0.74 0.86 0.80 0.74 0.88 0.81 SSTD-OMC 0.80 0.82 0.81 0.80 0.80 0.83 Table 5 shows the results of five methods tested on the SVT dataset. The SSD-OMC method showed an improvement of 13.62% (IC13) and 12.78% (DetEval) on the F-measure score compared to the SSD method, while the SSTD-OMC method improved 1.94% (IC13) and 4.82% (DetEval) on the F-measure score compared to the SSTD method. Figure 11 shows some detection results from the SVT dataset. Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 12 algorithms showed an increase of 17.10% (IC13), 17.09% (DetEval) and 5.97% (IC13), 5.04% (DetEval) in the recall rate relative to the SSD and SSTD algorithms, which means that more texts missed by the multibox processing was detected. The SSD-OMC method was 9.60% (IC13) and 10.93% (DetEval) higher than the SSD method in the F-measure score, and the SSTD-OMC method was 2.55% (IC13) and 4.36% (DetEval) higher compared to the SSTD method in the F-measure score, meaning that the multibox processing optimized by our algorithm had a better detection accuracy. Table 4. Improved algorithm compared with other advanced algorithms (on ICDAR2013). IC13 Standard DetEval Standard Network R P F R P F Yin [34] 0.66 0.88 0.76 0.69 0.89 0.78 Neumann [35] 0.72 0.82 0.77 - - - Zhang [36] 0.74 0.88 0.80 0.76 0.88 0.82 Textboxes [22] 0.74 0.86 0.80 0.74 0.88 0.81 SSTD-OMC 0.80 0.82 0.81 0.80 0.80 0.83 Table 4 shows the results of the SSTD-OMC algorithm and four other advanced text detection algorithms tested on the ICDAR2013 dataset. As can be seen in the table, SSTD-OMC has a higher f- Appl. Sci. 2019, 9, 1054 10 of 13 measure score, indicating that SSTD-OMC had the better detection accuracy among these five algorithms. Some detection results are shown in Figure 10. GT SSD-OMC SSTD-OMC Figure 10. Some detection results and their ground truth from the ICDAR2013 dataset. Figure 10. Some detection results and their ground truth from the ICDAR2013 dataset. Table 5. Improved algorithm compared with original algorithm (on SVT). Table 5. Improved algorithm compared with original algorithm (on SVT). IC13 Standard DetEval Standard Network IC13 Standard DetEval Standard R P F R P F Network R P F R P F FCN 50.78% 54.94% 52.78% 55.13% 54.94% 55.03% FCN 50.78% 54.94% 52.78% 55.13% 54.94% 55.03% SSD 48.61% 73.24% 58.43% 48.17% 75.12% 58.70% SSD 48.61% 73.24% 58.43% 48.17% 75.12% 58.70% SSD-OMC 69.91% 73.57% 71.69% 66.01% 77.97% 71.48% SSD-OMC 69.91% 73.57% 71.69% 66.01% 77.97% 71.48% SSTD 78.60% 73.35% 75.89% 78.60% 73.35% 75.89% SSTD-OMC 85.13% 71.68% 77.83% 81.65% 79.78% 80.71% Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 12 SSTD 78.60% 73.35% 75.89% 78.60% 73.35% 75.89% SSTD-OMC 85.13% 71.68% 77.83% 81.65% 79.78% 80.71% Table 5 shows the results of five methods tested on the SVT dataset. The SSD-OMC method showed an improvement of 13.62% (IC13) and 12.78% (DetEval) on the F-measure score compared to the SSD method, while the SSTD-OMC method improved 1.94% (IC13) and 4.82% (DetEval) on the GT F-measure score compared to the SSTD method. Figure 11 shows some detection results from the SVT dataset. SSD-OMC SSTD-OMC Figure 11. Some detection results and their ground truth from the SVT dataset. Figure 11. Some detection results and their ground truth from the SVT dataset. 5. Conclusions 5. Conclusions Our work provided an OMC algorithm to fuse multibox with semantic segmentation. In the OMC Our work provided an OMC algorithm to fuse multibox with semantic segmentation. In the algorithm, semantic segmentation and multibox were processed in parallel, and the text detection OMC algorithm, semantic segmentation and multibox were processed in parallel, and the text results were mutually corrected. The mutual correction process had two stages. In the first stage, the detection results were mutually corrected. The mutual correction process had two stages. In the first pixel-level classification results of the semantic segmentation were adopted to correct the multibox stage, the pixel-level classification results of the semantic segmentation were adopted to correct the multibox bounding boxes. In the second stage, the CRF algorithm was used to precisely adjust the boundaries of the semantic segmentation results. Then the NMS was introduced to merge the text bounding boxes generated by multibox and semantic segmentation. The experimental results showed that the proposed OMC algorithm had better performance than the original IFE algorithm. The F- measure score increased by a maximum of 13.62% and the highest F-measure score was 81.38%. Future work will focus on more powerful and faster detection structures, as well as on rotating text detection research. Author Contributions: Conceptualization: H.Q. and H.Z.; formal analysis: H.W. and Y.Y.; investigation: H.Q. and H.Z.; methodology: H.Q. and M.Z.; writing—original draft: H.W and W.Z; writing—review & editing: H.Q. and H.Z. Funding: This research was funded by National Natural Science Foundation of China, grant number 61801357. Conflicts of Interest: The authors declare no conflict of interest. References 1. Fletcher, L.A.; Kasturi, R. A robust algorithm for text string separation from mixed text/graphics images. IEEE Trans. Pattern Anal. Mach. Intell. 1988, 10, 910–918. 2. Ye, Q.; Doermann, D. Text detection and recognition in imagery: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1480–1500. 3. Wang, F.; Zhao, L.; Li, X.; Wang, X.; Tao, D. Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018. 4. Chen, H.; Tsai, S.S.; Schroth, G.; Chen, D.M.; Grzeszczuk, R.; Girod, B. Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In Proceednigs of the 18th IEEE International Conference on Image Processing (ICIP), Brussels, Belgium, 11–14 September 2011; IEEE: Piscataway, NJ, USA, 2011. Appl. Sci. 2019, 9, 1054 11 of 13 bounding boxes. In the second stage, the CRF algorithm was used to precisely adjust the boundaries of the semantic segmentation results. Then the NMS was introduced to merge the text bounding boxes generated by multibox and semantic segmentation. The experimental results showed that the proposed OMC algorithm had better performance than the original IFE algorithm. The F-measure score increased by a maximum of 13.62% and the highest F-measure score was 81.38%. Future work will focus on more powerful and faster detection structures, as well as on rotating text detection research. Author Contributions: Conceptualization: H.Q. and H.Z.; formal analysis: H.W. and Y.Y.; investigation: H.Q. and H.Z.; methodology: H.Q. and M.Z.; Writing—Original Draft: H.W. and W.Z.; Writing—Review & Editing: H.Q. and H.Z. Funding: This research was funded by National Natural Science Foundation of China, grant number 61801357. Conflicts of Interest: The authors declare no conflict of interest. References 1. Fletcher, L.A.; Kasturi, R. A robust algorithm for text string separation from mixed text/graphics images. IEEE Trans. Pattern Anal. Mach. Intell. 1988, 10, 910–918. [CrossRef] 2. Ye, Q.; Doermann, D. Text detection and recognition in imagery: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1480–1500. [CrossRef] [PubMed] 3. Wang, F.; Zhao, L.; Li, X.; Wang, X.; Tao, D. Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018. 4. Chen, H.; Tsai, S.S.; Schroth, G.; Chen, D.M.; Grzeszczuk, R.; Girod, B. Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In Proceedings of the 18th IEEE International Conference on Image Processing (ICIP), Brussels, Belgium, 11–14 September 2011; IEEE: Piscataway, NJ, USA, 2011. 5. Neumann, L.; Matas, J. Real-time scene text localization and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012. 6. Shi, C.; Wang, C.; Xiao, B.; Zhang, Y.; Gao, S. Scene text detection using graph model built upon maximally stable extremal regions. Pattern Recognit. Lett. 2013, 34, 107–116. [CrossRef] 7. Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010. 8. Mosleh, A.; Bouguila, N.; Hamza, A.B. Image text detection using a bandlet-based edge detector and stroke width transform. In BMVC; BMVC: Newcastle, UK, 2012. 9. He, W.; Zhang, X.Y.; Yin, F.; Liu, C.L. Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017. 10. Liu, Y.; Jin, L. Deep matching prior network: Toward tighter multi-oriented text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. 11. Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 2018, 20, 3111–3122. [CrossRef] 12. Xiang, D.; Guo, Q.; Xia, Y. Robust text detection with vertically-regressed proposal network. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016. 13. Shi, B.; Bai, X.; Belongie, S. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. 14. Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016. Appl. Sci. 2019, 9, 1054 12 of 13 15. Tian, S.; Pan, Y.; Huang, C.; Lu, S.; Yu, K.; Lim Tan, C. Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015. 16. Cho, H.; Sung, M.; Jun, B. Canny text detector: Fast and robust scene text localization algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. 17. Zhang, Z.; Zhang, C.; Shen, W.; Yao, C.; Liu, W.; Bai, X. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. 18. Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. 19. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed] 20. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016. 21. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. 22. Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. TextBoxes: A Fast Text Detector with a Single Deep Neural Network; AAAI: Menlo Park, CA, USA, 2017. 23. Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. 24. He, P.; Huang, W.; He, T.; Zhu, Q.; Qiao, Y.; Li, X. Single shot text detector with regional attention. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017. 25. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2015. 26. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. ACM 2012, 60, 84–90. [CrossRef] 27. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [CrossRef] [PubMed] 28. Rother, C.; Kolmogorov, V.; Blake, A. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM Transactions on Graphics (TOG); ACM: New York, NY, USA, 2004. 29. Kohli, P.; Torr, P.H. Robust higher order potentials for enforcing label consistency. Int. J. Comput.Vis. 2009, 82, 302–324. [CrossRef] 30. Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. Adv. Neural Inf. Process. Syst. 2011, 24, 109–117. 31. Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; de Las Heras, L.P. ICDAR 2013 robust reading competition. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), Washington, DC, USA, 25–28 August 2013; IEEE: Piscataway, NJ, USA, 2013. 32. Wang, K.; Belongie, S. Word spotting in the wild. In European Conference on Computer Vision; Springer: Berlin, Germany, 2010. 33. Lucas, S.M.; Panaretos, A.; Sosa, L.; Tang, A.; Wong, S.; Young, R.; Ashida, K.; Nagai, H.; Okamoto, M.; Yamamoto, H. ICDAR 2003 robust reading competitions: entries, results, and future directions. Int. J. Doc. Anal Recognit. 2005, 7, 105–122. [CrossRef] 34. Yin, X.-C.; Yin, X.; Huang, K.; Hao, H.W. Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 970–983. [PubMed] Appl. Sci. 2019, 9, 1054 13 of 13 35. Neumann, L.; Matas, J. Efficient scene text localization and recognition with local character refinement. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; IEEE: Piscataway, NJ, USA, 2015. 36. Zhang, Z.; Shen, W.; Yao, C.; Bai, X. Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015. © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied Sciences Multidisciplinary Digital Publishing Institute

An Algorithm for Scene Text Detection Using Multibox and Semantic Segmentation

Loading next page...
 
/lp/multidisciplinary-digital-publishing-institute/an-algorithm-for-scene-text-detection-using-multibox-and-semantic-y2fTUwz4VJ
Publisher
Multidisciplinary Digital Publishing Institute
Copyright
© 1996-2019 MDPI (Basel, Switzerland) unless otherwise stated
ISSN
2076-3417
DOI
10.3390/app9061054
Publisher site
See Article on Publisher Site

Abstract

applied sciences Article An Algorithm for Scene Text Detection Using Multibox and Semantic Segmentation 1 1 2 , 1 2 1 Hongbo Qin , Haodi Zhang , Hai Wang *, Yujin Yan , Min Zhang and Wei Zhao Key Laboratory of Electronic Equipment Structure Design, Ministry of Education, Xidian University, Xi’an 710071, China; qhb0920qhb@xidian.edu.cn (H.Q.); haodiz@stu.xidian.edu.cn (H.Z.); yanyujin.xidian@gmail.com (Y.Y.); weizhao@xidian.edu.cn (W.Z.) School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China; minzhanghk@gmail.com * Correspondence: wanghai@mail.xidian.edu.cn; Tel.: +86-029-8820-3115 Received: 28 January 2019; Accepted: 8 March 2019; Published: 13 March 2019 Abstract: An outside mutual correction (OMC) algorithm for natural scene text detection using multibox and semantic segmentation was developed. In the OMC algorithm, semantic segmentation and multibox were processed in parallel, and the text detection results were mutually corrected. The mutual correction process was divided into two steps: (1) The semantic segmentation results were employed in the bounding box enhancement module (BEM) to correct the multibox results. (2) The semantic bounding box module (SBM) was used to optimize the adhesion text boundary of the semantic segmentation results. Non-maximum suppression (NMS) was adopted to merge the SBM and BEM results. Our algorithm was evaluated on the ICDAR2013 and SVT datasets. The experimental results show that the developed algorithm had a maximum increase of 13.62% in the F-measure score and the highest F-measure score was 81.38%. Keywords: scene text detection; multibox detector; semantic segmentation 1. Introduction Scene text detection is an important and challenging task in computer vision [1,2]. The goal is to accurately locate the text area in a scene image. This technology has a wide range of applications in image retrieval, scene analysis, blind navigation, and other fields [3]. Text found in natural scenes have different fonts, styles, and sizes. It is usually accompanied by geometric distortion, a complex background, and uncontrolled lighting. Therefore, natural scene text detection is still a very open research challenge. Early mainstream approaches focused on various heuristics that help detect characters or character components. The two most famous approaches are: (1) the maximum stable extremal region (MSER) [4–6]; and (2) the stroke width transformation (SWT) [7,8]. MSER extracts character regions with a similar intensity whereas SWT assumes that the text component has a comparable stroke width. Both MSER and SWT must be combined with additional post-processing to produce reasonable text candidates. In recent works, various convolutional neural network (CNN)-based methods have been proposed to detect scene text [9–14]. These efforts focus on reducing the number of handcrafted features or artificial rules in text detection. Tian et al. [15] propose a text flow method using a minimum cost flow network to sort character CNN candidate detection, erroneous character deletion, text line extraction, and text line verification. Cho et al. [16] propose a canny text detector using the maximum stable region, edge similarity, text line tracking, and heuristic rule grouping to calculate candidate characters. Appl. Sci. 2019, 9, 1054; doi:10.3390/app9061054 www.mdpi.com/journal/applsci Appl. Sci. 2019, 9, 1054 2 of 13 One of the most important CNN-based methods is a fully convolutional network (FCN). Zhang et al. [17] suggest using a FCN to obtain text block candidates, a character-centroid FCN to generate auxiliary text lines, and a set of heuristic rules based on intensity and geometric consistency to reject the wrong candidates. Gupta et al. [18] proposes a fully convolutional regression network (FCRN) that efficiently performs text detection and bounding-box regression at all locations across multiple scales in an image based on a FCN. All of these algorithms use FCN to generate text semantic segmentation detection results containing semantic segmentation information. Another method called multibox uses multiple default candidate boxes to calculate the position of text in an image. Since general object detection based on CNN has achieved remarkable results in recent years, scene text detection has been greatly improved by treating text words or lines as objects. High-performance methods for object detection such as a faster region-based convolutional neural network (R-CNN) [19], single shot multibox detector (SSD) [20], and you only look once (YOLO) [21] have been modified to detect horizontal scene text [10,14,22,23] and have been greatly improved. In this paper, the outside mutual correction (OMC) algorithm is proposed. Existing fusion approaches use semantic segmentation as a module for extracting features inside a multibox detector, this is referred to as inside feature extraction (IFE). During IFE processing, feature maps extracted by semantic segmentation are enlarged first and then reduced, which usually introduces noise and reduces the accuracy of detection. In our proposed algorithm, semantic segmentation and multibox are processed in parallel, and the text detection results are mutually corrected. Thus, the bounding box enhancement module (BEM) and the semantic bounding box module (SBM) were designed. The proposed algorithm inherits the advantages of these methods and obtained more accurate text detection results through OMC. The rest of the paper is organized as follows: In Section 2, we provide a brief review of the related theories, including single shot multibox and the FCN. In Section 3, we describe the details of our proposed algorithm. In Section 4, we present the experimental results on benchmarks and comparisons to other scene text detection systems. In Section 5, we provide the conclusions of this paper. 2. Related Work 2.1. Multibox Text Detector The multibox text detector extracts feature maps using convolutional layers. The detector draws multiple default bounding boxes on feature maps of different resolutions. After the convolution process, the targets in the original image will decrease in size. Since the default box has a fixed shape, the target will be captured at the right size. SSD is a representative of multibox text detectors. SSD has several advantages, including: (1) SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per a feature map location. (2) During the prediction procedure, the network generates scores for the presence of each object category in each default box and adjusts the box to better match the object shape. (3) Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes [20]. Figure 1 shows the SSD network architecture. He et al. [24] proposed an improved SSD-based network named single shot text detector (SSTD). The inception structure and semantic segmentation are integrated in the SSD. The inception structure shows better performance for extraction convolution features. A new module named the text attention module is used to fuse semantic segmentation information and convolution features. The whole semantic segmentation fusing process has three steps: (1) The feature maps are enlarged and scored during semantic segmentation using a deconvolution process. (2) In order to fuse with the original feature map, the semantic segmentation results need to be reduced to smaller sizes through convolution processing. (3) Finally, the text attention module combines semantic segmentation information and convolution features. In this process, the feature map will be enlarged first and then reduced, Appl. Sci. 2019, 9, x FOR PEER REVIEW 2 of 12 One of the most important CNN-based methods is a fully convolutional network (FCN). Zhang et al. [17] suggest using a FCN to obtain text block candidates, a character-centroid FCN to generate auxiliary text lines, and a set of heuristic rules based on intensity and geometric consistency to reject the wrong candidates. Gupta et al. [18] proposes a fully convolutional regression network (FCRN) that efficiently performs text detection and bounding-box regression at all locations across multiple scales in an image based on a FCN. All of these algorithms use FCN to generate text semantic segmentation detection results containing semantic segmentation information. Another method called multibox uses multiple default candidate boxes to calculate the position of text in an image. Since general object detection based on CNN has achieved remarkable results in recent years, scene text detection has been greatly improved by treating text words or lines as objects. High-performance methods for object detection such as a faster region-based convolutional neural network (R-CNN) [19], single shot multibox detector (SSD) [20], and you only look once (YOLO) [21] have been modified to detect horizontal scene text [10,14,22,23] and have been greatly improved. In this paper, the outside mutual correction (OMC) algorithm is proposed. Existing fusion approaches use semantic segmentation as a module for extracting features inside a multibox detector, this is referred to as inside feature extraction (IFE). During IFE processing, feature maps extracted by semantic segmentation are enlarged first and then reduced, which usually introduces noise and reduces the accuracy of detection. In our proposed algorithm, semantic segmentation and multibox are processed in parallel, and the text detection results are mutually corrected. Thus, the bounding box enhancement module (BEM) and the semantic bounding box module (SBM) were designed. The proposed algorithm inherits the advantages of these methods and obtained more accurate text detection results through OMC. The rest of the paper is organized as follows: In Section 2, we provide a brief review of the related theories, including single shot multibox and the FCN. In Section 3, we describe the details of our proposed algorithm. In Section 4, we present the experimental results on benchmarks and comparisons to other scene text detection systems. In Section 5, we provide the conclusions of this paper. 2. Related Work 2.1. Multibox Text Detector The multibox text detector extracts feature maps using convolutional layers. The detector draws multiple default bounding boxes on feature maps of different resolutions. After the convolution process, the targets in the original image will decrease in size. Since the default box has a fixed shape, the target will be captured at the right size. SSD is a representative of multibox text detectors. SSD has several advantages, including: 1) SSD discretizes the output space of bounding boxes into a set Appl. of defa Sci.ul 2019 t boxes over di , 9, 1054 fferent aspect ratios and scales per a feature map location. 2) During 3 ofth 13 e prediction procedure, the network generates scores for the presence of each object category in each Appl. Sci. 2019, 9, x FOR PEER REVIEW 3 of 12 default box and adjusts the box to better match the object shape. 3) Additionally, the network introducing noise. This problem is caused by the desire to combine semantic segmentation within the Figure 1. (a) Single shot multibox detector (SSD) architecture. From the convolutional layers C1 to C9, combines predictions from multiple feature maps with different resolutions to naturally handle multibox. Thus, this method to integrate semantic segmentation has limited effect. a number of bounding boxes and their corresponding confidence are inferred. (b) Fixed size default objects of various sizes [20]. Figure 1 shows the SSD network architecture. bounding box on different resolution feature maps. The fixed size default box captures different-sized BoundingBoxes + Confidence targets in different resolution feature maps. This architecture allows the SSD to capture both large 5x5 and small targets in one shot. Target 10x10 He et al. [24] proposed an improved SSD-based network named single shot text detector (SSTD). VGG16 19x19 Default The inception structure and semantic segmentation are integrated in the SSD. The inception structure Bounding Box shows better performance for extraction convolution features. A new module named the text attention module is used to fuse semantic segmentation information and convolution features. The C4 feature map C1-C4 C5 C6 C7 C8 C9 resolution:36x36 whole semantic segmentation fusing process has three steps: 1) The feature maps are enlarged and (a) (b) scored during semantic segmentation using a deconvolution process. 2) In order to fuse with the Figure 1. (a) Single shot multibox detector (SSD) architecture. From the convolutional layers C1 to C9, original feature map, the semantic segmentation results need to be reduced to smaller sizes through a number of bounding boxes and their corresponding confidence are inferred. (b) Fixed size default convolution processing. 3) Finally, the text attention module combines semantic segmentation bounding box on different resolution feature maps. The fixed size default box captures different-sized information and convolution features. In this process, the feature map will be enlarged first and then targets in different resolution feature maps. This architecture allows the SSD to capture both large and reduced, introducing noise. This problem is caused by the desire to combine semantic segmentation small targets in one shot. within the multibox. Thus, this method to integrate semantic segmentation has limited effect. 2.2. Semantic Segmentation 2.2. Semantic Segmentation Semantic segmentation information can be acquired from the fully convolutional network Semantic segmentation information can be acquired from the fully convolutional network (FCN) (FCN) [25]. Typically, the convolutional neural network has several fully connected layers after [25]. Typically, the convolutional neural network has several fully connected layers after the the convolution layers. The feature maps generated from convolution layers are mapped to feature convolution layers. The feature maps generated from convolution layers are mapped to feature vectors by fully connected layers. For example, the 1000-dimensional vector output by the ImageNet vectors by fully connected layers. For example, the 1000-dimensional vector output by the ImageNet model in [26] indicated the probability that the input image belongs to each class. The FCN classifies model in [26] indicated the probability that the input image belongs to each class. The FCN classifies the object’s class at the pixel level. The FCN uses the deconvolution layer to upsample the feature map the object’s class at the pixel level. The FCN uses the deconvolution layer to upsample the feature and restore it to the input image size. After the deconvolution process, a prediction is made for each map and restore it to the input image size. After the deconvolution process, a prediction is made for pixel classification and the spatial information in the input image is preserved. The FCN fused features each pixel classification and the spatial information in the input image is preserved. The FCN fused from different coarseness layers to refine the segmentation using spatial information. Finally, the loss features from different coarseness layers to refine the segmentation using spatial information. Finally, of the softmax classification was calculated pixel by pixel, which was equivalent to one training sample the loss of the softmax classification was calculated pixel by pixel, which was equivalent to one per pixel. The FCN architecture is shown in Figure 2. training sample per pixel. The FCN architecture is shown in Figure 2. Convolutional Deconvolutional (a) (b) Figure 2. (a) Fully convolutional network (FCN) architecture. Including convolutional and Figure 2. (a) Fully convolutional network (FCN) architecture. Including convolutional and deconvolution layers. (b) Overlay of the input image and FCN processing results. deconvolution layers. (b) Overlay of the input image and FCN processing results. Based on the FCN, DeepLab-V2 [27], containing atrous spatial pyramid pooling (ASPP), was Based on the FCN, DeepLab-V2 [27], containing atrous spatial pyramid pooling (ASPP), was proposed. In ASPP, parallel atrous convolution with different rates were applied in the input feature proposed. In ASPP, parallel atrous convolution with different rates were applied in the input feature map and fused together. As objects of the same class may have different scales in the image, ASPP map and fused together. As objects of the same class may have different scales in the image, ASPP helps to account for different object scales that can improve the accuracy. helps to account for different object scales that can improve the accuracy. Usually semantic segmentation is treated as a special feature extraction process integrated into the convolutional neural network. The feature map will first be enlarged and then reduced during NMS NMS Appl. Sci. 2019, 9, 1054 4 of 13 Usually semantic segmentation is treated as a special feature extraction process integrated Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 12 Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 12 into the convolutional neural network. The feature map will first be enlarged and then reduced during the inside feature extraction process as shown in Figure 3. During this process, noise is the inside feature extraction process as shown in Figure 3. During this process, noise is introduced the inside feature extraction process as shown in Figure 3. During this process, noise is introduced introduced and the detection results are affected. Therefore, the proposed algorithm uses the outside and the detection results are affected. Therefore, the proposed algorithm uses the outside mutual and the detection results are affected. Therefore, the proposed algorithm uses the outside mutual mutual correction (OMC) algorithm to fuse semantic segmentation outside the multibox. Through the correction (OMC) algorithm to fuse semantic segmentation outside the multibox. Through the correction (OMC) algorithm to fuse semantic segmentation outside the multibox. Through the proposed algorithm, the pixel-level classification of semantic segmentation can be more effectively proposed algorithm, the pixel-level classification of semantic segmentation can be more effectively proposed algorithm, the pixel-level classification of semantic segmentation can be more effectively utilized to improve the detection accuracy. We used SSD and SSTD as the basic multibox text detectors. utilized to improve the detection accuracy. We used SSD and SSTD as the basic multibox text utilized to improve the detection accuracy. We used SSD and SSTD as the basic multibox text DeepLab-V2 was used to generate semantic segmentation information. detectors. DeepLab-V2 was used to generate semantic segmentation information. detectors. DeepLab-V2 was used to generate semantic segmentation information. Inside Feature Extrction Inside Feature Extrction 32 32 32 32 32 32 32 32 Convolutional Convolutional Deconvolutional 32 32 64 32 32 Deconvolutional 32 32 64 32 32 Concatenate Concatenate Figure 3. Inside feature extraction (IFE) process. The feature map will be enlarged first (32 × 32 to Figure 3. Inside feature extraction (IFE) process. The feature map will be enlarged first (32  32 to Figure 3. Inside feature extraction (IFE) process. The feature map will be enlarged first (32 × 32 to 64 × 64) and then reduced (64 × 64 to 32 × 32). 64  64) and then reduced (64  64 to 32  32). 64 × 64) and then reduced (64 × 64 to 32 × 32). 3. Proposed Algorithm 3. Proposed Algorithm 3. Proposed Algorithm 3.1. Overall Framework 3.1. Overall Framework 3.1. Overall Framework The The proposed algorithm is shown proposed algorithm is shown in in Figur Figure e 4 4. . Plent Plenty y of text can of text candidate didate bound bounding ing boxe boxes s we wer re e The proposed algorithm is shown in Figure 4. Plenty of text candidate bounding boxes were obtained from the multibox processing. Meanwhile, the text semantic segmentation result was obtained obtained from the multibox processing. Meanwhile, the text semantic segmentation result was obtained from the multibox processing. Meanwhile, the text semantic segmentation result was fr obta om i semantic ned from sema segmentation ntic segmenta processing. tion processi A softmax ng. A sof layer t was max added layer was to the ad output ded to the output layer of the la semantic yer of obtained from semantic segmentation processing. A softmax layer was added to the output layer of segmentation the semantic pr segmenta ocess to ti obtain on process to obta the classification in the cla probability ssifica of tion proba each pixel. bil The ity of text each pi candidate xel. The text bounding the semantic segmentation process to obtain the classification probability of each pixel. The text candidate bounding boxes and semantic segmentation results are merged in the bounding box boxes and semantic segmentation results are merged in the bounding box enhancement module (BEM) candidate bounding boxes and semantic segmentation results are merged in the bounding box to enhancement eliminate the module false results (BEM) to of multibox eliminate t prh ocessing. e false resu The lts o semantic f multibox proc segmentation essing. The result se enter mantthe ic enhancement module (BEM) to eliminate the false results of multibox processing. The semantic segmentation result enter the semantic bounding box module (SBM). After the CRF algorithm semantic bounding box module (SBM). After the CRF algorithm optimizes the text boundaries, the segmentation result enter the semantic bounding box module (SBM). After the CRF algorithm optimizes the text boundaries, the text semantic bounding box is computed. Finally, the outputs of text semantic bounding box is computed. Finally, the outputs of the BEM and SBM are sent to the optimizes the text boundaries, the text semantic bounding box is computed. Finally, the outputs of the BEM and SBM are sent to the non-maximum suppression (NMS) to remove duplicate bounding non-maximum suppression (NMS) to remove duplicate bounding boxes. the BEM and SBM are sent to the non-maximum suppression (NMS) to remove duplicate bounding boxes. boxes. Multibox Processing Multibox Reuslt BEM Multibox Processing Multibox Reuslt BEM Semantic Result Semantic Processing SB SBM M Semantic Result Semantic Processing SB SBM M Figure 4. Framework of scene text detection algorithm using multibox and semantic segmentation. Figure 4. Framework of scene text detection algorithm using multibox and semantic segmentation. Figure 4. Framework of scene text detection algorithm using multibox and semantic segmentation. 3.2. Bounding Box Enhancement Module 3.2. Bounding Box Enhancement Module 3.2. Boun The BEM ding Box merE ges nhathe ncem multibox ent Modul results e and semantic segmentation result to eliminate the false The BEM merges the multibox results and semantic segmentation result to eliminate the false bounding boxes. The regional median probability of all bounding boxes was calculated based on the bounding boxes. The regional median probability of all bounding boxes was calculated based on the The BEM merges the multibox results and semantic segmentation result to eliminate the false semantic segmentation result. The bounding box will be removed if its regional median probability bounding boxes. The regional median probability of all bounding boxes was calculated based on the is less than the threshold. The detailed steps of the BEM are shown in Algorithm 1. semantic segmentation result. The bounding box will be removed if its regional median probability Algorithm 1. Bounding box enhancement module (BEM) is less than the threshold. The detailed steps of the BEM are shown in Algorithm 1. Algorithm 1. Bounding box enhancement module (BEM) NMS NMS Appl. Sci. 2019, 9, 1054 5 of 13 semantic segmentation result. The bounding box will be removed if its regional median probability is less than the threshold. The detailed steps of the BEM are shown in Algorithm 1. Algorithm 1. Bounding box enhancement module (BEM) Step 1. Acquire a multibox result: Rec = ((x , y ), (x , y )) . i refers to the i-th result. (x , y ) i 1 1 2 2 i 1 1 is the coordinates of the upper left corner of the text bounding box. (x , y ) is the 2 2 coordinates of the right bottom corner of the text bounding box. Appl. Sci. 2019, 9, x FOR PEER REVIEW 5 of 12 Step 2. Get the rectangular area Area of Rec in the semantic segmentation result. Rec Step 3. Calculate the regional median probability: Step 1. Acquire a multibox result: Reci= ((x1, y1), (x2, y2))i. i refers to the i-th result. (x1, y1) P = Median( P ) Area ij is the coordinates of the upper left corner of the text bounding box. (x2, y2) is the ij in Aera Step 4. coordi Compar nates of e P the ri to the ght bottom corner of the text boundi threshold T: ng box. Area If P < T: Delete the Rec in the multibox results. Step 2. Get the rectangular area AreaRec of Reci in the semantic segmentation result. Area i Else: continue. Step 3. Calculate the regional median probability: Step 5. Repeat steps 1–4 until all multibox results have been calculated. P =𝑀𝑎𝑖𝑛𝑒𝑑( 𝑃 ) The BEM process is shown in Figure 5. It can be seen from the figure that region (a) is the text and Step 4. Compare P to the threshold T: the region (b) is the background. These two blocks are detected as text in multibox processing. From If P <𝑇 : Delete the Reci in the multibox results. the heatmap of the semantic processing results, we found that region (a) was bright while region (b) Else: continue. was dark. After the processing of the bounding box enhancement module, region (a) was reserved and Step 5. Repeat steps 1--4 until all multibox results have been calculated. region (b) was discarded. (a) (b) MultiBox Processing Result (a) (a) (b) (b) Concat and Reconstruct BEM Result (a) (b) (a) (b) Semantic Processing Result Figure 5. Flow chart of the BEM. Figure 5. Flow chart of the BEM. 3.3. Semantic Bounding Box Module The BEM process is shown in Figure 5. It can be seen from the figure that region (a) is the text The SBM contains two modules: CRF processing and bounding box search. The CRF processing and the region (b) is the background. These two blocks are detected as text in multibox processing. solves the problem of boundary blur and stickiness in text semantic segmentation. The bounding box From the heatmap of the semantic processing results, we found that region (a) was bright while search obtains the optimal text semantic segmentation bounding box according to the CRF processing region (b) was dark. After the processing of the bounding box enhancement module, region (a) was result. The SBM process is shown in Figure 6. reserved and region (b) was discarded. 3.3. Semantic Bounding Box Module The SBM contains two modules: CRF processing and bounding box search. The CRF processing solves the problem of boundary blur and stickiness in text semantic segmentation. The bounding box search obtains the optimal text semantic segmentation bounding box according to the CRF processing result. The SBM process is shown in Figure 6. Appl. Sci. 2019, 9, 1054 6 of 13 Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 12 Semantic Processing CRF Processing BoundingBox Search Figure 6. Semantic bounding box module flow chart. The yellow circle areas in the figure are clearer in Figure 6. Semantic bounding box module flow chart. The yellow circle areas in the figure are clearer the text boundary portion after the CRF processing. in the text boundary portion after the CRF processing. The semantic segmentation result has a defect, which is that it is easy to cause adhesion when The semantic segmentation result has a defect, which is that it is easy to cause adhesion when words are close. The CRF algorithm is used to correct the pixel-level prediction of the semantic words are close. The CRF algorithm is used to correct the pixel-level prediction of the semantic segmentation result. Text edges are sharper and less sticky after CRF processing. CRF has been segmentation result. Text edges are sharper and less sticky after CRF processing. CRF has been employed to smooth noisy segmentation maps [28,29]. These methods use short-range CRF to couple employed to smooth noisy segmentation maps [28,29]. These methods use short-range CRF to couple neighboring nodes, favoring same-label assignments to spatially proximal pixels. In this work, the neighboring nodes, favoring same-label assignments to spatially proximal pixels. In this work, the goal should be to recover the detailed local structure rather than smooth it further. Therefore, the fully goal should be to recover the detailed local structure rather than smooth it further. Therefore, the connected CRF model [30] is integrated into our network. The CRF model employs the energy function: fully connected CRF model [30] is integrated into our network. The CRF model employs the energy E(x) = q (x ) + q x , x , (1) function: å i i i å ij ij i j E(𝑥 ) = ∑ 𝜃 (𝑥 ) + ∑ 𝜃 𝑥 ,𝑥 , (1) where the x is the label assignment for pixels and the unary potential is defined as follows: where the x is the label assignment for pixels and the unary potential is defined as follows: q (x ) = log P(x ), (2) i i i ( ) ( ) θ 𝑥 =− log 𝑃 𝑥 , (2) where P(x ) is the label assignment probability of pixel i computed by semantic processing. where 𝑃 (𝑥 ) is the label assignment probability of pixel i computed by semantic processing. The pairwise potential has a form that allows for efficient inference while using a fully-connected The pairwise potential has a form that allows for efficient inference while using a fully-connected graph, i.e., connecting all pairs of image pixels, i, j. In particular, as in [30], the following expression graph, i.e. connecting all pairs of image pixels, i, j. In particular, as in [30], the following expression is used: is used: " ! !# 2 2 2 k p p k k I I k k p p k i j i j i j q x , x = m x , x w exp + w exp , (3) θ 𝑥 ,𝑥 = 𝜇𝑥 ,𝑥 𝜔 𝑒𝑥𝑝 − − + 𝜔 𝑒𝑥𝑝 − , (3) ij i j i j 1 2 2 2 2 2s 2s 2s a g where 𝜇𝑥 ,𝑥 =1 if 𝑥 ≠𝑥 , and zero otherwise. The remaining expression uses two Gaussian where m x , x = 1 if x 6= x , and zero otherwise. The remaining expression uses two Gaussian i j i j kernels in different feature spaces; the first, bilateral kernel depends on both pixel positions p and kernels in different feature spaces; the first, bilateral kernel depends on both pixel positions p and RGB color I, while the second kernel depends on pixel positions. The hyperparameters σ , σ and RGB color I, while the second kernel depends on pixel positions. The hyperparameters s , s and s a g σ control the scale of Gaussian kernels. The first kernel has similar tags for pixels with similar colors control the scale of Gaussian kernels. The first kernel has similar tags for pixels with similar colors and positions, whereas the second kernel considers spatial proximity while smoothing the and positions, whereas the second kernel considers spatial proximity while smoothing the boundaries. boundaries. Figure 7 shows the effectiveness of the CRF processing. Figure 7 shows the effectiveness of the CRF processing. Before SMB After SBM Before SBM After SBM (a) (b) Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 12 Semantic Processing CRF Processing BoundingBox Search Figure 6. Semantic bounding box module flow chart. The yellow circle areas in the figure are clearer in the text boundary portion after the CRF processing. The semantic segmentation result has a defect, which is that it is easy to cause adhesion when words are close. The CRF algorithm is used to correct the pixel-level prediction of the semantic segmentation result. Text edges are sharper and less sticky after CRF processing. CRF has been employed to smooth noisy segmentation maps [28,29]. These methods use short-range CRF to couple neighboring nodes, favoring same-label assignments to spatially proximal pixels. In this work, the goal should be to recover the detailed local structure rather than smooth it further. Therefore, the fully connected CRF model [30] is integrated into our network. The CRF model employs the energy function: E(𝑥 ) = ∑ 𝜃 (𝑥 ) + ∑ 𝜃 𝑥 ,𝑥 , (1) where the x is the label assignment for pixels and the unary potential is defined as follows: ( ) ( ) θ 𝑥 =− log 𝑃 𝑥 , (2) where 𝑃 (𝑥 ) is the label assignment probability of pixel i computed by semantic processing. The pairwise potential has a form that allows for efficient inference while using a fully-connected graph, i.e. connecting all pairs of image pixels, i, j. In particular, as in [30], the following expression is used: θ 𝑥 ,𝑥 = 𝜇𝑥 ,𝑥 𝜔 𝑒𝑥𝑝 − − + 𝜔 𝑒𝑥𝑝 − , (3) where 𝜇𝑥 ,𝑥 =1 if 𝑥 ≠𝑥 , and zero otherwise. The remaining expression uses two Gaussian kernels in different feature spaces; the first, bilateral kernel depends on both pixel positions p and RGB color I, while the second kernel depends on pixel positions. The hyperparameters σ , σ and σ control the scale of Gaussian kernels. The first kernel has similar tags for pixels with similar colors Appl. Sci. 2019, 9, 1054 7 of 13 and positions, whereas the second kernel considers spatial proximity while smoothing the boundaries. Figure 7 shows the effectiveness of the CRF processing. Before SMB After SBM Before SBM After SBM (a) (b) Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 12 Figure 7. Examples of CRF processing effects. (a) Two lines of text are placed in one bounding box. Figure 7. Examples of CRF processing effects. (a) Two lines of text are placed in one bounding box. After CRF processing, each word has a separate bounding box. (b) The arrow is erroneously detected After CRF processing, each word has a separate bounding box. (b) The arrow is erroneously detected as text and the arrow bounding box is removed after CRF processing. as text and the arrow bounding box is removed after CRF processing. 4. Experimental Results 4. Experimental Results 4.1. Datasets 4.1. Datasets ICDAR2013 • ICDAR2013 The ICDAR 2013 [31] consists of 229 training images and 233 testing images, with word-level The ICDAR 2013 [31] consists of 229 training images and 233 testing images, with word-level annotations provided. It is the standard benchmark for evaluating near-horizontal text detection. Some annotations provided. It is the standard benchmark for evaluating near-horizontal text detection. examples of the ICDAR2013 dataset are shown in Figure 8. Some examples of the ICDAR2013 dataset are shown in Figure 8. Image Ground Truth (a) (b) (c) (d) Figure 8. Several examples of difficulties in text detection in the ICDAR2013 dataset. (a) Background Figure 8. Several examples of difficulties in text detection in the ICDAR2013 dataset. (a) Background destruction font structure. (b) Perspective transformation and uneven illumination. (c) Text is too small destruction font structure. (b) Perspective transformation and uneven illumination. (c) Text is too and low contrast. (d) Complex background and low contrast. small and low contrast. (d) Complex background and low contrast. Street View Text (SVT) • Street View Text (SVT) The SVT dataset is harvested from Google Street View. Image text in this data exhibits high The SVT dataset is harvested from Google Street View. Image text in this data exhibits high variability and often has low resolution [32]. In autonomous driving, the text in these Street View variability and often has low resolution [32]. In autonomous driving, the text in these Street View images helps the system confirm its position. Compared to the ICDAR2013 dataset, text in the SVT images helps the system confirm its position. Compared to the ICDAR2013 dataset, text in the SVT dataset has lower resolution, more complex light conversion, and relatively smaller text, which is more dataset has lower resolution, more complex light conversion, and relatively smaller text, which is challenging. Figure 9 shows some examples of the SVT dataset. more challenging. Figure 9 shows some examples of the SVT dataset. Image Ground Truth Figure 9. Examples of the Street View Text (SVT) dataset. Text with lower resolution, complex light conversion, and smaller text. 4.2. Exploration Study Inside feature extraction (IFE) and outside mutual correction (OMC) algorithms were tested. Since the IFE in SSTD cannot be removed alone, the performance without IFE cannot be tested. Therefore, only IFE and OMC algorithms were tested. The training and testing environments were Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 12 Figure 7. Examples of CRF processing effects. (a) Two lines of text are placed in one bounding box. After CRF processing, each word has a separate bounding box. (b) The arrow is erroneously detected as text and the arrow bounding box is removed after CRF processing. 4. Experimental Results 4.1. Datasets • ICDAR2013 The ICDAR 2013 [31] consists of 229 training images and 233 testing images, with word-level annotations provided. It is the standard benchmark for evaluating near-horizontal text detection. Some examples of the ICDAR2013 dataset are shown in Figure 8. Image Ground Truth (a) (b) (c) (d) Figure 8. Several examples of difficulties in text detection in the ICDAR2013 dataset. (a) Background destruction font structure. (b) Perspective transformation and uneven illumination. (c) Text is too small and low contrast. (d) Complex background and low contrast. • Street View Text (SVT) The SVT dataset is harvested from Google Street View. Image text in this data exhibits high variability and often has low resolution [32]. In autonomous driving, the text in these Street View images helps the system confirm its position. Compared to the ICDAR2013 dataset, text in the SVT Appl. Sci. 2019, 9, 1054 8 of 13 dataset has lower resolution, more complex light conversion, and relatively smaller text, which is more challenging. Figure 9 shows some examples of the SVT dataset. Image Ground Truth Figure 9. Examples of the Street View Text (SVT) dataset. Text with lower resolution, complex light Figure 9. Examples of the Street View Text (SVT) dataset. Text with lower resolution, complex light conversion, and smaller text. conversion, and smaller text. 4.2. Exploration Study 4.2. Exploration Study Inside feature extraction (IFE) and outside mutual correction (OMC) algorithms were tested. Since Inside feature extraction (IFE) and outside mutual correction (OMC) algorithms were tested. the IFE in SSTD cannot be removed alone, the performance without IFE cannot be tested. Therefore, Since the IFE in SSTD cannot be removed alone, the performance without IFE cannot be tested. only IFE and OMC algorithms were tested. The training and testing environments were consistent. Therefore, only IFE and OMC algorithms were tested. The training and testing environments were All methods used the same training datasets, the same number of training epochs, and the same set parameters. We used two standard evaluation protocols: the IC13 standard and the DetEval standard [33]. The proposed method was implemented with Caffe and Matlab, running on a computer with an 8-core CPU, 32G RAM, TitianXP GPU, and Ubuntu 16.04. The complete test results are shown in Tables 1 and 2. Table 1. Comparison of the IFE and outside mutual correction (OMC) algorithms (on ICDAR2013). IC13 Standard DetEval Standard Network 1 2 3 R P F R P F SSD 52.20% 87.76% 65.18% 53.06% 87.24% 65.98% SSD-IFE 49.17% 82.79% 61.69% 49.88% 83.29% 62.39% SSD-OMC 69.30% 81.21% 74.78% 70.15% 85.10% 76.91% SSTD-IFE 74.54% 83.65% 78.83% 75.39% 84.07% 79.50% SSTD-OMC 80.51% 82.27% 81.38% 80.43% 87.60% 83.86% 1 2 3 Recall. Precision. F-measure. Table 2. Comparison of IFE and OMC algorithms (on SVT). IC13 Standard DetEval Standard Network R P F R P F SSD 48.61% 73.24% 58.43% 48.17% 75.12% 58.70% SSD-IFE 50.34% 79.91% 61.77% 50.34% 79.91% 61.77% SSD-OMC 69.91% 73.57% 71.69% 66.01% 77.97% 71.48% SSTD-IFE 78.60% 73.35% 75.89% 78.60% 73.35% 75.89% SSTD-OMC 85.13% 71.68% 77.83% 81.65% 79.78% 80.71% ‘SSD’ refers to the original SSD algorithm without IFE and OMC. ‘SSD-IFE’ refers to the SSD added IFE algorithm. ‘SSD-OMC’ refers to the SSD added OMC algorithm. ‘SSTD-IFE’ refers to the SSTD added IFE algorithm, which was the original SSTD. ‘SSTD-OMC’ refers to the SSTD added OMC algorithm. The experimental results show that the IFE algorithm reduces the accuracy of text detection on the ICDAR2013 dataset. For the SVT dataset, the IFE algorithm slightly improved the accuracy of the text detection. Compared to the IFE algorithm, the OMC algorithm significantly improved the F-measure score. Appl. Sci. 2019, 9, 1054 9 of 13 4.3. Experimental Results Five methods were tested on the ICDAR2013 and SVT datasets: FCN, SSD, SSD-OMC, SSTD, and SSTD-OMC. The SSD-OMC and the SSTD-OMC use the proposed algorithm to combine the semantic segmentation with SSD and SSTD, respectively. Table 3 shows the tested results of five methods on the ICDAR2013 dataset using two standard evaluation protocols. It can be seen from the test results that the SSD-OMC and SSTD-OMC algorithms showed an increase of 17.10% (IC13), 17.09% (DetEval) and 5.97% (IC13), 5.04% (DetEval) in the recall rate relative to the SSD and SSTD algorithms, which means that more texts missed by the multibox processing was detected. The SSD-OMC method was 9.60% (IC13) and 10.93% (DetEval) higher than the SSD method in the F-measure score, and the SSTD-OMC method was 2.55% (IC13) and 4.36% (DetEval) higher compared to the SSTD method in the F-measure score, meaning that the multibox processing optimized by our algorithm had a better detection accuracy. Table 3. Improved algorithm compared with original algorithm (on ICDAR2013). IC13 Standard DetEval Standard Network 1 2 3 R P F R P F FCN 62.54% 60.80% 61.66% 66.97 62.05% 64.42% SSD 52.20% 87.76% 65.18% 53.06% 87.24% 65.98% SSD-OMC 69.30% 81.21% 74.78% 70.15% 85.10% 76.91% SSTD 74.54% 83.65% 78.83% 75.39% 84.07% 79.50% SSTD-OMC 80.51% 82.27% 81.38% 80.43% 87.60% 83.86% 1 2 3 Recall. Precision. F-measure. Table 4 shows the results of the SSTD-OMC algorithm and four other advanced text detection algorithms tested on the ICDAR2013 dataset. As can be seen in the table, SSTD-OMC has a higher f-measure score, indicating that SSTD-OMC had the better detection accuracy among these five algorithms. Some detection results are shown in Figure 10. Table 4. Improved algorithm compared with other advanced algorithms (on ICDAR2013). IC13 Standard DetEval Standard Network R P F R P F Yin [34] 0.66 0.88 0.76 0.69 0.89 0.78 Neumann [35] 0.72 0.82 0.77 - - - Zhang [36] 0.74 0.88 0.80 0.76 0.88 0.82 Textboxes [22] 0.74 0.86 0.80 0.74 0.88 0.81 SSTD-OMC 0.80 0.82 0.81 0.80 0.80 0.83 Table 5 shows the results of five methods tested on the SVT dataset. The SSD-OMC method showed an improvement of 13.62% (IC13) and 12.78% (DetEval) on the F-measure score compared to the SSD method, while the SSTD-OMC method improved 1.94% (IC13) and 4.82% (DetEval) on the F-measure score compared to the SSTD method. Figure 11 shows some detection results from the SVT dataset. Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 12 algorithms showed an increase of 17.10% (IC13), 17.09% (DetEval) and 5.97% (IC13), 5.04% (DetEval) in the recall rate relative to the SSD and SSTD algorithms, which means that more texts missed by the multibox processing was detected. The SSD-OMC method was 9.60% (IC13) and 10.93% (DetEval) higher than the SSD method in the F-measure score, and the SSTD-OMC method was 2.55% (IC13) and 4.36% (DetEval) higher compared to the SSTD method in the F-measure score, meaning that the multibox processing optimized by our algorithm had a better detection accuracy. Table 4. Improved algorithm compared with other advanced algorithms (on ICDAR2013). IC13 Standard DetEval Standard Network R P F R P F Yin [34] 0.66 0.88 0.76 0.69 0.89 0.78 Neumann [35] 0.72 0.82 0.77 - - - Zhang [36] 0.74 0.88 0.80 0.76 0.88 0.82 Textboxes [22] 0.74 0.86 0.80 0.74 0.88 0.81 SSTD-OMC 0.80 0.82 0.81 0.80 0.80 0.83 Table 4 shows the results of the SSTD-OMC algorithm and four other advanced text detection algorithms tested on the ICDAR2013 dataset. As can be seen in the table, SSTD-OMC has a higher f- Appl. Sci. 2019, 9, 1054 10 of 13 measure score, indicating that SSTD-OMC had the better detection accuracy among these five algorithms. Some detection results are shown in Figure 10. GT SSD-OMC SSTD-OMC Figure 10. Some detection results and their ground truth from the ICDAR2013 dataset. Figure 10. Some detection results and their ground truth from the ICDAR2013 dataset. Table 5. Improved algorithm compared with original algorithm (on SVT). Table 5. Improved algorithm compared with original algorithm (on SVT). IC13 Standard DetEval Standard Network IC13 Standard DetEval Standard R P F R P F Network R P F R P F FCN 50.78% 54.94% 52.78% 55.13% 54.94% 55.03% FCN 50.78% 54.94% 52.78% 55.13% 54.94% 55.03% SSD 48.61% 73.24% 58.43% 48.17% 75.12% 58.70% SSD 48.61% 73.24% 58.43% 48.17% 75.12% 58.70% SSD-OMC 69.91% 73.57% 71.69% 66.01% 77.97% 71.48% SSD-OMC 69.91% 73.57% 71.69% 66.01% 77.97% 71.48% SSTD 78.60% 73.35% 75.89% 78.60% 73.35% 75.89% SSTD-OMC 85.13% 71.68% 77.83% 81.65% 79.78% 80.71% Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 12 SSTD 78.60% 73.35% 75.89% 78.60% 73.35% 75.89% SSTD-OMC 85.13% 71.68% 77.83% 81.65% 79.78% 80.71% Table 5 shows the results of five methods tested on the SVT dataset. The SSD-OMC method showed an improvement of 13.62% (IC13) and 12.78% (DetEval) on the F-measure score compared to the SSD method, while the SSTD-OMC method improved 1.94% (IC13) and 4.82% (DetEval) on the GT F-measure score compared to the SSTD method. Figure 11 shows some detection results from the SVT dataset. SSD-OMC SSTD-OMC Figure 11. Some detection results and their ground truth from the SVT dataset. Figure 11. Some detection results and their ground truth from the SVT dataset. 5. Conclusions 5. Conclusions Our work provided an OMC algorithm to fuse multibox with semantic segmentation. In the OMC Our work provided an OMC algorithm to fuse multibox with semantic segmentation. In the algorithm, semantic segmentation and multibox were processed in parallel, and the text detection OMC algorithm, semantic segmentation and multibox were processed in parallel, and the text results were mutually corrected. The mutual correction process had two stages. In the first stage, the detection results were mutually corrected. The mutual correction process had two stages. In the first pixel-level classification results of the semantic segmentation were adopted to correct the multibox stage, the pixel-level classification results of the semantic segmentation were adopted to correct the multibox bounding boxes. In the second stage, the CRF algorithm was used to precisely adjust the boundaries of the semantic segmentation results. Then the NMS was introduced to merge the text bounding boxes generated by multibox and semantic segmentation. The experimental results showed that the proposed OMC algorithm had better performance than the original IFE algorithm. The F- measure score increased by a maximum of 13.62% and the highest F-measure score was 81.38%. Future work will focus on more powerful and faster detection structures, as well as on rotating text detection research. Author Contributions: Conceptualization: H.Q. and H.Z.; formal analysis: H.W. and Y.Y.; investigation: H.Q. and H.Z.; methodology: H.Q. and M.Z.; writing—original draft: H.W and W.Z; writing—review & editing: H.Q. and H.Z. Funding: This research was funded by National Natural Science Foundation of China, grant number 61801357. Conflicts of Interest: The authors declare no conflict of interest. References 1. Fletcher, L.A.; Kasturi, R. A robust algorithm for text string separation from mixed text/graphics images. IEEE Trans. Pattern Anal. Mach. Intell. 1988, 10, 910–918. 2. Ye, Q.; Doermann, D. Text detection and recognition in imagery: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1480–1500. 3. Wang, F.; Zhao, L.; Li, X.; Wang, X.; Tao, D. Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018. 4. Chen, H.; Tsai, S.S.; Schroth, G.; Chen, D.M.; Grzeszczuk, R.; Girod, B. Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In Proceednigs of the 18th IEEE International Conference on Image Processing (ICIP), Brussels, Belgium, 11–14 September 2011; IEEE: Piscataway, NJ, USA, 2011. Appl. Sci. 2019, 9, 1054 11 of 13 bounding boxes. In the second stage, the CRF algorithm was used to precisely adjust the boundaries of the semantic segmentation results. Then the NMS was introduced to merge the text bounding boxes generated by multibox and semantic segmentation. The experimental results showed that the proposed OMC algorithm had better performance than the original IFE algorithm. The F-measure score increased by a maximum of 13.62% and the highest F-measure score was 81.38%. Future work will focus on more powerful and faster detection structures, as well as on rotating text detection research. Author Contributions: Conceptualization: H.Q. and H.Z.; formal analysis: H.W. and Y.Y.; investigation: H.Q. and H.Z.; methodology: H.Q. and M.Z.; Writing—Original Draft: H.W. and W.Z.; Writing—Review & Editing: H.Q. and H.Z. Funding: This research was funded by National Natural Science Foundation of China, grant number 61801357. Conflicts of Interest: The authors declare no conflict of interest. References 1. Fletcher, L.A.; Kasturi, R. A robust algorithm for text string separation from mixed text/graphics images. IEEE Trans. Pattern Anal. Mach. Intell. 1988, 10, 910–918. [CrossRef] 2. Ye, Q.; Doermann, D. Text detection and recognition in imagery: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1480–1500. [CrossRef] [PubMed] 3. Wang, F.; Zhao, L.; Li, X.; Wang, X.; Tao, D. Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018. 4. Chen, H.; Tsai, S.S.; Schroth, G.; Chen, D.M.; Grzeszczuk, R.; Girod, B. Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In Proceedings of the 18th IEEE International Conference on Image Processing (ICIP), Brussels, Belgium, 11–14 September 2011; IEEE: Piscataway, NJ, USA, 2011. 5. Neumann, L.; Matas, J. Real-time scene text localization and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012. 6. Shi, C.; Wang, C.; Xiao, B.; Zhang, Y.; Gao, S. Scene text detection using graph model built upon maximally stable extremal regions. Pattern Recognit. Lett. 2013, 34, 107–116. [CrossRef] 7. Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010. 8. Mosleh, A.; Bouguila, N.; Hamza, A.B. Image text detection using a bandlet-based edge detector and stroke width transform. In BMVC; BMVC: Newcastle, UK, 2012. 9. He, W.; Zhang, X.Y.; Yin, F.; Liu, C.L. Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017. 10. Liu, Y.; Jin, L. Deep matching prior network: Toward tighter multi-oriented text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. 11. Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 2018, 20, 3111–3122. [CrossRef] 12. Xiang, D.; Guo, Q.; Xia, Y. Robust text detection with vertically-regressed proposal network. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016. 13. Shi, B.; Bai, X.; Belongie, S. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. 14. Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016. Appl. Sci. 2019, 9, 1054 12 of 13 15. Tian, S.; Pan, Y.; Huang, C.; Lu, S.; Yu, K.; Lim Tan, C. Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015. 16. Cho, H.; Sung, M.; Jun, B. Canny text detector: Fast and robust scene text localization algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. 17. Zhang, Z.; Zhang, C.; Shen, W.; Yao, C.; Liu, W.; Bai, X. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. 18. Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. 19. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed] 20. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016. 21. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. 22. Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. TextBoxes: A Fast Text Detector with a Single Deep Neural Network; AAAI: Menlo Park, CA, USA, 2017. 23. Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. 24. He, P.; Huang, W.; He, T.; Zhu, Q.; Qiao, Y.; Li, X. Single shot text detector with regional attention. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017. 25. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2015. 26. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. ACM 2012, 60, 84–90. [CrossRef] 27. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [CrossRef] [PubMed] 28. Rother, C.; Kolmogorov, V.; Blake, A. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM Transactions on Graphics (TOG); ACM: New York, NY, USA, 2004. 29. Kohli, P.; Torr, P.H. Robust higher order potentials for enforcing label consistency. Int. J. Comput.Vis. 2009, 82, 302–324. [CrossRef] 30. Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. Adv. Neural Inf. Process. Syst. 2011, 24, 109–117. 31. Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; de Las Heras, L.P. ICDAR 2013 robust reading competition. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), Washington, DC, USA, 25–28 August 2013; IEEE: Piscataway, NJ, USA, 2013. 32. Wang, K.; Belongie, S. Word spotting in the wild. In European Conference on Computer Vision; Springer: Berlin, Germany, 2010. 33. Lucas, S.M.; Panaretos, A.; Sosa, L.; Tang, A.; Wong, S.; Young, R.; Ashida, K.; Nagai, H.; Okamoto, M.; Yamamoto, H. ICDAR 2003 robust reading competitions: entries, results, and future directions. Int. J. Doc. Anal Recognit. 2005, 7, 105–122. [CrossRef] 34. Yin, X.-C.; Yin, X.; Huang, K.; Hao, H.W. Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 970–983. [PubMed] Appl. Sci. 2019, 9, 1054 13 of 13 35. Neumann, L.; Matas, J. Efficient scene text localization and recognition with local character refinement. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; IEEE: Piscataway, NJ, USA, 2015. 36. Zhang, Z.; Shen, W.; Yao, C.; Bai, X. Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015. © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Journal

Applied SciencesMultidisciplinary Digital Publishing Institute

Published: Mar 13, 2019

There are no references for this article.