3D Wireframe Modeling and Viewpoint Estimation for Multi-Class Objects Combining Deep Neural Network and Deformable Model Matching
3D Wireframe Modeling and Viewpoint Estimation for Multi-Class Objects Combining Deep Neural...
Ren, Xiaoyuan;Jiang, Libing;Tang, Xiaoan;Liu, Weichun
2019-05-14 00:00:00
applied sciences Article 3D Wireframe Modeling and Viewpoint Estimation for Multi-Class Objects Combining Deep Neural Network and Deformable Model Matching Xiaoyuan Ren, Libing Jiang *, Xiaoan Tang and Weichun Liu College of Electronic Science, National University of Defense Technology; Changsha HN 731, China, renxiaoyuan10@nudt.edu.cn (X.R.); xatang@nudt.edu.cn (X.T.); liuweichun17@nudt.edu.cn (W.L.) * Correspondence: jianglibing@nudt.edu.cn; Tel.: +86 15581641708 Received: 25 February 2019; Accepted: 8 May 2019; Published: 14 May 2019 Featured Application: This research is a useful exploration to extend the generalization of deep learning in 3D modeling and viewpoint estimation. Abstract: The accuracy of 3D viewpoint and shape estimation from 2D images has been greatly improved by machine learning, especially deep learning technology such as the convolution neural network (CNN). However, current methods are always valid only for one specific category and have exhibited poor performance when generalized to other categories, which means that multiple detectors or networks are needed for multi-class object image cases. In this paper, we propose a method with strong generalization ability, which incorporates only one CNN with deformable model matching processing for the 3D viewpoint and the shape estimation of multi-class object image cases. The CNN is utilized to detect keypoints of the potential object from the image, while a deformable model matching stage is designed to conduct 3D wireframe modeling and viewpoint estimation simultaneously with the support of the detected keypoints. Besides, parameter estimation by deformable model matching processing has robust fault-tolerance to the keypoint detection results containing mistaken keypoints. The proposed method is evaluated on Pascal3D+ dataset. Experiments show that the proposed method performs well in both parameter estimation accuracy and the multi-class objects generalization. This research is a useful exploration to extend the generalization of deep learning in specific tasks. Keywords: 3D vision; viewpoint estimation; wireframe modeling; deformable model 1. Introduction Estimating the 3D geometry of an object from a single image is an important but challenging task in computer vision [1]. Recent years have witnessed an emerging trend towards analyzing the 3D viewpoint and shape instead of merely providing 2D bounding boxes. Previously, 3D primitives were fitted with the image to obtain viewpoint and shape parameters [2,3]. While these primitives can provide detailed descriptions of objects, robustly matching them to real-world images was proven to be dicult. Recently, the developments of machine learning especially deep neural networks such as the convolution neural network (CNN), have contributed greatly to this field. Despite the good performance gained by these methods, they share a common limitation: Each network or detector is trained for only one specific category target generally. For the 3D shape and the viewpoint estimation problem, most of the existing methods are interested in reconstructing 3D model for category-specific objects [4–10]. In general, the deformable model of the specific category, such as wireframe and mesh, is matched with the image to estimate the Appl. Sci. 2019, 9, 1975; doi:10.3390/app9101975 www.mdpi.com/journal/applsci Appl. Sci. 2019, 9, 1975 2 of 14 shape and the viewpoint. Although current methods using deep learning technology can output the shape and viewpoint parameters in an end-to-end way and perform well in accuracy, most of them are limited in one specific category and exhibit poor performance when generalized to other categories. Consequently, multiple detectors or networks are needed for multi-class cases, which significantly raises the training costs. Considering the fact that the sofas and the chairs have similar legs, as well as that the cars and the bicycles all contain wheels, it is worth pointing out that dierent object categories do share rich compositional similarities. Consequently, Zhou [11] took advantage of this characteristic and constructed the Starmap network in 2018, which can extract the keypoints for multi-class objects. However, this network cannot be directly adopted here due to its intrinsic disadvantage of lacking semantic information for the extracted keypoints. In this paper, the problem of 3D viewpoint and shape estimation for multi-class objects from a single image is further investigated. Instead of producing multiple detectors or networks just like category-specific methods, the proposed approach uses only one keypoint detection network incorporating it with the deformable model matching processing. Firstly, the keypoint detection network for multi-class objects is trained. Keypoint locations of multi-class objects can be obtained through this network, but unlike the category-specific methods, the semantic meaning of each detected keypoint is not provided. In the following, these extracted keypoints are utilized and explored for deformable model matching, which can be divided into two stages: model selection and model validation. In the first stage, the extracted 3D keypoints are utilized to match with dierent deformable models corresponding to multi-classes objects for the selection of candidate deformable models. In the second stage, these candidate models are further screened and validated by the matching with the extracted 2D keypoints, which provides semantic meaning to each keypoint, and can be used to conduct 3D wireframe modeling and viewpoint estimation simultaneously. The main contributions of our research are as follows: Firstly, only one keypoint detection network is adpoted for multi-class objects in the proposed method, which can not only reduce the training cost, but also capture similarity across dierent categories. Secondly, the deformable model matching processing is introduced to utilize and supplement results obtained from the network. Besides, parameter estimation by the deformable model has robust fault-tolerance to the mistaken keypoints. In conclusion, the method proposed combines the advantage of deep learning and priori model. Compared with methods that depend only on a deep network, the proposed approach has better generalization performance. This paper explores how to extend the generalization of deep learning in a specific task. 2. Related Work 2.1. Viewpoint and Shape Estimation In earlier days of computer vision, single objects, as well as entire scenes, were represented by simple primitives, such as polyhedra [2] and generalized cylinders [3]. These approaches provided rich descriptions of objects and could estimate the viewpoint and the shape parameters, but robustly matching them to cluttered real-world images proved to be dicult at the time. With the advent of computers and advances in machine learning, it has become feasible to detect objects and their parts robustly. Currently, vision-based methods can be broadly classified into 2D image-based and 3D model-based techniques [4]. Image recognition techniques are employed by 2D image-based methods to attempt to directly restore pose information from the single image [5,6]. These methods usually work by a set of trained model-views taken in a range around the known model with dierent locations and viewpoints, which always suer from intra-category variations. Pose estimation using 3D model-based methods usually require a priori 3D model of the object, and the holistic cost function is defined when the 3D deformable model is fitted to the image features. Pepik [7] extended the deformable part model (DPM) to 3D, and Xiang [8] introduced a separate DPM component corresponding to each viewpoint. Appl. Sci. 2019, 9, x FOR PEER REVIEW 3 of 14 presented a sophisticated convolutional neural network (CNN) architecture to estimate a 3D skeleton Appl. Sci. 2019, 9, 1975 3 of 14 containing viewpoint and shape information. Chi Li obtained the 3D object structure by a deep CNN architecture with domain knowledge in hidden layers [10]. However, these methods rely on category- specific keypoint annotation and are not generalizable. When dealing with multi-class objects, These methods can estimate the viewpoint and the shape parameters simultaneously with dierent different networks need to be trained separately, which ignores inter-category structure similarities representation including wireframe and 3D mesh. Lately, estimation accuracy and utility have been and raises training costs significantly. This paper promotes a method to attain a generalization ability. greatly improved in the deep learning era. The single image 3D interpreter network (3D-INN) [9] presented a sophisticated convolutional neural network (CNN) architecture to estimate a 3D skeleton 2.2. Keypoint Detection containing viewpoint and shape information. Chi Li obtained the 3D object structure by a deep Wireframe representation is a concise structure modeling form with strong description ability, CNN architecture with domain knowledge in hidden layers [10]. However, these methods rely on which can preserve the structural properties in 3D modeling. In order to extract the wireframe from category-specific keypoint annotation and are not generalizable. When dealing with multi-class objects, the image, keypoint detection is necessary. Researchers have made significant progress in detecting dierent networks need to be trained separately, which ignores inter-category structure similarities keypoints. The traditional way is to train the classifier with the hand-craft feature that is used in DPM and raises training costs significantly. This paper promotes a method to attain a generalization ability. [12]. Recently, there have been several attempts to apply CNN to detect keypoints. Toshev [13] trained a deep neural network for 2D human pose regression. Xiang Yu optimized 2.2. Keypoint Detection deformation coefficients based on the principal component analysis (PCA) representation of 2D keypoints to achieve state-of-the-art performance on the face and human body [14]. Despite the good Wireframe representation is a concise structure modeling form with strong description ability, performance of these approaches, they share a common limitation: Each keypoint is only trained for which can preserve the structural properties in 3D modeling. In order to extract the wireframe from a specific type from a specific object. Xingyi Zhou [11] proposed a category-agnostic keypoint the image, keypoint detection is necessary. Researchers have made significant progress in detecting representation, which combines a multi-peak heatmap (StarMap) for all the types of keypoints in keypoints. The traditional way is to train the classifier with the hand-craft feature that is used in Pascal3D+ dataset [8] using the hourglass network [15]. This representation provides the flexibility DPM [12]. Recently, there have been several attempts to apply CNN to detect keypoints. to represent varying numbers of keypoints across different categories. Despite the strong Toshev [13] trained a deep neural network for 2D human pose regression. Xiang Yu optimized generalization performance, this method cannot provide semantic information of the keypoints. deformation coecients based on the principal component analysis (PCA) representation of 2D 3. Method keypoints to achieve state-of-the-art performance on the face and human body [14]. Despite the good performance of these approaches, they share a common limitation: Each keypoint is only trained The framework of the method proposed is shown in Figure 1, which consists of two parts: for a specific type from a specific object. Xingyi Zhou [11] proposed a category-agnostic keypoint keypoint detection and deformable model matching processing. representation, In the fi which rst partcombines , the hourgl aass multi-peak network [15 heatmap ] is used (StarMap) to predict keypoints for all the from types an input of keypoints image in with two components: 2D location and their 3D coordinates. Network training is illustrated in Section Pascal3D+ dataset [8] using the hourglass network [15]. This representation provides the flexibility to 3.1. In the second part, the keypoints detected by the network are matched with deformable models, represent varying numbers of keypoints across dierent categories. Despite the strong generalization then parameters of pose and shape are estimated. The formation of deformable model matching is performance, this method cannot provide semantic information of the keypoints. shown in Section 3.2.1. Priori structures of multi-class objects are used in the matching processing, which are represented by the deformable wireframe models based on PCA. The building of 3. Method deformable wireframe models is introduced in Section 3.2.2. Parameter estimation of pose and shape The framework of the method proposed is shown in Figure 1, which consists of two parts: keypoint is introduced in Section 3.2.3. detection Th and e rest deformable of this sect model ion desmatching cribes the ke pr yocessing. point detection and the deformable model matching in detail. 2D keypoints deformable models pose and shape parameters 3D keypoints • Deformable Model Matching Keypoint Detection Figure 1. Illustration of the framework. For an input image, 2D keypoints and their 3D coordinates are Figure 1. Illustration of the framework. For an input image, 2D keypoints and their 3D coordinates obtained through the hourglass network. These keypoints are then matched with deformable models. are obtained through the hourglass network. These keypoints are then matched with deformable After that, the viewpoint and shape parameters are obtained. models. After that, the viewpoint and shape parameters are obtained. In the first part, the hourglass network [15] is used to predict keypoints from an input image with two components: 2D location and their 3D coordinates. Network training is illustrated in Section 3.1. In the second part, the keypoints detected by the network are matched with deformable models, then parameters of pose and shape are estimated. The formation of deformable model matching is shown in Section 3.2.1. Priori structures of multi-class objects are used in the matching processing, which are represented by the deformable wireframe models based on PCA. The building of deformable Appl. Sci. 2019, 9, 1975 4 of 14 wireframe models is introduced in Section 3.2.2. Parameter estimation of pose and shape is introduced in Section 3.2.3. The rest of this section describes the keypoint detection and the deformable model matching in detail. Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 14 3.1. Keypoint Detection Network 3.1. For Keypoi keypoint nt Detedetection, ction Network the most widely used way is to represent keypoints as multi-channel heatmaps, which associate each keypoint with one channel on a specific object category. For these For keypoint detection, the most widely used way is to represent keypoints as multi-channel methods, although each keypoint is semantically meaningful, they are limited in the specific category. heatmaps, which associate each keypoint with one channel on a specific object category. For these In other words, keypoints from dierent objects are completely separated. We aim to detect keypoints methods, although each keypoint is semantically meaningful, they are limited in the specific category. across dierent categories, so a generalized network for multi-class objects can be obtained. This In other words, keypoints from different objects are completely separated. We aim to detect keypoints approach is inspired by the category-agnostic keypoint detection approach network proposed by across different categories, so a generalized network for multi-class objects can be obtained. This Xingyi Zhou [11]. This method can locate all keypoints across dierent categories using only one approach is inspired by the category-agnostic keypoint detection approach network proposed by network, but keypoints obtained have no semantic meaning. For 3D viewpoint estimation, the semantic Xingyi Zhou [11]. This method can locate all keypoints across different categories using only one meaning of each keypoint is needed to match with a priori model. As a result, keypoints, their 3D network, but keypoints obtained have no semantic meaning. For 3D viewpoint estimation, the location, semantic and me depth aningar of e needed each keypoint to obtain is n the eeded semantic to matmeaning ch with a of pri keypoints ori model. in As Zhou’s a resuwork. lt, keypo Network ints, their 3D location, and depth are needed to obtain the semantic meaning of keypoints in Zhou’s work. in our method is similar to Zhou, while we only need 2D keypoints and their 3D location during Network in our method is similar to Zhou, while we only need 2D keypoints and their 3D location training. This is because the semantic meaning of each keypoint can be given by matching with a during training. This is because the semantic meaning of each keypoint can be given by matching deformable model behind. The network used is shown in Figure 2. with a deformable model behind. The network used is shown in Figure 2. 2D Annotated 3D Keypoints Keypoints of CAD (227 63) (25 52 39) (227 273) (25 27 39) (266 23) (23 57 34) (305 461) (25 6 30) 2D Keypoints (326 102) (336 383) (21 47 26) (21 15 27) 3D Keypoints Figure Figure 2. Hour 2. Hou glass rglas network s networ for k keypoint for keypoint detection. detectThe ion. network The network of our of method our m pr ethod edicts predi 2D keypoints cts 2D and key their points 3D and coor their dinates. 3D coordinates. Training the network in our method requires annotations of 2D keypoints and their corresponding Training the network in our method requires annotations of 2D keypoints and their 3Dcorrespo locations. ndin Annotations g 3D location of 2D s. Annot keypoints ations per of image 2D keyp are oint widely s peravailable image arin e wi many delydatasets. availableAnnotating in many datasets. Annotating 3D keypoints of a CAD model is also not hard work with an interactive 3D UI, 3D keypoints of a CAD model is also not hard work with an interactive 3D UI, which has been done in which has been done in some dataset such as Pascal3D+ and ObjectNet3D dataset [16]. Compared some dataset such as Pascal3D+ and ObjectNet3D dataset [16]. Compared with Zhou’s work, data with Zhou’s work, data preparation for the network is more feasible. A 2-stacks hourglass network preparation for the network is more feasible. A 2-stacks hourglass network is used. The 2D keypoints is used. The 2D keypoints and their 3D locations are allocated by four-channel heatmaps. During and their 3D locations are allocated by four-channel heatmaps. During training, the L2 distance is training, the L2 distance is minimized between the output four-channel heatmaps and their ground minimized between the output four-channel heatmaps and their ground truth. truth. 3.2. Deformable Model Matching 3.2. Deformable Model Matching After the keypoint detection network, the next step is to estimate object parameters from keypoints. After the keypoint detection network, the next step is to estimate object parameters from The keypoints detected from the network have no semantic meaning. Thus, only the keypoints are not keypoints. The keypoints detected from the network have no semantic meaning. Thus, only the enough to estimate pose and shape parameters for multi-object. Besides, the following parameters are keypoints are not enough to estimate pose and shape parameters for multi-object. Besides, the following parameters are needed for viewpoint estimation and 3D modeling: object category, semantic meaning of keypoint, shape parameters, and 3D pose. It is difficult to obtain all the parameters simultaneously. In order to solve these problems, we propose the deformable model matching method shown in Figure 3. Deformable model matching can be divided into two stages: model selection and model validation. In the first stage, the extracted 3D keypoints are utilized to match with different category Appl. Sci. 2019, 9, 1975 5 of 14 needed for viewpoint estimation and 3D modeling: object category, semantic meaning of keypoint, shape parameters, and 3D pose. It is dicult to obtain all the parameters simultaneously. In order to solve these problems, we propose the deformable model matching method shown in Figure 3. Deformable model matching can be divided into two stages: model selection and model validation. In the first stage, the extracted 3D keypoints are utilized to match with dierent category Appl. Sci. 2019, 9, x FOR PEER REVIEW 5 of 14 of deformable models for the selection of candidate deformable models. In the second stage, these candidate models are validated by the matching with the extracted 2D keypoints, and shape and of deformable models for the selection of candidate deformable models. In the second stage, these viewpoint parameters can be obtained by optimization. The final object parameters are obtained from candidate models are validated by the matching with the extracted 2D keypoints, and shape and the best fitting deformable model. viewpoint parameters can be obtained by optimization. The final object parameters are obtained from the best fitting deformable model. Keypoint detection network CS , CS , CS,, 2D keypoints 3D keypoints CS , Model Selection Model Validation Figure 3. Deformable model matching processing. Deformable model matching can be divided into Figure 3. Deformable model matching processing. Deformable model matching can be divided into two stages: model selection and model validation. The object parameters are obtained from the best two stages: model selection and model validation. The object parameters are obtained from the best fitting deformable model. fitting deformable model. The formulation, model building, and optimization of the deformable model are described in The formulation, model building, and optimization of the deformable model are described in the the following sections. following sections. 3.2.1. Formulation 3.2.1. Formulation The following demonstrates how to obtain object attribute parameters from keypoints including C , S , and R . C indicates the object category, S is shape parameter vector defined in Equation (7), The following demonstrates how to obtain object attribute parameters from keypoints including and indicates a rotation matrix of object pose. P R,, S C indicates the probability distribution of R ( ) C, S, and R. C indicates the object category, S is shape parameter vector defined in Equation (7), and object parameters. It is an optimal solution of target parameters by maximum the probability: R indicates a rotation matrix of object pose. P(R, S, C) indicates the probability distribution of object {R,S,C}= max P(R,S,C) (1) parameters. It is an optimal solution of target parameters by maximum the probability: R,, S C The deformable matching processing consists of model selection and model validation, which fR, S, Cg = maxP(R, S, C) (1) correspond to the prior probability and the conditional probability respectively: R,S,C P(R,S,C)= P(C) P(R,S |C) (2) The deformable matching processing consists of model selection and model validation, which PC is the probability of selecting the deformable model that is defined as ( ) correspond to the prior probability and the conditional probability respectively: 2 − min MX (S )− 1 ii S (3) i P(R, S, C) = P(C) P(R, SjC) (2) P(C) e where M indicates the node point coordinates of deformable wireframe model, which is defined in P(C) is the probability of selecting the deformable model that is defined as detail in Equations (6) and (7). X represents 3D keypoint coordinates obtained by the network. is a constant. The probability of belonging to a certain class target is related to the distance between min( kM (S) X k ) 1 i the 3D keypoints and deformable model point set S . P(C) / e (3) P R,| S C is the probability distribution of object parameters under the deformable model of a ( ) certain category, which is defined as where M indicates the node point coordinates of deformable wireframe model, which is defined in − mx () S − 2 ii detail in Equations (6) and (7). X represents 3D keypoint coordinates obtained by the network. is a i (4) P R,| S C e ( ) constant. The probability of belonging to a certain class target is related to the distance between the 3D keypoints and deformable model point set. ( ) P R, SjC is the probability distribution of object parameters under the deformable model of a certain category, which is defined as Appl. Sci. 2019, 9, 1975 6 of 14 km (S) x k 2 i i P(R, SjC) / e (4) where m indicates the projection points of M based on camera imaging model, and x represents 2D keypoint coordinates obtained by the network. is a constant. Substituting Equations (2)–(4) into Equation (1), we have Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 14 P P min( kM (S) X k ) km (S) x k 1 i 2 i i where m indicates the projection points of M based i on camera imaging model, and x represents i i fR, S, Cg = maxe e (5) 2D keypoint coordinates obtained by the network. is a constant. Substituting Equations (2)–(4) R,S,C into Equation (1), we have It is worth mentioning that only the visible points are considered. Invisible points in M and m are 2 C 2 − min MX (S )− − mx () S − 1 ii 2 ii discarded in Equations (3)–(5). In our method, the visibility of one node point in M or m is related to S i i (5) {R,S,C}= max e e R,, S C the distance with its nearest keypoint. It is worth mentioning that only the visible points are considered. Invisible points in and 3.2.2. Model Building m are discarded in Equations (3)–(5). In our method, the visibility of one node point in M or m is related to the distance with its nearest keypoint. In this section, the building of multi-class deformable models is illustrated. The deformable model is expected to have the ability to capture intraclass variance. We model each object category as a 3.2.2. Model Building deformable 3D wireframe that is concise and expressive. During training, 3D keypoint coordinates In this section, the building of multi-class deformable models is illustrated. The deformable annotated in a CAD model are represented as a vector, and we perform PCA on these vectors for CAD model is expected to have the ability to capture intraclass variance. We model each object category model library of a certain category. The geometry representation is based on the mean wireframe as a deformable 3D wireframe that is concise and expressive. During training, 3D keypoint plus a linear combination of r principal components p with geometry parameters s, where s is the k k coordinates annotated in a CAD model are represented as a vector, and we perform PCA on these weight of the k th principal component: vectors for CAD model library of a certain category. The geometry representation is based on the mean wireframe plus a linear combination of r principal components p with geometry r k parameters s , where s is the weightM of( th S) e = k th princ + ipal s p component: (6) k k k k=1 M() S =+ s p (6) kk The 3D wireframe can be determined by shape parameter S: k=1 The 3D wireframe can be determined by shape parameter S: S = fs g (7) k=1:::r Ss ={} (7) k k=1r r is set as 3 in this paper. An example of a 3D wireframe model is shown in Figure 4. r is set as 3 in this paper. An example of a 3D wireframe model is shown in Figure 4. (0 0 2) (0 2 1) (0 2 0) (2 0 0) Figure 4. Deformable representation of a 3D wireframe. Chair models of dierent shape parameters Figure 4. Deformable representation of a 3D wireframe. Chair models of different shape parameters are generated by PCA. Numbers indicated in the figure are weighting parameters of the first three are generated by PCA. Numbers indicated in the figure are weighting parameters of the first three principal principal component componen dir t di ections, rections, which which ar are e r epr represented esented as as SS . . 3.2.3. Optimization 3.2.3. Optimization We use We random use rando hill m hil climbing l climbing to to solve solvethe the optimization optimization pro prblems oblems in in Equation Equation (5). (5). The pro The bability probability distribution distribution of Equation of Equation (5) consists (5) conof sists two ofr elatively two relati independent vely independent items. items. To make To make the results the res optimized ults optimized and escape from the trap of local minimum, stepwise optimization is our strategy. For the and escape from the trap of local minimum, stepwise optimization is our strategy. For the first item, first item, the probability of Equation (3) is obtained by matching 3D keypoints with each deformable the probability of Equation (3) is obtained by matching 3D keypoints with each deformable model. model. Qualitative results are shown in Figure 5. From the result, we can see that keypoints are fitted well with the deformable model through optimization. Appl. Sci. 2019, 9, 1975 7 of 14 Qualitative results are shown in Figure 5. From the result, we can see that keypoints are fitted well with the Appdeformable l. Sci. 2019, 9, x FOR model PEER RE thr VIEW ough optimization. 7 of 14 Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 14 Car Chair Plane Bicycle Sofa Car Chair Plane Bicycle Sofa Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 14 Before Before Car Chair Plane Bicycle Sofa Before After After FigureFigure 5. Qualitative 5. Qualitati rve esults results of of matching matching 3D 3D k keypoints eypoints with with the the deform deformable able model model. . It can be It can seen be that seen that After 3D keypoints and corresponding deformable models are well fitted after matching. 3D keypoints and corresponding deformable models are well fitted after matching. Figure 5. Qualitative results of matching 3D keypoints with the deformable model. It can be seen that 3D keypoints and corresponding deformable models are well fitted after matching. The optimization of the second item in Equation (5) is sensitive to the initial value of viewpoint The optimization of the second item in Equation (5) is sensitive to the initial value of viewpoint parameters. We propose a method for determining the initial value by visible keypoints. Taking a car The optimization of the second item in Equation (5) is sensitive to the initial value of viewpoint Figure 5. Qualitative results of matching 3D keypoints with the deformable model. It can be seen that parameters. We propose a method for determining the initial value by visible keypoints. Taking a car as an example, Figure 6 illustrates the detailed procedure. 3D keypoints and corresponding deformable models are well fitted after matching. parameters. We propose a method for determining the initial value by visible keypoints. Taking a car as an example, Figure 6 illustrates the detailed procedure. as an example, Figure 6 illustrates the detailed procedure. 3 4 7 The op 2 timization of the second item in Equation (5) is sensitive to the initial value of viewpoint 3 4 parameters. We propose a method for determining the initial value by visible keypoints. Taking a car 8 1 17 2 as an example, Figu 9 re 6 illustrates the detailed procedure. 5 111111101110 011101111111 111111011101 101110111111 3 4 9 0 1 111111101110 011101111111 111111011101 101110111111 7 2 1 9 111111101110 011101111111 111111011101 101110111111 (5 -45 0) (5 -135 0) (5 45 0) (5 135 0) (5 -45 0) (5 -135 0) (5 45 0) (5 135 0) Figure 6. Initial viewpoint determination method. The visibility of keypoints in the different Figure 6. Initial viewpoint determination method. The visibility of keypoints in the dierent viewpoints vi Figure ewpoint 6.s is Ini cti oun al ted vie wpoint as a dic ti de onar term y.inati For on an inpu method. t point The set, visib the init ility ia (5 -lof 4 5 vi 0)e key wpoint points can in ( 5be -1t 3 he 5 obtai 0) dined fferent by (5 45 0) (5 135 0) is counted as a dictionary. For an input point set, the initial viewpoint can be obtained by matching matching their visibility with the dictionary. viewpoints is counted as a dictionary. For an input point set, the initial viewpoint can be obtained by their visibility with the dictionary. matc Figure hing their 6. Ini vi tisi alb il viie ty wpoint with t he de dicti termonar ination y. method. The visibility of keypoints in the different A discrimination mechanism is necessary for the objects that have different category than the viewpoints is counted as a dictionary. For an input point set, the initial viewpoint can be obtained by Apriori discrimination A mo di m del scr atcshing imin . For their at th mechanism ion e vi ob si mec jec bilitty han s who with ism is se the is necessary c dicti ate nece gory onar ssary y. is o ffor ut or of th the e our ob objects pr ject iori s th mo at that dels, have have th di eff dev edi rent i ation er ca ent tegory of category defor than mable the than the priori models. For the objects whose category is out of our priori models, the deviation of deformable model projection and 2D keypoints would be larger compared with objects that have the same priori models. For the objects whose category is out of our priori models, the deviation of deformable A discrimination mechanism is necessary for the objects that have different category than the category model pro with jec th tion e correspo and 2D nding keypoints deform wabl oul ed mo bede lar l. g Cer onseq com uently pared, a with thresho object ld s is th set atfor have the dev the ia same tion model projection and 2D keypoints would be larger compared with objects that have the same category priori models. For the objects whose category is out of our priori models, the deviation of deformable category with the corresponding deformable model. Consequently, a threshold is set for the deviation to judge whether it belongs to categories contained. with the corresponding deformable model. Consequently, a threshold is set for the deviation to judge model projection and 2D keypoints would be larger compared with objects that have the same to judge A summ whet ar her y of it th be elongs method to categor is shown ies in con Fig tained ure 7. . category with the corresponding deformable model. Consequently, a threshold is set for the deviation whether it belongs to categories contained. A summary of the method is shown in Figure 7. to judge whether it belongs to categories contained. 2D A summary of the method is shown in Figure 7. model validation obtain best fit A summary of the method is shown in Figure 7. keypoint 2D model validation obtain best fit keypoint keypoint 2D image model validation obtain best fit detection keypoint keypoint image detection keypoint 3D discriminate object image model selection detection keypoint category parameters 3D discriminate object model selection keypoint category parameters 3D object discriminate model selection keypoint category parameters keypoint detection deformable model matching keypoint detection deformable model matching keypoint detection deformable model matching Figure 7. The summary of the method. At first, keypoints are detected from the image. The 3D Figure 7. The summary of the method. At first, keypoints are detected from the image. The 3D keypoints are then matched with each deformable model and 2D keypoints are matched with the Figure 7. The summary of the method. At first, keypoints are detected from the image. The 3D keypoints Figure 7. The summary of the method. At first, keypoints are detected from the image. The 3D keypoints are then matched with each deformable model and 2D keypoints are matched with the projection of the deformable model. After the matching process, the optimal parameters of pose and are then matched with each deformable model and 2D keypoints are matched with the projection of the keypoints are then matched with each deformable model and 2D keypoints are matched with the projection of the deformable model. After the matching process, the optimal parameters of pose and deformable projecti modon el. of A th fte er de th form e m able atc h miod ng elp . Aft roc er e s th s, e tma hetc o hing ptim proces al pa sr , a the m e opt teimal rs o fparam poseeters and of s h pose ape and are obtained from the best fit deformable model. Next step is to judge whether the target type belongs to the existing models according to the matching deviation. Finally, the object parameters are obtained. Appl. Sci. 2019, 9, x FOR PEER REVIEW 8 of 14 shape are obtained from the best fit deformable model. Next step is to judge whether the target type Appl. Sci. 2019, 9, 1975 8 of 14 belongs to the existing models according to the matching deviation. Finally, the object parameters are obtained. 4. Experiments 4. Experiments We evaluate the method proposed on Pascal3D+ dataset. In this section, we evaluate the approach We evaluate the method proposed on Pascal3D+ dataset. In this section, we evaluate the from two aspects: wireframe modeling and viewpoint estimation. It is important to note that these approach from two aspects: wireframe modeling and viewpoint estimation. It is important to note two tasks are completed at the same time in our method. Our implementation is done in the PyTorch that these two tasks are completed at the same time in our method. Our implementation is done in framework and Matlab2014. the PyTorch framework and Matlab2014. 4.1. Wireframe Modeling 4.1. Wireframe Modeling The qualitative results of the wireframe modeling are shown in Figure 8. The qualitative results of the wireframe modeling are shown in Figure 8. Input 2D keypoints Matching with PCA model Wireframe projection Figure 8. Cont. Appl. Sci. 2019, 9, 1975 9 of 14 Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 14 Figure 8. Qualitative results of 3D wireframe modeling. Four columns in the figure are input image, 2D Figure 8. Qualitative results of 3D wireframe modeling. Four columns in the figure are input image, keypoints detected, matching between 3D keypoints and deformable model, and wireframe projection. 2D keypoints detected, matching between 3D keypoints and deformable model, and wireframe In the fourth column, the red line represents the projection of wireframe, and its deviation with 2D projection. In the fourth column, the red line represents the projection of wireframe, and its deviation keypoints detected is shown as the yellow line. with 2D keypoints detected is shown as the yellow line. The 3D wireframe models are generated after the deformable model matching. In order to test the The 3D wireframe models are generated after the deformable model matching. In order to test robustness, the robustn we artificially ess, we arinsert tificially the insert error th results e errorin rekeypoints sults in keypoint obtained s obby tained the network. by the netThe work. deviation The between the wireframe model projection and the ground truth of 2D keypoints is evaluated. Take the case of a sofa, the results of which are shown in Figure 9. Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 14 deviation between the wireframe model projection and the ground truth of 2D keypoints is evaluated. Appl. Sci. 2019, 9, 1975 10 of 14 Take the case of a sofa, the results of which are shown in Figure 9. Matching with Viewpoint Error 2D keypoints Wireframe projection Deviation PCA model Error of [11] (a) 39 pixels 8.1 8.3 (b) 39 pixels 8.1 8.3 (c) 39 pixels 8.1 10.8 Figure 9. Fault tolerance test. We artificially added error keypoints to test the fault tolerance, as shown Figure 9. Fault tolerance test. We artificially added error keypoints to test the fault tolerance, as shown in the first and second column. There is no error keypoint in (a). (b) and (c) show cases with error in the first and second column. There is no error keypoint in (a). (b) and (c) show cases with error keypoints added. Through the corresponding process presented in the second column, the method keypoints added. Through the corresponding process presented in the second column, the method proposed can identify the mistaken points shown as the third column. For the method proposed, the proposed can identify the mistaken points shown as the third column. For the method proposed, the deviation between wireframe model projection and ground truth of 2D keypoints does not change deviation between wireframe model projection and ground truth of 2D keypoints does not change with the increase of mistaken points as displayed in the fourth column. The last two columns compare with the increase of mistaken points as displayed in the fourth column. The last two columns compare the fault tolerance of method in [11] and the method proposed on viewpoint estimation, which using the fault tolerance of method in [11] and the method proposed on viewpoint estimation, which using evaluation criteria in Equation (8). evaluation criteria in Equation (8). It can be seen that the method proposed can tolerate incorrect keypoints well. Because of It can be seen that the method proposed can tolerate incorrect keypoints well. Because of the the deformable model matching, the error results of the network do not seem to cause significant deformable model matching, the error results of the network do not seem to cause significant negative eects. negative effects. 4.2. Viewpoint Estimation 4.2. Viewpoint Estimation For viewpoint estimation, the angle error between the predicted rotation vector and the ground For viewpoint estimation, the angle error between the predicted rotation vector and the ground truth rotation vector is measured as truth rotation vector is measured as klog(R R )k GT pred log(RR ) pred GT D R , R = (8) p F pred GT (8) = RR , ( ) pred GT R is the rotation matrix along X, Y, and Z axis. We consider Median Error and Accuracy as two R is the rotation matrix along X, Y, and Z axis. We consider Median Error and Accuracy as two evaluation criteria that are commonly applied in the literature. Median Error is the median of the evaluation criteria that are commonly applied in the literature. Median Error is the median of the rotation angle error, and Accuracy at is the percentage of objects whose error is less than . is set as rotation angle error, and Accuracy at θ is the percentage of objects whose error is less than θ. θ is set /6 in this paper. 6 as in this paper. Any image of multi-class targets can be processed by the method proposed with only one network Any image of multi-class targets can be processed by the method proposed with only one with the deformable model matching processing. For comparison purposes, the results of each category network with the deformable model matching processing. For comparison purposes, the results of are counted in Table 1. each category are counted in Table 1. From Table 1 it can be seen that the method proposed performs as well as mainstream approaches in viewpoint estimation. It should be noted that the first two methods are class-specific, while ours and Zhou’s methods are designed for multi-class objects and have better universality. Appl. Sci. 2019, 9, x FOR PEER REVIEW 11 of 14 Table 1. Results of viewpoint estimation. Appl. Sci. 2019, 9, 1975 11 of 14 Chai Aer Bik Sof Boa Bottl Tabl Trai Moto Mea Car Bus Tv r o e a t e e n r n Table 1. Results of viewpoint estimation. Tulsiani [17] 9.1 14.8 13.8 17.7 13.7 21.3 12.9 5.8 15.2 8.7 15.4 14.7 13.6 Car Chair Aero Bike Sofa Boat Bottle Bus Table Train Tv Motor Mean Mousavian [18] 5.8 11.9 13.6 12.5 12.8 22.8 8.3 3.1 12.5 6.3 11.9 12.3 11.1 Medi Tulsiani [17] 9.1 14.8 13.8 17.7 13.7 21.3 12.9 5.8 15.2 8.7 15.4 14.7 13.6 Mousavian [18] 5.8 11.9 13.6 12.5 12.8 22.8 8.3 3.1 12.5 6.3 11.9 12.3 11.1 an Median Zhou X 6.5 11.0 10.1 14.5 11.1 30.0 9.1 3.1 23.7 7.4 13.0 14.1 10.4 Zhou X 6.5 11.0 10.1 14.5 11.1 30.0 9.1 3.1 23.7 7.4 13.0 14.1 10.4 Error Error Method Method 6.3 10.8 10.4 14.6 10.9 23 8.9 3.2 14 7.1 12.2 13.9 10 6.3 10.8 10.4 14.6 10.9 23 8.9 3.2 14 7.1 12.2 13.9 10 proposed proposed Tulsiani [17] 0.89 0.80 0.81 0.77 0.82 0.59 0.93 0.98 0.62 0.80 0.80 0.88 0.806 Mousavian [18] 0.90 0.80 0.78 0.83 0.82 0.57 0.93 0.94 0.68 0.82 0.85 0.86 0.810 Tulsiani [17] 0.89 0.80 0.81 0.77 0.82 0.59 0.93 0.98 0.62 0.80 0.80 0.88 0.806 Accuracy Zhou X 0.92 0.79 0.82 0.86 0.92 0.50 0.92 0.97 0.62 0.77 0.83 0.88 0.823 Ours 0.93 0.81 0.81 0.8 0.92 0.52 0.93 0.97 0.63 0.78 0.84 0.89 0.829 Mousavian [18] 0.90 0.80 0.78 0.83 0.82 0.57 0.93 0.94 0.68 0.82 0.85 0.86 0.810 Accu racy Zhou X 0.92 0.79 0.82 0.86 0.92 0.50 0.92 0.97 0.62 0.77 0.83 0.88 0.823 From Table 1 it can be seen that the method proposed performs as well as mainstream approaches Ours 0.93 0.81 0.81 0.8 0.92 0.52 0.93 0.97 0.63 0.78 0.84 0.89 0.829 in viewpoint estimation. It should be noted that the first two methods are class-specific, while ours and Zhou’s methods are designed for multi-class objects and have better universality. From the result, the accuracy of the method proposed is similar to Zhou, but we take a From the result, the accuracy of the method proposed is similar to Zhou, but we take a completely completely different solution. The 3D keypoints from the network are used to estimate viewpoint dierent solution. The 3D keypoints from the network are used to estimate viewpoint directly in [11], directly in [11], while 3D keypoints are only used to obtain semantic meaning and shape initial value. while 3D keypoints are only used to obtain semantic meaning and shape initial value. Viewpoint Viewpoint estimation is conducted in deformable model matching in our method. As a result, the estimation is conducted in deformable model matching in our method. As a result, the method in [11] method in [11] relies heavily on 3D keypoints and is vulnerable to mistaken points, while the method relies heavily on 3D keypoints and is vulnerable to mistaken points, while the method proposed here proposed here provides better fault tolerance ability. This conclusion is verified by the following two provides better fault tolerance ability. This conclusion is verified by the following two experiments. experiments. Firstly, the ability to identify mistaken keypoints is compared in Figure 10. It can be seen that the Firstly, the ability to identify mistaken keypoints is compared in Figure 10. It can be seen that method proposed can distinguish mistaken keypoints by the optimization process, while it is dicult the method proposed can distinguish mistaken keypoints by the optimization process, while it is for the method in [11]. difficult for the method in [11]. Optimization by Identification of mistaken Original keypoints deformable model keypoints (a) c c (b) Figure 10. Identification of mistaken keypoints. (a) and (b) are two image cases. Keypoints from the Figure 10. Identification of mistaken keypoints. (a) and (b) are two image cases. Keypoints from the network are displayed as the first column, the mistaken points cannot be detected in [11]. The method network are displayed as the first column, the mistaken points cannot be detected in [11]. The method proposed can identify mistaken keypoints though optimization process in the second column. Mistaken proposed can identify mistaken keypoints though optimization process in the second column. keypoints are shown as red points in the last column. Mistaken keypoints are shown as red points in the last column. Next, we evaluate the fault tolerance ability of viewpoint estimation. As shown in the last two Next, we evaluate the fault tolerance ability of viewpoint estimation. As shown in the last two columns in Figure 9, viewpoint estimation of the method in [11] is sensitive to mistaken keypoints, columns in Figure 9, viewpoint estimation of the method in [11] is sensitive to mistaken keypoints, while our method provides better fault tolerance. More quantitative results are shown in Figure 11. while our method provides better fault tolerance. More quantitative results are shown in Figure 11. Appl. Sci. 2019, 9, x FOR PEER REVIEW 12 of 14 Appl. Sci. 2019, 9, 1975 12 of 14 Appl. Sci. 2019, 9, x FOR PEER REVIEW 12 of 14 Figure Figure 11. 11. Comparison Comparison of of viewpoint viewpoint estimation estimati with on with error keypoints error keypoints using data using from dat Car a from in Pascal3D Car in + Figure 11. Comparison of viewpoint estimation with error keypoints using data from Car in Pas dataset. cal3D+ W e dat artificially aset. We add artifici err all orykeypoints add errorto key test points the method’s to test the fault method tolerance. ’s fault tolerance. Pascal3D+ dataset. We artificially add error keypoints to test the method’s fault tolerance. It can be seen that as the number of error keypoints increases, the accuracy of [11] decreases faster It can be seen that as the number of error keypoints increases, the accuracy of [11] decreases It can be seen that as the number of error keypoints increases, the accuracy of [11] decreases than the method proposed. This is because 3D keypoints are used to estimate the viewpoint directly faster than the method proposed. This is because 3D keypoints are used to estimate the viewpoint faster than the method proposed. This is because 3D keypoints are used to estimate the viewpoint in [11], while our method conducts viewpoint estimation by deformable model matching, which is directly in [11], while our method conducts viewpoint estimation by deformable model matching, directly in [11], while our method conducts viewpoint estimation by deformable model matching, more robust. which is more robust. which is more robust. Finally, we test the performance of the method proposed for occluded object image cases. Finally, we test the performance of the method proposed for occluded object image cases. In Finally, we test the performance of the method proposed for occluded object image cases. In In practice, it is impossible for objects to be always visible. To evaluate our method, we artificially practice, it is impossible for objects to be always visible. To evaluate our method, we artificially practice, it is impossible for objects to be always visible. To evaluate our method, we artificially occlude the image of the car in Pascal3D+ dataset. The result is shown in Figure 12. occlude the image of the car in Pascal3D+ dataset. The result is shown in Figure 12. occlude the image of the car in Pascal3D+ dataset. The result is shown in Figure 12. Input 2D keypoint Wireframe projection Input 2D keypoint Wireframe projection Figure 12. Qualitative results for occluded object image cases. Three columns in the figure are input Figure 12. Qualitative results for occluded object image cases. Three columns in the figure are input Figure 12. Qualitative results for occluded object image cases. Three columns in the figure are input image, 2D keypoints detected and wireframe projection. The last instance is a failure case. image, 2D keypoints detected and wireframe projection. The last instance is a failure case. image, 2D keypoints detected and wireframe projection. The last instance is a failure case. Appl. Sci. 2019, 9, 1975 13 of 14 The median error is 9.5, which is larger than cases without occlusion. It can be seen from the results that the method proposed can work with some keypoints missing caused by occlusion, although we have some failure cases for large occlusion. Because of the priori deformable model, our approach can tolerate the absence of some keypoints to a certain extent. 4.3. Computation Table 2 presents the elapsed statistics of the method proposed. Table 2. The elapsed time. Class Number 2 4 6 8 10 12 Time (s) 1.1 1.3 1.5 1.7 1.9 2 To evaluate the eects of the category number, we test the time consumption in dierent object category number. The results show that with an increase of the number of categories, time consumption rises because the computation of point matching increases. 4.4. Discussion The experimental results show that the method proposed has the following advantages. Firstly, our method has a strong generalization ability. Compared with category-specific methods, there is only one CNN with deformable model matching processing for the 3D viewpoint and the shape estimation for all the types of objects in Pascal3D+ dataset. Secondly, the method proposed has a robust fault-tolerant ability. Similar to many methods, we estimate the 3D viewpoint depending on the detection of keypoints. Compared with methods such as [11], our method has better fault-tolerance to mistaken keypoints, as shown in Section 4.2. This is a result of the priori object structure and optimization mechanism in deformable model matching. Mistaken keypoints from the network can be eliminated after the matching with deformable models. 5. Conclusions In this paper, a 3D viewpoint and shape estimation method for multi-class objects is proposed. The method proposed combines the advantages of the data-based method and model-based method and conducts wireframe modeling and viewpoint estimation through maximizing probability distribution. Compared with the methods limited in a specific category, our method only uses one keypoint detection network with the deformable model matching processing for multi-class objects. Experiments on Pascal3D+ dataset show that the method proposed performs well in accuracy and generalization. Besides, due to the deformable model matching processing, the method proposed has robust fault-tolerance to mistaken keypoints detected from the network. Our research is valuable in exploration to extend the generalization of deep learning in specific tasks. Author Contributions: Conceptualization, X.R.; methodology, X.R.; software, L.J.; validation, W.L.; formal analysis, W.L.; investigation, X.R.; resources, L.J.; data curation, X.T.; writing—original draft preparation, X.R.; writing—review and editing, L.J.; visualization, L.J.; supervision, X.T.; project administration, X.T.; funding acquisition, X.T. Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. References 1. Zhou, X.; Zhu, M.; Leonardos, S.; Daniilidis, K. Sparse representation for 3D shape estimation: A convex relaxation approach. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1648–1661. [CrossRef] [PubMed] Appl. Sci. 2019, 9, 1975 14 of 14 2. Roberts, L.G. Machine Perception of Three-Dimensional Solids. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1963. 3. Brooks, R.A. Symbolic reasoning among 3-D models and 2-D images. Artif. Intell. 1981, 17, 285–348. [CrossRef] 4. Zhang, X.; Jiang, Z.; Zhang, H.; Wei, Q. Vision-based pose estimation for textureless space objects by contour points matching. IEEE Trans. Aerosp. Electron. Syst. 2018, 54, 2342–2355. [CrossRef] 5. Zhang, H.; Jiang, Z.; Yao, Y.; Meng, G. Vision-based pose estimation for space objects by Gaussian process regression. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 7–14 March 2015; pp. 1–9. 6. Cao, Z.; Sheikh, Y.; Banerjee, N.K. Real-time scalable 6DOF pose estimation for textureless objects. In Proceedings of the International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; pp. 2441–2448. 7. Pepik, B.; Stark, M.; Gehler, P.; Schiele, B. Teaching 3D geometry to deformable part models. In Proceedings of the Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3362–3369. 8. Xiang, Y.; Mottaghi, R.; Savarese, S. Beyond PASCAL: A benchmark for 3D object detection in the wild. In Proceedings of the Workshop on Applications of Computer Vision, Steamboat Springs, CO, USA, 24–26 March 2014; pp. 75–82. 9. Wu, J.; Xue, T.; Lim, J.J.; Tian, Y.; Tenenbaum, J.B.; Torralba, A.; Freeman, W.T. 3D Interpreter networks for viewer-centered wireframe modeling. Int. J. Comput. Vis. 2018, 126, 1009–1026. [CrossRef] 10. Li, C.; Zeeshan Zia, M.; Tran, Q.H.; Yu, X.; Hager, G.D.; Chandraker, M. Deep supervision with shape concepts for occlusion-aware 3D object parsing. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 388–397. 11. Zhou, X.; Karpur, A.; Luo, L.; Huang, Q. StarMap for category-agnostic keypoint and viewpoint Estimation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 328–345. 12. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [CrossRef] [PubMed] 13. Toshev, A.; Szegedy, C. DeepPose: Human pose estimation via deep Neural networks. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 24–27 June 2014; pp. 1653–1660. 14. Yu, X.; Zhou, F.; Chandraker, M. Deep deformation network for object landmark localization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 52–70. 15. Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 483–499. 16. Xiang, Y.; Kim, W.; Chen, W.; Ji, J.; Choy, C.; Su, H.; Savarese, S. ObjectNet3D: A large-scale database for 3D object recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October; pp. 160–176. 17. Tulsiani, S.; Malik, J. Viewpoints and keypoints. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1510–1519. 18. Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5632–5640. © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png
Applied Sciences
Multidisciplinary Digital Publishing Institute
http://www.deepdyve.com/lp/multidisciplinary-digital-publishing-institute/3d-wireframe-modeling-and-viewpoint-estimation-for-multi-class-objects-0ZF7rEsOVU