Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

LatticeNet: fast spatio-temporal point cloud segmentation using permutohedral lattices

LatticeNet: fast spatio-temporal point cloud segmentation using permutohedral lattices Deep convolutional neural networks have shown outstanding performance in the task of semantically segmenting images. Applying the same methods on 3D data still poses challenges due to the heavy memory requirements and the lack of structured data. Here, we propose LatticeNet, a novel approach for 3D semantic segmentation, which takes raw point clouds as input. A PointNet describes the local geometry which we embed into a sparse permutohedral lattice. The lattice allows for fast convolutions while keeping a low memory footprint. Further, we introduce DeformSlice, a novel learned data-dependent interpolation for projecting lattice features back onto the point cloud. We present results of 3D segmentation on multiple datasets where our method achieves state-of-the-art performance. We also extend and evaluate our network for instance and dynamic object segmentation. Keywords Semantic segmentation · Instance segmentation · Motion segmentation · Sequence segmentation · 3D point cloud 1 Introduction First, 3D data is often represented in an unstructured manner—unlike the grid-like structure of images. This raises Environment understanding is a crucial ability for autonomous difficulties for current approaches which assume a regular agents. Perceiving not only the geometrical structure of the structure upon which convolutions are defined. scene but also distinguishing between different classes of Second, the performance of current 3D networks is limited objects therein enables tasks like manipulation and inter- by their memory requirements. Storing 3D information in a action that were previously not possible. Within this field, dense structure is prohibitive for even high-end GPUs, clearly semantic segmentation of 2D images is a mature research indicating the need for a sparse structure. area, showing outstanding success in dense per pixel cate- Third, discretization issues caused by imposing a regular gorization on images (Long et al. 2015; Chen et al. 2017; grid onto point clouds can negatively affect the network’s Lin et al. 2017). However, the task of semantically labelling performance and interpolation is necessary to cope with 3D data is still an open area of research as it poses several quantization artifacts (Tchapmi et al. 2017). challenges that need to be addressed. In this work, we propose LatticeNet, a novel approach for point cloud segmentation which alleviates the previously mentioned problems. An overview of the input and output of This work has been funded by the Deutsche Forschungsgemeinschaft our method can be seen in Fig. 1. Hence, our contributions (DFG, German Research Foundation) under Germany’s Excellence are: Strategy - EXC 2070 - 390732324 and by the German Federal Ministry of Education and Research (BMBF) in the project ” Kompetenzzentrum: Aufbau des Deutschen Rettungsrobotik-Zentrums” (A-DRZ). – A hybrid architecture which leverages the strength of PointNet to obtain low-level features and sparse 3D con- This is one of the several papers published in Autonomous volutions to aggregate global context, Robotscomprising the Special Issue on Robotics: Science and Systems 2020. – A framework suitable for sparse data onto which all com- mon CNN operators are defined, and B Radu Alexandru Rosu – A novel slicing operator that is end-to-end trainable for rosu@ais.uni-bonn.de mapping features of a regular lattice grid back onto an Friedrich-Hirzebruch-Allee 8, Bonn, Germany unstructured point cloud. 123 Autonomous Robots Voxel networks 3D Convolutions in this category work on discretized cubic or tetrahedral volume elements. SEGCloud (Tchapmi et al. 2017) voxelizes the point cloud into a uniform 3D grid and applies 3D convolutions to obtain per-voxel class probabilities. A conditional ran- dom field (CRF) is used to smooth the labels and enforce global consistency. The class scores are transferred back to Fig. 1 Semantic segmentation: LatticeNet takes raw point clouds as the points using trilinear interpolation. The usage of a dense input and embeds them into a sparse lattice where convolutions are grid results in high memory consumption while our approach applied. Features on the lattice are projected back onto the point cloud uses a permutohedral lattice stored sparsely. Additionally, to yield a final segmentation their voxelization results in a loss of information due to the discretization of the space. We avoid quantization issues by In addition to our Robotics: Science and System con- using a PointNet architecture to summarize the local neigh- ference paper (Rosu et al. 2020) we make the following borhood. additional contributions: Rethage et al. (2018) perform semantic segmentation on a voxelized point cloud and employ a PointNet architecture – An extension with discriminative loss that allows Lat- as a low-level feature extractor. The usage of a dense grid, ticeNet to perform instance segmentation, and however, leads to high memory usage and slow inference, – A network architecture capable of processing temporal requiring various seconds for medium-sized point clouds. information in order to improve semantic segmentation SplatNet (Su et al. 2018) is the work most closely related and to distinguish between dynamic and static objects to ours. It alleviates the computational burden of 3D convo- within the scene. lutions by using a sparse permutohedral lattice, performing convolutions only around the surfaces. It discretizes the space in uniform simplices and accumulates the features of the raw 2 Related work point cloud onto the vertices of the lattice using a splatting operation. Convolutions are applied on the lattice vertices and 2.1 Semantic segmentation a slicing operation barycentrically interpolates the features of the vertices back onto the point cloud. A series of splat-conv- slice operations are applied to obtain contextual information. 3D Semantic segmentation approaches can be categorized depending on data representation upon which they operate. The main disadvantage is that splat and slice operations are Point cloud networks The first category of networks operates not learned and repeated application slowly degrades the directly on the raw point cloud. point clouds features as they act as Gaussian filters (Baek and From this area, PointNet (Qi et al. 2017a) is one of the Adams 2009). Furthermore, storing high-dimensional fea- pioneering works. The method processes raw point clouds by tures for each point in the cloud is memory intensive which individually embedding the points into a higher-dimensional limits the maximum number of points that can be processed. space and applying max-pooling for permutation-invariance In contrast, our approach has learned operations for splatting to obtain a global scene descriptor. The descriptor can be and slicing which brings more representational power to the network. We also restrict their usage to only the beginning used for both classification and semantic segmentation. How- ever, PointNet does not take local information into account and the end of the network, leaving the rest of the architecture fully convolutional. which is essential for the segmentation of highly-detailed objects. This has been partially solved in the subsequent Mesh networks The connectivity of triangular or quadrilateral work of PointNet++ (Qi et al. 2017b) which applies Point- mesh faces enables easy computation of normal vectors and Net hierarchically, capturing both local and global contextual establishes local tangent planes. information. GCNN (Masci et al. 2015) operates on small local patches Chen et al. (2018) use a similar approach but they input which are convolved using a series of rotated filters, followed the point responses w.r.t. a sparse set of radial basis functions by max-pooling to deal with the ambiguity in the patch orien- (RBF) scattered in 3D space. Optimizing jointly for the extent tation. However, the max-pooling disregards the orientation. and center of the RBF kernels allows to obtain a more explicit MoNet(Montietal. 2017) deals with the orientation ambi- guity by aligning the kernels to the principal curvature of modelling of the spatial distribution. PointCNN (Li et al. 2018) deals with the permutation the surface. Yet, this does not solve cases in which the local curvature is not informative, e.g. for walls or ceilings. Tex- invariance not by using a symmetric aggregation function, but by learning a K × K matrix for the K input points that tureNet (Huang et al. 2019) further improves on the idea permutes the cloud into a canonical form. by using a global 4-RoSy orientations field. This provides a 123 Autonomous Robots smooth orientation field at any point on the surface which is Kernel Point Convolution (KPConv) (Thomas et al. aligned to the edges of the mesh and has only a 4-direction 2019) operates directly on the point clouds by facilitating ambiguity. Defining convolution on patches oriented accord- convolution weights that are located in Euclidean space. ing to the 4-RoSy field yields significantly improved results. Points in the vicinity of these kernels are weighted and Graph networks These methods allow arbitrary topologies summed together to feature vectors. KPConv (Thomas et al. to connect vertices and lift the restriction of triangular or 2019), DarkNet53Seg (Behley et al. 2019) and Tangent- quadrilateral meshes. Conv (Tatarchenko et al. 2018) were previously used for the Wang et al. (2018a) and Wu et al. (2019) define a con- segmentation of 4D point clouds by accumulating multiple volution operator over non-grid structured data by having clouds of a sequence. continuous values over the full vector space. The weights of these continuous filters are parametrized by an multi-layer 2.3 Instance segmentation perceptron (MLP). Defferrard et al. (2016) formulate CNNs in the context Researchers extended principles from 2D to obtain instances of spectral graph theory. They define the convolution in in 3D which can be roughly categorized in proposal-based the Fourier domain with Chebyshev polynomials to obtain and proposal-free methods. fast localized filters. However, spectral approaches are not Proposal-based This type solves the problem in two stages. directly transferable to a new graph as the Fourier basis The first network stage generates proposals of bounding changes. Additionally, the learned filters are rotation invari- boxes for the objects in the scene. A second stage performs ant which can be seen as a limitation to the representational foreground-background segmentation on the points within power of the network. the bounding boxes in order to get valid instances. Multi-view networks The convolution operation is well Yang et al. (2019) present a single-stage method for defined in 2D and hence, there is an interest in casting 3D instance segmentation that can train both the proposal and segmentation as a series of single-view segmentations which the point-mask prediction network in an end-to-end manner. are fused together. Yi et al. (2019) alleviate some of the issues associated with Pham et al. (2019a) simultaneously reconstruct the scene wrong bounding box predictions by using an analysis-by- geometry and recover the semantics by segmenting sequences synthesis strategy. of RGB-D frames. The segmentation is transferred from 2D Proposal-free Proposal-free methods tackle instance seg- images to the 3D world and fused with previous segmenta- mentation without the need of generating object proposals. tions. A CRF finally resolves noisy predictions. They usually rely on predicting point embedding and apply Tatarchenko et al. (2018) assumes that the data is sampled clustering to recover the instances. from locally Euclidean surfaces and project the local surface Many proposal-free approaches base their work on the geometry onto a tangent plane to which 2D convolutions can 2D instance segmentation of De Brabandere et al. (2017)in be applied. This requires a heavy preprocessing for normal which pixel embeddings are predicted. There, a discrimina- calculation. In contrast, our approach can deal with raw point tive loss encourages the embeddings that belong to the same clouds without requiring normals. instance to be clustered together while embeddings from dif- ferent instances should be further apart. SPGN (Wang et al. 2018b) learns a similarity matrix for 2.2 Motion segmentation all point pairs, based on which, similar points are merged to instances. VoteNet (Qi et al. 2019) uses a Hough vot- For the task of motion segmentation two approaches have ing mechanism where the points predict the offset towards been widely used: Networks either incorporate multiple point the object center. A clustering algorithm finally recovers the clouds directly or accumulate a sequence of individually seg- object instances. mented point clouds. Neven et al. (2019) alleviate some of the issues associated Shi et al. (2020) present their U-Net based architec- with proposal-free methods by allowing also the clustering ture SpSequenceNet for semantic segmentation on 4D point algorithm to be part of the training by jointly optimizing the clouds. They input two point clouds and generate the output spatial embeddings and the clustering bandwidth. for the later one with a voxel-based method. They designed Wang et al. (2019) proposed a framework that allows for two modules, the Cross-frame Global Attention (CGA) and semantic and instances to be predicted simultaneously and for the Cross-frame Local Interpolation (CLI) module. The CGA the two tasks to mutually benefit from each other. Similarly, acts as a teacher that uses the data from P to focus the Pham et al. (2019b) recover both instances and semantics and t −1 network on the important features of P . The CLI module apply a CRF to improve the predictions accuracy. fuses information between both point clouds by combining Most of these works utilize a PointNet (Qi et al. 2017a) the spatial and temporal information. or PointNet++ (Qi et al. 2017b) network to predict the point 123 Autonomous Robots embeddings. In our case, we extend LatticeNet in a simi- coordinates of these neighbors are separated by a vector of d+1 lar manner to other proposal-free methods but predict the form ± [−1,..., −1, d, −1,... , −1] ∈ Z . embeddings using the lattice convolutions. The vertices of the permutohedral lattice are stored in a sparse manner using a hash map in which the key is the coordinate c and the value is x . Hence, we only allocate v v 3 Notation the simplices that contain the 3D surface of interest. This sparse allocation allows for efficient implementation of all Throughout this paper, we use bold upper-case characters typical operations in CNNs (convolution, pooling, transposed to denote matrices and bold lower-case characters to denote convolution, etc.). vectors. The permutohedral lattice has several advantages w.r.t. stan- The vertices of the d-dimensional permutohedral lattice dard cubic voxels. The number of vertices for each simplex is (d+1) are defined as a tuple v = (c , x ), with c ∈ Z denoting given by d + 1 which scales linearly with increasing dimen- v v v v d the coordinates of the vertex and x ∈ R representing the sion, in contrast to the 2 for standard voxels. This small values stored at vertex v. The full lattice containing n vertices number of vertices per simplex allows for fast splatting and n×(d+1) is denoted with V = (C, X), with C ∈ Z representing slicing operations. Furthermore, splatting and slicing create n×v piece-wise linear outputs as they use barycentric interpola- the coordinate matrix and X ∈ R the value matrix. The points in a cloud are defined as a tuple p = g , f , tion. In contrast, standard quantization in cubic voxels create p p ∈ R denoting the coordinates of the point and piece-wise constant outputs, leading to discretization arte- with g f ∈ R representing the features stored at point p (color, facts. normals, etc.). The full point cloud containing m points is Spatial correspondences between lattice vertices are given m×d denoted by P = (G, F) with G ∈ R being the positions by design and the hashmap: If the hashmap stays the same m× f matrix and F ∈ R the feature matrix. The feature matrix for the whole sequence, spatially identical lattice vertices of F can also be empty in which case f is set to zero. different point clouds are always mapped to the same entries. For motion segmentation we define a sequence of point This is visualized in Fig. 9 where features from two different clouds as P = ( P , P ,..., P ) with P = (G, F).We time-steps are fused together. seq 0 1 n n define a timestep as processing one cloud of this sequence. We denote with I the set of lattice vertices of the simplex that contains point p.Theset I always contains d+1 vertices 5 Method as the lattice tessellates the space in uniform simplices with d + 1 vertices each. Furthermore, we denote with J the The input to our method is a point cloud P = (G, F) con- set of points p for which vertex v is one of the vertices of taining coordinates and per-point features. the containing simplices. Hence, these are the points that We define the scale of the lattice by scaling the positions contribute to vertex v through the splat operation. σ σ G as G = G/σ σ , where σ σ ∈ R is the scaling factor. The We denote with S the splatting operation, with Y the slic- higher the sigma the less number of vertices will be needed ing operation, with Y the deformable slicing, with P the to cover the point cloud and the coarser the lattice will be. PointNet module, with D and D the distribution of the G F For ease of notation, unless otherwise specified, we refer to point positions and the points features, respectively, and with G as G as we usually only need the scaled version. G the gathering operation. 5.1 Common operations on permutohedral lattice 4 Permutohedral lattice In this section, we will explain in detail the standard oper- The d-dimensional permutohedral lattice is formed by pro- ations on a permutohedral lattice that are used in previous d+1 jecting the scaled regular grid (d + 1)Z along the vector works (Su et al. 2018;Guetal. 2019). 1 = [1,..., 1] onto the hyperplane H : p · 1 = 0. Splatting refers to the interpolation of point features onto the The lattice tessellates the space into uniform d-dimensional values of the lattice V using barycentric weighting (Fig. 3a). simplices. Hence, for d = 2 the space is tessellated with tri- Each point splats onto d +1 lattice vertices and their weighted angles and for d = 3 into tetrahedra. The enclosing simplex features are summed onto the vertices. of any point can be found by a simple rounding algorithm Convolving operates analogously to standard spatial convo- (Baek and Adams 2009). lutions in 2D or 3D, i.e. a weighted sum of the vertex values Due to the scaling and projection of the regular grid, the together with its neighbors is computed. We use convolu- coordinates c of each lattice vertex sum up to zero. Each tions that span over the 1-hop ring around a vertex and hence vertex has 2(d + 1) immediate neighboring vertices. The convolve the values of 2(d + 1) + 1 vertices (Fig. 2). 123 Autonomous Robots Fig. 2 Convolution: The neighboring vertices of a lattice are convolved (a) Splat (b) Distribute similarly to standard 2D convolutions. If a neighbor is not allocated in the sparse structure, we assume that it has a value of zero Fig. 3 Splat and Distribute operations: Splatting uses barycentric weighting to add the features of points onto neighboring vertices. The naïve summation can be detrimental to the network as splatting acts as Slicing is the inverse operation to splatting. The vertex val- a Gaussian filter. Distributing stores all the features of the contributing points, causing no loss of information and allows further processing by ues of the lattice are interpolated back for each position with the network the same weights used during splatting. The weighted con- tributions from the simplexes d + 1 vertices are summed up (Fig. 5a). D = D ( P, V ) ={ g − µ | p ∈ J }, (3) v G p v v D = D ( P, V ) ={ f | p ∈ J }, (4) v F p v 5.2 Proposed operations on permutohedral lattice µ = g , (5) v p | J | The operations defined in Sect. 5.1 are typically used in a cas- p∈ J cade of splat-conv-slice to obtain dense predictions (Su et al. 2018). However, splatting and slicing act as Gaussian kernel | J |×d | J |× f v v d where D ∈ R and D ∈ R are matrices con- v v g f low-pass filtering on encoded information (Baek and Adams taining the distributed coordinates and features, respectively, 2009). Their repeated usage at every layer is detrimental to for the contributing points into a vertex v. The matrices are the accuracy of the network. Additionally, splatting acts as a concatenated and processed by a PointNet P to obtain the weighted average on the feature vectors where the weights are final vertex value x .Fig. 3 illustrates the difference between only determined through barycentric interpolation. Includ- splatting and distributing. ing the weights as trainable parameter allows the network to Note that we use a different distribute function for coordi- decide on a better interpolation scheme. Furthermore, as the nates then for point features. For coordinates, we subtract the network grows deeper and feature vectors become higher- mean of the contributing coordinates. The intuition behind dimensional, slicing consumes increasingly more memory, this is that coordinates by themselves are not very informa- as it assigns the features to the points. Since in most cases tive w.r.t. the potential semantic class. However, the local | P||V |, it is more efficient to store the features only in distribution is more informative as it gives a notion of the the lattice vertices. geometry. To address these limitations, we propose four new opera- Downsampling refers to a coarsening of the lattice, by reduc- tors on the permutohedral lattice which are more suitable for ing the number of vertices. This allows the network to CNNs and dense prediction tasks. capture more contextual information. Downsampling con- Distribute is defined as the list of features that each lattice sists of two steps: creation of a coarse lattice and obtaining vertex receives. However, they are not summed as done by its values. Coarse lattices are created by repeatedly divid- splatting: ing the point cloud positions by 2 and using them to create new lattice vertices (Barron et al. 2015). The values of x = S( P, V ) = b f , (1) v pv p the coarse lattice are obtained by convolving over the finer p∈ J v lattice from the previous level (Fig. 4). Hence, we must embed the coarse lattice inside the finer one by scaling where x is the value of lattice vertex v and b is the barycen- v pv the coarse vertices by 2. Afterwards, the neighbors vertices tric weight between point p and lattice vertex v. over which we convolve are separated by a vector of form d+1 Instead, our distribute operators D and D concatenate G F ± [−1,..., −1, d, −1,... , −1] ∈ Z . The downsam- coordinates and features of the contributing points: pling operation effectively performs a strided convolution. Upsampling follows a similar reasoning. The fine vertices x = P(D ; D ), (2) need first to be embedded in the coarse lattice using a v v v g f 123 Autonomous Robots (a) Slice (b) DeformSlice Fig. 4 Coarsen: Downsampling of the lattice is performed by embed- Fig. 5 Slice and DeformSlice: Slicing barycentrically interpolates the ding the coarse lattice in the finer one and convolving over the neighbors. vertex values back onto a point. DeformSlice allows for the network to This effectively performs a strided convolution. Transposed convolution directly affect the interpolated value by learning offsets of the barycen- is performed in an analogous manner by embedding a fine lattice into a tric coordinates coarse one (d+1) (d+1)v division by 2. Afterwards, the neighboring vertices over b and q as vectors in R and R , respectively, p p which we convolve are separated by a vector of form and cast the prediction of offsets as a fully connected layer ± [−0.5,..., −0.5, d/2, −0.5,... , −0.5]. The careful reader followed by a non-linearity: will notice that in this case, the coordinates of the neigh- boring vertices may not be integer anymore; they may have b = F (q ) = σ(q · W + b). (10) p p p a fractional part and will, therefore, lie in the middle of a coarser simplex. In this case we ignore the contribution of However, this prediction has the disadvantage of not being this neighboring vertices and only take the contribution of permutation equivariant; therefore, permutation of the ver- the center vertex. The upsampling operation effectively per- tices would not imply the same permutation in the barycentric forms a transposed convolution. offsets: DeformSlicing While the slicing operation Y barycentrically interpolates the values back to the points by using barycentric coordinates: F (π q ) = π F (q ), (11) p p f = Y( P, V ) = b x , (6) p pv v where π is the set of all permutations of the d + 1 vertices. v∈I It is important for our prediction to be permutation equiv- ariant because the vertices may be arranged in any order and we propose the DeformSlicing Y which allows the network the barycentric offsets need to keep a consistent preference to directly modify the barycentric coordinates and shift the towards a certain vertexes’ features, regardless of its position position within the simplex for data-dependent interpolation: within a simplex. In order for the prediction of the offsets to be consis- f = Y( P, V ) = (b + b )x . (7) p pv pv v tent with permutations of the vertices, we take inspiration v∈I from the work of Ravanbakhsh et al. (2016) and Zaheer et al. (2017) of equivariant layers and design F as: Here, b are offsets that are applied to the original pv barycentric coordinates. A parallel branch within our net- b = σ(b + (b x − max{b x }) · W), (12) work first gathers the values from all the vertices in a simplex pv pv v pd d d∈I and regresses the b : pv b = F (q ) ={ b | v ∈ I }, (13) p p pv p q = G( P, V ) ={ b x | v ∈ I }, (8) p pv v p v ×1 where W ∈ R is a weight matrix and b ∈ R corre- b = F (q ), (9) p p sponds to a scalar bias. In other words, we subtract from where q is a set containing the weighted values of all each weighted vertex the maximum of the weighted values the vertices of the simplex containing p and the prediction of all the other vertices in the simplex. Since the max opera- b ={ b | v ∈ I } is a set of offsets to the barycentric tion is invariant to permutations of the input, the regression p pv p coordinates towards the d + 1 vertices. With a slight abuse of the offsets is equivariant to permutations of the vertices. of notation—due to the fact that the vertices of a simplex are The difference between the slicing and our DeformSlicing always enumerated in a consistent manner, we can regard is visualized in Fig. 5 123 Autonomous Robots 6 Segmentation methods vector for point p and μ as the mean or cluster center for i c cluster c.The δ and δ are the margins for the variance and v d Due to the flexibility of LatticeNet various segmentation distance loss respectively. We set α = β = 1 and γ = 0.001 methods can be implemented. In this section, we detail the A visualization of the pipeline for instance segmentation methods used for each one. can be seen in Fig. 6. 6.3 Motion segmentation 6.1 Semantic segmentation Motion segmentation distinguishes between dynamic and Semantic segmentation uses the default U-Net architecture static objects within a point cloud. For this, the network needs described in the Network Architecture section. It is trained temporal information. We extend the original LatticeNet U- with an equal part combination of cross entropy loss and Net architecture with a recursive architecture that can process Lovász loss (Berman et al. 2018). The Lovász loss acts as a a sequence of point clouds P at times t , t − 1,..., t − n surrogate for the intersection-over-union score and is espe- seq and learn to distinguish for example between a moving car cially useful for dealing with class imbalance. and a parked car. The dynamic objects are considered as additional classes. 6.2 Instance segmentation Hence, we use the same loss as in the case of semantic seg- mentation. We also explore multiple ways to perform the Our instance segmentation network follows the work of fusion of temporal information which we detail in the Net- other proposal-free methods like (De Brabandere et al. work Architecture section. 2017). We use LatticeNet to predict for each 3D point p in the point cloud an embedding x . A discriminative loss encourages closeness in embeddings space for points of the 7 Network architecture same instance while promoting distance between different instances. Finally, we apply mean-shift clustering on the Input to our network is a point cloud P which may contain points in embeddings space. Points belonging to the same per-point features stored in F. The output is class probabilities cluster are defined as an Instances. for each point p. In the recurrent network the input is an This discriminative loss can be expressed with three terms: ordered set of point clouds P and the output are class seq probabilities for the last point cloud of the sequence. Moving – Variance term: The intra-cluster pull force that draws the and static objects are considered as different semantic classes. embeddings towards the mean embedding. Our network architecture has a U-Net structure (Ron- – Distance term: An inter-cluster push force that forces the neberger et al. 2015) and is visualized in Fig. 7 together with clusters to be far apart from each other in embedding the used individual blocks. space. The first layers distribute the point features onto the – Regularization term: A small force that pulls the cluster lattice and use a PointNet to obtain local features. After- centers towards the origin in order to keep the activations wards, a series of ResNet blocks (He et al. 2016a), followed bounded. by repeated downsampling, aggregates global context. The decoder branch mirrors the encoder architecture and upsam- The full loss is then defined as: ples through transposed convolutions. Finally, a DeformSlic- ing propagates lattice features onto the original point cloud. C N 1 1 Skip connections are added by concatenating the encoder L = [μ − x  − δ ] (14) var c i v C N feature maps with matching decoder features. c=1 i =1 C C 7.1 Temporal fusion L = 2δ −μ − μ  (15) dist d c c A B C (C − 1) c =1 c =1 A B c =c A B Incorporating temporal information for motion prediction over a sequence of point clouds relies on fusing information L = μ  (16) reg c between multiple time-steps. For this purpose, the feature c=1 vectors of the timesteps t − 1 and t are passed through a L = α · L + β · L + γ · L (17) Temporal Fusion block, as shown in Fig. 8. This fusion con- var dist reg sists of a concatenation of both feature vectors and a linear We define C as the number of clusters in the ground truth, layer followed by a non-linearity (Fig. 9). Each new time- N as the number of elements in cluster c, x as the embedding step allocates additional vertices in the lattice corresponding c i 123 Autonomous Robots Fig. 6 Instance segmentation: LatticeNet takes raw point clouds as input and embeds them into a sparse lattice where convolutions are applied. Features on the lattice are projected onto a 2D space where clustering is performed. The clusters define the instances of each object type in the original cloud to newly explored areas in the map. For correct fusion, the features from the previous time-step need to be zero-padded so that the sizes match. Additionally, we performed experiments with a single Temporal Fusion block in the network and max-pooling over both feature vectors instead of the linear layer, but found that three Temporal Fusion blocks achieved overall superior results (Fig. 10). It should be noted that our approach for temporal fusion relies on a sequence of clouds that are transformed into a common coordinate frame. The required scan poses for trans- formation can be obtained e.g. from GPS or SLAM. 8 Implementation Our lattice is stored sparsely on a hash map structure, which allows for fast access of neighboring vertices. Unlike (Su et al. 2018), we construct the hash map directly on the GPU, saving us from incurring an expensive CPU to GPU memory copy. For memory savings, we implemented the DeformSlice Fig. 7 Architecture: Our model follows a U-Net structure. For ease of and the last linear classification layer in one fused operation, representation, blocks which are repeated one after another are indicated avoiding the storage of high-dimensional feature vectors for with a multiplier on the right side of the operation each point in the point cloud. All of the lattice operators containing forwards and back- wards passes are implemented on the GPU and exposed to We share the PyTorch implementation of LatticeNet at PyTorch (Paszke et al. 2017). https://github.com/AIS-Bonn/lattice_net. Following recent works (He et al. 2016b; Huang et al. 2017), all convolutions are pre-activated using Group Nor- malization (Wu and He 2018) and a ReLU unit. We chose 9 Experiments Group Normalization instead of the standard batch normal- ization due to greater stability for small batch sizes. We use We evaluate our proposed lattice network on four differ- the default of 32 groups. ent datasets: ShapeNet (Yi et al. 2016), ScanNet (Dai et al. The models were trained using the Adam optimizer with 2017), SemanticKITTI (Behley et al. 2019) and Pheno4D a learning rate of 0.001 and a weight decay of 10 4. The (https://www.ipb.uni-bonn.de/data/pheno4d/). For the task learning rate was reduced by a factor of 10 when the loss of semantic segmentation and motion segmentation we plateaued. report the mean Intersection-over-Union (mIoU). For the 123 Autonomous Robots color jitter. A video with additional footage of the experi- ments is available online . 9.1 Evaluation of segmentation accuracy ShapeNet part segmentation is a subset of the ShapeNet dataset (Yi et al. 2016) which contains objects from 16 dif- ferent categories each segmented into 2–6 parts. The dataset consists of points sampled from the surface of the objects, together with the ground truth label of the corresponding object part. The objects have an average of 2613 points. We train and evaluate our network on each object individually. We use the official train/test splits as defined by the dataset containing a total of 12 137 training objects and 2874 test objects. The results for our and five competing methods are gathered in Table 1 and visualized in Fig. 11. We observe that for some classes, we obtain state-of-the- art performance and for other objects, the IoU is slightly lower than for other approaches. We ascribe this to the fact that training one fixed architecture size for each individual object is suboptimal as some objects like the ”cap” have as few as 55 examples while others like the table have more than 5K. This causes the network to be prone to overfitting on the easy object or underfitting on the difficult ones. A fair evalua- tion would require finding an architecture that performs well for all objects on average. However, due to various issues with mislabeled ground truths (Su et al. 2018) we deem that Fig. 8 Recurrent architecture: The features from previous time-steps are fused in the current time-step at multiple levels of the network. This experimentation with more architectures or with different allows the network to distinguish dynamic objects from static ones regularization strengths for individual objects would overfit the dataset. ScanNet 3D segmentation Daietal. (2017) consists of 3D reconstructions of real rooms. It contains ≈ 1500 rooms seg- mented into 20 classes (bed, furniture, wall, etc.). The rooms have between 9K and 537K points—on average 145K. We segment an entire room at once without cropping. We use the official train/test splits as defined by the dataset containing a total of 1201 training rooms and 100 test objects. Results are gathered in Table 2 and visualized in Fig. 12. We obtain an IoU of 64.0 which is significantly higher than the most similar Fig. 9 Temporal fusion: The features from the previous time-step are related work of SplatNet. It is to be noted that MinkowskiNet zero-padded in order to account for the new vertices that were allocated at the current time-step. The features are afterwards concatenated and achieves a higher IoU but at the expense of an extremely high passed through a linear layer followed by a non-linearity spatial resolution of 2 cm per voxel. In contrast, our approach allocates lattice vertices so that each vertex covers approxi- mately 30 points. On this dataset, this corresponds to a spatial task of instance segmentation, we report the Symmetric Best extent of approximately 10 cm. Dice (SBD) (De Brabandere et al. 2017). SBD measures the SemanticKITTI Behley et al. (2019) consists of semanti- accuracy of the instance segmentation by averaging for each cally annotated LiDAR scans of real urban environments. input label the ground truth label yielding the maximum Dice The annotation covers a total of 19 classes for single scan score. evaluation and a total of 25 classes for multiple scan evalua- We use a shallow model for ShapeNet and Pheno4D and a tion. Each scan contains between 82K and 129K points. We deeper model for ScanNet and SemanticKITTI as the datasets process each scan entirely without any cropping. We use the are larger. We augment all data using random mirroring and translations in space. For ScanNet, we also apply random http://www.ais.uni-bonn.de/videos/RSS_2020_Rosu/. 123 Autonomous Robots For motion segmentation we take as input three point clouds at consecutive time steps and output the segmentation for the final, most recent cloud. We overlap this time window so that every clouds gets to be segmented. For the first few clouds, the time window is reduced as there are no clouds from previous time-steps to give as input. The results for the motion segmentation are provided in Table 4 and visualized in Fig. 14. We observe that for motion segmentation we outperform other approaches except for KPConv (Thomas et al. 2019), Fig. 10 Bonn Activity Maps segmentations. Colored meshes are recon- which has higher IoU. However, it is to be noted that KPconv structed from KinectV2 data using volumetric integration (Nießner et al. cannot process a full point cloud at once due to memory 2013; Stotko et al. 2019) and semantically segmented using LatticeNet. constraints and rather processes sub-clouds centered around Color coding of semantic labels corresponds to the ScanNet dataset (Dai random spheres in the scene. The spheres are chosen ran- et al. 2017) domly in the scene to ensure each point is tested multiple times by different sphere locations. Finally, a voting scheme gives the final prediction. In contrast, our approach can process a full point cloud without requiring neighborhood searching or partitioning in sub-clouds. Bonn Activity Maps (Tanke et al. 2019) is a dataset for human tracking, activity recognition and anticipation of multiple persons. It contains annotations of persons, their trajectories and activities. The 3D reconstruction of the four kitchen sce- narios is however of more interest to us. The environments are reconstructed as 3D colored meshes and have no ground truth semantic annotations. We trained our LatticeNet on the ScanNet dataset and evaluate it on the 4 kitchens in order Fig. 11 ShapeNet (Yi et al. 2016) results of our method to provide an annotation for each vertex of the mesh. The results are shown in Fig. 10. We can observe that our network generalizes well to unseen datasets, recorded with different sensors and with different noise properties as the seman- tic segmentations look plausible and exhibit sharp borders between classes. Pheno4D https://www.ipb.uni-bonn.de/data/pheno4d/ is a spatio-temporal dataset of point clouds of maize and tomato plants with instance annotations of leaves. We use a shallow version of LatticeNet to compute per-point embeddings and cluster them using mean-shift to recover the instances. We Fig. 12 ScanNet results. The left image shows the ground truth and the compare with PointNet and PointNet++ as they are popu- right one our prediction lar methods for computing per-point embeddings. Since the dataset contains 7 maize and 7 tomato plants, we train on the official train/validation splits as defined by the dataset. The first 5 plants for each type and test on the remaining two. The test set is not publicly available and testing can only be done results are gathered in Table 5. We observe that our method through the benchmark server. is capable of computing more meaningful embeddings that The results for single scan are provided in Table 3 and create more distinctive clusters between each plant organ. visualized in Fig. 13. Our LatticeNet outperforms all other methods—in case of the most similar SplatNet by more than a factor of two. It is to be noted that DarkNet53Seg (Behley 9.2 Ablation studies et al. 2019), DarkNet21Seg (Behley et al. 2019) and Squeeze- SegV2 (Wu et al. 2018) are methods that operate on a 2D We perform various ablations regarding our contribution to image by wrapping the LiDAR scans to 2D using spherical judge how much they affect the network’s performance. coordinates. In contrast, our method can operate on general DeformSlice We assess the impact that DeformSlice has on point clouds, directly in 3D. the network by comparing it with the Slice operator which 123 Autonomous Robots Fig. 13 SemanticKITTI results. We compare the prediction from our number of points. Additionally, the network also effectively makes use LatticeNet with the results from TangentConv (Tatarchenko et al. 2018) of contextual information in order to correctly predict the parking place and SplatNet (Su et al. 2018). We can observe that our approach can due to the existence of nearby cars better learn small objects like tree trunks, despite their relatively small Table 1 Results on ShapeNet part segmentation (Yi et al. 2016) #Instances 2690 76 55 898 3758 69 787 392 1547 451 202 184 283 66 152 5271 Instance Air- Bag Cap Car Chair Ear- Guitar Knife Lamp Laptop Motor- Mug Pistol Rocket Skate- Table avg. plane phone bike board PointNet (Qi et al. 2017a) 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6 PointNet++ (Qi et al. 2017b) 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6 SplatNet 3D (Su et al. 2018) 84.6 81.9 83.9 88.6 79.5 90.1 73.5 91.3 84.7 84.5 96.3 69.7 95.0 81.7 59.2 70.4 81.3 SplatNet 2D-3D (Su et al. 2018) 85.4 83.2 84.3 89.1 80.3 90.7 75.5 92.1 87.1 83.9 96.3 75.6 95.8 83.8 64.0 75.5 81.8 FCPN (Rethage et al. 2018) 84.0 84.0 82.8 86.4 88.3 83.3 73.6 93.4 87.4 77.4 97.7 81.4 95.8 87.7 68.4 83.6 73.4 Ours 83.9 82.3 84.8 79.1 81.0 86.9 71.0 91.9 89.4 84.7 96.6 77.2 95.8 86.0 70.5 79.3 87.0 Bold signifies the highest mean intersection-over-union (mIoU) Table 2 Results on ScanNet (Dai et al. 2017) We also evaluate a version of DeformSlice which ensures that the new barycentric coordinates still sum up to one by Method mIOU adding an additional loss term: PointNet++ (Qi et al. 2017b) 33.9 ⎛ ⎞ SplatNet (Su et al. 2018) 39.3 ⎝ ⎠ TangetConv (Tatarchenko et al. 2018) 43.8 L = b . (18) pv | P| 3DMV (Dai and Nießner 2018) 48.4 p∈ P v∈I MinkowskiNet42 (5cm) (Choy et al. 2019) 67.9 † However, we observe little change after adding this reg- SparseConvNet (Graham et al. 2018) 72.5 ularization term and hence, use the default version of MinkowskiNet42 (2cm) (Choy et al. 2019) 73.4 DeformSlice for the rest of the experiments. The results are Ours 64.0 gathered in Table 6. Bold signifies the highest mean intersection-over-union (mIoU) Distribute and PointNet Another contribution of our work is the usage of a Distribute operator to provide values to the lattice vertices which are later embedded in a higher- does not use learned barycentric interpolation. We evaluate dimensional space by a PointNet-like architecture. The this on SemanticKITTI, the largest dataset that we are using. positions and features of the point cloud are treated sep- 123 Autonomous Robots Table 3 Results on SemanticKITTI (Behley et al. 2019) Approach mIoU Road Sidewalk Parking Other- Building Car Truck Bicycle Motorcycle Other- Vegetation Trunk Terrain Person Bicyclist Motor- Fence Pole Traffic ground vehicle cyclist sign PointNet (Qietal. 14.6 61.6 35.7 15.8 1.4 41.4 46.3 0.1 1.3 0.3 0.8 31.0 4.6 17.6 0.2 0.2 0.0 12.9 2.4 3.7 2017a) SplatNet (Su et al. 18.4 64.6 39.1 0.4 0.0 58.3 58.2 0.0 0.0 0.0 0.0 71.1 9.9 19.3 0.0 0.0 0.0 23.1 5.6 0.0 2018) PointNet++ (Qi 20.1 72.0 41.8 18.7 5.6 62.3 53.7 0.9 1.9 0.2 0.2 46.5 13.8 30.0 0.9 1.0 0.0 16.9 6.0 8.9 et al. 2017b) Minkowski34(25cm) 33.0 80.8 43.0 36.9 0.5 73.5 83.0 42.9 2.0 2.9 7.8 74.4 42.9 36.7 11.2 22.8 4.4 37.2 35.4 28.6 (Choy et al. 2019) SqueezeSegV2 39.7 88.6 67.6 45.8 17.7 73.7 81.8 13.4 18.5 17.9 14.0 71.8 35.8 60.2 20.1 25.1 3.9 41.1 20.2 36.3 (Wu et al. 2018) TangentConv 40.9 83.9 63.9 33.4 15.4 83.4 90.8 15.2 2.7 16.5 12.1 79.5 49.3 58.1 23.0 28.4 8.1 49.0 35.8 28.5 (Tatarchenko et al. 2018) DarkNet21Seg 47.4 91.4 74.0 57.0 26.4 81.9 85.4 18.6 26.2 26.5 15.6 77.6 48.4 63.6 31.8 33.6 4.0 52.3 36.0 50.0 (Behley et al. 2019) DarkNet53Seg 49.9 91.8 74.6 64.8 27.9 84.1 86.4 25.5 24.5 32.7 22.6 78.3 50.1 64.0 36.2 33.6 4.7 55.0 38.9 52.2 (Behley et al. 2019) Ours 52.9 90.0 74.1 59.4 22.0 88.2 92.9 26.6 16.6 22.2 21.4 81.7 63.6 63.1 35.6 43.0 46.0 58.8 51.9 48.4 Bold signifies the highest mean intersection-over-union (mIoU) Autonomous Robots Table 4 Motion segmentation IoU results on SemanticKITTI (Behley et al. 2019) using a sequence of multiple past scans (in %) Approach 84.9 21.1 18.5 1.6 0.0 0.0 TangentConv [41] 34.1 40.3 42.2 30.1 6.4 1.1 1.9 84.1 20.0 20.7 7.5 0.0 0.0 DarkNet53Seg [4] 41.6 61.5 37.8 28.9 15.2 14.1 0.2 88.5 29.2 22.7 6.3 0.0 0.0 SpSequenceNet [37] 43.1 53.2 0.1 2.3 26.2 41.2 36.2 93.7 70.3 38.6 21.6 0.0 0.0 KPConv [43] 51.2 69.4 5.8 4.7 67.5 67.4 47.2 91.1 65.4 23.1 6.8 0.0 0.0 Ours 45.2 54.8 3.5 0.6 49.9 44.6 64.3 Shaded cells correspond to the IoU of the moving classes, while unshaded entries are the non-moving classes Table 5 Instance segmentation performance on the maize and tomato plants of the Pheno4D dataset SBD Maize Tomato PointNet (Qi et al. 2017a) 69.7 47.3 PointNet++ (Qi et al. 2017b) 74.8 56.1 LatticeNet (ours) 80.6 74.2 Bold signifies the highest mean intersection-over-union (mIoU) Table 6 Ablation study of the various components of LatticeNet. Var- ious features are disabled (indicated in red) and the impact to the IoU Fig. 14 Motion segmentation results on SemanticKITTI. The moving is evaluated car on the road (red) is correctly distinguished from the parked car (orange) (Color figure online) arately where the features (normals, color) are distributed directly. From the positions, we substract the locally aver- aged position as we assume that the local point distribution is more important than the coordinates in the global reference frame. We evaluate the impact of elevating the point features to a higher-dimensional space and subtracting the local mean against a simple splatting operator which just averages the features of the points around each corresponding vertex. We observe that not subtracting the local mean, and just using the xyz coordinates as features, heavily degrades the performance, causing the mIoU to drop from 52.9 to 43.0. Finally, naive application of the splat operation performs This further reinforces the idea that the local point distribu- worst with a mere 37.8 mIoU. tion is a good local feature to use in the first layers of the network. Not elevating the point cloud features to a higher- 9.3 Performance dimensional space before applying the max-pool operation also hurts performance but not as severely. In our experi- We report the time taken for a forward pass and the maximum ments, we elevate the features to 64 dimensions by using a memory used in our shallow and deep network on the first series of fully connected layers. three evaluated datasets. The performance was measured on car truck other-vehicle person bicyclist motorcyclist mIoU Autonomous Robots Table 7 Average time used by ShapeNet ScanNet SemanticKITTI the forward pass and the [ms] [GB] [ms] [GB] [ms] [GB] maximum memory used during training. An X indicates a SplatNet 129 0.6 XX 2931 8.9 method that failed to process the Ours 49 0.5 180 6.5 143 3.5 whole cloud due to memory limitations Bold signifies the highest mean intersection-over-union (mIoU) a NVIDIA Titan X Pascal and the results are gathered in right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. Table 7. In the case of motion segmentation, the inference times and memory used are the same as in the case of a single scan, as we use the same backbone network to extract features References and the computational cost of fusing the temporal informa- A large scale spatio-temporal dataset of point clouds of maize tion is minimum. However for training, the network requires and tomato plants. https://www.ipb.uni-bonn.de/data/pheno4d/. more memory with increasing time window due to the back- Accessed: 2021-01-1. propagation through time. This scales linearly with the time Baek, J., & Adams, A. (2009). Some useful properties of the permuto- window size and the amount of points in the cloud. hedral lattice for Gaussian filtering. Other Words 10(1). Barron, J.T., Adams, A., YiChang, S., & Hernández, C. (2015). Fast Despite the reduced memory usage compared to SplatNet bilateral-space stereo for synthetic defocus—Supplemental mate- and increased speed of execution, there are still memory sav- rial. In Proceedings of the IEEE Conference on Computer Vision ings possible by fusing the Distribute and PointNet operators and Pattern Recognition (CVPR), pp. 1–15. into one GPU operation. This is similar to fusing our Deform- Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., & Gall, J. (2019) SemanticKITTI: A Dataset for Semantic Slice and the classification layer. Additionally, we expect the Scene Understanding of LiDAR Sequences. In Proceedings of the network to become even faster as further advances on highly IEEE International Conference on Computer Vision (ICCV). optimized kernels for convolution on sparse lattices become Berman, M., Triki, A.R., & Blaschko, M.B. (2018). The Lovász- available. At the moment, the convolutions are performed by softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceed- our custom CUDA kernels. Tighter integration however with ings of the IEEE Conference on Computer Vision and Pattern highly optimized libraries like cuDNN (Chetlur et al. 2014) Recognition (CVPR), pp. 4413–4421. could be beneficial. Chen, L-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethink- ing atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Chen, W., Han, X., Li, G., Chen, C., Xing, J., Zhao, Y., & Li, H. (2018). Deep RBFNet: Point cloud feature learning using radial basis func- 10 Conclusion tions. arXiv preprint arXiv:1812.04302. Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catan- zaro, B., & Shelhamer, E. (2014). cuDNN: Efficient primitives for We presented LatticeNet, a novel method for point cloud deep learning. arXiv preprint arXiv:1410.0759. segmentation. A sparse permutohedral lattice allows us to Choy, C., Gwak, J., & Savarese, S. (2019). 4D Spatio-Temporal Con- efficiently process large point clouds. The usage of PointNet vNets: Minkowski Convolutional Neural Networks. arXiv preprint together with a data-dependent interpolation alleviates the arXiv:1904.08755. Dai, A., & Nießner, M. (2018). 3DMV: Joint 3D-multi-view predic- quantization issues of other methods. Experiments on four tion for 3D semantic scene segmentation. In Proceedings of the datasets show state-of-the-art results, at a reduced time and European Conference on Computer Vision (ECCV), pp. 452–468. memory budget. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). ScanNet: Richly-annotated 3D reconstruc- Funding Open Access funding enabled and organized by Projekt tions of indoor scenes. In Proceedings of the IEEE Conference DEAL. on Computer Vision and Pattern Recognition (CVPR), pp. 5828– Open Access This article is licensed under a Creative Commons De Brabandere, B., Neven, D., & Van Gool, L. (2017). Semantic Attribution 4.0 International License, which permits use, sharing, adap- instance segmentation with a discriminative loss function. arXiv tation, distribution and reproduction in any medium or format, as preprint arXiv:1708.02551. long as you give appropriate credit to the original author(s) and the Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional source, provide a link to the Creative Commons licence, and indi- neural networks on graphs with fast localized spectral filtering. In cate if changes were made. The images or other third party material Proceedings of the Advances in Neural Information Processing in this article are included in the article’s Creative Commons licence, Systems (NIPS), pp. 3844–3852. unless indicated otherwise in a credit line to the material. If material Graham, B., Engelcke, M., & van der Maaten, L. (2018). 3D seman- is not included in the article’s Creative Commons licence and your tic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy- Pattern Recognition (CVPR), pp. 9224–9232. 123 Autonomous Robots Gu, X., Wang, Y., Wu, C., Lee, Y.J., & Wang, P. (2019). HPLFlowNet: Qi, C.R., Yi, L., Su, H., & Guibas, L.J. (2017b). PointNet++: Deep Hierarchical Permutohedral Lattice FlowNet for Scene Flow Esti- hierarchical feature learning on point sets in a metric space. In mation on Large-scale Point Clouds. In Proceedings of the IEEE Proc. of the Advances in Neural Information Processing Systems Conference on Computer Vision and Pattern Recognition (CVPR), (NIPS), pp. 5099–5108. pp 3254–3263. Qi, C.R., Litany, O., He, K., & Guibas, L.J. (2019) Deep Hough voting He, K., Zhang, X., Ren, S., & Sun, J. (2016a) Deep residual learning for 3D object detection in point clouds. In Proceedings of the IEEE for image recognition. In Proceedings of the IEEE Conference on International Conference on Computer Vision (ICCV), pp. 9277– Computer Vision and Pattern Recognition (CVPR), pp. 770–778. 9286. He, K., Zhang, X., Ren, S., & Sun, J. (2016b) Identity mappings in deep Ravanbakhsh, S., Schneider, J.G., & Póczos, B. (2016). Deep Learning residual networks. In Proceedings of the European Conference on with Sets and Point Clouds. arXiv preprint arXiv:1611.04500. Computer Vision (ECCV), pp. 630–645. Rethage, D., Wald, J., Sturm, J., Navab, N., & Tombari, F. (2018). Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q. (2017). Fully-convolutional point networks for large-scale point clouds. Densely connected convolutional networks. In Proceedings of the In Proceedings of the European Conference on Computer Vision IEEE Conference on Computer Vision and Pattern Recognition (ECCV), pp. 596–611. (CVPR), pp. 4700–4708. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Huang, J., Zhang, H., Yi, L., Funkhouser, T., Nießner, M., & Guibas, L.J. networks for biomedical image segmentation. In International (2019). TextureNet: Consistent local parametrizations for learn- Conference on Medical Image Computing and Computer-assisted ing from high-resolution signals on meshes. In Proceedings of the Intervention, pp. 234–241. IEEE Conference on Computer Vision and Pattern Recognition Rosu, R.A., Schütt, P., Quenzel, J., & Behnke, S. (2020). LatticeNet: (CVPR), pp. 4440–4449. Fast point cloud segmentation using permutohedral lattices. Pro- Li, Y., Bu, R., Sun, M., Wu, W., Di, X., & Chen, B. (2018). PointCNN: ceedings of Robotics: Science and Systems. Convolution on x-transformed points. In Proceedings of the Shi, H., Lin, G., Wang, H., Hung, T-Y., & Wang, Z. (2020). SpSe- Advances in Neural Information Processing Systems (NIPS), pp. quenceNet: Semantic Segmentation Network on 4D Point Clouds. 820–830. In Proceedings of the IEEE Conference on Computer Vision and Lin, G., Milan, A., Shen, C., & Reid, I. (2017). RefineNet: Multi-path Pattern Recognition (CVPR), pp. 4574–4583. refinement networks for high-resolution semantic segmentation. Stotko, P., Krumpen, S., Weinmann, M., & Klein, R. (2019). Efficient In Proceedings of the IEEE Conference on Computer Vision and 3D Reconstruction and Streaming for Group-Scale Multi-Client Pattern Recognition (CVPR), pp. 1925–1934. Live Telepresence. In Proceedings of the IEEE International Sym- Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional net- posium on Mixed and Augmented Reality (ISMAR), pp. 19–25. works for semantic segmentation. In Proceedings of the IEEE Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, M-H., & Conference on Computer Vision and Pattern Recognition (CVPR), Kautz, J. (2018). SplatNet: Sparse lattice networks for point cloud pp. 3431–3440. processing. In Proceedings of the IEEE Conference on Computer Masci, J., Boscaini, D., Bronstein, M., & Vandergheynst, P. (2015). Vision and Pattern Recognition (CVPR), pp. 2530–2539. Geodesic convolutional neural networks on Riemannian mani- Tanke, J., Kwon, O-H., Stotko, P., Rosu, R.A., Weinmann, M., Errami, folds. In Workshop Proceedings of the IEEE International Con- H., Behnke, S., Bennewitz, M., Klein, R., Weber, A., et al. ference on Computer Vision (ICCV Workshops), pp. 37–45. (2019). Bonn Activity Maps: Dataset Description. arXiv preprint Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., & Bronstein, arXiv:1912.06354. M.M. (2017). Geometric deep learning on graphs and manifolds Tatarchenko, M., Park, J., Koltun, V., & Zhou, Q-Y. (2018). Tangent using mixture model CNNs. In Proceedings of the IEEE Confer- convolutions for dense prediction in 3D. In Proceedings of the ence on Computer Vision and Pattern Recognition (CVPR), pp. IEEE Conference on Computer Vision and Pattern Recognition 5115–5124. (CVPR), pp. 3887–3896. Neven, D., De Brabandere, B., Proesmans, M., & Van Gool, L. (2019). Tchapmi, L., Choy, C., Armeni, I., Gwak, J., & Savarese, S. (2017). Instance segmentation by jointly optimizing spatial embeddings SEGCloud: Semantic segmentation of 3D point clouds. In Inter- and clustering bandwidth. In Proceedings of the IEEE Conference national Conference on 3D Vision (3DV), pp. 537–547. IEEE. on Computer Vision and Pattern Recognition (CVPR), pp. 8837– Thomas, H., Qi, C.R., Deschaud, J-E., Marcotegui, B., Goulette, F., & 8845. Guibas, L.J. (2019). KPConv: Flexible and deformable convolu- Nießner, M., Zollhöfer, M., Izadi, S., & Stamminger, M. (2013). tion for point clouds. In Proceedings of the IEEE Int. Conference Real-time 3D reconstruction at scale using voxel hashing. ACM on Computer Vision (ICCV), pp. 6411–6420. Transactions on Graphics (ToG), 32(6), 1–11. Wang, S., Suo, S., Ma, W.C., Pokrovsky, A., & Urtasun, R. (2018a). Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Deep parametric continuous convolutional neural networks. In Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic Proceedings of the IEEE Conference on Computer Vision and Pat- Differentiation in PyTorch. In NIPS Autodiff Workshop. tern Recognition (CVPR), pp. 2589–2597. Pham, Q-H., Hua, B-S., Nguyen, T., & Yeung, S-K. (2019a). Real- Wang, W., Yu, R., Huang, Q., & Neumann, U. (2018b). SGPN: time progressive 3D semantic segmentation for indoor scenes. In Similarity group proposal network for 3D point cloud instance Proceedings of the IEEE Workshop on Applications of Computer segmentation. In Proceedings of the IEEE Conference on Com- Vision, pp. 1089–1098. puter Vision and Pattern Recognition (CVPR), pp. 2569–2578. Pham, Q-H., Nguyen, T., Hua, B-S., Roig, G., & Yeung, S-K. (2019b). Wang, X., Liu, S., Shen, X., Shen, C., & Jia, J. (2019). Associatively seg- JSIS3D: Joint semantic-instance segmentation of 3D point clouds menting instances and semantics in point clouds. In Proceedings of with multi-task pointwise networks and multi-value conditional the IEEE Conference on Computer Vision and Pattern Recognition random fields. In Proceedings of the IEEE Conference on Com- (CVPR), pp. 4096–4105. puter Vision and Pattern Recognition (CVPR), pp. 8827–8836. Wu, B., Zhou, X., Zhao, S., Yue, X., & Keutzer, K. (2018). Squeeze- Qi, C.R., Su, H., Mo, K., & Guibas, L.J. (2017a). PointNet: Deep Segv2: Improved model structure and unsupervised domain adap- learning on point sets for 3D classification and segmentation. In tation for road-object segmentation from a lidar point cloud. arXiv Proceedings of the IEEE Conference on Computer Vision and Pat- preprint arXiv:1809.08495. tern Recognition (CVPR), pp. 652–660. Wu, W., Qi, Z., & Fuxin, L. (2019). PointConv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE Con- 123 Autonomous Robots Jan Quenzel received his M.Sc. ference on Computer Vision and Pattern Recognition (CVPR), pp. degree in Computer Science from 9621–9630. the University of Lübeck in 2015. Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the Since August 2015, he is a mem- European Conference on Computer Vision (ECCV), pp. 3–19. ber of the Autonomous Intelligent Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham, A., & Trigoni, Systems Group at the University N. (2019). Learning object bounding boxes for 3D instance seg- of Bonn. His research focuses on mentation on point clouds. arXiv preprint arXiv:1906.01140. LiDAR and visual odometry for Yi, L., Kim, L. G., Ceylan, D., Shen, I., Yan, M., Su, H., et al. (2016). micro aerial vehicles, surface A scalable active framework for region annotation in 3D shape reconstruction, and sensor calibra- collections. ACM Transactions on Graphics (ToG), 35(6), 210. tion. Yi, L., Zhao, W., Wang, H., Sung, M., & Guibas, L.J. (2019). GSPN: Generative shape proposal network for 3D instance segmentation in point cloud. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 3947–3956. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., & Smola, A.J. (2017). Deep sets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pp. 3391–3401. Sven Behnke received his Ph.D. from Freie Universität Berlin in 2002. He worked in 2003 as post- Publisher’s Note Springer Nature remains neutral with regard to juris- doctoral researcher at the Interna- dictional claims in published maps and institutional affiliations. tional Computer Science Institute, Berkeley. From 2004 to 2008, he headed the Humanoid Robots Radu Alexandru Rosu is a PhD stu- Group at Albert-Ludwigs-Universi dent in the group of Autonomous tat Freiburg. Since 2008, he is Intelligent Systems, University of professor for Autonomous Intel- Bonn, Germany. He holds a mas- ligent Systems at the University ter’s degree in computer science of Bonn. His research interests from the University of Bonn and a include micro aerial vehicles, cog- bachelor’s degree in computer sci- nitive robotics, computer vision, ence from the University of Sala- and machine learning. manca, Spain. His research inter- est are in the area of 3D deep learning. He seeks to create novel neural network models capable of understanding, processing and recon- structing 3D data. Peer Schütt is a computer science master’s student at the University of Bonn. He has worked in the Autonomous Intelligent Systems group since 2017. His research interests are in the area of deep learning and augmented reality. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Autonomous Robots Springer Journals

LatticeNet: fast spatio-temporal point cloud segmentation using permutohedral lattices

Loading next page...
 
/lp/springer-journals/latticenet-fast-spatio-temporal-point-cloud-segmentation-using-JRPhf0P7Oo
Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2021
ISSN
0929-5593
eISSN
1573-7527
DOI
10.1007/s10514-021-09998-1
Publisher site
See Article on Publisher Site

Abstract

Deep convolutional neural networks have shown outstanding performance in the task of semantically segmenting images. Applying the same methods on 3D data still poses challenges due to the heavy memory requirements and the lack of structured data. Here, we propose LatticeNet, a novel approach for 3D semantic segmentation, which takes raw point clouds as input. A PointNet describes the local geometry which we embed into a sparse permutohedral lattice. The lattice allows for fast convolutions while keeping a low memory footprint. Further, we introduce DeformSlice, a novel learned data-dependent interpolation for projecting lattice features back onto the point cloud. We present results of 3D segmentation on multiple datasets where our method achieves state-of-the-art performance. We also extend and evaluate our network for instance and dynamic object segmentation. Keywords Semantic segmentation · Instance segmentation · Motion segmentation · Sequence segmentation · 3D point cloud 1 Introduction First, 3D data is often represented in an unstructured manner—unlike the grid-like structure of images. This raises Environment understanding is a crucial ability for autonomous difficulties for current approaches which assume a regular agents. Perceiving not only the geometrical structure of the structure upon which convolutions are defined. scene but also distinguishing between different classes of Second, the performance of current 3D networks is limited objects therein enables tasks like manipulation and inter- by their memory requirements. Storing 3D information in a action that were previously not possible. Within this field, dense structure is prohibitive for even high-end GPUs, clearly semantic segmentation of 2D images is a mature research indicating the need for a sparse structure. area, showing outstanding success in dense per pixel cate- Third, discretization issues caused by imposing a regular gorization on images (Long et al. 2015; Chen et al. 2017; grid onto point clouds can negatively affect the network’s Lin et al. 2017). However, the task of semantically labelling performance and interpolation is necessary to cope with 3D data is still an open area of research as it poses several quantization artifacts (Tchapmi et al. 2017). challenges that need to be addressed. In this work, we propose LatticeNet, a novel approach for point cloud segmentation which alleviates the previously mentioned problems. An overview of the input and output of This work has been funded by the Deutsche Forschungsgemeinschaft our method can be seen in Fig. 1. Hence, our contributions (DFG, German Research Foundation) under Germany’s Excellence are: Strategy - EXC 2070 - 390732324 and by the German Federal Ministry of Education and Research (BMBF) in the project ” Kompetenzzentrum: Aufbau des Deutschen Rettungsrobotik-Zentrums” (A-DRZ). – A hybrid architecture which leverages the strength of PointNet to obtain low-level features and sparse 3D con- This is one of the several papers published in Autonomous volutions to aggregate global context, Robotscomprising the Special Issue on Robotics: Science and Systems 2020. – A framework suitable for sparse data onto which all com- mon CNN operators are defined, and B Radu Alexandru Rosu – A novel slicing operator that is end-to-end trainable for rosu@ais.uni-bonn.de mapping features of a regular lattice grid back onto an Friedrich-Hirzebruch-Allee 8, Bonn, Germany unstructured point cloud. 123 Autonomous Robots Voxel networks 3D Convolutions in this category work on discretized cubic or tetrahedral volume elements. SEGCloud (Tchapmi et al. 2017) voxelizes the point cloud into a uniform 3D grid and applies 3D convolutions to obtain per-voxel class probabilities. A conditional ran- dom field (CRF) is used to smooth the labels and enforce global consistency. The class scores are transferred back to Fig. 1 Semantic segmentation: LatticeNet takes raw point clouds as the points using trilinear interpolation. The usage of a dense input and embeds them into a sparse lattice where convolutions are grid results in high memory consumption while our approach applied. Features on the lattice are projected back onto the point cloud uses a permutohedral lattice stored sparsely. Additionally, to yield a final segmentation their voxelization results in a loss of information due to the discretization of the space. We avoid quantization issues by In addition to our Robotics: Science and System con- using a PointNet architecture to summarize the local neigh- ference paper (Rosu et al. 2020) we make the following borhood. additional contributions: Rethage et al. (2018) perform semantic segmentation on a voxelized point cloud and employ a PointNet architecture – An extension with discriminative loss that allows Lat- as a low-level feature extractor. The usage of a dense grid, ticeNet to perform instance segmentation, and however, leads to high memory usage and slow inference, – A network architecture capable of processing temporal requiring various seconds for medium-sized point clouds. information in order to improve semantic segmentation SplatNet (Su et al. 2018) is the work most closely related and to distinguish between dynamic and static objects to ours. It alleviates the computational burden of 3D convo- within the scene. lutions by using a sparse permutohedral lattice, performing convolutions only around the surfaces. It discretizes the space in uniform simplices and accumulates the features of the raw 2 Related work point cloud onto the vertices of the lattice using a splatting operation. Convolutions are applied on the lattice vertices and 2.1 Semantic segmentation a slicing operation barycentrically interpolates the features of the vertices back onto the point cloud. A series of splat-conv- slice operations are applied to obtain contextual information. 3D Semantic segmentation approaches can be categorized depending on data representation upon which they operate. The main disadvantage is that splat and slice operations are Point cloud networks The first category of networks operates not learned and repeated application slowly degrades the directly on the raw point cloud. point clouds features as they act as Gaussian filters (Baek and From this area, PointNet (Qi et al. 2017a) is one of the Adams 2009). Furthermore, storing high-dimensional fea- pioneering works. The method processes raw point clouds by tures for each point in the cloud is memory intensive which individually embedding the points into a higher-dimensional limits the maximum number of points that can be processed. space and applying max-pooling for permutation-invariance In contrast, our approach has learned operations for splatting to obtain a global scene descriptor. The descriptor can be and slicing which brings more representational power to the network. We also restrict their usage to only the beginning used for both classification and semantic segmentation. How- ever, PointNet does not take local information into account and the end of the network, leaving the rest of the architecture fully convolutional. which is essential for the segmentation of highly-detailed objects. This has been partially solved in the subsequent Mesh networks The connectivity of triangular or quadrilateral work of PointNet++ (Qi et al. 2017b) which applies Point- mesh faces enables easy computation of normal vectors and Net hierarchically, capturing both local and global contextual establishes local tangent planes. information. GCNN (Masci et al. 2015) operates on small local patches Chen et al. (2018) use a similar approach but they input which are convolved using a series of rotated filters, followed the point responses w.r.t. a sparse set of radial basis functions by max-pooling to deal with the ambiguity in the patch orien- (RBF) scattered in 3D space. Optimizing jointly for the extent tation. However, the max-pooling disregards the orientation. and center of the RBF kernels allows to obtain a more explicit MoNet(Montietal. 2017) deals with the orientation ambi- guity by aligning the kernels to the principal curvature of modelling of the spatial distribution. PointCNN (Li et al. 2018) deals with the permutation the surface. Yet, this does not solve cases in which the local curvature is not informative, e.g. for walls or ceilings. Tex- invariance not by using a symmetric aggregation function, but by learning a K × K matrix for the K input points that tureNet (Huang et al. 2019) further improves on the idea permutes the cloud into a canonical form. by using a global 4-RoSy orientations field. This provides a 123 Autonomous Robots smooth orientation field at any point on the surface which is Kernel Point Convolution (KPConv) (Thomas et al. aligned to the edges of the mesh and has only a 4-direction 2019) operates directly on the point clouds by facilitating ambiguity. Defining convolution on patches oriented accord- convolution weights that are located in Euclidean space. ing to the 4-RoSy field yields significantly improved results. Points in the vicinity of these kernels are weighted and Graph networks These methods allow arbitrary topologies summed together to feature vectors. KPConv (Thomas et al. to connect vertices and lift the restriction of triangular or 2019), DarkNet53Seg (Behley et al. 2019) and Tangent- quadrilateral meshes. Conv (Tatarchenko et al. 2018) were previously used for the Wang et al. (2018a) and Wu et al. (2019) define a con- segmentation of 4D point clouds by accumulating multiple volution operator over non-grid structured data by having clouds of a sequence. continuous values over the full vector space. The weights of these continuous filters are parametrized by an multi-layer 2.3 Instance segmentation perceptron (MLP). Defferrard et al. (2016) formulate CNNs in the context Researchers extended principles from 2D to obtain instances of spectral graph theory. They define the convolution in in 3D which can be roughly categorized in proposal-based the Fourier domain with Chebyshev polynomials to obtain and proposal-free methods. fast localized filters. However, spectral approaches are not Proposal-based This type solves the problem in two stages. directly transferable to a new graph as the Fourier basis The first network stage generates proposals of bounding changes. Additionally, the learned filters are rotation invari- boxes for the objects in the scene. A second stage performs ant which can be seen as a limitation to the representational foreground-background segmentation on the points within power of the network. the bounding boxes in order to get valid instances. Multi-view networks The convolution operation is well Yang et al. (2019) present a single-stage method for defined in 2D and hence, there is an interest in casting 3D instance segmentation that can train both the proposal and segmentation as a series of single-view segmentations which the point-mask prediction network in an end-to-end manner. are fused together. Yi et al. (2019) alleviate some of the issues associated with Pham et al. (2019a) simultaneously reconstruct the scene wrong bounding box predictions by using an analysis-by- geometry and recover the semantics by segmenting sequences synthesis strategy. of RGB-D frames. The segmentation is transferred from 2D Proposal-free Proposal-free methods tackle instance seg- images to the 3D world and fused with previous segmenta- mentation without the need of generating object proposals. tions. A CRF finally resolves noisy predictions. They usually rely on predicting point embedding and apply Tatarchenko et al. (2018) assumes that the data is sampled clustering to recover the instances. from locally Euclidean surfaces and project the local surface Many proposal-free approaches base their work on the geometry onto a tangent plane to which 2D convolutions can 2D instance segmentation of De Brabandere et al. (2017)in be applied. This requires a heavy preprocessing for normal which pixel embeddings are predicted. There, a discrimina- calculation. In contrast, our approach can deal with raw point tive loss encourages the embeddings that belong to the same clouds without requiring normals. instance to be clustered together while embeddings from dif- ferent instances should be further apart. SPGN (Wang et al. 2018b) learns a similarity matrix for 2.2 Motion segmentation all point pairs, based on which, similar points are merged to instances. VoteNet (Qi et al. 2019) uses a Hough vot- For the task of motion segmentation two approaches have ing mechanism where the points predict the offset towards been widely used: Networks either incorporate multiple point the object center. A clustering algorithm finally recovers the clouds directly or accumulate a sequence of individually seg- object instances. mented point clouds. Neven et al. (2019) alleviate some of the issues associated Shi et al. (2020) present their U-Net based architec- with proposal-free methods by allowing also the clustering ture SpSequenceNet for semantic segmentation on 4D point algorithm to be part of the training by jointly optimizing the clouds. They input two point clouds and generate the output spatial embeddings and the clustering bandwidth. for the later one with a voxel-based method. They designed Wang et al. (2019) proposed a framework that allows for two modules, the Cross-frame Global Attention (CGA) and semantic and instances to be predicted simultaneously and for the Cross-frame Local Interpolation (CLI) module. The CGA the two tasks to mutually benefit from each other. Similarly, acts as a teacher that uses the data from P to focus the Pham et al. (2019b) recover both instances and semantics and t −1 network on the important features of P . The CLI module apply a CRF to improve the predictions accuracy. fuses information between both point clouds by combining Most of these works utilize a PointNet (Qi et al. 2017a) the spatial and temporal information. or PointNet++ (Qi et al. 2017b) network to predict the point 123 Autonomous Robots embeddings. In our case, we extend LatticeNet in a simi- coordinates of these neighbors are separated by a vector of d+1 lar manner to other proposal-free methods but predict the form ± [−1,..., −1, d, −1,... , −1] ∈ Z . embeddings using the lattice convolutions. The vertices of the permutohedral lattice are stored in a sparse manner using a hash map in which the key is the coordinate c and the value is x . Hence, we only allocate v v 3 Notation the simplices that contain the 3D surface of interest. This sparse allocation allows for efficient implementation of all Throughout this paper, we use bold upper-case characters typical operations in CNNs (convolution, pooling, transposed to denote matrices and bold lower-case characters to denote convolution, etc.). vectors. The permutohedral lattice has several advantages w.r.t. stan- The vertices of the d-dimensional permutohedral lattice dard cubic voxels. The number of vertices for each simplex is (d+1) are defined as a tuple v = (c , x ), with c ∈ Z denoting given by d + 1 which scales linearly with increasing dimen- v v v v d the coordinates of the vertex and x ∈ R representing the sion, in contrast to the 2 for standard voxels. This small values stored at vertex v. The full lattice containing n vertices number of vertices per simplex allows for fast splatting and n×(d+1) is denoted with V = (C, X), with C ∈ Z representing slicing operations. Furthermore, splatting and slicing create n×v piece-wise linear outputs as they use barycentric interpola- the coordinate matrix and X ∈ R the value matrix. The points in a cloud are defined as a tuple p = g , f , tion. In contrast, standard quantization in cubic voxels create p p ∈ R denoting the coordinates of the point and piece-wise constant outputs, leading to discretization arte- with g f ∈ R representing the features stored at point p (color, facts. normals, etc.). The full point cloud containing m points is Spatial correspondences between lattice vertices are given m×d denoted by P = (G, F) with G ∈ R being the positions by design and the hashmap: If the hashmap stays the same m× f matrix and F ∈ R the feature matrix. The feature matrix for the whole sequence, spatially identical lattice vertices of F can also be empty in which case f is set to zero. different point clouds are always mapped to the same entries. For motion segmentation we define a sequence of point This is visualized in Fig. 9 where features from two different clouds as P = ( P , P ,..., P ) with P = (G, F).We time-steps are fused together. seq 0 1 n n define a timestep as processing one cloud of this sequence. We denote with I the set of lattice vertices of the simplex that contains point p.Theset I always contains d+1 vertices 5 Method as the lattice tessellates the space in uniform simplices with d + 1 vertices each. Furthermore, we denote with J the The input to our method is a point cloud P = (G, F) con- set of points p for which vertex v is one of the vertices of taining coordinates and per-point features. the containing simplices. Hence, these are the points that We define the scale of the lattice by scaling the positions contribute to vertex v through the splat operation. σ σ G as G = G/σ σ , where σ σ ∈ R is the scaling factor. The We denote with S the splatting operation, with Y the slic- higher the sigma the less number of vertices will be needed ing operation, with Y the deformable slicing, with P the to cover the point cloud and the coarser the lattice will be. PointNet module, with D and D the distribution of the G F For ease of notation, unless otherwise specified, we refer to point positions and the points features, respectively, and with G as G as we usually only need the scaled version. G the gathering operation. 5.1 Common operations on permutohedral lattice 4 Permutohedral lattice In this section, we will explain in detail the standard oper- The d-dimensional permutohedral lattice is formed by pro- ations on a permutohedral lattice that are used in previous d+1 jecting the scaled regular grid (d + 1)Z along the vector works (Su et al. 2018;Guetal. 2019). 1 = [1,..., 1] onto the hyperplane H : p · 1 = 0. Splatting refers to the interpolation of point features onto the The lattice tessellates the space into uniform d-dimensional values of the lattice V using barycentric weighting (Fig. 3a). simplices. Hence, for d = 2 the space is tessellated with tri- Each point splats onto d +1 lattice vertices and their weighted angles and for d = 3 into tetrahedra. The enclosing simplex features are summed onto the vertices. of any point can be found by a simple rounding algorithm Convolving operates analogously to standard spatial convo- (Baek and Adams 2009). lutions in 2D or 3D, i.e. a weighted sum of the vertex values Due to the scaling and projection of the regular grid, the together with its neighbors is computed. We use convolu- coordinates c of each lattice vertex sum up to zero. Each tions that span over the 1-hop ring around a vertex and hence vertex has 2(d + 1) immediate neighboring vertices. The convolve the values of 2(d + 1) + 1 vertices (Fig. 2). 123 Autonomous Robots Fig. 2 Convolution: The neighboring vertices of a lattice are convolved (a) Splat (b) Distribute similarly to standard 2D convolutions. If a neighbor is not allocated in the sparse structure, we assume that it has a value of zero Fig. 3 Splat and Distribute operations: Splatting uses barycentric weighting to add the features of points onto neighboring vertices. The naïve summation can be detrimental to the network as splatting acts as Slicing is the inverse operation to splatting. The vertex val- a Gaussian filter. Distributing stores all the features of the contributing points, causing no loss of information and allows further processing by ues of the lattice are interpolated back for each position with the network the same weights used during splatting. The weighted con- tributions from the simplexes d + 1 vertices are summed up (Fig. 5a). D = D ( P, V ) ={ g − µ | p ∈ J }, (3) v G p v v D = D ( P, V ) ={ f | p ∈ J }, (4) v F p v 5.2 Proposed operations on permutohedral lattice µ = g , (5) v p | J | The operations defined in Sect. 5.1 are typically used in a cas- p∈ J cade of splat-conv-slice to obtain dense predictions (Su et al. 2018). However, splatting and slicing act as Gaussian kernel | J |×d | J |× f v v d where D ∈ R and D ∈ R are matrices con- v v g f low-pass filtering on encoded information (Baek and Adams taining the distributed coordinates and features, respectively, 2009). Their repeated usage at every layer is detrimental to for the contributing points into a vertex v. The matrices are the accuracy of the network. Additionally, splatting acts as a concatenated and processed by a PointNet P to obtain the weighted average on the feature vectors where the weights are final vertex value x .Fig. 3 illustrates the difference between only determined through barycentric interpolation. Includ- splatting and distributing. ing the weights as trainable parameter allows the network to Note that we use a different distribute function for coordi- decide on a better interpolation scheme. Furthermore, as the nates then for point features. For coordinates, we subtract the network grows deeper and feature vectors become higher- mean of the contributing coordinates. The intuition behind dimensional, slicing consumes increasingly more memory, this is that coordinates by themselves are not very informa- as it assigns the features to the points. Since in most cases tive w.r.t. the potential semantic class. However, the local | P||V |, it is more efficient to store the features only in distribution is more informative as it gives a notion of the the lattice vertices. geometry. To address these limitations, we propose four new opera- Downsampling refers to a coarsening of the lattice, by reduc- tors on the permutohedral lattice which are more suitable for ing the number of vertices. This allows the network to CNNs and dense prediction tasks. capture more contextual information. Downsampling con- Distribute is defined as the list of features that each lattice sists of two steps: creation of a coarse lattice and obtaining vertex receives. However, they are not summed as done by its values. Coarse lattices are created by repeatedly divid- splatting: ing the point cloud positions by 2 and using them to create new lattice vertices (Barron et al. 2015). The values of x = S( P, V ) = b f , (1) v pv p the coarse lattice are obtained by convolving over the finer p∈ J v lattice from the previous level (Fig. 4). Hence, we must embed the coarse lattice inside the finer one by scaling where x is the value of lattice vertex v and b is the barycen- v pv the coarse vertices by 2. Afterwards, the neighbors vertices tric weight between point p and lattice vertex v. over which we convolve are separated by a vector of form d+1 Instead, our distribute operators D and D concatenate G F ± [−1,..., −1, d, −1,... , −1] ∈ Z . The downsam- coordinates and features of the contributing points: pling operation effectively performs a strided convolution. Upsampling follows a similar reasoning. The fine vertices x = P(D ; D ), (2) need first to be embedded in the coarse lattice using a v v v g f 123 Autonomous Robots (a) Slice (b) DeformSlice Fig. 4 Coarsen: Downsampling of the lattice is performed by embed- Fig. 5 Slice and DeformSlice: Slicing barycentrically interpolates the ding the coarse lattice in the finer one and convolving over the neighbors. vertex values back onto a point. DeformSlice allows for the network to This effectively performs a strided convolution. Transposed convolution directly affect the interpolated value by learning offsets of the barycen- is performed in an analogous manner by embedding a fine lattice into a tric coordinates coarse one (d+1) (d+1)v division by 2. Afterwards, the neighboring vertices over b and q as vectors in R and R , respectively, p p which we convolve are separated by a vector of form and cast the prediction of offsets as a fully connected layer ± [−0.5,..., −0.5, d/2, −0.5,... , −0.5]. The careful reader followed by a non-linearity: will notice that in this case, the coordinates of the neigh- boring vertices may not be integer anymore; they may have b = F (q ) = σ(q · W + b). (10) p p p a fractional part and will, therefore, lie in the middle of a coarser simplex. In this case we ignore the contribution of However, this prediction has the disadvantage of not being this neighboring vertices and only take the contribution of permutation equivariant; therefore, permutation of the ver- the center vertex. The upsampling operation effectively per- tices would not imply the same permutation in the barycentric forms a transposed convolution. offsets: DeformSlicing While the slicing operation Y barycentrically interpolates the values back to the points by using barycentric coordinates: F (π q ) = π F (q ), (11) p p f = Y( P, V ) = b x , (6) p pv v where π is the set of all permutations of the d + 1 vertices. v∈I It is important for our prediction to be permutation equiv- ariant because the vertices may be arranged in any order and we propose the DeformSlicing Y which allows the network the barycentric offsets need to keep a consistent preference to directly modify the barycentric coordinates and shift the towards a certain vertexes’ features, regardless of its position position within the simplex for data-dependent interpolation: within a simplex. In order for the prediction of the offsets to be consis- f = Y( P, V ) = (b + b )x . (7) p pv pv v tent with permutations of the vertices, we take inspiration v∈I from the work of Ravanbakhsh et al. (2016) and Zaheer et al. (2017) of equivariant layers and design F as: Here, b are offsets that are applied to the original pv barycentric coordinates. A parallel branch within our net- b = σ(b + (b x − max{b x }) · W), (12) work first gathers the values from all the vertices in a simplex pv pv v pd d d∈I and regresses the b : pv b = F (q ) ={ b | v ∈ I }, (13) p p pv p q = G( P, V ) ={ b x | v ∈ I }, (8) p pv v p v ×1 where W ∈ R is a weight matrix and b ∈ R corre- b = F (q ), (9) p p sponds to a scalar bias. In other words, we subtract from where q is a set containing the weighted values of all each weighted vertex the maximum of the weighted values the vertices of the simplex containing p and the prediction of all the other vertices in the simplex. Since the max opera- b ={ b | v ∈ I } is a set of offsets to the barycentric tion is invariant to permutations of the input, the regression p pv p coordinates towards the d + 1 vertices. With a slight abuse of the offsets is equivariant to permutations of the vertices. of notation—due to the fact that the vertices of a simplex are The difference between the slicing and our DeformSlicing always enumerated in a consistent manner, we can regard is visualized in Fig. 5 123 Autonomous Robots 6 Segmentation methods vector for point p and μ as the mean or cluster center for i c cluster c.The δ and δ are the margins for the variance and v d Due to the flexibility of LatticeNet various segmentation distance loss respectively. We set α = β = 1 and γ = 0.001 methods can be implemented. In this section, we detail the A visualization of the pipeline for instance segmentation methods used for each one. can be seen in Fig. 6. 6.3 Motion segmentation 6.1 Semantic segmentation Motion segmentation distinguishes between dynamic and Semantic segmentation uses the default U-Net architecture static objects within a point cloud. For this, the network needs described in the Network Architecture section. It is trained temporal information. We extend the original LatticeNet U- with an equal part combination of cross entropy loss and Net architecture with a recursive architecture that can process Lovász loss (Berman et al. 2018). The Lovász loss acts as a a sequence of point clouds P at times t , t − 1,..., t − n surrogate for the intersection-over-union score and is espe- seq and learn to distinguish for example between a moving car cially useful for dealing with class imbalance. and a parked car. The dynamic objects are considered as additional classes. 6.2 Instance segmentation Hence, we use the same loss as in the case of semantic seg- mentation. We also explore multiple ways to perform the Our instance segmentation network follows the work of fusion of temporal information which we detail in the Net- other proposal-free methods like (De Brabandere et al. work Architecture section. 2017). We use LatticeNet to predict for each 3D point p in the point cloud an embedding x . A discriminative loss encourages closeness in embeddings space for points of the 7 Network architecture same instance while promoting distance between different instances. Finally, we apply mean-shift clustering on the Input to our network is a point cloud P which may contain points in embeddings space. Points belonging to the same per-point features stored in F. The output is class probabilities cluster are defined as an Instances. for each point p. In the recurrent network the input is an This discriminative loss can be expressed with three terms: ordered set of point clouds P and the output are class seq probabilities for the last point cloud of the sequence. Moving – Variance term: The intra-cluster pull force that draws the and static objects are considered as different semantic classes. embeddings towards the mean embedding. Our network architecture has a U-Net structure (Ron- – Distance term: An inter-cluster push force that forces the neberger et al. 2015) and is visualized in Fig. 7 together with clusters to be far apart from each other in embedding the used individual blocks. space. The first layers distribute the point features onto the – Regularization term: A small force that pulls the cluster lattice and use a PointNet to obtain local features. After- centers towards the origin in order to keep the activations wards, a series of ResNet blocks (He et al. 2016a), followed bounded. by repeated downsampling, aggregates global context. The decoder branch mirrors the encoder architecture and upsam- The full loss is then defined as: ples through transposed convolutions. Finally, a DeformSlic- ing propagates lattice features onto the original point cloud. C N 1 1 Skip connections are added by concatenating the encoder L = [μ − x  − δ ] (14) var c i v C N feature maps with matching decoder features. c=1 i =1 C C 7.1 Temporal fusion L = 2δ −μ − μ  (15) dist d c c A B C (C − 1) c =1 c =1 A B c =c A B Incorporating temporal information for motion prediction over a sequence of point clouds relies on fusing information L = μ  (16) reg c between multiple time-steps. For this purpose, the feature c=1 vectors of the timesteps t − 1 and t are passed through a L = α · L + β · L + γ · L (17) Temporal Fusion block, as shown in Fig. 8. This fusion con- var dist reg sists of a concatenation of both feature vectors and a linear We define C as the number of clusters in the ground truth, layer followed by a non-linearity (Fig. 9). Each new time- N as the number of elements in cluster c, x as the embedding step allocates additional vertices in the lattice corresponding c i 123 Autonomous Robots Fig. 6 Instance segmentation: LatticeNet takes raw point clouds as input and embeds them into a sparse lattice where convolutions are applied. Features on the lattice are projected onto a 2D space where clustering is performed. The clusters define the instances of each object type in the original cloud to newly explored areas in the map. For correct fusion, the features from the previous time-step need to be zero-padded so that the sizes match. Additionally, we performed experiments with a single Temporal Fusion block in the network and max-pooling over both feature vectors instead of the linear layer, but found that three Temporal Fusion blocks achieved overall superior results (Fig. 10). It should be noted that our approach for temporal fusion relies on a sequence of clouds that are transformed into a common coordinate frame. The required scan poses for trans- formation can be obtained e.g. from GPS or SLAM. 8 Implementation Our lattice is stored sparsely on a hash map structure, which allows for fast access of neighboring vertices. Unlike (Su et al. 2018), we construct the hash map directly on the GPU, saving us from incurring an expensive CPU to GPU memory copy. For memory savings, we implemented the DeformSlice Fig. 7 Architecture: Our model follows a U-Net structure. For ease of and the last linear classification layer in one fused operation, representation, blocks which are repeated one after another are indicated avoiding the storage of high-dimensional feature vectors for with a multiplier on the right side of the operation each point in the point cloud. All of the lattice operators containing forwards and back- wards passes are implemented on the GPU and exposed to We share the PyTorch implementation of LatticeNet at PyTorch (Paszke et al. 2017). https://github.com/AIS-Bonn/lattice_net. Following recent works (He et al. 2016b; Huang et al. 2017), all convolutions are pre-activated using Group Nor- malization (Wu and He 2018) and a ReLU unit. We chose 9 Experiments Group Normalization instead of the standard batch normal- ization due to greater stability for small batch sizes. We use We evaluate our proposed lattice network on four differ- the default of 32 groups. ent datasets: ShapeNet (Yi et al. 2016), ScanNet (Dai et al. The models were trained using the Adam optimizer with 2017), SemanticKITTI (Behley et al. 2019) and Pheno4D a learning rate of 0.001 and a weight decay of 10 4. The (https://www.ipb.uni-bonn.de/data/pheno4d/). For the task learning rate was reduced by a factor of 10 when the loss of semantic segmentation and motion segmentation we plateaued. report the mean Intersection-over-Union (mIoU). For the 123 Autonomous Robots color jitter. A video with additional footage of the experi- ments is available online . 9.1 Evaluation of segmentation accuracy ShapeNet part segmentation is a subset of the ShapeNet dataset (Yi et al. 2016) which contains objects from 16 dif- ferent categories each segmented into 2–6 parts. The dataset consists of points sampled from the surface of the objects, together with the ground truth label of the corresponding object part. The objects have an average of 2613 points. We train and evaluate our network on each object individually. We use the official train/test splits as defined by the dataset containing a total of 12 137 training objects and 2874 test objects. The results for our and five competing methods are gathered in Table 1 and visualized in Fig. 11. We observe that for some classes, we obtain state-of-the- art performance and for other objects, the IoU is slightly lower than for other approaches. We ascribe this to the fact that training one fixed architecture size for each individual object is suboptimal as some objects like the ”cap” have as few as 55 examples while others like the table have more than 5K. This causes the network to be prone to overfitting on the easy object or underfitting on the difficult ones. A fair evalua- tion would require finding an architecture that performs well for all objects on average. However, due to various issues with mislabeled ground truths (Su et al. 2018) we deem that Fig. 8 Recurrent architecture: The features from previous time-steps are fused in the current time-step at multiple levels of the network. This experimentation with more architectures or with different allows the network to distinguish dynamic objects from static ones regularization strengths for individual objects would overfit the dataset. ScanNet 3D segmentation Daietal. (2017) consists of 3D reconstructions of real rooms. It contains ≈ 1500 rooms seg- mented into 20 classes (bed, furniture, wall, etc.). The rooms have between 9K and 537K points—on average 145K. We segment an entire room at once without cropping. We use the official train/test splits as defined by the dataset containing a total of 1201 training rooms and 100 test objects. Results are gathered in Table 2 and visualized in Fig. 12. We obtain an IoU of 64.0 which is significantly higher than the most similar Fig. 9 Temporal fusion: The features from the previous time-step are related work of SplatNet. It is to be noted that MinkowskiNet zero-padded in order to account for the new vertices that were allocated at the current time-step. The features are afterwards concatenated and achieves a higher IoU but at the expense of an extremely high passed through a linear layer followed by a non-linearity spatial resolution of 2 cm per voxel. In contrast, our approach allocates lattice vertices so that each vertex covers approxi- mately 30 points. On this dataset, this corresponds to a spatial task of instance segmentation, we report the Symmetric Best extent of approximately 10 cm. Dice (SBD) (De Brabandere et al. 2017). SBD measures the SemanticKITTI Behley et al. (2019) consists of semanti- accuracy of the instance segmentation by averaging for each cally annotated LiDAR scans of real urban environments. input label the ground truth label yielding the maximum Dice The annotation covers a total of 19 classes for single scan score. evaluation and a total of 25 classes for multiple scan evalua- We use a shallow model for ShapeNet and Pheno4D and a tion. Each scan contains between 82K and 129K points. We deeper model for ScanNet and SemanticKITTI as the datasets process each scan entirely without any cropping. We use the are larger. We augment all data using random mirroring and translations in space. For ScanNet, we also apply random http://www.ais.uni-bonn.de/videos/RSS_2020_Rosu/. 123 Autonomous Robots For motion segmentation we take as input three point clouds at consecutive time steps and output the segmentation for the final, most recent cloud. We overlap this time window so that every clouds gets to be segmented. For the first few clouds, the time window is reduced as there are no clouds from previous time-steps to give as input. The results for the motion segmentation are provided in Table 4 and visualized in Fig. 14. We observe that for motion segmentation we outperform other approaches except for KPConv (Thomas et al. 2019), Fig. 10 Bonn Activity Maps segmentations. Colored meshes are recon- which has higher IoU. However, it is to be noted that KPconv structed from KinectV2 data using volumetric integration (Nießner et al. cannot process a full point cloud at once due to memory 2013; Stotko et al. 2019) and semantically segmented using LatticeNet. constraints and rather processes sub-clouds centered around Color coding of semantic labels corresponds to the ScanNet dataset (Dai random spheres in the scene. The spheres are chosen ran- et al. 2017) domly in the scene to ensure each point is tested multiple times by different sphere locations. Finally, a voting scheme gives the final prediction. In contrast, our approach can process a full point cloud without requiring neighborhood searching or partitioning in sub-clouds. Bonn Activity Maps (Tanke et al. 2019) is a dataset for human tracking, activity recognition and anticipation of multiple persons. It contains annotations of persons, their trajectories and activities. The 3D reconstruction of the four kitchen sce- narios is however of more interest to us. The environments are reconstructed as 3D colored meshes and have no ground truth semantic annotations. We trained our LatticeNet on the ScanNet dataset and evaluate it on the 4 kitchens in order Fig. 11 ShapeNet (Yi et al. 2016) results of our method to provide an annotation for each vertex of the mesh. The results are shown in Fig. 10. We can observe that our network generalizes well to unseen datasets, recorded with different sensors and with different noise properties as the seman- tic segmentations look plausible and exhibit sharp borders between classes. Pheno4D https://www.ipb.uni-bonn.de/data/pheno4d/ is a spatio-temporal dataset of point clouds of maize and tomato plants with instance annotations of leaves. We use a shallow version of LatticeNet to compute per-point embeddings and cluster them using mean-shift to recover the instances. We Fig. 12 ScanNet results. The left image shows the ground truth and the compare with PointNet and PointNet++ as they are popu- right one our prediction lar methods for computing per-point embeddings. Since the dataset contains 7 maize and 7 tomato plants, we train on the official train/validation splits as defined by the dataset. The first 5 plants for each type and test on the remaining two. The test set is not publicly available and testing can only be done results are gathered in Table 5. We observe that our method through the benchmark server. is capable of computing more meaningful embeddings that The results for single scan are provided in Table 3 and create more distinctive clusters between each plant organ. visualized in Fig. 13. Our LatticeNet outperforms all other methods—in case of the most similar SplatNet by more than a factor of two. It is to be noted that DarkNet53Seg (Behley 9.2 Ablation studies et al. 2019), DarkNet21Seg (Behley et al. 2019) and Squeeze- SegV2 (Wu et al. 2018) are methods that operate on a 2D We perform various ablations regarding our contribution to image by wrapping the LiDAR scans to 2D using spherical judge how much they affect the network’s performance. coordinates. In contrast, our method can operate on general DeformSlice We assess the impact that DeformSlice has on point clouds, directly in 3D. the network by comparing it with the Slice operator which 123 Autonomous Robots Fig. 13 SemanticKITTI results. We compare the prediction from our number of points. Additionally, the network also effectively makes use LatticeNet with the results from TangentConv (Tatarchenko et al. 2018) of contextual information in order to correctly predict the parking place and SplatNet (Su et al. 2018). We can observe that our approach can due to the existence of nearby cars better learn small objects like tree trunks, despite their relatively small Table 1 Results on ShapeNet part segmentation (Yi et al. 2016) #Instances 2690 76 55 898 3758 69 787 392 1547 451 202 184 283 66 152 5271 Instance Air- Bag Cap Car Chair Ear- Guitar Knife Lamp Laptop Motor- Mug Pistol Rocket Skate- Table avg. plane phone bike board PointNet (Qi et al. 2017a) 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6 PointNet++ (Qi et al. 2017b) 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6 SplatNet 3D (Su et al. 2018) 84.6 81.9 83.9 88.6 79.5 90.1 73.5 91.3 84.7 84.5 96.3 69.7 95.0 81.7 59.2 70.4 81.3 SplatNet 2D-3D (Su et al. 2018) 85.4 83.2 84.3 89.1 80.3 90.7 75.5 92.1 87.1 83.9 96.3 75.6 95.8 83.8 64.0 75.5 81.8 FCPN (Rethage et al. 2018) 84.0 84.0 82.8 86.4 88.3 83.3 73.6 93.4 87.4 77.4 97.7 81.4 95.8 87.7 68.4 83.6 73.4 Ours 83.9 82.3 84.8 79.1 81.0 86.9 71.0 91.9 89.4 84.7 96.6 77.2 95.8 86.0 70.5 79.3 87.0 Bold signifies the highest mean intersection-over-union (mIoU) Table 2 Results on ScanNet (Dai et al. 2017) We also evaluate a version of DeformSlice which ensures that the new barycentric coordinates still sum up to one by Method mIOU adding an additional loss term: PointNet++ (Qi et al. 2017b) 33.9 ⎛ ⎞ SplatNet (Su et al. 2018) 39.3 ⎝ ⎠ TangetConv (Tatarchenko et al. 2018) 43.8 L = b . (18) pv | P| 3DMV (Dai and Nießner 2018) 48.4 p∈ P v∈I MinkowskiNet42 (5cm) (Choy et al. 2019) 67.9 † However, we observe little change after adding this reg- SparseConvNet (Graham et al. 2018) 72.5 ularization term and hence, use the default version of MinkowskiNet42 (2cm) (Choy et al. 2019) 73.4 DeformSlice for the rest of the experiments. The results are Ours 64.0 gathered in Table 6. Bold signifies the highest mean intersection-over-union (mIoU) Distribute and PointNet Another contribution of our work is the usage of a Distribute operator to provide values to the lattice vertices which are later embedded in a higher- does not use learned barycentric interpolation. We evaluate dimensional space by a PointNet-like architecture. The this on SemanticKITTI, the largest dataset that we are using. positions and features of the point cloud are treated sep- 123 Autonomous Robots Table 3 Results on SemanticKITTI (Behley et al. 2019) Approach mIoU Road Sidewalk Parking Other- Building Car Truck Bicycle Motorcycle Other- Vegetation Trunk Terrain Person Bicyclist Motor- Fence Pole Traffic ground vehicle cyclist sign PointNet (Qietal. 14.6 61.6 35.7 15.8 1.4 41.4 46.3 0.1 1.3 0.3 0.8 31.0 4.6 17.6 0.2 0.2 0.0 12.9 2.4 3.7 2017a) SplatNet (Su et al. 18.4 64.6 39.1 0.4 0.0 58.3 58.2 0.0 0.0 0.0 0.0 71.1 9.9 19.3 0.0 0.0 0.0 23.1 5.6 0.0 2018) PointNet++ (Qi 20.1 72.0 41.8 18.7 5.6 62.3 53.7 0.9 1.9 0.2 0.2 46.5 13.8 30.0 0.9 1.0 0.0 16.9 6.0 8.9 et al. 2017b) Minkowski34(25cm) 33.0 80.8 43.0 36.9 0.5 73.5 83.0 42.9 2.0 2.9 7.8 74.4 42.9 36.7 11.2 22.8 4.4 37.2 35.4 28.6 (Choy et al. 2019) SqueezeSegV2 39.7 88.6 67.6 45.8 17.7 73.7 81.8 13.4 18.5 17.9 14.0 71.8 35.8 60.2 20.1 25.1 3.9 41.1 20.2 36.3 (Wu et al. 2018) TangentConv 40.9 83.9 63.9 33.4 15.4 83.4 90.8 15.2 2.7 16.5 12.1 79.5 49.3 58.1 23.0 28.4 8.1 49.0 35.8 28.5 (Tatarchenko et al. 2018) DarkNet21Seg 47.4 91.4 74.0 57.0 26.4 81.9 85.4 18.6 26.2 26.5 15.6 77.6 48.4 63.6 31.8 33.6 4.0 52.3 36.0 50.0 (Behley et al. 2019) DarkNet53Seg 49.9 91.8 74.6 64.8 27.9 84.1 86.4 25.5 24.5 32.7 22.6 78.3 50.1 64.0 36.2 33.6 4.7 55.0 38.9 52.2 (Behley et al. 2019) Ours 52.9 90.0 74.1 59.4 22.0 88.2 92.9 26.6 16.6 22.2 21.4 81.7 63.6 63.1 35.6 43.0 46.0 58.8 51.9 48.4 Bold signifies the highest mean intersection-over-union (mIoU) Autonomous Robots Table 4 Motion segmentation IoU results on SemanticKITTI (Behley et al. 2019) using a sequence of multiple past scans (in %) Approach 84.9 21.1 18.5 1.6 0.0 0.0 TangentConv [41] 34.1 40.3 42.2 30.1 6.4 1.1 1.9 84.1 20.0 20.7 7.5 0.0 0.0 DarkNet53Seg [4] 41.6 61.5 37.8 28.9 15.2 14.1 0.2 88.5 29.2 22.7 6.3 0.0 0.0 SpSequenceNet [37] 43.1 53.2 0.1 2.3 26.2 41.2 36.2 93.7 70.3 38.6 21.6 0.0 0.0 KPConv [43] 51.2 69.4 5.8 4.7 67.5 67.4 47.2 91.1 65.4 23.1 6.8 0.0 0.0 Ours 45.2 54.8 3.5 0.6 49.9 44.6 64.3 Shaded cells correspond to the IoU of the moving classes, while unshaded entries are the non-moving classes Table 5 Instance segmentation performance on the maize and tomato plants of the Pheno4D dataset SBD Maize Tomato PointNet (Qi et al. 2017a) 69.7 47.3 PointNet++ (Qi et al. 2017b) 74.8 56.1 LatticeNet (ours) 80.6 74.2 Bold signifies the highest mean intersection-over-union (mIoU) Table 6 Ablation study of the various components of LatticeNet. Var- ious features are disabled (indicated in red) and the impact to the IoU Fig. 14 Motion segmentation results on SemanticKITTI. The moving is evaluated car on the road (red) is correctly distinguished from the parked car (orange) (Color figure online) arately where the features (normals, color) are distributed directly. From the positions, we substract the locally aver- aged position as we assume that the local point distribution is more important than the coordinates in the global reference frame. We evaluate the impact of elevating the point features to a higher-dimensional space and subtracting the local mean against a simple splatting operator which just averages the features of the points around each corresponding vertex. We observe that not subtracting the local mean, and just using the xyz coordinates as features, heavily degrades the performance, causing the mIoU to drop from 52.9 to 43.0. Finally, naive application of the splat operation performs This further reinforces the idea that the local point distribu- worst with a mere 37.8 mIoU. tion is a good local feature to use in the first layers of the network. Not elevating the point cloud features to a higher- 9.3 Performance dimensional space before applying the max-pool operation also hurts performance but not as severely. In our experi- We report the time taken for a forward pass and the maximum ments, we elevate the features to 64 dimensions by using a memory used in our shallow and deep network on the first series of fully connected layers. three evaluated datasets. The performance was measured on car truck other-vehicle person bicyclist motorcyclist mIoU Autonomous Robots Table 7 Average time used by ShapeNet ScanNet SemanticKITTI the forward pass and the [ms] [GB] [ms] [GB] [ms] [GB] maximum memory used during training. An X indicates a SplatNet 129 0.6 XX 2931 8.9 method that failed to process the Ours 49 0.5 180 6.5 143 3.5 whole cloud due to memory limitations Bold signifies the highest mean intersection-over-union (mIoU) a NVIDIA Titan X Pascal and the results are gathered in right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. Table 7. In the case of motion segmentation, the inference times and memory used are the same as in the case of a single scan, as we use the same backbone network to extract features References and the computational cost of fusing the temporal informa- A large scale spatio-temporal dataset of point clouds of maize tion is minimum. However for training, the network requires and tomato plants. https://www.ipb.uni-bonn.de/data/pheno4d/. more memory with increasing time window due to the back- Accessed: 2021-01-1. propagation through time. This scales linearly with the time Baek, J., & Adams, A. (2009). Some useful properties of the permuto- window size and the amount of points in the cloud. hedral lattice for Gaussian filtering. Other Words 10(1). Barron, J.T., Adams, A., YiChang, S., & Hernández, C. (2015). Fast Despite the reduced memory usage compared to SplatNet bilateral-space stereo for synthetic defocus—Supplemental mate- and increased speed of execution, there are still memory sav- rial. In Proceedings of the IEEE Conference on Computer Vision ings possible by fusing the Distribute and PointNet operators and Pattern Recognition (CVPR), pp. 1–15. into one GPU operation. This is similar to fusing our Deform- Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., & Gall, J. (2019) SemanticKITTI: A Dataset for Semantic Slice and the classification layer. Additionally, we expect the Scene Understanding of LiDAR Sequences. In Proceedings of the network to become even faster as further advances on highly IEEE International Conference on Computer Vision (ICCV). optimized kernels for convolution on sparse lattices become Berman, M., Triki, A.R., & Blaschko, M.B. (2018). The Lovász- available. At the moment, the convolutions are performed by softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceed- our custom CUDA kernels. Tighter integration however with ings of the IEEE Conference on Computer Vision and Pattern highly optimized libraries like cuDNN (Chetlur et al. 2014) Recognition (CVPR), pp. 4413–4421. could be beneficial. Chen, L-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethink- ing atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Chen, W., Han, X., Li, G., Chen, C., Xing, J., Zhao, Y., & Li, H. (2018). Deep RBFNet: Point cloud feature learning using radial basis func- 10 Conclusion tions. arXiv preprint arXiv:1812.04302. Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catan- zaro, B., & Shelhamer, E. (2014). cuDNN: Efficient primitives for We presented LatticeNet, a novel method for point cloud deep learning. arXiv preprint arXiv:1410.0759. segmentation. A sparse permutohedral lattice allows us to Choy, C., Gwak, J., & Savarese, S. (2019). 4D Spatio-Temporal Con- efficiently process large point clouds. The usage of PointNet vNets: Minkowski Convolutional Neural Networks. arXiv preprint together with a data-dependent interpolation alleviates the arXiv:1904.08755. Dai, A., & Nießner, M. (2018). 3DMV: Joint 3D-multi-view predic- quantization issues of other methods. Experiments on four tion for 3D semantic scene segmentation. In Proceedings of the datasets show state-of-the-art results, at a reduced time and European Conference on Computer Vision (ECCV), pp. 452–468. memory budget. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). ScanNet: Richly-annotated 3D reconstruc- Funding Open Access funding enabled and organized by Projekt tions of indoor scenes. In Proceedings of the IEEE Conference DEAL. on Computer Vision and Pattern Recognition (CVPR), pp. 5828– Open Access This article is licensed under a Creative Commons De Brabandere, B., Neven, D., & Van Gool, L. (2017). Semantic Attribution 4.0 International License, which permits use, sharing, adap- instance segmentation with a discriminative loss function. arXiv tation, distribution and reproduction in any medium or format, as preprint arXiv:1708.02551. long as you give appropriate credit to the original author(s) and the Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional source, provide a link to the Creative Commons licence, and indi- neural networks on graphs with fast localized spectral filtering. In cate if changes were made. The images or other third party material Proceedings of the Advances in Neural Information Processing in this article are included in the article’s Creative Commons licence, Systems (NIPS), pp. 3844–3852. unless indicated otherwise in a credit line to the material. If material Graham, B., Engelcke, M., & van der Maaten, L. (2018). 3D seman- is not included in the article’s Creative Commons licence and your tic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy- Pattern Recognition (CVPR), pp. 9224–9232. 123 Autonomous Robots Gu, X., Wang, Y., Wu, C., Lee, Y.J., & Wang, P. (2019). HPLFlowNet: Qi, C.R., Yi, L., Su, H., & Guibas, L.J. (2017b). PointNet++: Deep Hierarchical Permutohedral Lattice FlowNet for Scene Flow Esti- hierarchical feature learning on point sets in a metric space. In mation on Large-scale Point Clouds. In Proceedings of the IEEE Proc. of the Advances in Neural Information Processing Systems Conference on Computer Vision and Pattern Recognition (CVPR), (NIPS), pp. 5099–5108. pp 3254–3263. Qi, C.R., Litany, O., He, K., & Guibas, L.J. (2019) Deep Hough voting He, K., Zhang, X., Ren, S., & Sun, J. (2016a) Deep residual learning for 3D object detection in point clouds. In Proceedings of the IEEE for image recognition. In Proceedings of the IEEE Conference on International Conference on Computer Vision (ICCV), pp. 9277– Computer Vision and Pattern Recognition (CVPR), pp. 770–778. 9286. He, K., Zhang, X., Ren, S., & Sun, J. (2016b) Identity mappings in deep Ravanbakhsh, S., Schneider, J.G., & Póczos, B. (2016). Deep Learning residual networks. In Proceedings of the European Conference on with Sets and Point Clouds. arXiv preprint arXiv:1611.04500. Computer Vision (ECCV), pp. 630–645. Rethage, D., Wald, J., Sturm, J., Navab, N., & Tombari, F. (2018). Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q. (2017). Fully-convolutional point networks for large-scale point clouds. Densely connected convolutional networks. In Proceedings of the In Proceedings of the European Conference on Computer Vision IEEE Conference on Computer Vision and Pattern Recognition (ECCV), pp. 596–611. (CVPR), pp. 4700–4708. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Huang, J., Zhang, H., Yi, L., Funkhouser, T., Nießner, M., & Guibas, L.J. networks for biomedical image segmentation. In International (2019). TextureNet: Consistent local parametrizations for learn- Conference on Medical Image Computing and Computer-assisted ing from high-resolution signals on meshes. In Proceedings of the Intervention, pp. 234–241. IEEE Conference on Computer Vision and Pattern Recognition Rosu, R.A., Schütt, P., Quenzel, J., & Behnke, S. (2020). LatticeNet: (CVPR), pp. 4440–4449. Fast point cloud segmentation using permutohedral lattices. Pro- Li, Y., Bu, R., Sun, M., Wu, W., Di, X., & Chen, B. (2018). PointCNN: ceedings of Robotics: Science and Systems. Convolution on x-transformed points. In Proceedings of the Shi, H., Lin, G., Wang, H., Hung, T-Y., & Wang, Z. (2020). SpSe- Advances in Neural Information Processing Systems (NIPS), pp. quenceNet: Semantic Segmentation Network on 4D Point Clouds. 820–830. In Proceedings of the IEEE Conference on Computer Vision and Lin, G., Milan, A., Shen, C., & Reid, I. (2017). RefineNet: Multi-path Pattern Recognition (CVPR), pp. 4574–4583. refinement networks for high-resolution semantic segmentation. Stotko, P., Krumpen, S., Weinmann, M., & Klein, R. (2019). Efficient In Proceedings of the IEEE Conference on Computer Vision and 3D Reconstruction and Streaming for Group-Scale Multi-Client Pattern Recognition (CVPR), pp. 1925–1934. Live Telepresence. In Proceedings of the IEEE International Sym- Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional net- posium on Mixed and Augmented Reality (ISMAR), pp. 19–25. works for semantic segmentation. In Proceedings of the IEEE Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, M-H., & Conference on Computer Vision and Pattern Recognition (CVPR), Kautz, J. (2018). SplatNet: Sparse lattice networks for point cloud pp. 3431–3440. processing. In Proceedings of the IEEE Conference on Computer Masci, J., Boscaini, D., Bronstein, M., & Vandergheynst, P. (2015). Vision and Pattern Recognition (CVPR), pp. 2530–2539. Geodesic convolutional neural networks on Riemannian mani- Tanke, J., Kwon, O-H., Stotko, P., Rosu, R.A., Weinmann, M., Errami, folds. In Workshop Proceedings of the IEEE International Con- H., Behnke, S., Bennewitz, M., Klein, R., Weber, A., et al. ference on Computer Vision (ICCV Workshops), pp. 37–45. (2019). Bonn Activity Maps: Dataset Description. arXiv preprint Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., & Bronstein, arXiv:1912.06354. M.M. (2017). Geometric deep learning on graphs and manifolds Tatarchenko, M., Park, J., Koltun, V., & Zhou, Q-Y. (2018). Tangent using mixture model CNNs. In Proceedings of the IEEE Confer- convolutions for dense prediction in 3D. In Proceedings of the ence on Computer Vision and Pattern Recognition (CVPR), pp. IEEE Conference on Computer Vision and Pattern Recognition 5115–5124. (CVPR), pp. 3887–3896. Neven, D., De Brabandere, B., Proesmans, M., & Van Gool, L. (2019). Tchapmi, L., Choy, C., Armeni, I., Gwak, J., & Savarese, S. (2017). Instance segmentation by jointly optimizing spatial embeddings SEGCloud: Semantic segmentation of 3D point clouds. In Inter- and clustering bandwidth. In Proceedings of the IEEE Conference national Conference on 3D Vision (3DV), pp. 537–547. IEEE. on Computer Vision and Pattern Recognition (CVPR), pp. 8837– Thomas, H., Qi, C.R., Deschaud, J-E., Marcotegui, B., Goulette, F., & 8845. Guibas, L.J. (2019). KPConv: Flexible and deformable convolu- Nießner, M., Zollhöfer, M., Izadi, S., & Stamminger, M. (2013). tion for point clouds. In Proceedings of the IEEE Int. Conference Real-time 3D reconstruction at scale using voxel hashing. ACM on Computer Vision (ICCV), pp. 6411–6420. Transactions on Graphics (ToG), 32(6), 1–11. Wang, S., Suo, S., Ma, W.C., Pokrovsky, A., & Urtasun, R. (2018a). Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Deep parametric continuous convolutional neural networks. In Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic Proceedings of the IEEE Conference on Computer Vision and Pat- Differentiation in PyTorch. In NIPS Autodiff Workshop. tern Recognition (CVPR), pp. 2589–2597. Pham, Q-H., Hua, B-S., Nguyen, T., & Yeung, S-K. (2019a). Real- Wang, W., Yu, R., Huang, Q., & Neumann, U. (2018b). SGPN: time progressive 3D semantic segmentation for indoor scenes. In Similarity group proposal network for 3D point cloud instance Proceedings of the IEEE Workshop on Applications of Computer segmentation. In Proceedings of the IEEE Conference on Com- Vision, pp. 1089–1098. puter Vision and Pattern Recognition (CVPR), pp. 2569–2578. Pham, Q-H., Nguyen, T., Hua, B-S., Roig, G., & Yeung, S-K. (2019b). Wang, X., Liu, S., Shen, X., Shen, C., & Jia, J. (2019). Associatively seg- JSIS3D: Joint semantic-instance segmentation of 3D point clouds menting instances and semantics in point clouds. In Proceedings of with multi-task pointwise networks and multi-value conditional the IEEE Conference on Computer Vision and Pattern Recognition random fields. In Proceedings of the IEEE Conference on Com- (CVPR), pp. 4096–4105. puter Vision and Pattern Recognition (CVPR), pp. 8827–8836. Wu, B., Zhou, X., Zhao, S., Yue, X., & Keutzer, K. (2018). Squeeze- Qi, C.R., Su, H., Mo, K., & Guibas, L.J. (2017a). PointNet: Deep Segv2: Improved model structure and unsupervised domain adap- learning on point sets for 3D classification and segmentation. In tation for road-object segmentation from a lidar point cloud. arXiv Proceedings of the IEEE Conference on Computer Vision and Pat- preprint arXiv:1809.08495. tern Recognition (CVPR), pp. 652–660. Wu, W., Qi, Z., & Fuxin, L. (2019). PointConv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE Con- 123 Autonomous Robots Jan Quenzel received his M.Sc. ference on Computer Vision and Pattern Recognition (CVPR), pp. degree in Computer Science from 9621–9630. the University of Lübeck in 2015. Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the Since August 2015, he is a mem- European Conference on Computer Vision (ECCV), pp. 3–19. ber of the Autonomous Intelligent Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham, A., & Trigoni, Systems Group at the University N. (2019). Learning object bounding boxes for 3D instance seg- of Bonn. His research focuses on mentation on point clouds. arXiv preprint arXiv:1906.01140. LiDAR and visual odometry for Yi, L., Kim, L. G., Ceylan, D., Shen, I., Yan, M., Su, H., et al. (2016). micro aerial vehicles, surface A scalable active framework for region annotation in 3D shape reconstruction, and sensor calibra- collections. ACM Transactions on Graphics (ToG), 35(6), 210. tion. Yi, L., Zhao, W., Wang, H., Sung, M., & Guibas, L.J. (2019). GSPN: Generative shape proposal network for 3D instance segmentation in point cloud. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 3947–3956. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., & Smola, A.J. (2017). Deep sets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pp. 3391–3401. Sven Behnke received his Ph.D. from Freie Universität Berlin in 2002. He worked in 2003 as post- Publisher’s Note Springer Nature remains neutral with regard to juris- doctoral researcher at the Interna- dictional claims in published maps and institutional affiliations. tional Computer Science Institute, Berkeley. From 2004 to 2008, he headed the Humanoid Robots Radu Alexandru Rosu is a PhD stu- Group at Albert-Ludwigs-Universi dent in the group of Autonomous tat Freiburg. Since 2008, he is Intelligent Systems, University of professor for Autonomous Intel- Bonn, Germany. He holds a mas- ligent Systems at the University ter’s degree in computer science of Bonn. His research interests from the University of Bonn and a include micro aerial vehicles, cog- bachelor’s degree in computer sci- nitive robotics, computer vision, ence from the University of Sala- and machine learning. manca, Spain. His research inter- est are in the area of 3D deep learning. He seeks to create novel neural network models capable of understanding, processing and recon- structing 3D data. Peer Schütt is a computer science master’s student at the University of Bonn. He has worked in the Autonomous Intelligent Systems group since 2017. His research interests are in the area of deep learning and augmented reality.

Journal

Autonomous RobotsSpringer Journals

Published: Oct 19, 2021

Keywords: Semantic segmentation; Instance segmentation; Motion segmentation; Sequence segmentation; 3D point cloud

References