Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Cystoscopic depth estimation using gated adversarial domain adaptation

Cystoscopic depth estimation using gated adversarial domain adaptation Monocular depth estimation from camera images is very important for surrounding scene evaluation in many technical fields from automotive to medicine. However, traditional triangulation methods using stereo cameras or multiple views with the assumption of a rigid environment are not applicable for endoscopic domains. Particularly in cystoscopies it is not possible to produce ground truth depth information to directly train machine learning algorithms for using a monocular image directly for depth prediction. This work considers first creating a synthetic cystoscopic environment for initial encoding of depth information from synthetically rendered images. Next, the task of predicting pixel-wise depth values for real images is con- strained to a domain adaption between the synthetic and real image domains. This adaptation is done through added gated residual blocks in order to simplify the network task and maintain training stability during adversarial training. Training is done on an internally collected cystoscopy dataset from human patients. The results after training demonstrate the ability to predict reasonable depth estimations from actual cystoscopic videos and added stability from using gated residual blocks is shown to prevent mode collapse during adversarial training. Keywords Neural networks · Domain adaptation · Depth estimation · Endoscopy · Synthetic data 1 Introduction current robotics applications combine LIDAR, or similar, sensors to create what is known as an RGB-D (color and Depth, or distance, information from a sensor is paramount depth) camera image. This provides a dense (or at least par- for localization and mapping algorithms, especially when tially dense) pixel-wise depth map for each given matching using cameras as the main sensor modality. For this reason, image. By using the camera’s intrinsic parameters, a point cloud of the scene can be re-projected and, through addi- tional algorithms, the extrinsic position of the camera can Peter Somers, Simon Holdenried-Krafft, and JohannesZahn have contributed equally to this work. * Peter Somers Oliver Sawodny somers@isys.uni-stuttgart.de sawodny@isys.uni-stuttgart.de Simon Holdenried-Krafft Cristina Tarín simon.krafft@uni-tuebingen.de tarin@isys.uni-stuttgart.de Johannes Schüle Hendrik P. A. Lensch schuele@isys.uni-stuttgart.de hendrik.lensch@uni-tuebingen.de Carina Veil Institute for System Dynamics, University of Stuttgart, veil@isys.uni-stuttgart.de Stuttgart, Germany Niklas Harland Institute for Computer Graphics, University of Tübingen, niklas.harland@uni-tuebingen.de Tübingen, Germany Simon Walz Urology Clinic, University Hospital of Tübingen, Tübingen, simon.walz@uni-tuebingen.de Germany Arnulf Stenzl arnulf.stenzl@uni-tuebingen.de Vol.:(0123456789) 1 3 Biomedical Engineering Letters be reconstructed while simultaneously mapping the environ- to a GAN approach. In addition, learnable gates are included ment. It is also possible to do this without the direct depth in the added layers to bring additional stability during the information by finding matching points in sequential images adversarial training by smoothly fading in domain specific and using triangulation methods to accomplish the same task features. with more sparse data points. When operating inside the human body, however, these 1.1 Related work methods are not as feasible and particularly during cysto- scopic operations where the camera and instruments must fit Depth estimation from images is not a new field of study. through the urethra to reach the bladder, inclusion of depth Techniques such as structure from motion have been around measuring sensors like LIDAR are out of the question. These since at least the 1970 s [4] and, more recently, augmented restrictions also make it more difficult for the physician to reality requires real-time estimation of this information. ensure that the entire bladder has been seen giving rise for Until recently, the approaches using only camera images all the need of methods to map the bladder to ensure full vis- relied upon using corresponding points between consecutive ual coverage. While using stereo cameras provides a more images. One state-of-the-art open source tool COLMAP [5] robust triangulation-based depth reconstruction, for the same excels at this. Machine learning has recently been used to reason of restricted space this method is not feasible and enhance results of these methods for more complete and using a monocular camera for depth estimation is unavoid- continuous depth maps [6]. The downside to all of these able to obtain the same localization and mapping goals. An existing techniques, however, is that they are not universal. additional problem in the cystoscopic environment is that Domains that change quickly or lack distinct features the scene may change fast enough that also using sequential between frames, such as endoscopic videos, render exist- image frames for pseudo-triangulation is not feasible due ing methods unusable. Therefore, newer techniques focus to difficulties in matching features between images. Some on methods to make this prediction without the need for works have proposed ideas to circumvent these limitations, distinct feature recognition. These techniques begin with for example [1], but they require additional information, a supervised approach, in which the problem can be seen such as an underlying model, that is not always obtainable. as a regression problem given a color image as input and For these reasons, monocular depth estimation remains a hot a ground truth depth map as output. Two frequently used topic of research and when using a single image, one method neural network architectures that do this are DispNet [7] and has come to stand out: image domain adaptation. fully convolutional residual networks (FCRN) [8]. One non- This work leverages the idea that instead of measuring, negligible difficulty with formulating the problem this way the entire domain can be simulated, in this case a synthetic is the reliance on ground truth data. Unfortunately, the situ- cystoscopic environment, including the desired output infor- ations where only a camera is desired for extracting dense mation of depth from the camera. This can be used to com- depth information are ones in which also using reliable dis- pensate for the missing information in a second domain: the tance sensors for ground truth is not feasible. real environment. Adaptation between different domains is This dilemma led to more generalizable approaches generally possible when they are similar enough that there capable of using synthesized data for supervised train- exists a feasible transfer function from one to the other. Gen- ing and using domain adaptation to apply the results to erative Adversarial Networks (GANs) have shown this to be real images. As already mentioned, AdaDepth [9] was the true by learning this transfer function using neural networks. first to effectively do this and demonstrates the capability Under the assumption that the synthetic domain can be con- on various datasets of different domains. Shortly there- structed accurately enough to form a finite information gap after came works in the areas of colonoscopy [10] and from the real domain, this work aims to find the associated bronchoscopy [2], where depth predictions in endoscopic transfer function. surgeries could now be done. Both of these works used With this in mind, the training method proposed in [2] is simplified organ reconstructions and phantom scans to cre- used as a foundation and modified. The approach uses adver - ate synthetic data with ground truth depth for initial neural sarial training to retrain an encoder from a encoder-decoder network training before performing a domain adaptation to network such that it produces similar latent features from the real images. While these environments also suffer from real images as from the synthetic images it was originally the aforementioned deformability and working space prob- trained on. The encoder in this work is modified so that the lems, particularly the lungs and airways have the advan- domain adaptation occurs only in added residual blocks, tage that the general shape and images between different not through retraining the entire encoder. This approach of patients does not vary to the same extent as for the bladder. using residual blocks for the additional learning was also The bladder is one of the most deformable organs in the taken in [3] for transfer learning to improve over compara- body and during surgeries, the fill level is continuously tive GAN approaches, but in this work it is directly applied changed to allow for different views or cutting actions 1 3 Biomedical Engineering Letters making the scene very dynamic. However, this does not 2.1 Synthetic domain network structure mean that the approaches cannot be applied to the bladder, it has just not yet (to the authors’ knowledge) been done. The overall depth prediction network structure follows the The contributions of this work can be summarized as U-Net in [2] with differences mainly in the decoder and the activation functions. Instead of simple nearest neigh- the creation of a synthetic cystoscopic environment for bor upsampling, the ICNR initialized sub-pixel convolution rendering images and corresponding depth maps, approach [11] is used. On real images this step drastically the use of a modified encoder structure for more stabi- reduced checkerboard artifacts (from empirical testing). lized GAN training during domain adaptation, The resulting network (Figure 2) is the backbone for learn- and evaluation of the prior two contributions on real, ing depth estimation from synthetically generated images. clinical endoscopic video data. As seen in Fig. 1, the decoder is guided to predict depth at multiple levels during training. This encourages the latent features to include information about the depth. Once this network is trained with a standard supervised regression 2 Materials and methods approach, the modifications outlined in the next section are made to handle the domain transfer learning for real cystos- The training takes place in two parts. First, a neural net- copy images. work is trained on synthetic data to learn the mapping from synthetic images to depth maps and in the second 2.2 Domain transfer network structure step, gated residual blocks for domain transfer are inserted into the encoder and adversarial training is performed to While direct application of the synthetic depth prediction adapt the encoder for real cystoscopic images. This sec- network on real images without any network modifications tion will outline the developed network structure, the data or re-training produces plausible depth maps, these are still generation, and training methods used to accomplish this. subject to inaccuracy due to the domain shift (see third First, the structure for depth estimation from synthetic column in Fig. 10). This domain shift is handled through images using an encoder-decoder network is explained. a domain transfer learned through generative adversarial Following the synthetic training, the structure is modified training between a new encoder and multiple discrimina- for domain transfer from real to synthetic latent features tors. However, rather than retrain the entire encoder, as is through a modification of the encoder where gated residual done in [2], which can lead to more unstable GAN training, a blocks are inserted. gated transfer learning approach is implemented with added Fig. 1 Synthetic data depth prediction network. Solid red arrows indicate the points at which the loss is calculated and include upsampling as needed to match the pixel dimensions of the ground truth depth image D (shown here in color for illustration purposes only) 1 3 Biomedical Engineering Letters Fig. 2 Modified encoder with gated residual blocks residual blocks at each encoder level. These blocks are ini- [14], which also includes a python interface for automated tially disabled as GAN training is started and the gates are generation of different scenes and camera positions. slowly opened with a learned coefficient  for each encoder Realism in synthetic images can be divided into three level  using categories: photo, physical, and functional realism [15]. The first refers to whether a rendered image produces the same O = R ◦ tan  , (1) visual response as a real scene. Physical realism is achieved when a synthetic image produces the same visual stimulation where R and O are the outputs of the added ResNet block as a real scene. This is harder to achieve than photo realism and the resulting gated output, respectively. This follows the and requires the render engine to accurately and realisti- same method as in [12] using the idea of ReZero from [13]. cally calculate the spectral properties of the light, observed The intent here is that the residual blocks will learn how at the viewpoint. Functional realism requires an image to to correct for the domain shift and the rest of the already contain the same visual information as a real scene. Hence, trained encoder is left frozen to maintain the image features the observer must be able to extract the relevant proper- that contain the depth information. The modified encoder ties such as sizes, shapes, motions, positions, and materials. with residual blocks is shown in Fig. 2. This does not require the image to be physically realistic. For example, technical drawings can provide functional realism. For the task of object detection, it was found that 2.3 Synthetic domain data and training a high level of photo realism is not required for high per- formance [16]. While this was shown for the task of object Here, the methods for the first step of training depth estima- detection it is unknown for other tasks, such as monocular tion within the synthetic cystoscopy domain are outlined. depth estimation. The scene lighting has a drastic effect on physical cues 2.3.1 Data generation for depth estimation in an endoscopic environment, which was a driving factor in [10]. The light source within an The tool of choice for rendering images of the synthetic envi- endoscopic environment is typically attached to the camera and, therefore, moves along with it. To capture illumination ronment is extremely important and directly influences the quality of the results. The generated images should be as real effects as best as possible, ray trace rendering is preferred over rasterization for creating synthetic images in order to as possible. The tool used in this work to create the syntheti- cally rendered images is the 3D rendering software Blender model the light transport accurately capturing effects such 1 3 Biomedical Engineering Letters as shadowing in a realistic way. Additionally, for depth include: black circular mask generation, random color jitters, ◦ ◦ estimation, the general shapes and sizes of objects need to random translations, and random rotations from 0 to  360 . be accurately represented. For this, all models need to be The circular black mask is necessary as this information created within the bounds of physically possible features cannot be removed in the real images. Samples of these are seen during a cystoscopy. This is accomplished by utilizing seen in Fig. 6. Simultaneously to the color image rendering, reconstructions of actual patient bladders taken from CT depth maps are rendered out and matching transformation scans, a 3D imaging technique, from the study [17]. Exam- augmentations are applied accordingly. ples are seen in Fig. 3 where it is also possible to see that the human bladder is a very irregularly shaped organ as the only consistent feature between the scans is that they are singular, 2.3.2 Supervised training closed volumes. The lights are simulated as two conical light sources placed on each side of the camera, similar to [10], to The training for the synthetic domain is very straightfor- simulate a typical endoscope. ward. The goal of the network is to predict a depth map Due to the voxel resolution of the scanning method that D for a given synthetic image I . It is a standard super- generated the models the resulting models’ surfaces needed vised learning problem with the caveat that the depth loss is to be smoothed. Features such as divercula or polyps are, calculated at each level of the multi-resolution decoder. In therefore, not represented. In addition, the walls in an actual order to do this at the pixel level, the same technique as in bladder tend to be more wrinkly. To account for these miss- [2] is used, namely upsampling with bilinear interpolation, ing features, additional geometry modifications are per - to reach the ground truth depth image D dimensions. The formed to randomly add fake polyps and a Perlin noise dis- BerHu loss [18] placement texture across the model’s surface. Examples of ∗ ∗ D − D  if D − D  ≤ c these modifications are shown in Fig.  4 next to similar real L (D, D )= ∗ 2 (2) BerHu (D−D ) +c images. if D − D  > c 2c To avoid poor generalization due to the uniform textur- ing (default blender material) shown in Fig. 4, additional is used as it has been shown to outperform standard regres- sion losses such as the L or L loss. The threshold materials are used to represent a closer color representation 1 2 to the real images including blood vessel-like structures. c = max D − D  is defined as a fixed fraction of the i i Translucent subsurface scattering is also enabled for this maximum absolute difference for any pixel between the texture to better represent the optical properties of human ground truth and prediction. Since the depth maps should be tissue. A final touch of randomly generated texture bright - locally similar as the tissue is generally smooth and con- ness helps to make the model learn the difference between nected (excluding situations such as occlusion), an addi- the reflective properties and shadows. These modifications tional loss is calculated, namely a gradient loss are shown in Fig. 5. 1 2 2 Camera pose generation for the rendered images is a sim- L = ∇ y + ∇ y , grad x i y i (3) ple procedure since the bladder is a closed sphere-like vol- i∈N ume. Vectors from the center of volume are randomly gener- where y = log D − log D , and ∇ and ∇ denote the image i i x y ated and a randomized distance from the intersection of the i gradients in horizontal and vertical directions for the number bladder wall provides the position of the camera. The view- of valid pixels N. The loss term penalizes high image gra- ing direction is then varied up to 30◦ from the intersecting dients of the difference between the prediction and ground vector. The final augmentations come post rendering in the truth in log scale. This produces more accurate gradients in form of more standard image modification methods. These Fig. 3 Anatomically accurate 3D bladder models with different filling states obtained by CT scans 1 3 Biomedical Engineering Letters Real Endoscopic Image Before Augmentation AfterAugmentation Fig. 4 Bladder model geometry augmentations. The 3D bladder mod- image shows model after augmentation with either added bodies or els are modified to cover tissue effects such as: (top) polyps, (bot- added Perlin noise displacement. Note: the model images here are tom) bumpy bladder walls. Left image shows general tissue effect to rendered using Phong shading, so it only appears that the simulated be simulated, middle image shows model before augmentation, right polyp is floating in space even though it is not Fig. 5 Bladder texture modifications left to right: Bladder base color, artificial blood vessels, artificial vessels and randomized texture brightness values, and real image of blood vessels for comparison the depth prediction without degrading the L2 regression summed across each decoder level with l = 4 as the lowest loss [19]. resolution decoder output. Here, u is the bilinear interpola- The total resulting loss for the synthetic domain training tion upsampling to reach the image resolution of D. is given as 2.4 Domain transfer data and training ∗ ∗ ∗ L(D, D )= c L (D, u(D )) + c L (D, u(D )) (4) 0 BerHu 1 grad l l l=1 As is done for the synthetic domain, first an overview of the with sensitivity tuning coefficients c and c between the 0 1 data used is provided, followed by the training procedure for two loss components. The individual loss components are accomplishing the task. 1 3 Biomedical Engineering Letters Fig. 6 Data augmentation examples left to right: No augmentation, color jitter, and color jitter with rotation 2.4.1 Data acquisition2.4.2 Adversarial training The dataset for domain adaptation consists of 17 standard After the network is trained to predict depth from the syn- cystoscopic videos with an average frame rate of 25 frames thetic images, the network weights are frozen. Two copies per second. The videos consist both of normal diagnostic of the encoder will be used during adversarial training: checks and trans-urethral resections of tumors. Most of these one F is left unchanged and the other F receives the S R videos are recorded using analog equipment so before pro- gated residual blocks for domain transfer. It is worth not- cessing, a standard deinterlacing algorithm YADIF is run on ing here that the batch normalization statistics throughout the associated videos. The videos are then sampled every 5 the encoder are also frozen. This decision comes from a frames to generate the initial raw data set. separate experimental investigation that found by doing Before the images can be used they are filtered to exclude this a better overall depth error was achieved after adver- irrelevant ones including: when the endoscope is outside the sarial training. The training approach here follows the body, over exposure, and the image is too blurry or dark. scheme in [2] where the decoder is not included in the This process is automated by first finding a fitting circu- adversarial training and instead the encoder is forced to lar mask and then using tools such as a red threshold (for learn similar latent vectors at the lower three levels as inside the bladder), Laplacian variance (blurriness), and a those output from the synthetic training. This is shown in general brightness threshold. Further, more advanced filter - Fig. 8. The individual discriminators A with i ∈ 3, 4, 5 also ing could be done including using a neural network classifier use the same PatchGAN structure as in [2]. to exclude images with bubbles as these are not a part of the The standard GAN training as proposed in [20] is used actual depth of the scene. Figure 7 shows some examples of with the adversarial objective function excluded and included cystoscopic images. Included Excluded Excluded Fig. 7 Frame selection from the clinical cystoscopy dataset. From left to right: included image for training, excluded blurry image, and excluded camera outside the body 1 3 Biomedical Engineering Letters Fig. 8 Adversarial domain adaptation scheme similar to that in [2]. The encoder F for the real domain is initial- ized with weights from F and includes the added residual blocks. Adversarial training is then performed where F acts as a conditional generator that takes image I as input. Dis- criminators A are applied at the skip connections and trained to distinguish F (I ) from F (I ) S S R R 0.0075 L =  [log A F I ] A I ∼X i Si S S S i∈3,4,5 (5) 0.0050 gate coefficients 0 +  [log(1 −(A (F (I ))))] I ∼X i Ri R R R gate coefficients 1 0.0025 gate coefficients 2 where I is an image from the real domain and I an image R S gate coefficients 3 0.0000 from the synthetic. gate coefficients 4 −0.0025 3 Results adaptive gating 1.5 no gates For the synthetic data, 5000 images per bladder model and material (textured and non-textured) were rendered. There- 1.0 fore, 10000 images per bladder model are available with randomized viewpoints, viewing direction, light intensity, etc. Two of the 38 bladder models are used each for the validation and test sets. This amounts to 340000 images for the train set ( 90% ) and 20000 images for both the test and validation set ( 5% each). For the domain adaptation, approxi- mately 16600 real cystoscopy images without ground truth adaptive gating labels are used. The images from both datasets are fed to no gates the network at their original resolution of 256 × 256 pixels. The synthetic data training achieved the lowest valida- 0 2 4 6 TrainingTimein Hours tion root mean square error (RMSE) after 22 epochs at 0.878 mm. The weights acquired after this epoch were Fig. 9 Gate values for the adaptable gates during adversarial train- saved and used for the domain adaptation. To ensure the ing (top) with discriminator (middle) and generator (bottom) training gating of the domain adaptation functioned as expected, losses. The  values correlate to the adaptive gating (light blue) train- the adversarial training was performed as presented in ing shown in the bottom two loss plots, while the dark blue values Sect.  2.4.2 and also repeated with the gating removed. track the loss for training without any gates included after the residual blocks. It is seen that the gating provides a smoother transition to a The results of the distribution of the gate coefficients stable balance between the generator and discriminator. This subse- can be seen in Fig.  9. Sample results after training the quently results in better predictions 1 3 α Value Discriminator Loss Generator Loss Biomedical Engineering Letters domain adaptation with gating are shown in Fig.  10. A 4 Discussion prediction using the same image before any adaptation is done is provided as well (right depth plot). To get an As expected, the convergence of the adversarial training to idea of what is changed between the two depth plots after a stable balance between discriminator training and genera- domain adaptation training, a difference plot between the tor training losses takes longer with the gating but exhibits two depth plots is provided. a much more stable trajectory to the equilibrium in com- When comparing to the results in Fig. 10, training the parison to not using gating. By using an adaptive gating the network with the gates removed (ungated training plot possibility of needing to restart training due to complete from Fig.  9), it is seen that the network almost imme- divergence of the network is avoided, which can happen very diately suffers from a mode collapse and struggles to often when training adversarial networks. In Fig. 10 it is pos- maintain any information from the provided input image. sible to see that including the domain adaptation (left depth Instead, only a feasible texture is predicted that can fool plot) produces a more reasonable depth map for the given the discriminators and the network becomes stuck at this image. It can be seen that the adversarial training appears to point. The resulting prediction is seen in Fig. 11. improve upon deciphering the difference between shadows and a darker texture. The checkerboard effect which often Input Image DomainAdapted Unadapted Difference Fig. 10 Examples of (perceived) improved depth estimation over syn- the two depth predictions. Units are provided in mm and the depth thetic training by using domain adaptation. From left to right: input predictions in the second and third columns both use the same scale image, domain adapted depth prediction, unadapted depth prediction found in their respective rows using network trained on synthetic data, and difference plot between Input Image Domain Adapted Unadapted Difference Fig. 11 Example of mode collapse during adversarial training for domain adaptation when not using adaptive gating. Units are provided in mm 1 3 Biomedical Engineering Letters Input Image Domain Adapted Unadapted Difference Fig. 12 Examples of unimproved depth estimation. Units are provided in mm appeared when directly applying the synthetic network to inside the human bladder and the work included the con- real images is effectively eliminated creating a smooth, more struction of a pseudo-realistic bladder environment for continuous depth estimation. the creation of synthetic camera images. Real cystoscopic videos were used for adversarial training to transfer the 4.1 Limitations depth estimation capabilities from the synthetic domain to the real. The training for this was stabilized by restrict- It is clear that the proposed domain adaptation through the ing the domain adaptation to newly added residual blocks, gated residual blocks accomplishes the intended tasks set each with a learnable gating parameter. Results showed an forth in this work. Unfortunately, however, as there is no improvement on feasible depth estimations once a domain representation for objects such as the resection cutting loop transfer was done, however, this only worked in scenarios in the current synthetic domain and the simulated polyps do where the synthetic domain was able to provide a simi- not differentiate much from their surrounding texture, the lar scene. With these results, it can be concluded that the network really struggles with handling images containing methods shown enabled depth estimation in a cystoscopic this info. Examples of this are seen in Fig. 12. It is apparent environment and provided a more stable approach to the that the network relies purely on the brightness of the tool adversarial training for domain adaptation. to determine its distance from the camera and misses the Acknowledgements This work was sponsored by the Graduate School fact that the polyp is not flat since it does not cast a visible 2543/1 “Intraoperative Multisensory Tissue Differentiation in Oncol- shadow in the given image. These problems should be avoid- ogy” (project ID 40947457) funded by the German Research Founda- able by including sample images of this data in the synthetic tion (DFG - Deutsche Forschungsgemeinschaft). This work was also supported in part by the German Federal Ministry of Education and domain such that the network can learn how to map the dis- Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. tinct depth structure of the tool to the latent vector space and that drastically differentiating local texture is an indication Author Contributions All authors contributed to the study conception of a different structure. and design. Material preparation was performed by Johannes Zahn. Video data collection was performed by Niklas Harland and Simon Walz. The first draft of the manuscript was written by Johannes Zahn and Peter Somers under the guidance of Simon Holdenried-Krafft. All 5 Conclusion authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. In this work an improvement on using deep neural net- Funding Open Access funding enabled and organized by Projekt works for monocular depth estimation for cystoscopy was DEAL. achieved using a two step training approach to limit the problem to a domain transfer between a synthetic and real Code availablity The source code for this work will be made available at https:// github. com/ cgtue bingen/ cysto scopy_ depth. domain. This was done for the cystoscopic environment 1 3 Biomedical Engineering Letters 8. Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N. Deeper Declarations depth prediction with fully convolutional residual networks. In: 2016 fourth international conference on 3D vision (3DV), pp. Conflicts of interest The authors have no relevant financial or non- 239–248 (2016). https:// doi. org/ 10. 1109/ 3DV. 2016. 32 financial interests to disclose. 9. Kundu JN, Uppala PK, Pahuja A, Babu RV. Adadepth: Unsuper- vised content congruent adaptation for depth estimation. In: 2018 Ethics approval Ethics approval was obtained from the ethics commis- IEEE/CVF Conference on computer vision and pattern recogni- sion at the University Hospital of Tübingen (January 25, 2022, project tion, pp. 2656–2665 (2018). https://doi. or g/10. 1109/ CVPR. 2018. number 583/2021BO1). 10. Mahmood F, Durr NJ. Deep learning and conditional random Informed consent Informed consent was obtained from all individual fields-based depth estimation and topographical reconstruction participants included in the study. from conventional endoscopy. Med Image Analys. 2018;48:230– 43. https:// doi. org/ 10. 1016/j. media. 2018. 06. 005. Open Access This article is licensed under a Creative Commons Attri- 11. Aitken AP, Ledig C, Theis L, Caballero J, Wang Z, Shi W. Check- bution 4.0 International License, which permits use, sharing, adapta- erboard artifact free sub-pixel convolution: a note on sub-pixel tion, distribution and reproduction in any medium or format, as long convolution, resize convolution and convolution resize. CoRR as you give appropriate credit to the original author(s) and the source, abs/1707.02937 (2017) arXiv: 1707. 02937 provide a link to the Creative Commons licence, and indicate if changes 12. Alayrac J-B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc were made. The images or other third party material in this article are K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, included in the article’s Creative Commons licence, unless indicated Cabi S, Han T, Gong Z, Samangooei S, Monteiro M, Menick J, otherwise in a credit line to the material. If material is not included in Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski the article’s Creative Commons licence and your intended use is not M, Barreira R, Vinyals O, Zisserman A, Simonyan K. Flamingo: a permitted by statutory regulation or exceeds the permitted use, you will visual language model for few-shot learning. arXiv (2022). https:// need to obtain permission directly from the copyright holder. To view a doi.or g/10. 48550/ ARXIV .2204. 14198 . https://ar xiv.or g/abs/ 2204. copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. 13. Bachlechner T, Majumder BP, Mao HH, Cottrell GW, McAuley J. Rezero is all you need: Fast convergence at large depth. In: thirty- seventh conference on uncertainty in artic fi ial intelligence. arXiv: References Machine Learning, ??? (2020). https://doi. or g/10. 48550/ ARXIV . 2003. 04887. https:// arxiv. org/ abs/ 2003. 04887 1. Schüle J, Haag J, Somers P, Veil C, Tarín C, Sawodny O. A 14. Blender Development Team: Blender 3.1.0. accessed: 20.04.2022 model-based simultaneous localization and mapping approach (2022). https://w ww.b lende r.o rg/d ownlo ad/r eleas es/3-1 / Accessed for deformable bodies. In: 2022 IEEE/ASME international con- 20.04.2022 ference on advanced intelligent mechatronics (AIM), pp. 607–612 15. Peddie J. Ray tracing: a tool for all. Cham: Springer; 2019. (2022). https:// doi. org/ 10. 1109/ AIM52 237. 2022. 98633 08 16. Rajpura PS, Hegde RS, Bojinov H. Object detection using deep 2. Karaoglu MA, Brasch N, Stollenga M, Wein W, Navab N, Tom- CNNS trained on synthetic images. ArXiv 2017. https:// doi. org/ bari F, Ladikos A. Adversarial domain feature adaptation for bron- 10. 48550/ arXiv. 1706. 06782 choscopic depth estimation. In: de Bruijne, M., Cattin, P.C., Cotin, 17. Rister B, Yi D, Shivakumar K, Nobashi T, Rubin DL. Ct-org, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical a new dataset for multiple organ segmentation in computed Image Computing and Computer Assisted Intervention–MIC- tomography. Sci Data. 2020;7(1):381. https:// doi. org/ 10. 1038/ CAI 2021. Lecture Notes in Computer Science, vol. 12904, pp. s41597- 020- 00715-8. 300–310. Springer International Publishing, Cham (2021). https:// 18. Zwald L, Lambert-Lacroix S. The berhu penalty and the grouped doi. org/ 10. 1007/ 978-3- 030- 87202-1_ 29 effect. ArXiv: Stati stics Theory 2012. https:// doi. org/ 10. 48550/ 3. Li S, Liu CH, Lin Q, Wen Q, Su L, Huang G, Ding Z. Deep arXiv. 1207. 6868 residual correction network for partial domain adaptation. IEEE 19. Eigen D, Fergus R. Predicting depth, surface normals and seman- Trans Pattern Analysis Mach Intell. 2021;43(7):2329–44. https:// tic labels with a common multi-scale convolutional architecture. doi. org/ 10. 1109/ tpami. 2020. 29641 73. In: 2015 IEEE international conference on computer vision 4. Ullman S. The interpretation of structure from motion. The Royal (ICCV). 2015 2650–2658. https://doi. or g/10. 1109/ ICCV .2015. 304 Society. 1979 20. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, 5. Schönberger JL, Frahm J-M. Structure-from-motion revisited. In: Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Conference on Computer Vision and Pattern Recognition (CVPR) advances in neural information processing systems 2014. https:// (2016) doi. org/ 10. 48550/ ARXIV. 1406. 2661 6. Luo X, Huang J-B, Szeliski R, Matzen K, Kopf J. Consistent video depth estimation. arXiv (2020). https://doi. or g/10. 48550/ ARXIV . Publisher's Note Springer Nature remains neutral with regard to 2004. 15021. https:// arxiv. org/ abs/ 2004. 15021 jurisdictional claims in published maps and institutional affiliations. 7. Mayer N, Ilg E, Häusser P, Fischer P, Cremers D, Dosovitskiy A, Brox T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp. 4040–4048 (2016). https:// doi. org/ 10. 1109/ CVPR. 2016. 438 1 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biomedical Engineering Letters Springer Journals

Loading next page...
 
/lp/springer-journals/cystoscopic-depth-estimation-using-gated-adversarial-domain-adaptation-fhSwOabnJZ
Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2023
ISSN
2093-9868
eISSN
2093-985X
DOI
10.1007/s13534-023-00261-3
Publisher site
See Article on Publisher Site

Abstract

Monocular depth estimation from camera images is very important for surrounding scene evaluation in many technical fields from automotive to medicine. However, traditional triangulation methods using stereo cameras or multiple views with the assumption of a rigid environment are not applicable for endoscopic domains. Particularly in cystoscopies it is not possible to produce ground truth depth information to directly train machine learning algorithms for using a monocular image directly for depth prediction. This work considers first creating a synthetic cystoscopic environment for initial encoding of depth information from synthetically rendered images. Next, the task of predicting pixel-wise depth values for real images is con- strained to a domain adaption between the synthetic and real image domains. This adaptation is done through added gated residual blocks in order to simplify the network task and maintain training stability during adversarial training. Training is done on an internally collected cystoscopy dataset from human patients. The results after training demonstrate the ability to predict reasonable depth estimations from actual cystoscopic videos and added stability from using gated residual blocks is shown to prevent mode collapse during adversarial training. Keywords Neural networks · Domain adaptation · Depth estimation · Endoscopy · Synthetic data 1 Introduction current robotics applications combine LIDAR, or similar, sensors to create what is known as an RGB-D (color and Depth, or distance, information from a sensor is paramount depth) camera image. This provides a dense (or at least par- for localization and mapping algorithms, especially when tially dense) pixel-wise depth map for each given matching using cameras as the main sensor modality. For this reason, image. By using the camera’s intrinsic parameters, a point cloud of the scene can be re-projected and, through addi- tional algorithms, the extrinsic position of the camera can Peter Somers, Simon Holdenried-Krafft, and JohannesZahn have contributed equally to this work. * Peter Somers Oliver Sawodny somers@isys.uni-stuttgart.de sawodny@isys.uni-stuttgart.de Simon Holdenried-Krafft Cristina Tarín simon.krafft@uni-tuebingen.de tarin@isys.uni-stuttgart.de Johannes Schüle Hendrik P. A. Lensch schuele@isys.uni-stuttgart.de hendrik.lensch@uni-tuebingen.de Carina Veil Institute for System Dynamics, University of Stuttgart, veil@isys.uni-stuttgart.de Stuttgart, Germany Niklas Harland Institute for Computer Graphics, University of Tübingen, niklas.harland@uni-tuebingen.de Tübingen, Germany Simon Walz Urology Clinic, University Hospital of Tübingen, Tübingen, simon.walz@uni-tuebingen.de Germany Arnulf Stenzl arnulf.stenzl@uni-tuebingen.de Vol.:(0123456789) 1 3 Biomedical Engineering Letters be reconstructed while simultaneously mapping the environ- to a GAN approach. In addition, learnable gates are included ment. It is also possible to do this without the direct depth in the added layers to bring additional stability during the information by finding matching points in sequential images adversarial training by smoothly fading in domain specific and using triangulation methods to accomplish the same task features. with more sparse data points. When operating inside the human body, however, these 1.1 Related work methods are not as feasible and particularly during cysto- scopic operations where the camera and instruments must fit Depth estimation from images is not a new field of study. through the urethra to reach the bladder, inclusion of depth Techniques such as structure from motion have been around measuring sensors like LIDAR are out of the question. These since at least the 1970 s [4] and, more recently, augmented restrictions also make it more difficult for the physician to reality requires real-time estimation of this information. ensure that the entire bladder has been seen giving rise for Until recently, the approaches using only camera images all the need of methods to map the bladder to ensure full vis- relied upon using corresponding points between consecutive ual coverage. While using stereo cameras provides a more images. One state-of-the-art open source tool COLMAP [5] robust triangulation-based depth reconstruction, for the same excels at this. Machine learning has recently been used to reason of restricted space this method is not feasible and enhance results of these methods for more complete and using a monocular camera for depth estimation is unavoid- continuous depth maps [6]. The downside to all of these able to obtain the same localization and mapping goals. An existing techniques, however, is that they are not universal. additional problem in the cystoscopic environment is that Domains that change quickly or lack distinct features the scene may change fast enough that also using sequential between frames, such as endoscopic videos, render exist- image frames for pseudo-triangulation is not feasible due ing methods unusable. Therefore, newer techniques focus to difficulties in matching features between images. Some on methods to make this prediction without the need for works have proposed ideas to circumvent these limitations, distinct feature recognition. These techniques begin with for example [1], but they require additional information, a supervised approach, in which the problem can be seen such as an underlying model, that is not always obtainable. as a regression problem given a color image as input and For these reasons, monocular depth estimation remains a hot a ground truth depth map as output. Two frequently used topic of research and when using a single image, one method neural network architectures that do this are DispNet [7] and has come to stand out: image domain adaptation. fully convolutional residual networks (FCRN) [8]. One non- This work leverages the idea that instead of measuring, negligible difficulty with formulating the problem this way the entire domain can be simulated, in this case a synthetic is the reliance on ground truth data. Unfortunately, the situ- cystoscopic environment, including the desired output infor- ations where only a camera is desired for extracting dense mation of depth from the camera. This can be used to com- depth information are ones in which also using reliable dis- pensate for the missing information in a second domain: the tance sensors for ground truth is not feasible. real environment. Adaptation between different domains is This dilemma led to more generalizable approaches generally possible when they are similar enough that there capable of using synthesized data for supervised train- exists a feasible transfer function from one to the other. Gen- ing and using domain adaptation to apply the results to erative Adversarial Networks (GANs) have shown this to be real images. As already mentioned, AdaDepth [9] was the true by learning this transfer function using neural networks. first to effectively do this and demonstrates the capability Under the assumption that the synthetic domain can be con- on various datasets of different domains. Shortly there- structed accurately enough to form a finite information gap after came works in the areas of colonoscopy [10] and from the real domain, this work aims to find the associated bronchoscopy [2], where depth predictions in endoscopic transfer function. surgeries could now be done. Both of these works used With this in mind, the training method proposed in [2] is simplified organ reconstructions and phantom scans to cre- used as a foundation and modified. The approach uses adver - ate synthetic data with ground truth depth for initial neural sarial training to retrain an encoder from a encoder-decoder network training before performing a domain adaptation to network such that it produces similar latent features from the real images. While these environments also suffer from real images as from the synthetic images it was originally the aforementioned deformability and working space prob- trained on. The encoder in this work is modified so that the lems, particularly the lungs and airways have the advan- domain adaptation occurs only in added residual blocks, tage that the general shape and images between different not through retraining the entire encoder. This approach of patients does not vary to the same extent as for the bladder. using residual blocks for the additional learning was also The bladder is one of the most deformable organs in the taken in [3] for transfer learning to improve over compara- body and during surgeries, the fill level is continuously tive GAN approaches, but in this work it is directly applied changed to allow for different views or cutting actions 1 3 Biomedical Engineering Letters making the scene very dynamic. However, this does not 2.1 Synthetic domain network structure mean that the approaches cannot be applied to the bladder, it has just not yet (to the authors’ knowledge) been done. The overall depth prediction network structure follows the The contributions of this work can be summarized as U-Net in [2] with differences mainly in the decoder and the activation functions. Instead of simple nearest neigh- the creation of a synthetic cystoscopic environment for bor upsampling, the ICNR initialized sub-pixel convolution rendering images and corresponding depth maps, approach [11] is used. On real images this step drastically the use of a modified encoder structure for more stabi- reduced checkerboard artifacts (from empirical testing). lized GAN training during domain adaptation, The resulting network (Figure 2) is the backbone for learn- and evaluation of the prior two contributions on real, ing depth estimation from synthetically generated images. clinical endoscopic video data. As seen in Fig. 1, the decoder is guided to predict depth at multiple levels during training. This encourages the latent features to include information about the depth. Once this network is trained with a standard supervised regression 2 Materials and methods approach, the modifications outlined in the next section are made to handle the domain transfer learning for real cystos- The training takes place in two parts. First, a neural net- copy images. work is trained on synthetic data to learn the mapping from synthetic images to depth maps and in the second 2.2 Domain transfer network structure step, gated residual blocks for domain transfer are inserted into the encoder and adversarial training is performed to While direct application of the synthetic depth prediction adapt the encoder for real cystoscopic images. This sec- network on real images without any network modifications tion will outline the developed network structure, the data or re-training produces plausible depth maps, these are still generation, and training methods used to accomplish this. subject to inaccuracy due to the domain shift (see third First, the structure for depth estimation from synthetic column in Fig. 10). This domain shift is handled through images using an encoder-decoder network is explained. a domain transfer learned through generative adversarial Following the synthetic training, the structure is modified training between a new encoder and multiple discrimina- for domain transfer from real to synthetic latent features tors. However, rather than retrain the entire encoder, as is through a modification of the encoder where gated residual done in [2], which can lead to more unstable GAN training, a blocks are inserted. gated transfer learning approach is implemented with added Fig. 1 Synthetic data depth prediction network. Solid red arrows indicate the points at which the loss is calculated and include upsampling as needed to match the pixel dimensions of the ground truth depth image D (shown here in color for illustration purposes only) 1 3 Biomedical Engineering Letters Fig. 2 Modified encoder with gated residual blocks residual blocks at each encoder level. These blocks are ini- [14], which also includes a python interface for automated tially disabled as GAN training is started and the gates are generation of different scenes and camera positions. slowly opened with a learned coefficient  for each encoder Realism in synthetic images can be divided into three level  using categories: photo, physical, and functional realism [15]. The first refers to whether a rendered image produces the same O = R ◦ tan  , (1) visual response as a real scene. Physical realism is achieved when a synthetic image produces the same visual stimulation where R and O are the outputs of the added ResNet block as a real scene. This is harder to achieve than photo realism and the resulting gated output, respectively. This follows the and requires the render engine to accurately and realisti- same method as in [12] using the idea of ReZero from [13]. cally calculate the spectral properties of the light, observed The intent here is that the residual blocks will learn how at the viewpoint. Functional realism requires an image to to correct for the domain shift and the rest of the already contain the same visual information as a real scene. Hence, trained encoder is left frozen to maintain the image features the observer must be able to extract the relevant proper- that contain the depth information. The modified encoder ties such as sizes, shapes, motions, positions, and materials. with residual blocks is shown in Fig. 2. This does not require the image to be physically realistic. For example, technical drawings can provide functional realism. For the task of object detection, it was found that 2.3 Synthetic domain data and training a high level of photo realism is not required for high per- formance [16]. While this was shown for the task of object Here, the methods for the first step of training depth estima- detection it is unknown for other tasks, such as monocular tion within the synthetic cystoscopy domain are outlined. depth estimation. The scene lighting has a drastic effect on physical cues 2.3.1 Data generation for depth estimation in an endoscopic environment, which was a driving factor in [10]. The light source within an The tool of choice for rendering images of the synthetic envi- endoscopic environment is typically attached to the camera and, therefore, moves along with it. To capture illumination ronment is extremely important and directly influences the quality of the results. The generated images should be as real effects as best as possible, ray trace rendering is preferred over rasterization for creating synthetic images in order to as possible. The tool used in this work to create the syntheti- cally rendered images is the 3D rendering software Blender model the light transport accurately capturing effects such 1 3 Biomedical Engineering Letters as shadowing in a realistic way. Additionally, for depth include: black circular mask generation, random color jitters, ◦ ◦ estimation, the general shapes and sizes of objects need to random translations, and random rotations from 0 to  360 . be accurately represented. For this, all models need to be The circular black mask is necessary as this information created within the bounds of physically possible features cannot be removed in the real images. Samples of these are seen during a cystoscopy. This is accomplished by utilizing seen in Fig. 6. Simultaneously to the color image rendering, reconstructions of actual patient bladders taken from CT depth maps are rendered out and matching transformation scans, a 3D imaging technique, from the study [17]. Exam- augmentations are applied accordingly. ples are seen in Fig. 3 where it is also possible to see that the human bladder is a very irregularly shaped organ as the only consistent feature between the scans is that they are singular, 2.3.2 Supervised training closed volumes. The lights are simulated as two conical light sources placed on each side of the camera, similar to [10], to The training for the synthetic domain is very straightfor- simulate a typical endoscope. ward. The goal of the network is to predict a depth map Due to the voxel resolution of the scanning method that D for a given synthetic image I . It is a standard super- generated the models the resulting models’ surfaces needed vised learning problem with the caveat that the depth loss is to be smoothed. Features such as divercula or polyps are, calculated at each level of the multi-resolution decoder. In therefore, not represented. In addition, the walls in an actual order to do this at the pixel level, the same technique as in bladder tend to be more wrinkly. To account for these miss- [2] is used, namely upsampling with bilinear interpolation, ing features, additional geometry modifications are per - to reach the ground truth depth image D dimensions. The formed to randomly add fake polyps and a Perlin noise dis- BerHu loss [18] placement texture across the model’s surface. Examples of ∗ ∗ D − D  if D − D  ≤ c these modifications are shown in Fig.  4 next to similar real L (D, D )= ∗ 2 (2) BerHu (D−D ) +c images. if D − D  > c 2c To avoid poor generalization due to the uniform textur- ing (default blender material) shown in Fig. 4, additional is used as it has been shown to outperform standard regres- sion losses such as the L or L loss. The threshold materials are used to represent a closer color representation 1 2 to the real images including blood vessel-like structures. c = max D − D  is defined as a fixed fraction of the i i Translucent subsurface scattering is also enabled for this maximum absolute difference for any pixel between the texture to better represent the optical properties of human ground truth and prediction. Since the depth maps should be tissue. A final touch of randomly generated texture bright - locally similar as the tissue is generally smooth and con- ness helps to make the model learn the difference between nected (excluding situations such as occlusion), an addi- the reflective properties and shadows. These modifications tional loss is calculated, namely a gradient loss are shown in Fig. 5. 1 2 2 Camera pose generation for the rendered images is a sim- L = ∇ y + ∇ y , grad x i y i (3) ple procedure since the bladder is a closed sphere-like vol- i∈N ume. Vectors from the center of volume are randomly gener- where y = log D − log D , and ∇ and ∇ denote the image i i x y ated and a randomized distance from the intersection of the i gradients in horizontal and vertical directions for the number bladder wall provides the position of the camera. The view- of valid pixels N. The loss term penalizes high image gra- ing direction is then varied up to 30◦ from the intersecting dients of the difference between the prediction and ground vector. The final augmentations come post rendering in the truth in log scale. This produces more accurate gradients in form of more standard image modification methods. These Fig. 3 Anatomically accurate 3D bladder models with different filling states obtained by CT scans 1 3 Biomedical Engineering Letters Real Endoscopic Image Before Augmentation AfterAugmentation Fig. 4 Bladder model geometry augmentations. The 3D bladder mod- image shows model after augmentation with either added bodies or els are modified to cover tissue effects such as: (top) polyps, (bot- added Perlin noise displacement. Note: the model images here are tom) bumpy bladder walls. Left image shows general tissue effect to rendered using Phong shading, so it only appears that the simulated be simulated, middle image shows model before augmentation, right polyp is floating in space even though it is not Fig. 5 Bladder texture modifications left to right: Bladder base color, artificial blood vessels, artificial vessels and randomized texture brightness values, and real image of blood vessels for comparison the depth prediction without degrading the L2 regression summed across each decoder level with l = 4 as the lowest loss [19]. resolution decoder output. Here, u is the bilinear interpola- The total resulting loss for the synthetic domain training tion upsampling to reach the image resolution of D. is given as 2.4 Domain transfer data and training ∗ ∗ ∗ L(D, D )= c L (D, u(D )) + c L (D, u(D )) (4) 0 BerHu 1 grad l l l=1 As is done for the synthetic domain, first an overview of the with sensitivity tuning coefficients c and c between the 0 1 data used is provided, followed by the training procedure for two loss components. The individual loss components are accomplishing the task. 1 3 Biomedical Engineering Letters Fig. 6 Data augmentation examples left to right: No augmentation, color jitter, and color jitter with rotation 2.4.1 Data acquisition2.4.2 Adversarial training The dataset for domain adaptation consists of 17 standard After the network is trained to predict depth from the syn- cystoscopic videos with an average frame rate of 25 frames thetic images, the network weights are frozen. Two copies per second. The videos consist both of normal diagnostic of the encoder will be used during adversarial training: checks and trans-urethral resections of tumors. Most of these one F is left unchanged and the other F receives the S R videos are recorded using analog equipment so before pro- gated residual blocks for domain transfer. It is worth not- cessing, a standard deinterlacing algorithm YADIF is run on ing here that the batch normalization statistics throughout the associated videos. The videos are then sampled every 5 the encoder are also frozen. This decision comes from a frames to generate the initial raw data set. separate experimental investigation that found by doing Before the images can be used they are filtered to exclude this a better overall depth error was achieved after adver- irrelevant ones including: when the endoscope is outside the sarial training. The training approach here follows the body, over exposure, and the image is too blurry or dark. scheme in [2] where the decoder is not included in the This process is automated by first finding a fitting circu- adversarial training and instead the encoder is forced to lar mask and then using tools such as a red threshold (for learn similar latent vectors at the lower three levels as inside the bladder), Laplacian variance (blurriness), and a those output from the synthetic training. This is shown in general brightness threshold. Further, more advanced filter - Fig. 8. The individual discriminators A with i ∈ 3, 4, 5 also ing could be done including using a neural network classifier use the same PatchGAN structure as in [2]. to exclude images with bubbles as these are not a part of the The standard GAN training as proposed in [20] is used actual depth of the scene. Figure 7 shows some examples of with the adversarial objective function excluded and included cystoscopic images. Included Excluded Excluded Fig. 7 Frame selection from the clinical cystoscopy dataset. From left to right: included image for training, excluded blurry image, and excluded camera outside the body 1 3 Biomedical Engineering Letters Fig. 8 Adversarial domain adaptation scheme similar to that in [2]. The encoder F for the real domain is initial- ized with weights from F and includes the added residual blocks. Adversarial training is then performed where F acts as a conditional generator that takes image I as input. Dis- criminators A are applied at the skip connections and trained to distinguish F (I ) from F (I ) S S R R 0.0075 L =  [log A F I ] A I ∼X i Si S S S i∈3,4,5 (5) 0.0050 gate coefficients 0 +  [log(1 −(A (F (I ))))] I ∼X i Ri R R R gate coefficients 1 0.0025 gate coefficients 2 where I is an image from the real domain and I an image R S gate coefficients 3 0.0000 from the synthetic. gate coefficients 4 −0.0025 3 Results adaptive gating 1.5 no gates For the synthetic data, 5000 images per bladder model and material (textured and non-textured) were rendered. There- 1.0 fore, 10000 images per bladder model are available with randomized viewpoints, viewing direction, light intensity, etc. Two of the 38 bladder models are used each for the validation and test sets. This amounts to 340000 images for the train set ( 90% ) and 20000 images for both the test and validation set ( 5% each). For the domain adaptation, approxi- mately 16600 real cystoscopy images without ground truth adaptive gating labels are used. The images from both datasets are fed to no gates the network at their original resolution of 256 × 256 pixels. The synthetic data training achieved the lowest valida- 0 2 4 6 TrainingTimein Hours tion root mean square error (RMSE) after 22 epochs at 0.878 mm. The weights acquired after this epoch were Fig. 9 Gate values for the adaptable gates during adversarial train- saved and used for the domain adaptation. To ensure the ing (top) with discriminator (middle) and generator (bottom) training gating of the domain adaptation functioned as expected, losses. The  values correlate to the adaptive gating (light blue) train- the adversarial training was performed as presented in ing shown in the bottom two loss plots, while the dark blue values Sect.  2.4.2 and also repeated with the gating removed. track the loss for training without any gates included after the residual blocks. It is seen that the gating provides a smoother transition to a The results of the distribution of the gate coefficients stable balance between the generator and discriminator. This subse- can be seen in Fig.  9. Sample results after training the quently results in better predictions 1 3 α Value Discriminator Loss Generator Loss Biomedical Engineering Letters domain adaptation with gating are shown in Fig.  10. A 4 Discussion prediction using the same image before any adaptation is done is provided as well (right depth plot). To get an As expected, the convergence of the adversarial training to idea of what is changed between the two depth plots after a stable balance between discriminator training and genera- domain adaptation training, a difference plot between the tor training losses takes longer with the gating but exhibits two depth plots is provided. a much more stable trajectory to the equilibrium in com- When comparing to the results in Fig. 10, training the parison to not using gating. By using an adaptive gating the network with the gates removed (ungated training plot possibility of needing to restart training due to complete from Fig.  9), it is seen that the network almost imme- divergence of the network is avoided, which can happen very diately suffers from a mode collapse and struggles to often when training adversarial networks. In Fig. 10 it is pos- maintain any information from the provided input image. sible to see that including the domain adaptation (left depth Instead, only a feasible texture is predicted that can fool plot) produces a more reasonable depth map for the given the discriminators and the network becomes stuck at this image. It can be seen that the adversarial training appears to point. The resulting prediction is seen in Fig. 11. improve upon deciphering the difference between shadows and a darker texture. The checkerboard effect which often Input Image DomainAdapted Unadapted Difference Fig. 10 Examples of (perceived) improved depth estimation over syn- the two depth predictions. Units are provided in mm and the depth thetic training by using domain adaptation. From left to right: input predictions in the second and third columns both use the same scale image, domain adapted depth prediction, unadapted depth prediction found in their respective rows using network trained on synthetic data, and difference plot between Input Image Domain Adapted Unadapted Difference Fig. 11 Example of mode collapse during adversarial training for domain adaptation when not using adaptive gating. Units are provided in mm 1 3 Biomedical Engineering Letters Input Image Domain Adapted Unadapted Difference Fig. 12 Examples of unimproved depth estimation. Units are provided in mm appeared when directly applying the synthetic network to inside the human bladder and the work included the con- real images is effectively eliminated creating a smooth, more struction of a pseudo-realistic bladder environment for continuous depth estimation. the creation of synthetic camera images. Real cystoscopic videos were used for adversarial training to transfer the 4.1 Limitations depth estimation capabilities from the synthetic domain to the real. The training for this was stabilized by restrict- It is clear that the proposed domain adaptation through the ing the domain adaptation to newly added residual blocks, gated residual blocks accomplishes the intended tasks set each with a learnable gating parameter. Results showed an forth in this work. Unfortunately, however, as there is no improvement on feasible depth estimations once a domain representation for objects such as the resection cutting loop transfer was done, however, this only worked in scenarios in the current synthetic domain and the simulated polyps do where the synthetic domain was able to provide a simi- not differentiate much from their surrounding texture, the lar scene. With these results, it can be concluded that the network really struggles with handling images containing methods shown enabled depth estimation in a cystoscopic this info. Examples of this are seen in Fig. 12. It is apparent environment and provided a more stable approach to the that the network relies purely on the brightness of the tool adversarial training for domain adaptation. to determine its distance from the camera and misses the Acknowledgements This work was sponsored by the Graduate School fact that the polyp is not flat since it does not cast a visible 2543/1 “Intraoperative Multisensory Tissue Differentiation in Oncol- shadow in the given image. These problems should be avoid- ogy” (project ID 40947457) funded by the German Research Founda- able by including sample images of this data in the synthetic tion (DFG - Deutsche Forschungsgemeinschaft). This work was also supported in part by the German Federal Ministry of Education and domain such that the network can learn how to map the dis- Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. tinct depth structure of the tool to the latent vector space and that drastically differentiating local texture is an indication Author Contributions All authors contributed to the study conception of a different structure. and design. Material preparation was performed by Johannes Zahn. Video data collection was performed by Niklas Harland and Simon Walz. The first draft of the manuscript was written by Johannes Zahn and Peter Somers under the guidance of Simon Holdenried-Krafft. All 5 Conclusion authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. In this work an improvement on using deep neural net- Funding Open Access funding enabled and organized by Projekt works for monocular depth estimation for cystoscopy was DEAL. achieved using a two step training approach to limit the problem to a domain transfer between a synthetic and real Code availablity The source code for this work will be made available at https:// github. com/ cgtue bingen/ cysto scopy_ depth. domain. This was done for the cystoscopic environment 1 3 Biomedical Engineering Letters 8. Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N. Deeper Declarations depth prediction with fully convolutional residual networks. In: 2016 fourth international conference on 3D vision (3DV), pp. Conflicts of interest The authors have no relevant financial or non- 239–248 (2016). https:// doi. org/ 10. 1109/ 3DV. 2016. 32 financial interests to disclose. 9. Kundu JN, Uppala PK, Pahuja A, Babu RV. Adadepth: Unsuper- vised content congruent adaptation for depth estimation. In: 2018 Ethics approval Ethics approval was obtained from the ethics commis- IEEE/CVF Conference on computer vision and pattern recogni- sion at the University Hospital of Tübingen (January 25, 2022, project tion, pp. 2656–2665 (2018). https://doi. or g/10. 1109/ CVPR. 2018. number 583/2021BO1). 10. Mahmood F, Durr NJ. Deep learning and conditional random Informed consent Informed consent was obtained from all individual fields-based depth estimation and topographical reconstruction participants included in the study. from conventional endoscopy. Med Image Analys. 2018;48:230– 43. https:// doi. org/ 10. 1016/j. media. 2018. 06. 005. Open Access This article is licensed under a Creative Commons Attri- 11. Aitken AP, Ledig C, Theis L, Caballero J, Wang Z, Shi W. Check- bution 4.0 International License, which permits use, sharing, adapta- erboard artifact free sub-pixel convolution: a note on sub-pixel tion, distribution and reproduction in any medium or format, as long convolution, resize convolution and convolution resize. CoRR as you give appropriate credit to the original author(s) and the source, abs/1707.02937 (2017) arXiv: 1707. 02937 provide a link to the Creative Commons licence, and indicate if changes 12. Alayrac J-B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc were made. The images or other third party material in this article are K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, included in the article’s Creative Commons licence, unless indicated Cabi S, Han T, Gong Z, Samangooei S, Monteiro M, Menick J, otherwise in a credit line to the material. If material is not included in Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski the article’s Creative Commons licence and your intended use is not M, Barreira R, Vinyals O, Zisserman A, Simonyan K. Flamingo: a permitted by statutory regulation or exceeds the permitted use, you will visual language model for few-shot learning. arXiv (2022). https:// need to obtain permission directly from the copyright holder. To view a doi.or g/10. 48550/ ARXIV .2204. 14198 . https://ar xiv.or g/abs/ 2204. copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. 13. Bachlechner T, Majumder BP, Mao HH, Cottrell GW, McAuley J. Rezero is all you need: Fast convergence at large depth. In: thirty- seventh conference on uncertainty in artic fi ial intelligence. arXiv: References Machine Learning, ??? (2020). https://doi. or g/10. 48550/ ARXIV . 2003. 04887. https:// arxiv. org/ abs/ 2003. 04887 1. Schüle J, Haag J, Somers P, Veil C, Tarín C, Sawodny O. A 14. Blender Development Team: Blender 3.1.0. accessed: 20.04.2022 model-based simultaneous localization and mapping approach (2022). https://w ww.b lende r.o rg/d ownlo ad/r eleas es/3-1 / Accessed for deformable bodies. In: 2022 IEEE/ASME international con- 20.04.2022 ference on advanced intelligent mechatronics (AIM), pp. 607–612 15. Peddie J. Ray tracing: a tool for all. Cham: Springer; 2019. (2022). https:// doi. org/ 10. 1109/ AIM52 237. 2022. 98633 08 16. Rajpura PS, Hegde RS, Bojinov H. Object detection using deep 2. Karaoglu MA, Brasch N, Stollenga M, Wein W, Navab N, Tom- CNNS trained on synthetic images. ArXiv 2017. https:// doi. org/ bari F, Ladikos A. Adversarial domain feature adaptation for bron- 10. 48550/ arXiv. 1706. 06782 choscopic depth estimation. In: de Bruijne, M., Cattin, P.C., Cotin, 17. Rister B, Yi D, Shivakumar K, Nobashi T, Rubin DL. Ct-org, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical a new dataset for multiple organ segmentation in computed Image Computing and Computer Assisted Intervention–MIC- tomography. Sci Data. 2020;7(1):381. https:// doi. org/ 10. 1038/ CAI 2021. Lecture Notes in Computer Science, vol. 12904, pp. s41597- 020- 00715-8. 300–310. Springer International Publishing, Cham (2021). https:// 18. Zwald L, Lambert-Lacroix S. The berhu penalty and the grouped doi. org/ 10. 1007/ 978-3- 030- 87202-1_ 29 effect. ArXiv: Stati stics Theory 2012. https:// doi. org/ 10. 48550/ 3. Li S, Liu CH, Lin Q, Wen Q, Su L, Huang G, Ding Z. Deep arXiv. 1207. 6868 residual correction network for partial domain adaptation. IEEE 19. Eigen D, Fergus R. Predicting depth, surface normals and seman- Trans Pattern Analysis Mach Intell. 2021;43(7):2329–44. https:// tic labels with a common multi-scale convolutional architecture. doi. org/ 10. 1109/ tpami. 2020. 29641 73. In: 2015 IEEE international conference on computer vision 4. Ullman S. The interpretation of structure from motion. The Royal (ICCV). 2015 2650–2658. https://doi. or g/10. 1109/ ICCV .2015. 304 Society. 1979 20. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, 5. Schönberger JL, Frahm J-M. Structure-from-motion revisited. In: Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Conference on Computer Vision and Pattern Recognition (CVPR) advances in neural information processing systems 2014. https:// (2016) doi. org/ 10. 48550/ ARXIV. 1406. 2661 6. Luo X, Huang J-B, Szeliski R, Matzen K, Kopf J. Consistent video depth estimation. arXiv (2020). https://doi. or g/10. 48550/ ARXIV . Publisher's Note Springer Nature remains neutral with regard to 2004. 15021. https:// arxiv. org/ abs/ 2004. 15021 jurisdictional claims in published maps and institutional affiliations. 7. Mayer N, Ilg E, Häusser P, Fischer P, Cremers D, Dosovitskiy A, Brox T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp. 4040–4048 (2016). https:// doi. org/ 10. 1109/ CVPR. 2016. 438 1 3

Journal

Biomedical Engineering LettersSpringer Journals

Published: Jan 20, 2023

Keywords: Neural networks; Domain adaptation; Depth estimation; Endoscopy; Synthetic data

References