Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Semantic segmentation with deep learning: detection of cracks at the cut edge of glass

Semantic segmentation with deep learning: detection of cracks at the cut edge of glass Glass Struct. Eng. (2021) 6:21–37 https://doi.org/10.1007/s40940-020-00133-7 RESEARCH PAPER Semantic segmentation with deep learning: detection of cracks at the cut edge of glass Michael Drass · Hagen Berthold · Michael A. Kraus · Steffen Müller-Braun Received: 10 May 2020 / Accepted: 20 August 2020 / Published online: 17 September 2020 © The Author(s) 2020 Abstract In this paper, artificial intelligence (AI) will Keywords Deep learning · Semantic segmentation · be applied for the first time in the context of glass pro- Cut-edge glass · U-Net · UXception cessing. The goal is to use an algorithm based on arti- ficial intelligence to detect the fractured edge of a cut glass in order to generate a so-called mask image by AI. In the context of AI, this is a classical problem 1 Introduction of semantic segmentation, in which objects (here the cut-edge of the cut glass) are automatically surrounded 1.1 Cutting of glass by the power of AI or detected and drawn. An orig- inal image of a cut glass edge is implemented into a In the production and further processing of annealed deep neural net and processed in such a way that a float glass, the glass panes are usually brought into mask image, i.e. an image of the cut edge, is auto- the required dimensions by a cutting process. In a first matically generated. Currently, this is only possible by step, a fissure is generated on the glass surface by using manual tracing the cut-edge due to the fact that the a cutting wheel. In the second step, the cut is opened crack contour of glass can sometimes only be recog- along the fissure by applying a bending stress. This cut- nized roughly. After manually marking the crack using ting process is influenced by many parameters (Müller- an image processing program, the contour is then auto- Braun et al. 2020). The edge strength in particular can matically evaluated further. AI and deep learning may be reproducibly increased by a proper adjustment of the provide the potential to automate the step of manual parameters (Ensslen and Müller-Braun 2017). Further- detection of the cut-edge of cut glass to great extent. In more, it could be observed that due to different cutting addition to the enormous time savings, the objectivity process parameters, the resulting damage to the edge and reproducibility of detection is an important aspect, (the crack system) can differ in its extent (Fig. 1). In which will be addressed in this paper. addition, this characteristic of the crack system can be brought into a relationship with the strength (Müller- M. Drass ( ) · M. A. Kraus Braun et al. 2018). In particular, it has been found that M&M Network-Ing UG (haftungsbeschränkt), Darmstadt, when the edge is viewed perpendicular to the glass sur- Germany e-mail: drass@mm-network-ing.com face (Fig. 2), the so-called lateral cracks, which can be e-mail: drass@ismd.tu-darmstadt.de observed here, allow the best predictions for strength M. Drass· H. Berthold · M. A. Kraus · (Müller-Braun et al. 2020). The challenge here is to S. Müller-Braun detect these lateral cracks in an accurate way. Cur- TU Darmstadt - Institute for Structural Mechanics and Design, Darmstadt, Germany rently, this is only possible by manual tracing due to 123 22 M. Drass et al. Fig. 1 View on the cut edge of two 4 mm thick glass specimens, a slight crack system, breaking stress: 78 MPa, b more pronounced crack system, breaking stress: 53 MPa 1.2 Artificial intelligence, machine and deep learning In this section, we define the essential terms of Artificial Intelligence, Machine and Deep Learning in order to bring them closer to the reader. In case the reader is familiar with it, then this section should be skipped. Artificial intelligence (AI) is the overall term for all subsequent developments, algorithms, forms and mea- Fig. 2 Lateral crack to be detected: the crack contour can be sures in which artificially intelligent action occurs. AI difficult to identify is dedicated to the theory and development of com- putational systems capable of performing tasks that normally require human intelligence, such as visual perception, speech recognition, decision making and the fact that the crack contour can sometimes only be translation between languages. recognized roughly (Fig. 2). After manually marking Machine learning (ML) is a sub-class of AI, which the crack using an image processing program, the con- enables systems to learn from given data and not to tour is then automatically evaluated further. Methods execute or make actions, commands or decisions by of Artificial Intelligence (AI) and especially the algo- explicit programming. The aim of ML is to generate rithms from the field of AI in computer vision may artificial knowledge from experience (the data). A basic provide the potential to automate the step of manual premise, however, is that the knowledge gained from detection as well. In addition to the enormous time sav- the data can be generalized and used for new prob- ings, the objectivity and reproducibility of detection is lem solutions, for the analysis of previously unknown an important aspect. 123 Semantic segmentation with deep learning 23 data or for predictions on data not measured (predic- to perform statistical analysis regarding the character- tion) (Kraus 2019b; Bishop 2006). Excellent exam- istics of the cut glass edge. ples for classical machine learning are linear clas- Since in this paper, it is the first time that AI algo- sifiers (e.g. linear/logistic regression, linear discrim- rithm are applied to a specific problem of glass pro- inant analysis), kernel methods (e.g. support vector cessing, Chapter 2 gives a brief introduction into AI machines), tree-based methods (e.g. decision trees, ran- and deep learning. Since we want to focus on the prob- dom forests), non-parametric regression (e.g. nearest lem of semantic segmentation in this paper with the neighbors, local kernel smoothing), etc. (Kraus 2019a; help of Deep Learning, i.e. the task of classifying each Kraus and Drass 2020b). and very pixel in an image into a class (here cut glass Deep neural networks (DNN) or Deep learning uses edge), we will introduce semantic segmentation briefly so-called artificial neural networks to recognize pat- in Chapter 3. Chapter 4 will then deal specifically with terns and highly non-linear relationships in data. A the task of semantic segmentation using images of cut deep neural network (DNN) is based on a collection glass edges to determine the contour of the broken edge. of connected nodes (the neurons), which fictitiously resemble the human brain (cf. Fig. 5). Due to their ability to reproduce and model nonlinear processes, deep neural networks have found applications in many 2 Fundamentals on deep learning areas (Voulodimos et al. 2018). These include mate- rial modeling and development, system identification This section aims at explaining deep learning and sum- and control (vehicle control, process control), pat- marizing its main features to help the reader understand tern recognition (radar systems, face recognition, sig- the algorithms that follow. In case one is familiar with nal classification, 3D reconstruction, object recogni- Deep Learning, this section can be skipped. tion and more), sequence recognition (gesture, speech, To understand how a neural network works, the handwriting and text recognition), medical diagnostics, example of a so-called multi-layer perceptron (MLP) social network filtering and e-mail spam filtering. is explained below. An MLP consists of at least three To give the interested reader an even better intro- node layers: an input layer, a hidden layer and an output duction to the application of Artificial Intelligence in layer. Within the layers there are neurons, each with dif- practical relation to structural glass engineering and ferent tasks. With the exception of the input nodes, each processing, we refer to the current review article pro- node is a neuron that uses a nonlinear activation func- posed by Kraus and Drass (2020a). tion. For training, MLP uses a technique of supervised learning called backpropagation to adjust the weights w of the neural network. An MLP is hence a mathe- 1.3 Problem statement and methodology matical composition of nonlinear functions of two or more neurons via an activation function. This partic- The problem is defined, whether it is possible with the ular nonlinear nature of NNs thus is able to identify help of AI to firstly recognize the fracture pattern of a and model nonlinear behaviors, which may not at all cut glass edge and secondly to process it in such a way or not properly be captured by other ML methods such that AI automatically creates an image, so-called mask as regression techniques or PCA etc. Despite the bio- image of the cut edge, without explicitly programming logical inspiration of the term neural network aNNin it. ML is a pure mathematical construct which consists of Methodologically, this problem is to be solved with either feedforward or feedback networks (recurrent). the help of so-called semantic segmentation using The neurons at the outputs are called output neurons, deep convolutional networks. This method is preferred the layers in between the input and output neurons are because it has wide application for describing and solv- called hidden layers. If there are more than three hid- ing image segmentation problems. Image segmentation den layers, this NN is called a Deep NN. The develop- means creating a contour plot around a specific object, ment of the right architecture for an NN or Deep NN is such as a ball, in an image. The goal here is to automat- problem dependent and only few rules of thumb exist ically generate a high-resolution image of the fracture for that setup (Bishop 2006; Frochte 2019; Kim 2017; pattern of the cut glass edge, which can be used later Paluszek and Thomas 2016). 123 24 M. Drass et al. Fig. 3 Presentation of selected input images of the MNIST Fig. 4 Example of a possible NN architecture for the MNIST datasets (Chen et al. 2018) dataset To describe the structure and mathematical pro- cesses of an NN during learning (training), the exam- ple of handwriting recognition based on the MNIST dataset is used for reasons of comprehensibility. This is about predicting the written numbers (0–9) based on input images without writing an explicit algorithm doing that task. The MNIST database contains 60,000 training images and 10,000 testing images. An example of several input images is summarized in Fig. 3. Each single image has a resolution of 28 × 28 pixels, which Fig. 5 Example of a single neuron for the MNIST dataset are then used as input within the NN. Therefore, the first layer of the NN consists only of Each neuron receives a set of values (numbered from the image data, which is fed into the network as initial 1to i) as an input and compute within the hidden lay- information. It is to note that the input image is not ers activation signals a as output. In the final layer, read in as an image but as a vector x with a length of the output layer, the neurons compute the predicted y ˆ 28 × 8 = 784. The values within the input vector value. The vector x actually contains the values of the of each image is the corresponding gray value of the features in one of m examples from the training set. image. Each value in the input layer corresponds in the Just to remember, the MNIST dataset contains 60,000 context NN to a single input neuron, which in turn is training images, hence m is equal to that number. In connected to all following neurons of a hidden layer. addition, each unit has its own set of parameters, usu- The last layer in NN is the so-called output layer. In ally referred to as w (column vector of weights) and this output layer, NN makes a prediction with respect b ( bias), which change during the learning process. which number was identified in a corresponding input- During each iteration, the neuron computes a weighted image. A typical NN architecture for the problem of average of the values of vector x, based on its current handwriting image recognition is displayed in Fig. 4. weighting vector w, and adds a bias b. For each neuron As can be seen in Fig. 4, the input vector x has 784 of the hidden layer, the weighted average is calculated entries, each of which is connected to the neurons of as the hidden layer. The output layer is defined as a vector that has exactly 10 entries and predicts the handwrit- z = w · x + b. (1) ing recognition solution. To understand more precisely how the training of a neural network works, only one neuron of the first hidden layer is shown in the follow- Finally, the result of this calculation is passed through a ing figure. nonlinear activation function called g. Activation func- 123 Semantic segmentation with deep learning 25 tions are one of the key elements of the neural network. Without them, our neural network would become a combination of linear functions, so it would itself be only a linear function. After the first feed-forward run through the NN, a so-called loss function must be com- puted. The basic source of information about the state of learning is the value of the loss function. In gen- eral, the loss function should show how far from the “ideal” solution a person is. Since there are many dif- ferent functions for describing the loss function L,the mean absolute value of the loss function J is generally described by Fig. 6 Example of the back propagation algorithm to optimize the loss function i i J (W , b) = L y ˆ , y . (2) i =1 NN learns very slowly, if it is too high, the NN cannot reach the minimum. dW and db are calculated using At this point it should be noted that all neurons of all the chain rule, partial derivatives of the loss function hidden layers are now considered, so that a matrix of with respect to W and b. The size of dW and db is the weightings W is calculated from the weighting vector same as that of W and b, respectively. Figure 6 shows of a single neuron. The scalar of bias b is now also the sequence of operations within the neural network. It turned into a vector b for all neurons. For the calcu- clearly shows how forward and backward propagation lation of the mean absolute value of the loss function interact to optimize the loss function (Fig. 5). J all images with the number m are used. The index i corresponds to the length of the input vector x,sothe loss function L is a function of the calculated predictor i i y ˆ and the true value y . 3 Semantic segmentation with deep learning Now that the loss function is defined, and we have arrived at the output of NN in the first iteration of the Today, in the context of computer vision and deep feed-forward procedure, we have to minimize the loss neural networks, the topic of image classification is function by adjusting the matrix of weightings W and widely known. Typically, in image classification one the bias vector b. For this purpose, calculus will be tries to classify images based on its visual content. For used for help and the gradient descent method will be instance, the classification algorithm of an image can applied to find a functional minimum. Throughout each be designed to detect whether an image contains a car, iteration the value of the partial derivatives of the loss animal, point fixing for facades, etc. or not. Whilst the function with respect to each of the parameters of the recognition of an object is trivial for humans, for com- neural network will be computed. The adjusted param- puter vision applications, robust image classification eters of the neural network are computed with is still a challenge. An extension of this deals with so-called object detection, in which objects within an W = W − αdW (3) image are enclosed by a frame or box. Regarding the accuracy of the resolution of object detection, it is rel- and atively coarse, so in some cases it is desirable to detect the exact contours of objects. Considering semantic b = b − αdb. (4) segmentation in contrast, it is the task of classifying each individual pixel in an image into a specific class. In the above equations, the learning rate is represented Typically, in the task of semantic segmentation, image by α which is a hyper parameter that allows the user to data is read in and evaluated in such a way that an control the value of the adjustment performed. Choos- object to be found is segmented or bordered by a so- ing the learning rate is essential—if it is too low, the called mask. An example of the algorithms listed above 123 26 M. Drass et al. Fig. 7 Example of image classification, object detection, semantic segmentation and instance segmentation for a facade is shown in Fig. 7 for the example of the detection of edges of these panes were photographed perpendic- point fixings in a facade. ular to the glass surface, with the edge always posi- In summary, one can say that the greatest infor- tioned in the centre of the image (Fig. 8 (top)). Accord- mation content in an image classification problem is ingly, this procedure always ensures the same bound- obtained via semantic segmentation. In the past it has ary conditions when capturing the images of the cut been shown that especially in image classification neu- edge. The input images were taken with a camera ral networks have become established, because they are with the model number U3-3890CP-M-GL Rev.2 from very well able to recognize (hidden) patterns in images. IDS Imaging Development Systems GmbH. This cam- These hidden patterns can be named also as features, era has a monochrome sensor with a resolution of which should be recognized and learned by the neu- 4000 × 3000 px. As lens the LM25JC10M of the com- ral network in order to make robust predictions on the pany Kowa Optimed Deutschland GmbH was used. input images. With a focal length of 25 mm, this allows macro pho- tography with a minimum distance of the object to the lens of 100 mm. Thus, a very high level of detail could 4 Semantic segmentation on cut-edge of glass be realized with a relatively large image section at the same time. The scale of the input images is 1/8.696 In the following, the method of semantic segmentation px/µm. For the illumination a homogeneous white LED is applied to images of cut edges of glasses. The aim is transmission light with the model code TH2-63X60SW to detect and trace fracture edges of images of cut glass from CCS Inc. (OPTEX GROUP CO., LTD.) was used using the method of semantic segmentation. With the together with the Digital Control Unit PD2-3024(A). successful application of this method it will be possible The images were taken with the lowest brightness level. in future to statistically evaluate images of broken glass For further processing, the file size of the images was edges in terms of the fracture pattern, crack branching reduced by trimming the upper and lower 1250 px, system and their crack width without having to invest as the image information in these areas are irrelevant a lot of time in image processing and post-processing, for the investigations. Thus, image files with a size of which was very time-consuming in the past. 4000 × 500 px were generated. In a further step, the visible lateral cracks were marked manually using the image processing program 4.1 Data: original input–output Adobe Photoshop. The crack contour was first traced with a 1 px thick line. Afterwards a 1 px straight line The object of interest are conventional 8 mm thick was positioned at the edge of the glass pane. Finally, the annealed Soda-lime glass panes, which were indus- resulting space between the two lines was filled with trial cut using a classical carbide cutting wheel. The 123 Semantic segmentation with deep learning 27 Fig. 8 Illustration of the original images and the created mask Fig. 9 Schematic illustration of the horizontal overlapping shift- of the cut edge ing black. The result is the output mask shown in Fig. 8 4.2.2 Data augmentation (bottom). This kind of processing of the individual input Supplementing data is a strategy that allows practition- images is very tedious and time-consuming, so the goal ers to significantly increase the variety of data avail- is to develop an automatic generation of mask images able for training models without actually collecting new via AI. data. Data augmentation techniques such as cropping, zooming, and distortion are often used to train large neural networks. However, most approaches used in neural network training use only basic types of data 4.2 Data: preparation and augmentation augmentation. While neural network architectures have been studied in detail, less emphasis has been placed 4.2.1 Data preparation on discovering strong types of data augmentation and data expansion strategies that capture data invariants The database is formed by four images with a pixel (Shorten and Khoshgoftaar 2019). size of 500 × 4000 px. The color space has already It is important to note that data augmentation is been reduced manually to only have gray values within an important tool to artificially increase the size of the images. From the previous investigations it became the data set. Typically, many image classification tasks clear that a square section of 192 × 192 px is a suf- have shown that data augmentation can produce better ficient dimension to represent the size we are looking results/accuracies, but it is also important to note that at to predict the cut-edge of the cut glass. Since the it only increases the sample size, not necessarily the current and actual available database had to be created information within the images, since it is still the same manually with a considerable amount of time, there is image. In the case of semantic segmentation, where currently only little data available in the form of input an original image and a mask or contour image are and mask images. Therefore the four input images were present, the data augmentation must take place equally prepared by a slicing process in such a way that about on both images. For demonstrating this clearly, we 4000 input and mask images were obtained per image. refer to the following example of a skyscraper, which The principle is shown in Fig. 9. Here, the window was deliberately chosen as an illustrative example to (192 × 192 px) is always shifted by 1 px. This proce- show the reader what the respective augmentation algo- dure of overlapping slicing was performed for the orig- rithm does with the respective input and corresponding inal input images and the corresponding mask images mask image. The example shows different approaches in order to obtain a coherent data set for the semantic of data augmentation in the context of semantic seg- segmentation. mentation in Fig. 10, where the skyscraper of the Swiss 123 28 M. Drass et al. Fig. 10 Different examples for data augmentation of the skyscraper Swiss Re, London Re, London should be classified and segmented. In the but will also contain basic image pre-processing func- augmented images, the original and masking images tions. were skewed, zoomed, sheared, flipped as well as the For the present task of predicting the cut glass edge brightness was changed, a Gaussian distortion filter was via neural networks in the context of semantic segmen- applied and finally a random erasing has been applied. tation, both the original and the mask image are equally To augment the images, the program Augmentor was augmented. Not an arbitrary number of possibilities of used, a Python package that supports the augmenta- data augmentation was used, but only the methods of tion and artificial generation of image data for machine zooming in, shearing and distortion were applied to the learning tasks. It is primarily a data magnification tool, input and mask images. This is absolutely necessary in order not to teach the neural network too many param- 123 Semantic segmentation with deep learning 29 eters and variabilities in data set. The limited data aug- ing to computer vision tasks (Shorten and Khoshgoftaar mentation is justified by the fact that for the problem at 2019). hand the input or original images are always acquired When programming a CNN, the input is a tensor under the same boundary conditions, so that there is no with shape (number of frames) × (frame width) × great variability in the input data. Nevertheless, data (frame height) × (frame depth) After the image has augmentation ensures that in the case of non-uniform passed through a convolution layer, it is abstracted to acquisition of the cut glass edge images, good predic- a feature map, with shape (number of images) × (fea- tion by the NN is still possible. ture map width) × (feature map height) × (feature map channels). This is comparable to the reaction of a neu- ron in the visual cortex to a specific stimulus. Each con- volutional neuron processes data only for its receptive 4.3 Deep neural network models field. A receptive field is nothing else than the area in the input volume that a particular feature extractor (fil- Before we begin with an introduction to the models ter) considers. The activity of each neuron is calculated of deep neural networks used in this study to solve by a discrete convolution (hence the addition convolu- the problem of semantic segmentation of the cut edge tional). Intuitively, a comparatively small convolution of cut glass, we first explain the main operations typi- matrix (filter kernel) is moved over the input step by cally used in these particular neural networks. Then we step. The input of a neuron in the convolutional layer is move on to the description of the special neural net- calculated as the inner product of the filter kernel with work called U-Net, which is well suited for problems the currently underlying image section. Accordingly, of semantic segmentation. Furthermore an extension neighboring neurons in the convolutional layer react to of the U-Net with the Xception net is presented, which overlapping areas (similar frequencies in audio signals is based on a technique called transfer learning. For or local environments in images). both NNs used for the prediction of mask images of cut glass edges all hyperparameters used are explained Max. pooling operation and described in detail. Additionally, it is shown which Basically, the function of pooling is to reduce the size metrics are used to evaluate the quality of the predic- of the feature map so that there are less parameters in tion. Finally, the results are summarized and presented. the network. The idea is to keep only the important features (max. evaluated pixels) from each region and 4.3.1 Convolutional neural networks: general to throw away the information that is not important. operations Important means in this context the information that best describes the context of the image. Figure 11 can Convolution operation be used as an example, by selecting the maximum pixel value from each 2 × 2 block of the input feature map Usually the input of Convolutional Neural Network (CNN) is available as a two- or three-dimensional to obtain a pooled feature map. Note that the size of the filter and the steps are two important hyperparameters matrix (e.g. the pixels of a grayscale or color image). Convolutional layers sequentially downsample the spa- in the max. pooling operation. In summary, both the convolution operation and tial resolution of images while expanding the depth of their feature maps. The feature map is the output of especially the pooling operation reduce the size of the image, which is generally referred to as down sampling. one filter applied to the previous layer. A given filter is drawn across the entire previous layer, moved one In a typical convolution network, attention should be paid to the fact that the height and width of the image pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature gradually decreases (down-sampling, due to pooling), map. which helps the filters in the deeper layers to focus on a larger receptive field (context). The number of chan- This succession of convolution transformations can create much less dimensional and more useful repre- nels/depth (number of filters used), however, gradually increases, which helps to extract more complex features sentations of images when compared to what could eventually be done manually. CNN’s success has from the image. On an intuitive level, the conclusion to be drawn from the pooling operation is as follows. increased interest and optimism in applying deep learn- 123 30 M. Drass et al. symmetric expansion path (also known as the decoder), which is used to enable precise localization using trans- posed convolution. It is generally known that the pro- cess of dimensional reduction in height and width, which is used throughout the entire convolutional neu- ral network—i.e. the pooling layer—is applied in the form of a dimensional increase in the second half of the model. The main differences between the U-Net and an AutoEncoder lies in the fact that AutoEncoders com- presses the input data (i.e. images) linearly, which results in a bottleneck in which all features cannot Fig. 11 Example of max. pooling operation in a convolutional be transmitted to the decoding process (Kingma et al. neural network 2019). Hence, information is lost, so that especially during the reconstruction of a mask image, i.e. the seg- Through down-sampling the model understands better mentation of the image is hardly possible. “WHAT” is present in the image, but it loses the infor- For the problem at hand, namely the segmentation mation “WHERE” it is present. of images of cut edges of glass into the classes break- Transposed convolution operation Since it is impor- age (black) and undamaged glass (white), the U-Net tant for the process of semantic segmentation to know architecture is shown in Fig. 12. where the corresponding information is located in the The U-Net architecture shown was implemented in image in order to obtain a complete high-resolution Python and then trained based on the data presented in image in which all pixels are classified, up-sampling is Sect. 4.2. necessary. Hence, transposed convolution is an upsam- pling technique that expands the size of images. Basi- 4.3.3 U-Net+XCeption (UXception) cally, the original image is slightly padded, followed by a convolution operation. The reason behind the upsam- In the previous section, the U-Net has been explained ling is to combine the information from the previous in detail for the application of semantic segmentation, layers in order to get a more precise prediction. in this case the generation of a binary image to detect An adequate technique here is the so-called trans- the cut glass edge. Since the U-Net described above posed convolution method, where the transposed con- is initially untrained, i.e. quasi virgin without knowing volution at high level is exactly the opposite of a normal anything, many studies have shown that transfer learn- convolution, i.e. the input volume is an image with low ing is a good method to increase the performance of resolution and the output volume is an image with high an NN. Here, a neural network that has been trained resolution. Therefore, transposed convolution is the on an enormously large database previously is coupled preferred choice for performing up-sampling, where with the U-Net. Thus, the neural network does not start the parameters are essentially learned by back prop- from scratch, but already has knowledge of what a ball, agation to convert a low resolution image to a high monkey, plane etc. looks like and is therefore able to resolution image. segment these objects. This preliminary information within the pre-trained NN is used in transfer learning 4.3.2 U-Net architecture in order to make predictions for other problems with the aim of achieving a performance gain. The U-Net got its name from a U-shaped architec- Hence, Transfer learning (TL) is a research problem ture in which a fully-connected convolutional network in machine learning that focuses on storing the knowl- is used. There are two paths in the architecture (see edge gained in solving one problem and applying it to Fig. 12). The first is the contraction path (also called another but related problem. Exactly the TL shall now encoder), which is used to capture the context in the be applied to couple a trained deep neural network for image. The encoder is just a traditional stack of convo- image classification, the so-called XCeption (Chollet lution and max-pooling layers. The second path is the 2017), with the previously presented U-Net in order to 123 Semantic segmentation with deep learning 31 Fig. 12 U-Net architecture for the problem of image segmentation of cut glass generate improved results in semantic segmentation. study the number of epochs was set at 50, which has XCeption is a 71-layer deep convolutional neural net- led to good results. work. It is possible to load a pre-trained version of the Each standard convolution process is activated by network, trained on more than one million images, from a ReLU activation function (see Sect. 4). The ReLU the ImageNet database (Deng et al. 2009). The pre- (rectified linear unit) is currently the most frequently trained network can classify images into 1000 object used activation function, since it is used in almost all categories like keyboard, mouse, pencil and many ani- convolutional neural networks or in deep learning. The mals. As a result, the network has learned rich feature ReLU is half rectified (from below). f (z) is zero if z representations for a wide range of images. is less than zero, and f (z) is equal to z if z is above or To compare the performance of an untrained (U-Net) equal to zero. For the sake of completeness, the ReLU and a pre-trained NN (UXception), both NNs are used function and its derivative is defined as follows for predicting the cut glass edge into a binary image to solve the present problem via semantic segmentation. z for z ≥ 0 f (z) = (5) 0for z < 0 and 4.4 Hyperparameters for training 1for z ≥ 0 Before starting with the training of the U-Net, several f (z) = (6) hyperparameters have to be determined or stated in 0for z < 0 advance. A hyperparameter in machine learning is a parameter whose value is set before the learning pro- which shows that the ReLU function and its derivative cess begins. In contrast, the values of other parameters both are monotonic. In view of the large number of are derived by training. A typical hyperparameter in activation functions, no further description is given. neural networks is the batch size, which determines the The goal of machine learning and deep learning is to number of samples processed before the NN model is reduce the difference between the predicted output and updated. The size of a batch must be more than or equal the actual output. This is also called the cost function to one and less than or equal to the number of samples or loss function. Cost functions are convex functions, in the training dataset. For our study it was found that which must be minimized by finding the optimized the best results were achieved with a batch size of 8. value for the weights of the NN. Here the hyperparam- Therefore, this hyperparameter is set accordingly. eter is described by the function that performs the opti- An additional hyperparameter is given the the mization. In deep learning there are many optimization amount of epochs. The number of epochs is the number functions, such as Gradient Descent (DC), Stochas- of complete passes through the training dataset. In this tic Gradient Descent (SGD), RMSProp (Root Mean 123 32 M. Drass et al. Square Propagation) and many more. An overview with the respective advantages and disadvantages is described in Sun et al. (2019). Since the optimizer Adaptive Moment Estimation (Adam) has become gen- erally established for semantic segmentation, it will be briefly described below. Adam is a method that cal- culates the individual adaptive learning rate for each parameter from estimates of first and second moments of gradients. It also reduces the radically diminish- ing learning rates of the Adaptive Gradient Algorithm Fig. 13 Typical relationship between capacity and error, mark- (Adagrad) (Sun et al. 2019). Adam can be viewed as a ing underfitting zone (left) and overfitting zone (right), from combination of Adagrad, which works well on sparse Goodfellow et al. (2016) gradients and RMSprop which works well in online and non-stationary settings. Adam implements the expo- nential moving average of the gradients to scale the learning rate instead of a simple average as in Ada- compared to the previous calculation, one is visually grad. It keeps an exponentially decaying average of past one step closer to the minimum again. gradients. In addition, Adam is computationally effi- Two central challenges in learning an AI model by cient and has very low memory requirements, making learning algorithms have to be introduced: under- and this optimizer one of the most popular gradient descent overfitting. A model is prone to underfitting if it is optimization algorithms. not able to obtain a sufficiently low loss (error) value The loss function is one of the most important com- on the training set, while overfitting occurs when the ponents of neural networks. The loss is simply a pre- training error is significantly different from the test or diction error of the neural network. And the method validation error (Frochte 2019; Bishop 2006; Goodfel- for calculating the loss is called loss function. Simply low et al. 2016). The generalization error typically pos- speaking, the loss is used to calculate the gradients. sesses an U-shaped curve as a function of model capac- And the gradients are used to update the weights of ity, which is illustrated in Fig. 13. Choosing a simpler the neural network. Typical loss functions in machine model is more likely to generalize well (having a small learning are gap between training and test error) while at the same time still choosing a sufficiently complex hypothesis to – Mean Squared Error (MSE) achieve low training error. Training and test error typi- – Binary Crossentropy (BCE) cally behave differently during training of an AI model – Categorical Crossentropy (CC) by an learning algorithm (Frochte 2019; Bishop 2006; – Sparse Categorical Crossentropy (SCC). Goodfellow et al. 2016). Having closer look at Fig. 13, In the context of semantic segmentation a generally the left end of the graph unveils that training error and accepted loss function is the binary crossentropy loss generalization error are both high. Thus, this marks function, which has been applied in the optimization the underfitting regime. Increasing the model capacity, process. BCE loss is used for the binary classification it drives the training error to decreases while the gap tasks. When using the BCE Loss function, the system between training and validation error increases. Further requires only one output node to classify the data in two increasing the capacity above the optimal will eventu- classes. The output value should pass through a ReLU ally lead the size of this gap to outweigh the decrease activation function and the output range is 0–1. in training error, which marks the overfitting regime. In summary, the basic principle of training a neural Increasing model capacity tackles underfitting while network is to update the weights. For this purpose, gra- overfitting may be handled with regularization tech- dient methods are used to determine function values of niques (Frochte 2019; Bishop 2006; Goodfellow et al. discrete points after each individual batch run, so that 2016). Model capacity can be steered by choosing a the best possible search direction can be determined hypothesis space, which is the set of functions that the to the minimum. The loss functions are used for the learning algorithm is allowed to select as being the function values. If the loss improves–loss decreases— solution (Goodfellow et al. 2016). Here, varying the 123 Semantic segmentation with deep learning 33 parameters of that function family is called representa- 4.5 Metric for evaluation tional capacity while the effective capacity takes also into account additional limitations such as optimization In semantic segmentation it is of utmost importance to problems etc. (Goodfellow et al. 2016). obtain an adequate measure for the quality of the NN. This balancing act between an overfitting of the Typically there are three metrics for evaluation: training data or a badly trained one (underfitting), – Pixel Accuracy which can neither forecast the training data nor the test – Intersection-Over-Union (IoU, Jaccard Index) data in a robust way, can be solved methodically by the – Dice Coefficient (F1 Score) following procedures: In the following, all three metrics are briefly introduced and compared. – Regularization, – Ensemble, 4.5.1 Pixel accuracy – Early Stopping. Pixel accuracy is perhaps most easily understood con- Regularization involves a wide range of techniques ceptually. It is the percentage of pixels in your image to artificially force the model to be simpler. The method that are correctly classified. A major problem with this depends on the type of learner involved. For example, metric is mainly in the case of so-called imbalanced using dropout in a neural network, or adding a penalty data. Unbalanced data here means that the objects to parameter to the cost function in regression. Often, be classified occupy less than 50% of the image. In the the regularization method is also a hyperparameter, case of unbalanced data, the metric described above which means that it can be adjusted by cross-validation gives a high degree of accuracy, but this does not mean (Raschka 2018). that the object to be found has been correctly classified. Ensembles are machine learning methods for com- Unfortunately, class imbalance is predominant in many bining predictions from multiple separate models. real-world data sets, and cannot be ignored. Therefore, There are a few different methods for ensembling, but two alternative metrics are presented that can tackle the two most common are “bagging” and “boosting”. this problem better. A detailed explanation of these models is not provided but instead we refer to Rokach (2010). 4.5.2 IoU metric The third method to avoid under- or overfitting is achieved by “Early Stopping”. Hence, in order not to let The Intersection-over-Union (IoU), also known as the the optimization run into insignificance, the algorithm Jaccard index, is one of the most commonly used met- was equipped with the tool “Early Stopping”. This is rics in semantic segmentation. The IoU is in fact a sim- method, in which the error on a validation set is moni- ple metric that is to understand. The IoU is the area of tored during training and (with a little patience) stop if overlap between the predicted segmentation and the the validation error does not improve sufficiently. This ground truth, divided by the area of union between method was used for the present the predicted segmentation and the ground truth (see In summary, the U-Net and UXception is equipped Fig. 14). with the following algorithms: Ground truth means in this context the reality we want to predict with our deep learning model. Typically, Parameter Set of U-Net / UXception the IoU-metric is in the range of 0–1, where 1 indicates an absolute match between original image and mask image and 0 means no match. – Batch size: 8, – Epochs: 50, – Activation function: ReLU, 4.5.3 Dice coefficient (F1 score) – Optimizer: Adam, – Loss function: Binary Crossentropy, The cube coefficient is very similar to the IoU coef- – Early stopping. ficient. It is defined by the square of the overlap area divided by the total number of pixels in both images. 123 34 M. Drass et al. Starting with the results of the U-Net, the accuracy for all training epochs is shown in Fig. 15 . Furthermore the loss (BCE) over the epochs is shown. As can be seen in Fig. 15, the U-Net has already reached a validation accuracy of over 99% after less than 15 epochs during training and validation. After 50 epochs of training, the U-Net has a validation accu- racy for the metric IuO of about 99.25%. Even with a training of more than 50 epochs, no improvement in accuracy could be observed. In contrast, if you look at the results for the UXcep- tion model, one can see a gain in performance, which is Fig. 14 Example for the semantic segmentation metric reflected in a validation accuracy of almost 99.4% with Intersection-over-Union or Jaccard index reference to the metric IoU. Thus, the applied model of transfer learning, where the U-Net was combined with For each fixed “ground truth”, the two metrics are a pre-trained XCeption Net, showed that a slight per- always positively correlated. This means that if classi- formance gain can be achieved (see Fig. 16). However, fier A is better than B under one metric, it is also better both models deliver extreme good results for the vali- than classifier B under the other metric. The IoU metric dation accuracy in determining the cut edge of glasses generally tends to penalise individual instances of poor and producing a binary image of the cut edge. classification quantitatively more severely than the dice In order to show again the performance difference coefficient, even if both can agree that this one case is of both neural networks, the validation accuracy IoU bad. Similar to how L norm punishes the largest errors is evaluated for both NN’s in the following graphic. In more than L norm, the IoU metric tends to “square” Fig 17 it can be seen that the UXception gives slight the errors relative to the dice score. So the dice score better results than the U-Net. It should also be noted that tends to measure average performance, while the IoU the application of the transfer learning approach with score tends to measure worst case performance. the UXception has the general advantage of obtaining a higher accuracy of the forecast more quickly than a con- ventional U-Net with a similar computational effort. In 4.6 Results: metrics our case, however, both models provide approximately the same results, so that both models can be used for a In this section, the results are presented separately for later evaluation. each algorithm used (U-Net and UXception) under evaluation of the metrics IoU and Dice coefficient. Fig. 15 Results of the training and validation of the U-Net in terms of accuracy and loss over all epochs 123 Semantic segmentation with deep learning 35 Fig. 16 Results of the training and validation of the UXception in terms of accuracy and loss over all epochs colour describes the glass and the black colour the pre- dicted fractured edge. Shown in reddish (yellow-red) colours, the areas in which the NN is not 100% sure whether it is glass or a fractured edge. The final result is the image predicted by the UXcep- tion model with a threshold value, which is called a predicted binary image. I.e. that below the limit of IoU = 20% everything is attributed to the glass, whereas when an gray value of more than 20% is reached, the NN predicts the cut edge. Furthermore, in all images generated by AI, the man-made cut edge is shown as a cyan-coloured line for better orientation. Fig. 17 Comparison of the U-Net and UXception in terms of As shown in Fig. 18, the trained UXception model validation accuracy over all epochs is excellently suited to create a mask image from the original image, without the need for human interac- tion. It is also obvious that the red-yellow areas, where 4.7 Results: computer vision the NN is not sure whether it sees the cut edge or just the pure glass, are very narrow. This means that the In this section the results for the UXception model are areas in which the NN is uncertain play only a minor presented as so-called computer vision results. Here, role. A slight improvement of the mask images created we show how well the neural networks are able to auto- by AI could be achieved by the cut-off condition or matically reconstruct an original image of a cut glass binary prediction. The presented NN for predicting the edge, i.e. to convert it into a mask image without the cut glass edge is therefore very accurate and saves an need for human interaction. enormous amount of time in the prediction and pro- As input five images of a cut glass edge are used, duction of mask images. In addition, the mask images which were not previously used for training or valida- can be further processed, for example to make statisti- tion of the neural network. Hence, the virgin images cal analyses of the break structure of the cut glass edge. are used as input and the output, i.e. the mask image, However, this is not part of the present paper, as the aim should be delivered by the neural network. here was to show the application of AI in the context The results first show the original image and the of glass processing. man-made mask image. Then the prediction of the AI model is also displayed as a mask image. Since all neu- ral networks are statistical models, the mask image gen- erated by AI is provided with a colour map. The white 123 36 M. Drass et al. Fig. 18 Results of the semantic segmentation using the UXception net to predict the cut glass edge on the basis of three original images unknown to NN 123 Semantic segmentation with deep learning 37 5 Summary and discussion Ensslen, F., Müller-Braun, S.: Kantenfestigkeit von floatglas in abhängigkeit von wesentlichen schneidprozessparametern. ce/papers 1(1), 189–202 (2017) In this paper the application of AI and especially the Frochte, J.: Maschinelles Lernen: Grundlagen und Algorithmen problem of semantic segmentation was applied to the in Python. Carl Hanser Verlag GmbH Co KG (2019) context of glass processing for the first time. The goal Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT press, New York (2016) was to process an image of a cut glass edge using neural Kim, P.: Matlab Deep Learning, With Machine Learning, Neural networks in such a way that a so-called mask image is Networks and Artificial Intelligence, p. 130 (2017) generated by the AI algorithm. Accordingly, the mask Kingma, D.P., Welling, M., et al.: An introduction to variational image should only recognize the cut glass edge from autoencoders. Found. Trends Mach. Learn. 12(4), 307–392 (2019) the original image and display it in black in the mask Kraus, M.A.: Künstliche intelligenz und maschinelles lernen im image. kontext der forschung im konstruktiven glasbau. ce/papers The application of the so-called U-Net and UXcep- 3(1), 161–173 (2019a) tion net showed excellent results in the prediction of the Kraus, M.A.: Machine learning techniques for the material parameter identification of laminated glass in the intact and cut glass edge. The validation accuracies of both mod- post-fracture state. PhD thesis, Universität der Bundeswehr els exceeded 99%, which is sufficient for the generation München (2019b) of the mask image via AI. Kraus, M.A., Drass, M.: Artificial intelligence for structural glass engineering applications: overview, case studies and future Funding Open Access funding provided by Projekt DEAL. potentials. Glass Struct. Eng. (2020a) Kraus, M.A., Drass, M.: Künstliche intelligenz für die gebäude- Open Access This article is licensed under a Creative Com- hülle. Deutsches Ingenieurblatt 04 (2020b) mons Attribution 4.0 International License, which permits use, Müller-Braun, S., Franz, J., Schneider, J., Schneider, F.: Optische sharing, adaptation, distribution and reproduction in any medium merkmale der glaskante nach glaszuschnitt mit schneidräd- or format, as long as you give appropriate credit to the original chen. ce/papers 2(1), 99–111 (2018) author(s) and the source, provide a link to the Creative Com- Müller-Braun, S., Seel, M., König, M., Hof, P., Schneider, J., mons licence, and indicate if changes were made. The images or Oechsner, M.: Cut edge of annealed float glass: crack system other third party material in this article are included in the article’s and possibilities to increase the edge strength by adjusting Creative Commons licence, unless indicated otherwise in a credit the cutting process. Glass Struct. Eng. 5(1), 3–25 (2020) line to the material. If material is not included in the article’s Cre- Paluszek, M., Thomas, S.: MATLAB Machine Learning. A press, ative Commons licence and your intended use is not permitted by New York (2016) statutory regulation or exceeds the permitted use, you will need Raschka, S.: Model evaluation, model selection, and algo- to obtain permission directly from the copyright holder. To view rithm selection in machine learning (2018). arXiv preprint a copy of this licence, visit http://creativecommons.org/licenses/ arXiv:181112808 by/4.0/. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1– 2), 1–39 (2010) Shorten, C., Khoshgoftaar, T.M.: A survey on image data aug- mentation for deep learning. J. Big Data 6(1), 60 (2019) References Sun, S., Cao, Z., Zhu, H., Zhao, J.: A survey of optimization methods from a machine learning perspective. IEEE Trans. Cybern. (2019) Bishop, C.M.: Pattern Recognition and Machine Learning. Infor- Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, mation Science and Statistics, 1st edn. Springer, Berlin E.: Deep learning for computer vision: a brief review. Com- (2006) put. Intell. Neurosci. (2018) Chen, F., Chen, N., Mao, H., Hu, H.: Assessing four neural networks on handwritten digit recognition dataset (mnist) (2018). arXiv preprint arXiv:181108278 Publisher’s Note Springer Nature remains neutral with regard Chollet, F.: Xception: deep learning with depthwise separable to jurisdictional claims in published maps and institutional affil- convolutions. In: Proceedings of the IEEE Conference on iations. Computer Vision and Pattern Recognition, pp. 1251–1258 (2017) Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., Fei-Fei, L.: Ima- genet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recog- nition, IEEE, pp. 248–255 (2009) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Glass Structures & Engineering Springer Journals

Semantic segmentation with deep learning: detection of cracks at the cut edge of glass

Loading next page...
 
/lp/springer-journals/semantic-segmentation-with-deep-learning-detection-of-cracks-at-the-IU0HMfdI1N

References (24)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2020
ISSN
2363-5142
eISSN
2363-5150
DOI
10.1007/s40940-020-00133-7
Publisher site
See Article on Publisher Site

Abstract

Glass Struct. Eng. (2021) 6:21–37 https://doi.org/10.1007/s40940-020-00133-7 RESEARCH PAPER Semantic segmentation with deep learning: detection of cracks at the cut edge of glass Michael Drass · Hagen Berthold · Michael A. Kraus · Steffen Müller-Braun Received: 10 May 2020 / Accepted: 20 August 2020 / Published online: 17 September 2020 © The Author(s) 2020 Abstract In this paper, artificial intelligence (AI) will Keywords Deep learning · Semantic segmentation · be applied for the first time in the context of glass pro- Cut-edge glass · U-Net · UXception cessing. The goal is to use an algorithm based on arti- ficial intelligence to detect the fractured edge of a cut glass in order to generate a so-called mask image by AI. In the context of AI, this is a classical problem 1 Introduction of semantic segmentation, in which objects (here the cut-edge of the cut glass) are automatically surrounded 1.1 Cutting of glass by the power of AI or detected and drawn. An orig- inal image of a cut glass edge is implemented into a In the production and further processing of annealed deep neural net and processed in such a way that a float glass, the glass panes are usually brought into mask image, i.e. an image of the cut edge, is auto- the required dimensions by a cutting process. In a first matically generated. Currently, this is only possible by step, a fissure is generated on the glass surface by using manual tracing the cut-edge due to the fact that the a cutting wheel. In the second step, the cut is opened crack contour of glass can sometimes only be recog- along the fissure by applying a bending stress. This cut- nized roughly. After manually marking the crack using ting process is influenced by many parameters (Müller- an image processing program, the contour is then auto- Braun et al. 2020). The edge strength in particular can matically evaluated further. AI and deep learning may be reproducibly increased by a proper adjustment of the provide the potential to automate the step of manual parameters (Ensslen and Müller-Braun 2017). Further- detection of the cut-edge of cut glass to great extent. In more, it could be observed that due to different cutting addition to the enormous time savings, the objectivity process parameters, the resulting damage to the edge and reproducibility of detection is an important aspect, (the crack system) can differ in its extent (Fig. 1). In which will be addressed in this paper. addition, this characteristic of the crack system can be brought into a relationship with the strength (Müller- M. Drass ( ) · M. A. Kraus Braun et al. 2018). In particular, it has been found that M&M Network-Ing UG (haftungsbeschränkt), Darmstadt, when the edge is viewed perpendicular to the glass sur- Germany e-mail: drass@mm-network-ing.com face (Fig. 2), the so-called lateral cracks, which can be e-mail: drass@ismd.tu-darmstadt.de observed here, allow the best predictions for strength M. Drass· H. Berthold · M. A. Kraus · (Müller-Braun et al. 2020). The challenge here is to S. Müller-Braun detect these lateral cracks in an accurate way. Cur- TU Darmstadt - Institute for Structural Mechanics and Design, Darmstadt, Germany rently, this is only possible by manual tracing due to 123 22 M. Drass et al. Fig. 1 View on the cut edge of two 4 mm thick glass specimens, a slight crack system, breaking stress: 78 MPa, b more pronounced crack system, breaking stress: 53 MPa 1.2 Artificial intelligence, machine and deep learning In this section, we define the essential terms of Artificial Intelligence, Machine and Deep Learning in order to bring them closer to the reader. In case the reader is familiar with it, then this section should be skipped. Artificial intelligence (AI) is the overall term for all subsequent developments, algorithms, forms and mea- Fig. 2 Lateral crack to be detected: the crack contour can be sures in which artificially intelligent action occurs. AI difficult to identify is dedicated to the theory and development of com- putational systems capable of performing tasks that normally require human intelligence, such as visual perception, speech recognition, decision making and the fact that the crack contour can sometimes only be translation between languages. recognized roughly (Fig. 2). After manually marking Machine learning (ML) is a sub-class of AI, which the crack using an image processing program, the con- enables systems to learn from given data and not to tour is then automatically evaluated further. Methods execute or make actions, commands or decisions by of Artificial Intelligence (AI) and especially the algo- explicit programming. The aim of ML is to generate rithms from the field of AI in computer vision may artificial knowledge from experience (the data). A basic provide the potential to automate the step of manual premise, however, is that the knowledge gained from detection as well. In addition to the enormous time sav- the data can be generalized and used for new prob- ings, the objectivity and reproducibility of detection is lem solutions, for the analysis of previously unknown an important aspect. 123 Semantic segmentation with deep learning 23 data or for predictions on data not measured (predic- to perform statistical analysis regarding the character- tion) (Kraus 2019b; Bishop 2006). Excellent exam- istics of the cut glass edge. ples for classical machine learning are linear clas- Since in this paper, it is the first time that AI algo- sifiers (e.g. linear/logistic regression, linear discrim- rithm are applied to a specific problem of glass pro- inant analysis), kernel methods (e.g. support vector cessing, Chapter 2 gives a brief introduction into AI machines), tree-based methods (e.g. decision trees, ran- and deep learning. Since we want to focus on the prob- dom forests), non-parametric regression (e.g. nearest lem of semantic segmentation in this paper with the neighbors, local kernel smoothing), etc. (Kraus 2019a; help of Deep Learning, i.e. the task of classifying each Kraus and Drass 2020b). and very pixel in an image into a class (here cut glass Deep neural networks (DNN) or Deep learning uses edge), we will introduce semantic segmentation briefly so-called artificial neural networks to recognize pat- in Chapter 3. Chapter 4 will then deal specifically with terns and highly non-linear relationships in data. A the task of semantic segmentation using images of cut deep neural network (DNN) is based on a collection glass edges to determine the contour of the broken edge. of connected nodes (the neurons), which fictitiously resemble the human brain (cf. Fig. 5). Due to their ability to reproduce and model nonlinear processes, deep neural networks have found applications in many 2 Fundamentals on deep learning areas (Voulodimos et al. 2018). These include mate- rial modeling and development, system identification This section aims at explaining deep learning and sum- and control (vehicle control, process control), pat- marizing its main features to help the reader understand tern recognition (radar systems, face recognition, sig- the algorithms that follow. In case one is familiar with nal classification, 3D reconstruction, object recogni- Deep Learning, this section can be skipped. tion and more), sequence recognition (gesture, speech, To understand how a neural network works, the handwriting and text recognition), medical diagnostics, example of a so-called multi-layer perceptron (MLP) social network filtering and e-mail spam filtering. is explained below. An MLP consists of at least three To give the interested reader an even better intro- node layers: an input layer, a hidden layer and an output duction to the application of Artificial Intelligence in layer. Within the layers there are neurons, each with dif- practical relation to structural glass engineering and ferent tasks. With the exception of the input nodes, each processing, we refer to the current review article pro- node is a neuron that uses a nonlinear activation func- posed by Kraus and Drass (2020a). tion. For training, MLP uses a technique of supervised learning called backpropagation to adjust the weights w of the neural network. An MLP is hence a mathe- 1.3 Problem statement and methodology matical composition of nonlinear functions of two or more neurons via an activation function. This partic- The problem is defined, whether it is possible with the ular nonlinear nature of NNs thus is able to identify help of AI to firstly recognize the fracture pattern of a and model nonlinear behaviors, which may not at all cut glass edge and secondly to process it in such a way or not properly be captured by other ML methods such that AI automatically creates an image, so-called mask as regression techniques or PCA etc. Despite the bio- image of the cut edge, without explicitly programming logical inspiration of the term neural network aNNin it. ML is a pure mathematical construct which consists of Methodologically, this problem is to be solved with either feedforward or feedback networks (recurrent). the help of so-called semantic segmentation using The neurons at the outputs are called output neurons, deep convolutional networks. This method is preferred the layers in between the input and output neurons are because it has wide application for describing and solv- called hidden layers. If there are more than three hid- ing image segmentation problems. Image segmentation den layers, this NN is called a Deep NN. The develop- means creating a contour plot around a specific object, ment of the right architecture for an NN or Deep NN is such as a ball, in an image. The goal here is to automat- problem dependent and only few rules of thumb exist ically generate a high-resolution image of the fracture for that setup (Bishop 2006; Frochte 2019; Kim 2017; pattern of the cut glass edge, which can be used later Paluszek and Thomas 2016). 123 24 M. Drass et al. Fig. 3 Presentation of selected input images of the MNIST Fig. 4 Example of a possible NN architecture for the MNIST datasets (Chen et al. 2018) dataset To describe the structure and mathematical pro- cesses of an NN during learning (training), the exam- ple of handwriting recognition based on the MNIST dataset is used for reasons of comprehensibility. This is about predicting the written numbers (0–9) based on input images without writing an explicit algorithm doing that task. The MNIST database contains 60,000 training images and 10,000 testing images. An example of several input images is summarized in Fig. 3. Each single image has a resolution of 28 × 28 pixels, which Fig. 5 Example of a single neuron for the MNIST dataset are then used as input within the NN. Therefore, the first layer of the NN consists only of Each neuron receives a set of values (numbered from the image data, which is fed into the network as initial 1to i) as an input and compute within the hidden lay- information. It is to note that the input image is not ers activation signals a as output. In the final layer, read in as an image but as a vector x with a length of the output layer, the neurons compute the predicted y ˆ 28 × 8 = 784. The values within the input vector value. The vector x actually contains the values of the of each image is the corresponding gray value of the features in one of m examples from the training set. image. Each value in the input layer corresponds in the Just to remember, the MNIST dataset contains 60,000 context NN to a single input neuron, which in turn is training images, hence m is equal to that number. In connected to all following neurons of a hidden layer. addition, each unit has its own set of parameters, usu- The last layer in NN is the so-called output layer. In ally referred to as w (column vector of weights) and this output layer, NN makes a prediction with respect b ( bias), which change during the learning process. which number was identified in a corresponding input- During each iteration, the neuron computes a weighted image. A typical NN architecture for the problem of average of the values of vector x, based on its current handwriting image recognition is displayed in Fig. 4. weighting vector w, and adds a bias b. For each neuron As can be seen in Fig. 4, the input vector x has 784 of the hidden layer, the weighted average is calculated entries, each of which is connected to the neurons of as the hidden layer. The output layer is defined as a vector that has exactly 10 entries and predicts the handwrit- z = w · x + b. (1) ing recognition solution. To understand more precisely how the training of a neural network works, only one neuron of the first hidden layer is shown in the follow- Finally, the result of this calculation is passed through a ing figure. nonlinear activation function called g. Activation func- 123 Semantic segmentation with deep learning 25 tions are one of the key elements of the neural network. Without them, our neural network would become a combination of linear functions, so it would itself be only a linear function. After the first feed-forward run through the NN, a so-called loss function must be com- puted. The basic source of information about the state of learning is the value of the loss function. In gen- eral, the loss function should show how far from the “ideal” solution a person is. Since there are many dif- ferent functions for describing the loss function L,the mean absolute value of the loss function J is generally described by Fig. 6 Example of the back propagation algorithm to optimize the loss function i i J (W , b) = L y ˆ , y . (2) i =1 NN learns very slowly, if it is too high, the NN cannot reach the minimum. dW and db are calculated using At this point it should be noted that all neurons of all the chain rule, partial derivatives of the loss function hidden layers are now considered, so that a matrix of with respect to W and b. The size of dW and db is the weightings W is calculated from the weighting vector same as that of W and b, respectively. Figure 6 shows of a single neuron. The scalar of bias b is now also the sequence of operations within the neural network. It turned into a vector b for all neurons. For the calcu- clearly shows how forward and backward propagation lation of the mean absolute value of the loss function interact to optimize the loss function (Fig. 5). J all images with the number m are used. The index i corresponds to the length of the input vector x,sothe loss function L is a function of the calculated predictor i i y ˆ and the true value y . 3 Semantic segmentation with deep learning Now that the loss function is defined, and we have arrived at the output of NN in the first iteration of the Today, in the context of computer vision and deep feed-forward procedure, we have to minimize the loss neural networks, the topic of image classification is function by adjusting the matrix of weightings W and widely known. Typically, in image classification one the bias vector b. For this purpose, calculus will be tries to classify images based on its visual content. For used for help and the gradient descent method will be instance, the classification algorithm of an image can applied to find a functional minimum. Throughout each be designed to detect whether an image contains a car, iteration the value of the partial derivatives of the loss animal, point fixing for facades, etc. or not. Whilst the function with respect to each of the parameters of the recognition of an object is trivial for humans, for com- neural network will be computed. The adjusted param- puter vision applications, robust image classification eters of the neural network are computed with is still a challenge. An extension of this deals with so-called object detection, in which objects within an W = W − αdW (3) image are enclosed by a frame or box. Regarding the accuracy of the resolution of object detection, it is rel- and atively coarse, so in some cases it is desirable to detect the exact contours of objects. Considering semantic b = b − αdb. (4) segmentation in contrast, it is the task of classifying each individual pixel in an image into a specific class. In the above equations, the learning rate is represented Typically, in the task of semantic segmentation, image by α which is a hyper parameter that allows the user to data is read in and evaluated in such a way that an control the value of the adjustment performed. Choos- object to be found is segmented or bordered by a so- ing the learning rate is essential—if it is too low, the called mask. An example of the algorithms listed above 123 26 M. Drass et al. Fig. 7 Example of image classification, object detection, semantic segmentation and instance segmentation for a facade is shown in Fig. 7 for the example of the detection of edges of these panes were photographed perpendic- point fixings in a facade. ular to the glass surface, with the edge always posi- In summary, one can say that the greatest infor- tioned in the centre of the image (Fig. 8 (top)). Accord- mation content in an image classification problem is ingly, this procedure always ensures the same bound- obtained via semantic segmentation. In the past it has ary conditions when capturing the images of the cut been shown that especially in image classification neu- edge. The input images were taken with a camera ral networks have become established, because they are with the model number U3-3890CP-M-GL Rev.2 from very well able to recognize (hidden) patterns in images. IDS Imaging Development Systems GmbH. This cam- These hidden patterns can be named also as features, era has a monochrome sensor with a resolution of which should be recognized and learned by the neu- 4000 × 3000 px. As lens the LM25JC10M of the com- ral network in order to make robust predictions on the pany Kowa Optimed Deutschland GmbH was used. input images. With a focal length of 25 mm, this allows macro pho- tography with a minimum distance of the object to the lens of 100 mm. Thus, a very high level of detail could 4 Semantic segmentation on cut-edge of glass be realized with a relatively large image section at the same time. The scale of the input images is 1/8.696 In the following, the method of semantic segmentation px/µm. For the illumination a homogeneous white LED is applied to images of cut edges of glasses. The aim is transmission light with the model code TH2-63X60SW to detect and trace fracture edges of images of cut glass from CCS Inc. (OPTEX GROUP CO., LTD.) was used using the method of semantic segmentation. With the together with the Digital Control Unit PD2-3024(A). successful application of this method it will be possible The images were taken with the lowest brightness level. in future to statistically evaluate images of broken glass For further processing, the file size of the images was edges in terms of the fracture pattern, crack branching reduced by trimming the upper and lower 1250 px, system and their crack width without having to invest as the image information in these areas are irrelevant a lot of time in image processing and post-processing, for the investigations. Thus, image files with a size of which was very time-consuming in the past. 4000 × 500 px were generated. In a further step, the visible lateral cracks were marked manually using the image processing program 4.1 Data: original input–output Adobe Photoshop. The crack contour was first traced with a 1 px thick line. Afterwards a 1 px straight line The object of interest are conventional 8 mm thick was positioned at the edge of the glass pane. Finally, the annealed Soda-lime glass panes, which were indus- resulting space between the two lines was filled with trial cut using a classical carbide cutting wheel. The 123 Semantic segmentation with deep learning 27 Fig. 8 Illustration of the original images and the created mask Fig. 9 Schematic illustration of the horizontal overlapping shift- of the cut edge ing black. The result is the output mask shown in Fig. 8 4.2.2 Data augmentation (bottom). This kind of processing of the individual input Supplementing data is a strategy that allows practition- images is very tedious and time-consuming, so the goal ers to significantly increase the variety of data avail- is to develop an automatic generation of mask images able for training models without actually collecting new via AI. data. Data augmentation techniques such as cropping, zooming, and distortion are often used to train large neural networks. However, most approaches used in neural network training use only basic types of data 4.2 Data: preparation and augmentation augmentation. While neural network architectures have been studied in detail, less emphasis has been placed 4.2.1 Data preparation on discovering strong types of data augmentation and data expansion strategies that capture data invariants The database is formed by four images with a pixel (Shorten and Khoshgoftaar 2019). size of 500 × 4000 px. The color space has already It is important to note that data augmentation is been reduced manually to only have gray values within an important tool to artificially increase the size of the images. From the previous investigations it became the data set. Typically, many image classification tasks clear that a square section of 192 × 192 px is a suf- have shown that data augmentation can produce better ficient dimension to represent the size we are looking results/accuracies, but it is also important to note that at to predict the cut-edge of the cut glass. Since the it only increases the sample size, not necessarily the current and actual available database had to be created information within the images, since it is still the same manually with a considerable amount of time, there is image. In the case of semantic segmentation, where currently only little data available in the form of input an original image and a mask or contour image are and mask images. Therefore the four input images were present, the data augmentation must take place equally prepared by a slicing process in such a way that about on both images. For demonstrating this clearly, we 4000 input and mask images were obtained per image. refer to the following example of a skyscraper, which The principle is shown in Fig. 9. Here, the window was deliberately chosen as an illustrative example to (192 × 192 px) is always shifted by 1 px. This proce- show the reader what the respective augmentation algo- dure of overlapping slicing was performed for the orig- rithm does with the respective input and corresponding inal input images and the corresponding mask images mask image. The example shows different approaches in order to obtain a coherent data set for the semantic of data augmentation in the context of semantic seg- segmentation. mentation in Fig. 10, where the skyscraper of the Swiss 123 28 M. Drass et al. Fig. 10 Different examples for data augmentation of the skyscraper Swiss Re, London Re, London should be classified and segmented. In the but will also contain basic image pre-processing func- augmented images, the original and masking images tions. were skewed, zoomed, sheared, flipped as well as the For the present task of predicting the cut glass edge brightness was changed, a Gaussian distortion filter was via neural networks in the context of semantic segmen- applied and finally a random erasing has been applied. tation, both the original and the mask image are equally To augment the images, the program Augmentor was augmented. Not an arbitrary number of possibilities of used, a Python package that supports the augmenta- data augmentation was used, but only the methods of tion and artificial generation of image data for machine zooming in, shearing and distortion were applied to the learning tasks. It is primarily a data magnification tool, input and mask images. This is absolutely necessary in order not to teach the neural network too many param- 123 Semantic segmentation with deep learning 29 eters and variabilities in data set. The limited data aug- ing to computer vision tasks (Shorten and Khoshgoftaar mentation is justified by the fact that for the problem at 2019). hand the input or original images are always acquired When programming a CNN, the input is a tensor under the same boundary conditions, so that there is no with shape (number of frames) × (frame width) × great variability in the input data. Nevertheless, data (frame height) × (frame depth) After the image has augmentation ensures that in the case of non-uniform passed through a convolution layer, it is abstracted to acquisition of the cut glass edge images, good predic- a feature map, with shape (number of images) × (fea- tion by the NN is still possible. ture map width) × (feature map height) × (feature map channels). This is comparable to the reaction of a neu- ron in the visual cortex to a specific stimulus. Each con- volutional neuron processes data only for its receptive 4.3 Deep neural network models field. A receptive field is nothing else than the area in the input volume that a particular feature extractor (fil- Before we begin with an introduction to the models ter) considers. The activity of each neuron is calculated of deep neural networks used in this study to solve by a discrete convolution (hence the addition convolu- the problem of semantic segmentation of the cut edge tional). Intuitively, a comparatively small convolution of cut glass, we first explain the main operations typi- matrix (filter kernel) is moved over the input step by cally used in these particular neural networks. Then we step. The input of a neuron in the convolutional layer is move on to the description of the special neural net- calculated as the inner product of the filter kernel with work called U-Net, which is well suited for problems the currently underlying image section. Accordingly, of semantic segmentation. Furthermore an extension neighboring neurons in the convolutional layer react to of the U-Net with the Xception net is presented, which overlapping areas (similar frequencies in audio signals is based on a technique called transfer learning. For or local environments in images). both NNs used for the prediction of mask images of cut glass edges all hyperparameters used are explained Max. pooling operation and described in detail. Additionally, it is shown which Basically, the function of pooling is to reduce the size metrics are used to evaluate the quality of the predic- of the feature map so that there are less parameters in tion. Finally, the results are summarized and presented. the network. The idea is to keep only the important features (max. evaluated pixels) from each region and 4.3.1 Convolutional neural networks: general to throw away the information that is not important. operations Important means in this context the information that best describes the context of the image. Figure 11 can Convolution operation be used as an example, by selecting the maximum pixel value from each 2 × 2 block of the input feature map Usually the input of Convolutional Neural Network (CNN) is available as a two- or three-dimensional to obtain a pooled feature map. Note that the size of the filter and the steps are two important hyperparameters matrix (e.g. the pixels of a grayscale or color image). Convolutional layers sequentially downsample the spa- in the max. pooling operation. In summary, both the convolution operation and tial resolution of images while expanding the depth of their feature maps. The feature map is the output of especially the pooling operation reduce the size of the image, which is generally referred to as down sampling. one filter applied to the previous layer. A given filter is drawn across the entire previous layer, moved one In a typical convolution network, attention should be paid to the fact that the height and width of the image pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature gradually decreases (down-sampling, due to pooling), map. which helps the filters in the deeper layers to focus on a larger receptive field (context). The number of chan- This succession of convolution transformations can create much less dimensional and more useful repre- nels/depth (number of filters used), however, gradually increases, which helps to extract more complex features sentations of images when compared to what could eventually be done manually. CNN’s success has from the image. On an intuitive level, the conclusion to be drawn from the pooling operation is as follows. increased interest and optimism in applying deep learn- 123 30 M. Drass et al. symmetric expansion path (also known as the decoder), which is used to enable precise localization using trans- posed convolution. It is generally known that the pro- cess of dimensional reduction in height and width, which is used throughout the entire convolutional neu- ral network—i.e. the pooling layer—is applied in the form of a dimensional increase in the second half of the model. The main differences between the U-Net and an AutoEncoder lies in the fact that AutoEncoders com- presses the input data (i.e. images) linearly, which results in a bottleneck in which all features cannot Fig. 11 Example of max. pooling operation in a convolutional be transmitted to the decoding process (Kingma et al. neural network 2019). Hence, information is lost, so that especially during the reconstruction of a mask image, i.e. the seg- Through down-sampling the model understands better mentation of the image is hardly possible. “WHAT” is present in the image, but it loses the infor- For the problem at hand, namely the segmentation mation “WHERE” it is present. of images of cut edges of glass into the classes break- Transposed convolution operation Since it is impor- age (black) and undamaged glass (white), the U-Net tant for the process of semantic segmentation to know architecture is shown in Fig. 12. where the corresponding information is located in the The U-Net architecture shown was implemented in image in order to obtain a complete high-resolution Python and then trained based on the data presented in image in which all pixels are classified, up-sampling is Sect. 4.2. necessary. Hence, transposed convolution is an upsam- pling technique that expands the size of images. Basi- 4.3.3 U-Net+XCeption (UXception) cally, the original image is slightly padded, followed by a convolution operation. The reason behind the upsam- In the previous section, the U-Net has been explained ling is to combine the information from the previous in detail for the application of semantic segmentation, layers in order to get a more precise prediction. in this case the generation of a binary image to detect An adequate technique here is the so-called trans- the cut glass edge. Since the U-Net described above posed convolution method, where the transposed con- is initially untrained, i.e. quasi virgin without knowing volution at high level is exactly the opposite of a normal anything, many studies have shown that transfer learn- convolution, i.e. the input volume is an image with low ing is a good method to increase the performance of resolution and the output volume is an image with high an NN. Here, a neural network that has been trained resolution. Therefore, transposed convolution is the on an enormously large database previously is coupled preferred choice for performing up-sampling, where with the U-Net. Thus, the neural network does not start the parameters are essentially learned by back prop- from scratch, but already has knowledge of what a ball, agation to convert a low resolution image to a high monkey, plane etc. looks like and is therefore able to resolution image. segment these objects. This preliminary information within the pre-trained NN is used in transfer learning 4.3.2 U-Net architecture in order to make predictions for other problems with the aim of achieving a performance gain. The U-Net got its name from a U-shaped architec- Hence, Transfer learning (TL) is a research problem ture in which a fully-connected convolutional network in machine learning that focuses on storing the knowl- is used. There are two paths in the architecture (see edge gained in solving one problem and applying it to Fig. 12). The first is the contraction path (also called another but related problem. Exactly the TL shall now encoder), which is used to capture the context in the be applied to couple a trained deep neural network for image. The encoder is just a traditional stack of convo- image classification, the so-called XCeption (Chollet lution and max-pooling layers. The second path is the 2017), with the previously presented U-Net in order to 123 Semantic segmentation with deep learning 31 Fig. 12 U-Net architecture for the problem of image segmentation of cut glass generate improved results in semantic segmentation. study the number of epochs was set at 50, which has XCeption is a 71-layer deep convolutional neural net- led to good results. work. It is possible to load a pre-trained version of the Each standard convolution process is activated by network, trained on more than one million images, from a ReLU activation function (see Sect. 4). The ReLU the ImageNet database (Deng et al. 2009). The pre- (rectified linear unit) is currently the most frequently trained network can classify images into 1000 object used activation function, since it is used in almost all categories like keyboard, mouse, pencil and many ani- convolutional neural networks or in deep learning. The mals. As a result, the network has learned rich feature ReLU is half rectified (from below). f (z) is zero if z representations for a wide range of images. is less than zero, and f (z) is equal to z if z is above or To compare the performance of an untrained (U-Net) equal to zero. For the sake of completeness, the ReLU and a pre-trained NN (UXception), both NNs are used function and its derivative is defined as follows for predicting the cut glass edge into a binary image to solve the present problem via semantic segmentation. z for z ≥ 0 f (z) = (5) 0for z < 0 and 4.4 Hyperparameters for training 1for z ≥ 0 Before starting with the training of the U-Net, several f (z) = (6) hyperparameters have to be determined or stated in 0for z < 0 advance. A hyperparameter in machine learning is a parameter whose value is set before the learning pro- which shows that the ReLU function and its derivative cess begins. In contrast, the values of other parameters both are monotonic. In view of the large number of are derived by training. A typical hyperparameter in activation functions, no further description is given. neural networks is the batch size, which determines the The goal of machine learning and deep learning is to number of samples processed before the NN model is reduce the difference between the predicted output and updated. The size of a batch must be more than or equal the actual output. This is also called the cost function to one and less than or equal to the number of samples or loss function. Cost functions are convex functions, in the training dataset. For our study it was found that which must be minimized by finding the optimized the best results were achieved with a batch size of 8. value for the weights of the NN. Here the hyperparam- Therefore, this hyperparameter is set accordingly. eter is described by the function that performs the opti- An additional hyperparameter is given the the mization. In deep learning there are many optimization amount of epochs. The number of epochs is the number functions, such as Gradient Descent (DC), Stochas- of complete passes through the training dataset. In this tic Gradient Descent (SGD), RMSProp (Root Mean 123 32 M. Drass et al. Square Propagation) and many more. An overview with the respective advantages and disadvantages is described in Sun et al. (2019). Since the optimizer Adaptive Moment Estimation (Adam) has become gen- erally established for semantic segmentation, it will be briefly described below. Adam is a method that cal- culates the individual adaptive learning rate for each parameter from estimates of first and second moments of gradients. It also reduces the radically diminish- ing learning rates of the Adaptive Gradient Algorithm Fig. 13 Typical relationship between capacity and error, mark- (Adagrad) (Sun et al. 2019). Adam can be viewed as a ing underfitting zone (left) and overfitting zone (right), from combination of Adagrad, which works well on sparse Goodfellow et al. (2016) gradients and RMSprop which works well in online and non-stationary settings. Adam implements the expo- nential moving average of the gradients to scale the learning rate instead of a simple average as in Ada- compared to the previous calculation, one is visually grad. It keeps an exponentially decaying average of past one step closer to the minimum again. gradients. In addition, Adam is computationally effi- Two central challenges in learning an AI model by cient and has very low memory requirements, making learning algorithms have to be introduced: under- and this optimizer one of the most popular gradient descent overfitting. A model is prone to underfitting if it is optimization algorithms. not able to obtain a sufficiently low loss (error) value The loss function is one of the most important com- on the training set, while overfitting occurs when the ponents of neural networks. The loss is simply a pre- training error is significantly different from the test or diction error of the neural network. And the method validation error (Frochte 2019; Bishop 2006; Goodfel- for calculating the loss is called loss function. Simply low et al. 2016). The generalization error typically pos- speaking, the loss is used to calculate the gradients. sesses an U-shaped curve as a function of model capac- And the gradients are used to update the weights of ity, which is illustrated in Fig. 13. Choosing a simpler the neural network. Typical loss functions in machine model is more likely to generalize well (having a small learning are gap between training and test error) while at the same time still choosing a sufficiently complex hypothesis to – Mean Squared Error (MSE) achieve low training error. Training and test error typi- – Binary Crossentropy (BCE) cally behave differently during training of an AI model – Categorical Crossentropy (CC) by an learning algorithm (Frochte 2019; Bishop 2006; – Sparse Categorical Crossentropy (SCC). Goodfellow et al. 2016). Having closer look at Fig. 13, In the context of semantic segmentation a generally the left end of the graph unveils that training error and accepted loss function is the binary crossentropy loss generalization error are both high. Thus, this marks function, which has been applied in the optimization the underfitting regime. Increasing the model capacity, process. BCE loss is used for the binary classification it drives the training error to decreases while the gap tasks. When using the BCE Loss function, the system between training and validation error increases. Further requires only one output node to classify the data in two increasing the capacity above the optimal will eventu- classes. The output value should pass through a ReLU ally lead the size of this gap to outweigh the decrease activation function and the output range is 0–1. in training error, which marks the overfitting regime. In summary, the basic principle of training a neural Increasing model capacity tackles underfitting while network is to update the weights. For this purpose, gra- overfitting may be handled with regularization tech- dient methods are used to determine function values of niques (Frochte 2019; Bishop 2006; Goodfellow et al. discrete points after each individual batch run, so that 2016). Model capacity can be steered by choosing a the best possible search direction can be determined hypothesis space, which is the set of functions that the to the minimum. The loss functions are used for the learning algorithm is allowed to select as being the function values. If the loss improves–loss decreases— solution (Goodfellow et al. 2016). Here, varying the 123 Semantic segmentation with deep learning 33 parameters of that function family is called representa- 4.5 Metric for evaluation tional capacity while the effective capacity takes also into account additional limitations such as optimization In semantic segmentation it is of utmost importance to problems etc. (Goodfellow et al. 2016). obtain an adequate measure for the quality of the NN. This balancing act between an overfitting of the Typically there are three metrics for evaluation: training data or a badly trained one (underfitting), – Pixel Accuracy which can neither forecast the training data nor the test – Intersection-Over-Union (IoU, Jaccard Index) data in a robust way, can be solved methodically by the – Dice Coefficient (F1 Score) following procedures: In the following, all three metrics are briefly introduced and compared. – Regularization, – Ensemble, 4.5.1 Pixel accuracy – Early Stopping. Pixel accuracy is perhaps most easily understood con- Regularization involves a wide range of techniques ceptually. It is the percentage of pixels in your image to artificially force the model to be simpler. The method that are correctly classified. A major problem with this depends on the type of learner involved. For example, metric is mainly in the case of so-called imbalanced using dropout in a neural network, or adding a penalty data. Unbalanced data here means that the objects to parameter to the cost function in regression. Often, be classified occupy less than 50% of the image. In the the regularization method is also a hyperparameter, case of unbalanced data, the metric described above which means that it can be adjusted by cross-validation gives a high degree of accuracy, but this does not mean (Raschka 2018). that the object to be found has been correctly classified. Ensembles are machine learning methods for com- Unfortunately, class imbalance is predominant in many bining predictions from multiple separate models. real-world data sets, and cannot be ignored. Therefore, There are a few different methods for ensembling, but two alternative metrics are presented that can tackle the two most common are “bagging” and “boosting”. this problem better. A detailed explanation of these models is not provided but instead we refer to Rokach (2010). 4.5.2 IoU metric The third method to avoid under- or overfitting is achieved by “Early Stopping”. Hence, in order not to let The Intersection-over-Union (IoU), also known as the the optimization run into insignificance, the algorithm Jaccard index, is one of the most commonly used met- was equipped with the tool “Early Stopping”. This is rics in semantic segmentation. The IoU is in fact a sim- method, in which the error on a validation set is moni- ple metric that is to understand. The IoU is the area of tored during training and (with a little patience) stop if overlap between the predicted segmentation and the the validation error does not improve sufficiently. This ground truth, divided by the area of union between method was used for the present the predicted segmentation and the ground truth (see In summary, the U-Net and UXception is equipped Fig. 14). with the following algorithms: Ground truth means in this context the reality we want to predict with our deep learning model. Typically, Parameter Set of U-Net / UXception the IoU-metric is in the range of 0–1, where 1 indicates an absolute match between original image and mask image and 0 means no match. – Batch size: 8, – Epochs: 50, – Activation function: ReLU, 4.5.3 Dice coefficient (F1 score) – Optimizer: Adam, – Loss function: Binary Crossentropy, The cube coefficient is very similar to the IoU coef- – Early stopping. ficient. It is defined by the square of the overlap area divided by the total number of pixels in both images. 123 34 M. Drass et al. Starting with the results of the U-Net, the accuracy for all training epochs is shown in Fig. 15 . Furthermore the loss (BCE) over the epochs is shown. As can be seen in Fig. 15, the U-Net has already reached a validation accuracy of over 99% after less than 15 epochs during training and validation. After 50 epochs of training, the U-Net has a validation accu- racy for the metric IuO of about 99.25%. Even with a training of more than 50 epochs, no improvement in accuracy could be observed. In contrast, if you look at the results for the UXcep- tion model, one can see a gain in performance, which is Fig. 14 Example for the semantic segmentation metric reflected in a validation accuracy of almost 99.4% with Intersection-over-Union or Jaccard index reference to the metric IoU. Thus, the applied model of transfer learning, where the U-Net was combined with For each fixed “ground truth”, the two metrics are a pre-trained XCeption Net, showed that a slight per- always positively correlated. This means that if classi- formance gain can be achieved (see Fig. 16). However, fier A is better than B under one metric, it is also better both models deliver extreme good results for the vali- than classifier B under the other metric. The IoU metric dation accuracy in determining the cut edge of glasses generally tends to penalise individual instances of poor and producing a binary image of the cut edge. classification quantitatively more severely than the dice In order to show again the performance difference coefficient, even if both can agree that this one case is of both neural networks, the validation accuracy IoU bad. Similar to how L norm punishes the largest errors is evaluated for both NN’s in the following graphic. In more than L norm, the IoU metric tends to “square” Fig 17 it can be seen that the UXception gives slight the errors relative to the dice score. So the dice score better results than the U-Net. It should also be noted that tends to measure average performance, while the IoU the application of the transfer learning approach with score tends to measure worst case performance. the UXception has the general advantage of obtaining a higher accuracy of the forecast more quickly than a con- ventional U-Net with a similar computational effort. In 4.6 Results: metrics our case, however, both models provide approximately the same results, so that both models can be used for a In this section, the results are presented separately for later evaluation. each algorithm used (U-Net and UXception) under evaluation of the metrics IoU and Dice coefficient. Fig. 15 Results of the training and validation of the U-Net in terms of accuracy and loss over all epochs 123 Semantic segmentation with deep learning 35 Fig. 16 Results of the training and validation of the UXception in terms of accuracy and loss over all epochs colour describes the glass and the black colour the pre- dicted fractured edge. Shown in reddish (yellow-red) colours, the areas in which the NN is not 100% sure whether it is glass or a fractured edge. The final result is the image predicted by the UXcep- tion model with a threshold value, which is called a predicted binary image. I.e. that below the limit of IoU = 20% everything is attributed to the glass, whereas when an gray value of more than 20% is reached, the NN predicts the cut edge. Furthermore, in all images generated by AI, the man-made cut edge is shown as a cyan-coloured line for better orientation. Fig. 17 Comparison of the U-Net and UXception in terms of As shown in Fig. 18, the trained UXception model validation accuracy over all epochs is excellently suited to create a mask image from the original image, without the need for human interac- tion. It is also obvious that the red-yellow areas, where 4.7 Results: computer vision the NN is not sure whether it sees the cut edge or just the pure glass, are very narrow. This means that the In this section the results for the UXception model are areas in which the NN is uncertain play only a minor presented as so-called computer vision results. Here, role. A slight improvement of the mask images created we show how well the neural networks are able to auto- by AI could be achieved by the cut-off condition or matically reconstruct an original image of a cut glass binary prediction. The presented NN for predicting the edge, i.e. to convert it into a mask image without the cut glass edge is therefore very accurate and saves an need for human interaction. enormous amount of time in the prediction and pro- As input five images of a cut glass edge are used, duction of mask images. In addition, the mask images which were not previously used for training or valida- can be further processed, for example to make statisti- tion of the neural network. Hence, the virgin images cal analyses of the break structure of the cut glass edge. are used as input and the output, i.e. the mask image, However, this is not part of the present paper, as the aim should be delivered by the neural network. here was to show the application of AI in the context The results first show the original image and the of glass processing. man-made mask image. Then the prediction of the AI model is also displayed as a mask image. Since all neu- ral networks are statistical models, the mask image gen- erated by AI is provided with a colour map. The white 123 36 M. Drass et al. Fig. 18 Results of the semantic segmentation using the UXception net to predict the cut glass edge on the basis of three original images unknown to NN 123 Semantic segmentation with deep learning 37 5 Summary and discussion Ensslen, F., Müller-Braun, S.: Kantenfestigkeit von floatglas in abhängigkeit von wesentlichen schneidprozessparametern. ce/papers 1(1), 189–202 (2017) In this paper the application of AI and especially the Frochte, J.: Maschinelles Lernen: Grundlagen und Algorithmen problem of semantic segmentation was applied to the in Python. Carl Hanser Verlag GmbH Co KG (2019) context of glass processing for the first time. The goal Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT press, New York (2016) was to process an image of a cut glass edge using neural Kim, P.: Matlab Deep Learning, With Machine Learning, Neural networks in such a way that a so-called mask image is Networks and Artificial Intelligence, p. 130 (2017) generated by the AI algorithm. Accordingly, the mask Kingma, D.P., Welling, M., et al.: An introduction to variational image should only recognize the cut glass edge from autoencoders. Found. Trends Mach. Learn. 12(4), 307–392 (2019) the original image and display it in black in the mask Kraus, M.A.: Künstliche intelligenz und maschinelles lernen im image. kontext der forschung im konstruktiven glasbau. ce/papers The application of the so-called U-Net and UXcep- 3(1), 161–173 (2019a) tion net showed excellent results in the prediction of the Kraus, M.A.: Machine learning techniques for the material parameter identification of laminated glass in the intact and cut glass edge. The validation accuracies of both mod- post-fracture state. PhD thesis, Universität der Bundeswehr els exceeded 99%, which is sufficient for the generation München (2019b) of the mask image via AI. Kraus, M.A., Drass, M.: Artificial intelligence for structural glass engineering applications: overview, case studies and future Funding Open Access funding provided by Projekt DEAL. potentials. Glass Struct. Eng. (2020a) Kraus, M.A., Drass, M.: Künstliche intelligenz für die gebäude- Open Access This article is licensed under a Creative Com- hülle. Deutsches Ingenieurblatt 04 (2020b) mons Attribution 4.0 International License, which permits use, Müller-Braun, S., Franz, J., Schneider, J., Schneider, F.: Optische sharing, adaptation, distribution and reproduction in any medium merkmale der glaskante nach glaszuschnitt mit schneidräd- or format, as long as you give appropriate credit to the original chen. ce/papers 2(1), 99–111 (2018) author(s) and the source, provide a link to the Creative Com- Müller-Braun, S., Seel, M., König, M., Hof, P., Schneider, J., mons licence, and indicate if changes were made. The images or Oechsner, M.: Cut edge of annealed float glass: crack system other third party material in this article are included in the article’s and possibilities to increase the edge strength by adjusting Creative Commons licence, unless indicated otherwise in a credit the cutting process. Glass Struct. Eng. 5(1), 3–25 (2020) line to the material. If material is not included in the article’s Cre- Paluszek, M., Thomas, S.: MATLAB Machine Learning. A press, ative Commons licence and your intended use is not permitted by New York (2016) statutory regulation or exceeds the permitted use, you will need Raschka, S.: Model evaluation, model selection, and algo- to obtain permission directly from the copyright holder. To view rithm selection in machine learning (2018). arXiv preprint a copy of this licence, visit http://creativecommons.org/licenses/ arXiv:181112808 by/4.0/. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1– 2), 1–39 (2010) Shorten, C., Khoshgoftaar, T.M.: A survey on image data aug- mentation for deep learning. J. Big Data 6(1), 60 (2019) References Sun, S., Cao, Z., Zhu, H., Zhao, J.: A survey of optimization methods from a machine learning perspective. IEEE Trans. Cybern. (2019) Bishop, C.M.: Pattern Recognition and Machine Learning. Infor- Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, mation Science and Statistics, 1st edn. Springer, Berlin E.: Deep learning for computer vision: a brief review. Com- (2006) put. Intell. Neurosci. (2018) Chen, F., Chen, N., Mao, H., Hu, H.: Assessing four neural networks on handwritten digit recognition dataset (mnist) (2018). arXiv preprint arXiv:181108278 Publisher’s Note Springer Nature remains neutral with regard Chollet, F.: Xception: deep learning with depthwise separable to jurisdictional claims in published maps and institutional affil- convolutions. In: Proceedings of the IEEE Conference on iations. Computer Vision and Pattern Recognition, pp. 1251–1258 (2017) Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., Fei-Fei, L.: Ima- genet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recog- nition, IEEE, pp. 248–255 (2009)

Journal

Glass Structures & EngineeringSpringer Journals

Published: Sep 17, 2020

There are no references for this article.