Machine learning based approach for the interpretation of engineering geophysical sounding logs

Armand Abordán; Norbert Péter Szabó

doi:10.1007/s40328-021-00354-4

Machine learning based approach for the interpretation of engineering geophysical sounding logs

Abordán, Armand; Szabó, Norbert Péter 2021-12-01 00:00:00 In this paper, a set of machine learning (ML) tools is applied to estimate the water satura- tion of shallow unconsolidated sediments at the Bátaapáti site in Hungary. Water saturation is directly calculated from the first factor extracted from a set of direct push logs by factor analysis. The dataset observed by engineering geophysical sounding tools as special vari- ants of direct-push probes contains data from a total of 12 shallow penetration holes. Both one- and two-dimensional applications of the suggested method are presented. To improve the performance of factor analysis, particle swarm optimization (PSO) is applied to give a globally optimized estimate for the factor scores. Furthermore, by a hyperparameter esti- mation approach, some control parameters of the utilized PSO algorithm are automatically estimated by simulated annealing (SA) to ensure the convergence of the procedure. The result of the suggested ML-based log analysis method is compared and verified by an inde- pendent inversion estimate. The study shows that the PSO-based factor analysis aided by hyperparameter estimation provides reliable in situ estimates of water saturation, which may improve the solution of environmental end engineering problems in shallow uncon- solidated heterogeneous formations. Keywords Factor analysis · Particle swarm optimization · Simulated annealing · Direct push logging · Hyperparameter estimation 1 Introduction Borehole geophysical data contain valuable information about the physical character- istics of the investigated subsurface (Everett 2013). The measured raw data can be pro- cessed and turned into petrophysical parameters by several different approaches. The simplest is the deterministic modeling, which is often based on a single empirical equa- tion that transforms an observed variable into a petrophysical parameter (e.g., shale vol- ume estimation along boreholes based only on the natural gamma-ray intensity log). A * Armand Abordán gfaa@uni-miskolc.hu Department of Geophysics, University of Miskolc, 3515 Miskolc-Egyetemváros, Hungary MTA-ME Geoengineering Research Group, University of Miskolc, 3515 Miskolc-Egyetemváros, Hungary 1 3 Vol.:(0123456789) 682 Acta Geodaetica et Geophysica (2021) 56:681–696 more advanced way to process geophysical data is inverse modeling (Zhdanov 2015). This approach combines numerical mathematics and optimization theory to derive the physical parameters of the investigated geological structures from the measured data. Geophysical inverse problems can be solved by linearized or global optimization methods (e.g., simulated annealing (SA), genetic algorithm (GA), particle swarm opti- mization (PSO)). However, in all cases, geophysical inversion is based on a scientific understanding, a set of equations (i.e., response functions) that describe the relationship between the observed data and the petrophysical parameters of the subsurface. Inverse problems are solved by updating a starting model iteratively until the syn- thetic data calculated on the model fits the measured data (Menke 2012). This opti- mization task is usually solved by linearized methods (e.g., the least squares method). These provide a fast and satisfactory solution given that there is a good initial (starting) model. However, the application of these methods to large scale (multivariate) problems is often problematic due to the complexity of the objective function. During the mini- mization of the objective function (that measures the misfit between the measured and calculated data) these linearized methods use a gradient-based search, which means that they stabilize in the nearest local minimum of the objective function. The above-mentioned global optimization methods use random search instead, there- fore are capable to get out of local minima of the objective function. Thus, they can provide a reliable and convergent solution independently of the chosen starting model. For this reason, global optimization techniques are widely used in geophysics (Sen and Stoffa 2013), e.g., full-waveform Rayleigh-wave inversion (Xing and Mazzotti 2019), inversion of magnetotelluric data (Wang et al. 2012), two-dimensional inversion of magnetic data (Liu et al. 2018). One of their disadvantages is that their computational requirements far exceed those of linearized methods (Kaikkonen and Sharma 2001), but due to the rapid development of computing, this is no longer a problem nowadays. Furthermore, it is worth mention- ing that based on a single program run, global optimization methods cannot provide information on the error of parameter estimation like linearized methods. However, with different hybrid solutions, one can combine linearized methods with global methods to create efficient, two-phase algorithms. In general, complex inverse problems can be ini- tialized by some global method to avoid the algorithm from stabilizing in a local opti- mum of the objective function. Then, when the procedure is close enough to the optimal solution, it can be switched to a linearized method so that the computation of estimation error of derived petrophysical parameters becomes possible and the runtime of the algo- rithm is reduced as well. Such hybrid solutions were suggested by Soupios et al. (2011) for seismic inversion, Chunduru et al. (1997) for 2D resistivity profiling data and by Szabó and Dobróka (2020) for the non-linear well logging inverse problem. Because these global methods do not use linearization, they do not require supple- mentary information (e.g., derivatives or a good starting model), but their convergence is greatly influenced by their control (hyper) parameters (e.g., combination and param- eters of genetic operators in GA, initial temperature, cooling schedule and parameter perturbation in SA or inertia weight and learning factors in PSO). However, through hyperparameter estimation, it is possible to automatically select the optimal values for these parameters with a secondary optimization algorithm, thus guaranteeing the con- vergence of the procedure. Based on a similar principle, the efficiency of geophysical inversion methods can also be increased by hyperparameter estimation. For example, the value of parameters otherwise treated as constants (e.g., zone parameters in case of 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 683 well logging inversion) in the response functions can be optimized by incorporating an additional program loop (Dobróka et al. 2016). 2 Machine learning tools used in geophysics In addition to inversion-based data processing methods, another approach is to use machine learning (ML) from the toolbox of artificial intelligence (Dramsch 2020). The aim of ML is to create such algorithms that can improve their own effectiveness by utilizing the experi- ence gained during their operation. Such as artificial neural network, deep learning, clus- ter analysis or fuzzy logic. The frequently used regression analysis (linear, non-linear or logistic) is also considered an ML tool. The use of these tools is gaining ground in the processing and interpretation of geological and geophysical data in recent years (Caté et al. 2017). Some examples from the literature are seismic interpretation (Wang et al. 2018), seismology (Kong et al. 2019), fault detection (Araya-Polo et al. 2017), electrical resistiv- ity tomography (Vu and Jardani 2021) and well logging inversion (Szabó 2018). The use of these methods is advantageous in the absence of the previously mentioned scientific back - ground (mathematical relationship does not necessarily exist between the petrophysical parameters and observed data) or the theoretical description is so complex (e.g., extremely long runtime) that we must disregard the exact description. In recent decades a wide variety of machine learning tools have been developed. To select the one that is most suitable for solving a given problem, it is crucial to understand how they work. Therefore, it is best to categorize the different ML methods based on their learning style. This way we can sort them into three main categories. The most commonly applied tools are based on supervised learning. In this case there is exact information (inputs and outputs) for training the algorithm. In addition to the input data, we also know what kind of results we can expect. Thus, supervised learning-based methods provide exact parameters (i.e., numerical labels) such as porosity or categorical labels (e.g., rock type). Some of the most often used ML approaches based on supervised learning are regression analysis and classifiers. In case of unsupervised learning, there is no exact information in the dataset regard- ing the output. Meaning that it uses data without labels and therefore the procedure has to group the observed data by finding structures within the dataset. A frequently applied tool that uses unsupervised learning is cluster analysis, which classifies the input data based on some distance metric. Clustering is often applied on geophysical datasets, e.g., for rock typing based on wireline logging data (Ali and Sheng-Chang 2020). Dimension reduction methods also utilize the unsupervised learning approach (e.g., factor analysis or principal component analysis). Big datasets are often difficult to handle therefore it can be advantageous to reduce the size (dimensionality) of the problem. By keeping most of the information contained in the original dataset (i.e., statistical sample), the same phenomenon can be described with fewer variables. Thus, the new variables con- tain the essential features of the investigated object as well as possible new properties that cannot be measured directly. Furthermore, by removing the error factors, it can even be used to improve the signal-to-noise ratio. In factor analysis, a large number of inter- related or independent variables are replaced by a smaller number of uncorrelated vari- ables, where the resulting new variables cannot be measured directly. The applicability of the dimensionality reduction methods comes from the fact that the newly extracted variables often show a strong correlation with different petrophysical parameters, thus 1 3 684 Acta Geodaetica et Geophysica (2021) 56:681–696 they can be used e.g., for lithological classification (Puskarczyk et al. 2019) or for quan- titative estimation of petrophysical parameters (Szabó 2011). The last category of ML methods based on learning style is the semi-supervised approach. A combination of the previous two cases, it can be used when only part of the database contains information about the output (i.e., labeled data) while some of the data does not have all the necessary information available (i.e., unlabeled data) for supervised training of the system. Semi-supervised learning ensures that the latter data is not wasted and contributes in some way to the design of the system. An example of a semi-supervised based system can be found in Li et al. (2019) for lithology recognition using a generative adversarial network. In the following sections, an ML-based system is suggested for direct-push log analy- sis, which enables the estimation of water saturation in shallow unconsolidated hetero- geneous formations. 3 Water saturation estimation in the shallow subsurface by factor analysis A direct push (DP) logging technology named as engineering geophysical sounding was developed in Hungary (Fejes and Jósa 1990) based on cone penetration testing (CPT). Besides cone resistance and sleeve friction, DP logging can also measure the same parameters routinely recorded by wireline logs including gamma-ray intensity, bulk density, neutron porosity and resistivity. It enables the characterization of the shallow unconsolidated subsurface with high vertical resolution down to approximately 50 m with a highly mobile equipment. These measurements are done by advancing steel rods into the ground without drilling, therefore there is less disturbance of the subsurface, which is also advantageous considering geophysical measurements. The different DP technologies can be used to solve problems of contamination map- ping, environmental risks assessment and ground-water investigations (Dietrich and Leven 2006). By processing direct push logs, one can derive quantitative information about the composition of shallow unconsolidated sediments, such as clay content, poros- ity or water saturation (Balogh 2016). In this paper, an ML-based statistical approach is developed for the quantitative analysis of direct-push logs. Factor analysis, as mentioned before is an unsupervised ML tool for describing sev- eral measured quantities with fewer uncorrelated variables. In case of DP logging data, the measured logs are the input variables, which are processed jointly to extract new factors. Here, the derived factors can be looked at as factor logs, which can be related to petrophysical parameters by regression analysis (Szabó et al. 2018). In this study, water saturation (S ) of shallow unconsolidated formations is estimated based on the first fac- tor log extracted from a direct push logging dataset. As a preliminary step, DP logs need to be standardized to serve as input for factor analysis. Then they are collected into a matrix D, where individual columns contain the recordings of different logging tools. In this paper, the processed logging tools are the natural gamma-ray intensity, GR (cpm), cone resistance, RCPT (MPa), bulk density, DEN (g/cm ), neutron porosity, NPHI (v/v) and resistivity, RES (ohmm). 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 685 ⎛ D D ⋯ D ⎞ 11 12 1K ⎜ ⎟ D D D 21 22 2K ⎜ ⎟ ⋮ ⋮⋮ ⋮ ⎜ ⎟ = , (1) ⎜ D D D ⎟ i1 i2 iK ⎜ ⎟ ⋮ ⋮⋮ ⋮ ⎜ ⎟ D D D ⎝ N1 N2 NK ⎠ where K is the number of applied DP logging tools and N shows the number of measured depth points in the given sounding hole. The solution of factor analysis is based upon the decomposition of data matrix D = + , (2) where F is an N-by-M matrix of factor scores (i.e., factor logs) where M denotes the num- ber of computed factors (M < K), L is a K-by-M matrix of factor loadings, which shows the correlation relationship between the measured DP logs and the newly extracted factors and E is an N-by-K error matrix. Based on the model of factor analysis in Eq. (2), the derived factors are essentially the weighted sums of the measured direct push logs. The first col- umn of matrix F (i.e., the first factor) describes most of the data variance of the meas- ured dataset and therefore generally bears the most significance for data interpretation. By assuming that the factors are linearly independent, the correlation matrix is given by −1 T T = N = + , (3) where is a diagonal matrix of specific variances that is independent of the common fac- tors. For the estimation of the factor loadings in matrix L, Jöreskog (2007) suggested the following non-iterative formula −1∕2 1∕2 −1 = diag ( − ) , (4) where is the diagonal matrix of the first M number of sorted eigenvalues of the sample covariance matrix S, denotes the matrix of the first M number of eigenvectors, I is the identity matrix, U is an arbitrary M-by-M orthogonal matrix and θ is an adequately chosen constant that is to be slightly smaller than 1. Once factor loadings are available, the factor scores can be estimated by a maximum likelihood method (Bartlett 1937) −1 T T −1 T −1 T = . (5) As an advanced approach, in this study, factor scores are estimated by means of global optimization. We currently utilize the metaheuristic particle swarm optimization (PSO) for giving an estimate to the factors scores. This PSO-based solution of factor analysis is referred to as FA-PSO (Abordán and Szabó 2018). In this approach, factor analysis is treated as an inverse problem, thus Eq. (2) must be reformulated = + , (6) where the standardized measured DP logs are represented as a KN length column vector d, factor loadings are given in a NK-by-NM matrix , factor scores are also gathered in a column vector f of MN length and e is the KN length residual vector (Szabó and Dobróka 2018). To initialize this metaheuristic solution, measured DP logs are first collected into 1 3 686 Acta Geodaetica et Geophysica (2021) 56:681–696 the column vector d, then the matrix of factor loadings can be estimated by Eq. (4). For getting more meaningful factors, the varimax rotation is applied on the loadings (Kai- ser 1958). Then these factor loadings are kept constant, while the optimal values of factor scores f is approximated by PSO. For measuring discrepancy, the following L norm based objective function is used, which is minimized to estimate the optimal values of factors scores NK 1 (m) (c) (7) E = (d − d ) , i i NK i=1 (m) (c) where d and d are the ith standardized measured and calculated direct push logging i i data, respectively. The calculated data comes from the multiplication from Eq. (6), while d stores the measured DP logs in the same equation. The before mentioned multi- plication of factor loadings and factor scores permits the calculation of synthetic DP logs, which can be looked at as the solution of the forward problem. For solving this optimization problem PSO is applied which is a global optimization method inspired by the social behavior of animals. The original method was developed by Kennedy and Eberhart (1995). PSO is often applied for its relatively low computa- tional requirements and easy implementation compared to other optimization methods such as the genetic algorithm (Holland 1975). It is a population based technique where each particle (i.e., possible solution) searches for the optimal solution by adjusting its own position in the search space by taking into account its own best position and the whole swarm’s best position in every iteration step. This searching mechanism of PSO is governed by Eqs. (8–9). Having an n-dimensional optimization problem, the ith par- ticle’s position in the search space can be represented by the vector x = (x , x , …, i i1 i2 T T x ) and its velocity by vector v = (v , v , …, v ) . Every particle of the swarm has in i i1 i2 in a memory of its best position, which is continuously updated during the iterations and is stored in vector p = (p , p , …, p ) for the ith particle. The position and velocity i i1 i2 in update equations are (t + 1)= (t)+ (t + 1), (8) i i i (t + 1)= w (t)+ r c ( (t)− (t)) + r c ((t)− (t)), (9) i i 1 1 i i 2 2 i where t = 1,…, T is the current iteration step, T is the last iteration and i = 1,2,…, S shows the particle index and S is the population size. In Eq. (9) r and r denote random variables 1 2 uniformly distributed in 0 to 1 and w is an inertia weight (Shi and Eberhart 1998) that was introduced to balance between global and local search. Vector g stores the very best posi- tion found by the swarm until the current iteration step. It is continuously updated in each iteration and helps the swarm to find the global optimum of the objective function. Accel- eration factors c and c are positive constants, where c is the cognitive scaling parameter 1 2 1 and c is the social scaling parameter, both generally set as 2 (Kennedy and Eberhart 1995). Since PSO is a metaheuristic method, the chosen control (hyper-) parameters have a great effect on its performance (Zhang et al. 2005). To increase the reliability of the searching mechanism in finding the global optimum, the automatic selection of acceler - ation factors c and c is developed. It is carried out in an additional program loop with 1 2 the help of simulated annealing (Metropolis et al. 1953) as depicted in Fig. 1. For the outer program loop, we initialize the values of both c and c hyperparameters 1 2 as 1 and then let SA select the optimal values automatically for the current optimization 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 687 Fig. 1 Flowchart of the FA-PSO method aided by hyperparameter estimation problem. In every SA iteration step their value is adjusted by adding a small b parameter to both values. Parameter b is randomly generated in each iteration from − b to b , max max where b is also reduced iteratively as b = b × ε, where ε is a constant smaller than max max max Once the new c and c parameters are selected, PSO is run to find the optimal values 1 2 of factor scores f. If the difference in energy (ΔE) in two successive iteration steps of the SA program loop according to the objective function in Eq. (7) is negative (i.e., PSO was more effective with the new hyperparameters), then the current values of parameters c and c are accepted and the iterations are continued. If the difference in energy is positive (i.e., new control parameters decreased the efficiency of PSO), then the accepting probability * * of the new c and c is given by P = exp(− ΔE/T ), where T is the current temperature 1 2 a (new) of the system. Temperature is reduced logarithmically according to T = T /lg(1 + q) as suggested by Geman and Geman (1984), where q is the number of iterations computed so far and T is the starting temperature of the system. The new hyperparameters are accepted only if a random number generated from 0 to 1 is smaller than P . This mechanism of accepting worse solutions prevents SA from being stuck in a local optimum near the start- ing model and thus allows for the optimal selection of parameters c and c for PSO. 1 2 4 Field tests 4.1 One‑dimensional application To test the suggested hyperparameter estimation assisted factor analysis, a direct push log- ging dataset is used that was measured in Bátaapáti, Southwest Hungary. The dataset con- tains a total of 12 sounding holes with the natural gamma-ray intensity, GR (cpm), cone resistance, RCPT (MPa), bulk density, DEN (g/cm ), neutron porosity, NPHI (v/v) and resistivity, RES (ohmm) logs measured in the upper 20–28 m of unconsolidated loessy- sandy layers. The sounding holes are located along a 550 m long profile approximately 50 m from each other. First, a one-dimensional application is shown for sounding hole 7 (SH7) where logging data is available for the interval of 0.5–27.7 m. To start off the procedure, DP logging data is standardized and then factor loadings are estimated by Eq. (4). The rotated factor load- ings by the varimax algorithm are shown in Table 1. The first factor explains 71% of the total data variance, while the other 29% is explained by the second factor. According to the computed factor loadings, the first factor correlates most with the bulk density, neutron porosity and resistivity logs, while the second factor is most influ- enced by the cone resistance and natural gamma-ray intensity logs. Then these factor loadings remain unchanged for the remainder of the procedure. To give an estimate to the factor scores f by PSO, first, a random population of 60 particles (each representing 1 3 688 Acta Geodaetica et Geophysica (2021) 56:681–696 Table 1 Rotated factor loadings Direct-push logs Symbol Factor 1 Factor 2 estimated by the FA-PSO method in SH-7 Natural gamma-ray intensity GR 0.03 − 0.61 Cone resistance RCPT − 0.09 0.74 Bulk density DEN 0.89 0.15 Neutron-porosity NPHI 0.82 − 0.18 Resistivity RES − 0.89 0.18 a solution candidate for vector f) is generated with uniform distribution within the search space previously defined by solving Eq. (5 ). In this instance, the limits of factor scores are set as − 5 to 5. The inertia weight w for Eq. (9) is initialized as 1 and then (new) (old) adjusted in each iteration based on w = w × α, where α is a damping factor set to 0.99, while hyperparameters c and c are automatically selected by SA in the outer 1 2 program loop. SA is run for 50 iteration steps to find the optimal values of c and c 1 2 (Fig. 2). The generated hyperparameters are plugged into PSO in each SA iteration step to test their effectiveness. With the currently set c and c parameters, PSO is run for 2000 1 2 iterations to given an estimate to the optimal values of factor scores in vector f. During the PSO runs, all 60 particles are adjusted according to Eqs. (8)–(9), and the objec- tive function defined in Eq. (7 ) is recalculated with the new values of factor scores f in every iteration step. Figure 2 on the right depicts the final data distance of each PSO run with the corresponding c and c control parameters (Fig. 2 on the left). It can be seen 1 2 that after approximately 15 iterations, SA finds the optimal c and c parameters and 1 2 thereafter their value somewhat stabilizes and the data distance reached by PSO does not decrease any further. The optimal values of c and c in this case are found to be 1 2 1.54 and 2.29, respectively, which somewhat differ from the default values of 2 and 2, respectively (Kennedy and Eberhart 1995). Fig. 2 Convergence of the FA-PSO method (on the right) by altering control parameters c and c by SA 1 2 (on the left) in SH-7 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 689 Once the pre-defined number of iteration steps is reached, the final values of the fac- tor scores (represented by the particle with the lowest data misfit) are accepted as the final solution. In this instance, the final data distance reached by PSO with the automati- cally selected hyperparameters at the end of the procedure was 0.45. Then the first fac- tor (F ) can be used to estimate the water saturation (S ) of the penetrated formations 1 w by regression analysis (Szabó et al. 2012). In this paper, an exponential relationship is assumed between the first factor log and water saturation (Szabó et al. 2018) (bF ) S = ae + c, (10) where a, b and c are area specific regression coefficients. As a reference for regression analysis the S result log taken from the complete quality controlled inversion (Drahos 2005) was applied. Figure 3 on the left depicts the exponential relationship between the first factor log and water saturation along sounding hole 7 based on Eq. (10). For this DP logging data- set, the regression coefficients with 95% confidence bounds are found to be a = 0.8084 [a = 0.6445, a = 0.9723] and b = 0.1996 [b = 0.1606, b = 0.2387] and min max min max c = − 0.2046 [c = − 0.3652, c = − 0.0440]. On the right of Fig. 3, water saturation min max estimated by local inversion and factor analysis is plotted, where the high Pearson’s cor- relation coefficient (r = 0.98) indicates that the two variables are nearly linearly propor- tional and thus shows the reliability of the FA-PSO based water saturation estimation. The results of the FA-PSO method in SH-7 are depicted in Fig. 4 along with the input direct push logs. The first five tracks contain the measured logs, track 6 contains the first (purple) and second factor (red) logs and the last track shows the water satura- tion estimates by inverse modeling (green) and by the FA-PSO method (blue) utilizing the first factor log. It can be seen that the two water saturation estimates are almost identical, which validates the applicability of factor analysis for the processing of direct push logs and can serve as an independent tool for estimating water saturation in the shallow uncon- solidated subsurface. Fig. 3 The exponential relationship between the first factor and water saturation in SH-7 (on the left), water saturation derived by local inversion and factor analysis (on the right) 1 3 690 Acta Geodaetica et Geophysica (2021) 56:681–696 Fig. 4 The results of the FA-PSO method in SH-7 4.2 Two‑dimensional application The processing of direct push logs by factor analysis can also be carried out in two-dimen- sions, which permits the simultaneous processing of DP logs recorded in neighboring sounding holes. Thus, a 2D section of water saturation can be estimated in one interpre- tation phase from the DP logs recorded in multiple sounding holes (Szabó et al. 2018). (h) First, gather the measured DP logs in vector d defined in Eq. (6) from the hth hole (h = 1,2,…,H). Then the model of factor analysis can be extended for multiple sounding holes as (1) ⎛ ⎞ (1) (1) (1) 0 00 0 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ 0 ⋱ 0 0 0 ⋮ ⋮ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ (h) (h) (h) (h) ⎜ ⎟ ⎜ ̃ ⎟ ⎜ ⎟ ⎜ ⎟ = 00 00 × + , (11) ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 000 ⋱ 0 ⋮ ⋮ ⎜ ⋮ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ (H) (H) (H) ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎜ ⎟ 0 0 00 (H) ⎝ ⎠ (h) (h) where denotes the matrix of factor loadings and f denotes the vector of factor scores computed for the hth sounding hole. Since there are N number of logged depth points in the hth sounding hole, the total number of depth points is N = N + N + … + N . The 1 2 H matrix of factor loadings in Eq. (11) is calculated similarly to the one-dimensional case by Eq. (4), and the optimal factor scores are estimated by PSO. Here it should be noted that the two dimensional case differs from the one dimensional case, because here measured DP logs are processed jointly from several sounding holes together while assuming that the same factor loadings are applicable for the whole exploration area. Once the factor logs are derived and are related to water saturation by regression analysis, they can be interpolated 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 691 (e.g., by kriging) between the sounding holes to derive the map of water saturation for the investigated area. To test this two-dimensional approach for water saturation estimation by factor analysis, data is collected from 12 sounding holes (SH-1 to SH-12), which are located along a 550 m long profile, approximately 50 m apart from each other. The natural gamma-ray intensity, GR (cpm), cone resistance, RCPT (MPa), bulk density, DEN (g/cm ), neutron porosity, NPHI (v/v), resistivity, RES (ohmm) logs are all available along the penetrated sounding holes to serve as the input of the two-dimensional factor analysis. The total number of DP logging data from the 12 sounding holes combined is 15,500. By extracting 2 factors, the rotated factor loadings are given in Table 2. The first factor describes 72% of the total data variance, while the remaining 28% is described by the second factor. The calculated factor loadings for the full dataset of all 12 sounding holes are essentially the same as the factor loadings estimated only in SH-7 (Table 1). The first factor correlates best with the bulk density, neutron porosity and resistivity logs, while the second factor is most influenced by the cone resistance and natural gamma-ray intensity logs. The remain- der of the two-dimensional procedure is identical to the one-dimensional case. Factor load- ings in Table 2 are fixed, and then by solving Eq. (5) for the factor scores, the boundaries of the search space for PSO is defined as − 5 to 30. Due to the larger dataset and thus increased number of unknowns, PSO here requires 7500 iterations to find the optimal val- ues of the factor scores by utilizing 60 particles. Hyperparameters c and c are again auto- 1 2 matically selected by SA in 50 iteration steps. The minimal data distance reached by PSO in finding the optimal values of factor scores by utilizing the SA derived hyperparameters is depicted in Fig. 5 on the right for all 50 iteration steps with the corresponding c and c 1 2 parameters which are shown in Fig. 5 on the left. It can be seen that after approximately 12 iterations, SA finds the optimal values of c and c , thereafter their value somewhat stabilizes and the data distance reached by PSO does not decrease any further. The optimal values of c and c are found to be 1.33 and 1 2 2.13, respectively. The final data distance reached by PSO with these parameters is 0.50. Then the resultant first factor log can be related to the water saturation of the investigated area by regression analysis based on Eq. (10) as seen in Fig. 6 on the left. As a reference for regression analysis, the quality checked local inversion derived water saturation values are used. In this instance, the regression coefficients with 95% confidence bounds are found to be a = 0.8544 [a = 0.8086, a = 0.9001] and b = 0.1982 [b = 0.1880, b = 0.2085] min max min max and c = − 0.2258 [c = − 0.2706, c = − 0.1810]. On the right of Fig. 6 water saturation min max estimated by local inversion and factor analysis is plotted, where the high correlation coef- ficient (r = 0.97) indicates that the two estimates are nearly linearly proportional and thus shows the reliability of the two dimensional factor analysis for water saturation estimation. By interpolating the estimated water saturation logs between the sounding holes, one can derive the map of the water saturation of unsaturated formations along the processed profile Table 2 Rotated factor loadings Direct-push logs Symbol Factor 1 Factor 2 estimated by the FA-PSO method in sounding holes 1–12 Natural gamma-ray intensity GR 0.03 − 0.50 Cone resistance RCPT − 0.08 0.70 Bulk density DEN 0.85 0.20 Neutron porosity NPHI 0.77 − 0.24 Resistivity RES − 0.88 0.16 1 3 692 Acta Geodaetica et Geophysica (2021) 56:681–696 Fig. 5 Convergence of the FA-PSO method (on the right) by altering control parameters c and c by SA 1 2 (on the left) for multiple sounding holes Fig. 6 The exponential relationship between the first factor and water saturation in sounding holes 1–12 (on the left), water saturation derived by local inversion and the two dimensional factor analysis (on the right) (Fig. 7). Similarly, the results of local inversion estimates can also be interpolated between the holes, which is depicted in Fig. 8. The logs are located at approximately 50 m apart, sounding hole 7 processed in the previ- ous section is located at 300 m. The layers with different water saturations can be easily recog- nized on the derived maps and thus can help with site characterization. The good correlation between the two estimates confirm the applicability of the two-dimensional factor analysis for water saturation estimation from direct push logs. 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 693 Fig. 7 Water saturation estimated by the two-dimensional factor analysis in sounding holes 1–12 Fig. 8 Water saturation estimated by inverse modeling in sounding holes 1–12 5 Conclusions The paper presents the results of a machine learning-based log analysis method for direct-push logging data. The suggested factor analysis based approach is shown to effectively reduce the dimension of direct push logging datasets and the newly extracted factor logs by the FA-PSO method can be used to estimate water saturation along arbi- trary long sounding hole intervals. By incorporating more than one sounding hole into the procedure, even two-dimensional water saturation profiles can be derived by simultaneously processing DP logs from neighboring holes. The result of the presented method is also verified by independent inversion based estimates of water saturation. It is also shown that simulated annealing is capable to automatically select some of the hyperparameters of the utilized particle swarm optimization algorithm. Thus, the uncer- tainty of the optimization algorithm can be reduced, since the manual selection of the c and c parameters is no longer necessary. The suggested machine learning tool assures 1 3 694 Acta Geodaetica et Geophysica (2021) 56:681–696 reliable evaluation of unconsolidated and unsaturated near-surface formations being as target domain of several engineering and environmental geophysical problems. Acknowledgements The research was carried out in the Project No. K-135323 supported by the National Research, Development and Innovation Office (NKFIH). The use of the dataset was permitted by Dezső Drahos from Loránd Eötvös University. The authors thanks for the continuous support of János Stickel from Elgoscar Ltd. Authors’ contributions Armand Abordán: original draft, software, visualization. Norbert Péter Szabó: con- ceptualization, methodology, review and editing. Funding Open access funding provided by University of Miskolc. The research was carried out in the Pro- ject No. K-135323 supported by the National Research, Development and Innovation Office (NKFIH). Declarations Conflict of interest The authors have no conflicts of interest to declare that are relevant to the content of this article. Ethics approval Not applicable. Consent to participate Not applicable. Consent for publication Not applicable. Availability of data and material Because of the data confidentiality, the experimental data is not published. Code availability Because of the data confidentiality, the code is not published. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com- mons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. References Abordán A, Szabó NP (2018) Particle swarm optimization assisted factor analysis for shale volume esti- mation in groundwater formations. Geosci Eng 6(9):87–97 Ali A, Sheng-Chang C (2020) Characterization of well logs using K-mean cluster analysis. J Petrol Explor Prod Technol 10:2245–2256. https:// doi. org/ 10. 1007/ s13202- 020- 00895-4 Araya-Polo M, Dahlke T, Frogner C, Zhang C, Poggio T, Hohl D (2017) Automated fault detection with- out seismic processing. Lead Edge 36(3):208–214. https:// doi. org/ 10. 1190/ tle36 030208.1 Balogh GP (2016) Interval inversion of engineering geophysical sounding logs. Geosci Eng 5(8):22–31 Bartlett MS (1937) The statistical conception of mental factors. Br J Psychol 28:97–104. https:// doi. org/ 10. 1111/j. 2044- 8295. 1937. tb008 63.x Caté A, Perozzi L, Gloaguen E, Blouin M (2017) Machine learning as a tool for geologists. Lead Edge 36:215–219. https:// doi. org/ 10. 1190/ tle36 030215.1 Chunduru RK, Sen MK, Stoffa PL (1997) Hybrid optimization methods for geophysical inversion. Geo- physics 62:1196–1207. https:// doi. org/ 10. 1190/1. 14442 20 Dietrich P, Leven C (2006) Direct push-technologies. In: Kirsch R (eds) Groundwater geophysics. Springer, Berlin. https:// doi. org/ 10. 1007/3- 540- 29387-6_ 11 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 695 Dobróka M, Szabó NP, Tóth J, Vass P (2016) Interval inversion approach for an improved interpretation of well logs. Geophysics 81:D155–D167. https:// doi. org/ 10. 1190/ geo20 15- 0422.1 Drahos D (2005) Inversion of engineering geophysical penetration sounding logs measured along a pro- file. Acta Geodetica Geophys Hungarica 40:193–202. https:// doi. org/ 10. 1556/ AGeod. 40. 2005.2.6 Dramsch JS (2020) 70 years of machine learning in geoscience in review. Adv Geophys. https:// doi. org/ 10. 1016/ bs. agph. 2020. 08. 002 Everett ME (2013) Near-surface applied geophysics. Cambridge University Press, Cambridge. https:// doi. org/ 10. 1017/ CBO97 81139 088435 Fejes I, Jósa E (1990) The engineering geophysical sounding method. Principles, instrumentation, and computerised interpretation. In: SH Ward (ed) Geotechnical and environmental geophysics, Envi- ronmental and groundwater, vol 2. SEG, pp 321–331, ISBN 978-0-931830-99-0. Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell PAMI-6:721–741. https:// doi. org/ 10. 1109/ TPAMI. 1984. 47675 96 Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press Jöreskog KG (2007) Factor analysis and its extensions. In: Cudeck R, MacCallum RC (eds) Factor analy- sis at 100, historical developments and future directions. Lawrence Erlbaum Associates, pp 47–77 Kaikkonen P, Sharma SP (2001) A comparison of performances of linearized and global nonlinear 2-D inversions of VLF and VLF-R electromagnetic data. Geophysics 66:462–475. https:// doi. org/ 10. 1190/1. 14449 37 Kaiser HF (1958) The varimax criterion for analytical rotation in factor analysis. Psychometrika 23:187– 200. https:// doi. org/ 10. 1007/ BF022 89233 Kennedy J, Eberhart R (1995) Particle swarm optimization. Proc IEEE Int Conf Neural Netw 4:1942– 1948. https:// doi. org/ 10. 1109/ ICNN. 1995. 488968 Kong Q, Trugman DT, Ross ZE, Bianco MJ, Meade BJ, Gerstoft P (2019) Machine learning in seismol- ogy: turning data into insights. Seismol Res Lett 90(1):3–14. https:// doi. org/ 10. 1785/ 02201 80259 Li G, Qiao Y, Zheng Y, Li Y, Wu W (2019) Semi-supervised learning based on generative adversarial network and its applied to lithology recognition. IEEE Access 7:67428–67437. https:// doi. org/ 10. 1109/ ACCESS. 2019. 29183 66 Liu S, Liang M, Hu X (2018) Particle swarm optimization inversion of magnetic data: field exam- ples from iron ore deposits in China. Geophysics 83(4):J43–J59. https:// doi. org/ 10. 1190/ geo20 17- 0456.1 Menke W (2012) Geophysical data analysis: discrete inverse theory, 3rd edn. Academic Press. https:// doi. org/ 10. 1016/ C2011-0- 69765-0 Metropolis N, Rosenbluth MN, Rosenbluth AW, Teller AH, Teller E (1953) Equation of state calcula- tions by fast computing machines. J Chem Phys 21:1087–1092. https:// doi. org/ 10. 1063/1. 16991 14 Puskarczyk E, Jarzyna JA, Wawrzyniak-Guz K et al (2019) Improved recognition of rock formation on the basis of well logging and laboratory experiments results using factor analysis. Acta Geophys 67:1809–1822. https:// doi. org/ 10. 1007/ s11600- 019- 00337-8 Sen MK, Stoffa PL (2013) Global optimization methods in geophysical inversion: Cambridge University Press, Cambridge. https:// doi. org/ 10. 1017/ CBO97 80511 997570 Shi Y, Eberhart R (1998) A modified particle swarm optimizer. In: The 1998 IEEE international confer - ence on IEEE world congress on computational intelligence evolutionary computation proceedings, pp 69–73. https:// doi. org/ 10. 1109/ ICEC. 1998. 699146 Soupios P, Akca I, Mpogiatzis P, Basokur AT, Papazachos C (2011) Applications of hybrid genetic algo- rithms in seismic tomography. J Appl Geophys 75(3):479–489. https:// doi. org/ 10. 1016/j. jappg eo. 2011. 08. 005 Szabó NP (2011) Shale volume estimation based on the factor analysis of well-logging data. Acta Geo- phys 59:935–953. https:// doi. org/ 10. 2478/ s11600- 011- 0034-0 Szabó NP (2018) A genetic meta-algorithm-assisted inversion approach: hydrogeological study for the determination of volumetric rock properties and matrix and fluid parameters in unsaturated forma- tions. Hydrogeol J 26:1935–1946. https:// doi. org/ 10. 1007/ s10040- 018- 1749-7 Szabó NP, Dobróka M, Drahos D (2012) Factor analysis of engineering geophysical sounding data for water saturation estimation in shallow formations. Geophysics 77(3):WA35–WA44. https:// doi. org/ 10. 1190/ geo20 11- 0265.1 Szabó NP, Dobróka M (2018) Exploratory factor analysis of wireline logs using a float-encoded genetic algorithm. Math Geosci 50:317–335. https:// doi. org/ 10. 1007/ s11004- 017- 9714-x Szabó NP, Balogh GP, Stickel J (2018) Most frequent value-based factor analysis of direct-push logging data. Geophys Prospect 66(3):530–548. https:// doi. org/ 10. 1111/ 1365- 2478. 12573 1 3 696 Acta Geodaetica et Geophysica (2021) 56:681–696 Szabó NP, Dobróka M (2020) Interval inversion as innovative well log interpretation tool for evaluating organic-rich shale formations. J Petrol Sci Eng. https:// doi. org/ 10. 1016/j. petrol. 2019. 106696 Vu MT, Jardani A (2021) Convolutional neural networks with SegNet architecture applied to three- dimensional tomography of subsurface electrical resistivity: CNN-3D-ERT. Geophys J Int 225(2):1319–1331. https:// doi. org/ 10. 1093/ gji/ ggab0 24 Wang Z, Di H, Shafiq MA, Alaudah Y, AlRegib G (2018) Successful leveraging of image processing and machine learning in seismic structural interpretation: a review. Lead Edge 37(6):451–461. https:// doi. org/ 10. 1190/ tle37 060451.1 Wang R, Yin C, Wang M, Wang G (2012) Simulated annealing for controlled-source audio-frequency magnetotelluric data inversion. Geophysics 77(2):E127–E133. https:// doi. org/ 10. 1190/ geo20 11- 0106.1 Xing Z, Mazzotti A (2019) Two-grid full-waveform Rayleigh-wave inversion via a genetic algorithm— Part 1: method and synthetic examples. Geophysics 84(5):R805–R814. https:// doi. org/ 10. 1190/ geo20 18- 0799.1 Zhang L-P, Yu H-J, Hu S-X (2005) Optimal choice of parameters for particle swarm optimization. J Zheji- ang Univ Sci 6:528–534. https:// doi. org/ 10. 1631/ jzus. 2005. A0528 Zhdanov MS (2015) Inverse theory and applications in geophysics. Elsevier, ISBN 978-0-444-62674-5. https:// doi. org/ 10. 1016/ C2012-0- 03334-0 1 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png "Acta Geodaetica et Geophysica" Springer Journals http://www.deepdyve.com/lp/springer-journals/machine-learning-based-approach-for-the-interpretation-of-engineering-fihtuPCAdM

Loading next page...

References (42)

E. Puskarczyk, J. Jarzyna, K. Wawrzyniak-Guz, P. Krakowska, M. Zych (2019)
Improved recognition of rock formation on the basis of well logging and laboratory experiments results using factor analysis
Acta Geophysica, 67
Q. Kong, D. Trugman, Z. Ross, Michael Bianco, B. Meade, P. Gerstoft (2018)
Machine Learning in Seismology: Turning Data into Insights
Seismological Research Letters
D. Drahos (2005)
Inversion of engineering geophysical penetration sounding logs measured along a profile
Acta Geodaetica et Geophysica Hungarica, 40
M. Bartlett (1937)
The statistical conception of mental factors.
The British journal of psychology. General section, 28
R. Chunduru, Mrinal Sen, P. Stoffa (1997)
Hybrid optimization methods for geophysical inversion
Geophysics, 62
M. Everett (2013)
Near-Surface Applied Geophysics
N. Szabó (2011)
Shale volume estimation based on the factor analysis of well-logging data
Acta Geophysica, 59
M. Zhdanov (2015)
Inverse Theory and Applications in Geophysics
N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller (1953)
Equation of state calculations by fast computing machines
Journal of Chemical Physics, 21
Abordán Armand (2018)
Particle swarm optimization assisted factor analysis for shale volume estimation in groundwater formations
, 6
I. Fejes, Ernö Jósa (1990)
The Engineering Geophysical Sounding Method: Principles, Instrumentation, and Computerized Interpretation
Antoine Caté, L. Perozzi, E. Gloaguen, M. Blouin (2017)
Machine learning as a tool for geologists
Geophysics, 36
P. Soupios, I. Akca, P. Mpogiatzis, A. Basokur, C. Papazachos (2011)
Applications of hybrid genetic algorithms in seismic tomography
Journal of Applied Geophysics, 75
M. Dobróka, N. Szabó, J. Tóth, P. Vass (2016)
Interval inversion approach for an improved interpretation of well logs
Geophysics, 81
Guohe Li, Yinghan Qiao, Yifeng Zheng, Ying Li, Weijiang Wu (2019)
Semi-Supervised Learning Based on Generative Adversarial Network and Its Applied to Lithology Recognition
IEEE Access, 7
Zhen Wang, H. Di, M. Shafiq, Yazeed Alaudah, G. AlRegib (2018)
Successful leveraging of image processing and machine learning in seismic structural interpretation: A review
The Leading Edge
N. Szabó, M. Dobróka (2020)
Interval inversion as innovative well log interpretation tool for evaluating organic-rich shale formations
Journal of Petroleum Science and Engineering, 186
P. Kaikkonen, S. Sharma (2001)
A comparison of performances of linearized and global nonlinear 2-D inversions of VLF and VLF-R electromagnetic data
Geophysics, 66
R. Poli, J. Kennedy, T. Blackwell (1995)
Particle swarm optimization
Swarm Intelligence, 1
H. Kaiser (1958)
The varimax criterion for analytic rotation in factor analysis
Psychometrika, 23
B. Pal (2017)
Interval Inversion of Engineering Geophysical Sounding Logs
, 5
W. Menke (1984)
Geophysical data analysis : discrete inverse theory
M. Araya-Polo, T. Dahlke, Charlie Frogner, Chiyuan Zhang, T. Poggio, D. Hohl (2017)
Automated fault detection without seismic processing
Geophysics, 36
N. Szabó, M. Dobróka, D. Drahos (2012)
Factor analysis of engineering geophysical sounding data for water-saturation estimation in shallow formations
Geophysics, 77
J. Dramsch (2020)
70 years of machine learning in geoscience in review
Advances in Geophysics, 61
张丽平, 俞欢军, 胡上序 (2005)
Optimal choice of parameters for particle swarm optimization
, 6
N. Szabó, M. Dobróka (2017)
Exploratory Factor Analysis of Wireline Logs Using a Float-Encoded Genetic Algorithm
Mathematical Geosciences, 50
Ruo Wang, C. Yin, Miao-yue Wang, G. Wang (2012)
Simulated annealing for controlled-source audio-frequency magnetotelluric data inversion
Geophysics, 77
J. Holland (1975)
Adaptation in natural and artificial systems
(2012)
26:1935–1946. https:// doi
N. Szabó, Gergely Balogh, J. Stickel (2018)
Most frequent value‐based factor analysis of direct‐push logging data
Geophysical Prospecting, 66
Z. Xing, A. Mazzotti (2019)
Two-grid full-waveform Rayleigh-wave inversion via a genetic algorithm — Part 1: Method and synthetic examples
GEOPHYSICS
Amjad Ali, Chen Sheng-chang (2020)
Characterization of well logs using K-mean cluster analysis
Journal of Petroleum Exploration and Production Technology, 10
Shuang Liu, Miao Liang, Xiangyun Hu (2018)
Particle swarm optimization inversion of magnetic data: Field examples from iron ore deposits in China
GEOPHYSICS
Mrinal Sen, P. Stoffa (2013)
Global Optimization Methods in Geophysical Inversion by Mrinal K. Sen
J Kennedy, RC Eberhart (1995)
Particle swarm optimization
Proc IEEE Int Conf Neural Netw, 4
(1998)
https:// doi
Yuhui Shi, R. Eberhart (1998)
A modified particle swarm optimizer
1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98TH8360)
S. Geman, D. Geman (1984)
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6
P. Dietrich, C. Leven (2006)
Direct Push-Technologies
M. Vu, A. Jardani (2021)
Convolutional neural networks with SegNet architecture applied to three-dimensional tomography of subsurface electrical resistivity: CNN-3D-ERT
Geophysical Journal International, 225
N. Szabó (2018)
A genetic meta-algorithm-assisted inversion approach: hydrogeological study for the determination of volumetric rock properties and matrix and fluid parameters in unsaturated formations
Hydrogeology Journal, 26

Publisher: Springer Journals
Copyright: Copyright © The Author(s) 2021
ISSN: 2213-5812
eISSN: 2213-5820
DOI: 10.1007/s40328-021-00354-4
Publisher site: See Article on Publisher Site

Abstract

In this paper, a set of machine learning (ML) tools is applied to estimate the water satura- tion of shallow unconsolidated sediments at the Bátaapáti site in Hungary. Water saturation is directly calculated from the first factor extracted from a set of direct push logs by factor analysis. The dataset observed by engineering geophysical sounding tools as special vari- ants of direct-push probes contains data from a total of 12 shallow penetration holes. Both one- and two-dimensional applications of the suggested method are presented. To improve the performance of factor analysis, particle swarm optimization (PSO) is applied to give a globally optimized estimate for the factor scores. Furthermore, by a hyperparameter esti- mation approach, some control parameters of the utilized PSO algorithm are automatically estimated by simulated annealing (SA) to ensure the convergence of the procedure. The result of the suggested ML-based log analysis method is compared and verified by an inde- pendent inversion estimate. The study shows that the PSO-based factor analysis aided by hyperparameter estimation provides reliable in situ estimates of water saturation, which may improve the solution of environmental end engineering problems in shallow uncon- solidated heterogeneous formations. Keywords Factor analysis · Particle swarm optimization · Simulated annealing · Direct push logging · Hyperparameter estimation 1 Introduction Borehole geophysical data contain valuable information about the physical character- istics of the investigated subsurface (Everett 2013). The measured raw data can be pro- cessed and turned into petrophysical parameters by several different approaches. The simplest is the deterministic modeling, which is often based on a single empirical equa- tion that transforms an observed variable into a petrophysical parameter (e.g., shale vol- ume estimation along boreholes based only on the natural gamma-ray intensity log). A * Armand Abordán gfaa@uni-miskolc.hu Department of Geophysics, University of Miskolc, 3515 Miskolc-Egyetemváros, Hungary MTA-ME Geoengineering Research Group, University of Miskolc, 3515 Miskolc-Egyetemváros, Hungary 1 3 Vol.:(0123456789) 682 Acta Geodaetica et Geophysica (2021) 56:681–696 more advanced way to process geophysical data is inverse modeling (Zhdanov 2015). This approach combines numerical mathematics and optimization theory to derive the physical parameters of the investigated geological structures from the measured data. Geophysical inverse problems can be solved by linearized or global optimization methods (e.g., simulated annealing (SA), genetic algorithm (GA), particle swarm opti- mization (PSO)). However, in all cases, geophysical inversion is based on a scientific understanding, a set of equations (i.e., response functions) that describe the relationship between the observed data and the petrophysical parameters of the subsurface. Inverse problems are solved by updating a starting model iteratively until the syn- thetic data calculated on the model fits the measured data (Menke 2012). This opti- mization task is usually solved by linearized methods (e.g., the least squares method). These provide a fast and satisfactory solution given that there is a good initial (starting) model. However, the application of these methods to large scale (multivariate) problems is often problematic due to the complexity of the objective function. During the mini- mization of the objective function (that measures the misfit between the measured and calculated data) these linearized methods use a gradient-based search, which means that they stabilize in the nearest local minimum of the objective function. The above-mentioned global optimization methods use random search instead, there- fore are capable to get out of local minima of the objective function. Thus, they can provide a reliable and convergent solution independently of the chosen starting model. For this reason, global optimization techniques are widely used in geophysics (Sen and Stoffa 2013), e.g., full-waveform Rayleigh-wave inversion (Xing and Mazzotti 2019), inversion of magnetotelluric data (Wang et al. 2012), two-dimensional inversion of magnetic data (Liu et al. 2018). One of their disadvantages is that their computational requirements far exceed those of linearized methods (Kaikkonen and Sharma 2001), but due to the rapid development of computing, this is no longer a problem nowadays. Furthermore, it is worth mention- ing that based on a single program run, global optimization methods cannot provide information on the error of parameter estimation like linearized methods. However, with different hybrid solutions, one can combine linearized methods with global methods to create efficient, two-phase algorithms. In general, complex inverse problems can be ini- tialized by some global method to avoid the algorithm from stabilizing in a local opti- mum of the objective function. Then, when the procedure is close enough to the optimal solution, it can be switched to a linearized method so that the computation of estimation error of derived petrophysical parameters becomes possible and the runtime of the algo- rithm is reduced as well. Such hybrid solutions were suggested by Soupios et al. (2011) for seismic inversion, Chunduru et al. (1997) for 2D resistivity profiling data and by Szabó and Dobróka (2020) for the non-linear well logging inverse problem. Because these global methods do not use linearization, they do not require supple- mentary information (e.g., derivatives or a good starting model), but their convergence is greatly influenced by their control (hyper) parameters (e.g., combination and param- eters of genetic operators in GA, initial temperature, cooling schedule and parameter perturbation in SA or inertia weight and learning factors in PSO). However, through hyperparameter estimation, it is possible to automatically select the optimal values for these parameters with a secondary optimization algorithm, thus guaranteeing the con- vergence of the procedure. Based on a similar principle, the efficiency of geophysical inversion methods can also be increased by hyperparameter estimation. For example, the value of parameters otherwise treated as constants (e.g., zone parameters in case of 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 683 well logging inversion) in the response functions can be optimized by incorporating an additional program loop (Dobróka et al. 2016). 2 Machine learning tools used in geophysics In addition to inversion-based data processing methods, another approach is to use machine learning (ML) from the toolbox of artificial intelligence (Dramsch 2020). The aim of ML is to create such algorithms that can improve their own effectiveness by utilizing the experi- ence gained during their operation. Such as artificial neural network, deep learning, clus- ter analysis or fuzzy logic. The frequently used regression analysis (linear, non-linear or logistic) is also considered an ML tool. The use of these tools is gaining ground in the processing and interpretation of geological and geophysical data in recent years (Caté et al. 2017). Some examples from the literature are seismic interpretation (Wang et al. 2018), seismology (Kong et al. 2019), fault detection (Araya-Polo et al. 2017), electrical resistiv- ity tomography (Vu and Jardani 2021) and well logging inversion (Szabó 2018). The use of these methods is advantageous in the absence of the previously mentioned scientific back - ground (mathematical relationship does not necessarily exist between the petrophysical parameters and observed data) or the theoretical description is so complex (e.g., extremely long runtime) that we must disregard the exact description. In recent decades a wide variety of machine learning tools have been developed. To select the one that is most suitable for solving a given problem, it is crucial to understand how they work. Therefore, it is best to categorize the different ML methods based on their learning style. This way we can sort them into three main categories. The most commonly applied tools are based on supervised learning. In this case there is exact information (inputs and outputs) for training the algorithm. In addition to the input data, we also know what kind of results we can expect. Thus, supervised learning-based methods provide exact parameters (i.e., numerical labels) such as porosity or categorical labels (e.g., rock type). Some of the most often used ML approaches based on supervised learning are regression analysis and classifiers. In case of unsupervised learning, there is no exact information in the dataset regard- ing the output. Meaning that it uses data without labels and therefore the procedure has to group the observed data by finding structures within the dataset. A frequently applied tool that uses unsupervised learning is cluster analysis, which classifies the input data based on some distance metric. Clustering is often applied on geophysical datasets, e.g., for rock typing based on wireline logging data (Ali and Sheng-Chang 2020). Dimension reduction methods also utilize the unsupervised learning approach (e.g., factor analysis or principal component analysis). Big datasets are often difficult to handle therefore it can be advantageous to reduce the size (dimensionality) of the problem. By keeping most of the information contained in the original dataset (i.e., statistical sample), the same phenomenon can be described with fewer variables. Thus, the new variables con- tain the essential features of the investigated object as well as possible new properties that cannot be measured directly. Furthermore, by removing the error factors, it can even be used to improve the signal-to-noise ratio. In factor analysis, a large number of inter- related or independent variables are replaced by a smaller number of uncorrelated vari- ables, where the resulting new variables cannot be measured directly. The applicability of the dimensionality reduction methods comes from the fact that the newly extracted variables often show a strong correlation with different petrophysical parameters, thus 1 3 684 Acta Geodaetica et Geophysica (2021) 56:681–696 they can be used e.g., for lithological classification (Puskarczyk et al. 2019) or for quan- titative estimation of petrophysical parameters (Szabó 2011). The last category of ML methods based on learning style is the semi-supervised approach. A combination of the previous two cases, it can be used when only part of the database contains information about the output (i.e., labeled data) while some of the data does not have all the necessary information available (i.e., unlabeled data) for supervised training of the system. Semi-supervised learning ensures that the latter data is not wasted and contributes in some way to the design of the system. An example of a semi-supervised based system can be found in Li et al. (2019) for lithology recognition using a generative adversarial network. In the following sections, an ML-based system is suggested for direct-push log analy- sis, which enables the estimation of water saturation in shallow unconsolidated hetero- geneous formations. 3 Water saturation estimation in the shallow subsurface by factor analysis A direct push (DP) logging technology named as engineering geophysical sounding was developed in Hungary (Fejes and Jósa 1990) based on cone penetration testing (CPT). Besides cone resistance and sleeve friction, DP logging can also measure the same parameters routinely recorded by wireline logs including gamma-ray intensity, bulk density, neutron porosity and resistivity. It enables the characterization of the shallow unconsolidated subsurface with high vertical resolution down to approximately 50 m with a highly mobile equipment. These measurements are done by advancing steel rods into the ground without drilling, therefore there is less disturbance of the subsurface, which is also advantageous considering geophysical measurements. The different DP technologies can be used to solve problems of contamination map- ping, environmental risks assessment and ground-water investigations (Dietrich and Leven 2006). By processing direct push logs, one can derive quantitative information about the composition of shallow unconsolidated sediments, such as clay content, poros- ity or water saturation (Balogh 2016). In this paper, an ML-based statistical approach is developed for the quantitative analysis of direct-push logs. Factor analysis, as mentioned before is an unsupervised ML tool for describing sev- eral measured quantities with fewer uncorrelated variables. In case of DP logging data, the measured logs are the input variables, which are processed jointly to extract new factors. Here, the derived factors can be looked at as factor logs, which can be related to petrophysical parameters by regression analysis (Szabó et al. 2018). In this study, water saturation (S ) of shallow unconsolidated formations is estimated based on the first fac- tor log extracted from a direct push logging dataset. As a preliminary step, DP logs need to be standardized to serve as input for factor analysis. Then they are collected into a matrix D, where individual columns contain the recordings of different logging tools. In this paper, the processed logging tools are the natural gamma-ray intensity, GR (cpm), cone resistance, RCPT (MPa), bulk density, DEN (g/cm ), neutron porosity, NPHI (v/v) and resistivity, RES (ohmm). 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 685 ⎛ D D ⋯ D ⎞ 11 12 1K ⎜ ⎟ D D D 21 22 2K ⎜ ⎟ ⋮ ⋮⋮ ⋮ ⎜ ⎟ = , (1) ⎜ D D D ⎟ i1 i2 iK ⎜ ⎟ ⋮ ⋮⋮ ⋮ ⎜ ⎟ D D D ⎝ N1 N2 NK ⎠ where K is the number of applied DP logging tools and N shows the number of measured depth points in the given sounding hole. The solution of factor analysis is based upon the decomposition of data matrix D = + , (2) where F is an N-by-M matrix of factor scores (i.e., factor logs) where M denotes the num- ber of computed factors (M < K), L is a K-by-M matrix of factor loadings, which shows the correlation relationship between the measured DP logs and the newly extracted factors and E is an N-by-K error matrix. Based on the model of factor analysis in Eq. (2), the derived factors are essentially the weighted sums of the measured direct push logs. The first col- umn of matrix F (i.e., the first factor) describes most of the data variance of the meas- ured dataset and therefore generally bears the most significance for data interpretation. By assuming that the factors are linearly independent, the correlation matrix is given by −1 T T = N = + , (3) where is a diagonal matrix of specific variances that is independent of the common fac- tors. For the estimation of the factor loadings in matrix L, Jöreskog (2007) suggested the following non-iterative formula −1∕2 1∕2 −1 = diag ( − ) , (4) where is the diagonal matrix of the first M number of sorted eigenvalues of the sample covariance matrix S, denotes the matrix of the first M number of eigenvectors, I is the identity matrix, U is an arbitrary M-by-M orthogonal matrix and θ is an adequately chosen constant that is to be slightly smaller than 1. Once factor loadings are available, the factor scores can be estimated by a maximum likelihood method (Bartlett 1937) −1 T T −1 T −1 T = . (5) As an advanced approach, in this study, factor scores are estimated by means of global optimization. We currently utilize the metaheuristic particle swarm optimization (PSO) for giving an estimate to the factors scores. This PSO-based solution of factor analysis is referred to as FA-PSO (Abordán and Szabó 2018). In this approach, factor analysis is treated as an inverse problem, thus Eq. (2) must be reformulated = + , (6) where the standardized measured DP logs are represented as a KN length column vector d, factor loadings are given in a NK-by-NM matrix , factor scores are also gathered in a column vector f of MN length and e is the KN length residual vector (Szabó and Dobróka 2018). To initialize this metaheuristic solution, measured DP logs are first collected into 1 3 686 Acta Geodaetica et Geophysica (2021) 56:681–696 the column vector d, then the matrix of factor loadings can be estimated by Eq. (4). For getting more meaningful factors, the varimax rotation is applied on the loadings (Kai- ser 1958). Then these factor loadings are kept constant, while the optimal values of factor scores f is approximated by PSO. For measuring discrepancy, the following L norm based objective function is used, which is minimized to estimate the optimal values of factors scores NK 1 (m) (c) (7) E = (d − d ) , i i NK i=1 (m) (c) where d and d are the ith standardized measured and calculated direct push logging i i data, respectively. The calculated data comes from the multiplication from Eq. (6), while d stores the measured DP logs in the same equation. The before mentioned multi- plication of factor loadings and factor scores permits the calculation of synthetic DP logs, which can be looked at as the solution of the forward problem. For solving this optimization problem PSO is applied which is a global optimization method inspired by the social behavior of animals. The original method was developed by Kennedy and Eberhart (1995). PSO is often applied for its relatively low computa- tional requirements and easy implementation compared to other optimization methods such as the genetic algorithm (Holland 1975). It is a population based technique where each particle (i.e., possible solution) searches for the optimal solution by adjusting its own position in the search space by taking into account its own best position and the whole swarm’s best position in every iteration step. This searching mechanism of PSO is governed by Eqs. (8–9). Having an n-dimensional optimization problem, the ith par- ticle’s position in the search space can be represented by the vector x = (x , x , …, i i1 i2 T T x ) and its velocity by vector v = (v , v , …, v ) . Every particle of the swarm has in i i1 i2 in a memory of its best position, which is continuously updated during the iterations and is stored in vector p = (p , p , …, p ) for the ith particle. The position and velocity i i1 i2 in update equations are (t + 1)= (t)+ (t + 1), (8) i i i (t + 1)= w (t)+ r c ( (t)− (t)) + r c ((t)− (t)), (9) i i 1 1 i i 2 2 i where t = 1,…, T is the current iteration step, T is the last iteration and i = 1,2,…, S shows the particle index and S is the population size. In Eq. (9) r and r denote random variables 1 2 uniformly distributed in 0 to 1 and w is an inertia weight (Shi and Eberhart 1998) that was introduced to balance between global and local search. Vector g stores the very best posi- tion found by the swarm until the current iteration step. It is continuously updated in each iteration and helps the swarm to find the global optimum of the objective function. Accel- eration factors c and c are positive constants, where c is the cognitive scaling parameter 1 2 1 and c is the social scaling parameter, both generally set as 2 (Kennedy and Eberhart 1995). Since PSO is a metaheuristic method, the chosen control (hyper-) parameters have a great effect on its performance (Zhang et al. 2005). To increase the reliability of the searching mechanism in finding the global optimum, the automatic selection of acceler - ation factors c and c is developed. It is carried out in an additional program loop with 1 2 the help of simulated annealing (Metropolis et al. 1953) as depicted in Fig. 1. For the outer program loop, we initialize the values of both c and c hyperparameters 1 2 as 1 and then let SA select the optimal values automatically for the current optimization 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 687 Fig. 1 Flowchart of the FA-PSO method aided by hyperparameter estimation problem. In every SA iteration step their value is adjusted by adding a small b parameter to both values. Parameter b is randomly generated in each iteration from − b to b , max max where b is also reduced iteratively as b = b × ε, where ε is a constant smaller than max max max Once the new c and c parameters are selected, PSO is run to find the optimal values 1 2 of factor scores f. If the difference in energy (ΔE) in two successive iteration steps of the SA program loop according to the objective function in Eq. (7) is negative (i.e., PSO was more effective with the new hyperparameters), then the current values of parameters c and c are accepted and the iterations are continued. If the difference in energy is positive (i.e., new control parameters decreased the efficiency of PSO), then the accepting probability * * of the new c and c is given by P = exp(− ΔE/T ), where T is the current temperature 1 2 a (new) of the system. Temperature is reduced logarithmically according to T = T /lg(1 + q) as suggested by Geman and Geman (1984), where q is the number of iterations computed so far and T is the starting temperature of the system. The new hyperparameters are accepted only if a random number generated from 0 to 1 is smaller than P . This mechanism of accepting worse solutions prevents SA from being stuck in a local optimum near the start- ing model and thus allows for the optimal selection of parameters c and c for PSO. 1 2 4 Field tests 4.1 One‑dimensional application To test the suggested hyperparameter estimation assisted factor analysis, a direct push log- ging dataset is used that was measured in Bátaapáti, Southwest Hungary. The dataset con- tains a total of 12 sounding holes with the natural gamma-ray intensity, GR (cpm), cone resistance, RCPT (MPa), bulk density, DEN (g/cm ), neutron porosity, NPHI (v/v) and resistivity, RES (ohmm) logs measured in the upper 20–28 m of unconsolidated loessy- sandy layers. The sounding holes are located along a 550 m long profile approximately 50 m from each other. First, a one-dimensional application is shown for sounding hole 7 (SH7) where logging data is available for the interval of 0.5–27.7 m. To start off the procedure, DP logging data is standardized and then factor loadings are estimated by Eq. (4). The rotated factor load- ings by the varimax algorithm are shown in Table 1. The first factor explains 71% of the total data variance, while the other 29% is explained by the second factor. According to the computed factor loadings, the first factor correlates most with the bulk density, neutron porosity and resistivity logs, while the second factor is most influ- enced by the cone resistance and natural gamma-ray intensity logs. Then these factor loadings remain unchanged for the remainder of the procedure. To give an estimate to the factor scores f by PSO, first, a random population of 60 particles (each representing 1 3 688 Acta Geodaetica et Geophysica (2021) 56:681–696 Table 1 Rotated factor loadings Direct-push logs Symbol Factor 1 Factor 2 estimated by the FA-PSO method in SH-7 Natural gamma-ray intensity GR 0.03 − 0.61 Cone resistance RCPT − 0.09 0.74 Bulk density DEN 0.89 0.15 Neutron-porosity NPHI 0.82 − 0.18 Resistivity RES − 0.89 0.18 a solution candidate for vector f) is generated with uniform distribution within the search space previously defined by solving Eq. (5 ). In this instance, the limits of factor scores are set as − 5 to 5. The inertia weight w for Eq. (9) is initialized as 1 and then (new) (old) adjusted in each iteration based on w = w × α, where α is a damping factor set to 0.99, while hyperparameters c and c are automatically selected by SA in the outer 1 2 program loop. SA is run for 50 iteration steps to find the optimal values of c and c 1 2 (Fig. 2). The generated hyperparameters are plugged into PSO in each SA iteration step to test their effectiveness. With the currently set c and c parameters, PSO is run for 2000 1 2 iterations to given an estimate to the optimal values of factor scores in vector f. During the PSO runs, all 60 particles are adjusted according to Eqs. (8)–(9), and the objec- tive function defined in Eq. (7 ) is recalculated with the new values of factor scores f in every iteration step. Figure 2 on the right depicts the final data distance of each PSO run with the corresponding c and c control parameters (Fig. 2 on the left). It can be seen 1 2 that after approximately 15 iterations, SA finds the optimal c and c parameters and 1 2 thereafter their value somewhat stabilizes and the data distance reached by PSO does not decrease any further. The optimal values of c and c in this case are found to be 1 2 1.54 and 2.29, respectively, which somewhat differ from the default values of 2 and 2, respectively (Kennedy and Eberhart 1995). Fig. 2 Convergence of the FA-PSO method (on the right) by altering control parameters c and c by SA 1 2 (on the left) in SH-7 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 689 Once the pre-defined number of iteration steps is reached, the final values of the fac- tor scores (represented by the particle with the lowest data misfit) are accepted as the final solution. In this instance, the final data distance reached by PSO with the automati- cally selected hyperparameters at the end of the procedure was 0.45. Then the first fac- tor (F ) can be used to estimate the water saturation (S ) of the penetrated formations 1 w by regression analysis (Szabó et al. 2012). In this paper, an exponential relationship is assumed between the first factor log and water saturation (Szabó et al. 2018) (bF ) S = ae + c, (10) where a, b and c are area specific regression coefficients. As a reference for regression analysis the S result log taken from the complete quality controlled inversion (Drahos 2005) was applied. Figure 3 on the left depicts the exponential relationship between the first factor log and water saturation along sounding hole 7 based on Eq. (10). For this DP logging data- set, the regression coefficients with 95% confidence bounds are found to be a = 0.8084 [a = 0.6445, a = 0.9723] and b = 0.1996 [b = 0.1606, b = 0.2387] and min max min max c = − 0.2046 [c = − 0.3652, c = − 0.0440]. On the right of Fig. 3, water saturation min max estimated by local inversion and factor analysis is plotted, where the high Pearson’s cor- relation coefficient (r = 0.98) indicates that the two variables are nearly linearly propor- tional and thus shows the reliability of the FA-PSO based water saturation estimation. The results of the FA-PSO method in SH-7 are depicted in Fig. 4 along with the input direct push logs. The first five tracks contain the measured logs, track 6 contains the first (purple) and second factor (red) logs and the last track shows the water satura- tion estimates by inverse modeling (green) and by the FA-PSO method (blue) utilizing the first factor log. It can be seen that the two water saturation estimates are almost identical, which validates the applicability of factor analysis for the processing of direct push logs and can serve as an independent tool for estimating water saturation in the shallow uncon- solidated subsurface. Fig. 3 The exponential relationship between the first factor and water saturation in SH-7 (on the left), water saturation derived by local inversion and factor analysis (on the right) 1 3 690 Acta Geodaetica et Geophysica (2021) 56:681–696 Fig. 4 The results of the FA-PSO method in SH-7 4.2 Two‑dimensional application The processing of direct push logs by factor analysis can also be carried out in two-dimen- sions, which permits the simultaneous processing of DP logs recorded in neighboring sounding holes. Thus, a 2D section of water saturation can be estimated in one interpre- tation phase from the DP logs recorded in multiple sounding holes (Szabó et al. 2018). (h) First, gather the measured DP logs in vector d defined in Eq. (6) from the hth hole (h = 1,2,…,H). Then the model of factor analysis can be extended for multiple sounding holes as (1) ⎛ ⎞ (1) (1) (1) 0 00 0 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ 0 ⋱ 0 0 0 ⋮ ⋮ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ (h) (h) (h) (h) ⎜ ⎟ ⎜ ̃ ⎟ ⎜ ⎟ ⎜ ⎟ = 00 00 × + , (11) ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 000 ⋱ 0 ⋮ ⋮ ⎜ ⋮ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ (H) (H) (H) ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎜ ⎟ 0 0 00 (H) ⎝ ⎠ (h) (h) where denotes the matrix of factor loadings and f denotes the vector of factor scores computed for the hth sounding hole. Since there are N number of logged depth points in the hth sounding hole, the total number of depth points is N = N + N + … + N . The 1 2 H matrix of factor loadings in Eq. (11) is calculated similarly to the one-dimensional case by Eq. (4), and the optimal factor scores are estimated by PSO. Here it should be noted that the two dimensional case differs from the one dimensional case, because here measured DP logs are processed jointly from several sounding holes together while assuming that the same factor loadings are applicable for the whole exploration area. Once the factor logs are derived and are related to water saturation by regression analysis, they can be interpolated 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 691 (e.g., by kriging) between the sounding holes to derive the map of water saturation for the investigated area. To test this two-dimensional approach for water saturation estimation by factor analysis, data is collected from 12 sounding holes (SH-1 to SH-12), which are located along a 550 m long profile, approximately 50 m apart from each other. The natural gamma-ray intensity, GR (cpm), cone resistance, RCPT (MPa), bulk density, DEN (g/cm ), neutron porosity, NPHI (v/v), resistivity, RES (ohmm) logs are all available along the penetrated sounding holes to serve as the input of the two-dimensional factor analysis. The total number of DP logging data from the 12 sounding holes combined is 15,500. By extracting 2 factors, the rotated factor loadings are given in Table 2. The first factor describes 72% of the total data variance, while the remaining 28% is described by the second factor. The calculated factor loadings for the full dataset of all 12 sounding holes are essentially the same as the factor loadings estimated only in SH-7 (Table 1). The first factor correlates best with the bulk density, neutron porosity and resistivity logs, while the second factor is most influenced by the cone resistance and natural gamma-ray intensity logs. The remain- der of the two-dimensional procedure is identical to the one-dimensional case. Factor load- ings in Table 2 are fixed, and then by solving Eq. (5) for the factor scores, the boundaries of the search space for PSO is defined as − 5 to 30. Due to the larger dataset and thus increased number of unknowns, PSO here requires 7500 iterations to find the optimal val- ues of the factor scores by utilizing 60 particles. Hyperparameters c and c are again auto- 1 2 matically selected by SA in 50 iteration steps. The minimal data distance reached by PSO in finding the optimal values of factor scores by utilizing the SA derived hyperparameters is depicted in Fig. 5 on the right for all 50 iteration steps with the corresponding c and c 1 2 parameters which are shown in Fig. 5 on the left. It can be seen that after approximately 12 iterations, SA finds the optimal values of c and c , thereafter their value somewhat stabilizes and the data distance reached by PSO does not decrease any further. The optimal values of c and c are found to be 1.33 and 1 2 2.13, respectively. The final data distance reached by PSO with these parameters is 0.50. Then the resultant first factor log can be related to the water saturation of the investigated area by regression analysis based on Eq. (10) as seen in Fig. 6 on the left. As a reference for regression analysis, the quality checked local inversion derived water saturation values are used. In this instance, the regression coefficients with 95% confidence bounds are found to be a = 0.8544 [a = 0.8086, a = 0.9001] and b = 0.1982 [b = 0.1880, b = 0.2085] min max min max and c = − 0.2258 [c = − 0.2706, c = − 0.1810]. On the right of Fig. 6 water saturation min max estimated by local inversion and factor analysis is plotted, where the high correlation coef- ficient (r = 0.97) indicates that the two estimates are nearly linearly proportional and thus shows the reliability of the two dimensional factor analysis for water saturation estimation. By interpolating the estimated water saturation logs between the sounding holes, one can derive the map of the water saturation of unsaturated formations along the processed profile Table 2 Rotated factor loadings Direct-push logs Symbol Factor 1 Factor 2 estimated by the FA-PSO method in sounding holes 1–12 Natural gamma-ray intensity GR 0.03 − 0.50 Cone resistance RCPT − 0.08 0.70 Bulk density DEN 0.85 0.20 Neutron porosity NPHI 0.77 − 0.24 Resistivity RES − 0.88 0.16 1 3 692 Acta Geodaetica et Geophysica (2021) 56:681–696 Fig. 5 Convergence of the FA-PSO method (on the right) by altering control parameters c and c by SA 1 2 (on the left) for multiple sounding holes Fig. 6 The exponential relationship between the first factor and water saturation in sounding holes 1–12 (on the left), water saturation derived by local inversion and the two dimensional factor analysis (on the right) (Fig. 7). Similarly, the results of local inversion estimates can also be interpolated between the holes, which is depicted in Fig. 8. The logs are located at approximately 50 m apart, sounding hole 7 processed in the previ- ous section is located at 300 m. The layers with different water saturations can be easily recog- nized on the derived maps and thus can help with site characterization. The good correlation between the two estimates confirm the applicability of the two-dimensional factor analysis for water saturation estimation from direct push logs. 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 693 Fig. 7 Water saturation estimated by the two-dimensional factor analysis in sounding holes 1–12 Fig. 8 Water saturation estimated by inverse modeling in sounding holes 1–12 5 Conclusions The paper presents the results of a machine learning-based log analysis method for direct-push logging data. The suggested factor analysis based approach is shown to effectively reduce the dimension of direct push logging datasets and the newly extracted factor logs by the FA-PSO method can be used to estimate water saturation along arbi- trary long sounding hole intervals. By incorporating more than one sounding hole into the procedure, even two-dimensional water saturation profiles can be derived by simultaneously processing DP logs from neighboring holes. The result of the presented method is also verified by independent inversion based estimates of water saturation. It is also shown that simulated annealing is capable to automatically select some of the hyperparameters of the utilized particle swarm optimization algorithm. Thus, the uncer- tainty of the optimization algorithm can be reduced, since the manual selection of the c and c parameters is no longer necessary. The suggested machine learning tool assures 1 3 694 Acta Geodaetica et Geophysica (2021) 56:681–696 reliable evaluation of unconsolidated and unsaturated near-surface formations being as target domain of several engineering and environmental geophysical problems. Acknowledgements The research was carried out in the Project No. K-135323 supported by the National Research, Development and Innovation Office (NKFIH). The use of the dataset was permitted by Dezső Drahos from Loránd Eötvös University. The authors thanks for the continuous support of János Stickel from Elgoscar Ltd. Authors’ contributions Armand Abordán: original draft, software, visualization. Norbert Péter Szabó: con- ceptualization, methodology, review and editing. Funding Open access funding provided by University of Miskolc. The research was carried out in the Pro- ject No. K-135323 supported by the National Research, Development and Innovation Office (NKFIH). Declarations Conflict of interest The authors have no conflicts of interest to declare that are relevant to the content of this article. Ethics approval Not applicable. Consent to participate Not applicable. Consent for publication Not applicable. Availability of data and material Because of the data confidentiality, the experimental data is not published. Code availability Because of the data confidentiality, the code is not published. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com- mons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. References Abordán A, Szabó NP (2018) Particle swarm optimization assisted factor analysis for shale volume esti- mation in groundwater formations. Geosci Eng 6(9):87–97 Ali A, Sheng-Chang C (2020) Characterization of well logs using K-mean cluster analysis. J Petrol Explor Prod Technol 10:2245–2256. https:// doi. org/ 10. 1007/ s13202- 020- 00895-4 Araya-Polo M, Dahlke T, Frogner C, Zhang C, Poggio T, Hohl D (2017) Automated fault detection with- out seismic processing. Lead Edge 36(3):208–214. https:// doi. org/ 10. 1190/ tle36 030208.1 Balogh GP (2016) Interval inversion of engineering geophysical sounding logs. Geosci Eng 5(8):22–31 Bartlett MS (1937) The statistical conception of mental factors. Br J Psychol 28:97–104. https:// doi. org/ 10. 1111/j. 2044- 8295. 1937. tb008 63.x Caté A, Perozzi L, Gloaguen E, Blouin M (2017) Machine learning as a tool for geologists. Lead Edge 36:215–219. https:// doi. org/ 10. 1190/ tle36 030215.1 Chunduru RK, Sen MK, Stoffa PL (1997) Hybrid optimization methods for geophysical inversion. Geo- physics 62:1196–1207. https:// doi. org/ 10. 1190/1. 14442 20 Dietrich P, Leven C (2006) Direct push-technologies. In: Kirsch R (eds) Groundwater geophysics. Springer, Berlin. https:// doi. org/ 10. 1007/3- 540- 29387-6_ 11 1 3 Acta Geodaetica et Geophysica (2021) 56:681–696 695 Dobróka M, Szabó NP, Tóth J, Vass P (2016) Interval inversion approach for an improved interpretation of well logs. Geophysics 81:D155–D167. https:// doi. org/ 10. 1190/ geo20 15- 0422.1 Drahos D (2005) Inversion of engineering geophysical penetration sounding logs measured along a pro- file. Acta Geodetica Geophys Hungarica 40:193–202. https:// doi. org/ 10. 1556/ AGeod. 40. 2005.2.6 Dramsch JS (2020) 70 years of machine learning in geoscience in review. Adv Geophys. https:// doi. org/ 10. 1016/ bs. agph. 2020. 08. 002 Everett ME (2013) Near-surface applied geophysics. Cambridge University Press, Cambridge. https:// doi. org/ 10. 1017/ CBO97 81139 088435 Fejes I, Jósa E (1990) The engineering geophysical sounding method. Principles, instrumentation, and computerised interpretation. In: SH Ward (ed) Geotechnical and environmental geophysics, Envi- ronmental and groundwater, vol 2. SEG, pp 321–331, ISBN 978-0-931830-99-0. Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell PAMI-6:721–741. https:// doi. org/ 10. 1109/ TPAMI. 1984. 47675 96 Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press Jöreskog KG (2007) Factor analysis and its extensions. In: Cudeck R, MacCallum RC (eds) Factor analy- sis at 100, historical developments and future directions. Lawrence Erlbaum Associates, pp 47–77 Kaikkonen P, Sharma SP (2001) A comparison of performances of linearized and global nonlinear 2-D inversions of VLF and VLF-R electromagnetic data. Geophysics 66:462–475. https:// doi. org/ 10. 1190/1. 14449 37 Kaiser HF (1958) The varimax criterion for analytical rotation in factor analysis. Psychometrika 23:187– 200. https:// doi. org/ 10. 1007/ BF022 89233 Kennedy J, Eberhart R (1995) Particle swarm optimization. Proc IEEE Int Conf Neural Netw 4:1942– 1948. https:// doi. org/ 10. 1109/ ICNN. 1995. 488968 Kong Q, Trugman DT, Ross ZE, Bianco MJ, Meade BJ, Gerstoft P (2019) Machine learning in seismol- ogy: turning data into insights. Seismol Res Lett 90(1):3–14. https:// doi. org/ 10. 1785/ 02201 80259 Li G, Qiao Y, Zheng Y, Li Y, Wu W (2019) Semi-supervised learning based on generative adversarial network and its applied to lithology recognition. IEEE Access 7:67428–67437. https:// doi. org/ 10. 1109/ ACCESS. 2019. 29183 66 Liu S, Liang M, Hu X (2018) Particle swarm optimization inversion of magnetic data: field exam- ples from iron ore deposits in China. Geophysics 83(4):J43–J59. https:// doi. org/ 10. 1190/ geo20 17- 0456.1 Menke W (2012) Geophysical data analysis: discrete inverse theory, 3rd edn. Academic Press. https:// doi. org/ 10. 1016/ C2011-0- 69765-0 Metropolis N, Rosenbluth MN, Rosenbluth AW, Teller AH, Teller E (1953) Equation of state calcula- tions by fast computing machines. J Chem Phys 21:1087–1092. https:// doi. org/ 10. 1063/1. 16991 14 Puskarczyk E, Jarzyna JA, Wawrzyniak-Guz K et al (2019) Improved recognition of rock formation on the basis of well logging and laboratory experiments results using factor analysis. Acta Geophys 67:1809–1822. https:// doi. org/ 10. 1007/ s11600- 019- 00337-8 Sen MK, Stoffa PL (2013) Global optimization methods in geophysical inversion: Cambridge University Press, Cambridge. https:// doi. org/ 10. 1017/ CBO97 80511 997570 Shi Y, Eberhart R (1998) A modified particle swarm optimizer. In: The 1998 IEEE international confer - ence on IEEE world congress on computational intelligence evolutionary computation proceedings, pp 69–73. https:// doi. org/ 10. 1109/ ICEC. 1998. 699146 Soupios P, Akca I, Mpogiatzis P, Basokur AT, Papazachos C (2011) Applications of hybrid genetic algo- rithms in seismic tomography. J Appl Geophys 75(3):479–489. https:// doi. org/ 10. 1016/j. jappg eo. 2011. 08. 005 Szabó NP (2011) Shale volume estimation based on the factor analysis of well-logging data. Acta Geo- phys 59:935–953. https:// doi. org/ 10. 2478/ s11600- 011- 0034-0 Szabó NP (2018) A genetic meta-algorithm-assisted inversion approach: hydrogeological study for the determination of volumetric rock properties and matrix and fluid parameters in unsaturated forma- tions. Hydrogeol J 26:1935–1946. https:// doi. org/ 10. 1007/ s10040- 018- 1749-7 Szabó NP, Dobróka M, Drahos D (2012) Factor analysis of engineering geophysical sounding data for water saturation estimation in shallow formations. Geophysics 77(3):WA35–WA44. https:// doi. org/ 10. 1190/ geo20 11- 0265.1 Szabó NP, Dobróka M (2018) Exploratory factor analysis of wireline logs using a float-encoded genetic algorithm. Math Geosci 50:317–335. https:// doi. org/ 10. 1007/ s11004- 017- 9714-x Szabó NP, Balogh GP, Stickel J (2018) Most frequent value-based factor analysis of direct-push logging data. Geophys Prospect 66(3):530–548. https:// doi. org/ 10. 1111/ 1365- 2478. 12573 1 3 696 Acta Geodaetica et Geophysica (2021) 56:681–696 Szabó NP, Dobróka M (2020) Interval inversion as innovative well log interpretation tool for evaluating organic-rich shale formations. J Petrol Sci Eng. https:// doi. org/ 10. 1016/j. petrol. 2019. 106696 Vu MT, Jardani A (2021) Convolutional neural networks with SegNet architecture applied to three- dimensional tomography of subsurface electrical resistivity: CNN-3D-ERT. Geophys J Int 225(2):1319–1331. https:// doi. org/ 10. 1093/ gji/ ggab0 24 Wang Z, Di H, Shafiq MA, Alaudah Y, AlRegib G (2018) Successful leveraging of image processing and machine learning in seismic structural interpretation: a review. Lead Edge 37(6):451–461. https:// doi. org/ 10. 1190/ tle37 060451.1 Wang R, Yin C, Wang M, Wang G (2012) Simulated annealing for controlled-source audio-frequency magnetotelluric data inversion. Geophysics 77(2):E127–E133. https:// doi. org/ 10. 1190/ geo20 11- 0106.1 Xing Z, Mazzotti A (2019) Two-grid full-waveform Rayleigh-wave inversion via a genetic algorithm— Part 1: method and synthetic examples. Geophysics 84(5):R805–R814. https:// doi. org/ 10. 1190/ geo20 18- 0799.1 Zhang L-P, Yu H-J, Hu S-X (2005) Optimal choice of parameters for particle swarm optimization. J Zheji- ang Univ Sci 6:528–534. https:// doi. org/ 10. 1631/ jzus. 2005. A0528 Zhdanov MS (2015) Inverse theory and applications in geophysics. Elsevier, ISBN 978-0-444-62674-5. https:// doi. org/ 10. 1016/ C2012-0- 03334-0 1 3

Journal

"Acta Geodaetica et Geophysica" – Springer Journals

Published: Dec 1, 2021

Keywords: Factor analysis; Particle swarm optimization; Simulated annealing; Direct push logging; Hyperparameter estimation

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Machine learning based approach for the interpretation of engineering geophysical sounding logs

Machine learning based approach for the interpretation of engineering geophysical sounding logs

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Machine learning based approach for the interpretation of engineering geophysical sounding logs

Machine learning based approach for the interpretation of engineering geophysical sounding logs

References (42)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies