Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator

A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for... actuators Article A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator Yangyang Hou, Huajie Hong * , Dasheng Xu, Zhe Zeng, Yaping Chen and Zhaoyang Liu College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China; houyangyang14@nudt.edu.cn (Y.H.); ixudasheng@163.com (D.X.); zengzhe@nudt.edu.cn (Z.Z.); yaping_chen2021@163.com (Y.C.); ZhaoyangLiuNUDT@163.com (Z.L.) * Correspondence: honghuajie@nudt.edu.cn; Tel.: +86-138-7313-0046 Abstract: Deep Reinforcement Learning (DRL) has been an active research area in view of its capability in solving large-scale control problems. Until presently, many algorithms have been developed, such as Deep Deterministic Policy Gradient (DDPG), Twin-Delayed Deep Deterministic Policy Gradient (TD3), and so on. However, the converging achievement of DRL often requires extensive collected data sets and training episodes, which is data inefficient and computing resource consuming. Motivated by the above problem, in this paper, we propose a Twin-Delayed Deep Deterministic Policy Gradient algorithm with a Rebirth Mechanism, Tetanic Stimulation and Amnesic Mechanisms (ATRTD3), for continuous control of a multi-DOF manipulator. In the training process of the proposed algorithm, the weighting parameters of the neural network are learned using Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a Citation: Hou, Y.; Hong, H.; Xu, D.; biomimetic view to speed up the converging process by biochemical reactions generated by neurons Zeng, Z.; Chen, Y.; Liu, Z. A Deep in the biological brain during memory and forgetting. The effectiveness of the proposed algorithm Reinforcement Learning Algorithm is validated by a simulation example including the comparisons with previously developed DRL Based on Tetanic Stimulation and algorithms. The results indicate that our approach shows performance improvement in terms of Amnesic Mechanisms for Continuous convergence speed and precision. Control of Multi-DOF Manipulator. Actuators 2021, 10, 254. https:// Keywords: multi-DOF manipulator; tetanic stimulation; amnesia mechanism; deep reinforcement doi.org/10.3390/act10100254 learning Academic Editors: Ioan Doroftei and Karsten Berns Received: 11 August 2021 1. Introduction Accepted: 20 September 2021 Deep Reinforcement Learning (DRL) is an advanced intelligent control method. It uses Published: 29 September 2021 a neural network to parameterize the Markov decision processes (MDP). DRL has been successfully applied to the field of robots [1–4], Machine Translation [5], Auto-Driving [6], Publisher’s Note: MDPI stays neutral Target Positioning [7], and shows strong adaptability. There are two kinds of the DRL with regard to jurisdictional claims in algorithms, one is based on value function such as Deep Q Network (DQN) [8] and Nature published maps and institutional affil- DQN [9]. The output of DRL algorithm based on value function is discrete state-action iations. value. The other is policy-based, such as Deep Deterministic Policy Gradient (DDPG) [10], Trust Region Policy Optimization (TRPO) [11], Asynchronous Advantage Actor-Critic (A3C) [12], Distributed Proximal Policy Optimization (DPPO) [13,14], and Twin-Delayed Deep Deterministic Policy Gradient (TD3) [15]. For continuous action space, advanced Copyright: © 2021 by the authors. search policy can improve the sampling efficiency of the underlying algorithms [10]. Many Licensee MDPI, Basel, Switzerland. research results focus on improving exploration policy. Among them, Fortunato et al. [16] This article is an open access article and Plppert et al. [17] put forward a noise-based exploration policy in adding noise to the distributed under the terms and action space and observation space. Bellemare et al. propose an exploratory algorithm conditions of the Creative Commons based on pseudo-counting for efficient exploration. The algorithm evaluates frequency by Attribution (CC BY) license (https:// designing a density model that satisfies certain properties and calculates pseudo-counts creativecommons.org/licenses/by/ that are generalized in continuous space to encourage exploration [18]. In [19], Fox et al. 4.0/). Actuators 2021, 10, 254. https://doi.org/10.3390/act10100254 https://www.mdpi.com/journal/actuators Actuators 2021, 10, 254 2 of 16 innovatively propose a framework, DORA, that uses two parallel MDP processes to inject exploration signals into random tasks. Reward-based exploration results in a slower approximation of functions and failure to provide an intrinsic reward signal in time. So, Badia et al. [20] propose a “never give up” exploration strategy (NGU), which is designed to quickly prevent repeated access to the same state in the same episode. Due to the coupling and complexity, a multi-degree-of-freedom (multi-DOF) manipu- lator is a hot application background of DRL. The direct applications of DRL algorithms have, so far, been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity [21]. Therefore, the application of DRL in multi-DOF robots requires a more effective DRL algorithm. Neuroscience provides a rich source of inspiration for new types of algorithms and architectures, independent of and complemen- tary to the mathematical and logic-based methods and ideas that have largely dominated traditional approaches to AI [22]. In this paper, we design a new DRL algorithm named ATRTD3 based on the research results of neuroscience and the analysis of the human brain memory learning process. Our team shows a biomimetic view to speed up the converging process by biochemical reactions generated by neurons in the biological brain during memory and forgetting. We apply the neural network parameter updating mechanism with Tetanic stimulation and Amnesia mechanism to the DRL algorithm to further improve the efficiency in the application of manipulator. 2. Related Work The advancement of DRL has led to the development of intelligent control of a multi- DOF manipulator in recent years. Kim et al. [23] propose a motion planning algorithm for robot manipulators using TD3, which designed paths are smoother and shorter after 140,000 episodes than those designed by Probabilistic Roadmap. Based on the classic DDPG algorithm, Zhang et al. smoothly add Gaussian parameters to improve the exploratory nature of the algorithm, dynamically sets the robot grasping space parameters to adapt to the workspace of multiple scales, and realizes the accurate grasping of the robot [24]. Robert Kwiatkowski et al. [25] used DL methods to make the manipulator build a self- model after 35 h of training. By comparing the application of DDPG and Proximal Policy Optimization (PPO) in the manipulator, Lriondo et al. [26] concluded that the current DRL algorithm could not obtain robust motion ability and acceptable training efficiency. The difficulty of applying DRL to the motion control of the multi-DOF manipulator lies in how to improve the exploration efficiency and improve the robustness of the output action of the manipulator. Therefore, it is necessary to get inspiration from the research results of neuroscience. For flexible manipulators, some research [27,28] are very interesting, which brings some help to kinematics modeling and control design in this paper. Long-term potentiation (LTP) is a form of activity-dependent plasticity which results in a persistent enhancement of synaptic transmission. LTP has been a source of great fascination to neuroscientists since its discovery in the early 1970s [29] because it satisfies the criteria proposed by Donald Hebb for a synaptic memory mechanism in his influential book ‘The Organization of Behavior ’ [30]. LTP is a persistent enhancement of excitatory synaptic transmission induced by some kinds of preceding operations of high-frequency stimulation (HFS) [31]. In LTP, stimulation changes the synaptic proteins, that is, changes the sensitivity of postsynaptic neurons to presynaptic neurons, thus changing the strength and efficiency of synaptic signal transmis- sion. Memory formation is considered to be the result of long-term synaptic plasticities, such as long-term depression (LTD) and LTP [32]. LTP and LTD have another potentially important role in modern neuroscience, and that is the possibility that they may be exploited to treat disorder and disease in the human central nervous system (CNS). A variety of neurological conditions arise from lost or excessive synaptic drive due to sensory deprivation during childhood, brain damage, or disease [33]. Memory and forgetting are the stages that the human brain must go through in Actuators 2021, 10, x FOR PEER REVIEW  3  of  17  to presynaptic neurons, thus changing the strength and efficiency of synaptic signal trans‐ mission. Memory formation is considered to be the result of long‐term synaptic plastici‐ ties, such as long‐term depression (LTD) and LTP [32].  LTP and LTD have another potentially important role in modern neuroscience, and  that is the possibility that they may be exploited to treat disorder and disease in the human  Actuators 2021, 10, 254 3 of 16 central nervous system (CNS). A variety of neurological conditions arise from lost or ex‐ cessive synaptic drive due to sensory deprivation during childhood, brain damage, or  disease [33]. Memory and forgetting are the stages that the human brain must go through  in the process of accepting knowledge and accumulating experience. This paper will mod‐ the process of accepting knowledge and accumulating experience. This paper will modify ify the Actor‐network module in DRL, and change the Actor‐network module optimized  the Actor-network module in DRL, and change the Actor-network module optimized by by gradient descent into a network module with biological characteristics.  gradient descent into a network module with biological characteristics. 3. Methods 3. Methods  ATRTD3 is a DRL algorithm proposed in this paper to improve the motion ability of a ATRTD3 is a DRL algorithm proposed in this paper to improve the motion ability of  multi-DOF manipulator. The innovation of the algorithm is to transform the research results a multi‐DOF manipulator. The innovation of the algorithm is to transform the research  of neuroscience into DRL. This algorithm is based on the Twin-Delayed Deep Deterministic results of neuroscience into DRL. This algorithm is based on the Twin‐Delayed Deep De‐ Policy Gradient algorithm with Rebirth Mechanism (RTD3) [34] and improves the update terministic Policy Gradient algorithm with Rebirth Mechanism (RTD3) [34] and improves  of network weight parameters of Actor-network module. The highlight of the algorithm is the update of network weight parameters of Actor‐network module. The highlight of the  that it uses tetanic stimulation and amnesia mechanism to randomly enhance and weaken algorithm is that it uses tetanic stimulation and amnesia mechanism to randomly enhance  the weighting parameters of the neural network, thus realizing the bionic update of the and weaken the weighting parameters of the neural network, thus realizing the bionic  neural network. The Actor-network module obtained through the deterministic policy update of the neural network. The Actor‐network module obtained through the determin‐ gradient needs to be further updated through the above two mechanisms. Compared istic  policy  gradient  needs  to  be further  updated  through  the  above  two  mechanisms.  with other DRL algorithms, ATRTD3 increases the network update part of biological Compared with other DRL algorithms, ATRTD3 increases the network update part of bi‐ characteristics and further expands the scope of exploration. Figure 1 shows the framework ological characteristics and further expands the scope of exploration. Figure 1 shows the  of the overall algorithm. framework of the overall algorithm.   Figure 1. ATRTD3 algorithm framework. Figure 1. ATRTD3 algorithm framework.  The pseudo-code of ATRTD3 is shown in Appendix A at the end of this paper. The The pseudo‐code of ATRTD3 is shown in Appendix A at the end of this paper. The  following is a detailed description of the Tetanic stimulation and Amnesia mechanism. following is a detailed description of the Tetanic stimulation and Amnesia mechanism.   3.1. Tetanic Stimulation     Tetanic stimulation is the memory part of the Actor-network module. In the process of back propagation, the neural network obtains the updating quantity of parameters by gradient descent, to realize the iterative updating of the network. By evaluating the updating of network parameters, we can get which neural nodes’ weights are enhanced and which ones are weakened. For the parameters of the strengthened neurons, there are also differences in the strengthened degree. Therefore, it is necessary to evaluate and sort the parameters of the strengthened neuron nodes, select the ranking of the strengthened degree, and then obtain the neuron nodes qualified for Tetanic stimulation, and conduct Tetanic stimulation on the above parameters of the neuron nodes to achieve LTP, as shown in Figure 2, and the specific pseudo code is shown in Algorithm 1. Directly modifying the parameters of neural nodes will directly affect the nonlinear expression results of neural networks. This is a kind of immediate influence, which will immediately show the change Actuators 2021, 10, 254 4 of 16 effect of neuron weight in the continuous MDP, so we need to control this kind of influence in a reasonable range, and constantly update the parameters of the neural network in the iterative process. The designed Tetanic stimulation mechanism is nested in the fixed delay update step, which can exert the effect of Tetanic stimulation mechanism to a certain extent, and will not affect the overall effect of network update iteration, to ensure that the network converges towards the direction of performance improvement in the training process, and has the function of not weakening the attempt to explore the process. Algorithm 1 Tetanic stimulation 1: Tetanic stimulation coefficient k 2: Load Actor network f, W Actor. f c.weight 3: Update Actor network f f new 4: Load New Actor network, W Actor. f c.weight new 5: DW = W W 6: Select the serial number (row , col ) of the T largest data in DW list list 7: For t = 1 to T do: 8: If random(0, 1) < k: 9: If A.w(row , col ) > 0 t t 10: A.w(row , col ) (1 + random(0, 0.01)) A.w(row , col ) t t t t 11: else: 12: A.w(row , col ) (1 random(0, 0.01)) A.w(row , col ) t t t t Actuators 2021, 10, x FOR PEER REVIEW  5  of  17  13: End if 14: End if 15: End for Tetanic Stimulation Framework Behind _Fc Behind _Fc Behind _Fc Front _Fc Front _Fc Front _Fc 1 1 1 1 1 1 Backward 2 2 2 2 2 2 3 3 3 Figure 2. Schematic diagram of working mechanism of Tetanic stimulation. Figure 2. Schematic diagram of working mechanism of Tetanic stimulation.  3.2. Amnesia Mechanism 3.2. Amnesia Mechanism  The Amnesia mechanism is the forgetting part of the Actor-network module. When The Amnesia mechanism is the forgetting part of the Actor‐network module. When  there are problems in information transmission between neurons, synapses cannot normally there are problems in information transmission between neurons, synapses cannot nor‐ perform the function of neurotransmitter transmission, the brain begins to have problems mally perform the function of neurotransmitter transmission, the brain begins to have  in information transmission, forgetting begins to occur, at this time, the huge redundant problems in information transmission, forgetting begins to occur, at this time, the huge  brain neural network begins to have problems, and some memory and logic units begin to redundant brain neural network begins to have problems, and some memory and logic  have problems. Neuron function is not always in a stable and good working state, there units begin to have problems. Neuron function is not always in a stable and good working  will be all kinds of accidents, just as the world’s best snooker players cannot guarantee state, there will be all kinds of accidents, just as the world’s best snooker players cannot  that every shot will be accurate. So, there is forgetting in the whole period and process guarantee that every shot will be accurate. So, there is forgetting in the whole period and  of neurons. For this reason, the Amnesia mechanism is added to the neural network by process of neurons. For this reason, the Amnesia mechanism is added to the neural net‐ randomly selecting the neurons in the network with small probability, and then weakening work by randomly selecting the neurons in the network with small probability, and then  the parameters of the neuron nodes. When using the Amnesia mechanism to weaken weakening the parameters of the neuron nodes. When using the Amnesia mechanism to  the representation ability of the neural network, the influence of this weakening must be weaken the representation ability of the neural network, the influence of this weakening  controllable, that is to say, it will not affect the convergence trend of the neural network. must be controllable, that is to say, it will not affect the convergence trend of the neural  network. Therefore, to ensure that the influence of the Amnesia mechanism can be con‐ trolled,  the  weights  of  neurons  are  weakened  by  random  force  (a  random  percentage  number) in this paper, as shown in Figure 3, and the specific pseudo code is shown in  Algorithm 2.  Algorithm 2 Amnesia Framework  1: Load Actor network,    W  A.w(Actor. fc.weight) 2:  N   is the number of the  W ’s node  3: For i = 1 to  N :  4:     Random(0, 1) number   , Amnesia threshold value   Mutation coefficient     k 5:     If     :  6:        Actor.fc.weight[i] Actor.fc.weight[i] (1krandom(0,1))  7:     End if  8: End for  Actuators 2021, 10, 254 5 of 16 Therefore, to ensure that the influence of the Amnesia mechanism can be controlled, the weights of neurons are weakened by random force (a random percentage number) in this paper, as shown in Figure 3, and the specific pseudo code is shown in Algorithm 2. Algorithm 2 Amnesia Framework 1: Load Actor network, W A.w(Actor. f c.weight) 2: N is the number of the W’s node 3: For i = 1 to N: 4: Random(0, 1) number x , Amnesia threshold value t, Mutation coefficient k 5: If x < t: Actuators 2021, 10, x FOR PEER REVIEW  6  of  17  6: Actor. f c.weight[i] Actor. f c.weight[i] (1 k random(0, 1))   7: End if 8: End for Figure 3. Schematic diagram of Amnesia framework. Figure 3. Schematic diagram of Amnesia framework.  4. Experiment 4.1. Experiment Setup 4. Experiment  For the control problem of the multi-DOF manipulator, if we only consider the kine- 4.1. Experiment Setup  matics model of the manipulator, and the motion of the multi-DOF manipulator is regarded For the control problem of the multi‐DOF manipulator, if we only consider the kine‐ as a discrete process from one position of the end effector to another position, the DRL matics model of the manipulator, and the motion of the multi‐DOF manipulator is re‐ method of deterministic policy gradient, such as RTD3, can achieve good results. However, garded as a discrete process from one position of the end effector to another position, the  if the motion of the manipulator is regarded as a continuous motion process, a group DRL method of deterministic policy gradient, such as RTD3, can achieve good results.  of new inputs and outputs of DRL must be found. The idea adopted in this paper is to However, if the motion of the manipulator is regarded as a continuous motion process, a  discretize the motion process of the manipulator in time, take the position deviation of the group of new inputs and outputs of DRL must be found. The idea adopted in this paper  end-effector from the target, the angular velocity of the joints, and the angular of the joints is to discretize the motion process of the manipulator in time, take the position deviation  as the input information of the DRL in this paper, and then take the angular acceleration of the end‐effector from the target, the angular velocity of the joints, and the angular of  the joints as the input information of the DRL in this paper, and then take the angular  acceleration command for the next interval of the control joint as the output information,  as shown in Figure 4. In this way, by controlling the angular acceleration of the joint, the  problem of discrete processes in the previous control position process can be solved. How‐ ever, this change will inevitably lead to the increase in model dimensions, which not only  puts forward new requirements for the ability of the DRL algorithm but also puts forward  new requirements for the redesign of the reward function.   UR manipulator is a representative manipulator in industrial production and scien‐ tific research. Therefore, this paper will use the structural size and joint layout of manip‐ ulator to establish the simulation manipulator in this paper.  Actuators 2021, 10, 254 6 of 16 command for the next interval of the control joint as the output information, as shown in Figure 4. In this way, by controlling the angular acceleration of the joint, the problem of discrete processes in the previous control position process can be solved. However, Actuators 2021, 10, x FOR PEER REVIEW  7  of  17  this change will inevitably lead to the increase in model dimensions, which not only puts forward new requirements for the ability of the DRL algorithm but also puts forward new requirements for the redesign of the reward function. Initial State DRL Controller Positi on Deviation Observa tion Input Output Joints Angular Joints Angular Acceleration Velocity ATRTD3 Increment Manipulator Joints Angular Update Reset Intermediate State Termination State No Replay Buffer Termination  Yes Judgment Manipulator Episode Manipulator Figure 4. Schematic diagram of control flow. Figure 4. Schematic diagram of control flow.  UR manipulator is a representative manipulator in industrial production and scientific 4.2. Task Introduction  research. Therefore, this paper will use the structural size and joint layout of manipulator to establish In this pa the per simulation , the DRL al manipulator gorithm is us ined this  to tra paper in a . model controller of the multi‐DOF  manipulator. The model can control the angular acceleration of the joints of the manipu‐ 4.2. Task Introduction lator so that the manipulator can start to move from the initial position of the workspace  In this paper, the DRL algorithm is used to train a model controller of the multi-DOF in the static state, and then move to the target position and stop. In the whole training  manipulator. The model can control the angular acceleration of the joints of the manipulator process, the target position of the task is a fixed position in the workspace of the manipu‐ so that the manipulator can start to move from the initial position of the workspace in the lator. The core of the task is that the manipulator can reach the target position and when  static state, and then move to the target position and stop. In the whole training process, it reaches the target position at the same time, each joint of the manipulator is at static  the target position of the task is a fixed position in the workspace of the manipulator. The state. Finally, the manipulator can reach the target position smoothly by controlling the  core of the task is that the manipulator can reach the target position and when it reaches the angular acceleration of joints. In order to limit the boundaries of the entire task, the entire  target position at the same time, each joint of the manipulator is at static state. Finally, the training process must be restricted. Each episode is divided into twenty steps. This setting  manipulator can reach the target position smoothly by controlling the angular acceleration mainly takes into account that the training convergence process takes a long time and the  of joints. In order to limit the boundaries of the entire task, the entire training process must time of a single episode must be shortened. This task is a simulation experiment to test  be restricted. Each episode is divided into twenty steps. This setting mainly takes into the convergence ability and learning ability of the improved algorithm ATRTD3.   account that the training convergence process takes a long time and the time of a single episode must be shortened. This task is a simulation experiment to test the convergence 4.3. Simulation Environment Construction  ability and learning ability of the improved algorithm ATRTD3. The DRL algorithm establishes the manipulator model through the standard DH [35]  method and uses the positive kinematics solution to solve the spatial pose of the end ef‐ 4.3. Simulation Environment Construction fector through the joint angle. DH modeling method is a general modeling method of  The DRL algorithm establishes the manipulator model through the standard DH [35] multi‐link mechanism. Standard DH models are used for serial structure robots. In Table  method and uses the positive kinematics solution to solve the spatial pose of the end 1, we show the DH parameters of this manipulator, where  a   is the length of the link,  d   effector through the joint angle. DH modeling method is a general modeling method of is the offset of the link,     is the twist angle of the link,     is the joint angle. The units  multi-link mechanism. Standard DH models are used for serial structure robots. In Table 1, of  a   and  d   are meters, and the units of     and     are radians.  we show the DH parameters of this manipulator, where a is the length of the link, d is the offset of the link, a is the twist angle of the link, q is the joint angle. The units of a and d are Table 1. The D‐H parameters of manipulator.  meters, and the units of a and q are radians. 𝜶   𝒅   𝜶   𝜽   Joint  Base  0  0.0892   / 2   Shoulder −0.4250  0  0        Elbow −0.3923  0  0     Wrist1  0  0.1092   / 2     Wrist2  0  0.0946   / 2     Wrist3  0  0.0823  0  6 Actuators 2021, 10, 254 7 of 16 Table 1. The D-H parameters of manipulator. Joint a d a q Base 0 0.0892 p/2 q Shoulder 0.4250 0 0 q Elbow 0.3923 0 0 q Wrist1 0 0.1092 p/2 q Wrist2 0 0.0946 p/2 q Wrist3 0 0.0823 0 q During the experiment, only base, shoulder, and elbow are controlled, wrist1, wrist2, and wrist3 were locked. Because the problem studied in this paper focuses on the position reaching ability of the manipulator, it can be realized only by using three joints, so the three joints of the wrist joints are locked. The homogeneous transformation matrix is established through D–H parameters, as shown in Equation (1). 2 3 cos(q ) sin(q ) cos(a ) sin(q ) sin(a ) a cos(q ) i i i i i i i 6 7 sin(q ) cos(q ) cos(a ) cos(q ) sin(a ) a sin(q ) i1 i i i i i i i 6 7 T = (1) i 4 5 0 sin(a ) cos(a ) d i i i 0 0 0 1 The solution of the forward kinematics of the manipulator can be obtained by mul- tiplying the homogeneous transformation matrix, as shown in Equation (2). In the base coordinate system {B}, the position of the end of the manipulator can be obtained. 0 0 1 2 3 4 5 T = T T T T T T (2) 6 1 2 3 4 5 6 In each episode, the target position is randomly generated in the workspace of the manipulator. In the experiment, the distance difference between the center position of the end effector and the target position in three directions (dx, dy, and dz), the angular veloc- ity (w , w , and w ) and absolute angle of the first three joints Joint_Base Joint_Shoulder Joint_Eblow (q , q , and q ) are used as the input of DRL, and the angular Joint_Base Joint_Shoulder Joint_Eblow . . acceleration control commands of the base, shoulder and elbow (w , w , Joint_Base Joint_Shoulder and w ) are output by DRL. In order to ensure the safe operation of the virtual Joint_Elbow manipulator, the angular acceleration (q/s ) is limited, as shown in Equation (3). w 2 (0.5, 0.5), i 2 (Base, Shoulder, Eblow) (3) When the DRL outputs the angular acceleration control command w , the joint angle increment Dq obtained in this step is calculated by Equation (4) according to the interval time t = 0.1 s and the angular velocity of the previous step w . The current joint angle _i q is updated through the joint angle increment Dq and the joint angle of the previous i_ i step q in Equation (5). The position of the manipulator end effector in {B} coordinate is _i obtained by homogeneous transformation matrix T calculation. The joint angular velocity is updated as shown in Equation (6). 1 . Dq = w t + w t , i 2 (Base, Shoulder, Eblow) (4) i _i i q = q + Dq (5) i_ _i i w = w + w t (6) i_ _i i In this motion process, the DRL model sends the angular acceleration command (the output of the DRL algorithm) of the joints to manipulator according to the perceived environment and its state (the input of the DRL algorithm) and gives the termination command when judging the motion. Actuators 2021, 10, x FOR PEER REVIEW  9  of  17  Actuators 2021, 10, 254 8 of 16 4.4. Rewriting Experience Playback Mechanism and Reward Function Design  The motion process of a multi‐DOF manipulator is no longer multiple discrete spatial  4.4. Rewriting Experience Playback Mechanism and Reward Function Design position points. As can be seen from Figure 5 below, this experiment completes the spec‐ The motion process of a multi-DOF manipulator is no longer multiple discrete spatial ified task by controlling the angular acceleration of the three joints. In the joint angular  position points. As can be seen from Figure 5 below, this experiment completes the specified velocity field drawn by three joint angular velocities, in Case 1, the manipulator can reach  task by controlling the angular acceleration of the three joints. In the joint angular velocity the field  target drawn  position by thr ee and joint stop angular , which velocities,  means  in tha Case t the 1, task the manipulator  is successfu can llyr completed; each the target in Case  position and stop, which means that the task is successfully completed; in Case 2, the 2, the angular velocity of the joint does not stop at all, or it passes through the target po‐ angular velocity of the joint does not stop at all, or it passes through the target position sition quickly and fails to complete the task; for Case 3, although the manipulator stops in  quickly and fails to complete the task; for Case 3, although the manipulator stops in the the scheduled time, the end of the manipulator does not reach the target position, and the  scheduled time, the end of the manipulator does not reach the target position, and the task task fails to complete. Figure 5 also shows that the task of this experiment is more difficult  fails to complete. Figure 5 also shows that the task of this experiment is more difficult than than the task of only achieving the goal through discrete spatial position.  the task of only achieving the goal through discrete spatial position. Figure 5. Schematic diagram of the angular velocity field of joints. x is the threshold for determining joint stop. Figure 5. Schematic diagram of the angular velocity field of joints.    is the threshold for determining joint stop.  Therefore, the design of the reward function must be reconsidered, that is to say, the angular velocity information of the joint must be introduced. In the whole process of Therefore, the design of the reward function must be reconsidered, that is to say, the  the manipulator movement, each instance of acceleration and deceleration has an impact angular velocity information of the joint must be introduced. In the whole process of the  on the angular velocity of each joint when the termination condition is reached and the manipulator movement, each instance of acceleration and deceleration has an impact on  iteration of the round stops. Therefore, it is necessary to further improve the experience the angular velocity of each joint when the termination condition is reached and the iter‐ playback mechanism and change the experience pool storage. In other words, the final ation of the round stops. Therefore, it is necessary to further improve the experience play‐ angular velocity of each joint should be shared by all the previous continuous movements. back mechanism and change the experience pool storage. In other words, the final angular  As shown in Equation (7), here the absolute value of the angular velocity of the joint will velocity be taken,  of multiplied each joint by sho the uld constant  be shared l , divided  by all the by the previous number contin of steps uous T  movem , and the ents. As  i Stop corresponding reward value will be obtained. shown in Equation (7), here the absolute value of the angular velocity of the joint will be  taken, multiplied by the constant  , divided by the number of steps  , and the corre‐ i Stop l jw j + l jw j + l jw j 1 Base 2 Shoulder 3 Elbow R = , l > 0, i = f1, 2, 3g (7) Joint_Vel i sponding reward value will be obtained.  Stop       This part of the reward as a negative reward is added to the corresponding reward in 1 Base 2 Shoulder 3 Elbow R  ,  0,i {1,2,3} (7)  Joint_Vel i the experience pool, so as to realize the feedback of the joint angular velocity state in the Stop neural network update parameters, as shown in Figure 6. This part of the reward as a negative reward is added to the corresponding reward  in the experience pool, so as to realize the feedback of the joint angular velocity state in  the neural network update parameters, as shown in Figure 6.  Actuators 2021, 10, x FOR PEER REVIEW  10  of  17  Actuators 2021, 10, 254 9 of 16 Figure 6. Schematic diagram of rewriting experience playback mechanism. Figure 6. Schematic diagram of rewriting experience playback mechanism.  The design of the reward function in this paper adds the angular velocity reward part The design of the reward function in this paper adds the angular velocity reward part  to the Step-by-Step reward function (r ) [35]. The Step-by-Step reward function StepbyStep to the Step‐by‐Step reward function ( ) [35]. The Step‐by‐Step reward function  StepbyStep mainly includes two parts: the first part is the negative value of the Euclidean distance mainly includes two parts: the first part is the negative value of the Euclidean distance  between the end of the manipulator and the target. The second part is the reward obtained between the end of the manipulator and the target. The second part is the reward obtained  by comparing the distance closed to the target position between the current position and by comparing the distance closed to the target position between the current position and  the last position of the manipulator end during the movement. Therefore, the reward the  last  position  of  the  manipulator  end  during  the  movement.  Therefore,  the  reward  function in this paper is shown in Equation (8): function in this paper is shown in Equation (8):  r = l r + l R (8) 4 StepbyStep Joint_Vel r  r  R   (8)  4 StepbyStep 5 Joint_Vel where l and l are two constants. where     and     are two constants.  4 5 4.5. Simulation Experimental Components In the application of the DRL algorithm, a problem that cannot be avoided is the 4.5. Simulation Experimental Components  random generation of a large number of neural network parameters. It is because of the In the application of the DRL algorithm, a problem that cannot be avoided is the ran‐ randomness of the parameters that we cannot train and learn effectively in the face of dom generation of a large number of neural network parameters. It is because of the ran‐ specific tasks, so we need to explore a more efficient, faster convergence, and more stable domness of the parameters that we cannot train and learn effectively in the face of specific  algorithm framework to make up for this disadvantage. Since ATRTD3, RTD3, and TD3 are tasks, so we need to explore a more efficient, faster convergence, and more stable algo‐ all improved and innovated on the basis of DDPG, the contrast group of this experiment rithm framework to make up for this disadvantage. Since ATRTD3, RTD3, and TD3 are  chooses the above four algorithms. In the contrast experiment, we train with the same all improved and innovated on the basis of DDPG, the contrast group of this experiment  task and acquire the ability to solve the target task through learning and training. In the chooses the above four algorithms. In the contrast experiment, we train with the same task  experiment, we specify two kinds of evaluation indexes. The first index is to calculate and acquire the ability to solve the target task through learning and training. In the exper‐ all the reward scores in an episode and divide them by the total number of steps in the iment, we specify two kinds of evaluation indexes. The first index is to calculate all the  episode to get the average score. The second index is to record the absolute value of the reward scores in an episode and divide them by the total number of steps in the episode  angular velocity of base, shoulder, and elbow at the end of an episode. Additionally, in the to get the average score. The second index is to record the absolute value of the angular  same group of experiments, in order to ensure a fair comparison of the practical application velocity of base, shoulder, and elbow at the end of an episode. Additionally, in the same  ability and model convergence ability of the four algorithms, we use the same initialization group of experiments, in order to ensure a fair comparison of the practical application  model as the test algorithm model. ability and model convergence ability of the four algorithms, we use the same initializa‐ tion model as the test algorithm model.  5. Discussion Through the comparison of DDPG, TD3, RTD3, and ATRTD3 in Figure 7a, we can 5. Discussion  clearly see the improvement effect of ATRTD3 learning ability. Therefore, we need to further Through analyze  the comparison the evaluation  of DDPG, index of TD3, the average RTD3, and scor ATRTD3 e when the  in final Figure distance  7a, weerr  can or  is clearly roughly  see the the improveme same. Fromnt Figur  effect e 7of a, ATR we can TD3 see  learning that the abi average lity. Therefore score of, we ATR nTD3 eed to is fu higher r‐ ther analyze the evaluation index of the average score when the final distance error is  than other algorithms when the final distance error is the same in purple areas A, B, and C, roughly which the indicates  same. From that the  Figure learning  7a, we ef fect can see of A tha TRt TD3  the av is better erage. sco The rer of eason  ATRTD is that 3 ispart  high oferthe   tharnewar  other d al function gorithm iss the when negative  the final rewar  dista dnintr ce eoduced rror is the by sa the me absolute  in purple value  areas of A, the B,angular  and  velocity of each joint after each termination of an episode. Through purple areas A, B, and C, which indicates that the learning effect of ATRTD3 is better. The reason is that part of  C of Figure 7b, ATRTD3 can better guide the multi-DOF manipulator to move to the target the reward function is the negative reward introduced by the absolute value of the angular  position and stop better, so ATRTD3 is more efficient and stable than other algorithms. velocity of each joint after each termination of an episode. Through purple areas A, B, and  Actuators 2021, 10, x FOR PEER REVIEW  11  of  17  C of Figure 7b, ATRTD3 can better guide the multi‐DOF manipulator to move to the target  position and stop better, so ATRTD3 is more efficient and stable than other algorithms.  Secondly,  through  Figure  7,  we  can  draw  the  conclusion  that  ATRTD3  shows  stronger stability than other algorithms in the late training period. As can be seen in Figure  8, although DDPG has reached the score level close to that of ATRT3 through the later  training, we can clearly see from the average score curve and the final error distance curve  Actuators 2021, 10, 254 10 of 16 that ATRTD3 has better stability in the later training stage compared with DDPG, and the  two curves of ATRTD3 are straighter, there are many spikes in the two curves of DDPG.  20 15 1 10 RTD3 ATRTD3 0 0 TD3 -5 -5 DDPG -10 -10 -15 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode -5 -5 -10 -10 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode (a)  RTD3 0.8 0.8 ATRTD3 0.6 0.6 TD3 DDPG 0.4 0.4 0.2 0.2 0.0 0.0 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode (b)  Figure 7. Four randomized experiments (Group1, Group2, Group3, and Group4) were conducted to evaluate the perfor- Figure 7. Four randomized experiments (Group1, Group2, Group3, and Group4) were conducted to evaluate the perfor‐ mance of four algorithms (RTD3, ATRTD3, TD3 and DDPG). There are Group1, Group2, Group3, and Group4 in the average mance of four algorithms (RTD3, ATRTD3, TD3 and DDPG). There are Group1, Group2, Group3, and Group4 in the  score (Avg_score) in (a) and the final error distance (Final_Dis) in (b). Purple areas A, B, and C need special attention. average score (Avg_score) in (a) and the final error distance (Final_Dis) in (b). Purple areas A, B, and C need special  attention.  Secondly, through Figure 7, we can draw the conclusion that ATRTD3 shows stronger stability than other algorithms in the late training period. As can be seen in Figure 8, although DDPG has reached the score level close to that of ATRT3 through the later training, we can clearly see from the average score curve and the final error distance curve that ATRTD3 has better stability in the later training stage compared with DDPG, and the two curves of ATRTD3 are straighter, there are many spikes in the two curves of DDPG. From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89% compared with the other three algorithms. In terms of stability, ATRTD3 performs better, with an improvement of at least 89.27% compared with the other three algorithms. Avg_Score Final_Dis (m) Avg_Score Final_Dis (m) Avg_Score Avg_Score Final_Dis (m) Final_Dis (m) Actuators 2021, 10, x FOR PEER REVIEW  12  of  17  Group1 Group1 1.0 100 Group2 Group2 89.84 Group3 0.858 Group3 Actuators 2021, 10, x FOR PEER REVIEW  12  of  17  Actuators 2021, 10, 254 11 of 16 Group4 Group4 0.8 0.755 59.62 0.6 0.531 52.06 47.27 Group1 Group1 1.0 Group2 Group2 0.4 89.84 Group3 0.858 Group3 Group4 Group4 0.8 0.755 0.2 59.62 0.053 60 0.6 0.531 52.06 0.0 47.27 RTD3 ATRTD3 TD3 DDPG RTD3 ATRTD3 TD3 DDPG 0.4 Algorithm Algorithm 0.2 (a)  (b)  0.053 0.0 Figure 8. The results of four experiments (Group1, Group2, Group3, and Group4) of four algorithms (RTD3, ATRTD3,  RTD3 ATRTD3 TD3 DDPG RTD3 ATRTD3 TD3 DDPG TD3, and DDPG) are compared in a stacked way. From left to right, (a) represents the final error distance; (b) represents  Algorithm Algorithm   the average score.  (a)  (b)  From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89%  Figure 8. The results of four experiments (Group1, Group2, Group3, and Group4) of four algorithms (RTD3, ATRTD3, TD3, Figure 8. The results of four experiments (Group1, Group2, Group3, and Group4) of four algorithms (RTD3, ATRTD3,  compared with the other three algorithms. In terms of stability, ATRTD3 performs better,  and TD3, DDPG)  and DDPG) are com  are par  com edpin ared a stacked  in a staway cked . way. From From left to left right,  to right, (a) repr  (a)esents  represents the final  the final error error distance;  distan (bce; ) r epr (b) esents represents the   the average score.  with an improvement of at least 89.27% compared with the other three algorithms.  average score. From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89%  Avg_Score compared with the other three algorithms. In terms of stability, ATRTD3 performs better,  Final_Dis 100% with an improvement of at least 89.27% compared with the other three algorithms.  Avg_Score 80% Final_Dis 100% 60% 80% 40% 60% 20% 40% 20 0% % DDPG RTD3 TD3 0% Figure 9. Combined with four groups of experiments (Group1, Group2, Group3, and Group4), the  Figure 9. Combined with four groups of experiments (Group1, Group2, Group3, and Group4), the DDPG RTD3 TD3 ATRTD3 algorithm improves the average score and final error distance performance of the model.  ATRTD3 algorithm improves the average score and final error distance performance of the model. Figure 9. Combined with four groups of experiments (Group1, Group2, Group3, and Group4), the  To further demonstrate the advantages of the ATRTD3 from the underlying control ATRTD3 algorithm improves the average score and final error distance performance of the model.  To further demonstrate the advantages of the ATRTD3 from the underlying control  variables, we will collect the end angular speeds of the three joints during model training variables, we will collect the end angular speeds of the three joints during model training  as shown in Figure 10a,b. In Figure 10a, we show the final angular velocity of the three To further demonstrate the advantages of the ATRTD3 from the underlying control  as shown in Figure 10a,b. In Figure 10a, we show the final angular velocity of the three  joints of the high score model (Group1). In Figure 10b, we show the final angular velocity variables, we will collect the end angular speeds of the three joints during model training  joints of the high score model (Group1). In Figure 10b, we show the final angular velocity  of the three joints of the arm controlled by a low score model (Group2). By placing a local as shown in Figure 10a,b. In Figure 10a, we show the final angular velocity of the three  of the three joints of the arm controlled by a low score model (Group2). By placing a local  magnification in each image, we can clearly see the final angular velocity of the three joints joints of the high score model (Group1). In Figure 10b, we show the final angular velocity  at magni the end ficat ofion the in training  each ima process, ge, we wher  can e the clear curve ly see repr the esented  final an bygu thela rr ed velocity color is of the the speed  three joints  of the three joints of the arm controlled by a low score model (Group2). By placing a local  of each joint under the guidance of the ATRTD3 throughout the training process. ATRTD3 at the end of the training process, where the curve represented by the red color is the  magnification in each image, we can clearly see the final angular velocity of the three joints  has obvious advantages over the other three algorithms. speed of each joint under the guidance of the ATRTD3 throughout the training process.  at the end of the training process, where the curve represented by the red color is the  ATRTD3 speed of ea has ch  joint obvious  unde adv r thaent guid ages ance  over  of  th thee  AT other RT D3 three  throughout  algorithms.  the  training process.  ATRTD3 has obvious advantages over the other three algorithms.  Final_Dis(m) Final_Dis(m) Performance Improvem Performance Improvem ent ent 49.89% 49.89% 93.36% 93.36% 71.67% Avg_Score 71.67% Avg_Score 92.45% 92.45% 89.06% 89.06% 89.27% 89.27% Actuators 2021, 10, x FOR PEER REVIEW  13  of  17  Actuators 2021, 10, 254 12 of 16 RTD3 ATRTD3 0.12 TD3 0.08 0.05 DDPG 0.04 0.00 0.00 -0.04 2 -0.05 40k 50k 60k Episode 0.05 -0.10 1 40k 50k 60k 0.00 Episode -1 -0.05 -0.10 -0.15 -1 40k 50k 60k -1 Episode -2 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode Episode (a)  4 1 4 0.15 0.10 3 0 0.05 0.06 0.00 0.04 0.02 -0.05 0.00 2 -1 -0.10 40k 50k 60k -0.02 0.05 Episode -0.04 40k 50k 60k 0.00 1 -2 Episode -0.05 -0.10 RTD3 -0.15 0 -3 ATRTD3 -0.20 -2 40k 50k 60k TD3 Episode DDPG -1 -4 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode Episode (b)  Figure Figure 10. 10. ((a a)) The The final final angular angular velo velocity city of of th the e joints joints in in the the hi high gh score score mod model el Group1; Group1; ( (b b) ) The The final final angular angular velo velocity city of of the the  joints in the low score model Group2.  joints in the low score model Group2. The final joint angular velocity of the manipulator based on ATRTD3 is always around The  final  joint  angular  velocity  of  the  manipulator  based  on  ATRTD3  is  always  0 (rad/s), with the minimum fluctuation and the best stability. In the high score model around 0 (rad/s), with the minimum fluctuation and the best stability. In the high score  (Group1), only DDPG can achieve a score similar to ATRTD3 in the training score, so model (Group1), only DDPG can achieve a score similar to ATRTD3 in the training score,  only the joint angular velocity of DDPG and ATRTD3 is compared. It can be seen from so only the joint angular velocity of DDPG and ATRTD3 is compared. It can be seen from  Table 2 that DDPG has an order of magnitude disadvantage in the comparison of the Table 2 that DDPG has an order of magnitude disadvantage in the comparison of the an‐ angular velocity of the other two joints except that the final angular velocity of the base gular velocity of the other two joints except that the final angular velocity of the base joint  joint is roughly the same as that of ATRTD3. By comparing the variances in Table 2, it is is roughly the same as that of ATRTD3. By comparing the variances in Table 2, it is not  not difficult to find that ATRTD3 has absolute advantages in stability, which is generally difficult to find that ATRTD3 has absolute advantages in stability, which is generally one  one order of magnitude higher. Because the smaller the variance is, the more stable the order of magnitude higher. Because the smaller the variance is, the more stable the train‐ training is. In the low model (Group2), the advantages of ATRTD3 cannot be seen in Table 3. ing is. In the low model (Group2), the advantages of ATRTD3 cannot be seen in Table 3.  After accumulating and comparing the angular velocity and angular velocity variance of After accumulating and comparing the angular velocity and angular velocity variance of  the three joints, it can be seen from Figure 11 that ATRTD3 is still better than the other the three joints, it can be seen from Figure 11 that ATRTD3 is still better than the other  three algorithms. three algorithms.     Baes(rad/s) Baes(rad/s) Baes(rad/s) Baes(rad/s) Shoulder(rad/s) Shoulder(rad/s) Shoulder(rad/s) Shoulder(rad/s) Elbow(rad/s) Elbow(rad/s) Elbow(rad/s) Elbow(rad/s) Actuators 2021, 10, x FOR PEER REVIEW  14  of  17  Table 2. The mean and variance of joints angular velocity in local enlarged images of high score model (Group1).  Joint  Base  Shoulder  Elbow  Angular Veloc‐ Average  Average  Average  Var  Var  Var  ity  (rad/s)  (rad/s)  (rad/s)  Actuators 2021, 10, 254 13 of 16 −3 −5 −3 −5 −4 −5 ATRTD3 −2.50 × 10   2.80 × 10   2.51 × 10   3.09 × 10 −3.90 × 10   2.83 × 10   −3 −4 −2 −3 −3 −4 RTD3 −4.65 × 10   4.58 × 10 −3.13 × 10   1.07 × 10   3.35 × 10   6.81 × 10   −2 −4 −2 −3 −3 −4 Table 2. The mean and variance of joints angular velocity in local enlarged images of high score model (Group1). TD3  3.13 × 10   7.26 × 10 −2.79 × 10   1.32 × 10 −6.69 × 10   4.73 × 10   −3 −5 −2 −5 −2 −5 Joint Base Shoulder Elbow DDPG  2.66 × 10   4.60 × 10   2.52 × 10   6.41 × 10 −2.01 × 10   3.95 × 10   Angular Average Average Average Var Var Var Velocity (rad/s) (rad/s) (rad/s) 3 5 3 5 4 5 Table 3. The mean and ATR TD3 variance2.50  of joints 10  angular 2.80  10 veloci2.51 ty in  10 local enlarged 3.09  10  imag 3.90 es of 10 low score 2.83  10 model (Group2).  3 4 2 3 3 4 RTD3 4.65  10 4.58  10 3.13  10 1.07  10 3.35  10 6.81  10 2 4 2 3 3 4 TD3 3.13  10 7.26  10 2.79  10 1.32  10 6.69  10 4.73  10 Joint  Base  3 5 Shoulder 2   5 2 5 Elbow  DDPG 2.66  10 4.60  10 2.52  10 6.41  10 2.01  10 3.95  10 Angular Veloc‐ Average  Average  Average  Table 3. The meanVar and variance   of joints angular velocity in local enlarged images Var of low score model (Group2). Var  ity  (rad/s)  (rad/s)  (rad/s)  Joint Base Shoulder Elbow −2 −5 −3 −4 −2 −5 ATRTD3 −1.71 × 10   4.17 × 10   8.04 × 10   1.08 × 10 −1.23 × 10   6.18 × 10   Angular Average Average Average Var Var Var Velocity (rad/s) (rad/s) (rad/s) −2 −4 −3 −4 −2 −4 RTD3  3.03 × 10   1.14 × 10 −2.50 × 10   5.87 × 10 −1.73 × 10   4.48 × 10   2 5 3 4 2 5 ATRTD3 1.71  10 4.17  10 8.04  10 1.08  10 1.23  10 6.18  10 −2 2 −5 4 −23 4 −5 2 4−2 −5 RTD3 3.03  10 1.14  10 2.50  10 5.87  10 1.73  10 4.48  10 TD3  1.39 × 10   3.40 × 10   3.67 × 10   9.69 × 10 −6.58 × 10   6.95 × 10   2 5 2 5 2 5 TD3 1.39  10 3.40  10 3.67  10 9.69  10 6.58  10 6.95  10 −2 −5 −2 −5 −3 −4 2 5 2 5 3 4 DDPG  3.20 × 10   5.38 × 10 −1.40 × 10   7.07 × 10   8.21 × 10   1.27 × 10   DDPG 3.20  10 5.38  10 1.40  10 7.07  10 8.21  10 1.27  10 Skill Level Improvement Stability Improvement 100% 80% 60% 40% 20% 0% Group1 Group2 Group1 Group2 Group1 Group2 DDPG RTD3 TD3 Figure 11. In the high score model (Group1) and low score model (Group2), the ATRTD3 algorithm compared with the Figure 11. In the high score model (Group1) and low score model (Group2), the ATRTD3 algo‐ other three algorithms in the control of joint angular velocity skill level and late training stability are two aspects of improvement effect. rithm compared with the other three algorithms in the control of joint angular velocity skill level  and late training stability are two aspects of improvement effect.  Through Figure 11, we can see that ATRTD3 can significantly improve the skill level  and stability compared with the other three algorithms. ATRTD3 generally improves the  skill level by more than 25.27% and improves stability by more than 15.90%. Compared  with TD3, the stability improvement of ATRTD3 in Group 2 is −5.54%. The main reason is  that TD3 falls into a more stable local optimization, so the performance of ATRTD3 cannot  be questioned under this index.  In the Actor neural network, the Tetanic stimulation mechanism changes the param‐ eters of some neurons compulsively and improves the exploration space of action in a  certain  range.  Through  Tetanic  stimulation,  the  weight  of  selected  neurons  can  be  in‐ creased again, which helps to improve the convergence speed of the neural network and  speed up the training speed. The Amnesia mechanism makes the parameters of neurons  changed compulsively, which is in line with the situation of biological cell decay.  Performance Improvement 88.74 41.71 30.94 15.90 86.25 96.05 25.27 81.59 91.80 96.54 67.84 -5.540 Actuators 2021, 10, 254 14 of 16 Through Figure 11, we can see that ATRTD3 can significantly improve the skill level and stability compared with the other three algorithms. ATRTD3 generally improves the skill level by more than 25.27% and improves stability by more than 15.90%. Compared with TD3, the stability improvement of ATRTD3 in Group 2 is 5.54%. The main reason is that TD3 falls into a more stable local optimization, so the performance of ATRTD3 cannot be questioned under this index. In the Actor neural network, the Tetanic stimulation mechanism changes the param- eters of some neurons compulsively and improves the exploration space of action in a certain range. Through Tetanic stimulation, the weight of selected neurons can be increased again, which helps to improve the convergence speed of the neural network and speed up the training speed. The Amnesia mechanism makes the parameters of neurons changed compulsively, which is in line with the situation of biological cell decay. 6. Conclusions In this paper, we propose an algorithm named ATRTD3, for continuous control of a multi-DOF manipulator. In the training process of the proposed algorithm, the weighting parameters of the neural network are learned using Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a biomimetic view to speed up the converging process by biochemical reactions generated by neurons in the biological brain during memory and forgetting. The integration of the two mechanisms is of great significance to expand the scope of exploration, jump out of the local optimal solution, speed up the learning process and enhance the stability of the algorithm. The effectiveness of the proposed algorithm is validated by a simulation example including the comparisons with previously developed DRL algorithms. The results indicate that our approach shows performance improvement in terms of convergence speed and precision in the multi-DOF manipulator. The ATRTD3 we proposed successfully applies the research results of neuroscience to the DRL algorithm to enhance its performance, and uses computer technology to carry out biological heuristic design through code to approximately realize some functions of the biological brain. In the future, we will further learn from neuroscience achievements, improve the learning efficiency of DRL, and use DRL to achieve reliable and accurate manipulator control. Furthermore, we will continue to challenge the scheme of controlling six joints to realize the cooperative control of position and orientation. Author Contributions: Conceptualization, H.H. and Y.H.; methodology, Y.H.; software, Y.H.; valida- tion, Y.H., D.X. and Z.Z.; formal analysis, Y.H.; investigation, Y.H.; resources, H.H.; data curation, Y.C.; writing—original draft preparation, Y.H.; writing—review and editing, Y.C. and Z.L.; visualization, Y.H.; supervision, H.H. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The authors exclude this statement. Conflicts of Interest: The authors declare no conflict of interest. Actuators 2021, 10, 254 15 of 16 Appendix A Pseudo code of Algorithm A1. Algorithm A1. ATRTD3 1: Initialize critic networks Q , Q ,and actor network p with random parameters q , q , f q q f 1 2 1 2 2: Initialize target networks Q 0 , Q 0 , p 0 q q f 1 2 0 0 0 3: Target network node assignment q ! q ; q ! q ; f ! f 1 2 4: Initialize replay buffer B 5: for t = 1 to T do 6: Amnesia framework 7: Select action with exploration noise a  p + #, #  N(0, s) 8: Temporary store transition tuple (s, a, s , r) in B 9: Fix transition tuple r r + R , angular velocity correction R joint_Vel Joint_Vel 0 0 10: Store transition tuple (s, a, s , r ) in B 11: If Sum < mini_batch then 12 return 13: Sample mini-batch of N transition (s, a, r, s ) from B 14: ea p 0(s ) + #, #  clip(N(0,es),c, c) 0 e 15: y = r + min Q (s , a) i=1,2 16: Statistical calculation of Q value utilization ratio w 0 , w 0 q q 1 2 17: If w < Mini_utilization then 18: Rebirth target networks q 19: End if 20: If w 0 < Mini_utilization then 21: Rebirth target networks q 22: End if 23: Update critics q argmin N (y Q (s, a)) i q q i i 24: If t mod d then 25: Update f by the deterministic policy gradient: 26: r J(f) = N r Q (s, a) r p (s) f a q f f a=p (s) 27: Tetanic stimulation Framework 28: Update target networks: 0 0 29: q tq + (1 t)q i i 0 0 30: f tf + (1 t)f 31: End if 32: End if 33: End for References 1. Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection. In International Symposium on Experimental Robotics; Springer: Cham, Switzerland, 2016; pp. 173–184. 2. Zhang, M.; Mccarthy, Z.; Finn, C.; Levine, S.; Abbeel, P. Learning deep neural network policies with continuous memory states. In Proceedings of the International Conference on Robotics and Auto-mation, Stockholm, Sweden, 16 May 2016; pp. 520–527. 3. Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1–40. 4. Lenz, I.; Knepper, R.; Saxena, A. DeepMPC:learning deep latent features for model predictive control. In Proceedings of the Robotics Scienceand Systems, Rome, Italy, 13–17 July 2015; pp. 201–209. 5. Satija, H.; Pineau, J. Simultaneous machine translation using deep reinforcement learning. In Proceedings of the Workshops of International Conference on Machine Learning, New York, NY, USA, 24 June 2016; pp. 110–119. 6. Sallab, A.; Abdou, M.; Perot, E.; Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017, 19, 70–76. [CrossRef] 7. Caicedo, J.; Lazebnik, S. Active Object Localization with Deep Reinforcement Learning. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2488–2496. 8. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. 9. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Figjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef] Actuators 2021, 10, 254 16 of 16 10. Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. 11. Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. 12. Mnih, V.; Badia, A.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. 13. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. 14. Heess, N.; Dhruva, T.B.; Sriram, S.; Lemmon, J.; Silver, D. Emergence of Locomotion Behaviours in Rich Environments. arXiv 2017, arXiv:1707.02286. 15. Fujimoto, S.; Hoof, H.V.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. 16. Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. Noisy Networks for Exploration. arXiv 2017, arXiv:1706.10295. 17. Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.Y.; Chen, X.; Asfour, T.; Abbeel, P.; Andrychowicz, M. Parameter Space Noise for Exploration. arXiv 2017, arXiv:1706.01905. 18. Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count-based exploration and intrinsic motivation. Adv. Neural Inf. Process. Syst. 2016, 29, 1471–1479. 19. Choshen, L.; Fox, L.; Loewenstein, Y. DORA The Explorer: Directed Outreaching Reinforcement Action-Selection. arXiv 2018, arXiv:1804.04012. 20. Badia, A.; Sprechmann, P.; Vitvitskyi, A.; Guo, D.; Piot, B.; Kapturowski, S.; Tieleman, O.; Arjovsky, M.; Pritzel, A.; Bolt, A.; et al. Never Give Up: Learning Directed Exploration Strategies. arXiv 2020, arXiv:2002.06038. 21. Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates. arXiv 2016, arXiv:1610.00633. 22. Hassabis, D.; Kumaran, D.; Summerfield, C.; Botvinick, M. Neuroscience-Inspired Artificial Intelligence. Neuron 2017, 95, 245–258. [CrossRef] 23. MyeongSeop, K.; DongKi, H.; JaeHan, P.; JuneSu, K. Motion Planning of Robot Manipulators for a Smoother Path Using a Twin Delayed Deep Deterministic Policy Gradient with Hindsight Experience Replay. Appl. Sci. 2020, 10, 575. 24. Zhang, H.; Wang, F.; Wang, J.; Cui, B. Robot Grasping Method Optimization Using Improved Deep Deterministic Policy Gradient Algorithm of Deep Reinforcement Learning. Rev. Sci. Instrum. 2021, 92, 1–11. 25. Kwiatkowski, R.; Lipson, H. Task-agnostic self-modeling machines. Sci. Robot. 2019, 4, eaau9354. [CrossRef] 26. Iriondo, A.; Lazkano, E.; Susperregi, L.; Urain, J.; Fernandez, A.; Molina, J. Pick and Place Operations in Logistics Using a Mobile Manipulator Controlled with Deep Reinforcement Learning. Appl. Sci. 2019, 9, 348. [CrossRef] 27. Giorgio, I.; Del Vescovo, D. Energy-based trajectory tracking and vibration control for multilink highly flexible manipulators. Math. Mech. Complex Syst. 2019, 7, 159–174. [CrossRef] 28. Rubinstein, D. Dynamics of a flexible beam and a system of rigid rods, with fully inverse (one-sided) boundary conditions. Comput. Methods Appl. Mech. Eng. 1999, 175, 87–97. [CrossRef] 29. Bliss, T.V.P.; Lømo, T. Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. 1973, 232, 331–356. [CrossRef] 30. Hebb, D.O. The Organization of Behavior; Wiley: New York, NY, USA, 1949. 31. Thomas, M.J.; Watabe, A.M.; Moody, T.D.; Makhinson, M.; O’Dell, T.J. Postsynaptic Complex Spike Bursting Enables the Induction of LTP by Theta Frequency Synaptic Stimulation. J. Neurosci. 1998, 18, 7118–7126. [CrossRef] [PubMed] 32. Dayan, P.; Abbott, L.F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems; The MIT Press: New York, NY, USA, 2001. 33. Bliss, T.V.P.; Cooke, S.F. Long-Term Potentiation and Long-Term Depression: A Clinical Perspective. Clinics 2011, 66 (Suppl. 1), 3–17. [CrossRef] 34. Hou, Y.Y.; Hong, H.J.; Sun, Z.M.; Xu, D.S.; Zeng, Z. The Control Method of Twin Delayed Deep Deterministic Policy Gradient with Rebirth Mechanism to Multi-DOF Manipulator. Electronics 2021, 10, 870. [CrossRef] 35. Denavit, J.; Hartenberg, R.S. A Kinematic Notation for Lower-Pair Mechanisms. J. Appl. Mech. 1955, 77, 215–221. [CrossRef] http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Actuators Multidisciplinary Digital Publishing Institute

A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator

Loading next page...
 
/lp/multidisciplinary-digital-publishing-institute/a-deep-reinforcement-learning-algorithm-based-on-tetanic-stimulation-6h73VP8rWP

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Multidisciplinary Digital Publishing Institute
Copyright
© 1996-2021 MDPI (Basel, Switzerland) unless otherwise stated Disclaimer The statements, opinions and data contained in the journals are solely those of the individual authors and contributors and not of the publisher and the editor(s). MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Terms and Conditions Privacy Policy
ISSN
2076-0825
DOI
10.3390/act10100254
Publisher site
See Article on Publisher Site

Abstract

actuators Article A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator Yangyang Hou, Huajie Hong * , Dasheng Xu, Zhe Zeng, Yaping Chen and Zhaoyang Liu College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China; houyangyang14@nudt.edu.cn (Y.H.); ixudasheng@163.com (D.X.); zengzhe@nudt.edu.cn (Z.Z.); yaping_chen2021@163.com (Y.C.); ZhaoyangLiuNUDT@163.com (Z.L.) * Correspondence: honghuajie@nudt.edu.cn; Tel.: +86-138-7313-0046 Abstract: Deep Reinforcement Learning (DRL) has been an active research area in view of its capability in solving large-scale control problems. Until presently, many algorithms have been developed, such as Deep Deterministic Policy Gradient (DDPG), Twin-Delayed Deep Deterministic Policy Gradient (TD3), and so on. However, the converging achievement of DRL often requires extensive collected data sets and training episodes, which is data inefficient and computing resource consuming. Motivated by the above problem, in this paper, we propose a Twin-Delayed Deep Deterministic Policy Gradient algorithm with a Rebirth Mechanism, Tetanic Stimulation and Amnesic Mechanisms (ATRTD3), for continuous control of a multi-DOF manipulator. In the training process of the proposed algorithm, the weighting parameters of the neural network are learned using Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a Citation: Hou, Y.; Hong, H.; Xu, D.; biomimetic view to speed up the converging process by biochemical reactions generated by neurons Zeng, Z.; Chen, Y.; Liu, Z. A Deep in the biological brain during memory and forgetting. The effectiveness of the proposed algorithm Reinforcement Learning Algorithm is validated by a simulation example including the comparisons with previously developed DRL Based on Tetanic Stimulation and algorithms. The results indicate that our approach shows performance improvement in terms of Amnesic Mechanisms for Continuous convergence speed and precision. Control of Multi-DOF Manipulator. Actuators 2021, 10, 254. https:// Keywords: multi-DOF manipulator; tetanic stimulation; amnesia mechanism; deep reinforcement doi.org/10.3390/act10100254 learning Academic Editors: Ioan Doroftei and Karsten Berns Received: 11 August 2021 1. Introduction Accepted: 20 September 2021 Deep Reinforcement Learning (DRL) is an advanced intelligent control method. It uses Published: 29 September 2021 a neural network to parameterize the Markov decision processes (MDP). DRL has been successfully applied to the field of robots [1–4], Machine Translation [5], Auto-Driving [6], Publisher’s Note: MDPI stays neutral Target Positioning [7], and shows strong adaptability. There are two kinds of the DRL with regard to jurisdictional claims in algorithms, one is based on value function such as Deep Q Network (DQN) [8] and Nature published maps and institutional affil- DQN [9]. The output of DRL algorithm based on value function is discrete state-action iations. value. The other is policy-based, such as Deep Deterministic Policy Gradient (DDPG) [10], Trust Region Policy Optimization (TRPO) [11], Asynchronous Advantage Actor-Critic (A3C) [12], Distributed Proximal Policy Optimization (DPPO) [13,14], and Twin-Delayed Deep Deterministic Policy Gradient (TD3) [15]. For continuous action space, advanced Copyright: © 2021 by the authors. search policy can improve the sampling efficiency of the underlying algorithms [10]. Many Licensee MDPI, Basel, Switzerland. research results focus on improving exploration policy. Among them, Fortunato et al. [16] This article is an open access article and Plppert et al. [17] put forward a noise-based exploration policy in adding noise to the distributed under the terms and action space and observation space. Bellemare et al. propose an exploratory algorithm conditions of the Creative Commons based on pseudo-counting for efficient exploration. The algorithm evaluates frequency by Attribution (CC BY) license (https:// designing a density model that satisfies certain properties and calculates pseudo-counts creativecommons.org/licenses/by/ that are generalized in continuous space to encourage exploration [18]. In [19], Fox et al. 4.0/). Actuators 2021, 10, 254. https://doi.org/10.3390/act10100254 https://www.mdpi.com/journal/actuators Actuators 2021, 10, 254 2 of 16 innovatively propose a framework, DORA, that uses two parallel MDP processes to inject exploration signals into random tasks. Reward-based exploration results in a slower approximation of functions and failure to provide an intrinsic reward signal in time. So, Badia et al. [20] propose a “never give up” exploration strategy (NGU), which is designed to quickly prevent repeated access to the same state in the same episode. Due to the coupling and complexity, a multi-degree-of-freedom (multi-DOF) manipu- lator is a hot application background of DRL. The direct applications of DRL algorithms have, so far, been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity [21]. Therefore, the application of DRL in multi-DOF robots requires a more effective DRL algorithm. Neuroscience provides a rich source of inspiration for new types of algorithms and architectures, independent of and complemen- tary to the mathematical and logic-based methods and ideas that have largely dominated traditional approaches to AI [22]. In this paper, we design a new DRL algorithm named ATRTD3 based on the research results of neuroscience and the analysis of the human brain memory learning process. Our team shows a biomimetic view to speed up the converging process by biochemical reactions generated by neurons in the biological brain during memory and forgetting. We apply the neural network parameter updating mechanism with Tetanic stimulation and Amnesia mechanism to the DRL algorithm to further improve the efficiency in the application of manipulator. 2. Related Work The advancement of DRL has led to the development of intelligent control of a multi- DOF manipulator in recent years. Kim et al. [23] propose a motion planning algorithm for robot manipulators using TD3, which designed paths are smoother and shorter after 140,000 episodes than those designed by Probabilistic Roadmap. Based on the classic DDPG algorithm, Zhang et al. smoothly add Gaussian parameters to improve the exploratory nature of the algorithm, dynamically sets the robot grasping space parameters to adapt to the workspace of multiple scales, and realizes the accurate grasping of the robot [24]. Robert Kwiatkowski et al. [25] used DL methods to make the manipulator build a self- model after 35 h of training. By comparing the application of DDPG and Proximal Policy Optimization (PPO) in the manipulator, Lriondo et al. [26] concluded that the current DRL algorithm could not obtain robust motion ability and acceptable training efficiency. The difficulty of applying DRL to the motion control of the multi-DOF manipulator lies in how to improve the exploration efficiency and improve the robustness of the output action of the manipulator. Therefore, it is necessary to get inspiration from the research results of neuroscience. For flexible manipulators, some research [27,28] are very interesting, which brings some help to kinematics modeling and control design in this paper. Long-term potentiation (LTP) is a form of activity-dependent plasticity which results in a persistent enhancement of synaptic transmission. LTP has been a source of great fascination to neuroscientists since its discovery in the early 1970s [29] because it satisfies the criteria proposed by Donald Hebb for a synaptic memory mechanism in his influential book ‘The Organization of Behavior ’ [30]. LTP is a persistent enhancement of excitatory synaptic transmission induced by some kinds of preceding operations of high-frequency stimulation (HFS) [31]. In LTP, stimulation changes the synaptic proteins, that is, changes the sensitivity of postsynaptic neurons to presynaptic neurons, thus changing the strength and efficiency of synaptic signal transmis- sion. Memory formation is considered to be the result of long-term synaptic plasticities, such as long-term depression (LTD) and LTP [32]. LTP and LTD have another potentially important role in modern neuroscience, and that is the possibility that they may be exploited to treat disorder and disease in the human central nervous system (CNS). A variety of neurological conditions arise from lost or excessive synaptic drive due to sensory deprivation during childhood, brain damage, or disease [33]. Memory and forgetting are the stages that the human brain must go through in Actuators 2021, 10, x FOR PEER REVIEW  3  of  17  to presynaptic neurons, thus changing the strength and efficiency of synaptic signal trans‐ mission. Memory formation is considered to be the result of long‐term synaptic plastici‐ ties, such as long‐term depression (LTD) and LTP [32].  LTP and LTD have another potentially important role in modern neuroscience, and  that is the possibility that they may be exploited to treat disorder and disease in the human  Actuators 2021, 10, 254 3 of 16 central nervous system (CNS). A variety of neurological conditions arise from lost or ex‐ cessive synaptic drive due to sensory deprivation during childhood, brain damage, or  disease [33]. Memory and forgetting are the stages that the human brain must go through  in the process of accepting knowledge and accumulating experience. This paper will mod‐ the process of accepting knowledge and accumulating experience. This paper will modify ify the Actor‐network module in DRL, and change the Actor‐network module optimized  the Actor-network module in DRL, and change the Actor-network module optimized by by gradient descent into a network module with biological characteristics.  gradient descent into a network module with biological characteristics. 3. Methods 3. Methods  ATRTD3 is a DRL algorithm proposed in this paper to improve the motion ability of a ATRTD3 is a DRL algorithm proposed in this paper to improve the motion ability of  multi-DOF manipulator. The innovation of the algorithm is to transform the research results a multi‐DOF manipulator. The innovation of the algorithm is to transform the research  of neuroscience into DRL. This algorithm is based on the Twin-Delayed Deep Deterministic results of neuroscience into DRL. This algorithm is based on the Twin‐Delayed Deep De‐ Policy Gradient algorithm with Rebirth Mechanism (RTD3) [34] and improves the update terministic Policy Gradient algorithm with Rebirth Mechanism (RTD3) [34] and improves  of network weight parameters of Actor-network module. The highlight of the algorithm is the update of network weight parameters of Actor‐network module. The highlight of the  that it uses tetanic stimulation and amnesia mechanism to randomly enhance and weaken algorithm is that it uses tetanic stimulation and amnesia mechanism to randomly enhance  the weighting parameters of the neural network, thus realizing the bionic update of the and weaken the weighting parameters of the neural network, thus realizing the bionic  neural network. The Actor-network module obtained through the deterministic policy update of the neural network. The Actor‐network module obtained through the determin‐ gradient needs to be further updated through the above two mechanisms. Compared istic  policy  gradient  needs  to  be further  updated  through  the  above  two  mechanisms.  with other DRL algorithms, ATRTD3 increases the network update part of biological Compared with other DRL algorithms, ATRTD3 increases the network update part of bi‐ characteristics and further expands the scope of exploration. Figure 1 shows the framework ological characteristics and further expands the scope of exploration. Figure 1 shows the  of the overall algorithm. framework of the overall algorithm.   Figure 1. ATRTD3 algorithm framework. Figure 1. ATRTD3 algorithm framework.  The pseudo-code of ATRTD3 is shown in Appendix A at the end of this paper. The The pseudo‐code of ATRTD3 is shown in Appendix A at the end of this paper. The  following is a detailed description of the Tetanic stimulation and Amnesia mechanism. following is a detailed description of the Tetanic stimulation and Amnesia mechanism.   3.1. Tetanic Stimulation     Tetanic stimulation is the memory part of the Actor-network module. In the process of back propagation, the neural network obtains the updating quantity of parameters by gradient descent, to realize the iterative updating of the network. By evaluating the updating of network parameters, we can get which neural nodes’ weights are enhanced and which ones are weakened. For the parameters of the strengthened neurons, there are also differences in the strengthened degree. Therefore, it is necessary to evaluate and sort the parameters of the strengthened neuron nodes, select the ranking of the strengthened degree, and then obtain the neuron nodes qualified for Tetanic stimulation, and conduct Tetanic stimulation on the above parameters of the neuron nodes to achieve LTP, as shown in Figure 2, and the specific pseudo code is shown in Algorithm 1. Directly modifying the parameters of neural nodes will directly affect the nonlinear expression results of neural networks. This is a kind of immediate influence, which will immediately show the change Actuators 2021, 10, 254 4 of 16 effect of neuron weight in the continuous MDP, so we need to control this kind of influence in a reasonable range, and constantly update the parameters of the neural network in the iterative process. The designed Tetanic stimulation mechanism is nested in the fixed delay update step, which can exert the effect of Tetanic stimulation mechanism to a certain extent, and will not affect the overall effect of network update iteration, to ensure that the network converges towards the direction of performance improvement in the training process, and has the function of not weakening the attempt to explore the process. Algorithm 1 Tetanic stimulation 1: Tetanic stimulation coefficient k 2: Load Actor network f, W Actor. f c.weight 3: Update Actor network f f new 4: Load New Actor network, W Actor. f c.weight new 5: DW = W W 6: Select the serial number (row , col ) of the T largest data in DW list list 7: For t = 1 to T do: 8: If random(0, 1) < k: 9: If A.w(row , col ) > 0 t t 10: A.w(row , col ) (1 + random(0, 0.01)) A.w(row , col ) t t t t 11: else: 12: A.w(row , col ) (1 random(0, 0.01)) A.w(row , col ) t t t t Actuators 2021, 10, x FOR PEER REVIEW  5  of  17  13: End if 14: End if 15: End for Tetanic Stimulation Framework Behind _Fc Behind _Fc Behind _Fc Front _Fc Front _Fc Front _Fc 1 1 1 1 1 1 Backward 2 2 2 2 2 2 3 3 3 Figure 2. Schematic diagram of working mechanism of Tetanic stimulation. Figure 2. Schematic diagram of working mechanism of Tetanic stimulation.  3.2. Amnesia Mechanism 3.2. Amnesia Mechanism  The Amnesia mechanism is the forgetting part of the Actor-network module. When The Amnesia mechanism is the forgetting part of the Actor‐network module. When  there are problems in information transmission between neurons, synapses cannot normally there are problems in information transmission between neurons, synapses cannot nor‐ perform the function of neurotransmitter transmission, the brain begins to have problems mally perform the function of neurotransmitter transmission, the brain begins to have  in information transmission, forgetting begins to occur, at this time, the huge redundant problems in information transmission, forgetting begins to occur, at this time, the huge  brain neural network begins to have problems, and some memory and logic units begin to redundant brain neural network begins to have problems, and some memory and logic  have problems. Neuron function is not always in a stable and good working state, there units begin to have problems. Neuron function is not always in a stable and good working  will be all kinds of accidents, just as the world’s best snooker players cannot guarantee state, there will be all kinds of accidents, just as the world’s best snooker players cannot  that every shot will be accurate. So, there is forgetting in the whole period and process guarantee that every shot will be accurate. So, there is forgetting in the whole period and  of neurons. For this reason, the Amnesia mechanism is added to the neural network by process of neurons. For this reason, the Amnesia mechanism is added to the neural net‐ randomly selecting the neurons in the network with small probability, and then weakening work by randomly selecting the neurons in the network with small probability, and then  the parameters of the neuron nodes. When using the Amnesia mechanism to weaken weakening the parameters of the neuron nodes. When using the Amnesia mechanism to  the representation ability of the neural network, the influence of this weakening must be weaken the representation ability of the neural network, the influence of this weakening  controllable, that is to say, it will not affect the convergence trend of the neural network. must be controllable, that is to say, it will not affect the convergence trend of the neural  network. Therefore, to ensure that the influence of the Amnesia mechanism can be con‐ trolled,  the  weights  of  neurons  are  weakened  by  random  force  (a  random  percentage  number) in this paper, as shown in Figure 3, and the specific pseudo code is shown in  Algorithm 2.  Algorithm 2 Amnesia Framework  1: Load Actor network,    W  A.w(Actor. fc.weight) 2:  N   is the number of the  W ’s node  3: For i = 1 to  N :  4:     Random(0, 1) number   , Amnesia threshold value   Mutation coefficient     k 5:     If     :  6:        Actor.fc.weight[i] Actor.fc.weight[i] (1krandom(0,1))  7:     End if  8: End for  Actuators 2021, 10, 254 5 of 16 Therefore, to ensure that the influence of the Amnesia mechanism can be controlled, the weights of neurons are weakened by random force (a random percentage number) in this paper, as shown in Figure 3, and the specific pseudo code is shown in Algorithm 2. Algorithm 2 Amnesia Framework 1: Load Actor network, W A.w(Actor. f c.weight) 2: N is the number of the W’s node 3: For i = 1 to N: 4: Random(0, 1) number x , Amnesia threshold value t, Mutation coefficient k 5: If x < t: Actuators 2021, 10, x FOR PEER REVIEW  6  of  17  6: Actor. f c.weight[i] Actor. f c.weight[i] (1 k random(0, 1))   7: End if 8: End for Figure 3. Schematic diagram of Amnesia framework. Figure 3. Schematic diagram of Amnesia framework.  4. Experiment 4.1. Experiment Setup 4. Experiment  For the control problem of the multi-DOF manipulator, if we only consider the kine- 4.1. Experiment Setup  matics model of the manipulator, and the motion of the multi-DOF manipulator is regarded For the control problem of the multi‐DOF manipulator, if we only consider the kine‐ as a discrete process from one position of the end effector to another position, the DRL matics model of the manipulator, and the motion of the multi‐DOF manipulator is re‐ method of deterministic policy gradient, such as RTD3, can achieve good results. However, garded as a discrete process from one position of the end effector to another position, the  if the motion of the manipulator is regarded as a continuous motion process, a group DRL method of deterministic policy gradient, such as RTD3, can achieve good results.  of new inputs and outputs of DRL must be found. The idea adopted in this paper is to However, if the motion of the manipulator is regarded as a continuous motion process, a  discretize the motion process of the manipulator in time, take the position deviation of the group of new inputs and outputs of DRL must be found. The idea adopted in this paper  end-effector from the target, the angular velocity of the joints, and the angular of the joints is to discretize the motion process of the manipulator in time, take the position deviation  as the input information of the DRL in this paper, and then take the angular acceleration of the end‐effector from the target, the angular velocity of the joints, and the angular of  the joints as the input information of the DRL in this paper, and then take the angular  acceleration command for the next interval of the control joint as the output information,  as shown in Figure 4. In this way, by controlling the angular acceleration of the joint, the  problem of discrete processes in the previous control position process can be solved. How‐ ever, this change will inevitably lead to the increase in model dimensions, which not only  puts forward new requirements for the ability of the DRL algorithm but also puts forward  new requirements for the redesign of the reward function.   UR manipulator is a representative manipulator in industrial production and scien‐ tific research. Therefore, this paper will use the structural size and joint layout of manip‐ ulator to establish the simulation manipulator in this paper.  Actuators 2021, 10, 254 6 of 16 command for the next interval of the control joint as the output information, as shown in Figure 4. In this way, by controlling the angular acceleration of the joint, the problem of discrete processes in the previous control position process can be solved. However, Actuators 2021, 10, x FOR PEER REVIEW  7  of  17  this change will inevitably lead to the increase in model dimensions, which not only puts forward new requirements for the ability of the DRL algorithm but also puts forward new requirements for the redesign of the reward function. Initial State DRL Controller Positi on Deviation Observa tion Input Output Joints Angular Joints Angular Acceleration Velocity ATRTD3 Increment Manipulator Joints Angular Update Reset Intermediate State Termination State No Replay Buffer Termination  Yes Judgment Manipulator Episode Manipulator Figure 4. Schematic diagram of control flow. Figure 4. Schematic diagram of control flow.  UR manipulator is a representative manipulator in industrial production and scientific 4.2. Task Introduction  research. Therefore, this paper will use the structural size and joint layout of manipulator to establish In this pa the per simulation , the DRL al manipulator gorithm is us ined this  to tra paper in a . model controller of the multi‐DOF  manipulator. The model can control the angular acceleration of the joints of the manipu‐ 4.2. Task Introduction lator so that the manipulator can start to move from the initial position of the workspace  In this paper, the DRL algorithm is used to train a model controller of the multi-DOF in the static state, and then move to the target position and stop. In the whole training  manipulator. The model can control the angular acceleration of the joints of the manipulator process, the target position of the task is a fixed position in the workspace of the manipu‐ so that the manipulator can start to move from the initial position of the workspace in the lator. The core of the task is that the manipulator can reach the target position and when  static state, and then move to the target position and stop. In the whole training process, it reaches the target position at the same time, each joint of the manipulator is at static  the target position of the task is a fixed position in the workspace of the manipulator. The state. Finally, the manipulator can reach the target position smoothly by controlling the  core of the task is that the manipulator can reach the target position and when it reaches the angular acceleration of joints. In order to limit the boundaries of the entire task, the entire  target position at the same time, each joint of the manipulator is at static state. Finally, the training process must be restricted. Each episode is divided into twenty steps. This setting  manipulator can reach the target position smoothly by controlling the angular acceleration mainly takes into account that the training convergence process takes a long time and the  of joints. In order to limit the boundaries of the entire task, the entire training process must time of a single episode must be shortened. This task is a simulation experiment to test  be restricted. Each episode is divided into twenty steps. This setting mainly takes into the convergence ability and learning ability of the improved algorithm ATRTD3.   account that the training convergence process takes a long time and the time of a single episode must be shortened. This task is a simulation experiment to test the convergence 4.3. Simulation Environment Construction  ability and learning ability of the improved algorithm ATRTD3. The DRL algorithm establishes the manipulator model through the standard DH [35]  method and uses the positive kinematics solution to solve the spatial pose of the end ef‐ 4.3. Simulation Environment Construction fector through the joint angle. DH modeling method is a general modeling method of  The DRL algorithm establishes the manipulator model through the standard DH [35] multi‐link mechanism. Standard DH models are used for serial structure robots. In Table  method and uses the positive kinematics solution to solve the spatial pose of the end 1, we show the DH parameters of this manipulator, where  a   is the length of the link,  d   effector through the joint angle. DH modeling method is a general modeling method of is the offset of the link,     is the twist angle of the link,     is the joint angle. The units  multi-link mechanism. Standard DH models are used for serial structure robots. In Table 1, of  a   and  d   are meters, and the units of     and     are radians.  we show the DH parameters of this manipulator, where a is the length of the link, d is the offset of the link, a is the twist angle of the link, q is the joint angle. The units of a and d are Table 1. The D‐H parameters of manipulator.  meters, and the units of a and q are radians. 𝜶   𝒅   𝜶   𝜽   Joint  Base  0  0.0892   / 2   Shoulder −0.4250  0  0        Elbow −0.3923  0  0     Wrist1  0  0.1092   / 2     Wrist2  0  0.0946   / 2     Wrist3  0  0.0823  0  6 Actuators 2021, 10, 254 7 of 16 Table 1. The D-H parameters of manipulator. Joint a d a q Base 0 0.0892 p/2 q Shoulder 0.4250 0 0 q Elbow 0.3923 0 0 q Wrist1 0 0.1092 p/2 q Wrist2 0 0.0946 p/2 q Wrist3 0 0.0823 0 q During the experiment, only base, shoulder, and elbow are controlled, wrist1, wrist2, and wrist3 were locked. Because the problem studied in this paper focuses on the position reaching ability of the manipulator, it can be realized only by using three joints, so the three joints of the wrist joints are locked. The homogeneous transformation matrix is established through D–H parameters, as shown in Equation (1). 2 3 cos(q ) sin(q ) cos(a ) sin(q ) sin(a ) a cos(q ) i i i i i i i 6 7 sin(q ) cos(q ) cos(a ) cos(q ) sin(a ) a sin(q ) i1 i i i i i i i 6 7 T = (1) i 4 5 0 sin(a ) cos(a ) d i i i 0 0 0 1 The solution of the forward kinematics of the manipulator can be obtained by mul- tiplying the homogeneous transformation matrix, as shown in Equation (2). In the base coordinate system {B}, the position of the end of the manipulator can be obtained. 0 0 1 2 3 4 5 T = T T T T T T (2) 6 1 2 3 4 5 6 In each episode, the target position is randomly generated in the workspace of the manipulator. In the experiment, the distance difference between the center position of the end effector and the target position in three directions (dx, dy, and dz), the angular veloc- ity (w , w , and w ) and absolute angle of the first three joints Joint_Base Joint_Shoulder Joint_Eblow (q , q , and q ) are used as the input of DRL, and the angular Joint_Base Joint_Shoulder Joint_Eblow . . acceleration control commands of the base, shoulder and elbow (w , w , Joint_Base Joint_Shoulder and w ) are output by DRL. In order to ensure the safe operation of the virtual Joint_Elbow manipulator, the angular acceleration (q/s ) is limited, as shown in Equation (3). w 2 (0.5, 0.5), i 2 (Base, Shoulder, Eblow) (3) When the DRL outputs the angular acceleration control command w , the joint angle increment Dq obtained in this step is calculated by Equation (4) according to the interval time t = 0.1 s and the angular velocity of the previous step w . The current joint angle _i q is updated through the joint angle increment Dq and the joint angle of the previous i_ i step q in Equation (5). The position of the manipulator end effector in {B} coordinate is _i obtained by homogeneous transformation matrix T calculation. The joint angular velocity is updated as shown in Equation (6). 1 . Dq = w t + w t , i 2 (Base, Shoulder, Eblow) (4) i _i i q = q + Dq (5) i_ _i i w = w + w t (6) i_ _i i In this motion process, the DRL model sends the angular acceleration command (the output of the DRL algorithm) of the joints to manipulator according to the perceived environment and its state (the input of the DRL algorithm) and gives the termination command when judging the motion. Actuators 2021, 10, x FOR PEER REVIEW  9  of  17  Actuators 2021, 10, 254 8 of 16 4.4. Rewriting Experience Playback Mechanism and Reward Function Design  The motion process of a multi‐DOF manipulator is no longer multiple discrete spatial  4.4. Rewriting Experience Playback Mechanism and Reward Function Design position points. As can be seen from Figure 5 below, this experiment completes the spec‐ The motion process of a multi-DOF manipulator is no longer multiple discrete spatial ified task by controlling the angular acceleration of the three joints. In the joint angular  position points. As can be seen from Figure 5 below, this experiment completes the specified velocity field drawn by three joint angular velocities, in Case 1, the manipulator can reach  task by controlling the angular acceleration of the three joints. In the joint angular velocity the field  target drawn  position by thr ee and joint stop angular , which velocities,  means  in tha Case t the 1, task the manipulator  is successfu can llyr completed; each the target in Case  position and stop, which means that the task is successfully completed; in Case 2, the 2, the angular velocity of the joint does not stop at all, or it passes through the target po‐ angular velocity of the joint does not stop at all, or it passes through the target position sition quickly and fails to complete the task; for Case 3, although the manipulator stops in  quickly and fails to complete the task; for Case 3, although the manipulator stops in the the scheduled time, the end of the manipulator does not reach the target position, and the  scheduled time, the end of the manipulator does not reach the target position, and the task task fails to complete. Figure 5 also shows that the task of this experiment is more difficult  fails to complete. Figure 5 also shows that the task of this experiment is more difficult than than the task of only achieving the goal through discrete spatial position.  the task of only achieving the goal through discrete spatial position. Figure 5. Schematic diagram of the angular velocity field of joints. x is the threshold for determining joint stop. Figure 5. Schematic diagram of the angular velocity field of joints.    is the threshold for determining joint stop.  Therefore, the design of the reward function must be reconsidered, that is to say, the angular velocity information of the joint must be introduced. In the whole process of Therefore, the design of the reward function must be reconsidered, that is to say, the  the manipulator movement, each instance of acceleration and deceleration has an impact angular velocity information of the joint must be introduced. In the whole process of the  on the angular velocity of each joint when the termination condition is reached and the manipulator movement, each instance of acceleration and deceleration has an impact on  iteration of the round stops. Therefore, it is necessary to further improve the experience the angular velocity of each joint when the termination condition is reached and the iter‐ playback mechanism and change the experience pool storage. In other words, the final ation of the round stops. Therefore, it is necessary to further improve the experience play‐ angular velocity of each joint should be shared by all the previous continuous movements. back mechanism and change the experience pool storage. In other words, the final angular  As shown in Equation (7), here the absolute value of the angular velocity of the joint will velocity be taken,  of multiplied each joint by sho the uld constant  be shared l , divided  by all the by the previous number contin of steps uous T  movem , and the ents. As  i Stop corresponding reward value will be obtained. shown in Equation (7), here the absolute value of the angular velocity of the joint will be  taken, multiplied by the constant  , divided by the number of steps  , and the corre‐ i Stop l jw j + l jw j + l jw j 1 Base 2 Shoulder 3 Elbow R = , l > 0, i = f1, 2, 3g (7) Joint_Vel i sponding reward value will be obtained.  Stop       This part of the reward as a negative reward is added to the corresponding reward in 1 Base 2 Shoulder 3 Elbow R  ,  0,i {1,2,3} (7)  Joint_Vel i the experience pool, so as to realize the feedback of the joint angular velocity state in the Stop neural network update parameters, as shown in Figure 6. This part of the reward as a negative reward is added to the corresponding reward  in the experience pool, so as to realize the feedback of the joint angular velocity state in  the neural network update parameters, as shown in Figure 6.  Actuators 2021, 10, x FOR PEER REVIEW  10  of  17  Actuators 2021, 10, 254 9 of 16 Figure 6. Schematic diagram of rewriting experience playback mechanism. Figure 6. Schematic diagram of rewriting experience playback mechanism.  The design of the reward function in this paper adds the angular velocity reward part The design of the reward function in this paper adds the angular velocity reward part  to the Step-by-Step reward function (r ) [35]. The Step-by-Step reward function StepbyStep to the Step‐by‐Step reward function ( ) [35]. The Step‐by‐Step reward function  StepbyStep mainly includes two parts: the first part is the negative value of the Euclidean distance mainly includes two parts: the first part is the negative value of the Euclidean distance  between the end of the manipulator and the target. The second part is the reward obtained between the end of the manipulator and the target. The second part is the reward obtained  by comparing the distance closed to the target position between the current position and by comparing the distance closed to the target position between the current position and  the last position of the manipulator end during the movement. Therefore, the reward the  last  position  of  the  manipulator  end  during  the  movement.  Therefore,  the  reward  function in this paper is shown in Equation (8): function in this paper is shown in Equation (8):  r = l r + l R (8) 4 StepbyStep Joint_Vel r  r  R   (8)  4 StepbyStep 5 Joint_Vel where l and l are two constants. where     and     are two constants.  4 5 4.5. Simulation Experimental Components In the application of the DRL algorithm, a problem that cannot be avoided is the 4.5. Simulation Experimental Components  random generation of a large number of neural network parameters. It is because of the In the application of the DRL algorithm, a problem that cannot be avoided is the ran‐ randomness of the parameters that we cannot train and learn effectively in the face of dom generation of a large number of neural network parameters. It is because of the ran‐ specific tasks, so we need to explore a more efficient, faster convergence, and more stable domness of the parameters that we cannot train and learn effectively in the face of specific  algorithm framework to make up for this disadvantage. Since ATRTD3, RTD3, and TD3 are tasks, so we need to explore a more efficient, faster convergence, and more stable algo‐ all improved and innovated on the basis of DDPG, the contrast group of this experiment rithm framework to make up for this disadvantage. Since ATRTD3, RTD3, and TD3 are  chooses the above four algorithms. In the contrast experiment, we train with the same all improved and innovated on the basis of DDPG, the contrast group of this experiment  task and acquire the ability to solve the target task through learning and training. In the chooses the above four algorithms. In the contrast experiment, we train with the same task  experiment, we specify two kinds of evaluation indexes. The first index is to calculate and acquire the ability to solve the target task through learning and training. In the exper‐ all the reward scores in an episode and divide them by the total number of steps in the iment, we specify two kinds of evaluation indexes. The first index is to calculate all the  episode to get the average score. The second index is to record the absolute value of the reward scores in an episode and divide them by the total number of steps in the episode  angular velocity of base, shoulder, and elbow at the end of an episode. Additionally, in the to get the average score. The second index is to record the absolute value of the angular  same group of experiments, in order to ensure a fair comparison of the practical application velocity of base, shoulder, and elbow at the end of an episode. Additionally, in the same  ability and model convergence ability of the four algorithms, we use the same initialization group of experiments, in order to ensure a fair comparison of the practical application  model as the test algorithm model. ability and model convergence ability of the four algorithms, we use the same initializa‐ tion model as the test algorithm model.  5. Discussion Through the comparison of DDPG, TD3, RTD3, and ATRTD3 in Figure 7a, we can 5. Discussion  clearly see the improvement effect of ATRTD3 learning ability. Therefore, we need to further Through analyze  the comparison the evaluation  of DDPG, index of TD3, the average RTD3, and scor ATRTD3 e when the  in final Figure distance  7a, weerr  can or  is clearly roughly  see the the improveme same. Fromnt Figur  effect e 7of a, ATR we can TD3 see  learning that the abi average lity. Therefore score of, we ATR nTD3 eed to is fu higher r‐ ther analyze the evaluation index of the average score when the final distance error is  than other algorithms when the final distance error is the same in purple areas A, B, and C, roughly which the indicates  same. From that the  Figure learning  7a, we ef fect can see of A tha TRt TD3  the av is better erage. sco The rer of eason  ATRTD is that 3 ispart  high oferthe   tharnewar  other d al function gorithm iss the when negative  the final rewar  dista dnintr ce eoduced rror is the by sa the me absolute  in purple value  areas of A, the B,angular  and  velocity of each joint after each termination of an episode. Through purple areas A, B, and C, which indicates that the learning effect of ATRTD3 is better. The reason is that part of  C of Figure 7b, ATRTD3 can better guide the multi-DOF manipulator to move to the target the reward function is the negative reward introduced by the absolute value of the angular  position and stop better, so ATRTD3 is more efficient and stable than other algorithms. velocity of each joint after each termination of an episode. Through purple areas A, B, and  Actuators 2021, 10, x FOR PEER REVIEW  11  of  17  C of Figure 7b, ATRTD3 can better guide the multi‐DOF manipulator to move to the target  position and stop better, so ATRTD3 is more efficient and stable than other algorithms.  Secondly,  through  Figure  7,  we  can  draw  the  conclusion  that  ATRTD3  shows  stronger stability than other algorithms in the late training period. As can be seen in Figure  8, although DDPG has reached the score level close to that of ATRT3 through the later  training, we can clearly see from the average score curve and the final error distance curve  Actuators 2021, 10, 254 10 of 16 that ATRTD3 has better stability in the later training stage compared with DDPG, and the  two curves of ATRTD3 are straighter, there are many spikes in the two curves of DDPG.  20 15 1 10 RTD3 ATRTD3 0 0 TD3 -5 -5 DDPG -10 -10 -15 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode -5 -5 -10 -10 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode (a)  RTD3 0.8 0.8 ATRTD3 0.6 0.6 TD3 DDPG 0.4 0.4 0.2 0.2 0.0 0.0 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode (b)  Figure 7. Four randomized experiments (Group1, Group2, Group3, and Group4) were conducted to evaluate the perfor- Figure 7. Four randomized experiments (Group1, Group2, Group3, and Group4) were conducted to evaluate the perfor‐ mance of four algorithms (RTD3, ATRTD3, TD3 and DDPG). There are Group1, Group2, Group3, and Group4 in the average mance of four algorithms (RTD3, ATRTD3, TD3 and DDPG). There are Group1, Group2, Group3, and Group4 in the  score (Avg_score) in (a) and the final error distance (Final_Dis) in (b). Purple areas A, B, and C need special attention. average score (Avg_score) in (a) and the final error distance (Final_Dis) in (b). Purple areas A, B, and C need special  attention.  Secondly, through Figure 7, we can draw the conclusion that ATRTD3 shows stronger stability than other algorithms in the late training period. As can be seen in Figure 8, although DDPG has reached the score level close to that of ATRT3 through the later training, we can clearly see from the average score curve and the final error distance curve that ATRTD3 has better stability in the later training stage compared with DDPG, and the two curves of ATRTD3 are straighter, there are many spikes in the two curves of DDPG. From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89% compared with the other three algorithms. In terms of stability, ATRTD3 performs better, with an improvement of at least 89.27% compared with the other three algorithms. Avg_Score Final_Dis (m) Avg_Score Final_Dis (m) Avg_Score Avg_Score Final_Dis (m) Final_Dis (m) Actuators 2021, 10, x FOR PEER REVIEW  12  of  17  Group1 Group1 1.0 100 Group2 Group2 89.84 Group3 0.858 Group3 Actuators 2021, 10, x FOR PEER REVIEW  12  of  17  Actuators 2021, 10, 254 11 of 16 Group4 Group4 0.8 0.755 59.62 0.6 0.531 52.06 47.27 Group1 Group1 1.0 Group2 Group2 0.4 89.84 Group3 0.858 Group3 Group4 Group4 0.8 0.755 0.2 59.62 0.053 60 0.6 0.531 52.06 0.0 47.27 RTD3 ATRTD3 TD3 DDPG RTD3 ATRTD3 TD3 DDPG 0.4 Algorithm Algorithm 0.2 (a)  (b)  0.053 0.0 Figure 8. The results of four experiments (Group1, Group2, Group3, and Group4) of four algorithms (RTD3, ATRTD3,  RTD3 ATRTD3 TD3 DDPG RTD3 ATRTD3 TD3 DDPG TD3, and DDPG) are compared in a stacked way. From left to right, (a) represents the final error distance; (b) represents  Algorithm Algorithm   the average score.  (a)  (b)  From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89%  Figure 8. The results of four experiments (Group1, Group2, Group3, and Group4) of four algorithms (RTD3, ATRTD3, TD3, Figure 8. The results of four experiments (Group1, Group2, Group3, and Group4) of four algorithms (RTD3, ATRTD3,  compared with the other three algorithms. In terms of stability, ATRTD3 performs better,  and TD3, DDPG)  and DDPG) are com  are par  com edpin ared a stacked  in a staway cked . way. From From left to left right,  to right, (a) repr  (a)esents  represents the final  the final error error distance;  distan (bce; ) r epr (b) esents represents the   the average score.  with an improvement of at least 89.27% compared with the other three algorithms.  average score. From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89%  Avg_Score compared with the other three algorithms. In terms of stability, ATRTD3 performs better,  Final_Dis 100% with an improvement of at least 89.27% compared with the other three algorithms.  Avg_Score 80% Final_Dis 100% 60% 80% 40% 60% 20% 40% 20 0% % DDPG RTD3 TD3 0% Figure 9. Combined with four groups of experiments (Group1, Group2, Group3, and Group4), the  Figure 9. Combined with four groups of experiments (Group1, Group2, Group3, and Group4), the DDPG RTD3 TD3 ATRTD3 algorithm improves the average score and final error distance performance of the model.  ATRTD3 algorithm improves the average score and final error distance performance of the model. Figure 9. Combined with four groups of experiments (Group1, Group2, Group3, and Group4), the  To further demonstrate the advantages of the ATRTD3 from the underlying control ATRTD3 algorithm improves the average score and final error distance performance of the model.  To further demonstrate the advantages of the ATRTD3 from the underlying control  variables, we will collect the end angular speeds of the three joints during model training variables, we will collect the end angular speeds of the three joints during model training  as shown in Figure 10a,b. In Figure 10a, we show the final angular velocity of the three To further demonstrate the advantages of the ATRTD3 from the underlying control  as shown in Figure 10a,b. In Figure 10a, we show the final angular velocity of the three  joints of the high score model (Group1). In Figure 10b, we show the final angular velocity variables, we will collect the end angular speeds of the three joints during model training  joints of the high score model (Group1). In Figure 10b, we show the final angular velocity  of the three joints of the arm controlled by a low score model (Group2). By placing a local as shown in Figure 10a,b. In Figure 10a, we show the final angular velocity of the three  of the three joints of the arm controlled by a low score model (Group2). By placing a local  magnification in each image, we can clearly see the final angular velocity of the three joints joints of the high score model (Group1). In Figure 10b, we show the final angular velocity  at magni the end ficat ofion the in training  each ima process, ge, we wher  can e the clear curve ly see repr the esented  final an bygu thela rr ed velocity color is of the the speed  three joints  of the three joints of the arm controlled by a low score model (Group2). By placing a local  of each joint under the guidance of the ATRTD3 throughout the training process. ATRTD3 at the end of the training process, where the curve represented by the red color is the  magnification in each image, we can clearly see the final angular velocity of the three joints  has obvious advantages over the other three algorithms. speed of each joint under the guidance of the ATRTD3 throughout the training process.  at the end of the training process, where the curve represented by the red color is the  ATRTD3 speed of ea has ch  joint obvious  unde adv r thaent guid ages ance  over  of  th thee  AT other RT D3 three  throughout  algorithms.  the  training process.  ATRTD3 has obvious advantages over the other three algorithms.  Final_Dis(m) Final_Dis(m) Performance Improvem Performance Improvem ent ent 49.89% 49.89% 93.36% 93.36% 71.67% Avg_Score 71.67% Avg_Score 92.45% 92.45% 89.06% 89.06% 89.27% 89.27% Actuators 2021, 10, x FOR PEER REVIEW  13  of  17  Actuators 2021, 10, 254 12 of 16 RTD3 ATRTD3 0.12 TD3 0.08 0.05 DDPG 0.04 0.00 0.00 -0.04 2 -0.05 40k 50k 60k Episode 0.05 -0.10 1 40k 50k 60k 0.00 Episode -1 -0.05 -0.10 -0.15 -1 40k 50k 60k -1 Episode -2 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode Episode (a)  4 1 4 0.15 0.10 3 0 0.05 0.06 0.00 0.04 0.02 -0.05 0.00 2 -1 -0.10 40k 50k 60k -0.02 0.05 Episode -0.04 40k 50k 60k 0.00 1 -2 Episode -0.05 -0.10 RTD3 -0.15 0 -3 ATRTD3 -0.20 -2 40k 50k 60k TD3 Episode DDPG -1 -4 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode Episode (b)  Figure Figure 10. 10. ((a a)) The The final final angular angular velo velocity city of of th the e joints joints in in the the hi high gh score score mod model el Group1; Group1; ( (b b) ) The The final final angular angular velo velocity city of of the the  joints in the low score model Group2.  joints in the low score model Group2. The final joint angular velocity of the manipulator based on ATRTD3 is always around The  final  joint  angular  velocity  of  the  manipulator  based  on  ATRTD3  is  always  0 (rad/s), with the minimum fluctuation and the best stability. In the high score model around 0 (rad/s), with the minimum fluctuation and the best stability. In the high score  (Group1), only DDPG can achieve a score similar to ATRTD3 in the training score, so model (Group1), only DDPG can achieve a score similar to ATRTD3 in the training score,  only the joint angular velocity of DDPG and ATRTD3 is compared. It can be seen from so only the joint angular velocity of DDPG and ATRTD3 is compared. It can be seen from  Table 2 that DDPG has an order of magnitude disadvantage in the comparison of the Table 2 that DDPG has an order of magnitude disadvantage in the comparison of the an‐ angular velocity of the other two joints except that the final angular velocity of the base gular velocity of the other two joints except that the final angular velocity of the base joint  joint is roughly the same as that of ATRTD3. By comparing the variances in Table 2, it is is roughly the same as that of ATRTD3. By comparing the variances in Table 2, it is not  not difficult to find that ATRTD3 has absolute advantages in stability, which is generally difficult to find that ATRTD3 has absolute advantages in stability, which is generally one  one order of magnitude higher. Because the smaller the variance is, the more stable the order of magnitude higher. Because the smaller the variance is, the more stable the train‐ training is. In the low model (Group2), the advantages of ATRTD3 cannot be seen in Table 3. ing is. In the low model (Group2), the advantages of ATRTD3 cannot be seen in Table 3.  After accumulating and comparing the angular velocity and angular velocity variance of After accumulating and comparing the angular velocity and angular velocity variance of  the three joints, it can be seen from Figure 11 that ATRTD3 is still better than the other the three joints, it can be seen from Figure 11 that ATRTD3 is still better than the other  three algorithms. three algorithms.     Baes(rad/s) Baes(rad/s) Baes(rad/s) Baes(rad/s) Shoulder(rad/s) Shoulder(rad/s) Shoulder(rad/s) Shoulder(rad/s) Elbow(rad/s) Elbow(rad/s) Elbow(rad/s) Elbow(rad/s) Actuators 2021, 10, x FOR PEER REVIEW  14  of  17  Table 2. The mean and variance of joints angular velocity in local enlarged images of high score model (Group1).  Joint  Base  Shoulder  Elbow  Angular Veloc‐ Average  Average  Average  Var  Var  Var  ity  (rad/s)  (rad/s)  (rad/s)  Actuators 2021, 10, 254 13 of 16 −3 −5 −3 −5 −4 −5 ATRTD3 −2.50 × 10   2.80 × 10   2.51 × 10   3.09 × 10 −3.90 × 10   2.83 × 10   −3 −4 −2 −3 −3 −4 RTD3 −4.65 × 10   4.58 × 10 −3.13 × 10   1.07 × 10   3.35 × 10   6.81 × 10   −2 −4 −2 −3 −3 −4 Table 2. The mean and variance of joints angular velocity in local enlarged images of high score model (Group1). TD3  3.13 × 10   7.26 × 10 −2.79 × 10   1.32 × 10 −6.69 × 10   4.73 × 10   −3 −5 −2 −5 −2 −5 Joint Base Shoulder Elbow DDPG  2.66 × 10   4.60 × 10   2.52 × 10   6.41 × 10 −2.01 × 10   3.95 × 10   Angular Average Average Average Var Var Var Velocity (rad/s) (rad/s) (rad/s) 3 5 3 5 4 5 Table 3. The mean and ATR TD3 variance2.50  of joints 10  angular 2.80  10 veloci2.51 ty in  10 local enlarged 3.09  10  imag 3.90 es of 10 low score 2.83  10 model (Group2).  3 4 2 3 3 4 RTD3 4.65  10 4.58  10 3.13  10 1.07  10 3.35  10 6.81  10 2 4 2 3 3 4 TD3 3.13  10 7.26  10 2.79  10 1.32  10 6.69  10 4.73  10 Joint  Base  3 5 Shoulder 2   5 2 5 Elbow  DDPG 2.66  10 4.60  10 2.52  10 6.41  10 2.01  10 3.95  10 Angular Veloc‐ Average  Average  Average  Table 3. The meanVar and variance   of joints angular velocity in local enlarged images Var of low score model (Group2). Var  ity  (rad/s)  (rad/s)  (rad/s)  Joint Base Shoulder Elbow −2 −5 −3 −4 −2 −5 ATRTD3 −1.71 × 10   4.17 × 10   8.04 × 10   1.08 × 10 −1.23 × 10   6.18 × 10   Angular Average Average Average Var Var Var Velocity (rad/s) (rad/s) (rad/s) −2 −4 −3 −4 −2 −4 RTD3  3.03 × 10   1.14 × 10 −2.50 × 10   5.87 × 10 −1.73 × 10   4.48 × 10   2 5 3 4 2 5 ATRTD3 1.71  10 4.17  10 8.04  10 1.08  10 1.23  10 6.18  10 −2 2 −5 4 −23 4 −5 2 4−2 −5 RTD3 3.03  10 1.14  10 2.50  10 5.87  10 1.73  10 4.48  10 TD3  1.39 × 10   3.40 × 10   3.67 × 10   9.69 × 10 −6.58 × 10   6.95 × 10   2 5 2 5 2 5 TD3 1.39  10 3.40  10 3.67  10 9.69  10 6.58  10 6.95  10 −2 −5 −2 −5 −3 −4 2 5 2 5 3 4 DDPG  3.20 × 10   5.38 × 10 −1.40 × 10   7.07 × 10   8.21 × 10   1.27 × 10   DDPG 3.20  10 5.38  10 1.40  10 7.07  10 8.21  10 1.27  10 Skill Level Improvement Stability Improvement 100% 80% 60% 40% 20% 0% Group1 Group2 Group1 Group2 Group1 Group2 DDPG RTD3 TD3 Figure 11. In the high score model (Group1) and low score model (Group2), the ATRTD3 algorithm compared with the Figure 11. In the high score model (Group1) and low score model (Group2), the ATRTD3 algo‐ other three algorithms in the control of joint angular velocity skill level and late training stability are two aspects of improvement effect. rithm compared with the other three algorithms in the control of joint angular velocity skill level  and late training stability are two aspects of improvement effect.  Through Figure 11, we can see that ATRTD3 can significantly improve the skill level  and stability compared with the other three algorithms. ATRTD3 generally improves the  skill level by more than 25.27% and improves stability by more than 15.90%. Compared  with TD3, the stability improvement of ATRTD3 in Group 2 is −5.54%. The main reason is  that TD3 falls into a more stable local optimization, so the performance of ATRTD3 cannot  be questioned under this index.  In the Actor neural network, the Tetanic stimulation mechanism changes the param‐ eters of some neurons compulsively and improves the exploration space of action in a  certain  range.  Through  Tetanic  stimulation,  the  weight  of  selected  neurons  can  be  in‐ creased again, which helps to improve the convergence speed of the neural network and  speed up the training speed. The Amnesia mechanism makes the parameters of neurons  changed compulsively, which is in line with the situation of biological cell decay.  Performance Improvement 88.74 41.71 30.94 15.90 86.25 96.05 25.27 81.59 91.80 96.54 67.84 -5.540 Actuators 2021, 10, 254 14 of 16 Through Figure 11, we can see that ATRTD3 can significantly improve the skill level and stability compared with the other three algorithms. ATRTD3 generally improves the skill level by more than 25.27% and improves stability by more than 15.90%. Compared with TD3, the stability improvement of ATRTD3 in Group 2 is 5.54%. The main reason is that TD3 falls into a more stable local optimization, so the performance of ATRTD3 cannot be questioned under this index. In the Actor neural network, the Tetanic stimulation mechanism changes the param- eters of some neurons compulsively and improves the exploration space of action in a certain range. Through Tetanic stimulation, the weight of selected neurons can be increased again, which helps to improve the convergence speed of the neural network and speed up the training speed. The Amnesia mechanism makes the parameters of neurons changed compulsively, which is in line with the situation of biological cell decay. 6. Conclusions In this paper, we propose an algorithm named ATRTD3, for continuous control of a multi-DOF manipulator. In the training process of the proposed algorithm, the weighting parameters of the neural network are learned using Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a biomimetic view to speed up the converging process by biochemical reactions generated by neurons in the biological brain during memory and forgetting. The integration of the two mechanisms is of great significance to expand the scope of exploration, jump out of the local optimal solution, speed up the learning process and enhance the stability of the algorithm. The effectiveness of the proposed algorithm is validated by a simulation example including the comparisons with previously developed DRL algorithms. The results indicate that our approach shows performance improvement in terms of convergence speed and precision in the multi-DOF manipulator. The ATRTD3 we proposed successfully applies the research results of neuroscience to the DRL algorithm to enhance its performance, and uses computer technology to carry out biological heuristic design through code to approximately realize some functions of the biological brain. In the future, we will further learn from neuroscience achievements, improve the learning efficiency of DRL, and use DRL to achieve reliable and accurate manipulator control. Furthermore, we will continue to challenge the scheme of controlling six joints to realize the cooperative control of position and orientation. Author Contributions: Conceptualization, H.H. and Y.H.; methodology, Y.H.; software, Y.H.; valida- tion, Y.H., D.X. and Z.Z.; formal analysis, Y.H.; investigation, Y.H.; resources, H.H.; data curation, Y.C.; writing—original draft preparation, Y.H.; writing—review and editing, Y.C. and Z.L.; visualization, Y.H.; supervision, H.H. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The authors exclude this statement. Conflicts of Interest: The authors declare no conflict of interest. Actuators 2021, 10, 254 15 of 16 Appendix A Pseudo code of Algorithm A1. Algorithm A1. ATRTD3 1: Initialize critic networks Q , Q ,and actor network p with random parameters q , q , f q q f 1 2 1 2 2: Initialize target networks Q 0 , Q 0 , p 0 q q f 1 2 0 0 0 3: Target network node assignment q ! q ; q ! q ; f ! f 1 2 4: Initialize replay buffer B 5: for t = 1 to T do 6: Amnesia framework 7: Select action with exploration noise a  p + #, #  N(0, s) 8: Temporary store transition tuple (s, a, s , r) in B 9: Fix transition tuple r r + R , angular velocity correction R joint_Vel Joint_Vel 0 0 10: Store transition tuple (s, a, s , r ) in B 11: If Sum < mini_batch then 12 return 13: Sample mini-batch of N transition (s, a, r, s ) from B 14: ea p 0(s ) + #, #  clip(N(0,es),c, c) 0 e 15: y = r + min Q (s , a) i=1,2 16: Statistical calculation of Q value utilization ratio w 0 , w 0 q q 1 2 17: If w < Mini_utilization then 18: Rebirth target networks q 19: End if 20: If w 0 < Mini_utilization then 21: Rebirth target networks q 22: End if 23: Update critics q argmin N (y Q (s, a)) i q q i i 24: If t mod d then 25: Update f by the deterministic policy gradient: 26: r J(f) = N r Q (s, a) r p (s) f a q f f a=p (s) 27: Tetanic stimulation Framework 28: Update target networks: 0 0 29: q tq + (1 t)q i i 0 0 30: f tf + (1 t)f 31: End if 32: End if 33: End for References 1. Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection. In International Symposium on Experimental Robotics; Springer: Cham, Switzerland, 2016; pp. 173–184. 2. Zhang, M.; Mccarthy, Z.; Finn, C.; Levine, S.; Abbeel, P. Learning deep neural network policies with continuous memory states. In Proceedings of the International Conference on Robotics and Auto-mation, Stockholm, Sweden, 16 May 2016; pp. 520–527. 3. Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1–40. 4. Lenz, I.; Knepper, R.; Saxena, A. DeepMPC:learning deep latent features for model predictive control. In Proceedings of the Robotics Scienceand Systems, Rome, Italy, 13–17 July 2015; pp. 201–209. 5. Satija, H.; Pineau, J. Simultaneous machine translation using deep reinforcement learning. In Proceedings of the Workshops of International Conference on Machine Learning, New York, NY, USA, 24 June 2016; pp. 110–119. 6. Sallab, A.; Abdou, M.; Perot, E.; Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017, 19, 70–76. [CrossRef] 7. Caicedo, J.; Lazebnik, S. Active Object Localization with Deep Reinforcement Learning. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2488–2496. 8. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. 9. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Figjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef] Actuators 2021, 10, 254 16 of 16 10. Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. 11. Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. 12. Mnih, V.; Badia, A.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. 13. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. 14. Heess, N.; Dhruva, T.B.; Sriram, S.; Lemmon, J.; Silver, D. Emergence of Locomotion Behaviours in Rich Environments. arXiv 2017, arXiv:1707.02286. 15. Fujimoto, S.; Hoof, H.V.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. 16. Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. Noisy Networks for Exploration. arXiv 2017, arXiv:1706.10295. 17. Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.Y.; Chen, X.; Asfour, T.; Abbeel, P.; Andrychowicz, M. Parameter Space Noise for Exploration. arXiv 2017, arXiv:1706.01905. 18. Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count-based exploration and intrinsic motivation. Adv. Neural Inf. Process. Syst. 2016, 29, 1471–1479. 19. Choshen, L.; Fox, L.; Loewenstein, Y. DORA The Explorer: Directed Outreaching Reinforcement Action-Selection. arXiv 2018, arXiv:1804.04012. 20. Badia, A.; Sprechmann, P.; Vitvitskyi, A.; Guo, D.; Piot, B.; Kapturowski, S.; Tieleman, O.; Arjovsky, M.; Pritzel, A.; Bolt, A.; et al. Never Give Up: Learning Directed Exploration Strategies. arXiv 2020, arXiv:2002.06038. 21. Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates. arXiv 2016, arXiv:1610.00633. 22. Hassabis, D.; Kumaran, D.; Summerfield, C.; Botvinick, M. Neuroscience-Inspired Artificial Intelligence. Neuron 2017, 95, 245–258. [CrossRef] 23. MyeongSeop, K.; DongKi, H.; JaeHan, P.; JuneSu, K. Motion Planning of Robot Manipulators for a Smoother Path Using a Twin Delayed Deep Deterministic Policy Gradient with Hindsight Experience Replay. Appl. Sci. 2020, 10, 575. 24. Zhang, H.; Wang, F.; Wang, J.; Cui, B. Robot Grasping Method Optimization Using Improved Deep Deterministic Policy Gradient Algorithm of Deep Reinforcement Learning. Rev. Sci. Instrum. 2021, 92, 1–11. 25. Kwiatkowski, R.; Lipson, H. Task-agnostic self-modeling machines. Sci. Robot. 2019, 4, eaau9354. [CrossRef] 26. Iriondo, A.; Lazkano, E.; Susperregi, L.; Urain, J.; Fernandez, A.; Molina, J. Pick and Place Operations in Logistics Using a Mobile Manipulator Controlled with Deep Reinforcement Learning. Appl. Sci. 2019, 9, 348. [CrossRef] 27. Giorgio, I.; Del Vescovo, D. Energy-based trajectory tracking and vibration control for multilink highly flexible manipulators. Math. Mech. Complex Syst. 2019, 7, 159–174. [CrossRef] 28. Rubinstein, D. Dynamics of a flexible beam and a system of rigid rods, with fully inverse (one-sided) boundary conditions. Comput. Methods Appl. Mech. Eng. 1999, 175, 87–97. [CrossRef] 29. Bliss, T.V.P.; Lømo, T. Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. 1973, 232, 331–356. [CrossRef] 30. Hebb, D.O. The Organization of Behavior; Wiley: New York, NY, USA, 1949. 31. Thomas, M.J.; Watabe, A.M.; Moody, T.D.; Makhinson, M.; O’Dell, T.J. Postsynaptic Complex Spike Bursting Enables the Induction of LTP by Theta Frequency Synaptic Stimulation. J. Neurosci. 1998, 18, 7118–7126. [CrossRef] [PubMed] 32. Dayan, P.; Abbott, L.F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems; The MIT Press: New York, NY, USA, 2001. 33. Bliss, T.V.P.; Cooke, S.F. Long-Term Potentiation and Long-Term Depression: A Clinical Perspective. Clinics 2011, 66 (Suppl. 1), 3–17. [CrossRef] 34. Hou, Y.Y.; Hong, H.J.; Sun, Z.M.; Xu, D.S.; Zeng, Z. The Control Method of Twin Delayed Deep Deterministic Policy Gradient with Rebirth Mechanism to Multi-DOF Manipulator. Electronics 2021, 10, 870. [CrossRef] 35. Denavit, J.; Hartenberg, R.S. A Kinematic Notation for Lower-Pair Mechanisms. J. Appl. Mech. 1955, 77, 215–221. [CrossRef]

Journal

ActuatorsMultidisciplinary Digital Publishing Institute

Published: Sep 29, 2021

Keywords: multi-DOF manipulator; tetanic stimulation; amnesia mechanism; deep reinforcement learning

There are no references for this article.