Optimal Path Planning for Wireless Power Transfer Robot Using Area Division Deep Reinforcement Learning
Optimal Path Planning for Wireless Power Transfer Robot Using Area Division Deep Reinforcement...
Xing, Yuan;Young, Riley;Nguyen, Giaolong;Lefebvre, Maxwell;Zhao, Tianchi;Pan, Haowen;Dong, Liang
2022-03-04 00:00:00
Hindawi Wireless Power Transfer Volume 2022, Article ID 9921885, 10 pages https://doi.org/10.1155/2022/9921885 Research Article Optimal Path Planning for Wireless Power Transfer Robot Using Area Division Deep Reinforcement Learning 1 1 1 1 2 Yuan Xing , Riley Young, Giaolong Nguyen, Maxwell Lefebvre, Tianchi Zhao , 3 4 Haowen Pan , and Liang Dong Department of Engineering and Technology, University of Wisconsin-Stout, Menomonie, WI 54751, USA Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85721, USA Changzhou Voyage Electronics Technology LLC, Changzhou, China Department of Electrical and Computer Engineering, Baylor University, Waco, TX 76706, USA Correspondence should be addressed to Yuan Xing; xingy@uwstout.edu Received 26 October 2021; Accepted 31 January 2022; Published 4 March 2022 Academic Editor: Narushan Pillay Copyright © 2022 Yuan Xing et al. *is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. *is paper aims to solve the optimization problems in far-field wireless power transfer systems using deep reinforcement learning techniques. *e Radio-Frequency (RF) wireless transmitter is mounted on a mobile robot, which patrols near the harvested energy-enabled Internet of *ings (IoT) devices. *e wireless transmitter intends to continuously cruise on the designated path in order to fairly charge all the stationary IoT devices in the shortest time. *e Deep Q-Network (DQN) algorithm is applied to determine the optimal path for the robot to cruise on. When the number of IoT devices increases, the traditional DQN cannot converge to a closed-loop path or achieve the maximum reward. In order to solve these problems, an area division Deep Q-Network (AD-DQN) is invented. *e algorithm can intelligently divide the complete charging field into several areas. In each area, the DQN algorithm is utilized to calculate the optimal path. After that, the segmented paths are combined to create a closed- loop path for the robot to cruise on, which can enable the robot to continuously charge all the IoTdevices in the shortest time. *e numerical results prove the superiority of the AD-DQN in optimizing the proposed wireless power transfer system. the far-field wireless power transmitters can charge the IoT 1. Introduction devices (including the mobile IoT devices) that are deployed *e wireless power transfer technique is proved to be the in a larger space. most effective solution to the charging problem as the However, the far-field wireless power transfer is still in number of the IoT devices grows drastically, since it is its infancy for two reasons. First, the level of power supply is impossible to replace the batteries of all IoT devices [1]. In very low due to the long distance between the power recent years’ Consumer Electronics Show (CES), a large transmitters and the energy harvesters. In [6], the authors mentioned that the existing far-field RF energy harvesting number of wireless power transfer products have come into consumers’ sights. *ere are two types of wireless power technologies can only achieve nanowatts-level power transmission products: near-field and far-field. In near-field transfer, which is too tiny to power up the high-power- wireless power transfer, the IoT devices, which are charged consuming electronic devices. In [3], the authors investi- by resonant inductive coupling, have to be placed very close gated the RF beamforming in radiative far field for wireless to the wireless transmitters (less than 5 cm) [2]. In far-field power transfer. *e authors demonstrated that, with wireless power transfer, the IoT devices use the electro- beamforming techniques, the level of the energy harvesting magnetic waves from transmitters as the power resource and can be boosted. However, as the distance between the the effective charging distance ranges from 50 centimeters to transceivers increases to 1.5 meters, the amount of the 1.5 meters [3–5]. Compared to the near-field transmitters, harvested energy is less than 5 milliwatts, which is still not 2 Wireless Power Transfer on, which does not satisfy the requirement of charging every ideal to power up the high-energy-consuming devices. Second, most of the existing wireless charging systems can regular time interval. In order to deal with this problem, area division deep reinforcement learning is proposed in this only effectively charge stationary energy harvesters. In [7], a set of wireless chargers (Powercast [8]) are deployed on the paper. At first, the algorithm divides the whole test field into square area. *e Powercast transmitters can adjust the several areas. In each area, DQN is utilized to calculate the transmission strategies to optimize energy harvested at optimal path. Next, the entire path is formulated with the the stationary energy harvesters. In [9], the Powercast paths of each separated area. In this way, a closed loop is wireless charger is mounted on the moving robot. guaranteed and the numerical results prove that the cal- *erefore, the charger is a mobile wireless charger, which culated path is also the optimal path. can adjust the transmission patterns of the stationary sensors while moving. However, the number of the IoT 2. System Model devices to be charged is too small. In order to wirelessly charge multiple IoT devices, some researchers proposed *e symbols used in this paper and the corresponding ex- using Unmanned Aerial Vehicle (UAV) to implement the planations are listed in Table 1. wireless power transfer [10–13]. *e UAV is designed to As shown in Figure 1, a mobile robot that carries two RF plan the optimal path to charge the designated IoT de- wireless power transmitters cruises on the calculated path to vices. However, it is very inefficient to use UAV to charge radiate the RF power to K nearby RF energy harvesters. Both the IoT devices, since UAV has very high power con- the power transmitter and the RF power harvesters are sumption and very short operational time. Installing the equipped with one antenna. *e power received at receiver wireless power emitter on the UAV will further shorten k, k ∈ K � {1, 2, . . . , K}, is the operational time of UAV. ηG G (λ/4π) In order to enhance the level of the energy harvesting tx rx p � p , (1) k tx and efficiency in charging a large number of energy-hungry l (L + μ) IoT devices, in this paper, we assembled the wireless power where p is the transmit power; G is the gain of the transfer robot and applied deep reinforcement learning tx tx algorithm to optimize its performance. In the system, the transmitter’s antenna; G is the gain of the receiver’s an- rx wireless transmitter aims to find the optimal path for the tenna; L is the distance between the transmitter and har- wireless power transfer robot. *e robot cruises on the path, vester k; η is the rectifier efficiency; λ denotes the wavelength which can charge each IoT device in the shortest time. DQN of the transmitted signal; l denotes the polarization loss; μ is the adjustable parameter due to Friis’s free space equation. has been widely used to play the complicated games which have a large number of system states even when the envi- Since the effective charging area is critical in determining the level of energy harvesting and it is the parameter to be ronment information is not entirely available [14]. Lately, a lot of researchers have started to implement DQN in solving adjusted at the transmitter, equation (1) is reexpressed using the effective area: the complicated wireless communication optimization problems because the systems are very complicated and ηS S cos α tx rx environment information is time-varying and hard to p � p , k tx (2) 2 2 l λ (L + μ) capture [15–18]. In particular, the researchers applied deep reinforcement learning to plan the optimal path for auto- where S is the maximum effective transmit area; S is the tx rx drive robots [19–22] and the robots can quickly converge to effective received area; α is the angle between the transmitter the optimal path. Henceforth, we found that DQN is a and the vertical reference line. perfect match to solve our proposed optimization problem. Since we consider the mobile energy harvesters in the However, those papers either only proposed the theoretical system, the distance and effective charging area may vary model or could not implement wireless power transfer over the time; we assume that the time is slotted and the functions. To the best of our knowledge, we are the first ones position of any mobile device within one time slot is con- to implement the automatic far-field wireless power transfer stant. In time slot n, the power harvested at receiver k can be system in the test field and invent a DQN algorithm to solve denoted as it. In our system, the entire test field is evenly quantified into ηS S cos α(n) the square spaces. *e time is slotted with the same interval. tx rx p (n) � p . (3) k tx 2 2 We consider the relative location of the robot in the test field l λ (L(n) + μ) as the system state, while we define the direction to move in the next time slot. At the beginning of each time slot, the For a mobile energy harvester, the power harvested in wireless power transfer robot generates the system state and different time slots is determined by the angle between the takes it as the input to DQN. *e DQN can generate the Q transmitter and the vertical reference line α(n) together with values for each possible action and the one with the max- the distance between the transmitter and the harvester L(n) imum Q value is picked to guide robot’s move during the in the time slot. current time slot. In our model, the mobile transmitter is free to adjust the As the number of IoT devices increases and the testing transmit angle α(n) and L(n) as it can move around the IoT field becomes more complicated, the traditional DQN devices. We assume that the effective charging is counted cannot generate the close-loop path for the robot to cruise only when α(n) � 0 and L(n) < � 45 cm. Wireless Power Transfer 3 Table 1: Symbols and explanations. Furthermore, DQN algorithm is applied to address the large number of system states. Symbol Explanation K *e number of energy harvesters η Rectifier efficiency 3.1. Problem Formulation. In order to model our optimi- G Gain of transmitter’s antenna tx zation problem as an RL problem, we define the test field G Gain of receiver’s antenna rx consisting of same area unit square, whose side length is λ Wavelength of transmitted signal 30 cm. K � 8 harvested energy-enabled IoT devices are l Polarization loss deployed in the test field, whose orders are 0, 1, 2, 3, 4, 5, 6, μ Friis’s free space adjustable parameter and 7, respectively. *e map is shown in Figure 2. *e system L Distance between transmitter and harvester p Transmit power state s at time slot n is defined as the position of a particular tx p Received power square where the robot is currently located at in the test field, Angle between transmitter and the vertical reference which is specified as s � pos(h, v), where h is the distance line between the present square and the leftmost edge, which is S Maximum effective transmit area tx counted by the number of squares. v indicates the distance S Effective received area rx between the present square and the upmost edge, which is n Time instant counted by the number of squares. For example, the No. 5 pos(h, v) Position h and v units to left and upmost edges IoT device can be denoted as o � pos(2, 0). *e shadow o Position of kth energy harvester area adjacent to No. k IoT devices indicates the effective eff Effective charging area for kth IoT devices charging area for the respective IoT devices, which is s Present system state s Next system state denoted as eff . For example, the boundary of effective a Action taken at n charging areas for No. 6 IoT device is highlighted in red. We T Total time consumption define the direction of movement in a particular time slot n ′ n Transition probability from state s to state s taking as the actions a . *e set of possible actions A consists of 4 p (a) s,s action a different A � {U, D, L, R}, where U is moving upward one w(s, a, s ) Reward function at state s taking action a unit, D is moving downward one unit, L is moving left one acc Indicator whether k − 1 harvesters have been charged k−1 unit, and R is moving right one unit. ζ Unit price for reward function Given the above, the mobile wireless charging problem π Optimal strategy can be formulated as minimizing the time duration T for the Q(s, a) Cost function at state s taking action a robot to complete running one loop at the same time the c Reward decay ′ robot has to pass through one of the effective areas of each σ(s , a) Learning rate for Q-learning p Selected location for ith area IoT device. W ith area minimize T, {a } 0 T P : (4) s � s , subject to ∃s ∈ eff , ∀k ∈ K, n � 1, 2, . . . , T. Time duration for the robot to complete running one loop is defined as T. *e starting position is the same as the last position, since the robot cruises in a loop. In the loop, the robot has to pass through at least one of the effective charging areas of each IoT device. Adapting to the different positions, the agent chooses different action at each time slot. Henceforth, we can model our proposed system as a Markov chain. In the system, we use the current position to specify a particular state s. S denotes the system state set. *e starting state s and final state s are the same, since the robot needs to move and return to the starting point. *e MDP process can be de- scribed as the agent chooses an action a from A at a specific Figure 1: Mobile wireless power transmitter cruises on the calculated system state s. After that, a new system state s will be transit path to charge multiple harvested energy-enabled IoT devices. into. p (a), s, s ∈ S and a ∈ A, denotes the probability s,s that system state transits from s to s with a. *e reward of the MDP is denoted as w(s, a, s ), which is 3. Problem Formulation defined for the condition that system state transits from s to In this paper, the optimization problem is formulated as a state s . *e optimization problem is formulated as reaching Markov Decision Process (MDP) and reinforcement s in the fewest transmission time slots; henceforth, the learning (RL) algorithm is utilized to solve the problem. reward has to be defined to motivate the mobile robot that 4 Wireless Power Transfer 3.2.1. Q-Learning Method. *e traditional Q-learning method is widely used to solve the dynamic optimization problem provided that the number of the system states is moderate. Corresponding to each particular system state, the best action can be determined to generate the highest reward. Q(s, a) denotes the cost function, which uses a numerical value to describe the cost of taking action a at state s. At the beginning of the algorithm, all the cost function is zero since no action has ever been taken to generate any consequence Q(s, a) � 0. All the Q values are saved in the Q table. Only one cost function is updated in each time slot as the action is taken and the reward function is calculated. *e cost function is updated as Q(s, a) � (1 − σ(s, a))Q(s, a) (6) ′ ′ + σ(s, a) w s, a, s + cf s , a , where ′ ′ f s , a � max Q s , a . (7) a∈A Figure 2: *e entire test field consists of same space unit square. K � 8 harvested energy-enabled IoTdevices are deployed in the test *e learning rate is defined as σ(s , a). field. *e shadow area adjacent to each IoT device indicates the When the algorithm initializes, the Q table is empty since effective charging area for the respective IoT devices. For example, no exploration has been made to obtain any useful cost the boundary of effective charging areas for No. 6 IoT device is function to fill the Q table. Since the agent has no experience highlighted in red. about the environment, the random action selection is implemented at the beginning of the algorithm. A threshold ϵ ∈ [0.5, 1] is designed to start the exploration. In each time does not repeatedly pass through any effective charging area slot, a numerical value p ∈ [0, 1] is generated and compared of any IoTdevices. Besides, the rewards at different positions with the threshold. If p ≥ϵ , action a is picked as are interconnected with each other, since the goal of the a � max Q(s, a). optimization is to pass through the effective charging areas (8) a∈A of all the IoT devices. We assume that the optimal order to pass through all the IoT devices is defined as o , o , . . . , o . However, provided that p <ϵ , an action is randomly 0 1 7 o � 0, 1, . . . , 7. Specifically, the reward function can be selected from the action set A. expressed as After iteratively updating the value in the Q table, the Q value converges. We can calculate the best action corre- ⎧ ⎨ o ζ , s ∈ eff , acc � 1, k o o k k−1 sponding to each action and state by w s, a, s � (5) −1, otherwise. ∗ ∗ π (s) � arg max Q (s, a), (9) a∈A In the above equation, acc � 1 if the robot has already k−1 passed through as effective area of the o th IoT device; and which corresponds to finding the optimal moving direction k−1 ζ denotes the unit price of the harvested energy. for each system state explored during the charging process. As we have defined all the necessary elements for MDP, we can characterize the formulated problem as a stochastic 0 T 3.2.2. DQN Algorithm. *e increase in the number of IoT shortest path search that starts at s and ends at s . At each system state s, we derive the best action a (s) which can devices has led to an increase in the number of system states. Suppose that Q-learning algorithm is used; a very large Q generate the maximum reward. *e optimal policy sets are defined as π � {a(s): s ∈ S}. table has to be created and the convergence speed is too slow. DQN algorithm is more compatible since there is a deep neural network in the structure that can be well trained and 3.2. Optimal Path Planning with Reinforcement Learning. take immediate action to determine the best action that is If the systematic dynamics obey a specific transition prob- going to be taken. ability, reinforcement learning will be the perfect match to *e deep neural network in the structure has the system solve the optimization problems. In this section, Q-learning state as the input and the Q value for each action is defined as [23] is first introduced to solve the proposed problem. After the output. Henceforth, the function of the neural network is that, to address the large states and actions sets, the DQN to generate the cost function for particular state and action. algorithm [14] is utilized to determine the optimal action for We can describe the cost function as Q(s, a, θ), where θ is the each particular system state. weight on the neuron nodes in the structure. As we collect Wireless Power Transfer 5 the data when different actions are taken in different time In traditional DQN, as shown in equation (12), the slot, the neural network is trained to update the weight of the target network target net is designed to derive the cost neural network, which can output a more precise Q value: function for a particular system state. Nevertheless, be- cause we do not update the weight of the target network Q(s, a, θ) ≈ Q (s, a). (10) target net in each training epoch, the training error will increase while training, hence prolonging the *ere are two identical neural networks existing in the training procedure. In Doubling DQN, both the target structure of DQN [24]: one is called the evaluation network network target net and the evaluation network eval net eval net, and the other is called the target network are used to calculate the cost functions. We use evaluation target net. Since these two deep neural networks have the network eval net to calculate the best action for system same structure, multiple hidden layers are defined for each state s . network. We use the current system state s and the next system state s as the input to eval net and target net, re- ′ ′ ′ ′ y � w s, a, s + ϵ max Q s , arg max Q s , a, θ , θ . spectively. We use Q (s, a, θ) and Q (s, a, θ ) to define the e e t a ∈A a∈A output of two deep neural networks eval net and target net. (13) In the structure, in order to update the value of the weight of neuron nodes, we only continuously train the evaluation *e latest research proves that the training error can be network eval net. *e target network is not trained. It pe- dramatically reduced using the Doubling DQN structure riodically duplicates the weights of the neurons from the [24]. evaluation network (i.e., θ � θ). *e loss function is de- In traditional DQN, we only define the cost function Q scribed as follows, which is used to train eval net: value as the output of the deep neural network. *e Loss(θ) � E y − Q (s, a, θ) . (11) Dueling DQN is invented to speed up the convergence of