Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Optimal Path Planning for Wireless Power Transfer Robot Using Area Division Deep Reinforcement Learning

Optimal Path Planning for Wireless Power Transfer Robot Using Area Division Deep Reinforcement... Hindawi Wireless Power Transfer Volume 2022, Article ID 9921885, 10 pages https://doi.org/10.1155/2022/9921885 Research Article Optimal Path Planning for Wireless Power Transfer Robot Using Area Division Deep Reinforcement Learning 1 1 1 1 2 Yuan Xing , Riley Young, Giaolong Nguyen, Maxwell Lefebvre, Tianchi Zhao , 3 4 Haowen Pan , and Liang Dong Department of Engineering and Technology, University of Wisconsin-Stout, Menomonie, WI 54751, USA Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85721, USA Changzhou Voyage Electronics Technology LLC, Changzhou, China Department of Electrical and Computer Engineering, Baylor University, Waco, TX 76706, USA Correspondence should be addressed to Yuan Xing; xingy@uwstout.edu Received 26 October 2021; Accepted 31 January 2022; Published 4 March 2022 Academic Editor: Narushan Pillay Copyright © 2022 Yuan Xing et al. *is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. *is paper aims to solve the optimization problems in far-field wireless power transfer systems using deep reinforcement learning techniques. *e Radio-Frequency (RF) wireless transmitter is mounted on a mobile robot, which patrols near the harvested energy-enabled Internet of *ings (IoT) devices. *e wireless transmitter intends to continuously cruise on the designated path in order to fairly charge all the stationary IoT devices in the shortest time. *e Deep Q-Network (DQN) algorithm is applied to determine the optimal path for the robot to cruise on. When the number of IoT devices increases, the traditional DQN cannot converge to a closed-loop path or achieve the maximum reward. In order to solve these problems, an area division Deep Q-Network (AD-DQN) is invented. *e algorithm can intelligently divide the complete charging field into several areas. In each area, the DQN algorithm is utilized to calculate the optimal path. After that, the segmented paths are combined to create a closed- loop path for the robot to cruise on, which can enable the robot to continuously charge all the IoTdevices in the shortest time. *e numerical results prove the superiority of the AD-DQN in optimizing the proposed wireless power transfer system. the far-field wireless power transmitters can charge the IoT 1. Introduction devices (including the mobile IoT devices) that are deployed *e wireless power transfer technique is proved to be the in a larger space. most effective solution to the charging problem as the However, the far-field wireless power transfer is still in number of the IoT devices grows drastically, since it is its infancy for two reasons. First, the level of power supply is impossible to replace the batteries of all IoT devices [1]. In very low due to the long distance between the power recent years’ Consumer Electronics Show (CES), a large transmitters and the energy harvesters. In [6], the authors mentioned that the existing far-field RF energy harvesting number of wireless power transfer products have come into consumers’ sights. *ere are two types of wireless power technologies can only achieve nanowatts-level power transmission products: near-field and far-field. In near-field transfer, which is too tiny to power up the high-power- wireless power transfer, the IoT devices, which are charged consuming electronic devices. In [3], the authors investi- by resonant inductive coupling, have to be placed very close gated the RF beamforming in radiative far field for wireless to the wireless transmitters (less than 5 cm) [2]. In far-field power transfer. *e authors demonstrated that, with wireless power transfer, the IoT devices use the electro- beamforming techniques, the level of the energy harvesting magnetic waves from transmitters as the power resource and can be boosted. However, as the distance between the the effective charging distance ranges from 50 centimeters to transceivers increases to 1.5 meters, the amount of the 1.5 meters [3–5]. Compared to the near-field transmitters, harvested energy is less than 5 milliwatts, which is still not 2 Wireless Power Transfer on, which does not satisfy the requirement of charging every ideal to power up the high-energy-consuming devices. Second, most of the existing wireless charging systems can regular time interval. In order to deal with this problem, area division deep reinforcement learning is proposed in this only effectively charge stationary energy harvesters. In [7], a set of wireless chargers (Powercast [8]) are deployed on the paper. At first, the algorithm divides the whole test field into square area. *e Powercast transmitters can adjust the several areas. In each area, DQN is utilized to calculate the transmission strategies to optimize energy harvested at optimal path. Next, the entire path is formulated with the the stationary energy harvesters. In [9], the Powercast paths of each separated area. In this way, a closed loop is wireless charger is mounted on the moving robot. guaranteed and the numerical results prove that the cal- *erefore, the charger is a mobile wireless charger, which culated path is also the optimal path. can adjust the transmission patterns of the stationary sensors while moving. However, the number of the IoT 2. System Model devices to be charged is too small. In order to wirelessly charge multiple IoT devices, some researchers proposed *e symbols used in this paper and the corresponding ex- using Unmanned Aerial Vehicle (UAV) to implement the planations are listed in Table 1. wireless power transfer [10–13]. *e UAV is designed to As shown in Figure 1, a mobile robot that carries two RF plan the optimal path to charge the designated IoT de- wireless power transmitters cruises on the calculated path to vices. However, it is very inefficient to use UAV to charge radiate the RF power to K nearby RF energy harvesters. Both the IoT devices, since UAV has very high power con- the power transmitter and the RF power harvesters are sumption and very short operational time. Installing the equipped with one antenna. *e power received at receiver wireless power emitter on the UAV will further shorten k, k ∈ K � {1, 2, . . . , K}, is the operational time of UAV. ηG G (λ/4π) In order to enhance the level of the energy harvesting tx rx p � p , (1) k tx and efficiency in charging a large number of energy-hungry l (L + μ) IoT devices, in this paper, we assembled the wireless power where p is the transmit power; G is the gain of the transfer robot and applied deep reinforcement learning tx tx algorithm to optimize its performance. In the system, the transmitter’s antenna; G is the gain of the receiver’s an- rx wireless transmitter aims to find the optimal path for the tenna; L is the distance between the transmitter and har- wireless power transfer robot. *e robot cruises on the path, vester k; η is the rectifier efficiency; λ denotes the wavelength which can charge each IoT device in the shortest time. DQN of the transmitted signal; l denotes the polarization loss; μ is the adjustable parameter due to Friis’s free space equation. has been widely used to play the complicated games which have a large number of system states even when the envi- Since the effective charging area is critical in determining the level of energy harvesting and it is the parameter to be ronment information is not entirely available [14]. Lately, a lot of researchers have started to implement DQN in solving adjusted at the transmitter, equation (1) is reexpressed using the effective area: the complicated wireless communication optimization problems because the systems are very complicated and ηS S cos α tx rx environment information is time-varying and hard to p � p , k tx (2) 2 2 l λ (L + μ) capture [15–18]. In particular, the researchers applied deep reinforcement learning to plan the optimal path for auto- where S is the maximum effective transmit area; S is the tx rx drive robots [19–22] and the robots can quickly converge to effective received area; α is the angle between the transmitter the optimal path. Henceforth, we found that DQN is a and the vertical reference line. perfect match to solve our proposed optimization problem. Since we consider the mobile energy harvesters in the However, those papers either only proposed the theoretical system, the distance and effective charging area may vary model or could not implement wireless power transfer over the time; we assume that the time is slotted and the functions. To the best of our knowledge, we are the first ones position of any mobile device within one time slot is con- to implement the automatic far-field wireless power transfer stant. In time slot n, the power harvested at receiver k can be system in the test field and invent a DQN algorithm to solve denoted as it. In our system, the entire test field is evenly quantified into ηS S cos α(n) the square spaces. *e time is slotted with the same interval. tx rx p (n) � p . (3) k tx 2 2 We consider the relative location of the robot in the test field l λ (L(n) + μ) as the system state, while we define the direction to move in the next time slot. At the beginning of each time slot, the For a mobile energy harvester, the power harvested in wireless power transfer robot generates the system state and different time slots is determined by the angle between the takes it as the input to DQN. *e DQN can generate the Q transmitter and the vertical reference line α(n) together with values for each possible action and the one with the max- the distance between the transmitter and the harvester L(n) imum Q value is picked to guide robot’s move during the in the time slot. current time slot. In our model, the mobile transmitter is free to adjust the As the number of IoT devices increases and the testing transmit angle α(n) and L(n) as it can move around the IoT field becomes more complicated, the traditional DQN devices. We assume that the effective charging is counted cannot generate the close-loop path for the robot to cruise only when α(n) � 0 and L(n) < � 45 cm. Wireless Power Transfer 3 Table 1: Symbols and explanations. Furthermore, DQN algorithm is applied to address the large number of system states. Symbol Explanation K *e number of energy harvesters η Rectifier efficiency 3.1. Problem Formulation. In order to model our optimi- G Gain of transmitter’s antenna tx zation problem as an RL problem, we define the test field G Gain of receiver’s antenna rx consisting of same area unit square, whose side length is λ Wavelength of transmitted signal 30 cm. K � 8 harvested energy-enabled IoT devices are l Polarization loss deployed in the test field, whose orders are 0, 1, 2, 3, 4, 5, 6, μ Friis’s free space adjustable parameter and 7, respectively. *e map is shown in Figure 2. *e system L Distance between transmitter and harvester p Transmit power state s at time slot n is defined as the position of a particular tx p Received power square where the robot is currently located at in the test field, Angle between transmitter and the vertical reference which is specified as s � pos(h, v), where h is the distance line between the present square and the leftmost edge, which is S Maximum effective transmit area tx counted by the number of squares. v indicates the distance S Effective received area rx between the present square and the upmost edge, which is n Time instant counted by the number of squares. For example, the No. 5 pos(h, v) Position h and v units to left and upmost edges IoT device can be denoted as o � pos(2, 0). *e shadow o Position of kth energy harvester area adjacent to No. k IoT devices indicates the effective eff Effective charging area for kth IoT devices charging area for the respective IoT devices, which is s Present system state s Next system state denoted as eff . For example, the boundary of effective a Action taken at n charging areas for No. 6 IoT device is highlighted in red. We T Total time consumption define the direction of movement in a particular time slot n ′ n Transition probability from state s to state s taking as the actions a . *e set of possible actions A consists of 4 p (a) s,s action a different A � {U, D, L, R}, where U is moving upward one w(s, a, s ) Reward function at state s taking action a unit, D is moving downward one unit, L is moving left one acc Indicator whether k − 1 harvesters have been charged k−1 unit, and R is moving right one unit. ζ Unit price for reward function Given the above, the mobile wireless charging problem π Optimal strategy can be formulated as minimizing the time duration T for the Q(s, a) Cost function at state s taking action a robot to complete running one loop at the same time the c Reward decay ′ robot has to pass through one of the effective areas of each σ(s , a) Learning rate for Q-learning p Selected location for ith area IoT device. W ith area minimize T, {a } 0 T P : (4) s � s , subject to ∃s ∈ eff , ∀k ∈ K, n � 1, 2, . . . , T. Time duration for the robot to complete running one loop is defined as T. *e starting position is the same as the last position, since the robot cruises in a loop. In the loop, the robot has to pass through at least one of the effective charging areas of each IoT device. Adapting to the different positions, the agent chooses different action at each time slot. Henceforth, we can model our proposed system as a Markov chain. In the system, we use the current position to specify a particular state s. S denotes the system state set. *e starting state s and final state s are the same, since the robot needs to move and return to the starting point. *e MDP process can be de- scribed as the agent chooses an action a from A at a specific Figure 1: Mobile wireless power transmitter cruises on the calculated system state s. After that, a new system state s will be transit path to charge multiple harvested energy-enabled IoT devices. into. p (a), s, s ∈ S and a ∈ A, denotes the probability s,s that system state transits from s to s with a. *e reward of the MDP is denoted as w(s, a, s ), which is 3. Problem Formulation defined for the condition that system state transits from s to In this paper, the optimization problem is formulated as a state s . *e optimization problem is formulated as reaching Markov Decision Process (MDP) and reinforcement s in the fewest transmission time slots; henceforth, the learning (RL) algorithm is utilized to solve the problem. reward has to be defined to motivate the mobile robot that 4 Wireless Power Transfer 3.2.1. Q-Learning Method. *e traditional Q-learning method is widely used to solve the dynamic optimization problem provided that the number of the system states is moderate. Corresponding to each particular system state, the best action can be determined to generate the highest reward. Q(s, a) denotes the cost function, which uses a numerical value to describe the cost of taking action a at state s. At the beginning of the algorithm, all the cost function is zero since no action has ever been taken to generate any consequence Q(s, a) � 0. All the Q values are saved in the Q table. Only one cost function is updated in each time slot as the action is taken and the reward function is calculated. *e cost function is updated as Q(s, a) � (1 − σ(s, a))Q(s, a) (6) ′ ′ + σ(s, a) w s, a, s + cf s , a , 􏼂 􏼁 􏼁 􏼃 where ′ ′ f s , a � max Q s , a . 􏼁 􏼁 (7) a∈A Figure 2: *e entire test field consists of same space unit square. K � 8 harvested energy-enabled IoTdevices are deployed in the test *e learning rate is defined as σ(s , a). field. *e shadow area adjacent to each IoT device indicates the When the algorithm initializes, the Q table is empty since effective charging area for the respective IoT devices. For example, no exploration has been made to obtain any useful cost the boundary of effective charging areas for No. 6 IoT device is function to fill the Q table. Since the agent has no experience highlighted in red. about the environment, the random action selection is implemented at the beginning of the algorithm. A threshold ϵ ∈ [0.5, 1] is designed to start the exploration. In each time does not repeatedly pass through any effective charging area slot, a numerical value p ∈ [0, 1] is generated and compared of any IoTdevices. Besides, the rewards at different positions with the threshold. If p ≥ϵ , action a is picked as are interconnected with each other, since the goal of the a � max Q(s, a). optimization is to pass through the effective charging areas (8) a∈A of all the IoT devices. We assume that the optimal order to pass through all the IoT devices is defined as o , o , . . . , o . However, provided that p <ϵ , an action is randomly 0 1 7 o � 0, 1, . . . , 7. Specifically, the reward function can be selected from the action set A. expressed as After iteratively updating the value in the Q table, the Q value converges. We can calculate the best action corre- ⎧ ⎨ o ζ , s ∈ eff , acc � 1, k o o k k−1 sponding to each action and state by w s, a, s 􏼁 � (5) −1, otherwise. ∗ ∗ π (s) � arg max Q (s, a), (9) a∈A In the above equation, acc � 1 if the robot has already k−1 passed through as effective area of the o th IoT device; and which corresponds to finding the optimal moving direction k−1 ζ denotes the unit price of the harvested energy. for each system state explored during the charging process. As we have defined all the necessary elements for MDP, we can characterize the formulated problem as a stochastic 0 T 3.2.2. DQN Algorithm. *e increase in the number of IoT shortest path search that starts at s and ends at s . At each system state s, we derive the best action a (s) which can devices has led to an increase in the number of system states. Suppose that Q-learning algorithm is used; a very large Q generate the maximum reward. *e optimal policy sets are defined as π � {a(s): s ∈ S}. table has to be created and the convergence speed is too slow. DQN algorithm is more compatible since there is a deep neural network in the structure that can be well trained and 3.2. Optimal Path Planning with Reinforcement Learning. take immediate action to determine the best action that is If the systematic dynamics obey a specific transition prob- going to be taken. ability, reinforcement learning will be the perfect match to *e deep neural network in the structure has the system solve the optimization problems. In this section, Q-learning state as the input and the Q value for each action is defined as [23] is first introduced to solve the proposed problem. After the output. Henceforth, the function of the neural network is that, to address the large states and actions sets, the DQN to generate the cost function for particular state and action. algorithm [14] is utilized to determine the optimal action for We can describe the cost function as Q(s, a, θ), where θ is the each particular system state. weight on the neuron nodes in the structure. As we collect Wireless Power Transfer 5 the data when different actions are taken in different time In traditional DQN, as shown in equation (12), the slot, the neural network is trained to update the weight of the target network target net is designed to derive the cost neural network, which can output a more precise Q value: function for a particular system state. Nevertheless, be- cause we do not update the weight of the target network Q(s, a, θ) ≈ Q (s, a). (10) target net in each training epoch, the training error will increase while training, hence prolonging the *ere are two identical neural networks existing in the training procedure. In Doubling DQN, both the target structure of DQN [24]: one is called the evaluation network network target net and the evaluation network eval net eval net, and the other is called the target network are used to calculate the cost functions. We use evaluation target net. Since these two deep neural networks have the network eval net to calculate the best action for system same structure, multiple hidden layers are defined for each state s . network. We use the current system state s and the next system state s as the input to eval net and target net, re- ′ ′ ′ ′ y � w s, a, s 􏼁 + ϵ max Q 􏼠s , arg max Q s , a, θ􏼁 , θ 􏼡. spectively. We use Q (s, a, θ) and Q (s, a, θ ) to define the e e t a ∈A a∈A output of two deep neural networks eval net and target net. (13) In the structure, in order to update the value of the weight of neuron nodes, we only continuously train the evaluation *e latest research proves that the training error can be network eval net. *e target network is not trained. It pe- dramatically reduced using the Doubling DQN structure riodically duplicates the weights of the neurons from the [24]. evaluation network (i.e., θ � θ). *e loss function is de- In traditional DQN, we only define the cost function Q scribed as follows, which is used to train eval net: value as the output of the deep neural network. *e Loss(θ) � E y − Q (s, a, θ) . (11) Dueling DQN is invented to speed up the convergence of 􏽨 􏼁 􏽩 the deep neural network by designing 2 individual streams We use y to represent the real Q value, which can be of the output for the deep neural network. We use the expressed as output value V(s, θ, β) to represent the first stream of the neural network. It denotes the cost function for a specific ′ ′ ′ ′ y � w s, a, s 􏼁 + ϵ max, Q s , a , θ 􏼁 . t (12) ′ system state. We name the second stream of the output as a ∈A advantage output A(s , a, θ, α), which is utilized to illus- We denote the learning rate as ϵ. *e idea of back- trate the advantage of using a specific action to a system propagation is utilized to update the weight of eval net; as a state [25]. We define α and β as the parameters to correlate result, the neural network is trained. the output of two streams and the neural network. *e *e experience reply method is utilized to improve the cost function can be denoted as training effect, since it can effectively eliminate the corre- Q(s, a, θ, α, β) � V(s, θ, β) lation among the training data. Each single experience in- cludes the system state s, the action a, and the next system ′ ′ state s , together with the reward function w(s, a, s ). We ⎝ ⎠ ⎛ ⎞ ′ ′ + A s , a, θ, α 􏼁 − 􏽘 A s , a, θ, α 􏼁 . ′ ′ define the experience set as ep � 􏼈s, a, w(s, a, s ), s 􏼉. In the |A| algorithm, D individual experiences are saved and, in each (14) training epoch, only D (with D < D) experiences are se- s s lected from D. As the training process is completed, *e latest research proves that Dueling DQN can speeds target net copies the weight of the neurons from the eval- up the training procedure by efficiently annihilating the uation network (i.e., θ � θ). D different experiences are additional freedom while training the deep neural generated from ep, while only D are picked to train the network [25]. evaluation network eval net. *e total number of training iterations can be denoted as U. Both evaluation network and target network share the same structure, in which the deep 3.3. Area Division Deep Reinforcement Learning. In this neural networks have N hidden layers. paper, the optimization problem can be seen as calculating the optimal close-loop path which generates the maximum 3.2.3. Dueling Double DQN. In order to leverage the per- accumulated reward. However, the traditional DQN shows formance of DQN, which can effectively select the optimal the difficulty converging to the optimal path because of the action to charge multiple harvesters in a time-varying complicated experimental field. In order to leverage the channel conditions, we redesign the structure of the deep performance of traditional DQN, we invent an AD-DQN in neural network by using Dueling Double DQN. Doubling this paper. At first, the experimental field is divided into DQN is an advanced version of DQN which can prevent the multiple separate parts. DQN is run on each part individ- overestimating problem appearing throughout the training ually to obtain the optimal path for the robot, respectively. [24]. Dueling Double DQN can efficiently solve the over- Finally, the entire close-loop path is formulated using the estimating problem throughout the training process. In the path on each part. In area division, the whole area is defined same training epochs, Dueling Double DQN is proved to as W. *e whole area is divided at multiple specific loca- outperform the original DQN in learning efficiency. tions. p ∈ P. i 6 Wireless Power Transfer *e criterion to pick p is finding the squares, which exist (xii) en d in more than one effective charging area of the IoT devices. (xiii) en dw hile (xiv) en d ∀p ∈ P, (xv) for i � 1, . . . , |P|: p ∈ eff , i m (15) (xvi) Define set W p ∈ eff , i n (xvii) W � N ∪ 􏼈pos(h, v) i i m, n � 0, 1, . . . , K − 1, m ≠ n. |arg min |pos(h, v) − o |, k ∉ K } h,v pos(h,v)∈W k e (xviii) en d For each p , we define N � p . We define set 􏼈 􏼉 i i i (xix) for i � 1, 2, . . . , |P| + 1: K � o . In the clockwise direction, we 􏼚 􏼛 e argp ∈eff , j�0,1,...,K−1 i j (xx) for j � 1, 2, . . . , |J|: find that the IoTdevice o has the shortest distance to p , and i i (xxi) *e starting point is defined as p . *e end point then add both o and the effective charging area of o to N . i i i is defined as e ∈ eff , j ∈ J. j c *e new area can be expressed as (xxii) *e weight of the neuron nodes θ are randomly generated for the eval net and the weights are N � N ∪ 􏼈o 􏼉 ∪ 􏼈eff 􏼉. (16) i i i i copied by target netθ � θ. u � 1. D � d � 1. Next, we find the IoT device having the shortest distance (xxiii) while u < U s � s . t � 1. to the IoT device o that is just added to set N , and then add i i (xxiv) A probability is generated as a numerical pa- both the new IoT device and the effective charging area of it rameter p ∈ [0, 1]. to N . Iteratively, we find that all the IoT devices besides the ones in K are included in one N . Finally, classify all the (xxv) if D > 200 an d p ≥ϵ ch e i rest squares to the nearest N . 􏼈N 􏼉 � W. i i (xxvi) a � max Q(s, a) a∈A In each area, the DQN is run to determine the optimal (xxvii) else path for the robot. In each area, the starting point is the same (xxviii) Randomly choose the action from action set A. as the position of p ; the end point is one of the effective charging squares of the furthest IoT device from the starting (xxix) en d point in the same area. After the optimal path is calculated (xxx) while s ≠ s for each individual area, the close-loop optimal path for the (xxxi) *e state transit into s after taking the action. entire area can be synthesized. *e algorithm is shown in ′ ′ d � d + 1. ep(d) � 􏼈s, a, w(s, a, s ), s 􏼉. Suppose Algorithm 1. D keeps unchanged if it goes over the experience (i) Define E � 􏼈eff , k � 0, 1, . . . , K − 1􏼉. Among pool’s limitation, d � 1; otherwise, $ D � d$. E, find all the area division points p by 􏼈p 􏼉 � t � t + 1. s � s . After enough data has been col- i i 􏼈pos(h, v)|pos lected in experience pool, eval net is trained using (h, v) ∈ eff , pos(h, v) ∈ eff , m, n ∈ K} D of D experiences. Minimize the loss function m n Loss(θ) using Back-propagation. target net copies (ii) *e number of area divison points is defined as the weight from eval net periodically. |P|. (xxxii) en dw hile (iii) i � 1, . . . , |P|. *e number of the area to be divided is |P| + 1. (xxxiii) en dw hile (iv) K � 􏼚o 􏼛. (xxxiv) *e optimal path of the entire test field is e argp ∈eff , j�0,1,...,K−1 i j synthesized with the optimal path in each W . (v) for i � 1, . . . , |P|: (vi) r � p . r � p . 1 i 2 i 4. Experimental Results (vii) while ∄o ∈ N , o ∈ N g i g I\{i} (viii) if i < � |P| *e implementation of the proposed wireless power transfer (ix) In the clockwise direction, find the the system is shown in Figure 3. IoT devices, that has the shortest distance to In the test field, 8 harvested energy-enabled IoT devices r . *e order of the IoT device is: are placed as Figure 2 indicates. *e top view of the test field g � argmin |o − r |. N is updated as: can be seen as a 2D map. Henceforth, the map is modeled o ∉ K i 1 i i e N � N ∪ o ∪ eff . r � o . and inputted into the computer. *en the AD-DQN algo- 􏽮 􏽯 􏽮 􏽯 i i g g 1 g rithm is implemented in computer using Python and the (x) else optimal charging path can be derived. At the same time, a (xi) In the counterclockwise direction, find the wireless power transfer robot is assembled. Two Powercast the IoT devices, that has the shortest distance to RF power transmitters TX91501 [8] are mounted on two r . *e order of the IoT device is: sides of the Raspberry Pi [26] enabled intelligent driving g � argmin |o − r |. N is updated as: o ∉ K i 2 i i e robot. Each transmitter is powered by 5 V power bank and N � N ∪ 􏽮o 􏽯 ∪ 􏽮eff 􏽯. r � o . i i g g 2 g continuously emits 3 Watts RF power. *e infrared patrol Wireless Power Transfer 7 Model the 2D harvested energy enable loT devices map in computer by inputting the position of each device Assemble the robot and configure the Implement Area Division Deep robot with raspberry pi 4B Reinforcement Learning using Python microcontroller and derive the optimal charge path Assemble two RF power transmitters on Build the road corresponding to -1 the raspberry pi 4B controlled robot the calculated path in the test field -2 -3 Implement mobile wireless power Install the infrared patrol modules transfer on test field using well on the robot configured auto drive robot -4 Figure 3: Flowchart of wireless power transfer implementation. -5 0 1000 2000 3000 4000 5000 6000 7000 8000 Training time slots module is installed on the robot to implement the autodrive reward on the test field; henceforth, the robot can automatically reward cruise on along the path and continuously charge the reward multiple IoTdevices, as shown in Figure 1. To the best of our knowledge, we are the first ones to implement the automatic Figure 4: *e average rewards of reward , reward , and reward 1 2 3 wireless power transfer system in the test field and invent versus the training episodes in area I of the experimental field. AD-DQN algorithm to design the optimal path for the wireless power transfer robot. Since we are the first ones to design and implement the mobile far-field wireless power transfer system, there is no hardware reference design we can refer to and use for validation. So the validation of our work is done in the software aspect. But referring to the flowchart, our mobile wireless power transfer system can be replicated. For the software, we use TensorFlow 0.13.1 together with Python 3.8 in Jupyter Notebook 5.6.0 as the software sim- ulation environment to train the AD-DQN. *e number of hidden layers is 4 and each hidden layer owns 100 nodes. *e learning rate is less than 0.1. *e mini-batch size is 10. *e 6 learning frequency is 5. *e training starting step is 200. *e experience pool is greater than 20000. *e exploration in- terval is 0.001. *e target network replacement interval is greater than 100. Reward decay is 0.99. First, different reward functions are tested for the op- 0 500 1000 1500 2000 2500 3000 timal one. Reward one reward is defined using equation (5). Training episodes *e unit price is defined as ζ � 4. Reward two reward is reward defined as reward ⎧ ⎨ ′ ζ , s ∈ eff , acc � 1, reward o o 3 k k−1 w s, a, s 􏼁 � (17) −1, otherwise, Figure 5: *e average time consumption achieved by reward , reward , and reward versus the training episodes in area I of the 2 3 where ζ � 4. Reward three reward is defined with equation experimental field. (5); however, ζ � 2. Two factors are observed for the per- formance of different rewards, which are average reward during the training and average time consumption during In area I, the performances of three different rewards are the training. compared in Figures 4 and 5. Based on the procedures of AD-DQN in Algorithm 1, In area II, the performances of three different rewards the experimental field is divided into two areas along the are compared in Figures 6 and 7. only shared effective charging area for both device 2 and From Figures 4 and 5, we can observe that reward is device 3. In area I, IoT devices 2, 3, 4, 5, and 6 are included, optimal. Since all three rewards perform similarly on the while in area II, IoT devices 0, 1, 2, 6, and 7 are included. time consumption, reward is the highest reward among all, Average time consumption/time slots Average reward 8 Wireless Power Transfer 70 1 0.9 0.8 0.7 0.6 0.5 0.4 -10 0.3 -20 -30 0.2 0 50000 100000 150000 Training time slots Total number of IoT devices Random action selection reward Q learning reward reward Deep Q-Network Area Division Deep Q-Network Figure 6: *e average rewards of reward , reward , and reward 1 2 3 versus the training episodes in area II of the experimental field. Figure 8: *e effective charging rate of random action selection, Q- learning, DQN, and AD-DQN versus the total number of IoT devices. than reward . *at can be explained as follows: compared with reward , reward can only effectively charge fewer 1 3 number of the IoT devices. Overall, reward has optimal performance in both areas I and II; henceforth, reward is used to define the reward for AD-DQN. In Figures 8 and 9, the performances of four different algorithms are compared. *e random action selection randomly selects the action in the experimental test field. Same as AD-DQN, reward is used as the reward of Q-learning and DQN. We define the successful charging rate as the number of IoT devices that can be successfully charged in one complete charging episode over the total number of the IoT devices. From Figure 8, we can observe that random action selection has the worst successful charging rate. *at can be explained 0 5000 10000 15000 20000 25000 30000 35000 40000 as follows: random action selection never converges to either Training episodes suboptimal or optimal path. Q-learning has a better per- reward formance than random action selection; however, it is reward outperformed by the other two algorithms, since Q-learning reward can only deal with the simple reinforcement learning model. Figure 7: *e average time consumption achieved by reward , DQN performs better than Q-learning and random action reward , and reward versus the training episodes in area II of the selection; however, it is outperformed by the AD-DQN, 2 3 experimental field. since the rewards for different states are defined as inter- connected; even the reward decay is 0.99; DQN still cannot learn the optimal solution. When the total number of the IoT which means that reward can effectively charge most of the devices decreases, both DQN and AD-DQN perform the IoT devices compared with the other two rewards. same since the decrease of the number of the IoT devices From Figures 6 and 7, we can observe that reward degrades the interconnections between different system performs best on the time consumption to complete one states. From Figure 9, we can observe that, compared with episode; however, reward is much more average reward the other algorithms, AD-DQN is not the one consuming Average time consumption/time slots Average reward Successful charging rate Wireless Power Transfer 9 20 robot is regulated to cruise on the path in the counter- clockwise direction. In this way, the robot can continuously charge all the IoT devices. *e experimental demonstration is shown in Figure 1. 5. Conclusions In this paper, we invent a novel deep reinforcement learning algorithm AD-DQN to determine the optimal path for the mobile wireless power transfer robot to dynamically charge the harvesting energy-enabled IoT devices. *e invented algorithm can intelligently divide a large area into multiple subareas and implement the individual DQN in each area, finally synthesizing the entire path for the robot. Compared with the state of the art, the proposed algorithm can ef- fectively charge all the IoT devices on the experimental field. *e whole system can be used in a lot of application sce- 87654 narios, like charging IoT devices in the dangerous area and Total number of IoT devices charging medical devices. Random action selection Q learning Data Availability Deep Q-Network *e data used to support the findings of this study are Area Division Deep Q-Network available from the corresponding author upon request. Figure 9: *e average time consumption of random action se- lection, Q-learning, DQN, and AD-DQN versus the total number Conflicts of Interest of IoT devices. *e authors declare that they have no conflicts of interest. Authors’ Contributions Yuan Xing designed the proposed wireless power transfer system, formulated the optimization problem, and proposed the innovative reinforcement learning algorithm. Riley Young, Giaolong Nguyen, and Maxwell Lefebvre designed, built, and tested the wireless power transfer robot on the wireless power transfer test field. Tianchi Zhao optimized the performance of the proposed deep reinforcement learning algorithm. Haowen Pan implemented the comparison on the system performance between the proposed algorithm and the state of the art. Liang Dong provided the theoretical support for far-field RF power transfer technique. Acknowledgments *is work was supported by WiSys Technology Foundation under Spark Grant. References Figure 10: *e optimal path determined by AD-DQN. Bold black line indicates the path for the wireless power transfer robot. [1] F. Giuppi, K. Niotaki, A. Collado, and A. Georgiadis, “Challenges in energy harvesting techniques for autonomous self-powered wireless sensors,” in Proceedings of the 2013 the least time slots to complete one charging episode; European Microwave Conference, pp. 854–857, Nuremberg, however, AD-DQN is still the optimal algorithm, since all Germany, October 2013. the other algorithms cannot achieve 100% effective charging [2] A. M. Jawad, R. Nordin, S. K. Gharghan, H. M. Jawad, and rate; hence they consume fewer time slots to complete one M. Ismail, “Opportunities and challenges for near-field charging episode. wireless power transfer: a review,” Energies, vol. 10, no. 7, In Figure 10, the optimal path determined by AD-DQN p. 1022, 2017. is shown as the bold black line. *e arrows on the path show [3] P. S. Yedavalli, T. Riihonen, X. Wang, and J. M. Rabaey, “Far- the direction of the robot to move as we assume that the field rf wireless power transfer with blind adaptive Total time consumption/time slots 10 Wireless Power Transfer beamforming for internet of things devices,” IEEE Access, [20] S. Wen, Y. Zhao, X. Yuan, Z. Wang, D. Zhang, and vol. 5, pp. 1743–1752, 2017. L. Manfredi, “Path planning for active slam based on deep [4] Y. Xing and L. Dong, “Passive radio-frequency energy har- reinforcement learning under unknown environments,” In- telligent Service Robotics, vol. 13, pp. 1–10, 2020. vesting through wireless information transmission,” in Pro- ceedings of the IEEE DCOSS, pp. 73–80, Ottawa, ON, Canada, [21] S. Koh, B. Zhou, H. Fang et al., “Real-time deep reinforcement June 2017. learning based vehicle navigation,” Applied Soft Computing, [5] Y. Xing, Y. Qian, and L. Dong, “A multi-armed bandit ap- vol. 96, Article ID 106694, 2020. proach to wireless information and power transfer,” IEEE [22] R. Ding, F. Gao, and X. S. Shen, “3d uav trajectory design and Communications Letters, vol. 24, no. 4, pp. 886–889, 2020. frequency band allocation for energy-efficient and fair [6] Y. L. Lee, D. Qin, L.-C. Wang, and G. H. Sim, “6g massive communication: a deep reinforcement learning approach,” radio access networks: key applications, requirements and IEEE Transactions on Wireless Communications, vol. 19, challenges,” IEEE Open Journal of Vehicular Technology, no. 12, pp. 7796–7809, 2020. vol. 2, 2020. [23] A. G. Barto, S. J. Bradtke, and S. P. Singh, “Learning to act [7] S. Nikoletseas, T. Raptis, A. Souroulagkas, and D. Tsolovos, using real-time dynamic programming,” Artificial Intelli- “Wireless power transfer protocols in sensor networks: ex- gence, vol. 72, no. 1-2, pp. 81–138, 1995. periments and simulations,” Journal of Sensor and Actuator [24] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the Networks, vol. 6, no. 2, p. 4, 2017. [8] Powercast, https://www.powercastco.com/documentation/, Girtieth AAAI Conference on Artificial Intelligence, vol. 2, 2021. p. 5p. 5, Phoenix, AZ, USA, February 2016. [9] C. Lin, Y. Zhou, F. Ma, J. Deng, L. Wang, and G. Wu, [25] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, “Minimizing charging delay for directional charging in and N. De Freitas, “Dueling network architectures for deep wireless rechargeable sensor networks,” in Proceedings of the reinforcement learning,” arXiv:1511.06581, 2015. IEEE INFOCOM 2019-IEEE Conference on Computer Com- [26] RaspberryPi, “Tx91501b-915mhz powercaster transmitter,” munications, pp. 1819–1827, IEEE, Paris, France, April-May 2021, https://www.raspberrypi.org/documentation/ 2019. computers/raspberry-pi.html. [10] H. Yan, Y. Chen, and S.-H. Yang, “Uav-enabled wireless power transfer with base station charging and uav power consumption,” IEEE Transactions on Vehicular Technology, vol. 69, no. 11, pp. 12883–12896, 2020. [11] Y. Liu, K. Xiong, Y. Lu, Q. Ni, P. Fan, and K. B. Letaief, “Uav- aided wireless power transfer and data collection in rician fading,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 10, pp. 3097–3113, 2021. [12] W. Feng, N. Zhao, S. Ao et al., “Joint 3d trajectory design and time allocation for uav-enabled wireless power transfer net- works,” IEEE Transactions on Vehicular Technology, vol. 69, no. 9, pp. 9265–9278, 2020. [13] X. Yuan, T. Yang, Y. Hu, J. Xu, and A. Schmeink, “Trajectory design for uav-enabled multiuser wireless power transfer with nonlinear energy harvesting,” IEEE Transactions on Wireless Communications, vol. 20, no. 2, pp. 1105–1121, 2020. [14] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with deep reinforcement learning,” arXiv:1312.5602, 2013. [15] Y. He, Z. Zhang, F. R. Yu et al., “Deep reinforcement learning- based optimization for cache-enabled opportunistic inter- ference alignment wireless networks,” IEEE Transactions on Vehicular Technology, vol. 66, no. 11, pp. 10433–10445, 2017. [16] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent rein- forcement learning,” in Proceedings of the 29th Conference on Advances in Neural Information Processing Systems, pp. 2137–2145, Barcelona, Spain, May 2016. [17] Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforcement learning based framework for power-efficient resource allocation in cloud rans,” in Proceedings of the 2017 IEEE International Conference on Communications (ICC), pp. 1–6, IEEE, Paris, France, May 2017. [18] Y. Xing, H. Pan, B. Xu et al., “Optimal wireless information and power transfer using deep q-network,” Wireless Power Transfer, vol. 2021, Article ID 5513509, 12 pages, 2021. [19] C. Chen, J. Jiang, N. Lv, and S. Li, “An intelligent path planning scheme of autonomous vehicles platoon using deep reinforcement learning on network edge,” IEEE Access, vol. 8, pp. 99059–99069, 2020. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Wireless Power Transfer Hindawi Publishing Corporation

Optimal Path Planning for Wireless Power Transfer Robot Using Area Division Deep Reinforcement Learning

Loading next page...
 
/lp/hindawi-publishing-corporation/optimal-path-planning-for-wireless-power-transfer-robot-using-area-V1Rh9eAhu6
Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2022 Yuan Xing et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
eISSN
2052-8418
DOI
10.1155/2022/9921885
Publisher site
See Article on Publisher Site

Abstract

Hindawi Wireless Power Transfer Volume 2022, Article ID 9921885, 10 pages https://doi.org/10.1155/2022/9921885 Research Article Optimal Path Planning for Wireless Power Transfer Robot Using Area Division Deep Reinforcement Learning 1 1 1 1 2 Yuan Xing , Riley Young, Giaolong Nguyen, Maxwell Lefebvre, Tianchi Zhao , 3 4 Haowen Pan , and Liang Dong Department of Engineering and Technology, University of Wisconsin-Stout, Menomonie, WI 54751, USA Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85721, USA Changzhou Voyage Electronics Technology LLC, Changzhou, China Department of Electrical and Computer Engineering, Baylor University, Waco, TX 76706, USA Correspondence should be addressed to Yuan Xing; xingy@uwstout.edu Received 26 October 2021; Accepted 31 January 2022; Published 4 March 2022 Academic Editor: Narushan Pillay Copyright © 2022 Yuan Xing et al. *is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. *is paper aims to solve the optimization problems in far-field wireless power transfer systems using deep reinforcement learning techniques. *e Radio-Frequency (RF) wireless transmitter is mounted on a mobile robot, which patrols near the harvested energy-enabled Internet of *ings (IoT) devices. *e wireless transmitter intends to continuously cruise on the designated path in order to fairly charge all the stationary IoT devices in the shortest time. *e Deep Q-Network (DQN) algorithm is applied to determine the optimal path for the robot to cruise on. When the number of IoT devices increases, the traditional DQN cannot converge to a closed-loop path or achieve the maximum reward. In order to solve these problems, an area division Deep Q-Network (AD-DQN) is invented. *e algorithm can intelligently divide the complete charging field into several areas. In each area, the DQN algorithm is utilized to calculate the optimal path. After that, the segmented paths are combined to create a closed- loop path for the robot to cruise on, which can enable the robot to continuously charge all the IoTdevices in the shortest time. *e numerical results prove the superiority of the AD-DQN in optimizing the proposed wireless power transfer system. the far-field wireless power transmitters can charge the IoT 1. Introduction devices (including the mobile IoT devices) that are deployed *e wireless power transfer technique is proved to be the in a larger space. most effective solution to the charging problem as the However, the far-field wireless power transfer is still in number of the IoT devices grows drastically, since it is its infancy for two reasons. First, the level of power supply is impossible to replace the batteries of all IoT devices [1]. In very low due to the long distance between the power recent years’ Consumer Electronics Show (CES), a large transmitters and the energy harvesters. In [6], the authors mentioned that the existing far-field RF energy harvesting number of wireless power transfer products have come into consumers’ sights. *ere are two types of wireless power technologies can only achieve nanowatts-level power transmission products: near-field and far-field. In near-field transfer, which is too tiny to power up the high-power- wireless power transfer, the IoT devices, which are charged consuming electronic devices. In [3], the authors investi- by resonant inductive coupling, have to be placed very close gated the RF beamforming in radiative far field for wireless to the wireless transmitters (less than 5 cm) [2]. In far-field power transfer. *e authors demonstrated that, with wireless power transfer, the IoT devices use the electro- beamforming techniques, the level of the energy harvesting magnetic waves from transmitters as the power resource and can be boosted. However, as the distance between the the effective charging distance ranges from 50 centimeters to transceivers increases to 1.5 meters, the amount of the 1.5 meters [3–5]. Compared to the near-field transmitters, harvested energy is less than 5 milliwatts, which is still not 2 Wireless Power Transfer on, which does not satisfy the requirement of charging every ideal to power up the high-energy-consuming devices. Second, most of the existing wireless charging systems can regular time interval. In order to deal with this problem, area division deep reinforcement learning is proposed in this only effectively charge stationary energy harvesters. In [7], a set of wireless chargers (Powercast [8]) are deployed on the paper. At first, the algorithm divides the whole test field into square area. *e Powercast transmitters can adjust the several areas. In each area, DQN is utilized to calculate the transmission strategies to optimize energy harvested at optimal path. Next, the entire path is formulated with the the stationary energy harvesters. In [9], the Powercast paths of each separated area. In this way, a closed loop is wireless charger is mounted on the moving robot. guaranteed and the numerical results prove that the cal- *erefore, the charger is a mobile wireless charger, which culated path is also the optimal path. can adjust the transmission patterns of the stationary sensors while moving. However, the number of the IoT 2. System Model devices to be charged is too small. In order to wirelessly charge multiple IoT devices, some researchers proposed *e symbols used in this paper and the corresponding ex- using Unmanned Aerial Vehicle (UAV) to implement the planations are listed in Table 1. wireless power transfer [10–13]. *e UAV is designed to As shown in Figure 1, a mobile robot that carries two RF plan the optimal path to charge the designated IoT de- wireless power transmitters cruises on the calculated path to vices. However, it is very inefficient to use UAV to charge radiate the RF power to K nearby RF energy harvesters. Both the IoT devices, since UAV has very high power con- the power transmitter and the RF power harvesters are sumption and very short operational time. Installing the equipped with one antenna. *e power received at receiver wireless power emitter on the UAV will further shorten k, k ∈ K � {1, 2, . . . , K}, is the operational time of UAV. ηG G (λ/4π) In order to enhance the level of the energy harvesting tx rx p � p , (1) k tx and efficiency in charging a large number of energy-hungry l (L + μ) IoT devices, in this paper, we assembled the wireless power where p is the transmit power; G is the gain of the transfer robot and applied deep reinforcement learning tx tx algorithm to optimize its performance. In the system, the transmitter’s antenna; G is the gain of the receiver’s an- rx wireless transmitter aims to find the optimal path for the tenna; L is the distance between the transmitter and har- wireless power transfer robot. *e robot cruises on the path, vester k; η is the rectifier efficiency; λ denotes the wavelength which can charge each IoT device in the shortest time. DQN of the transmitted signal; l denotes the polarization loss; μ is the adjustable parameter due to Friis’s free space equation. has been widely used to play the complicated games which have a large number of system states even when the envi- Since the effective charging area is critical in determining the level of energy harvesting and it is the parameter to be ronment information is not entirely available [14]. Lately, a lot of researchers have started to implement DQN in solving adjusted at the transmitter, equation (1) is reexpressed using the effective area: the complicated wireless communication optimization problems because the systems are very complicated and ηS S cos α tx rx environment information is time-varying and hard to p � p , k tx (2) 2 2 l λ (L + μ) capture [15–18]. In particular, the researchers applied deep reinforcement learning to plan the optimal path for auto- where S is the maximum effective transmit area; S is the tx rx drive robots [19–22] and the robots can quickly converge to effective received area; α is the angle between the transmitter the optimal path. Henceforth, we found that DQN is a and the vertical reference line. perfect match to solve our proposed optimization problem. Since we consider the mobile energy harvesters in the However, those papers either only proposed the theoretical system, the distance and effective charging area may vary model or could not implement wireless power transfer over the time; we assume that the time is slotted and the functions. To the best of our knowledge, we are the first ones position of any mobile device within one time slot is con- to implement the automatic far-field wireless power transfer stant. In time slot n, the power harvested at receiver k can be system in the test field and invent a DQN algorithm to solve denoted as it. In our system, the entire test field is evenly quantified into ηS S cos α(n) the square spaces. *e time is slotted with the same interval. tx rx p (n) � p . (3) k tx 2 2 We consider the relative location of the robot in the test field l λ (L(n) + μ) as the system state, while we define the direction to move in the next time slot. At the beginning of each time slot, the For a mobile energy harvester, the power harvested in wireless power transfer robot generates the system state and different time slots is determined by the angle between the takes it as the input to DQN. *e DQN can generate the Q transmitter and the vertical reference line α(n) together with values for each possible action and the one with the max- the distance between the transmitter and the harvester L(n) imum Q value is picked to guide robot’s move during the in the time slot. current time slot. In our model, the mobile transmitter is free to adjust the As the number of IoT devices increases and the testing transmit angle α(n) and L(n) as it can move around the IoT field becomes more complicated, the traditional DQN devices. We assume that the effective charging is counted cannot generate the close-loop path for the robot to cruise only when α(n) � 0 and L(n) < � 45 cm. Wireless Power Transfer 3 Table 1: Symbols and explanations. Furthermore, DQN algorithm is applied to address the large number of system states. Symbol Explanation K *e number of energy harvesters η Rectifier efficiency 3.1. Problem Formulation. In order to model our optimi- G Gain of transmitter’s antenna tx zation problem as an RL problem, we define the test field G Gain of receiver’s antenna rx consisting of same area unit square, whose side length is λ Wavelength of transmitted signal 30 cm. K � 8 harvested energy-enabled IoT devices are l Polarization loss deployed in the test field, whose orders are 0, 1, 2, 3, 4, 5, 6, μ Friis’s free space adjustable parameter and 7, respectively. *e map is shown in Figure 2. *e system L Distance between transmitter and harvester p Transmit power state s at time slot n is defined as the position of a particular tx p Received power square where the robot is currently located at in the test field, Angle between transmitter and the vertical reference which is specified as s � pos(h, v), where h is the distance line between the present square and the leftmost edge, which is S Maximum effective transmit area tx counted by the number of squares. v indicates the distance S Effective received area rx between the present square and the upmost edge, which is n Time instant counted by the number of squares. For example, the No. 5 pos(h, v) Position h and v units to left and upmost edges IoT device can be denoted as o � pos(2, 0). *e shadow o Position of kth energy harvester area adjacent to No. k IoT devices indicates the effective eff Effective charging area for kth IoT devices charging area for the respective IoT devices, which is s Present system state s Next system state denoted as eff . For example, the boundary of effective a Action taken at n charging areas for No. 6 IoT device is highlighted in red. We T Total time consumption define the direction of movement in a particular time slot n ′ n Transition probability from state s to state s taking as the actions a . *e set of possible actions A consists of 4 p (a) s,s action a different A � {U, D, L, R}, where U is moving upward one w(s, a, s ) Reward function at state s taking action a unit, D is moving downward one unit, L is moving left one acc Indicator whether k − 1 harvesters have been charged k−1 unit, and R is moving right one unit. ζ Unit price for reward function Given the above, the mobile wireless charging problem π Optimal strategy can be formulated as minimizing the time duration T for the Q(s, a) Cost function at state s taking action a robot to complete running one loop at the same time the c Reward decay ′ robot has to pass through one of the effective areas of each σ(s , a) Learning rate for Q-learning p Selected location for ith area IoT device. W ith area minimize T, {a } 0 T P : (4) s � s , subject to ∃s ∈ eff , ∀k ∈ K, n � 1, 2, . . . , T. Time duration for the robot to complete running one loop is defined as T. *e starting position is the same as the last position, since the robot cruises in a loop. In the loop, the robot has to pass through at least one of the effective charging areas of each IoT device. Adapting to the different positions, the agent chooses different action at each time slot. Henceforth, we can model our proposed system as a Markov chain. In the system, we use the current position to specify a particular state s. S denotes the system state set. *e starting state s and final state s are the same, since the robot needs to move and return to the starting point. *e MDP process can be de- scribed as the agent chooses an action a from A at a specific Figure 1: Mobile wireless power transmitter cruises on the calculated system state s. After that, a new system state s will be transit path to charge multiple harvested energy-enabled IoT devices. into. p (a), s, s ∈ S and a ∈ A, denotes the probability s,s that system state transits from s to s with a. *e reward of the MDP is denoted as w(s, a, s ), which is 3. Problem Formulation defined for the condition that system state transits from s to In this paper, the optimization problem is formulated as a state s . *e optimization problem is formulated as reaching Markov Decision Process (MDP) and reinforcement s in the fewest transmission time slots; henceforth, the learning (RL) algorithm is utilized to solve the problem. reward has to be defined to motivate the mobile robot that 4 Wireless Power Transfer 3.2.1. Q-Learning Method. *e traditional Q-learning method is widely used to solve the dynamic optimization problem provided that the number of the system states is moderate. Corresponding to each particular system state, the best action can be determined to generate the highest reward. Q(s, a) denotes the cost function, which uses a numerical value to describe the cost of taking action a at state s. At the beginning of the algorithm, all the cost function is zero since no action has ever been taken to generate any consequence Q(s, a) � 0. All the Q values are saved in the Q table. Only one cost function is updated in each time slot as the action is taken and the reward function is calculated. *e cost function is updated as Q(s, a) � (1 − σ(s, a))Q(s, a) (6) ′ ′ + σ(s, a) w s, a, s + cf s , a , 􏼂 􏼁 􏼁 􏼃 where ′ ′ f s , a � max Q s , a . 􏼁 􏼁 (7) a∈A Figure 2: *e entire test field consists of same space unit square. K � 8 harvested energy-enabled IoTdevices are deployed in the test *e learning rate is defined as σ(s , a). field. *e shadow area adjacent to each IoT device indicates the When the algorithm initializes, the Q table is empty since effective charging area for the respective IoT devices. For example, no exploration has been made to obtain any useful cost the boundary of effective charging areas for No. 6 IoT device is function to fill the Q table. Since the agent has no experience highlighted in red. about the environment, the random action selection is implemented at the beginning of the algorithm. A threshold ϵ ∈ [0.5, 1] is designed to start the exploration. In each time does not repeatedly pass through any effective charging area slot, a numerical value p ∈ [0, 1] is generated and compared of any IoTdevices. Besides, the rewards at different positions with the threshold. If p ≥ϵ , action a is picked as are interconnected with each other, since the goal of the a � max Q(s, a). optimization is to pass through the effective charging areas (8) a∈A of all the IoT devices. We assume that the optimal order to pass through all the IoT devices is defined as o , o , . . . , o . However, provided that p <ϵ , an action is randomly 0 1 7 o � 0, 1, . . . , 7. Specifically, the reward function can be selected from the action set A. expressed as After iteratively updating the value in the Q table, the Q value converges. We can calculate the best action corre- ⎧ ⎨ o ζ , s ∈ eff , acc � 1, k o o k k−1 sponding to each action and state by w s, a, s 􏼁 � (5) −1, otherwise. ∗ ∗ π (s) � arg max Q (s, a), (9) a∈A In the above equation, acc � 1 if the robot has already k−1 passed through as effective area of the o th IoT device; and which corresponds to finding the optimal moving direction k−1 ζ denotes the unit price of the harvested energy. for each system state explored during the charging process. As we have defined all the necessary elements for MDP, we can characterize the formulated problem as a stochastic 0 T 3.2.2. DQN Algorithm. *e increase in the number of IoT shortest path search that starts at s and ends at s . At each system state s, we derive the best action a (s) which can devices has led to an increase in the number of system states. Suppose that Q-learning algorithm is used; a very large Q generate the maximum reward. *e optimal policy sets are defined as π � {a(s): s ∈ S}. table has to be created and the convergence speed is too slow. DQN algorithm is more compatible since there is a deep neural network in the structure that can be well trained and 3.2. Optimal Path Planning with Reinforcement Learning. take immediate action to determine the best action that is If the systematic dynamics obey a specific transition prob- going to be taken. ability, reinforcement learning will be the perfect match to *e deep neural network in the structure has the system solve the optimization problems. In this section, Q-learning state as the input and the Q value for each action is defined as [23] is first introduced to solve the proposed problem. After the output. Henceforth, the function of the neural network is that, to address the large states and actions sets, the DQN to generate the cost function for particular state and action. algorithm [14] is utilized to determine the optimal action for We can describe the cost function as Q(s, a, θ), where θ is the each particular system state. weight on the neuron nodes in the structure. As we collect Wireless Power Transfer 5 the data when different actions are taken in different time In traditional DQN, as shown in equation (12), the slot, the neural network is trained to update the weight of the target network target net is designed to derive the cost neural network, which can output a more precise Q value: function for a particular system state. Nevertheless, be- cause we do not update the weight of the target network Q(s, a, θ) ≈ Q (s, a). (10) target net in each training epoch, the training error will increase while training, hence prolonging the *ere are two identical neural networks existing in the training procedure. In Doubling DQN, both the target structure of DQN [24]: one is called the evaluation network network target net and the evaluation network eval net eval net, and the other is called the target network are used to calculate the cost functions. We use evaluation target net. Since these two deep neural networks have the network eval net to calculate the best action for system same structure, multiple hidden layers are defined for each state s . network. We use the current system state s and the next system state s as the input to eval net and target net, re- ′ ′ ′ ′ y � w s, a, s 􏼁 + ϵ max Q 􏼠s , arg max Q s , a, θ􏼁 , θ 􏼡. spectively. We use Q (s, a, θ) and Q (s, a, θ ) to define the e e t a ∈A a∈A output of two deep neural networks eval net and target net. (13) In the structure, in order to update the value of the weight of neuron nodes, we only continuously train the evaluation *e latest research proves that the training error can be network eval net. *e target network is not trained. It pe- dramatically reduced using the Doubling DQN structure riodically duplicates the weights of the neurons from the [24]. evaluation network (i.e., θ � θ). *e loss function is de- In traditional DQN, we only define the cost function Q scribed as follows, which is used to train eval net: value as the output of the deep neural network. *e Loss(θ) � E y − Q (s, a, θ) . (11) Dueling DQN is invented to speed up the convergence of 􏽨 􏼁 􏽩 the deep neural network by designing 2 individual streams We use y to represent the real Q value, which can be of the output for the deep neural network. We use the expressed as output value V(s, θ, β) to represent the first stream of the neural network. It denotes the cost function for a specific ′ ′ ′ ′ y � w s, a, s 􏼁 + ϵ max, Q s , a , θ 􏼁 . t (12) ′ system state. We name the second stream of the output as a ∈A advantage output A(s , a, θ, α), which is utilized to illus- We denote the learning rate as ϵ. *e idea of back- trate the advantage of using a specific action to a system propagation is utilized to update the weight of eval net; as a state [25]. We define α and β as the parameters to correlate result, the neural network is trained. the output of two streams and the neural network. *e *e experience reply method is utilized to improve the cost function can be denoted as training effect, since it can effectively eliminate the corre- Q(s, a, θ, α, β) � V(s, θ, β) lation among the training data. Each single experience in- cludes the system state s, the action a, and the next system ′ ′ state s , together with the reward function w(s, a, s ). We ⎝ ⎠ ⎛ ⎞ ′ ′ + A s , a, θ, α 􏼁 − 􏽘 A s , a, θ, α 􏼁 . ′ ′ define the experience set as ep � 􏼈s, a, w(s, a, s ), s 􏼉. In the |A| algorithm, D individual experiences are saved and, in each (14) training epoch, only D (with D < D) experiences are se- s s lected from D. As the training process is completed, *e latest research proves that Dueling DQN can speeds target net copies the weight of the neurons from the eval- up the training procedure by efficiently annihilating the uation network (i.e., θ � θ). D different experiences are additional freedom while training the deep neural generated from ep, while only D are picked to train the network [25]. evaluation network eval net. *e total number of training iterations can be denoted as U. Both evaluation network and target network share the same structure, in which the deep 3.3. Area Division Deep Reinforcement Learning. In this neural networks have N hidden layers. paper, the optimization problem can be seen as calculating the optimal close-loop path which generates the maximum 3.2.3. Dueling Double DQN. In order to leverage the per- accumulated reward. However, the traditional DQN shows formance of DQN, which can effectively select the optimal the difficulty converging to the optimal path because of the action to charge multiple harvesters in a time-varying complicated experimental field. In order to leverage the channel conditions, we redesign the structure of the deep performance of traditional DQN, we invent an AD-DQN in neural network by using Dueling Double DQN. Doubling this paper. At first, the experimental field is divided into DQN is an advanced version of DQN which can prevent the multiple separate parts. DQN is run on each part individ- overestimating problem appearing throughout the training ually to obtain the optimal path for the robot, respectively. [24]. Dueling Double DQN can efficiently solve the over- Finally, the entire close-loop path is formulated using the estimating problem throughout the training process. In the path on each part. In area division, the whole area is defined same training epochs, Dueling Double DQN is proved to as W. *e whole area is divided at multiple specific loca- outperform the original DQN in learning efficiency. tions. p ∈ P. i 6 Wireless Power Transfer *e criterion to pick p is finding the squares, which exist (xii) en d in more than one effective charging area of the IoT devices. (xiii) en dw hile (xiv) en d ∀p ∈ P, (xv) for i � 1, . . . , |P|: p ∈ eff , i m (15) (xvi) Define set W p ∈ eff , i n (xvii) W � N ∪ 􏼈pos(h, v) i i m, n � 0, 1, . . . , K − 1, m ≠ n. |arg min |pos(h, v) − o |, k ∉ K } h,v pos(h,v)∈W k e (xviii) en d For each p , we define N � p . We define set 􏼈 􏼉 i i i (xix) for i � 1, 2, . . . , |P| + 1: K � o . In the clockwise direction, we 􏼚 􏼛 e argp ∈eff , j�0,1,...,K−1 i j (xx) for j � 1, 2, . . . , |J|: find that the IoTdevice o has the shortest distance to p , and i i (xxi) *e starting point is defined as p . *e end point then add both o and the effective charging area of o to N . i i i is defined as e ∈ eff , j ∈ J. j c *e new area can be expressed as (xxii) *e weight of the neuron nodes θ are randomly generated for the eval net and the weights are N � N ∪ 􏼈o 􏼉 ∪ 􏼈eff 􏼉. (16) i i i i copied by target netθ � θ. u � 1. D � d � 1. Next, we find the IoT device having the shortest distance (xxiii) while u < U s � s . t � 1. to the IoT device o that is just added to set N , and then add i i (xxiv) A probability is generated as a numerical pa- both the new IoT device and the effective charging area of it rameter p ∈ [0, 1]. to N . Iteratively, we find that all the IoT devices besides the ones in K are included in one N . Finally, classify all the (xxv) if D > 200 an d p ≥ϵ ch e i rest squares to the nearest N . 􏼈N 􏼉 � W. i i (xxvi) a � max Q(s, a) a∈A In each area, the DQN is run to determine the optimal (xxvii) else path for the robot. In each area, the starting point is the same (xxviii) Randomly choose the action from action set A. as the position of p ; the end point is one of the effective charging squares of the furthest IoT device from the starting (xxix) en d point in the same area. After the optimal path is calculated (xxx) while s ≠ s for each individual area, the close-loop optimal path for the (xxxi) *e state transit into s after taking the action. entire area can be synthesized. *e algorithm is shown in ′ ′ d � d + 1. ep(d) � 􏼈s, a, w(s, a, s ), s 􏼉. Suppose Algorithm 1. D keeps unchanged if it goes over the experience (i) Define E � 􏼈eff , k � 0, 1, . . . , K − 1􏼉. Among pool’s limitation, d � 1; otherwise, $ D � d$. E, find all the area division points p by 􏼈p 􏼉 � t � t + 1. s � s . After enough data has been col- i i 􏼈pos(h, v)|pos lected in experience pool, eval net is trained using (h, v) ∈ eff , pos(h, v) ∈ eff , m, n ∈ K} D of D experiences. Minimize the loss function m n Loss(θ) using Back-propagation. target net copies (ii) *e number of area divison points is defined as the weight from eval net periodically. |P|. (xxxii) en dw hile (iii) i � 1, . . . , |P|. *e number of the area to be divided is |P| + 1. (xxxiii) en dw hile (iv) K � 􏼚o 􏼛. (xxxiv) *e optimal path of the entire test field is e argp ∈eff , j�0,1,...,K−1 i j synthesized with the optimal path in each W . (v) for i � 1, . . . , |P|: (vi) r � p . r � p . 1 i 2 i 4. Experimental Results (vii) while ∄o ∈ N , o ∈ N g i g I\{i} (viii) if i < � |P| *e implementation of the proposed wireless power transfer (ix) In the clockwise direction, find the the system is shown in Figure 3. IoT devices, that has the shortest distance to In the test field, 8 harvested energy-enabled IoT devices r . *e order of the IoT device is: are placed as Figure 2 indicates. *e top view of the test field g � argmin |o − r |. N is updated as: can be seen as a 2D map. Henceforth, the map is modeled o ∉ K i 1 i i e N � N ∪ o ∪ eff . r � o . and inputted into the computer. *en the AD-DQN algo- 􏽮 􏽯 􏽮 􏽯 i i g g 1 g rithm is implemented in computer using Python and the (x) else optimal charging path can be derived. At the same time, a (xi) In the counterclockwise direction, find the wireless power transfer robot is assembled. Two Powercast the IoT devices, that has the shortest distance to RF power transmitters TX91501 [8] are mounted on two r . *e order of the IoT device is: sides of the Raspberry Pi [26] enabled intelligent driving g � argmin |o − r |. N is updated as: o ∉ K i 2 i i e robot. Each transmitter is powered by 5 V power bank and N � N ∪ 􏽮o 􏽯 ∪ 􏽮eff 􏽯. r � o . i i g g 2 g continuously emits 3 Watts RF power. *e infrared patrol Wireless Power Transfer 7 Model the 2D harvested energy enable loT devices map in computer by inputting the position of each device Assemble the robot and configure the Implement Area Division Deep robot with raspberry pi 4B Reinforcement Learning using Python microcontroller and derive the optimal charge path Assemble two RF power transmitters on Build the road corresponding to -1 the raspberry pi 4B controlled robot the calculated path in the test field -2 -3 Implement mobile wireless power Install the infrared patrol modules transfer on test field using well on the robot configured auto drive robot -4 Figure 3: Flowchart of wireless power transfer implementation. -5 0 1000 2000 3000 4000 5000 6000 7000 8000 Training time slots module is installed on the robot to implement the autodrive reward on the test field; henceforth, the robot can automatically reward cruise on along the path and continuously charge the reward multiple IoTdevices, as shown in Figure 1. To the best of our knowledge, we are the first ones to implement the automatic Figure 4: *e average rewards of reward , reward , and reward 1 2 3 wireless power transfer system in the test field and invent versus the training episodes in area I of the experimental field. AD-DQN algorithm to design the optimal path for the wireless power transfer robot. Since we are the first ones to design and implement the mobile far-field wireless power transfer system, there is no hardware reference design we can refer to and use for validation. So the validation of our work is done in the software aspect. But referring to the flowchart, our mobile wireless power transfer system can be replicated. For the software, we use TensorFlow 0.13.1 together with Python 3.8 in Jupyter Notebook 5.6.0 as the software sim- ulation environment to train the AD-DQN. *e number of hidden layers is 4 and each hidden layer owns 100 nodes. *e learning rate is less than 0.1. *e mini-batch size is 10. *e 6 learning frequency is 5. *e training starting step is 200. *e experience pool is greater than 20000. *e exploration in- terval is 0.001. *e target network replacement interval is greater than 100. Reward decay is 0.99. First, different reward functions are tested for the op- 0 500 1000 1500 2000 2500 3000 timal one. Reward one reward is defined using equation (5). Training episodes *e unit price is defined as ζ � 4. Reward two reward is reward defined as reward ⎧ ⎨ ′ ζ , s ∈ eff , acc � 1, reward o o 3 k k−1 w s, a, s 􏼁 � (17) −1, otherwise, Figure 5: *e average time consumption achieved by reward , reward , and reward versus the training episodes in area I of the 2 3 where ζ � 4. Reward three reward is defined with equation experimental field. (5); however, ζ � 2. Two factors are observed for the per- formance of different rewards, which are average reward during the training and average time consumption during In area I, the performances of three different rewards are the training. compared in Figures 4 and 5. Based on the procedures of AD-DQN in Algorithm 1, In area II, the performances of three different rewards the experimental field is divided into two areas along the are compared in Figures 6 and 7. only shared effective charging area for both device 2 and From Figures 4 and 5, we can observe that reward is device 3. In area I, IoT devices 2, 3, 4, 5, and 6 are included, optimal. Since all three rewards perform similarly on the while in area II, IoT devices 0, 1, 2, 6, and 7 are included. time consumption, reward is the highest reward among all, Average time consumption/time slots Average reward 8 Wireless Power Transfer 70 1 0.9 0.8 0.7 0.6 0.5 0.4 -10 0.3 -20 -30 0.2 0 50000 100000 150000 Training time slots Total number of IoT devices Random action selection reward Q learning reward reward Deep Q-Network Area Division Deep Q-Network Figure 6: *e average rewards of reward , reward , and reward 1 2 3 versus the training episodes in area II of the experimental field. Figure 8: *e effective charging rate of random action selection, Q- learning, DQN, and AD-DQN versus the total number of IoT devices. than reward . *at can be explained as follows: compared with reward , reward can only effectively charge fewer 1 3 number of the IoT devices. Overall, reward has optimal performance in both areas I and II; henceforth, reward is used to define the reward for AD-DQN. In Figures 8 and 9, the performances of four different algorithms are compared. *e random action selection randomly selects the action in the experimental test field. Same as AD-DQN, reward is used as the reward of Q-learning and DQN. We define the successful charging rate as the number of IoT devices that can be successfully charged in one complete charging episode over the total number of the IoT devices. From Figure 8, we can observe that random action selection has the worst successful charging rate. *at can be explained 0 5000 10000 15000 20000 25000 30000 35000 40000 as follows: random action selection never converges to either Training episodes suboptimal or optimal path. Q-learning has a better per- reward formance than random action selection; however, it is reward outperformed by the other two algorithms, since Q-learning reward can only deal with the simple reinforcement learning model. Figure 7: *e average time consumption achieved by reward , DQN performs better than Q-learning and random action reward , and reward versus the training episodes in area II of the selection; however, it is outperformed by the AD-DQN, 2 3 experimental field. since the rewards for different states are defined as inter- connected; even the reward decay is 0.99; DQN still cannot learn the optimal solution. When the total number of the IoT which means that reward can effectively charge most of the devices decreases, both DQN and AD-DQN perform the IoT devices compared with the other two rewards. same since the decrease of the number of the IoT devices From Figures 6 and 7, we can observe that reward degrades the interconnections between different system performs best on the time consumption to complete one states. From Figure 9, we can observe that, compared with episode; however, reward is much more average reward the other algorithms, AD-DQN is not the one consuming Average time consumption/time slots Average reward Successful charging rate Wireless Power Transfer 9 20 robot is regulated to cruise on the path in the counter- clockwise direction. In this way, the robot can continuously charge all the IoT devices. *e experimental demonstration is shown in Figure 1. 5. Conclusions In this paper, we invent a novel deep reinforcement learning algorithm AD-DQN to determine the optimal path for the mobile wireless power transfer robot to dynamically charge the harvesting energy-enabled IoT devices. *e invented algorithm can intelligently divide a large area into multiple subareas and implement the individual DQN in each area, finally synthesizing the entire path for the robot. Compared with the state of the art, the proposed algorithm can ef- fectively charge all the IoT devices on the experimental field. *e whole system can be used in a lot of application sce- 87654 narios, like charging IoT devices in the dangerous area and Total number of IoT devices charging medical devices. Random action selection Q learning Data Availability Deep Q-Network *e data used to support the findings of this study are Area Division Deep Q-Network available from the corresponding author upon request. Figure 9: *e average time consumption of random action se- lection, Q-learning, DQN, and AD-DQN versus the total number Conflicts of Interest of IoT devices. *e authors declare that they have no conflicts of interest. Authors’ Contributions Yuan Xing designed the proposed wireless power transfer system, formulated the optimization problem, and proposed the innovative reinforcement learning algorithm. Riley Young, Giaolong Nguyen, and Maxwell Lefebvre designed, built, and tested the wireless power transfer robot on the wireless power transfer test field. Tianchi Zhao optimized the performance of the proposed deep reinforcement learning algorithm. Haowen Pan implemented the comparison on the system performance between the proposed algorithm and the state of the art. Liang Dong provided the theoretical support for far-field RF power transfer technique. Acknowledgments *is work was supported by WiSys Technology Foundation under Spark Grant. References Figure 10: *e optimal path determined by AD-DQN. Bold black line indicates the path for the wireless power transfer robot. [1] F. Giuppi, K. Niotaki, A. Collado, and A. Georgiadis, “Challenges in energy harvesting techniques for autonomous self-powered wireless sensors,” in Proceedings of the 2013 the least time slots to complete one charging episode; European Microwave Conference, pp. 854–857, Nuremberg, however, AD-DQN is still the optimal algorithm, since all Germany, October 2013. the other algorithms cannot achieve 100% effective charging [2] A. M. Jawad, R. Nordin, S. K. Gharghan, H. M. Jawad, and rate; hence they consume fewer time slots to complete one M. Ismail, “Opportunities and challenges for near-field charging episode. wireless power transfer: a review,” Energies, vol. 10, no. 7, In Figure 10, the optimal path determined by AD-DQN p. 1022, 2017. is shown as the bold black line. *e arrows on the path show [3] P. S. Yedavalli, T. Riihonen, X. Wang, and J. M. Rabaey, “Far- the direction of the robot to move as we assume that the field rf wireless power transfer with blind adaptive Total time consumption/time slots 10 Wireless Power Transfer beamforming for internet of things devices,” IEEE Access, [20] S. Wen, Y. Zhao, X. Yuan, Z. Wang, D. Zhang, and vol. 5, pp. 1743–1752, 2017. L. Manfredi, “Path planning for active slam based on deep [4] Y. Xing and L. Dong, “Passive radio-frequency energy har- reinforcement learning under unknown environments,” In- telligent Service Robotics, vol. 13, pp. 1–10, 2020. vesting through wireless information transmission,” in Pro- ceedings of the IEEE DCOSS, pp. 73–80, Ottawa, ON, Canada, [21] S. Koh, B. Zhou, H. Fang et al., “Real-time deep reinforcement June 2017. learning based vehicle navigation,” Applied Soft Computing, [5] Y. Xing, Y. Qian, and L. Dong, “A multi-armed bandit ap- vol. 96, Article ID 106694, 2020. proach to wireless information and power transfer,” IEEE [22] R. Ding, F. Gao, and X. S. Shen, “3d uav trajectory design and Communications Letters, vol. 24, no. 4, pp. 886–889, 2020. frequency band allocation for energy-efficient and fair [6] Y. L. Lee, D. Qin, L.-C. Wang, and G. H. Sim, “6g massive communication: a deep reinforcement learning approach,” radio access networks: key applications, requirements and IEEE Transactions on Wireless Communications, vol. 19, challenges,” IEEE Open Journal of Vehicular Technology, no. 12, pp. 7796–7809, 2020. vol. 2, 2020. [23] A. G. Barto, S. J. Bradtke, and S. P. Singh, “Learning to act [7] S. Nikoletseas, T. Raptis, A. Souroulagkas, and D. Tsolovos, using real-time dynamic programming,” Artificial Intelli- “Wireless power transfer protocols in sensor networks: ex- gence, vol. 72, no. 1-2, pp. 81–138, 1995. periments and simulations,” Journal of Sensor and Actuator [24] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the Networks, vol. 6, no. 2, p. 4, 2017. [8] Powercast, https://www.powercastco.com/documentation/, Girtieth AAAI Conference on Artificial Intelligence, vol. 2, 2021. p. 5p. 5, Phoenix, AZ, USA, February 2016. [9] C. Lin, Y. Zhou, F. Ma, J. Deng, L. Wang, and G. Wu, [25] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, “Minimizing charging delay for directional charging in and N. De Freitas, “Dueling network architectures for deep wireless rechargeable sensor networks,” in Proceedings of the reinforcement learning,” arXiv:1511.06581, 2015. IEEE INFOCOM 2019-IEEE Conference on Computer Com- [26] RaspberryPi, “Tx91501b-915mhz powercaster transmitter,” munications, pp. 1819–1827, IEEE, Paris, France, April-May 2021, https://www.raspberrypi.org/documentation/ 2019. computers/raspberry-pi.html. [10] H. Yan, Y. Chen, and S.-H. Yang, “Uav-enabled wireless power transfer with base station charging and uav power consumption,” IEEE Transactions on Vehicular Technology, vol. 69, no. 11, pp. 12883–12896, 2020. [11] Y. Liu, K. Xiong, Y. Lu, Q. Ni, P. Fan, and K. B. Letaief, “Uav- aided wireless power transfer and data collection in rician fading,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 10, pp. 3097–3113, 2021. [12] W. Feng, N. Zhao, S. Ao et al., “Joint 3d trajectory design and time allocation for uav-enabled wireless power transfer net- works,” IEEE Transactions on Vehicular Technology, vol. 69, no. 9, pp. 9265–9278, 2020. [13] X. Yuan, T. Yang, Y. Hu, J. Xu, and A. Schmeink, “Trajectory design for uav-enabled multiuser wireless power transfer with nonlinear energy harvesting,” IEEE Transactions on Wireless Communications, vol. 20, no. 2, pp. 1105–1121, 2020. [14] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with deep reinforcement learning,” arXiv:1312.5602, 2013. [15] Y. He, Z. Zhang, F. R. Yu et al., “Deep reinforcement learning- based optimization for cache-enabled opportunistic inter- ference alignment wireless networks,” IEEE Transactions on Vehicular Technology, vol. 66, no. 11, pp. 10433–10445, 2017. [16] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent rein- forcement learning,” in Proceedings of the 29th Conference on Advances in Neural Information Processing Systems, pp. 2137–2145, Barcelona, Spain, May 2016. [17] Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforcement learning based framework for power-efficient resource allocation in cloud rans,” in Proceedings of the 2017 IEEE International Conference on Communications (ICC), pp. 1–6, IEEE, Paris, France, May 2017. [18] Y. Xing, H. Pan, B. Xu et al., “Optimal wireless information and power transfer using deep q-network,” Wireless Power Transfer, vol. 2021, Article ID 5513509, 12 pages, 2021. [19] C. Chen, J. Jiang, N. Lv, and S. Li, “An intelligent path planning scheme of autonomous vehicles platoon using deep reinforcement learning on network edge,” IEEE Access, vol. 8, pp. 99059–99069, 2020.

Journal

Wireless Power TransferHindawi Publishing Corporation

Published: Mar 4, 2022

References