Document Type : Research Paper
Authors
^{1} Ph.D. Scholar, School of Computer & Systems Sciences, Jawaharlal Nehru University, New Delhi110067,
^{2} Assistant Professor, School of Computer & Systems Sciences, Jawaharlal Nehru University, New Delhi110067.
^{3} Ph.D. Scholar, School of Computer & Systems Sciences, Jawaharlal Nehru University, New Delhi110067.
^{4} Ph.D., School of Computer & Systems Sciences, Jawaharlal Nehru University, New Delhi110067.
Abstract
Keywords
The Internet of Things (IoT) is the network of physical objectsdevices, vehicles, buildings and other items embedded with electronics, software, sensors and wireless network connectivity that enable these objects to collect and exchange data (Akyildiz IF et al. (2002), Kashyap, P. K. (2019) & Kashyap, P. K., Kumar, S., and Jaiswal, A. (2019) ). Each of these smart device is uniquely identified by Internet address protocol (IP) to forward the data packet from source to destination (F. Bouabdallah et al. (2008)). The IoT has huge application in almost all the sectors of human being such as healthcare facilities, industrial organization, vehicular network, military operations, business organization and many more (Ishmanov F et al. (2011) & Ahmed AA and Mohammed Y, (2007)). These smart devices have limited battery power to perform complex computation and forward the data packet. Due to tremendous upsurge in the connected number of ubiquitous devices, there is large number of data packets travelled in the IoT network. So, there is need to choose the energy balanced routing path to forward the data packet so that lifetime of the network is improved.
In order to support various complicated application, IoT nodes have to perform the reliable operation with their limited energy, computational resources and bandwidth effectively so that it reduce the timedelay for data transmission using shortest routing path, transmission errors and ultimately improve the lifetime of the network. In this regard software define network is combined with the IoT network that separate the hardware and software control operation efficiently to cope with the mentioned challenged (J. Zhou, et al. (2016)). Therefore, dynamic routing rules for the IoT nodes provide novel data forwarding strategy, but lack in the presence of stochastic nature of channel path.
Machine learning (ML) techniques have been extensively used in the IoT network to finding the optimal route for data forwarding in the recent decade (C. Guestrin et al. (2004), Kashyap, P.K, Kumar, S. (2019), & A. Jaiswal et al. (2020)). Machine learning techniques provides learning ability to IoT network through experience, and reinforcement learning (RL) works on learning agent that improve its learning capability based on received rewards according to their taken action. By exploitation of their gained knowledge and exploration of the environment RL agent maximize its rewards (R. S. Sutton (2018)). Reinforcement learning techniques requires low computation resources with lower implementation efforts to output effective results with higher accuracy. The output of the system nearly optimal and has higher flexibility according to the changes in the environment without prior knowledge of the network. Thus, reinforcement learning and Qlearning are best suited techniques for routing approaches in the IoT network that build path with lower redundancy.
Under these circumstances, there is need energy balanced routing algorithm based on reinforcement learning approach that include residual energy, physical distance and link quality of the channel for data transmission. The major contribution of the paper as follow:
The rest of the paper is divided into following sections. Section II described the system models used in the IoT network using graph approach. Section III explains the Qlearning based routing protocol in the IoT network. In Section IV, simulation and results are analyzed for the proposed algorithm and stateofartalgorithms. Finally, Conclusion of this paper is presented along with future scope in the section V.
Network Setting
We consider an energyconstrained Internet of Thing network that has finite number of sensor nodes, which are randomly deployed in a given monitoring area. Each node in the network can only communicate with the neighbouring nodes that are within its transmission range. Data transmission from one node to another takes place in synchronised time slots. Here, it is considered that each data transmission from a source node to destination takes place by using a number of intermediate nodes present along the route in the network. Each node has a single antenna, a finite battery which can be recharged periodically and works in a halfduplex mode.
The wireless connection between nodes of IoT are affected by many factors, such as residual energy of node, physical distance, channel gain etc. that makes the edge length and network state of dynamic nature in many scenarios. Here, we represent the network as a graph G = (V, E, L)with stochastic edge length, where V is the set of vertices i.e. the sensor nodes and E = (eij), such that vi, vj ϵ V, is the set of edges and L represents the probability distribution of each edge length. An edge exists between vertices vi and vj in the graph only when node j is the neighbor of node i. The nodes in the transmission range of a node constitute its neighbourhood. The length of edge (vi, vj) is denoted as l(vi, vj) and is considered as a random variable. The channel between neighbouring nodes is assumed to follow quasistatic block Rayleigh fading model and the channel gain between neighbouring nodes vi and vj are modelled as Markov chain. The transition probability of from at any time instant t is given as and is unknown to the network or sensors.
Energy Consumption and residual energy Model
In IoT network, the energy is consumed for carrying out sensing, processing and communication (transmitting/receiving) activities. Out of these, data communication consumes most of the energy of a node so energy consumed for communication only is considered during routing. For simplicity, only the energy consumed for transmissions is accounted and energy spent for receiving is ignored as the idle and receiving nodes consume almost same amount of energy [20].According to the first order radio model presented in [20], for a message having b bits, the energy consumed for its transmission from to nodewith edge length l ( , ) is calculated as
(1)
(2)
After data transmission residual energy of a node at any hop, at time slot can be evaluated as follow
(3)
Where is the maximum battery capacity of a node, is used for calculating the threshold distance (l_{0}) which in turn is used to determine the power loss model to be used i.e. whether to use free space model or multipath fading model. Free space model is utilized when the distance between sender and receiver is less than the threshold distance otherwise multipath fading model is applied for calculating the energy consumed for transmission purposes. is the energy requirements of transmitter and receiver circuit, and are the energy consumed for amplifying transmission in order to attain a satisfactory signal to noise ratio (SNR) and l is the communication edge length.
Energy Balanced Routing Problem in IoT
The sensor nodes in the IoT network capture the desired data and forward this data to the destination node. Due to resource constraints in WSNs such as short communication range, limited processing potential and limited battery power, the source node, instead of communicating directly with the destination, communicates indirectly through its neighbours (multihop) as this leads to higher energy efficiency as compared to direct communication. In a multihop communication environment, routing algorithm is needed to find a communication path from source node to the destination node. Multihop communication overcomes the problem of energy inefficiency and shortrange communication faced in direct communication but it steers imbalanced consumption of energy in the network (Khan, Tayyab, et al. (2019)) as the intermediate nodes deplete their battery faster while relaying the data of other nodes. The nodes in the vicinity of sink are the most affected ones. Therefore, the routing algorithm needs to find a path which balances the energy consumption of the nodes in the network so that all the nodes deplete their energy nearly at the same time which in turn results in increased network lifetime. A routing path rp in the network graph is defined as a sequence of distinct sensors in the WSN starting from source node and ending at destination node i.e. such that are adjacent vertices for and =source node and =destination node. The path rp having sensor nodes has a length of The main aim of this paper is to find an optimal routing path between a source and destination node in order to minimize the total energy consumption and transmission delay for a reliable communication.
In this section, we propose a QLearning based efficient routing protocol to find an optimal and reliable route from a source to destination node in order to reduce the total energy consumption and minimize the total transmission delay (based on shortest distance), which can ultimately improve the network lifetime.
Problem Modelling
The stochastic optimal routingpath finding problem is modelled as an MDP. Qlearning updating rules are used to learn an optimal policy. Here, the learning agent selects an action in order to interact with the environment (stochastic graph) to reach the next neighboring node in the route, to get an optimal path from source to destination subject to maximize the expected reward obtained. MDP can be defined as follow:
State ‘ ’:Each sensor node in the network and the corresponding channel gain towards its neighbor nodes is modelled as a state. The current sensor node in the routing pathfinding process and current channel gain is considered as a current state.
Action ‘ ’:All the out link neighbor nodes are considered in the action set of a state.
Transition ‘ ’: The next state is determined by the action selection in current state.
Reward‘ ’: Reward for a stateaction pair (s, a) is calculated by using utility value which is the combination of nodes’ residual energy, edge length and nodes’ energy consumption and link quality.
Definition 1 (edge length: distance between nodes): Following formula is used to compute the edge length between any two node
l( , )= (4) where ( ) and ( ) are coordinates of node , and respectively.
Definition 2 (edge length based path): Data packet is transferred from a source node to destination through nhops towards the destination node. The optimal routing path based on edge length is represented as
(5)
Where is the edge length between two nodes at hop. And, the path with minimum edge length guarantees that the transmission delay is minimum.
Definition 2 (routing path based on energy):Data packet is transferred from a source node to destination through nhops towards the destination node. The optimal routing path based on residual energy is represented as
(6) Where is the residual energy of transmitting node at hop.Residual energy of a node can be computed using Eq. (3).
Definition 3 (routing path based on link quality): Data packet is transferred from a source node to destination through nhops towards the destination node. The optimal routing path based on better link quality is represented as
(7) Where is the normalized link quality at hopand given as
(8)
Where are signal strength at hop and maximum signal strength. Thus, the reward (utility) obtainedfor transition from to state after taking an action at time slot t being at hop can be computed as
(9)
Where , , , and are some prescribed positive weights parameter which reflect the importance of residual energy, link quality, edge length and energy consumption respectively in calculation of reward, and are number of neighbors for . The weight parameters are closely related to each other, such that if is set to zero, the presented model emphasize on the maximization of the residual energy in the routing path and transmission delay is ignored i.e., independent of number of intermediate hop. When is set to zero, the presented algorithm pays more attention in reducing the transmission delay of packet and ignores residual energy of the sensor node. Thus, in both above case lifetime of the network is not optimal because of tradeoff between . Thus, we can adjust these parameter values according to our needs.
To find an optimal routing path, the learning agent initially perceive the starting state , i.e., the source node and channel gain of the links towards its neighbors, and then selects an action using the current policy, till the agent arrives at the destination node. The statevalue is updated using the temporal difference method. The updating rule for Qlearning is
) (10)
where, ζ ϵ [0,1] denotes learning rate, is discount factor. Qlearning adopts greedy policy for action selection, i.e, it select optimal action having maximum statevalue with probability 1 and a random action with probability . The main aim of the learning agent is to find an optimal policy for selecting an optimal routing path. The optimal policy denotes the statevalue which is greater than other policy’s statevalue. The optimal routing problem can be expressed as
(11)
The formulated problem in (10) can be optimally solved by a QLearning method. The goal of proposed routing problem is to maximize the total reward over all routing paths starting at source node, such that rp and is the set of all paths in G starting from source node and ending at destination.
In routing process, each node in the network keeps a routing table which is used to select the next hop for the data transmission. The routing table contains the information about next possible nodes which can be reachable to all possible destinations in the network. This table is updated after each data transmission to store the information of node which is good for further data forwarding based on the obtained reward.
QLearning based Routing Algorithm
In this section, we present a learning based routing algorithm, particularly using Qlearning approach.
Algorithm QLRA
Initially, we initialize the Qtable with all zero values. After that current state is observed. Based on the current state an action is selected from the available actions at (line no. 15). After executing the action, a reward and next state are obtained (line no. 16). Using the achieved reward, statevalue is updated (line no. 17). And now next state becomes current state. The algorithm converge either source node find a routing path to reach destination node or for the maximum number of episode.
Time Complexity
The time complexity of the Qlearning based routing algorithm mainly has three aspects: (1) the algorithm continues until it find destination node in the line no.9 i.e., number of intermediate node in the routing path is ( ). (2) Selecting an action (choose neighbor node) from the set of neighbors’ nodes subject to maximize the expected discount reward in the network. The set of neighbors is represented in the form of matrix i.e., but for the single node (state) linear search apply on the single desired row takes time in line no.10 to line no.15. Thereafter line no.16 to line no. 19 requires constant time to update the Qvalue (3).The algorithm runs in worst case are equal to the number of episode until convergence (line no.5). Thus, overall time complexity of the algorithm is .
In this section, firstly, the convergence performance of proposed QLearning routing algorithm is analyzed over learning trails and link quality in terms of steps until convergence and reward (utility) respectively. Secondly, comparative analysis of the proposed algorithm against Random learning algorithm and without (w/o) learning algorithm done with respect four metrics: 1) Reward (Utility) 2) Residual energy 3) Energy consumption 4) Edge length. All these algorithms are simulated using same values of parameters for energy model and network conditions.
Simulation Environment
The simulation is carried out using MATLAB in a square area having 50 sensor nodes are randomly distributed. The communication range and initial energy of all the sensor nodes set to be20 m and 0.5 J respectively. The maximum bandwidth for each of the communication link in the network is 100Mbps. Without loss of generality, one source node and one destination node is selected randomly for the performance analysis of all the stateofartalgorithms. The others simulations parameters are shown in Table 1.
Table 1. Simulation Parameters
Parameter 
Value 
Parameter 
Value 


Initial Energy 


0.7 

0.92 

1000 

[0,1] 

3 



50 dBm 



0.7 



0.3 










Result Analysis
Figure 1. Convergence performance over learning trails
Figure 1 illustrates the convergence performance of the proposed QRL based energy balanced algorithm over number of learning trails. It can be clearly observed from result, the RL agent takes 940 steps to converge for the first trails of learning rate thereafter it took less number of steps approximate 560 steps for the very less value of learning rate .The number of steps on average higher than 200 steps for the taking the value of , whereas the best performance to achieve convergence by the RL agent for the value of . Thus, it is necessary to choose the learning rate with caution for convergence in small number of steps towards maximization of reward i.e. reduce the energy consumption and minimize the number of hop counts.
Figure 2. Reward (Utility) over link Quality
Figure 2 illustrates that the reward (utility) of the proposed Qlearning routing algorithm with respect to link quality under different residual energy of the intermediate hops in the routing path. The link quality describes the nature of communication path between the sensor nodes. And further, link quality depends upon the residual energy of the intermediate hops. If the residual energy of the communicating nodes is high then the link quality is also better and viceversa. It can be observed from the results, at the beginning, the reward of the proposed algorithm increases rapidly then become stationary as the value of link quality improves. This is because of the learning capability of the proposed algorithm to optimize the reward faster at link quality’s value . It is also worth to note down reward of the proposed algorithm cannot increase with further increase in the link quality. This is due to the fact other factors such as limited bandwidth and energy consumption in communication also increases and then affects the reward of the proposed algorithm according to Eq. (10).
Figure 3. Reward (Utility) over Episodes
A comparison of reward (utility) between proposed QRL based energy balanced algorithm and stateofartalgorithm over number of episode is presented in the figure.3 using learning rate of . it is clearly observed from the simulation results that Qlearning algorithm converges faster than random learning algorithm within 390 episodes. Whereas random algorithm’s utility converges around 580 episodes. This is due to the fact the proposed QRL algorithm uses greedy technique to select an action rather than randomly selected any action for the current statereward. Also, optimal learning policy helps in the selection of an action in QRL, which ultimately maximize the reward with less number of episodes. It is also worthy to note down that worst performance is shown by without learning algorithm. This is because of neither have learning policy nor any optimization technique involved in the process of reward maximization.
A comparison of convergence characteristic in the terms of residual energy between Qlearning and the stateofartalgorithm is presented in the figure 4 using learning rate of . It is clearly observed from the result as the number of episode increases residual energy increases for all the three algorithms and converges at 400 episodes. Further, it is noticeable that proposed QRL based energy balanced routing algorithm has higher residual energy about (0.87Joule) than other stateofartalgorithm. This is because of the. QRL selects the next hop for the routing purpose based on optimal policy learning strategy subject to maximize the residual energy, better link quality and minimum distance. Whereas random algorithm select the next route based on minimum distance and does not consider residual energy of the next hop that in turns increases the overall energy consumption. And, the without learning based algorithm selects any hop for the routing randomly without considering the residual energy and minimum distance.
Figure 4. Residual energy over Episodes
Figure 5. Energy Consumption over episodes
A comparison of energy consumption of the proposed Qlearning routing algorithm with stateofartalgorithms over number of learning episodes is shown in the figure 5. It can be observed from the results; at the start of learning episodes the energy consumption of the proposed algorithm is 0.225 J and as the algorithm reached to 160 episodes it lower down the energy consumption at 0.03 J. Further, proposed algorithm reached up to 400 episodes, the energy reduces to 0.025 J and consumption becomes stable. Whereas other stateofart algorithm fails to optimize the energy consumption of the nodes involved during routing path. This is because of Qlearning algorithm uses greedy approach for the selection of optimal policy, whereas Random algorithm selects any action randomly to obtain the reward. It is also noted down that the worst performance is shown by without (w/o) learning algorithm, because it does not have any learning policy and compute reward on the current situation of node’s parameters.
Figure 6. Edge Length over episodes
A comparison of edge length of the proposed Qlearning routing algorithm with stateofartalgorithms over number of learning episodes is shown in the figure 6. The edge length describe the distance between the intermediate hops, smaller the edge length (Euclidean distance) corresponds to reduce the transmission delay and improves the also convergence speed of the learning based routing algorithm. The results shows that as the number of episode increases, the edge length of proposed algorithm reduces and stabilized about to 17 m within 400 episodes. Whereas edge length of random and without learning algorithms are fluctuates and fails to convergence. This is because of proposed QLearning routing algorithm at the initialization know the number of hop counts and also in learning phase greedy policy helps to compute less number of intermediate hops count and then compute the shortest distance edge length. It can be also observed that without learning based routing algorithm compute the edge length only on Euclidean distance formula according to Eq. (4) and nothing to do with learning and in turn fail to converge the edge length.
In this paper, we handled the problem of energy balanced routing algorithm using reinforcement learning and proposed QRL algorithm for wireless sensor network. The link quality, residual energy, and distance between the two consecutive hops are used as parameter for selection of an optimal action subject to maximize the reward (utility). To achieve the objective QRL based energy balanced algorithm has been proposed and their time complexity is also analyzed to show the effectives of the proposed algorithm. It is also proved from the simulation results the proposed QRL algorithm converges faster than other stateofartalgorithms. It is also notable from the simulation results that energy consumption and link quality and residual energy also improved compared to random algorithm and without learning algorithm. In the future, we also include the node density as another parameter to estimate the energy balanced routing path using deep learning techniques.
Conflict of interest
The authors declare no potential conflict of interest regarding the publication of this work. In addition, the ethical issues including plagiarism, informed consent, misconduct, data fabrication and, or falsification, double publication and, or submission, and redundancy have been completely witnessed by the authors.
Funding
The Internet of Things (IoT) is the network of physical objectsdevices, vehicles, buildings and other items embedded with electronics, software, sensors and wireless network connectivity that enable these objects to collect and exchange data (Akyildiz IF et al. (2002), Kashyap, P. K. (2019) & Kashyap, P. K., Kumar, S., and Jaiswal, A. (2019) ). Each of these smart device is uniquely identified by Internet address protocol (IP) to forward the data packet from source to destination (F. Bouabdallah et al. (2008)). The IoT has huge application in almost all the sectors of human being such as healthcare facilities, industrial organization, vehicular network, military operations, business organization and many more (Ishmanov F et al. (2011) & Ahmed AA and Mohammed Y, (2007)). These smart devices have limited battery power to perform complex computation and forward the data packet. Due to tremendous upsurge in the connected number of ubiquitous devices, there is large number of data packets travelled in the IoT network. So, there is need to choose the energy balanced routing path to forward the data packet so that lifetime of the network is improved.
In order to support various complicated application, IoT nodes have to perform the reliable operation with their limited energy, computational resources and bandwidth effectively so that it reduce the timedelay for data transmission using shortest routing path, transmission errors and ultimately improve the lifetime of the network. In this regard software define network is combined with the IoT network that separate the hardware and software control operation efficiently to cope with the mentioned challenged (J. Zhou, et al. (2016)). Therefore, dynamic routing rules for the IoT nodes provide novel data forwarding strategy, but lack in the presence of stochastic nature of channel path.
Machine learning (ML) techniques have been extensively used in the IoT network to finding the optimal route for data forwarding in the recent decade (C. Guestrin et al. (2004), Kashyap, P.K, Kumar, S. (2019), & A. Jaiswal et al. (2020)). Machine learning techniques provides learning ability to IoT network through experience, and reinforcement learning (RL) works on learning agent that improve its learning capability based on received rewards according to their taken action. By exploitation of their gained knowledge and exploration of the environment RL agent maximize its rewards (R. S. Sutton (2018)). Reinforcement learning techniques requires low computation resources with lower implementation efforts to output effective results with higher accuracy. The output of the system nearly optimal and has higher flexibility according to the changes in the environment without prior knowledge of the network. Thus, reinforcement learning and Qlearning are best suited techniques for routing approaches in the IoT network that build path with lower redundancy.
Under these circumstances, there is need energy balanced routing algorithm based on reinforcement learning approach that include residual energy, physical distance and link quality of the channel for data transmission. The major contribution of the paper as follow:
The rest of the paper is divided into following sections. Section II described the system models used in the IoT network using graph approach. Section III explains the Qlearning based routing protocol in the IoT network. In Section IV, simulation and results are analyzed for the proposed algorithm and stateofartalgorithms. Finally, Conclusion of this paper is presented along with future scope in the section V.
Network Setting
We consider an energyconstrained Internet of Thing network that has finite number of sensor nodes, which are randomly deployed in a given monitoring area. Each node in the network can only communicate with the neighbouring nodes that are within its transmission range. Data transmission from one node to another takes place in synchronised time slots. Here, it is considered that each data transmission from a source node to destination takes place by using a number of intermediate nodes present along the route in the network. Each node has a single antenna, a finite battery which can be recharged periodically and works in a halfduplex mode.
The wireless connection between nodes of IoT are affected by many factors, such as residual energy of node, physical distance, channel gain etc. that makes the edge length and network state of dynamic nature in many scenarios. Here, we represent the network as a graph G = (V, E, L)with stochastic edge length, where V is the set of vertices i.e. the sensor nodes and E = (eij), such that vi, vj ϵ V, is the set of edges and L represents the probability distribution of each edge length. An edge exists between vertices vi and vj in the graph only when node j is the neighbor of node i. The nodes in the transmission range of a node constitute its neighbourhood. The length of edge (vi, vj) is denoted as l(vi, vj) and is considered as a random variable. The channel between neighbouring nodes is assumed to follow quasistatic block Rayleigh fading model and the channel gain between neighbouring nodes vi and vj are modelled as Markov chain. The transition probability of from at any time instant t is given as and is unknown to the network or sensors.
Energy Consumption and residual energy Model
In IoT network, the energy is consumed for carrying out sensing, processing and communication (transmitting/receiving) activities. Out of these, data communication consumes most of the energy of a node so energy consumed for communication only is considered during routing. For simplicity, only the energy consumed for transmissions is accounted and energy spent for receiving is ignored as the idle and receiving nodes consume almost same amount of energy [20].According to the first order radio model presented in [20], for a message having b bits, the energy consumed for its transmission from to nodewith edge length l ( , ) is calculated as
(1)
(2)
After data transmission residual energy of a node at any hop, at time slot can be evaluated as follow
(3)
Where is the maximum battery capacity of a node, is used for calculating the threshold distance (l_{0}) which in turn is used to determine the power loss model to be used i.e. whether to use free space model or multipath fading model. Free space model is utilized when the distance between sender and receiver is less than the threshold distance otherwise multipath fading model is applied for calculating the energy consumed for transmission purposes. is the energy requirements of transmitter and receiver circuit, and are the energy consumed for amplifying transmission in order to attain a satisfactory signal to noise ratio (SNR) and l is the communication edge length.
Energy Balanced Routing Problem in IoT
The sensor nodes in the IoT network capture the desired data and forward this data to the destination node. Due to resource constraints in WSNs such as short communication range, limited processing potential and limited battery power, the source node, instead of communicating directly with the destination, communicates indirectly through its neighbours (multihop) as this leads to higher energy efficiency as compared to direct communication. In a multihop communication environment, routing algorithm is needed to find a communication path from source node to the destination node. Multihop communication overcomes the problem of energy inefficiency and shortrange communication faced in direct communication but it steers imbalanced consumption of energy in the network (Khan, Tayyab, et al. (2019)) as the intermediate nodes deplete their battery faster while relaying the data of other nodes. The nodes in the vicinity of sink are the most affected ones. Therefore, the routing algorithm needs to find a path which balances the energy consumption of the nodes in the network so that all the nodes deplete their energy nearly at the same time which in turn results in increased network lifetime. A routing path rp in the network graph is defined as a sequence of distinct sensors in the WSN starting from source node and ending at destination node i.e. such that are adjacent vertices for and =source node and =destination node. The path rp having sensor nodes has a length of The main aim of this paper is to find an optimal routing path between a source and destination node in order to minimize the total energy consumption and transmission delay for a reliable communication.
In this section, we propose a QLearning based efficient routing protocol to find an optimal and reliable route from a source to destination node in order to reduce the total energy consumption and minimize the total transmission delay (based on shortest distance), which can ultimately improve the network lifetime.
Problem Modelling
The stochastic optimal routingpath finding problem is modelled as an MDP. Qlearning updating rules are used to learn an optimal policy. Here, the learning agent selects an action in order to interact with the environment (stochastic graph) to reach the next neighboring node in the route, to get an optimal path from source to destination subject to maximize the expected reward obtained. MDP can be defined as follow:
State ‘ ’:Each sensor node in the network and the corresponding channel gain towards its neighbor nodes is modelled as a state. The current sensor node in the routing pathfinding process and current channel gain is considered as a current state.
Action ‘ ’:All the out link neighbor nodes are considered in the action set of a state.
Transition ‘ ’: The next state is determined by the action selection in current state.
Reward‘ ’: Reward for a stateaction pair (s, a) is calculated by using utility value which is the combination of nodes’ residual energy, edge length and nodes’ energy consumption and link quality.
Definition 1 (edge length: distance between nodes): Following formula is used to compute the edge length between any two node
l( , )= (4) where ( ) and ( ) are coordinates of node , and respectively.
Definition 2 (edge length based path): Data packet is transferred from a source node to destination through nhops towards the destination node. The optimal routing path based on edge length is represented as
(5)
Where is the edge length between two nodes at hop. And, the path with minimum edge length guarantees that the transmission delay is minimum.
Definition 2 (routing path based on energy):Data packet is transferred from a source node to destination through nhops towards the destination node. The optimal routing path based on residual energy is represented as
(6) Where is the residual energy of transmitting node at hop.Residual energy of a node can be computed using Eq. (3).
Definition 3 (routing path based on link quality): Data packet is transferred from a source node to destination through nhops towards the destination node. The optimal routing path based on better link quality is represented as
(7) Where is the normalized link quality at hopand given as
(8)
Where are signal strength at hop and maximum signal strength. Thus, the reward (utility) obtainedfor transition from to state after taking an action at time slot t being at hop can be computed as
(9)
Where , , , and are some prescribed positive weights parameter which reflect the importance of residual energy, link quality, edge length and energy consumption respectively in calculation of reward, and are number of neighbors for . The weight parameters are closely related to each other, such that if is set to zero, the presented model emphasize on the maximization of the residual energy in the routing path and transmission delay is ignored i.e., independent of number of intermediate hop. When is set to zero, the presented algorithm pays more attention in reducing the transmission delay of packet and ignores residual energy of the sensor node. Thus, in both above case lifetime of the network is not optimal because of tradeoff between . Thus, we can adjust these parameter values according to our needs.
To find an optimal routing path, the learning agent initially perceive the starting state , i.e., the source node and channel gain of the links towards its neighbors, and then selects an action using the current policy, till the agent arrives at the destination node. The statevalue is updated using the temporal difference method. The updating rule for Qlearning is
) (10)
where, ζ ϵ [0,1] denotes learning rate, is discount factor. Qlearning adopts greedy policy for action selection, i.e, it select optimal action having maximum statevalue with probability 1 and a random action with probability . The main aim of the learning agent is to find an optimal policy for selecting an optimal routing path. The optimal policy denotes the statevalue which is greater than other policy’s statevalue. The optimal routing problem can be expressed as
(11)
The formulated problem in (10) can be optimally solved by a QLearning method. The goal of proposed routing problem is to maximize the total reward over all routing paths starting at source node, such that rp and is the set of all paths in G starting from source node and ending at destination.
In routing process, each node in the network keeps a routing table which is used to select the next hop for the data transmission. The routing table contains the information about next possible nodes which can be reachable to all possible destinations in the network. This table is updated after each data transmission to store the information of node which is good for further data forwarding based on the obtained reward.
QLearning based Routing Algorithm
In this section, we present a learning based routing algorithm, particularly using Qlearning approach.
Algorithm QLRA
Initially, we initialize the Qtable with all zero values. After that current state is observed. Based on the current state an action is selected from the available actions at (line no. 15). After executing the action, a reward and next state are obtained (line no. 16). Using the achieved reward, statevalue is updated (line no. 17). And now next state becomes current state. The algorithm converge either source node find a routing path to reach destination node or for the maximum number of episode.
Time Complexity
The time complexity of the Qlearning based routing algorithm mainly has three aspects: (1) the algorithm continues until it find destination node in the line no.9 i.e., number of intermediate node in the routing path is ( ). (2) Selecting an action (choose neighbor node) from the set of neighbors’ nodes subject to maximize the expected discount reward in the network. The set of neighbors is represented in the form of matrix i.e., but for the single node (state) linear search apply on the single desired row takes time in line no.10 to line no.15. Thereafter line no.16 to line no. 19 requires constant time to update the Qvalue (3).The algorithm runs in worst case are equal to the number of episode until convergence (line no.5). Thus, overall time complexity of the algorithm is .
In this section, firstly, the convergence performance of proposed QLearning routing algorithm is analyzed over learning trails and link quality in terms of steps until convergence and reward (utility) respectively. Secondly, comparative analysis of the proposed algorithm against Random learning algorithm and without (w/o) learning algorithm done with respect four metrics: 1) Reward (Utility) 2) Residual energy 3) Energy consumption 4) Edge length. All these algorithms are simulated using same values of parameters for energy model and network conditions.
Simulation Environment
The simulation is carried out using MATLAB in a square area having 50 sensor nodes are randomly distributed. The communication range and initial energy of all the sensor nodes set to be20 m and 0.5 J respectively. The maximum bandwidth for each of the communication link in the network is 100Mbps. Without loss of generality, one source node and one destination node is selected randomly for the performance analysis of all the stateofartalgorithms. The others simulations parameters are shown in Table 1.
Table 1. Simulation Parameters
Parameter 
Value 
Parameter 
Value 


Initial Energy 


0.7 

0.92 

1000 

[0,1] 

3 



50 dBm 



0.7 



0.3 










Result Analysis
Figure 1. Convergence performance over learning trails
Figure 1 illustrates the convergence performance of the proposed QRL based energy balanced algorithm over number of learning trails. It can be clearly observed from result, the RL agent takes 940 steps to converge for the first trails of learning rate thereafter it took less number of steps approximate 560 steps for the very less value of learning rate .The number of steps on average higher than 200 steps for the taking the value of , whereas the best performance to achieve convergence by the RL agent for the value of . Thus, it is necessary to choose the learning rate with caution for convergence in small number of steps towards maximization of reward i.e. reduce the energy consumption and minimize the number of hop counts.
Figure 2. Reward (Utility) over link Quality
Figure 2 illustrates that the reward (utility) of the proposed Qlearning routing algorithm with respect to link quality under different residual energy of the intermediate hops in the routing path. The link quality describes the nature of communication path between the sensor nodes. And further, link quality depends upon the residual energy of the intermediate hops. If the residual energy of the communicating nodes is high then the link quality is also better and viceversa. It can be observed from the results, at the beginning, the reward of the proposed algorithm increases rapidly then become stationary as the value of link quality improves. This is because of the learning capability of the proposed algorithm to optimize the reward faster at link quality’s value . It is also worth to note down reward of the proposed algorithm cannot increase with further increase in the link quality. This is due to the fact other factors such as limited bandwidth and energy consumption in communication also increases and then affects the reward of the proposed algorithm according to Eq. (10).
Figure 3. Reward (Utility) over Episodes
A comparison of reward (utility) between proposed QRL based energy balanced algorithm and stateofartalgorithm over number of episode is presented in the figure.3 using learning rate of . it is clearly observed from the simulation results that Qlearning algorithm converges faster than random learning algorithm within 390 episodes. Whereas random algorithm’s utility converges around 580 episodes. This is due to the fact the proposed QRL algorithm uses greedy technique to select an action rather than randomly selected any action for the current statereward. Also, optimal learning policy helps in the selection of an action in QRL, which ultimately maximize the reward with less number of episodes. It is also worthy to note down that worst performance is shown by without learning algorithm. This is because of neither have learning policy nor any optimization technique involved in the process of reward maximization.
A comparison of convergence characteristic in the terms of residual energy between Qlearning and the stateofartalgorithm is presented in the figure 4 using learning rate of . It is clearly observed from the result as the number of episode increases residual energy increases for all the three algorithms and converges at 400 episodes. Further, it is noticeable that proposed QRL based energy balanced routing algorithm has higher residual energy about (0.87Joule) than other stateofartalgorithm. This is because of the. QRL selects the next hop for the routing purpose based on optimal policy learning strategy subject to maximize the residual energy, better link quality and minimum distance. Whereas random algorithm select the next route based on minimum distance and does not consider residual energy of the next hop that in turns increases the overall energy consumption. And, the without learning based algorithm selects any hop for the routing randomly without considering the residual energy and minimum distance.
Figure 4. Residual energy over Episodes
Figure 5. Energy Consumption over episodes
A comparison of energy consumption of the proposed Qlearning routing algorithm with stateofartalgorithms over number of learning episodes is shown in the figure 5. It can be observed from the results; at the start of learning episodes the energy consumption of the proposed algorithm is 0.225 J and as the algorithm reached to 160 episodes it lower down the energy consumption at 0.03 J. Further, proposed algorithm reached up to 400 episodes, the energy reduces to 0.025 J and consumption becomes stable. Whereas other stateofart algorithm fails to optimize the energy consumption of the nodes involved during routing path. This is because of Qlearning algorithm uses greedy approach for the selection of optimal policy, whereas Random algorithm selects any action randomly to obtain the reward. It is also noted down that the worst performance is shown by without (w/o) learning algorithm, because it does not have any learning policy and compute reward on the current situation of node’s parameters.
Figure 6. Edge Length over episodes
A comparison of edge length of the proposed Qlearning routing algorithm with stateofartalgorithms over number of learning episodes is shown in the figure 6. The edge length describe the distance between the intermediate hops, smaller the edge length (Euclidean distance) corresponds to reduce the transmission delay and improves the also convergence speed of the learning based routing algorithm. The results shows that as the number of episode increases, the edge length of proposed algorithm reduces and stabilized about to 17 m within 400 episodes. Whereas edge length of random and without learning algorithms are fluctuates and fails to convergence. This is because of proposed QLearning routing algorithm at the initialization know the number of hop counts and also in learning phase greedy policy helps to compute less number of intermediate hops count and then compute the shortest distance edge length. It can be also observed that without learning based routing algorithm compute the edge length only on Euclidean distance formula according to Eq. (4) and nothing to do with learning and in turn fail to converge the edge length.
In this paper, we handled the problem of energy balanced routing algorithm using reinforcement learning and proposed QRL algorithm for wireless sensor network. The link quality, residual energy, and distance between the two consecutive hops are used as parameter for selection of an optimal action subject to maximize the reward (utility). To achieve the objective QRL based energy balanced algorithm has been proposed and their time complexity is also analyzed to show the effectives of the proposed algorithm. It is also proved from the simulation results the proposed QRL algorithm converges faster than other stateofartalgorithms. It is also notable from the simulation results that energy consumption and link quality and residual energy also improved compared to random algorithm and without learning algorithm. In the future, we also include the node density as another parameter to estimate the energy balanced routing path using deep learning techniques.
Conflict of interest
The authors declare no potential conflict of interest regarding the publication of this work. In addition, the ethical issues including plagiarism, informed consent, misconduct, data fabrication and, or falsification, double publication and, or submission, and redundancy have been completely witnessed by the authors.
Funding
The Internet of Things (IoT) is the network of physical objectsdevices, vehicles, buildings and other items embedded with electronics, software, sensors and wireless network connectivity that enable these objects to collect and exchange data (Akyildiz IF et al. (2002), Kashyap, P. K. (2019) & Kashyap, P. K., Kumar, S., and Jaiswal, A. (2019) ). Each of these smart device is uniquely identified by Internet address protocol (IP) to forward the data packet from source to destination (F. Bouabdallah et al. (2008)). The IoT has huge application in almost all the sectors of human being such as healthcare facilities, industrial organization, vehicular network, military operations, business organization and many more (Ishmanov F et al. (2011) & Ahmed AA and Mohammed Y, (2007)). These smart devices have limited battery power to perform complex computation and forward the data packet. Due to tremendous upsurge in the connected number of ubiquitous devices, there is large number of data packets travelled in the IoT network. So, there is need to choose the energy balanced routing path to forward the data packet so that lifetime of the network is improved.
In order to support various complicated application, IoT nodes have to perform the reliable operation with their limited energy, computational resources and bandwidth effectively so that it reduce the timedelay for data transmission using shortest routing path, transmission errors and ultimately improve the lifetime of the network. In this regard software define network is combined with the IoT network that separate the hardware and software control operation efficiently to cope with the mentioned challenged (J. Zhou, et al. (2016)). Therefore, dynamic routing rules for the IoT nodes provide novel data forwarding strategy, but lack in the presence of stochastic nature of channel path.
Machine learning (ML) techniques have been extensively used in the IoT network to finding the optimal route for data forwarding in the recent decade (C. Guestrin et al. (2004), Kashyap, P.K, Kumar, S. (2019), & A. Jaiswal et al. (2020)). Machine learning techniques provides learning ability to IoT network through experience, and reinforcement learning (RL) works on learning agent that improve its learning capability based on received rewards according to their taken action. By exploitation of their gained knowledge and exploration of the environment RL agent maximize its rewards (R. S. Sutton (2018)). Reinforcement learning techniques requires low computation resources with lower implementation efforts to output effective results with higher accuracy. The output of the system nearly optimal and has higher flexibility according to the changes in the environment without prior knowledge of the network. Thus, reinforcement learning and Qlearning are best suited techniques for routing approaches in the IoT network that build path with lower redundancy.
Under these circumstances, there is need energy balanced routing algorithm based on reinforcement learning approach that include residual energy, physical distance and link quality of the channel for data transmission. The major contribution of the paper as follow:
The rest of the paper is divided into following sections. Section II described the system models used in the IoT network using graph approach. Section III explains the Qlearning based routing protocol in the IoT network. In Section IV, simulation and results are analyzed for the proposed algorithm and stateofartalgorithms. Finally, Conclusion of this paper is presented along with future scope in the section V.
Network Setting
We consider an energyconstrained Internet of Thing network that has finite number of sensor nodes, which are randomly deployed in a given monitoring area. Each node in the network can only communicate with the neighbouring nodes that are within its transmission range. Data transmission from one node to another takes place in synchronised time slots. Here, it is considered that each data transmission from a source node to destination takes place by using a number of intermediate nodes present along the route in the network. Each node has a single antenna, a finite battery which can be recharged periodically and works in a halfduplex mode.
The wireless connection between nodes of IoT are affected by many factors, such as residual energy of node, physical distance, channel gain etc. that makes the edge length and network state of dynamic nature in many scenarios. Here, we represent the network as a graph G = (V, E, L)with stochastic edge length, where V is the set of vertices i.e. the sensor nodes and E = (eij), such that vi, vj ϵ V, is the set of edges and L represents the probability distribution of each edge length. An edge exists between vertices vi and vj in the graph only when node j is the neighbor of node i. The nodes in the transmission range of a node constitute its neighbourhood. The length of edge (vi, vj) is denoted as l(vi, vj) and is considered as a random variable. The channel between neighbouring nodes is assumed to follow quasistatic block Rayleigh fading model and the channel gain between neighbouring nodes vi and vj are modelled as Markov chain. The transition probability of from at any time instant t is given as and is unknown to the network or sensors.
Energy Consumption and residual energy Model
In IoT network, the energy is consumed for carrying out sensing, processing and communication (transmitting/receiving) activities. Out of these, data communication consumes most of the energy of a node so energy consumed for communication only is considered during routing. For simplicity, only the energy consumed for transmissions is accounted and energy spent for receiving is ignored as the idle and receiving nodes consume almost same amount of energy [20].According to the first order radio model presented in [20], for a message having b bits, the energy consumed for its transmission from to nodewith edge length l ( , ) is calculated as
(1)
(2)
After data transmission residual energy of a node at any hop, at time slot can be evaluated as follow
(3)
Where is the maximum battery capacity of a node, is used for calculating the threshold distance (l_{0}) which in turn is used to determine the power loss model to be used i.e. whether to use free space model or multipath fading model. Free space model is utilized when the distance between sender and receiver is less than the threshold distance otherwise multipath fading model is applied for calculating the energy consumed for transmission purposes. is the energy requirements of transmitter and receiver circuit, and are the energy consumed for amplifying transmission in order to attain a satisfactory signal to noise ratio (SNR) and l is the communication edge length.
Energy Balanced Routing Problem in IoT
The sensor nodes in the IoT network capture the desired data and forward this data to the destination node. Due to resource constraints in WSNs such as short communication range, limited processing potential and limited battery power, the source node, instead of communicating directly with the destination, communicates indirectly through its neighbours (multihop) as this leads to higher energy efficiency as compared to direct communication. In a multihop communication environment, routing algorithm is needed to find a communication path from source node to the destination node. Multihop communication overcomes the problem of energy inefficiency and shortrange communication faced in direct communication but it steers imbalanced consumption of energy in the network (Khan, Tayyab, et al. (2019)) as the intermediate nodes deplete their battery faster while relaying the data of other nodes. The nodes in the vicinity of sink are the most affected ones. Therefore, the routing algorithm needs to find a path which balances the energy consumption of the nodes in the network so that all the nodes deplete their energy nearly at the same time which in turn results in increased network lifetime. A routing path rp in the network graph is defined as a sequence of distinct sensors in the WSN starting from source node and ending at destination node i.e. such that are adjacent vertices for and =source node and =destination node. The path rp having sensor nodes has a length of The main aim of this paper is to find an optimal routing path between a source and destination node in order to minimize the total energy consumption and transmission delay for a reliable communication.
In this section, we propose a QLearning based efficient routing protocol to find an optimal and reliable route from a source to destination node in order to reduce the total energy consumption and minimize the total transmission delay (based on shortest distance), which can ultimately improve the network lifetime.
Problem Modelling
The stochastic optimal routingpath finding problem is modelled as an MDP. Qlearning updating rules are used to learn an optimal policy. Here, the learning agent selects an action in order to interact with the environment (stochastic graph) to reach the next neighboring node in the route, to get an optimal path from source to destination subject to maximize the expected reward obtained. MDP can be defined as follow:
State ‘ ’:Each sensor node in the network and the corresponding channel gain towards its neighbor nodes is modelled as a state. The current sensor node in the routing pathfinding process and current channel gain is considered as a current state.
Action ‘ ’:All the out link neighbor nodes are considered in the action set of a state.
Transition ‘ ’: The next state is determined by the action selection in current state.
Reward‘ ’: Reward for a stateaction pair (s, a) is calculated by using utility value which is the combination of nodes’ residual energy, edge length and nodes’ energy consumption and link quality.
Definition 1 (edge length: distance between nodes): Following formula is used to compute the edge length between any two node
l( , )= (4) where ( ) and ( ) are coordinates of node , and respectively.
Definition 2 (edge length based path): Data packet is transferred from a source node to destination through nhops towards the destination node. The optimal routing path based on edge length is represented as
(5)
Where is the edge length between two nodes at hop. And, the path with minimum edge length guarantees that the transmission delay is minimum.
Definition 2 (routing path based on energy):Data packet is transferred from a source node to destination through nhops towards the destination node. The optimal routing path based on residual energy is represented as
(6) Where is the residual energy of transmitting node at hop.Residual energy of a node can be computed using Eq. (3).
Definition 3 (routing path based on link quality): Data packet is transferred from a source node to destination through nhops towards the destination node. The optimal routing path based on better link quality is represented as
(7) Where is the normalized link quality at hopand given as
(8)
Where are signal strength at hop and maximum signal strength. Thus, the reward (utility) obtainedfor transition from to state after taking an action at time slot t being at hop can be computed as
(9)
Where , , , and are some prescribed positive weights parameter which reflect the importance of residual energy, link quality, edge length and energy consumption respectively in calculation of reward, and are number of neighbors for . The weight parameters are closely related to each other, such that if is set to zero, the presented model emphasize on the maximization of the residual energy in the routing path and transmission delay is ignored i.e., independent of number of intermediate hop. When is set to zero, the presented algorithm pays more attention in reducing the transmission delay of packet and ignores residual energy of the sensor node. Thus, in both above case lifetime of the network is not optimal because of tradeoff between . Thus, we can adjust these parameter values according to our needs.
To find an optimal routing path, the learning agent initially perceive the starting state , i.e., the source node and channel gain of the links towards its neighbors, and then selects an action using the current policy, till the agent arrives at the destination node. The statevalue is updated using the temporal difference method. The updating rule for Qlearning is
) (10)
where, ζ ϵ [0,1] denotes learning rate, is discount factor. Qlearning adopts greedy policy for action selection, i.e, it select optimal action having maximum statevalue with probability 1 and a random action with probability . The main aim of the learning agent is to find an optimal policy for selecting an optimal routing path. The optimal policy denotes the statevalue which is greater than other policy’s statevalue. The optimal routing problem can be expressed as
(11)
The formulated problem in (10) can be optimally solved by a QLearning method. The goal of proposed routing problem is to maximize the total reward over all routing paths starting at source node, such that rp and is the set of all paths in G starting from source node and ending at destination.
In routing process, each node in the network keeps a routing table which is used to select the next hop for the data transmission. The routing table contains the information about next possible nodes which can be reachable to all possible destinations in the network. This table is updated after each data transmission to store the information of node which is good for further data forwarding based on the obtained reward.
QLearning based Routing Algorithm
In this section, we present a learning based routing algorithm, particularly using Qlearning approach.
Algorithm QLRA
Initially, we initialize the Qtable with all zero values. After that current state is observed. Based on the current state an action is selected from the available actions at (line no. 15). After executing the action, a reward and next state are obtained (line no. 16). Using the achieved reward, statevalue is updated (line no. 17). And now next state becomes current state. The algorithm converge either source node find a routing path to reach destination node or for the maximum number of episode.
Time Complexity
The time complexity of the Qlearning based routing algorithm mainly has three aspects: (1) the algorithm continues until it find destination node in the line no.9 i.e., number of intermediate node in the routing path is ( ). (2) Selecting an action (choose neighbor node) from the set of neighbors’ nodes subject to maximize the expected discount reward in the network. The set of neighbors is represented in the form of matrix i.e., but for the single node (state) linear search apply on the single desired row takes time in line no.10 to line no.15. Thereafter line no.16 to line no. 19 requires constant time to update the Qvalue (3).The algorithm runs in worst case are equal to the number of episode until convergence (line no.5). Thus, overall time complexity of the algorithm is .
In this section, firstly, the convergence performance of proposed QLearning routing algorithm is analyzed over learning trails and link quality in terms of steps until convergence and reward (utility) respectively. Secondly, comparative analysis of the proposed algorithm against Random learning algorithm and without (w/o) learning algorithm done with respect four metrics: 1) Reward (Utility) 2) Residual energy 3) Energy consumption 4) Edge length. All these algorithms are simulated using same values of parameters for energy model and network conditions.
Simulation Environment
The simulation is carried out using MATLAB in a square area having 50 sensor nodes are randomly distributed. The communication range and initial energy of all the sensor nodes set to be20 m and 0.5 J respectively. The maximum bandwidth for each of the communication link in the network is 100Mbps. Without loss of generality, one source node and one destination node is selected randomly for the performance analysis of all the stateofartalgorithms. The others simulations parameters are shown in Table 1.
Table 1. Simulation Parameters
Parameter 
Value 
Parameter 
Value 


Initial Energy 


0.7 

0.92 

1000 

[0,1] 

3 



50 dBm 



0.7 



0.3 










Result Analysis
Figure 1. Convergence performance over learning trails
Figure 1 illustrates the convergence performance of the proposed QRL based energy balanced algorithm over number of learning trails. It can be clearly observed from result, the RL agent takes 940 steps to converge for the first trails of learning rate thereafter it took less number of steps approximate 560 steps for the very less value of learning rate .The number of steps on average higher than 200 steps for the taking the value of , whereas the best performance to achieve convergence by the RL agent for the value of . Thus, it is necessary to choose the learning rate with caution for convergence in small number of steps towards maximization of reward i.e. reduce the energy consumption and minimize the number of hop counts.
Figure 2. Reward (Utility) over link Quality
Figure 2 illustrates that the reward (utility) of the proposed Qlearning routing algorithm with respect to link quality under different residual energy of the intermediate hops in the routing path. The link quality describes the nature of communication path between the sensor nodes. And further, link quality depends upon the residual energy of the intermediate hops. If the residual energy of the communicating nodes is high then the link quality is also better and viceversa. It can be observed from the results, at the beginning, the reward of the proposed algorithm increases rapidly then become stationary as the value of link quality improves. This is because of the learning capability of the proposed algorithm to optimize the reward faster at link quality’s value . It is also worth to note down reward of the proposed algorithm cannot increase with further increase in the link quality. This is due to the fact other factors such as limited bandwidth and energy consumption in communication also increases and then affects the reward of the proposed algorithm according to Eq. (10).
Figure 3. Reward (Utility) over Episodes
A comparison of reward (utility) between proposed QRL based energy balanced algorithm and stateofartalgorithm over number of episode is presented in the figure.3 using learning rate of . it is clearly observed from the simulation results that Qlearning algorithm converges faster than random learning algorithm within 390 episodes. Whereas random algorithm’s utility converges around 580 episodes. This is due to the fact the proposed QRL algorithm uses greedy technique to select an action rather than randomly selected any action for the current statereward. Also, optimal learning policy helps in the selection of an action in QRL, which ultimately maximize the reward with less number of episodes. It is also worthy to note down that worst performance is shown by without learning algorithm. This is because of neither have learning policy nor any optimization technique involved in the process of reward maximization.
A comparison of convergence characteristic in the terms of residual energy between Qlearning and the stateofartalgorithm is presented in the figure 4 using learning rate of . It is clearly observed from the result as the number of episode increases residual energy increases for all the three algorithms and converges at 400 episodes. Further, it is noticeable that proposed QRL based energy balanced routing algorithm has higher residual energy about (0.87Joule) than other stateofartalgorithm. This is because of the. QRL selects the next hop for the routing purpose based on optimal policy learning strategy subject to maximize the residual energy, better link quality and minimum distance. Whereas random algorithm select the next route based on minimum distance and does not consider residual energy of the next hop that in turns increases the overall energy consumption. And, the without learning based algorithm selects any hop for the routing randomly without considering the residual energy and minimum distance.
Figure 4. Residual energy over Episodes
Figure 5. Energy Consumption over episodes
A comparison of energy consumption of the proposed Qlearning routing algorithm with stateofartalgorithms over number of learning episodes is shown in the figure 5. It can be observed from the results; at the start of learning episodes the energy consumption of the proposed algorithm is 0.225 J and as the algorithm reached to 160 episodes it lower down the energy consumption at 0.03 J. Further, proposed algorithm reached up to 400 episodes, the energy reduces to 0.025 J and consumption becomes stable. Whereas other stateofart algorithm fails to optimize the energy consumption of the nodes involved during routing path. This is because of Qlearning algorithm uses greedy approach for the selection of optimal policy, whereas Random algorithm selects any action randomly to obtain the reward. It is also noted down that the worst performance is shown by without (w/o) learning algorithm, because it does not have any learning policy and compute reward on the current situation of node’s parameters.
Figure 6. Edge Length over episodes
A comparison of edge length of the proposed Qlearning routing algorithm with stateofartalgorithms over number of learning episodes is shown in the figure 6. The edge length describe the distance between the intermediate hops, smaller the edge length (Euclidean distance) corresponds to reduce the transmission delay and improves the also convergence speed of the learning based routing algorithm. The results shows that as the number of episode increases, the edge length of proposed algorithm reduces and stabilized about to 17 m within 400 episodes. Whereas edge length of random and without learning algorithms are fluctuates and fails to convergence. This is because of proposed QLearning routing algorithm at the initialization know the number of hop counts and also in learning phase greedy policy helps to compute less number of intermediate hops count and then compute the shortest distance edge length. It can be also observed that without learning based routing algorithm compute the edge length only on Euclidean distance formula according to Eq. (4) and nothing to do with learning and in turn fail to converge the edge length.
In this paper, we handled the problem of energy balanced routing algorithm using reinforcement learning and proposed QRL algorithm for wireless sensor network. The link quality, residual energy, and distance between the two consecutive hops are used as parameter for selection of an optimal action subject to maximize the reward (utility). To achieve the objective QRL based energy balanced algorithm has been proposed and their time complexity is also analyzed to show the effectives of the proposed algorithm. It is also proved from the simulation results the proposed QRL algorithm converges faster than other stateofartalgorithms. It is also notable from the simulation results that energy consumption and link quality and residual energy also improved compared to random algorithm and without learning algorithm. In the future, we also include the node density as another parameter to estimate the energy balanced routing path using deep learning techniques.
Conflict of interest
The authors declare no potential conflict of interest regarding the publication of this work. In addition, the ethical issues including plagiarism, informed consent, misconduct, data fabrication and, or falsification, double publication and, or submission, and redundancy have been completely witnessed by the authors.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.