An Efficient Framework for Peer Selection in Dynamic P2P Network Using Q Learning with Fuzzy Linear Programming

Anandaraj, Mahalingam; Albalawi, Tahani; Alkhatib, Mohammad

doi:10.3390/jsan14020038

Open AccessArticle

An Efficient Framework for Peer Selection in Dynamic P2P Network Using Q Learning with Fuzzy Linear Programming

by

Mahalingam Anandaraj

¹,

Tahani Albalawi

^2,*

and

Mohammad Alkhatib

²

¹

Department of Information Technology, PSNA College of Engineering and Technology, Dindigul 624 622, Tamilnadu, India

²

Department of Computer Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11432, Saudi Arabia

^*

Author to whom correspondence should be addressed.

J. Sens. Actuator Netw. 2025, 14(2), 38; https://doi.org/10.3390/jsan14020038

Submission received: 28 January 2025 / Revised: 24 March 2025 / Accepted: 26 March 2025 / Published: 2 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a new approach to integrating Q learning into the fuzzy linear programming (FLP) paradigm to improve peer selection in P2P networks. Using Q learning, the proposed method employs real-time feedback to adjust and update peer selection policies. The FLP framework enriches this process by dealing with imprecise information through fuzzy logic. It is used to achieve multiple objectives, such as enhancing the throughput rate, reducing the delay, and guaranteeing a reliable connection. This integration effectively solves the problem of network uncertainty, making the network configuration more stable and flexible. It is also important to note that throughout the use of the Q-learning agent in the network, various state metric indicators, including available bandwidth, latency, packet drop rates, and connectivity of nodes, are observed and recorded. It then selects actions by choosing optimal peers for each node and updating a Q table that defines states and actions based on these performance indices. This reward system guides the agent’s learning, refining its peer selection policy over time. The FLP framework supports the Q-learning agent by providing optimized solutions that balance conflicting objectives under uncertain conditions. Fuzzy parameters capture variability in network metrics, and the FLP model solves a fuzzy linear programming problem, offering guidelines for the Q-learning agent’s decisions. The proposed method is evaluated under different experimental settings to reveal its effectiveness. The Erdos–Renyi model simulation is used, and it shows that throughput increased by 21% and latency decreased by 40%. The computational efficiency was also notably improved, with computation times diminishing by up to five orders of magnitude compared to traditional methods.

Keywords:

Erdos–Renyi model; fuzzy linear programming; Q learning; P2P network; Q table; reinforcement learning

1. Introduction

P2P networks today are an integral part of advanced coherent communication systems and allow for efficient, comprehensive, and distributed distribution of various resources and services [1]. They contrast with the system of the client-server, where a particular element maintains authority over several other elements, but the communication between the peers is direct with the P2P network. This feature is beneficial in many ways, providing better resilience to failure, more efficient utilization of resources, and expandability. However, as in any P2P network, the dynamics and heterogeneity of the environment pose significant issues, primarily related to peer selection [2,3]. Peer selection is a necessary process in P2P networks, whereby suitable peers are chosen for data sharing and exchange. Proper selection of peers leads to increased communication throughput, decreased delay, or even increased reliability of the whole network. However, the fact that the P2P networks are dynamic, with peers joining and leaving the network frequently, unpredictable network conditions, and different capabilities of the peers, makes the selection process difficult [4,5].

The high churn rate causes most of the connections to be interrupted and results in frequent changes in peer lists for content exchange [6]. The peer selection process must be flexible to deal with these frequent changes in the list of peers where new peers can be made quickly, and any leaving peers should not affect the network [7]. P2P networks comprise many participants with different computing power levels, storage space, and available and sustainable bandwidth [8]. A P2P network with five nodes is represented in Figure 1. Traditional peer selection methods rely on rule-based strategies that are not flexible when network conditions change. Peer churns also cause disruptions and inefficiencies in data transfer. Another limitation is the inability to handle uncertainty in network parameters. Prior approaches ignore real-world variances and assume deterministic computing power, bandwidth, and availability values. Peer selection techniques become less successful due to this rigidity, which causes network inefficiencies and higher latency. Scalability is also a significant concern, as many existing peer selection techniques struggle to maintain performance as the network grows. To tackle these research gaps, the proposed framework enhances adaptability, effectively handles uncertainty, and scales efficiently. The proposed method ensures dynamic, real-time peer selection by integrating Q learning with fuzzy constraints, notably improving network performance. This paper presents an optimization approach that incorporates Q learning and an FLP approach for selecting peers in dynamic P2P networks. Compared to conventional hybrid models, which generally use either crisp thresholds or probabilistic models, FLP provides a more refined and resilient approach by handling uncertainty mathematically robustly. A key limitation of several hybrid selection models is their inability to optimize peer selection over the long term. Many traditional approaches rely on fixed optimization functions that do not evolve based on network conditions, often leading to suboptimal resource distribution where a small set of peers is overused, whereas others remain underutilized. The major objectives of this paper are as follows:

Introduce a Q-learning-based peer selection framework that adapts to real-time network changes, handles frequent peer arrivals and departures efficiently, ensures stable and optimized peer selection, and improves network reliability and performance.
Integrate Q learning with fuzzy logic-based constraints to manage uncertainties in peer attributes like bandwidth, availability, and processing power, with the aim of providing a multi-objective optimization strategy, balancing throughput, latency, and reliability.
Design a scalable and adaptive framework capable of handling large-scale P2P networks efficiently and demonstrate significant improvements in network performance, reducing latency and enhancing data transfer rates.
Apply the system to file-sharing networks and decentralized cloud storage, ensuring scalable and efficient peer communication.

The main objective of the proposed system is to provide a solution for how the integration of FLP with Q learning transforms peer selection processes in P2P CDN into effective and simple ones. The remaining part of the article is prepared as follows. Section 2 explains the literature review of the proposed work and existing mechanisms and their limitations to resolve the problem of peer selection. Section 3 elaborates on the proposed method, including the fuzzy logic programming technique and Q learning for the peer selection framework. Section 4 presents the results of the experimental analysis and a discussion of the proposed system. Finally, Section 5 provides the paper’s conclusion and outlines future research directions.

2. Related Work

2.1. Traditional Peer Selection Techniques

The criteria for peer selection in the P2P network are essential to optimize data exchange, minimize latency, and increase the network’s resilience [9]. Alternatively, the common approaches for peer selection are based on heuristics and criteria, which are static analysis strategies that frequently use simple measures, including distance, available bandwidth, and node status. Proximity-based selection is one of the selection techniques where peers nearer to the source are favored to select in order to decrease latency and increase throughput [10]. The rationale is that the physical separation of peers is typically inversely proportional to network delay, and thus, small network distances should lead to small delays [2]. Bandwidth-based selection is another technique that tries to pair up peers that have the greatest amount of bandwidth, as the goal here is to achieve maximum throughput. This approach enables users to connect with reliable super peers who are capable of supporting high-bandwidth transmission [11]. Availability-based peer selection techniques refer to peers who are always connected and willing to share data [12]. This approach is beneficial in ensuring that only stable and reliable connectivity is established in the network. This way, the network can invest in peers that have high availability, thus improving the network’s overall stability and decreasing the possibility of interruption of transfers. Hybrid methods select the number of peers based on accessibility, bandwidth capacity, and neighboring host peers [4,13].

Depending on their type, P2P systems are more structured and rely on specific algorithms for selecting other peers and disseminating data, such as distributed hash tables (DHTs) [14]. They include Chord, CAN, and Kademlia. DHTs offer unique node selection by using hash functions, thus making them scalable and capable of making efficient use of available resources [15]. Gossip protocols, also referred to as epidemic protocols, have been employed in the process of peer selection and information dissemination in P2P networks [16]. Gossiping in a gossip-based system involves periodically exchanging information with a randomly selected neighbor within the network. This approach can also promote the quality and reliability of information dissemination across the entire network [17]. Hybrid methods and structured P2P networks are used to provide more robust and scalable solutions, but there is a lack of flexibility [18]. Unstructured networks and gossip protocols provide adaptability and resilience, but they are inefficient under dynamic conditions. While traditional methods have been effective in changing degrees, there remains a necessity for more adaptive and dynamic approaches to optimizing peer selection in ever-changing, dynamic P2P network environments [19,20]. This paper aims to build on these foundations by devising new strategies for improving peer selection to enhance the performance and robustness of the network.

2.2. Heuristic-Based Peer Selection Methods

Traditional techniques can be improved using heuristic-based peer selection methods, which introduce rule-based optimizations and mathematical models. Game-theoretic models frame peer selection as a strategic interaction where each peer optimizes its utility function, often leading to more steady and efficient selections. By optimally organizing peers, graph-based selection methods such as clustering algorithms and minimal spanning trees (MSTs) enhance network connection. Techniques like analytic hierarchy process (AHP) and technique for order preference by similarity (TOPSIS) rank peers according to criteria including bandwidth, latency, and dependability in multi-criteria decision making (MCDM), another popular heuristic approach. Even though efficiency is increasing, these heuristic models need human parameter tuning and might not effectively adjust to network disturbances in real time. Table 1 clearly differentiates the summary of the strengths and weaknesses of each category described above, showing how AI-driven models, particularly the proposed system, offer the most robust and adaptable solution for peer selection in dynamic P2P networks.

3. Proposed System

3.1. Objective Function and Constraints

In order to formulate the objective function and constraints for the proposed system, it is necessary to consider three key components: peer selection, Q learning, and FLP. The peer selection component aims to select the best peer based on multiple factors such as bandwidth, latency, trust level, and availability and handle dynamic changes in network conditions. For the Q learning component, reinforcement learning is used to optimize peer selection over time, and the Q-value function helps in decision making to maximize long-term rewards. The FLP component is used to handle uncertainties in the dynamic network, and constraints are represented as fuzzy inequalities rather than crisp ones. The goal of the framework is to maximize peer selection efficiency, considering multiple dynamic factors.

\sum_{i = 1}^{N} μ_{i} Q (s, a_{i})

(1)

The Q-value is updated as follows:

Q (s, a) = Q (s, a) + α [γ + γ \max_{a} Q (s^{'}, a^{'}) - Q (s, a)]

(2)

The objective function aims to maximize long-term rewards based on dynamic peer performance, while the constraints ensure that selected peers meet performance and reliability requirements. The following are the constraints and their formulations.

Bandwidth Constraint: B_i ≥ B_min∀_i. Fuzzy constraint: B_i ≥ B_min − e_b

Latency Constraint: L_i ≤ L_max∀_I. Fuzzy constraint: L_i ≤ L_max − e_l

Trust-Level Constraint (Fuzzy Constraint): T_i ≥ T_threshold. Fuzzy constraint: T_i ≥ T_hreshold − e_T

Availability Constraint: A_i = 1 if a peer must be available. It is a hard constraint, not fuzzy in nature. Table 2 provides a concise summary of the key notations used in the proposed framework.

3.2. MDP Framework

The selection of the best peer in P2P networks becomes difficult because network conditions often change, while peer reliability varies and resource availability fluctuates. The changing dynamics of network systems make it hard to define static peer selection standards, so an intelligent approach with adaptive mechanisms must be used in decision-making processes. The peer selection framework is formed based on the Markov decision process (MDP) because this mathematical model provides tools to make decisions in dynamic and uncertain situations. Using Q learning as a reinforcement learning algorithm, peer selection is optimized through time as it learns from observed network behavior alongside performance metrics by expressing peer selection as an MDP. The peer selection process is defined by four key MDP components: state space, action space, transition probabilities, and reward function. A complete network state model contains all potential conditions that exist within the P2P network over time, using peer bandwidth, latency, trust score, availability, and QoS satisfaction. The state space representation needs precision to include realistic peer performance variations so that the model can effectively determine network efficiency. These factors influence the network’s overall efficiency, and an accurate representation of the state space ensures that the model considers real-world variations in peer performance. A defined action space ensures that the model thoroughly evaluates all possible selections before determining its best peer selection plan. When implementing a specific action, the transition p probabilities determine how likely it is to move between states and incorporate parameters such as network congestion and peer failures together with link quality differences. Fuzzy logic-based transition modeling appears in an MDP to address uncertain network conditions, which enables better decision making in unpredictable situations. The reward function evaluates peer selection quality through numerical scoring of performance measures that include data transfer speed and latency, as well as reliability and resource efficiency. Q learning operates to improve peer selection decisions over time by learning from past experiences to achieve maximum reward accumulation. Through Q learning, the system manages to balance the location and usage of high-performing peers by continuously exploring new choices that help it adapt to dynamic network conditions. The MDP-based Q-learning model differs from static peer selection methods by developing dynamically through experience to improve its selection accuracy. The system becomes more effective for decision making through fuzzy constraints because these flexible threshold values enable optimal choices even in unpredictable network environments.

3.2.1. State Representation in P2P Networks

In P2P networks, accurately representing the state of the network is crucial for effective management and optimization. The state of a P2P network at any given time can be represented by a set of variables capturing essential characteristics such as node status, network topology, resource availability, performance metrics, and peer behavior [21]. For instance, state S can be defined as a vector:

S = [S_{n}, S_{t}, S_{r}, S_{p}, S_{b}]

(3)

where S_n represents the status of nodes such as whether active or inactive, S_t captures the network topology as a connection matrix, S_r indicates resource availability such as bandwidth and storage, S_p includes performance metrics like latency and throughput, and S_b denotes peer behaviors such as churn rate and reputation scores. This comprehensive state representation allows for the detailed monitoring and management of the network’s dynamic behavior [22].

3.2.2. Transition Probability Matrix (TPM)

The TPM is used to describe the probabilities of moving from one state to another over a given period. It is a basic tool for modeling state transitions in a P2P network. P is used to denote the TPM, which is a square matrix. Each element in the matrix, P_ij represents the probability of transition from state i to state j:

P_{i j} = P_{r} (S_{t + 1} ⋮ S_{t} = i)

(4)

In order to formulate the TPM, it is necessary to understand the states of the network. Then, data on state transitions are gathered by examining the network over some time slot. The transition probability from state i to state j is calculated as follows:

P_{i j} = No . of transitions from state i to state j / Total number of transitions from state i .

(5)

The above matrix is used to understand the network’s dynamic behavior and predict future states.

Following these steps in Algorithms 1 and 2 comprehensively models the state representation and transition probability matrix analysis in a P2P network, revealing the network dynamics, performance, and stability.

Algorithm 1: State Representation and Transition Probability Matrix Analysis in a P2P Network

Initialize the input : G = (V, E) (P 2 P network graph), u_{i}, d_{i}, s_{i}, p_{i}, a_{i}

(peer attributes),

C, D

(content storage and demand matrices)

Output : Transition probability matrix P

, steady - state distribution π

, expected state time,
network performance metrics

Step 1.: Formulation of state transition matrix Q:
Step 1.1. For each state i in the network:
For each possible transition j ≠ i:
Define $Q$ where $q_{i j}$ transition rate is from i to j
If Q_ij > 0, update matrix Q.

$Step 1.2 . Find q_{i j} = \sum_{j \neq i} q_{i j}$

(6)

$Step 1.3 . Assign q_{i j}$ $> 0 for i \neq j$ .
$Step 1.4 . Arrival rate λ_{i}$ $and departure rate μ_{i}$ is defined.
Step 2.: Find transition probabilities

$p_{i j} (t) = [e^{Q t}] i j$

(7)
Step 3.: Solve for steady-state distribution $π Q = 0$ , subject to $\sum_{i} π_{i} = 1$
Step 4.: For each state i:
If $μ_{i} > 0 :$
Compute expected time in each state:

$T_{i} = \frac{1}{μ_{i} π_{i}}$

(8)

Else: Set $T_{i} = \infty$ (indicating no transition)
Step 5.: For each state i:
$Analyze network dynamics λ_{i} μ_{i}$ with

$\sum_{i} λ_{i} π_{i} = \sum_{i} μ_{i} π_{i}$

(9)

If the equation does not hold, adjust the transition rates.
Step 6.: Evaluate network capacity and performance metrics.
Step 7.: Assess bandwidth, storage, and processing power.
Step 8.: Compute download/upload speeds, content availability, and network latency.
Step 9.: Return $P, π, T_{i}$ network performance metrics.

3.3. Fuzzy Linear Programming (FLP) for Peer Selection

An effective peer selection mechanism is critical in dynamic and heterogeneous P2P networks in order to optimize performance metrics such as throughput, latency, and network robustness [23]. FLP can be used to model and solve optimization problems in peer selection, where parameters are not precisely known and are better represented as fuzzy numbers. The objective is to choose the optimal set of peers that maximizes network performance while simultaneously considering uncertainties in network conditions, resource availability, and peer behavior. Fuzzy logic extends classical logic to manage the concept of partial truth, where truth values range between completely true and false [24]. Fuzzy sets were introduced by Lotfi Zadehin 1965. Unlike classical sets, where elements either belong or do not belong to the set, fuzzy sets allow for partial membership, characterized by a membership function μ:X→[0, 3] [25]. Consider the fuzzy set A representing high bandwidth in a P2P network. Different bandwidth values have different degrees of membership

\bar{A}

.

μ_{\bar{A}} (x) = \{\begin{cases} 0 i f x \leq 10 Mbps \\ \frac{x - 10}{20 - 10} i f 10 < x < 20 Mbps \\ 1 i f x \geq 20 Mbps \end{cases}

(10)

Fuzzy values are a special type of fuzzy set used to represent uncertain quantities. A common representation is a triangular fuzzy number (TFN), defined by a triplet (l,m,u), where l is the lower limit, m is the most likely value, and u is the upper limit. The membership function for a TFN

\bar{A}

= (l, m, u) is

μ_{\bar{A}} (x) = \{\begin{cases} 0 i f x \leq l \\ \frac{x - l}{m - l} i f l < x < m \\ \frac{u - x}{u - m} i f m < x \leq u \\ 0 i f x > u \end{cases}

(11)

Algorithm 2: Fuzzy Linear Programming for Peer Selection

Initialize the input: Network parameters—Download rate, latency, resource allocation, Decision variables:

x_{j}

(peer selection variables)
Output: Optimal peer selection strategy
Step 1: Define optimization problem
Maximize download rate, minimize latency, and optimize resource allocation
Step 2: Define decision variable

x_{j}

represents the selection of peers j

If a peer is selected x_{j} = 1

otherwise 0
Step 3: Formulate objective function

F L P (x) = \sum_{i = 1}^{n} λ_{i} * μ_{i} (x)

(12)

Step 4: Formulate constraints

\sum_{j = 1}^{m} a_{i j} x_{j} \leq b_{i}

(13)

Step 5: For each fuzzy set A, define membership functions

μ_{L o w} (x) = f_{L o w} (x)

and

μ_{M e d i u m} (x) = f_{M e d i u m} (x)

Step 6: Aggregate the membership functions using criteria

(μ_{1} (x), μ_{2} (x), \dots, μ_{n} (x))

(14)

If the aggregated membership value is above a threshold, prioritize selection.
Step 7: Defuzzification:

C r i s p V a l u e = D (A g g r e g a t e d C r i t e r i a)

(15)

If the crisp value exceeds a threshold, select the peer.
Step 8: Solve the FLP problem using an optimization solver
Step 9: Return transition probability matrix P, steady-state distribution π, expected time in each state

T_{i}

, network performance metrics, optimized peer selection strategy

FLP can be effectively applied to optimize peer selection in a P2P network, considering various fuzzy parameters and constraints by following these steps [26,27]. Variables such as peer reliability, download, task completion rate and latency, and fuzzy sets can be defined using linguistic variables and membership functions to confine their imprecise characteristics [28]. In the domain of linguistic variables and membership functions for download speed characterization, the establishment of distinct linguistic variables like {Low, Medium, High} enable the categorization of download speeds into qualitative levels. Corresponding to each linguistic variable, membership functions, denoted as μ_Low(x), μ_Medium(x), and μ_High(x), respectively, assign degrees of membership to individual download speed values [29,30]. Figure 2 shows the fuzzified input variables such as download speed, peer availability, content delivery rate, and delay. Fuzzy constraints play a crucial role in optimizing peer selection in dynamic P2P networks by handling the intrinsic uncertainties associated with network conditions. Since factors such as bandwidth availability, peer reliability, latency, and computational power fluctuate over time, traditional crisp thresholds may not be effective. Rather, fuzzy logic makes it possible for states to vary efficiently while maintaining the robustness and adaptability of peer selection decisions. To appropriately represent these uncertainties and guarantee that the system can distinguish between different levels of network characteristics, the right membership functions (MFs) need to be used. Triangular membership functions (TMFs) are chosen for factors with a linear transition, including peer reliability and storage capacity, because of their straightforward computational structure and minimal processing overhead. The blue lines and diamond symbol in the Figure 2 shows the very low and very high linguistic variables. The red lines and square symbols in the Figure 2 show low and high linguistic variables.

3.4. Learning for Peer Selection Optimization

The Q-learning approach provides a substantial solution to the peer selection problem. This allows for the learning of the best environmental strategies through the interactions it undertakes. It is a sort of learning where an agent determines how to act in an environment to attain the most rewarding cumulative sum of rewards. It is especially suited for problems where the model of the environment is not known and complex. States, actions, rewards, and Q-values are the central part of Q learning. States represent the different configurations and situations of the environment. In a P2P network, a state could encapsulate the current network topology, peer performance metrics, and also resource availability. Actions reflect the possible choices of an agent, which are discrete activities that it can perform, like connecting to/disconnecting from other peers. They contain feedback on an action initiated in a given state; to do so, they reflect performance metrics such as high throughput, low latency, and reliable links. These are Q-values used to determine the expected cumulative rewards of an action taken in a state, and the subsequent policy, which is optimal, is learned iteratively from the agents’ experience. The Q-learning algorithm (Algorithm 3) for updating the Q-value for a state–action pair is described in Equation (2).

Targeted peers in P2P networks indicate the state of the network, which includes the topologies of P2P networks, performance of peers, availability of resources, and status of peers. Network topology addresses the convergence and interconnection of peers, and peer’s performances are best depicted in terms of bandwidth, latency, packet loss rate, and computational power. Availability of resources is the flow of the present status of the networks, such as bandwidth, storage, and others, while peer status informs the system about the activity of the peer, such as joins and leaves. Establishing connections involves choosing a new set of peers to connect with based on their potential to improve network performance, while terminating connections involves deciding which existing connections to terminate if they are no longer beneficial. The reward function is designed to reflect the desired performance objectives of the P2P network. It is given below.

r(s,a) = w1 × throughput + w2 × (−latency) + w3 × connectivity − w4 × resource_cost

(16)

where w1, w2, w3, and w4are the weights assigned to each performance metric based on their relative importance.

It chooses the action with the highest Q-value of the current state. After performing the chosen action, the agent perceives the new state and the new reward and then changes the Q-value in the above formula to match the new reward received and the maximum Q-value of the new state. This process is iterative and continues until the Q-values stabilize. Figure 3 shows the exploration and exploitation trade-off over episodes. A high learning rate means that a change in experience affects Q-values, incorporating more information about the future, and a low learning rate leads to a slow change in new information.

Algorithm 3: Q Learning for Peer Selection Optimization

Initialize Q Table : Set of states S

, set of actions A

.

Set parameters : Learning rate α

, discount factor γ

, exploration rate ϵ

Define state S, action A, and reward R.

Output optimized Q table Q (s, a)

with learned values and optimal policy Π (s)

.

Step 1.

For each episode:

a.

Initialize state s

, and repeat until state s

is terminal.

$Choose action a$ $based on the ϵ$ -greedy policy:

$a = \{\begin{matrix} r a n d o m a c t i o n w i t h p r o b a b i l i t y ϵ \\ {\arg m a x}_{a^{'}} Q (s, a^{'}) w i t h p r o b a b i l i t y 1 - ϵ \end{matrix}$

(17)
$Perform action a$ $, and observe the next state s'$ $and reward r$ .
Update the Q table using Equation (2).
$Set s \leftarrow s'$ .

b.

End of episode : Reduce exploration rate ϵ

.

Step 2.

Derive the optimal policy from the Q table.

Π (s) = {\arg m a x}_{a} Q (s, a)

(18)

Step 3.

Return optimized Q table

Q (s, a)

, optimal peer selection policy Π (s)

.

By following these steps, Q learning can be effectively applied to optimize peer selection in a P2P network, improving network performance through adaptive learning and decision making.

3.5. Optimization of Hyper Parameters

Q learning for peer selection needs to be optimized. It is crucial to fine-tune key hyper parameters such as the learning rate (α) and discount factor (γ) to guarantee efficient convergence, stability, and adaptability. The learning rate determines how much newly obtained information dominates previously learned knowledge. If α is high, the model learns too quickly, which makes it unbalanced, while a low α results in slow convergence, avoiding the agent from adjusting efficiently to network changes. In order to find the optimal α, an experimental approach was adopted, testing values in the range of 0.1 to 0.9. The results showed that α between 0.3 and 0.5 provided the best trade-off, allowing the system to learn efficiently while maintaining stability. Additionally, a decaying learning rate strategy was implemented, where α gradually decreased over time, ensuring that the model became more stable as learning progressed. It is shown in Figure 4, Figure 5 and Figure 6 for different settings. The discount factor which is shown in Figure 7 γ plays a vital role in balancing short-term against long-term rewards. A low γ (~0.5) prioritizes instant rewards, leading to suboptimal long-term peer selection, while a high value of γ (~0.9–0.99) encourages long-term optimization, although it may slow down convergence. In order to identify the best value of γ, experiments were conducted using values between 0.5 and 0.99, and the results demonstrated that a range of 0.85 to 0.95 offered the most effective balance. This allowed the algorithm to make decisions that considered both immediate performance and long-term stability. This leads to better peer selection strategies. Another important aspect of Q-learning optimization is balancing exploration and exploitation. This was achieved through an ε-greedy strategy. Initially, the exploration rate ε was set at 1 to encourage broad exploration, but over time, ε was decayed exponentially to shift toward more exploitation of optimal peer selections. This guaranteed that in the early training phase, Algorithm 4 explored various peer selection strategies, whereas in later stages, it mainly focused on using the most effective strategies identified during learning. The combination of optimized α, γ, and ε considerably improved the efficiency of peer selection. These optimizations permitted the Q-learning model to converge faster, adjust to dynamic network changes, reduce selection errors, and improve the overall performance of the P2P network. This is represented in Figure 8 and Figure 9.

3.6. Integration of Fuzzy Linear Programming (FLP) and Q Learning

Integrating fuzzy linear programming with Q learning is a powerful paradigm to improve the decision-making strategy in uncertain environments, especially suitable for P2P networks for solving peer selection problems. FLP is advanced conventional linear programming that utilizes the fuzzy set theory in managing data [31]. Optimality is addressed in FLP models through the use of fuzzy objectives, constraints, and decision variables to resolve conflicting objectives within a context of uncertainty. FLP makes use of fuzzy sets and fuzzy numbers to effectively model and process vagueness in information and, with the help of fuzzy logic operations, solves optimization problems. Similarly, Q learning, which is model-free reinforcement learning, makes it possible to build value functions and select the best policy through the learning of interactions within an environment. Q learning is a form of learning in which the value of the state–action pair or Q-value is estimated and then updated with the aids on the rewards received by the agents, thereby making it possible for the agents to learn good policies. In this context, FLP helps in modeling states that may have vague or unpredictable information, for instance, peers’ performance parameters and network status. By means of fuzzy objectives and constraints, FLP effectively defines and describes the optimization objectives and constraints for peer selection that are characteristic of P2P networks. Similarly, the Q-learning agent communicates with the FLP framework and constantly updates the best policies of peer selection. Besides, the agent changes its strategies for selecting peers and further exploits the learned policies to gain as much cumulative reward as possible in the future. This integration enables the achievement of the conflicting objectives of optimization while also addressing the inherent uncertainty in the real-world P2P network environment with the help of the FLP framework. Figure 9 shows the proposed model.

Algorithm 4: Integration of FLP and Q Learning

The integration of FLP and Q learning combines the advantages of handling uncertainty with fuzzy logic and the learning capability of reinforcement learning for optimized peer selection in P2P networks.

Input : Peer attributes, fuzzy membership functions, learning rate α

, discount factor

γ

, exploration rate ϵ

.
Output: Optimized peer selection set for P2P networks.

Step 1.

Define fuzzy sets (low, medium, and high)

μ_{L o w} (x) = \frac{1}{1 + e^{- (x - c_{1})}}

(19)

μ_{M e d i u m} (x) = \frac{1}{1 + e^{- (x - c_{2})}}

(20)

μ_{H i g h} (x) = \frac{1}{1 + e^{- (x - c_{3})}}

(21)

Step 2.

Formulate FLP problem using objective function

M a x i m i z e Z = λ_{1} \cdot f_{1} (x) + λ_{2} \cdot f_{2} (x) + \dots + λ_{n} \cdot f_{n} (x)

(22)

S u b j e c t t o μ_{i} (a_{i j} \cdot x_{j}) \geq b_{i} \forall i, j

(23)

Step 3.

Set

α = 0.1, γ = 0.9, a n d ϵ = 0.1 .

Step 4.

Define state space

S = {s_{1}, s_{2}, \dots, s_{m}}

, action set A = {a_1,a_{2, …,}a_n}, and reward

R (s, a) .

Step 5.

Run Q learning with fuzzy adjustments:

For each episode:

Step 5.1.: Initialize state $s .$
Step 5.2.: Choose action $a = \{\begin{array}{l} r a n d o m a c t i o n i f r a n d \leq ε \\ \arg \max_{a}' Q (s, a^{'}), o t h e r w i s e \end{array}\}$
Step 5.3.: Perform action $a$ , and observe the next state $s'$ and reward $r$ .
Step 5.4.: Update Q-value using Equation (2).
Step 5.5.: Ensure satisfaction of fuzzy constraints $μ_{i} (a_{i j} \cdot x_{j}) \geq b_{i}$

Step 6.

Compute optimal policy

π (s) = \arg \max_{a} Q (s, a)

Step 7.

Compute

O p t i m a l A c t i o n = \arg \max_{a} [\sum_{i = 1}^{n} λ_{i} \cdot f_{i} (x)]

Subject to : μ_{i} (a_{i j} \cdot x_{j}) \geq b_{i}

(24)

Step 8.

Output the set of best peers for content distribution.

This integrated approach effectively combines fuzzy logic and Q learning, enabling robust and adaptive peer selection in P2P networks by handling uncertainty and optimizing performance through learning.

4. Performance Evaluation

The performance of the proposed framework for peer selection optimization in P2P networks is determined using the following key performance metrics.

4.1. Parameters

Throughput: It is measured in terms of pps or packets per second. If the throughput is higher than the existing, network resources are utilized effectively and the data transferred can be done at a faster rate.

Latency: It is also referred to as delay, which means the amount of time that a particular data packet takes to get through a network from the source to the destination. The basic unit of this measurement is millisecond (ms). Low latency means less time between data transmission and communication, as well as shorter waiting time for the same. It is important in everyday uses, such as video on demand, online gaming, and VOIP calls. Reducing latency by balancing the choice of optimum peers enhances the P2P system’s response rate and QoS.

Connectivity: It means the closeness of the peers in a network considering the topology of the network. It quantifies connectivity and dependability of the communication links from one peer to another peer, and it determines the fault tolerance of a network. Increased density results in multiple connection pathways within a network and decreases the overall probability of network segmentation and isolation due to node or network failure.

Resource Utilization: Bandwidth usage and distribution of load, as well as scheduling and assignment of load across the distributed sites, present the key quantitative measure of resource usage within the network. They evaluate how effectively the network is utilized in the available bandwidth capacity and the computing power of data transmission and computation. Conserving resources minimizes wastage, thus optimizing utilization of available resources and improving efficiency and cost in the network.

QoS Satisfaction: Quality of service (QoS) metrics, such as packet loss rate and jitter, measure the reliability and stability of data transmission in the network. Minimizing packet loss and jitter ensures consistent and reliable communication, particularly for real-time applications.

4.2. Existing Systems

In dynamic P2P networks, the kind of peer selection strategy needs to be chosen to ensure that it fits the network characteristics and operational needs and the specific application requirements. Each scheme given below is unique in its advantages and limitations, suggesting the importance of approaching the problem as a case of matching the needs to available resources, avoiding excessive latency, and improving the overall quality of the network. The following are some commonly used peer selection schemes for evaluating the proposed systems:

1. Random Peer Selection (RPS): It is likely to be the least complex, where some arbitrary selection of the peers is made prevalent for data exchange and query of resources. It has also been found to entail low implementation costs and is relatively easy to use. However, its main drawback is inefficiency; while peers can be randomly selected, there might not necessarily be the best use of the available resources. This randomness can result in the high latency needed to access resources, mainly in a large network with low chances of randomly selecting an appropriate peer.

2. Neighbor Selection (NS): It focuses on choosing peers based on network proximity, which can mean latencies, or geographic proximity. In this scheme, neighbors are selected as nodes that are closer, with the aim of reducing latency, as the closest neighbors are more likely to access resources within the locality. This means that it must be constantly updated and monitored to ensure that it remains as effective as possible in identifying neighbors. However, this can be a serious problem in large networks due to the fact that updates are often needed and may lead to overhead in maintaining the neighbor lists.

3. Churn-Aware Selection (CAS): It considers the probability of change in peer churn rates, that is, situations where peers may join or exit the network. This scheme aims to improve the network longevity and reactiveness in that it identifies peers that are less likely to churn. By adjusting to node join and leaving, churn-aware selection ensures an optimized flow and thus optimizes network continuity. Yet, the above-mentioned churn patterns may be difficult to predict; they demand accurate algorithms alongside real-time control procedures that can adapt peer selections on the fly.

4. Social-Based Selection (SBA): This technique uses the characteristics of the social network or the trust relationship between peers to select reliable nodes for resource finding. Those with better social connectivity or trust scores would be favored when the data exchange is being done. This scheme improves the efficiency of resource discovery based on current social structures but has strong demands on trust management. Trust can also come under threat if trust metrics are corrupted, wherein security threats may occur, or when peer selections are questionable.

5. Utility-Based Selection (UBS): It involves choosing those peers that are likely to offer the resources needed by the requester. Peers that offer a higher utility with respect to available bandwidth, storage space, and processing capacity are selected. This scheme enhances the usage of scarce resources and performance of the network through proper correlation between requirements and available resources. However, estimating the utilities of peers and avoiding a free-rider situation, where peers benefit from a resource without contributing in a way that is deemed sufficient by the other members of the group, remains a major problem.

4.3. Simulation

An Erdos–Renyi (ER) graph is characterized by two parameters: the number of nodes in the network, usually given by n, and the probability that any two nodes are linked by a given edge, denoted by p. The model referred to as G(n,p) is preferred for its simplicity and applicability to model random connections in different types of real networks, especially in P2P networks. In P2P networks, the nodes are the peers or the participants, while the edges are the possible channels of data transfer between the peers. Symmetry and determination of Erdos–Renyi graphs are random in nature and hence suitable for modeling the dynamic and decentralized nature of P2P networks. Peers join and leave the system frequently, and the connections and disconnections are unpredictable. ER graphs offer means to model properties of a key network, such as connectivity, degree distribution, clustering, and path length, making the system valuable. They provide system designers and analysts with methods to model and schedule P2P networks, which can suggest how P2P networks must pattern to achieve optimal communication efficiency and reliability. Figure 10 shows the Erdos–Renyi graphs with random connections for 100 peers. Simulation factors are presented in Table 3, Table 4 and Table 5 below. The RL agent communicates with the ns3 simulator by monitoring the status of the whole network, including link bandwidth, delay, packet drop probabilities, and the statuses of nodes. These actions are performed at the ns3 simulation level by changing the routing tables and connection settings of the nodes. The ns3 simulator then contacts the RL agent, providing it with feedback in the form of rewards. For this simulation, the rewards consist of network performance parameters, such as throughput, latency, and packet delivery ratios. At the same time, the RL agent is connected with the FLP model, which consists of a fuzzy system that accounts for uncertainty and imprecision of data in the network. The FLP model then uses a fuzzy system to find the optimal or near-optimal peer required to meet the aforementioned goals without violating them to a large extent. The information that is gathered by the RL agent is used in the form of these optimizations to help the agent make its decisions.

4.4. Dataset Structure

4.4.1. State Space

For instance, in the use of Q learning to integrate FLP in the peer selection of a P2P network, the state space refers to the different situations and arrangements of the network. This state space is useful because it lays out the different conditions under which decisions about peer selection need to be made. For example, when the number of peers is between 100 and 600, state variables are the number of active peers, load in the network, bandwidth availability, and trust between peers in the state space. Every state collects the state of the network at a particular time, allowing the Q-learning algorithm to apprehend the network’s continually evolving status. Since the algorithm models these states correctly, it is able to come up with the best decision based on the different statuses of a network, thereby enhancing the efficiency and effectiveness of peer selection. This dynamic representation aids in emulating context whereby changes in the network occur; characteristics of real-world systems that can benefit the Q-learning framework in refining durations and tendencies of peer interactions. There are several significant attributes that deserve identification in order to shape the state space and improve decision making in this integration. Number of active peers represents the number of peers at any given time, which defines connectivity and resource sharing in the network size. Network load defines the movement of data and its impact on congestion rate and efficiency of using the available resources. Bandwidth availability is the availability of communication links to decide the speed of data transmission. Latency is the amount of time the data take to travel from the source to the intended destination, where the lower the value, the better the network response. QoS metrics measure the general quality of service, with the results integrated by throughput and error rates influencing end users’ satisfaction level. Peer availability provides the number of times peers are available online, which is necessary for network reliability. Resource demand highlights the needs of peers, ensuring efficient resource distribution. Finally, fuzzy membership values from fuzzy logic provide nuanced insights into how attributes meet specific criteria, facilitating a more adaptable and flexible selection of peers. These attributes collectively make a robust and adaptive optimization process. The state space representation is given in Table 6.

4.4.2. Action Space

Action space includes the options where the system or peers could make changes that improve the standards of the network. It encompasses deciding which peer to communicate with for data transfer, changing resource usage, and revamping on peer connections in relation to the network scenario. Furthermore, it also consists of the managing of the replication that can result in high availability, the management of network load by reallocation of resources, and handling of peer churns by changing the strategies with the network changes. The action space also includes the execution of QoS changes and decision making based on fuzzy logic when dealing with imprecise and/or unknown information. Every measure that is carried out in this space impacts the network utilization parameters, including throughput, latency, and connectivity. In this way, the actions explained must be examined to search for and filter those actions that contribute the best results for network performance. It should be identified within the learning process by the Q-learning algorithm to operate at the specified configuration of the P2P system. The characteristics of each peer consequently contribute to the identification of the operation environment and necessary choices. Such attributes include the peer ID, which serves to identify each peer in the network with a lot of precision. The available bandwidth evaluates the ability of a peer to accommodate data transfer. It is used to select which peer to assign data requests. Resource contribution measures the storage capacity and other resources one can offer a peer, which is important when sharing and replicating data. Connection quality standards refer to the indicators of latency and reliability; these are vital when it comes to enabling efficient and, at the same time, timely communication with peers. QoS metrics include parameters such as delay and jitter in an effort to make sure that the provided QoS is in keeping with the expected network QoS. Peer load can define the current load on a peer, ensuring that its resources are not compromised and its ability to work in many other networks. Historical performance data contain information concerning past conduct and dependability that aids in decision making based on proven performance. The action space representation is given in Table 7.

4.4.3. Rewards

Interestingly, the proposed integration includes a reward system that helps peers modify their behavior and, subsequently, improve other aspects of network performance. For instance, in order to receive positive rewards, peers take actions promoting data distribution, which is good for their interaction. Likewise, incentives are given for low latency and high throughout, which encourage activities that favor low response times and greater bandwidth. Low resource wastage and high QoS satisfaction are also encouraged so that peers utilize network resources optimally and provide very high-quality service. Sustaining superior connectivity and optimizing peer turnover are further incentivized in order to sustain network stability in the face of prospective and constant peer changes. This reward system is a fundamental component of the Q-learning algorithm because it provides feedback to peers regarding the efficiency of certain actions in the other peer’s learning environment. The reward representation is shown in Table 8.

4.4.4. Transition Probability

Transition probability in the proposed integration refers to the measure of the probability with which a peer moves from one state to another over time for a given action. It measures the likelihood of a peer transitioning from one state s to another new state s′ following a particular action a. This probability defines the likelihood of the action as a function of the action impact, the network flow, and the behaviors of the other peers. If a peer is currently “Idle” and decides to “Forward Data Request”, the transition probability P(s′,s,a) is used to measure the chance of the peer successfully entering the “Requesting Data” state. Exact estimation of these probabilities is crucial for Q learning, as it helps predict the outcomes of actions and optimize decision-making processes. Through historical and simulation data, these probabilities can be derived to assist peers in making better decisions, thus improving the efficiency and performance of the P2P network. The transition probability representation is given in Table 9.

4.5. Results and Discussion

The proposed system makes real-time adaptive decision-making and learning from network conditions to optimize peer selection, unlike conventional heuristic methods. Compared with rigid threshold-based methods, the incorporation of fuzzy constraints produces a more flexible and seamless selection process by managing uncertainties in peer reliability, latency, and bandwidth. Moreover, the framework improves resource utilization, accelerates learning convergence, and maintains constant connections, even under high rates of peer churn. With its efficient computational design, it offers a scalable, robust, and adaptive alternative to conventional methods. It also ensures improved throughput and reduced latency. Further analysis of the results of the proposed section is elaborated upon in this section.

The throughput parameters show that the proposed framework outperforms the existing peer selection methods with great distinction. It achieved an average throughput of 65 Mbps, which indicates efficient data transmission across the network. In terms of peak throughput, the system reached 85 Mbps, demonstrating its capability to handle high data loads effectively. The system showed low throughput variability, which highlights its consistency in maintaining stable data transmission rates over a period of time. The data packet delivery ratio was recorded at 95% for the proposed system. This ensures that the vast majority of data packets are successfully delivered to their destinations. This high delivery ratio shows the system’s robustness in handling data traffic. The throughput efficiency stood at 90%, indicating that the system effectively uses its maximum potential throughput. Another important aspect of the effective use of connections was bandwidth usage, which reached 88%. This proved that the system allowed the effective use of all connections. It showcases the system’s ability to make the most of available network resources. The proposed system completed 80% successful data transfers during the simulation in terms of the number of successful transfers, which underscores its reliability and robustness. This is recorded and shown in Figure 11. The data transfer rate of 1024 KB/s further proves the high performance of the system in active data transfers.

The proposed framework demonstrates low latency compared to the other methods for a minimum of 100 peers in the network. The proposed system exhibited a latency of 140 ms, which was relatively much smaller compared to RPS at 150 ms, NS at 170 ms, CAS at 220 ms, SBA at 300 ms, and UBS at 350 ms. This reduction in latency demonstrates the system’s efficiency in minimizing the time taken for data packets to traverse the network. This is shown in Figure 12. The achieved latencies of the proposed system were identified to be 300 ms during its peak, whereas they were 350 ms for RPS, 360 ms for NS, 370 ms for CAS, 400 ms for SBA, and 435 ms for UBS. This lower peak latency is indicative of the system’s robustness in maintaining low delay even under heavy network traffic conditions. The system also showed a small latency jitter at 5 ms. This showcases its capability to provide consistent and predictable performance. In contrast, RPS, NS, CAS, SBA, and UBS showed higher jitter values of 20 ms, 18 ms, 15 ms, 12 ms, and 10 ms, respectively. Low jitter is crucial for applications requiring real-time data transmission. In addition, the proposed system latency variance was 2 ms, which was less than that of the RPS, NS, CAS, SBA, and UBS, with 10, 9, 8, 7, and 6 ms, respectively. This lower variance confirms that the system can sustain the identified latency rates, strengthening the reliability claim under varying network conditions. This is reported and shown in Figure 13. Concerning the RTT measure, the proposed system has an RTT of 30 ms, while the RPS has an RTT of 80 ms, NS 75 ms, CAS 70 ms, SBA 65 ms, and UBS 60 ms. A lower RTT enhances user experience by ensuring quicker acknowledgment and response times. This is reported and shown in Figure 14.

The connectivity of the proposed system was found to be better than the existing systems. The average connectivity of the proposed system is 8, which is much higher than that of RPS (4), NS (5), CAS (6), SBA (7), and UBS (6). This increases the service’s reliability and the extent of interactions between peers in the network. Additionally, the clustering coefficient for the proposed system was 0.75, compared to 0.3 for RPS, 0.35 for NS, 0.4 for CAS, 0.5 for SBA, and 0.45 for UBS, indicating a stronger tendency for peers to form groups, thereby providing an efficient and more liable facility in terms of local connectivity and fault tolerance, as the peers have a greater tendency to cluster together. The network diameter, a measure of the longest shortest path between any two nodes, was 5 for the proposed system. This shows that there is a significant improvement over RPS (10), NS (9), CAS (8), SBA (7), and UBS (6). This shorter network diameter implies quicker data transfer and a shorter delay time. This is recorded and shown in Figure 15. The average path length in the proposed system was also found to be 3, which was less than RPS at 6, NS at 5.5, CAS at 5, SBA at 4.5, and UBS at 4, and therefore reduced overall latency and quicker communication. In addition, the proposed system showed higher redundancy and fault tolerance with a redundancy factor of 0.85 more than RPS = 0.4, NS = 0.45, CAS = 0.5, SBA = 0.6, and UBS = 0.55. This relatively high redundancy provides the assurance that the network is still alive, even with node failures resulting in low data loss and network connectivity disruption. The proposed system achieved a higher stability rate, with connectivity stability being significant at 95% than that of RPS at 70%,NSat 75%,CAS at 80%,SBA at 85%, and UBS at 90%. This high stability rate gives a clear sign that the relations between peers are more stable and that the connection is less likely to be interrupted.

In terms of resource utilization, the proposed system outperforms the traditional systems. The proposed system shows better utilization efficiency of CPU at 85% compared to RPS at 60%, NS at 65%, CAS at 70%, SBA at 75%, and UBS at 80%. This high CPU utilization efficiency ensures that computational resources are optimally used without excessive overhead. Efficient memory usage is crucial for handling large amounts of data and supporting numerous simultaneous peer connections. The proposed system used less memory than RPS, with an average of 80% of memory utilization, NS at 60%, CAS at 65%, SBA at 70%, and UBS at 75%. The bandwidth utilization concerning the rate with which the proposed system functions was optimal at 90%, while that of other systems such as RPS, NS, CAS, SBA, and UBS was much less at a rate of 50%, 55%, 60%, 65%, and 70%, respectively. It also helps to determine the fact that bandwidth is optimally used to improve the flow of data across the network, not to mention reducing incidences of congestion that may hinder the performance of the global P2P network. The proposed system also demonstrated better results in the disk I/O utilization rate, in which the average value obtained was 85%, while that obtained for RPS, NS, CAS, SBA, and UBS was 50%, 55%, 60%, 65%, and 70%, respectively. This is shown in Figure 16. Furthermore, the proposed system’s ability to balance load across peers was exceptional, achieving a load balancing efficiency of 90%. Moreover, the performance assessment of the proposed system characterized the load balancing ratios among the peers, with a particularly high performance of 90%. Hence, this is higher than the efficiency analyzed for RPS (50%), NS (55%), CAS (60%), SBA (70%), and UBS (75%). Load management helps overcome bottlenecks within a peer, where one peer does not affect the continuous operation of the entire network. This is shown in Figure 16, and the upload capacity measure is also shown in Figure 17.The proposed approach increased the level of QoS satisfaction compared to existing systems. This framework obtains 95% reliability and 98% availability, ensuring consistent and dependable service delivery. It far exceeds performance compared to other systems. Additionally, the proposed system possesses a low jitter, measuring only 5ms, and a very low packet loss, at only 0.5%; it is an important requirement for communication that requires low delay, such as video streaming, VoIP, etc. Such metrics illustrate the smooth and continuous data transmission that is supported by the system. Also, the proposed system shows 96% user satisfaction, which strongly supports the applied intelligent combination of FLP and Q learning in selecting peers dynamically and optimally to match changing network situations and users’ demands. This combination leads to maximization of resources and guarantees betterQoS satisfaction, which makes the proposed framework a strong solution to the problem of P2P network optimization. This is shown in Figure 18. Table 10 shows the comparative analysis of peer selection methods.

4.6. Computational Complexity and Scalability Analysis

The computational complexity of the proposed approach is mainly determined by the Q-learning update process, which follows O(|S| × |A|), where |S| is the number of states and |A| is the number of actions. Given that the number of states is proportional to the number of peers N,theworst-case complexity is approximated as O(N²). The fuzzy logic component also adds some computational overhead. Since it operates with a limited set of fuzzy rules and membership functions, its complexity remains O(m × r), where m is the number of input variables and r is the number of fuzzy rules. The overall complexity of the proposed system is expressed as follows:

T (n) = O (n^{2} + O (m \times r)

(25)

Since m and r are considerably smaller than n, the dominant term remains O(n²). This makes the approach computationally feasible even for large-scale networks.

The scalability of the proposed approach is evaluated using its ability to handle the increasing size of the network while maintaining computational efficiency, adaptability, and resource optimization. As P2P networks grow, the number of peers fluctuates due to frequent joining and leaving. This makes scalability a critical factor to ensure the robustness of the proposed system. Peer availability fluctuates due to churn, which leads to frequent changes in the network structure. The proposed method effectively adjusts to these changes by continuously updating the Q table and adjusting fuzzy parameters. The steady-state probability distribution of peer selection is computed based on a state transition matrix. It ensures that the system stays stable despite variations in peer availability. The expected time spent in each state is determined based on the peer departure rate and steady-state probability, allowing the system to remain an optimized selection process. The proposed system efficiently scales without an increase in computational overhead for large-scale networks. Experimental results show that when the number of peers increases from 100 to 600, the framework continues to remain the best possible selection. It effectively balances response time, network throughput, and resource utilization. The method ensures that even in large-scale environments, peer selection remains robust, minimizing bottlenecks and improving overall network efficiency.

5. P2P Optimization in Sensor Networks and IoT

P2P communication is a crucial means to overcoming the significant difficulty of low energy consumption requirements brought about by the rapid development of sensors in sensor and actuator networks (SANs) and wireless sensor network (WSNs). The formation of a ubiquitous sensor environment, where several types of sensor devices provide sensor data, has been fueled by recent advancements in wireless or ubiquitous technology. In this setting, building a P2P network is effective for retrieving sensor data efficiently, since a large number of sensor devices join the network. Beyond P2P networks, the issue discussed in this article can also apply to the Internet of Things (IoT) and SANs. Similar optimization can be used to improve data distribution, resource usage, and overall network efficiency in SANs and IoT systems, even though P2P networks are mainly concerned with optimizing the distribution of video content. The proposed FLP offers a structured and informed decision-making framework that can be used to address key issues in these domains. This research also incorporates Q learning, which dynamically optimizes decision making in these dynamic network environments. Through adaptive learning from network conditions made possible by Q learning, nodes can gradually improve their routing and data placement choices and node selection. Q learning with fuzzy-based decision making combines to make an intelligent optimization framework that can adjust to changes in sensor networks, the Internet of Things, and actuator systems in real time. The following aspects illustrate how this research enhances SANs and the IoT.

5.1. Optimized Data Dissemination in IoT

IoT systems contain a vast number of distributed devices. They generate and exchange data continuously. Efficient data distribution is significant for guaranteeing low latency, minimal congestion, and also reliable communication across devices in the network. The IoT network also needs an intelligent selection of intermediary nodes to optimize data-forwarding paths similar to P2P networks. The proposed framework models IoT-based sensor and actuator networks (SANs) using a state-space representation. It enables dynamic adaptation to changing network conditions. Each IoT node functions as an intelligent agent that maintains a state representation using factors like residual energy, link quality, bandwidth availability, hop count, and queue delay. It also allows IoT networks to dynamically learn and adapt to changes in network conditions. Each node can assess the success of past routing decisions and adjust its future actions to minimize delay and congestion using Q learning. The system can autonomously identify optimal paths for data transmission, ensuring that high-priority data are delivered quickly and efficiently among the nodes. Q learning can assist in dynamically routing crucial sensor data along optimal network paths in smart cities, where IoT sensors monitor weather, traffic, and pollution. This helps avoid congested routes and guarantees real-time updates for emergency services. Although existing studies in this domain have explored energy-efficient routing and congestion control separately, this work aims to uniquely combine both aspects using machine learning and fuzzy optimization techniques. By using state-space representation, Q learning, and FLP, the proposed framework provides a more adaptive and scalable solution that improves network longevity, reliability, and performance in IoT-driven sensor networks.

5.2. Adaptive Node Selection in Sensor Networks

WSNs function in resource-constrained settings where energy efficiency and network longevity are the main concerns. Node selection is often based on predefined heuristics for data forwarding, which may not effectively adapt to dynamic changes in network conditions such as energy depletion, congestion, and link quality variations in conventional sensor networks. In contrast, our framework aims to model sensor nodes as intelligent agents within a state-space representation, where each node assesses its environment, updates its state dynamically, and chooses the most suitable peer for data transmission. The state representation includes key network parameters, such as residual energy, signal strength, queue load, and transmission latency, ensuring that node selection considers both energy efficiency and network stability. The proposed framework is appropriate for these networks because of its real-time status updates, adaptive learning mechanism, and resilience to environmental changes. These technological features guarantee a longer operating lifetime, improved network dependability, and effective resource usage. Choosing the best possible relay nodes is vital for balancing energy consumption, data priority, and connectivity. If inefficient nodes are selected for data forwarding, the network experiences rapid energy depletion and also failures in communication. Sensor nodes can learn the best relay selection strategies based on past transmissions using the techniques used in the proposed system. Each node maintains a Q table to store reward values for different routing decisions in the network. Over time, nodes adapt to network changes and dynamically choose the most energy-efficient paths.

5.3. Dynamic Resource Allocation in Heterogeneous Networks

WSNs and SANs are heterogeneous in nature and exhibit considerable disparities in node capabilities, demanding an adaptive mechanism for efficient resource distribution and an adaptive system for efficient resource distribution. The conventional static resource allocation technique fails to adjust to these variations, which result in inefficient data placement and network congestion. The proposed framework’s adaptability makes it highly appropriate for dynamic resource allocation in heterogeneous networks, where nodes possess varying computational power, bandwidth, and energy resources. By using the Q-learning component, the framework continuously learns optimal resource allocation strategies by monitoring network conditions. It includes node availability, energy levels, bandwidth capacity, and also processing power. This is achieved through state–action–reward observations, where the learning process incrementally enhances its decision-making accuracy. The inclusion of FLP improves this process further by allowing soft decision making even when data are not complete. This framework also dynamically adjusts its policies to balance workload distribution across diverse nodes, enhancing throughput and reducing latency in those networks. The robustness of the Q-learning model, when combined with FLP’s ability to handle uncertainties, guarantees efficient handling of dynamic topologies and varying availability of resources at any moment. Hence, there is a possibility of improving the overall performance of the sensor network. For instance, IoT devices often offload data to edge nodes for real-time processing in edge computing environments. The suggested framework guarantees that computational resources are allocated effectively to minimize latency and enhance throughput by using adaptive load balancing. It enables self-learning and adaptive and energy-efficient decision making in distributed network environments.

5.4. Implementation of Proposed Work in SANs

Existing studies in SANs have mainly focused on heuristic-based routing, clustering algorithms, and static optimization techniques to improve energy efficiency and data transmission reliability. Low-energy adaptive clustering hierarchy (LEACH), power-efficient gathering in sensor information systems (PEGASIS), and QoS-aware routing are common approaches [32,33]. These methods function based on fixed decision rules and predefined optimization criteria, making them less adaptive to dynamic network conditions such as node failures, energy exhaustion, and irregular traffic loads [34]. In contrast, our research introduces a Q-learning-based adaptive node selection mechanism combined with FLP, which enables nodes to self-learn and dynamically adjust based on real-time network conditions. Furthermore, while many of the existing SAN protocols work based on periodic clustering, which introduces bottlenecks and single points of failure, our system works in a fully decentralized manner, allowing nodes to make local, intelligent decisions without any centralized coordination. One of the important factors that need to be considered to implement any technique in SANs is computational efficiency. Although deep reinforcement learning methods are computationally intensive, our approach employs Q learning, which works with a lightweight table-based model, making it appropriate for low-power sensor nodes with restricted processing capacity. Additionally, FLP also requires minimum additional computations, as it mainly involves solving linear equations with fuzzy constraints, which can be efficiently processed even in resource-constrained environments. Furthermore, our system can be fine-tuned and optimized to makes it more suitable for SANs. One more crucial factor is hardware compatibility. Zigbee, LoRa, and Bluetooth Low Energy (BLE) devices are modern sensor nodes that are capable of handling lightweight machine learning algorithms and optimization techniques. Our framework can be implemented using low-power microcontrollers such as the ARM Cortex-M series and ESP32, with negligible memory and processing requirements, making it a realistic solution for real-world SAN applications. From an implementation perspective, the proposed system can be integrated into existing SAN architectures without requiring any considerable modifications. Many sensor networks already employ distributed decision-making methods for routing and resource allocation. Hence, it is straightforward to deploy the proposed system on top of existing frameworks. Furthermore, the system does not rely on centralized control, which guarantees that it remains robust against node failures and network fragmentation.

6. Conclusions

The integration of fuzzy linear programming (FLP) with Q learning presents a promising approach to enhance decision making in dynamic and uncertain P2P network environments. This combination efficiently uses FLP’s ability to handle imprecise and uncertain information while utilizing Q learning’s adaptability in dynamic conditions. This leads to an efficient and intelligent peer selection mechanism. FLP guarantees that multiple conflicting objectives are addressed simultaneously, achieving the best feasible solution. Meanwhile, Q learning introduces an adaptive learning mechanism that enables the system to refine decision-making rules based on past experiences and observed network conditions. This type of learning process improves the robustness of the peer selection strategy and allows the system to respond effectively to variations in network behavior and conditions. This integration ensures a well-balanced trade-off between exploration and exploitation rate by estimating the best possible actions based on cumulative future rewards. This balance is essential in maintaining an efficient peer selection process without overstating the exploration rate. The system continuously updates its policies based on learned rewards, allowing it to develop and refine selection strategies dynamically. Thus, there is considerable potential for the presented integrated approach to improve the efficiency, robustness, and adaptability of the selection of peers, as well as to increase the effectiveness and reliability of P2P networks’ functioning in general. In particular, the methodology presented achieved 21% higher throughput, 40% lower latency, and 30% higher stability compared to the traditional systems. It also offers a robust solution for intelligent peer selection in P2P networks, paving the way for further research and practical implementation in any kind of distributed computing setup.

Author Contributions

Conceptualization, M.A. (Mahalingam Anandaraj) and T.A.; methodology, M.A. (Mahalingam Anandaraj); software M.A. (Mahalingam Anandaraj) validation, M.A. (Mahalingam Anandaraj), T.A. and M.A. (Mohammad Alkhatib); formal analysis, T.A.; investigation, M.A. (Mahalingam Anandaraj); resources, M.A. (Mohammad Alkhatib); data curation, T.A.; writing—original draft preparation, M.A. (Mahalingam Anandaraj).; writing—review and editing, T.A.; visualization, M.A. (Mohammad Alkhatib); supervision, T.A.; project administration, M.A. (Mahalingam Anandaraj); funding acquisition, T.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-RG23028).

Data Availability Statement

Data will be supplied by the authors on request if anything further needed apart from the available within this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gebraselase, B.G.; Helvik, B.E.; Jiang, Y. Bitcoin P2P Network Measurements: A testbed study of the effect of peer selection on transaction propagation and confirmation times. IEEE Trans. Netw. Serv. Manag. 2022, 19, 3975–3987. [Google Scholar] [CrossRef]
Budhkar, S.; Tamarapalli, V. An overlay management strategy to improve QoS in CDN-P2P live streaming systems. Peer-to-Peer Netw. Appl. 2019, 13, 190–206. [Google Scholar] [CrossRef]
Hwang, I.-S.; Rianto, A.; Kharga, R.; Ab-Rahman, M.S. Global P2P BitTorrent Real-Time Traffic Over SDN-Based Local-Aware NG-PON2. IEEE Access 2022, 10, 76884–76894. [Google Scholar] [CrossRef]
Ren, Y.; Zeng, Z.; Wang, T.; Zhang, S.; Zhi, G. A trust-based minimum cost and quality aware data collection scheme in P2P network. Peer-to-Peer Netw. Appl. 2020, 13, 2300–2323. [Google Scholar] [CrossRef]
Nacakli, S.; Tekalp, A.M. Controlling P2P-CDN Live Streaming Services at SDN-Enabled Multi-Access Edge Datacenters. IEEE Trans. Multimed. 2020, 23, 3805–3816. [Google Scholar] [CrossRef]
Luo, S.; Yu, H.; Li, K.; Xing, H. Efficient file dissemination in data center networks with priority-based adaptive multicast. IEEE J. Sel. Areas Commun. 2020, 38, 1161–1175. [Google Scholar] [CrossRef]
Yao, H.; Xiang, Y.; Liu, J. Virtual Prosumers’ P2P Transaction Based Distribution Network Expansion Planning. IEEE Trans. Power Syst. 2023, 39, 1044–1057. [Google Scholar] [CrossRef]
Farahani, R.; Çetinkaya, E.; Timmerer, C.; Shojafar, M.; Ghanbari, M.; Hellwagner, H. ALIVE: A Latency- and Cost-Aware Hybrid P2P-CDN Framework for Live Video Streaming. IEEE Trans. Netw. Serv. Manag. 2023, 21, 1561–1580. [Google Scholar] [CrossRef]
Nie, L.; Yang, S.; Zheng, X.; Wang, X. An Efficient and Adaptive Content Delivery System Based on Hybrid Network. IEEE Trans. Broadcast. 2023, 69, 904–915. [Google Scholar] [CrossRef]
Kumar, D.; Pandey, M. An optimal and secure resource searching algorithm for unstructured mobile peer-to-peer network using particle swarm optimization. Appl. Intell. 2022, 52, 14988–15005. [Google Scholar] [CrossRef]
Safara, F.; Souri, A.; Deiman, S.F. Super peer selection strategy in peer-to-peer networks based on learning automata. Int. J. Commun. Syst. 2020, 33, e4296. [Google Scholar] [CrossRef]
Ali, M.S.; Vecchio, M.; Putra, G.D.; Kanhere, S.S.; Antonelli, F. A Decentralized Peer-to-Peer Remote Health Monitoring System. Sensors 2020, 20, 1656. [Google Scholar] [CrossRef] [PubMed]
D’Alessandro Costa, M.A.; Gonçalves Rubinstein, M. Performance analysis of a locality-aware BitTorrent protocol in enterprise networks. Peer-to-Peer Netw. Appl. 2019, 12, 751–762. [Google Scholar] [CrossRef]
Meng, X. speed Trust: A super peer-guaranteed trust model in hybrid P2P networks. J. Supercomput. 2018, 74, 2553–2580. [Google Scholar] [CrossRef]
Geng, J.; Fujita, S. Enhancing Crowd-Sourced Video Sharing through P2P-Assisted HTTP Video Streaming. Electronics 2024, 13, 1270. [Google Scholar] [CrossRef]
Xue, B.; Mao, Y.; Venkatakrishnan, S.B.; Kannan, S. Goldfish: Peer Selection using Matrix Completion in Unstructured P2P Network. In Proceedings of the 2023 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), Dubai, United Arab Emirates, 1–5 May 2023; pp. 1–9. [Google Scholar]
Ghasemkhani, H.; Li, Y.; Moinzadeh, K.; Tan, Y. Contracting Models for P2P Content Distribution. Prod. Oper. Manag. 2018, 27, 1940–1959. [Google Scholar] [CrossRef]
Anandaraj, M.; Selvaraj, K.; Ganeshkumar, P.; Rajkumar, K.; Sriram, S. Genetic Algorithm Based Resource Minimization in Network Code Based Peer-to-Peer Network. J. Circuits Syst. Comput. 2020, 30, 2150092. [Google Scholar] [CrossRef]
Naganandhini, S.; Shanthi, D. Optimizing Replication of Data for Distributed Cloud Computing Environments: Techniques, Challenges, and Research Gap. In Proceedings of the 2023 2nd International Conference on Edge Computing and Applications (ICECAA), Namakkal, India, 19–21 July 2023; pp. 35–41. [Google Scholar]
Shoab, M.; Jubayrin, S.A. Intelligent neighbor selection for efficient query routing in unstructured P2P networks using Q-learning. Appl. Intell. 2022, 52, 6306–6315. [Google Scholar] [CrossRef]
Yang, X.-P.; Zheng, G. Maximum number of line faults in a P2P network system based on the addition-min fuzzy relation inequalities. IEEE Trans. Fuzzy Syst. 2021, 30, 2241–2253. [Google Scholar] [CrossRef]
Goguen, J.A. L. A. Zadeh. Fuzzy sets. Information and control, vol. 8 (1965), pp. 338–353. - L. A. Zadeh. Similarity relations and fuzzy orderings. Information sciences, vol. 3 (1971), pp. 177–200. J. Symb. Log. 1973, 38, 656–657. [Google Scholar] [CrossRef]
Nguyen, A.-T.; Taniguchi, T.; Eciolaza, L.; Campos, V.; Palhares, R.; Sugeno, R.M. Fuzzy Control Systems: Past, Present and Future. IEEE Comput. Intell. Mag. 2019, 14, 56–68. [Google Scholar] [CrossRef]
Liu, Y.; Sakamoto, S.; Matsuo, K.; Ikeda, M.; Barolli, L.; Xhafa, F. A comparison study for two fuzzy-based systems: Improving reliability and security of JXTA-overlay P2P platform. Soft Comput. 2015, 20, 2677–2687. [Google Scholar] [CrossRef]
Zhang, G.; Chai, S.; Chai, R.; Garcia, M.; Xia, Y. Fuzzy Goal Programming Algorithm for Multi-Objective Trajectory Optimal Parking of Autonomous Vehicles. IEEE Trans. Intell. Veh. 2024, 9, 1909–1918. [Google Scholar] [CrossRef]
Nasseri, S.H.; Verdegay, J.L.; Mahmoudi, F. A New Method to Solve Fuzzy Interval Flexible Linear Programming Using a Multi-Objective Approach. Fuzzy Inf. Eng. 2021, 13, 248–265. [Google Scholar] [CrossRef]
Abdul Hakkeem, S.; Mohamed Assarudeen, S.N. An Algorithm for Solving Fully Fuzzy Linear Fractional Programming Problems in Fuzzy Environment. J. Comput. Anal. Appl. (JoCAAA) 2024, 33, 412–420. [Google Scholar]
Rivaz, S.; Nasseri, S.H.; Ziaseraji, M. A Fuzzy Goal Programming Approach to Multiobjective Transportation Problems. Fuzzy Inf. Eng. 2020, 12, 139–149. [Google Scholar] [CrossRef]
Zhang, L. Max-min fuzzy bi-level programming: Resource sharing system with application. Appl. Math. Sci. Eng. 2024, 32, 2335319. [Google Scholar] [CrossRef]
Anandaraj, M.; Ganeshkumar, P.; Naganandhini, S.; Selvaraj, K. A novel fuzzy programming approach for piece selection problem in P2P content distribution network. PeerJ Comput. Sci. 2024, 10, e1645. [Google Scholar] [CrossRef]
Yu, Y.; Qin, Y.; Gong, H. A Fuzzy Q-Learning Algorithm for Storage Optimization in Islanding Microgrid. J. Electr. Eng. Technol. 2021, 16, 2343–2353. [Google Scholar] [CrossRef]
Ntabeni, U.; Basutli, B.; Alves, H.; Chuma, J. Improvement of the Low-Energy Adaptive Clustering Hierarchy Protocol in Wireless Sensor Networks Using Mean Field Games. Sensors 2024, 24, 6952. [Google Scholar] [CrossRef]
Sadhana, S.; Sivaraman, E.; Daniel, D. Enhanced Energy Efficient Routing for Wireless Sensor Network Using Extended Power Efficient Gathering in Sensor Information Systems (E-PEGASIS) Protocol. Procedia Comput. Sci. 2021, 194, 89–101. [Google Scholar] [CrossRef]
Yuan, J.; Peng, J.; Yan, Q.; He, G.; Xiang, H.; Liu, Z. Deep Reinforcement Learning-Based Energy Consumption Optimization for Peer-to-Peer (P2P) Communication in Wireless Sensor Networks. Sensors 2024, 24, 1632. [Google Scholar] [CrossRef] [PubMed]

Figure 1. P2P network.

Figure 2. Fuzzified inputvariables: (a) download speed, (b) peer availability, (c) content delivery rate, and (d) delay.

Figure 3. Exploration and exploitation rate.

Figure 4. Learning rate over epochs.

Figure 5. Learning rate over iterations.

Figure 6. Learning rate vs. episodes.

Figure 7. Impact of discount factor.

Figure 8. Impact of exploration rate.

Figure 9. Q learning with FLP.

Figure 10. Erdos–Renyi graph using random connections.

Figure 11. Throughput measurement.

Figure 12. Latency measure.

Figure 13. Jitter measure.

Figure 14. RTT measure.

Figure 15. Connectivity measure.

Figure 16. Resource utilization measure.

Figure 17. Upload capacity measure.

Figure 18. QoS measure.

Table 1. Summary of the literature review categorizing traditional, heuristic, and AI-driven peer selection techniques.

Category	Techniques	Advantages	Limitations
Traditional Peer Selection	- Random Selection - Round-Robin - Latency-Based Selection - Proximity-Based Selection	- Simple implementation - Low computational cost	- Inefficient for dynamic networks - Cannot adapt to changing network conditions - High churn rate issues
Heuristic-Based Peer Selection	- Game-Theoretic Models - Graph-Based Selection (MST, Clustering) - Multi-Criteria Decision Making (AHP, TOPSIS)	- More efficient than traditional methods - Optimized for specific scenarios - Reduces latency and improves connectivity	- Requires manual parameter tuning - Less adaptable to real-time network fluctuations - Limited scalability
AI-Driven Peer Selection	- Supervised Learning (Prediction Models) - Reinforcement Learning (Q-Learning, DQN) - Fuzzy Logic-Based Selection (FLP) - Hybrid AI models (QLearning + FLP)	- Self-learning and adaptive - Handles uncertainties and real-time changes - Optimized peer selection strategies - Scalable and efficient for large networks	- Higher computational requirements - Requires sufficient training data - Complexity in implementation

Table 2. Notation summary.

Symbol	Definition
Q-Learning Parameters (Reinforcement Learning for Peer Selection)
s	Current state of the network (peer selection scenario)
s′	Next state after an action is taken
S	Set of all possible states
a	Action taken (selecting a peer)
A	Set of all possible actions (peer selection choices)
a_i	Action selecting peer i
Q(s,a)	Q-value, representing the expected reward for selecting peer i in state s
r	Immediate reward based on peer selection quality
α	Learning rate in Q learning
γ	Discount factor for future rewards in Q learning
max_aQ(s′,a′)	Maximum expected Q-value for the next state
π(s)	Policy function that determines the best action for state s
R(s,a)	Reward function for selecting action a in state s
P_ij	Probability of transitioning from state i to state j
Peer Attributes and Selection Criteria (Fuzzy Logic Components)
N	Total number of available peers
P_i	Peer iii in the network
B_i	Bandwidth of peer i
L_i	Latency of peer i
T_i	Trust score of peer i
A_i	Availability of peer i (1 if available, 0 otherwise)
E_i	Energy consumption of peer i (if applicable)
C_i	Computational power of peer i (if applicable)
R_i	Peer reputation score (aggregated trust score)
Fuzzy Membership and Normalization Functions (Handling Uncertainty in Attributes)
μ_i	Fuzzy membership function representing preference for peer i
w1, w2, w3, w4	Weight coefficients for different peer attributes (sum to 1)
B_min	Minimum required bandwidth
B_max	Maximum bandwidth available
L_min	Minimum latency observed
L_max	Maximum allowable latency
T_max	Maximum possible trust score
T_threshold	Minimum required trust score for selection
A_min	Minimum availability requirement (usually 1)
e_b, e_l, e_T	Tolerance levels for fuzzy constraints

Table 3. P2P network configuration.

Parameter	Value
Simulation Duration	100 s
Number of Peers	100 to 600
Network Topology	Erdos-Renyi graph
Content Repository Size	10 GB
Bandwidth	100 Mbps
Peer Upload Capacity	10 Mbps
Peer Download Capacity	20 Mbps
Max/Min Arrival Rate	50/10 peers per minute
Max/Min Departure Rate	30/5 peers per minute
Traffic Model	Constant Bit Rate (CBR)

Table 4. Q learning parameters.

Parameter	Value
Learning Rate (α)	0.1
Discount Factor (γ)	0.9
Exploration Rate (ε)	0.2
Exploration Decay Rate	0.99
Initial Q-Value	0
Number of Episodes	1000
Maximum Steps per Episode	100
Reward for Successful Download	100
Penalty for Failed Download	−10

Table 5. Fuzzy linear programming parameters.

Parameter	Value
Max Download Speed	10 Mbps
Min Download Speed	1 Mbps
Max Reliability	0.9
Min Reliability	0.5
Max Latency	100 ms
Min Latency	10 ms
Max Completion Rate	95%
Min Completion Rate	80%
Membership Functions	Triangular functions
Weights	Equal

Table 6. State space representation.

State ID	Active Peers	Network Load (%)	Bandwidth Availability (Mbps)	Peer Trust Level
S1	100	50	100	High
S2	200	60	80	Medium
S3	300	40	120	Low
S4	400	70	90	High
S5	500	55	110	Medium
S6	600	65	95	High

Table 7. Action space representation.

Action ID	Action Description
A1	Select Peer Based on Bandwidth
A2	Select Peer Based on Trust Level
A3	Select Nearest Peer
A4	Select Peer with Least Load
A5	Random Peer Selection

Table 8. Reward representation.

State ID	Action ID	Reward (Q-Value)
S1	A1	10
S1	A2	7
S1	A3	5
S1	A4	8
S1	A5	3
S2	A1	6
S2	A2	9
S2	A3	4
S2	A4	7
S2	A5	2
S3	A1	8
S3	A2	6
S3	A3	7
S3	A4	5
S3	A5	4
S4	A1	9
S4	A2	8
S4	A3	6
S4	A4	7
S4	A5	3
S5	A1	10
S5	A2	9
S5	A3	8
S5	A4	7
S5	A5	5
S6	A1	12
S6	A2	10
S6	A3	9
S6	A4	8
S6	A5	6

Table 9. Transition probability representation.

Current State	Action	Next State	Probability
S1	A1	S2	0.4
S1	A1	S3	0.6
S2	A2	S4	0.7
S2	A2	S5	0.3
S3	A3	S1	0.5
S3	A3	S2	0.5
S4	A4	S3	0.8
S4	A4	S5	0.2
S5	A5	S1	0.6
S5	A5	S4	0.4
S6	A1	S2	0.5
S6	A1	S4	0.5

Table 10. Comparative analysis of peer selection methods.

Criteria	Traditional Methods	Proposed Method	Improvement (%)
Handling Uncertainty	Fixed thresholds, high sensitivity to fluctuations	Fuzzy constraints provide smooth decision making	30% Lower selection variability
Resource Utilization	Load balancing inefficient, often leads to bottlenecks	Optimized allocation using learned policies	+40% Better load distribution
Convergence Speed	Slow adaptation (Avg: 5000 iterations)	Faster convergence (Avg: 2000 iterations)	60% Reduction in convergence time
Success Rate in Peer Connections	75% (High failure under churn)	92% (Stable connections)	+22% Higher success rate
Throughput (Mbps)	25 Mbps	45 Mbps	+21% Higher throughput
Network Latency (ms)	150 ms	90 ms	40% Lower latency
Stability in High Churn	Unstable, frequent disconnections	Robust, maintains connections efficiently	30% Higher stability

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Anandaraj, M.; Albalawi, T.; Alkhatib, M. An Efficient Framework for Peer Selection in Dynamic P2P Network Using Q Learning with Fuzzy Linear Programming. J. Sens. Actuator Netw. 2025, 14, 38. https://doi.org/10.3390/jsan14020038

AMA Style

Anandaraj M, Albalawi T, Alkhatib M. An Efficient Framework for Peer Selection in Dynamic P2P Network Using Q Learning with Fuzzy Linear Programming. Journal of Sensor and Actuator Networks. 2025; 14(2):38. https://doi.org/10.3390/jsan14020038

Chicago/Turabian Style

Anandaraj, Mahalingam, Tahani Albalawi, and Mohammad Alkhatib. 2025. "An Efficient Framework for Peer Selection in Dynamic P2P Network Using Q Learning with Fuzzy Linear Programming" Journal of Sensor and Actuator Networks 14, no. 2: 38. https://doi.org/10.3390/jsan14020038

APA Style

Anandaraj, M., Albalawi, T., & Alkhatib, M. (2025). An Efficient Framework for Peer Selection in Dynamic P2P Network Using Q Learning with Fuzzy Linear Programming. Journal of Sensor and Actuator Networks, 14(2), 38. https://doi.org/10.3390/jsan14020038

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

State ID	Action ID	Reward (Q-Value)
S1	A1	10
S1	A2	7
S1	A3	5
S1	A4	8
S1	A5	3
S2	A1	6
S2	A2	9
S2	A3	4
S2	A4	7
S2	A5	2
S3	A1	8
S3	A2	6
S3	A3	7
S3	A4	5
S3	A5	4
S4	A1	9
S4	A2	8
S4	A3	6
S4	A4	7
S4	A5	3
S5	A1	10
S5	A2	9
S5	A3	8
S5	A4	7
S5	A5	5
S6	A1	12
S6	A2	10
S6	A3	9
S6	A4	8
S6	A5	6

Current State	Action	Next State	Probability
S1	A1	S2	0.4
S1	A1	S3	0.6
S2	A2	S4	0.7
S2	A2	S5	0.3
S3	A3	S1	0.5
S3	A3	S2	0.5
S4	A4	S3	0.8
S4	A4	S5	0.2
S5	A5	S1	0.6
S5	A5	S4	0.4
S6	A1	S2	0.5
S6	A1	S4	0.5

State ID	Action ID	Reward (Q-Value)
S1	A1	10
S1	A2	7
S1	A3	5
S1	A4	8
S1	A5	3
S2	A1	6
S2	A2	9
S2	A3	4
S2	A4	7
S2	A5	2
S3	A1	8
S3	A2	6
S3	A3	7
S3	A4	5
S3	A5	4
S4	A1	9
S4	A2	8
S4	A3	6
S4	A4	7
S4	A5	3
S5	A1	10
S5	A2	9
S5	A3	8
S5	A4	7
S5	A5	5
S6	A1	12
S6	A2	10
S6	A3	9
S6	A4	8
S6	A5	6

Current State	Action	Next State	Probability
S1	A1	S2	0.4
S1	A1	S3	0.6
S2	A2	S4	0.7
S2	A2	S5	0.3
S3	A3	S1	0.5
S3	A3	S2	0.5
S4	A4	S3	0.8
S4	A4	S5	0.2
S5	A5	S1	0.6
S5	A5	S4	0.4
S6	A1	S2	0.5
S6	A1	S4	0.5

Article Menu

An Efficient Framework for Peer Selection in Dynamic P2P Network Using Q Learning with Fuzzy Linear Programming

Abstract

1. Introduction

2. Related Work

2.1. Traditional Peer Selection Techniques

2.2. Heuristic-Based Peer Selection Methods

3. Proposed System

3.1. Objective Function and Constraints

3.2. MDP Framework

3.2.1. State Representation in P2P Networks

3.2.2. Transition Probability Matrix (TPM)

3.3. Fuzzy Linear Programming (FLP) for Peer Selection

3.4. Learning for Peer Selection Optimization

3.5. Optimization of Hyper Parameters

3.6. Integration of Fuzzy Linear Programming (FLP) and Q Learning

4. Performance Evaluation

4.1. Parameters

4.2. Existing Systems

4.3. Simulation

4.4. Dataset Structure

4.4.1. State Space

4.4.2. Action Space

4.4.3. Rewards

4.4.4. Transition Probability

4.5. Results and Discussion

4.6. Computational Complexity and Scalability Analysis

5. P2P Optimization in Sensor Networks and IoT

5.1. Optimized Data Dissemination in IoT

5.2. Adaptive Node Selection in Sensor Networks

5.3. Dynamic Resource Allocation in Heterogeneous Networks

5.4. Implementation of Proposed Work in SANs

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

State ID	Action ID	Reward (Q-Value)
S1	A1	10
S1	A2	7
S1	A3	5
S1	A4	8
S1	A5	3
S2	A1	6
S2	A2	9
S2	A3	4
S2	A4	7
S2	A5	2
S3	A1	8
S3	A2	6
S3	A3	7
S3	A4	5
S3	A5	4
S4	A1	9
S4	A2	8
S4	A3	6
S4	A4	7
S4	A5	3
S5	A1	10
S5	A2	9
S5	A3	8
S5	A4	7
S5	A5	5
S6	A1	12
S6	A2	10
S6	A3	9
S6	A4	8
S6	A5	6

Current State	Action	Next State	Probability
S1	A1	S2	0.4
S1	A1	S3	0.6
S2	A2	S4	0.7
S2	A2	S5	0.3
S3	A3	S1	0.5
S3	A3	S2	0.5
S4	A4	S3	0.8
S4	A4	S5	0.2
S5	A5	S1	0.6
S5	A5	S4	0.4
S6	A1	S2	0.5
S6	A1	S4	0.5