Dynamical Pseudo-Random Number Generator Using Reinforcement Learning

Park, Sungju; Kim, Kyungmin; Kim, Keunjin; Nam, Choonsung

doi:10.3390/app12073377

Open AccessArticle

Dynamical Pseudo-Random Number Generator Using Reinforcement Learning

¹

Spiceware, 17F, 83, Uisadang-daero, Yeongdeungpo-gu, Seoul 07325, Korea

²

Department of Software Convergence Engineering, Inha University, Incheon 15798, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(7), 3377; https://doi.org/10.3390/app12073377

Submission received: 10 February 2022 / Revised: 5 March 2022 / Accepted: 24 March 2022 / Published: 26 March 2022

Download

Browse Figures

Versions Notes

Abstract

:

Pseudo-random number generators (PRNGs) are based on the algorithm that generates a sequence of numbers arranged randomly. Recently, random numbers have been generated through a reinforcement learning mechanism. This method generates random numbers based on reinforcement learning characteristics that select the optimal behavior considering every possible status up to the point of episode closing to secure the randomness of such random numbers. The LSTM method is used for the long-term memory of previous patterns and selection of new patterns in consideration of such previous patterns. In addition, feature vectors extracted from the LSTM are accumulated, and their images are generated to overcome the limitation of LSTM long-term memory. From these generated images, features are extracted using CNN. This dynamical pseudo-random number generator secures the randomness of random numbers.

Keywords:

reinforcement learning; dynamical pseudo-random number generator; RNN; CNN; agent; environment

1. Introduction

Random numbers are selected numbers that cannot be predicted until their actual generation. If the process of generating random numbers involves an error, it may affect the safety of the cipher algorithm itself, precisely, the secret key of the symmetric key encryption algorithm, the initializing vector of the stream cipher algorithm, and large decimal generation of the RSA of the asymmetric key encryption algorithm. Therefore, random numbers should be random, unpredictable, unbiased, and independent. Such random numbers are called ‘true random numbers’ and could be generated by no other means but quantum-safe security. Other random number generating methods, aside from the quantum-safe security method, generate pseudo-random numbers hardly distinguished from random numbers, and these are used as cryptographically random numbers [1,2,3].

There are two ways of obtaining random numbers: using the randomness of physical phenomena (non-deterministic random bit generator, NRBG) [4] and generating pseudo-random numbers of a far longer length by adding a short initial value to the deterministic algorithm (deterministic random bit generator, DRBG) [5]. In the case of the NRBG, generated pseudo-random numbers may show a relatively high level of entropy, similarly to true random numbers, but it is not easy to obtain jobs of a high-level entropy and apply them to various platforms. In contrast, the DRBG can guarantee a sufficient level of safety, although the entropy level is lower than that of the NRBG as long as the operation status is kept confidential and it is not easy to analyze the cipher algorithm. Hence, the pseudo-random number generator (PRNG) [5] adopts the algorithm to generate a random number sequence by using a DRBG-based mathematical formula. Using the initial seed status, it can generate a series of numbers whose characteristics are similar to those of random numbers in a random initializing state.

Recently, random numbers have been generated based on the NRBG in reinforcement learning [6]. However, once patterns are newly generated or added out of existing patterns regarded as common randomness, it may not guarantee the randomness. In order to guarantee the randomness, the feature of reinforcement learning to select optimal behaviors at every moment in consideration of every possible status is applied to random numbers when each episode is completed. In the case of [6], long short-term memory (LSTM) is added to reinforcement learning in order to save previous patterns in the long-term memory and select a new pattern about such previous patterns. In the case of LSTM, however, the long-term memory involves a limitation in length [7]. This study utilizes the convolutional neural network (CNN) to store pattern features for the random number generator [8]. With feature vectors extracted from the random number generator’s LSTM accumulated and converted into images, the CNN compressed these patterns to extract long-term pattern features. The random number generator can generate random numbers of a higher level of randomness about such previous patterns by utilizing accumulated long-term patterns.

2. Proposed Method

This chapter presents the dynamical pseudo-random generator that generates random numbers of long-term patterns through CNN. Figure 1 presents the diagram of the basic structure. In Figure 1, the left side shows the environment section while the right side shows the agent section. The environment section consists of 3 modules: State space by action, evaluation function, and reward policy. The state space by action functions to convert actions produced by AI into the next state. The evaluation function converts the randomness of the following state (random number) into a score, then used as a reward. The reward consists of frequency and weight: the frequency is a reward scheduler, while the weight indicates exponential growth. The reward policy increases the weight of the reward exponentially. The agent consists of the RNN, feature vectors, CNN, and fully connected layer. The RNN is a layer in which the current status value of the environment is entered with the first feature value presented. The module of the feature vectors aims to save the first features by time and convert them into images. The CNN is a layer where feature vectors are entered with the second features presented. The fully connected layer collects these two features and converts them into actions.

2.1. Environment

In the environment, the agent’s actions are converted into the next state through the state space by action, and then rewards are generated by the evaluation function. After that, the agent updates parameters about the rewards. Hence, once an episode ends at a particular time (T), it becomes T = T + 1, and the next state is designated as the first random number. Such first random numbers are used to indicate the agent’s current state.

2.2. Agent

In the agent module, the following steps are taken in line with each corresponding range: As shown in Figure 1, the random number value of the initial state is entered into the RNN in the agent module, and the first feature value is generated. The generated first features are saved by time in an image form through feature vectors. Images entered into the CNN network are converted into the second feature value through CNN. The resulting first and second feature values are collected through the fully connected layer and converted into the third. The LSTM model is a structure composed of a forget gate, and input gate, and are fully connected. The input value of LSTM is the state of the previous random number value and has a 1 × 8 vector. The activation function consists of tanh and relu.

The agent produces random numbers as many as 8 bits at a time, and this process is appended 100 times to produce 800 bits in total. As shown in Figure 2 regarding the RNN, 2 layers of the LSTM network are used to convert the current state into the first feature value of 8 bits. The structure of the CNN model for long-term memory is as follows. We set the input value to 30 × 30; the filter size of the CNN model is 3 × 3 and we applied 2 strides to it. There is no max pooling except for the CNN layer. The activation function is relu. After the feature map passed through the two layers of the fully connected network, the CNN model finally obtains a vector result in the form of a 1 × 4 feature vector 2. The total size of the feature vectors is 30 × 30. Each 3 × 3 is piled up to as many as 100 in the form of 10 × 10. Each 3 × 3 consists of 8 bits of the feature vectors and 1 ‘zero’ bit of padding. As the appending is repeated 100 times, 8 bits are arranged in the 3 × 3 form each time and piled up horizontally. As 800 bits are all piled up, the whole feature of 30 × 30 is filled up. With 5 CNN layers and 2 fully connected layers involved, the four-dimensional feature vector is generated, and then the feature vector 2 is generated accordingly. These are the eight-dimensional feature vector 1 and the concat from the RNN. As a result, feature vector 3 is also created.

There are two fully connected layers. This means that learning consists of two sections. The first learning section is the value stream, during which the critic predicts the reward. The second learning section is the policy stream, during which the actor predicts actions. In the critic’s fully connected layer, the initially entered reward value is utilized to produce the reward value. This step is for the behavior prediction network to learn based on feedback to the reward. In the actor’s fully connected layer, the action value is predicted.

In the first step of learning, the critic’s fully connected network shows the following characteristics: The reward value from environment is received by the critic’s fully connected network of the agent as feedback. The back propagation to the entire network is as much as the loss; it is the difference between the reward and the value predicted by the critic’s fully connected network. The critic loss is as follows:

Method 1: minimize(P(@)) reduce_mean(env_reward-critic_reward)

Env_reward is the actual reward calculated in the environment, and the critic reward is the value predicted by the critic network. ‘P’ is a network parameter. Method 1 is used to determine the agent’s parameter value that minimizes the env_reward and critic reward.

In the second step of learning, the actor’s fully connected network shows the following characteristics: The purpose of the actor’s fully connected network is to learn the most rewarding activities in the current state. The loss of the actor network is as follows:

Method 2: maximize(p(@)) (advantage × [P(a_t|s_t)/P_old(a_t|s_t)])

The advantage is determined based on the reward value predicted by the critic’s fully connected network. When the value is positive, the current state produces the best possible action. With the advantage set as the weight, the probability that the P network produces the best possible action after learning from P_old to P as the expectation about every possible a_t is high, in reference to the current s_t of the actor network.

3. Experiment and Results

3.1. Experiment Configuration

The experiment configuration was designed using the open source code [9] through a reinforcement learning approach [6]. With it being set with the target method, the suggested method was comparatively examined. As to the experiment evaluation function, we used NIST SP 800-22/900-22 proposed by the National Institute of Standards and Technology, which evaluates randomness. In addition, in Section 3.3, we presented an evaluation function using auto correlation to evaluate unpredictability. It was used to apply the suggested architecture (method), GPU Tesla V100 [10]. The parameters related to learning are as follows. Agent has two branches for learning. Among them, we set the learning rate policy of the loss function of the value branch corresponding to determining the next action to 3 × 10⁻⁴, and the learning rate advantage for the critic part reflecting the reward to the agent to 1 × 10⁻⁴. Moreover, we set discount_factor, a parameter of Bellman’s equation that determines the agent’s reward value, to 1.0. The agent’s entire network model update parameter, ‘the learning rate policy,’ was set to 3 × 10⁻⁴, and the learning rate advantage was set to 1 × 10⁻⁴. The reward’s discount_factor was set to 1.0. The number of episodes was set to 2000, and the volley was set to 35 in consideration of the point that every episode would end. In each episode, one learning loop is completed in the sequence of random number→random number generation→compensation. The basic 8-bit unit of the model’s generation is repeated 100 times and, finally, 800 bits of the final random numbers are created. The volley 1 is batch_iter (20) × batch (100), which begins with one random number and completes 2000 episodes in total. The total learning length is 35 volleys, including 35 × (20 × 100) = 70,000 random numbers. The total learning time is 19 h, 45 min, and 34 s.

3.2. Average Total Reward with 800 Bits of Random Numbers

In this experiment, the total length of random numbers is 800 bits. In Figure 3, the average total reward of volleys in the y axis indicates the average value of randomness of 2000 random numbers used in each volley. Specifically, as the average total reward value is significant, the level of randomness is high, accordingly. The proposed method produced an average total reward similar to that of the target method, to the extent that the number of training volleys was 10. As the number of training volleys exceeded 10, however, there was a difference between them. When 35 training volleys were involved, the difference was as high as about 4%. In addition, the results of analysis on validation volleys indicate that when the number of volleys exceeded 35, the difference was as high as about 4% as in training. This suggests that the proposed method can better guarantee the randomness of random numbers than the target method. As the proposed method’s randomness was superior, this result indicates that the following random number can secure a higher level of randomness about the previous random number sequence pattern by using the CNN auxiliary storage device.

Table 1 compares the maximum test value with the maximum validation value shown in Figure 3. The maximum values show the average total reward’s maximum value in every volley. The average values show the average value of the average total reward in every volley. Table 1 shows that the maximum validation value of the proposed method was 0.046 larger than that of the target method, the maximum t10est value was 0.036 larger, and the average test value was 0.039 larger.

3.3. Different Sequences Even in the Same with 800 Bits of Random Numbers

In the study of multiplexed quantum random number generation [11], the correlation of signals was examined to verify each signal’s independence from seven homodyne detector channels for quantum random number generation. Likewise, this study uses a correlation matrix to verify whether multiple sequences out of the same seed maintain independence in disposition and pattern. The correlation matrix is close to one when the disposition of two sequences is the same. Thus, all start in the condition of ‘0 seed.’ This condition proves that different sequences can be generated even from the same seed. A total of 10,000 sequences of 800 bits are generated, and 2 out of these 10,000 are selected to measure the correlation. Ten thousand values are compared in each sequence, and the maximum value is measured. As such, each of 10,000 sequences is measured.

As shown in Figure 4 the red dotted line indicates the value of correlation max in 10,000 sequences. The blue line is the number of sequences with the same correlation value. The maximum correlation value among all of the sequences was 19%. It turned out that 10,000 sequences generated at different times were not similar. The maximum value of the correlation in most sequences was about 13%. The fact that no other sequence existed suggests that dynamical pseudo-random numbers are all different when the reinforcement learning mechanism is applied, even if the same seed is used. As the dimension of random numbers is low, the same pattern will most likely be repeated. Thus, the similarity of correlation scores is likely to be relatively high. However, the values indicate that the similarity was not high, and as the test was conducted on 100 sequences of 100,000 bits, the maximum value decreased to 1.1%.

4. Conclusions and Future Works

The current random number field around the world is focused on hardware-based quantum random numbers, but as mentioned in other answers, we needed to develop a random number generator consisting of only software that is not constrained by the environment. Therefore, we utilized deep learning technology to create an unpredictable random number module that is different from existing software-based pseudo-random numbers. If the proposed method determines the optimal behavior for each state and uses the reinforcement learning technique that changes the parameters of deep learning, it is impossible to find out the pattern of random numbers by tracking the operation of numerous parameters that are changed every time in real time. Therefore, we judged this method to be appropriate as an artificial intelligence model for random numbers. Thus, this study proposes a dynamical pseudo-random generator that utilizes the reinforcement learning mechanism. For this purpose, the structure is divided into the environment section and the agent section. Actions from the agent section produce the next state and reward in the environment section. Based on the parameter updated through the reward, the next state is the first random number, which is converted to the current state of the agent section. The agent section generates Feature 1 through the RNN that uses this current state, and Feature 2 is generated based on feature vectors and concatenated. The resulting feature vector 3 predicts actions based on the fully connected layer. Based on this proposed structure, it was demonstrated that when the random number was 800 bits in the reward and the experiment evaluation function NIST SP 800-22/900-22 was used, the level of randomness was higher. In addition, when the generated random numbers were compared, the similarity among them was as low as 19% at best. Hence, the proposed method proved better randomness and similarity than the existing method.

Author Contributions

Conceptualization, software, validation, formal analysis, resources and draft preparation, S.P. and K.K. (Kyungmin Kim); conceptualization, project administration and supervision, K.K. (Keunjin Kim); methodology, investigation, data curation, writing, review and editing, C.N.; funding acquisition, C.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by INHA UNIVERSITY Research Grant.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shor, P.W. Algorithms for quantum computation: Discrete logarithms and factoring. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, Santa Fe, NM, USA, 20–22 November 1994; IEEE: Piscataway, NJ, USA, 1994; pp. 124–134. [Google Scholar]
Rukhin, A.; Soto, J.; Nechvatal, J.; Barker, E.; Leigh, S.; Levenson, M.; Iii, L.E.B. A statistical test suite for random and pseudorandom number generators for cryptographic applications. NIST Spec. Publ. 2002, 800–822. [Google Scholar]
Herrero-Collantes, M.; Garcia-Escartin, J.C. Quantum random number generators. Rev. Mod. Phys. 2017, 89, 015004. [Google Scholar] [CrossRef] [Green Version]
Barker, E.B.; Kelsey, J.M. Recommendation for Random Bit Generator (RBG) Constructions; US Department of Commerce, National Institute of Standards and Technology: Gaithersburg, MD, USA, 2012.
Barker, E.B. Guideline for using cryptographic standards in the federal government: Cryptographic mechanisms. NIST Spec. Publ. 2016, 800-175B, 1–82. [Google Scholar] [CrossRef]
Pasqualini, L.; Parton, M. Pseudo random number generation: A reinforcement learning approach. Procedia Comput. Sci. 2020, 170, 1122–1127. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y. Convolutional networks for images, speech, and time series. Handb. Brain Theory Neural Netw. 1995, 3361, 1995. [Google Scholar]
Pseudo Random Number Generation through Reinforcement Learning and Recurrent Neural Networks. Available online: https://github.com/InsaneMonster/pasqualini2020prngrl (accessed on 2 February 2022).
V100 Tensor Core GPU. Available online: https://www.nvidia.com/en-us/data-center/v100/ (accessed on 2 February 2022).
Haylock, B.; Peace, D.; Lenzini, F.; Weedbrook, C.; Lobino, M. Multiplexed quantum random number generation. Quantum 2019, 3, 141. [Google Scholar] [CrossRef]

Figure 1. Dynamical pseudo-random generator using reinforcement learning.

Figure 2. The sequence of feature vector generation.

Figure 3. Comparative analysis of learning and verification between the proposed method and target method when there are 800 bits of random numbers.

Figure 4. Comparison of correlation in each sequence.

Table 1. Comparative analysis of the maximum and average values between the proposed method and target method when there are 800 bits of random numbers.

	Maximum Validation Value	Maximum Test Value	Average Test Value
Proposed method	0.532	0.522	0.509
Target method	0.486	0.486	0.470

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, S.; Kim, K.; Kim, K.; Nam, C. Dynamical Pseudo-Random Number Generator Using Reinforcement Learning. Appl. Sci. 2022, 12, 3377. https://doi.org/10.3390/app12073377

AMA Style

Park S, Kim K, Kim K, Nam C. Dynamical Pseudo-Random Number Generator Using Reinforcement Learning. Applied Sciences. 2022; 12(7):3377. https://doi.org/10.3390/app12073377

Chicago/Turabian Style

Park, Sungju, Kyungmin Kim, Keunjin Kim, and Choonsung Nam. 2022. "Dynamical Pseudo-Random Number Generator Using Reinforcement Learning" Applied Sciences 12, no. 7: 3377. https://doi.org/10.3390/app12073377

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamical Pseudo-Random Number Generator Using Reinforcement Learning

Abstract

1. Introduction

2. Proposed Method

2.1. Environment

2.2. Agent

3. Experiment and Results

3.1. Experiment Configuration

3.2. Average Total Reward with 800 Bits of Random Numbers

3.3. Different Sequences Even in the Same with 800 Bits of Random Numbers

4. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI