1. Introduction
Stochastic computing (SC) has emerged as a possible solution for Neural Network hardware implementation [
1,
2] and also as a way to accelerate the computation in different applications such as image processing [
3] or Deep Learning (DL) for inference [
4,
5] and training [
6,
7]. It presents several advantages compared to traditional computing, such as noise resiliency [
8], low signal transmission delay [
9], low power consumption, and a small footprint area [
10].
Two main SC-based codifications exist for representing variables: unipolar and bipolar coding. In unipolar coding, the bit-stream (BS) value
x is seen as the probability of obtaining a 1 at a random position in
x. This corresponds to counting the number of 1’s
and the number of 0’s
along the BS and computing
. This value lies in the interval [0,1]. On the other hand, bipolar coding represents the BS values in the interval [−1,1]. To accomplish this, zeroes are weighted as
, while ones are weighted with a
, so that
, where the upper index ∗ denotes bipolar coding. Operating over single BSs allows cost operations such as multiplication, to be implemented with single logic gates. For instance,
Figure 1a shows how two unipolar BSs
and
are multiplied using a single AND gate.
In order to operate in the SC domain, a Stochastic Number Generator (SNG) circuit must be employed. The most commonly used circuit in the literature is based on a RNG circuit and a single comparator [
11,
12], as shown in
Figure 1b. Whenever the digital input value
X is greater than the RNG value
R, the stochastic output
x is set to ‘1’, otherwise it is set to ‘0’. If the RNG sequence is uniformly distributed in the interval of all possible values of
X, the probability
is proportional to the number
X.
The overall SC system performance is highly dependent on the RNG quality. The reasons are twofold. Firstly, the area employed for this part of the system is much higher than that of the computational part, occupying around 80% or even 90% of the total footprint of the system [
13]. This is especially critical for applications requiring a high fan-in. However, secondly, the randomness quality can significantly affect the precision of those operations requiring non-correlated signals, as in the case of stochastic multiplication, for instance. For these reasons, finding the best RNG in terms of area and low correlation is a major concern to address when designing real SC applications.
Different approaches have been proposed to tackle these issues. Low-discrepancy sequences such as Halton [
14] or Sobol sequences [
15] deal with the low-correlation matter. This sort of sequences produce pulse signals with ones and zeroes uniformly spaced; this mitigates the random fluctuations in the BSs generated. The problem they face is the area employed for generating such sequences. Normally, different base-counters [
14] or least significant zero detectors plus storage arrays [
16,
17] are utilized, thus increasing the hardware overhead. On the other hand, area saving mainly comes in one mode: the Linear-Feedback-Shift-Register (LFSR) circuit.
An LFSR is a circuit based on a shift register and a linear function of its previous state connected to its input. Normally, the linear function is produced by connecting exclusive OR gates to different points (known as taps) in the state registers. The way of connecting these taps in the circuit is known as
primitive polynomials, which can be expressed using two different notations: polynomial or binary notation. In polynomial notation, the tap connections are expressed as
, where
T are the register taps selected. In binary notation, the taps connected are expressed as
.
Figure 2 shows an LFSR circuit where the inputs of the XOR gate are connected to registers
, and 4; its polynomial is thereby expressed as
or 184 in binary notation. Different polynomials allow different deterministic streams with different lengths to be produced. There exists a finite number of polynomial configurations (depending on the LFSR resolution) that produce a maximal length sequence of
cycles, where
b is the number of registers used in the circuit (bit-resolution). For instance, a 10-bit LFSR has at least 60 different polynomials that produce maximal length sequences [
18]. Those polynomials which generate a maximal stream length are known as primitive polynomials. The reason the LFSR sequence has one cycle missed is because there is a prohibitive state, where the LFSR is locked-up in case it enters (when all bit registers are zeros in case the XOR gate is used).
The starting value of the LFSR is named as the
seed.
Figure 2 shows an LFSR sequence with a seed set to 255. After
N clock cycles, the sequence will roll over again from the starting seed point. This is one of the reasons LFSRs are not truly RNGs; they are, rather, pseudorandom numbers, since we can predict the next state of the sequence if we know the current state (something uncertain in a true RNG).
LFSR is the easiest and smallest circuit to produce a pseudo-RNG (from now referred as RNG; we explicitly use the adjective
True when meaning True random), albeit presenting a high correlation behavior [
19]. Different works have been proposed in the literature for exploiting the advantages of LFSR while mitigating its shortcomings. Basically, most of the works presented focus on sharing the same LFSR between different SNGs but alleviating the correlation effects that this method raises. The work carried out by Z. Li et al. [
20] is a good example of this; their approach is based on a DFF insertion technique (adding a DFF to the line to uncorrelate the BS with itself), where the DFF circuit aims to uncorrelate one of the BS from the other. This technique performs successfully when employing a True Random Number Generator (TRNG), but for stand-alone LFSRs this is not the case. The LFSR shows a high level of autocorrelation only if isolated with a single DFF, as demonstrated in [
19]; therefore, this approach is not efficient. Hideyuki Ichihara et al. [
21] suggested a technique for sharing as many LFSRs as possible by employing a circular shift at the LFSR output. This method allows generation of two non-correlated signals with no hardware overhead; therefore, it reduces the area employed and deals with the error at the same time. Along the same line, another relevant study was proposed by H. Joe and Y. Kim [
22]. They dug deeper into the different ways of connecting the same LFSR to produce the smallest correlation impact between signals using a wire exchanging technique. Their results showed better performance in comparison with other LFSR sharing methods. Another approach, introduced by F. Neugebauer et al. [
19], tried to increase the randomness of the LFSR outcome by adding a nonlinear boolean circuit. This method decreased the correlation impact at the cost of hardware overhead. J. Anderson et al. [
23] explored an interesting approach based on the effect of the seeds on the accuracy in SC systems when employing LFSRs. They demonstrated that an efficient selection of seeds in the circuit improved the accuracy. For this, the authors explored the whole space to find the best seeding set. This is affordable as long as the exploration space is bounded (small bit-precision). However, for more complex analysis cases (as more complex systems or higher bit-precision), a higher-level procedure must be employed (such as metaheuristic techniques [
24,
25]). Nevertheless, as previously introduced, SC advantages are displayed when small bit-precision is used (≤8 bits, for the case of multiplication). Therefore, the whole exploration process is reasonable and there is need for a higher-level heuristic.
Despite the fact that some solutions have been produced, none of them guarantees good accuracy and a small footprint area at the same time for correlation-sensitive circuits such as the SC quadratic function, the scaled addition, and the multiplication. These circuits are the driving force in the SC realm, and high demand applications such as image processing or DL employ them.
In this work, we explore in more detail how the LFSR circuit could be better exploited as a RNG source in SC by making a careful selection of the seeds employed, with the purpose of finding the best BS generator technique for different application requirements. Our contribution is four-fold:
We show the LFSR seeding impact over different correlation-sensitive circuits. We demonstrate that, if seeded properly, the LFSR circuit is the most accurate RNG compared to other methods in the literature, as long as it is computed for a complete sequence period. This, in fact, comes with the advantage of using the cheapest RNG found since the early beginnings of SC [
11].
We demonstrate that the LFSR may achieve low autocorrelation behavior when isolated properly, something that has been totally overlooked in the literature up to now. This fact has an impact on the design of commonly used operations such as the stochastic square function .
We prove our claims in real hardware implementations. Using an FPGA device, we perform a real case application for image processing, implementing an edge detection circuit.
We provide the seeds that must be employed when using LFSRs for different use cases, offering SC designers a direct RNG setting.
2. Seeding Impact on Correlation
LFSRs are valuable in different applications such as fast digital counters [
26], whitening sequences [
27], cryptography [
28], circuit testing [
29], and, indeed, SC circuits. Despite its advantages, the use of LFSR sequences for SNG raises some difficulties. Unlike in the case of TRNGs, in that of LFSRs, the premise that all bits in the stochastic stream are independent of each other does not apply anymore. Consecutive values of the LFSR sequence are highly dependent, as they are a shifted version of past states. Each bit in the sequence, if taken separately, possesses good randomness characteristics, but when it is seen as a whole binary number (as it is normally used in SC), the ideal randomness quality disappears. This is a real issue for commonly used stochastic functions such as the quadratic function
, which is carried out with a single AND gate (
unipolar coding) and a single D-FF register. The purpose of the D-FF is to uncorrelate the stochastic signal from itself by adding a delay cycle. This is true for stochastic signals generated by TRNGs, but it is not fulfilled for the LFSR case. Moreover, when operating multiple BS, different seed combinations produce different results, and since the sequence is periodic, the same error is observed in each cycle. This contrasts with the TRNG, in which the error fluctuates, converging to zero as the integration time increases. For these reasons, finding optimum seed combinations is a major issue when employing LFSRs as the random number source of the circuit.
Despite the fact that LFSR has been the preferred RNG in SC real implementations, previous works do not provide a careful analysis of the LFSR seeding; still less do they offer a method to efficiently choose the best seeds to operate. The work that comes closest to doing so was that carried out by J.Anderson et al. [
23]. They explored whether there existed a suitable set of seeds to improve the accuracy of stochastic computing systems. They demonstrated that a good selection of seeds increased the accuracy of SC circuits for the same bit precision, and that shorter streams with optimum seeding had the same or better accuracy than longer bit streams with random seeding. Nevertheless, although they present empirical evidence for their results, the authors do not provide what those seed combinations are and how to select them a priori, i.e., without iterating the whole
seed sweeping computation (Monte-Carlo way) for the application. One of the problems with their method is that we need to perform a trial and error procedure: a quick task if small circuits are evaluated but a heavy task for a large implementation. In this section, an analysis of the seeds of the LFSR is presented, the aim being to overcome the aforementioned shortcomings.
Suppose two BSs are generated from two independent LFSRs with the same polynomial but having different seeds. Suppose these two BSs are multiplied. The question that arises is: does any different couple of seeds generate the same result? If not, how does the seeding affect the overall outcome? Take for instance the operation shown in
Figure 3. The
x signal represents the value
, while
y represents
. The same LFSR polynomial is used to generate each signal but with different seeds. Two versions of
y (
and
) are generated for comparison purposes. For the
, we picked the next value in the sequence of the
(see
with the seed highlighted in red) as the seed. For
, we picked the fourth value in the sequence of the
(see
with the seed highlighted in green). As seen, due to the fact that the LFSRs have the same polynomial, the
stream is a shifted version of
. This makes the
x signal 1’s match the
and
1’s in different times, producing different outcomes in the final operation. In order to see the impact of seeding, the AND operation between
x and
is presented in
, respectively. For the case of
, the result is
, whereas for
, the result is
. As shown,
represents more accurately the ideal value (
), with an absolute error of
. This fact shows that
is less correlated than
, respect to
x. This short and simple example demonstrates how seeding has an impact on the overall results in SC correlation sensitive circuits (as is the case of multiplication), so careful selection of seeds must be carried out to obtain the most accurate results.
Expanding the preceding example,
Figure 4 shows the Mean Absolute Error (MAE) for different seeds when
x and
y are multiplied considering all possible values. Once again, the same polynomial is employed for the two LFSRs. Two different bit resolutions are taken into account in the analysis: 6-bit and 8-bit. Instead of performing the analysis by taking as reference the seed values as in the previous example, this time we took the difference between the seed position (
seed index) in the sequence
. Taking
Figure 3 as an example,
x is generated with the seed index 0,
is generated with the seed index 1, and
is generated with the seed index 3. In essence, the seed index corresponds to the value taken by the LFSR sequence at time
. Formally, the MAE for every seed index is calculated as:
where
b is the bit-precision,
x and
y are the BS generated when converting
X and
Y to SC domain, and
,
the expected value of
x and
y, respectively.
As shown in both plots, the maximum error occurs when x and y are generated with the same seed (). However, as the moves away from the , the error tends to diminish (with some resonant error peaks throughout), until we move to the further seed index . The behaviour can be seen as a mirror if we take the center as the point of reference. The takeaway from these figures is that the difference between both seed indexes is the real issue, not the LFSR seed values.
It is worth noting that as we move away from
, some seed indexes present an error resonant behaviour (high peaks in the plots); however, as we move closer to the further index, the peaks are mitigated in an exponential way. The reason for this phenomenon can be better understood by observing
Figure 5. The blue line shows the normalized value of the 8-bit LFSR sequence for the first 127 cycles. The orange, in turn, shows the MAE for the first 127 seed indexes. Since both variables share the same axis (
), we can plot them in the same figure. As shown, the LFSR sequence presents similar patterns periodically (see arrows in the plot). Considering that the LFSR
x seed is at
, if the LFSR
y seed coincides with one of these initial-pattern values, then a resonant error occurs, indicating that both sequences (the one starting with seed index 0 and the one starting with the initial-pattern) have a high degree of correlation, i.e., they are similar. Therefore, if noncorrelated operations are to take place (as is the case of the multiplication), it is mandatory to avoid these seed values for the generation of the second BS.
Figure 6 shows the MAE histogram for an 8-bit LFSR and a 10-bit LFSR implementation. As can be appreciated from the LFSR-8 instance, selecting random seeds can lead us to have more than twice the error than if the seeds are selected intentionally. According to the measurements, there exists a 79% probability of choosing an inaccurate seed (seeds with a MAE greater than 0.002 in
Figure 6 for the LFSR-8 case). Moreover, 90% of the LFSR-10 seeds produce an MAE of less than 0.002, which is the same MAE we obtain when the seeds are efficiently chosen for the LFSR-8. We can thereby achieve similar accuracy if the pairing seed selection is carried out deliberately for lower resolution LFSR, instead of doing it in a random way for higher resolution LFSR, saving hardware resources, latency, and power.
The Absolute Errors (AE) for different seed indexes when varying
x and
y are presented in
Figure 7. The worst case (
Figure 7a) occurs when we generate the
y BS with seed index 0, producing maximum correlation between both input BSs. As seen, the maximum error is produced around the center, when the variance of the signal is maximum (
), taking maximum error values of up to 0.25. The second plot (
Figure 7b), shows the error behaviour for the first resonant peak of
Figure 5 (seed index 25). On this occasion, the maximum error is spotted when one of the signals is 0.5 and the other is at
and
, raising error values of up to 0.12, almost half of the worst case. The best seed (index = 97) is presented in
Figure 7c, where the maximum error rises to no more than 0.011, an order of magnitude less than the first resonant peak. Finally, the further seed index (idx = 127) is shown in
Figure 7d. As can be seen, its behavior is very similar to the best seed index case (
Figure 7c), presenting a maximum value of 0.016; 0.005 higher than the best seed case. As shown, the outcome pattern varies depending on the seed employed; the seeding effect is therefore a major concern when utilizing LFSRs as the random source for generating stochastic BSs.
3. Seeding Impact on Autocorrelation
The study of LFSR seeding can be extended to a very important area of research in SC: autocorrelation. A BS is said to be non-autocorrelated when there exists a low dependency between its consequent bits. When autocorrelation occurs, an isolator circuit could be used to uncorrelate the BS with itself, allowing operations such as the stochastic square function (
) to be performed. In other words, the autocorrelation measures how well an isolator circuit is able to uncorrelate the BS with a delayed version of itself [
19]. The most common circuit employed in the literature as an isolator is the D-Flip-Flop (DFF) [
11,
30]. Inserting a DFF in the BS line uncorrelates the BS with itself or with another BS
generated with the same RNG, as explained by Z. Li et al. [
20]. This is something especially exploited in their work in the quest to reduce the area overhead produced by the RNG circuits, since inserting a single DFF in the line is much simpler than inserting a complete independent RNG. Nevertheless, although they claim that their technique works for any BS generated with a circuit generator structure made of any RNG (LFSR method included) and a comparator, the truth is that for the LFSR case, this property does not apply, as will be analyzed in this section. The LFSR circuit has a high degree of autocorrelation, as demonstrated in [
19], and a single DFF isolator insertion is insufficient to uncorrelate the BS with itself.
Let us see firstly how the LFSR behaves compared to other RNG found in the literature using an autocorrelation metric. F. Neugebauer et al. [
19] define an autocorrelation metric based on the Box-Jenkins function [
31], which is employed in the NIST Engineering Statistics Handbook [
32]. The definition they provide is as follows: Let
x be a BS with a sequence:
, where
N is the BS length, and let
be the expected value of
x. The autocorrelation
of
x will be:
where
k is the number of cycles delayed (the number of DFF inserted in the line). Autocorrelation values close to 0 indicate a good independence of the BS with its
k delayed version, whereas a high absolute value indicates a bad independence.
Figure 8 shows a comparison of the autocorrelation between the LFSR for different
k values and other RNG methods. The measure is taken for different
x values using 8-bit precision. As seen, LFSR for
presents an autocorrelation value higher than that of the TRNG (TRandom). The average value of the LFSR
(LFSR
k = 1, blue line) is
, showing an increase in
A as
x moves away from the middle point. That is why inserting a single DFF when using an LFSR produces poor precision results when the BS value moves away from the center point (
). For the case of inserting 2 DFF in the line (LFSR
k = 2), the average value decreases to
, but still shows high peaks of more than
, still performing poorly for most applications. The ideal case, which is the TRandom, presents an average
A value of
, measured from 1000 different random samples (in the figure, one of the samples chosen arbitrarily is plotted), having better autocorrelation value than the LFSR for
. This conclusion is supported by [
19] in their work, discarding the LFSR standalone circuit for stochastic operations such as the quadratic function, and proposing the SBoNG method as an approach to circumvent the problem. The SBoNG method is based on connecting a nonlinear Boolean function to the output of the LFSR. The function is performed by a combinational circuit called SBox [
33]. The SBox circuit is normally implemented as a LUT (although it can be implemented using logic gates) and is constrained to 4-bit inputs, limiting its use to 4-bit-multiple RNGs. In their paper, the authors compare the SBoNG method with the LFSR circuit, but only for
; concluding that SBoNG performs better for stochastic operations demanding low autocorrelated BSs. Nevertheless, as
Figure 8 shows, the LFSR with the precise number of delay elements (
) performs much better than the TRandom and the SBoNG implementations, with an average
A value of
;
, and 2.2 times better than the TRandom and the SBoNG methods, respectively. It must be said that for the case of the SBoNG measures, our results differ from the ones presented in the original paper [
19]. This is because they evaluated the SBoNG circuit by varying randomly, for every iteration of the test, the LFSR seed and the initial state of the circuit (see details in original paper). However, to conduct real digital circuit implementations without increasing the amount of resources to generate random values for every iteration, we fixed the seed and initial-state values. These values were found by running 1000 tests and selecting the best case result, i.e., the LFSR-seed and initial-state couple, which performed the lowest autocorrelation average value.
The reason LFSR with 5 DFF performs better can be understood by analyzing the MAE values of the first seed indexes in the 8-bit LFSR multiplier.
Table 1 shows the numerical values. A
k delayed version of
x is equivalent to taking the seed index
k (see the example of
Figure 3). That is why the
version has a low autocorrelation value, because the seed index 5 is the first minimum MAE, as can be observed in the table. Additionally, the first two seed indexes, which represent
in
Figure 8, have a high MAE, corresponding to a high autocorrelation level. It is therefore expected that if we employ only one or two isolators, the LFSR will perform poorly, since it corresponds to employing the two first seed indexes for operating. However, if we embed the correct number of DFF, we can have accurate results.