1. Introduction
Nonorthogonal multiple access (NOMA) is an emerging class of multiple-access technologies in 5G telecommunications and the Internet of Things [
1]. Interleave division multiple access (IDMA) is one of the NOMA schemes that distinguishes multiple users according to their distinct interleaving patterns [
2]. By virtue of its fine scalability and robustness, IDMA is considered as a promising NOMA candidate for the forthcoming applications [
3].
Recent works in the literature have pioneered sophisticated multiuser detector architectures for IDMA systems [
4,
5,
6,
7,
8,
9,
10,
11]. As generalized in
Figure 1, a
U-user detector incorporates
U user-wise processing blocks (UPBs) and one elementary signal estimator (ESE). Each UPB contains its own address generation unit (AGU) for accessing memories therein. Since all users employ distinct interleaving patterns and access memories in their own manners, all the
U AGUs are implemented separately, making the total number of AGUs in a detector
U. Accordingly, when a massive number of users are connected, i.e.,
U >> 1, the AGUs as a whole contribute a significant portion to the overall hardware complexity.
Despite the weightiness in implementation, only one AGU structure [
5] has been presented in the literature, and it has never been studied in detail. For the first time, this paper analyzes all possible options for designing AGUs, and then a complexity reduction technique is applied to each of those architectures. More specifically, some components in AGUs are relocated to make them shareable and removable without affecting the functionality. The complete transparency of such renovation makes it applicable to any existing multiuser detector without tailoring the interfacing components therein. Measuring the hardware complexity, all the resulting AGUs are compared with each other, and a new architecture simpler than the state-of-the-art one is developed. Implementation results in a 65 nm CMOS process that will demonstrate that the proposed AGU can alleviate the equivalent gate count and the power consumption of the prior process by 13% and 31%, respectively.
The rest of this paper is organized as follows.
Section 2 reviews the fundamentals of multiuser detection in IDMA systems.
Section 3 compares two addressing modes for AGUs. The proposed complexity reduction technique is presented in
Section 4. In
Section 5, all possible options for AGUs are evaluated and discussed along with the implementation results. Concluding remarks are made in
Section 6.
2. Background
At the transmitter of the
uth user (Tx
u) in
Figure 1, each of
N information bits is first replicated
S times by a spreader. The resulting sequence of
J =
NS chips is then permuted by an interleaver of length
J. The sequences of the
J chips before and after interleaving can be indexed by {
j} and {
πu(
j)}, respectively, for
j = 0, 1, …,
J – 1.
πu(∙) is the interleaving function of the
uth user. In case of
J = 4 and
U = 2, for example, {
j} = {0, 1, 2, 3}, {
π1(
j)} = {2, 0, 3, 1}, and {
π2(
j)} = {1, 3, 2, 0}. Note that two interleaving patterns are distinct. The chips departing from
U users go through a wireless channel while interfering with each other. In a multiuser detector receiving the chips with interference, the ESE first distributes
lu(
πu(
j)) to UPB
u for
u = 1, 2, …,
U, where
lu(
πu(
j)) is the log-likelihood ratio (LLR) of the
jth chip from the
uth user. In return, UPB
u answers the ESE with
eu(
πu(
j)), called an extrinsic LLR. After the ESE and UPBs exchange their LLRs several times, the final estimates of the
N information bits are determined by the signs of LLRs.
The operation of UPB
u can be formulated for all
j = 0, 1, …,
J – 1 as
du(
pu(
j)) is called a despread LLR, and
pu(
j) is the index of the despread LLR that corresponds to
lu(
πu(
j)). Accordingly, the first line of (1) states that an extrinsic LLR,
eu(
πu(
j)), is calculated by subtracting an incoming LLR,
lu(
πu(
j)), from its corresponding despread LLR,
du(
pu(
j)). Comparing the first and the second lines of (1) suggests that
du(
pu(
j)) is the sum of
S LLRs associated with
pu(
j). Since
pu(
j) = floor(
πu(
j)/
S), as rewritten in the third line of (1), {
πu(
j)} can be divided into
J/
S disjoint subsets, each of which has
S elements associated with the same
pu(
j). Then,
du(
pu(
j)) can be interpreted as the sum of such
S LLRs in a subset. Let us exemplify with
J = 4,
S = 2, and {
πu(
j)} = {2, 3, 1, 0}. The elements in subset {
πu(0),
πu(1)} = {2, 3} are associated with
pu(
j) = 1, and
du(
pu(
j)) =
du(1) is the sum of
S = 2 LLRs,
lu(2) +
lu(3). The elements in subset {
πu(2),
πu(3)} = {1, 0} are related with
pu(
j) = 0, and
du(
pu(
j)) =
du(0) =
lu(1) +
lu(0). It is worth noting that obtaining one despread LLR by accumulating
S LLRs associated with the same
pu(
j) is the inverse of spreading that makes
S replicas of the
pu(
j)-th information bit.
The state-of-the-art scheme to calculate (1), which is called on-the-fly despreading [
5], has dominantly been employed by the latest UPBs [
5,
6,
7,
8]. It comprises two phases: 1) reception and 2) response. In the first phase, UPB
u receives
lu(
πu(
j)) for all
j = 0, 1, …,
J – 1 from the ESE, and stores them into a memory named M
L. Simultaneously, it adds
lu(
πu(
j)) to the
pu(
j)-th entry in a memory named M
P1. M
P1 contains
J/
S entries, each of which corresponds to a partial sum (PS) of
du(
pu(
j))
. After accumulating all the
J LLRs as stated, the PSs become {
du(
pu(
j))}. Let us exemplify again with
J = 4,
S = 2, {
πu(
j)} = {2, 3, 1, 0},
du(0) =
lu(0) +
lu(1), and
du(1) =
lu(2) +
lu(3).
- (1)
For j = 0, lu(2) is stored into ML, and the PS of du(1) in MP1 is set to lu(2).
- (2)
For j = 1, lu(3) is stored into ML, and the PS of du(1) in MP1, which has been lu(2), is updated to du(1) = lu(2) + lu(3).
- (3)
For j = 2, lu(1) is stored into ML, and the PS of du(0) in MP1 is set to be lu(1).
- (4)
For j = 3, lu(0) is stored into ML, and the PS of du(0) in MP1, which has been lu(1), is updated to du(0) = lu(1) + lu(0).
As a result of J = 4 cycles, {lu(πu(j))} and {du(pu(j))} have been prepared in ML and MP1, respectively. In the second phase, UPBu returns eu(πu(j)) = du(pu(j)) – lu(πu(j)) to the ESE for all j = 0, 1, …, J – 1. The minuend and the subtrahend are retrieved from MP1 and ML, respectively. The next reception phase in which new PSs are to be computed may start in the middle of the response phase. However, since {du(pu(j))} in MP1 are in use during the response phase, they should not be overwritten. Accordingly, the new PSs are stored into a duplicate memory of MP1, named MP2. As the phases iterate, the roles of MP1 and MP2 alternate. For example, in even-numbered iterations, MP1 provides du(pu(j)), while MP2 manages new PSs. In odd-numbered iterations, vice versa.
3. AGUs Based on Sequential and Interleaved Addresses
As stated above, every UPB intensively accesses M
L, M
P1, and M
P2 every cycle to read and write LLRs and PSs, necessitating the generation of proper read and write addresses. The AGU is responsible for organizing such addresses, and interfaces with the memories as depicted in
Figure 2. The nomenclature of the signals is as follows. The baseline text stands for the functionality of an address, while the subscript designates the memory associated. For example,
RAL is the read address for M
L, and
WAL is the write address for M
L. In a similar manner,
RAP1,
RAP2,
WAP1, and
WAP2 are the read and write addresses for M
P1 and M
P2, respectively. Note that both M
P1 and M
P2 take
WAP as their common write address.
c[0] is the least significant bit (LSB) of the current iteration count
c, it being 1 if the current iteration is odd-numbered or 0 if even-numbered.
Table 1 briefs the meanings as a prompt reference.
Let us recapitulate that M
L has
J entries to store
lu(
πu(
j)) for
j = 0, 1, …,
J – 1, and each of M
P1 and M
P2 has
J/
S entries to hold PSs of
du(
pu(
j)) for
pu(
j) = 0, 1, …,
J/
S – 1. While
pu(
j) is the only addressing scheme for
J/
S entries of M
P1 and M
P2, two different options are available for accessing the
J entries of M
L. One is to use sequential addresses (SAs), {
j}, and the other is to adopt interleaved addresses (IAs), {
πu(
j)}. Nevertheless, only the former has been presented in the literature [
5], and it has never been compared with the latter.
Figure 3 sketches the existing AGU using SAs [
5]. The output of the counter at the bottom, which is a cyclic sequence of elements in {
j} = {0, 1, …,
J – 1}, is readily used as
RAL. The output of the other counter at the top, which precedes the bottom one by
E – 1 cycles, plays the role of
n_WAL.
n_WAL is the next write address for M
L that precedes
WAL by one cycle, and
E is the latency of the ESE. R
J denotes a D-type register holding log
2J bits, where the subscript
J represents the argument of the logarithm. Both
RAL and
WAL are log
2J-bit long so as to address all
J entries in M
L.
WAL is generated by R
J that defers
n_WAL one cycle. Given
i as input, interleaver
u makes
πu(
i). Given
i as input, a division-by-
S-and-floor unit (DFU) calculates floor(
i/
S). Putting it all together, given
RAL or
j as input, a series of interleaver
u and a DFU bounded by dotted lines derives
RAP = floor(
πu(
j)/
S) =
pu(
j).
RAP serves as one input of each multiplexer. By the other set of interleaver
u and the following DFU,
n_WAL is transformed into
n_WAP, which fills the remaining input of each multiplexer. According to
c[0],
RAP1 and
RAP2 alternate between
n_WAP and
RAP. This implements the aforementioned role exchange of M
P1 and M
P2, i.e., one retrieves
du(
pu(
j)) to compute
eu(
πu(
j)) in the response phase, while the other prefetches a PS to accumulate
lu(
πu(
j)) during the reception phase. A series of R
J and a DFU that follows the upper interleaver
u makes
WAP.
On the other hand,
Figure 4 depicts another possible AGU that uses IAs. Unlike the SA-based AGU (S-AGU),
RAL is the output of interleaver
u that shuffles a cyclic sequence from a counter {
j}, i.e., {
πu(
j)}.
n_WAL is also taken from the output of the other interleaver
u. The two counters are out of phase by
E – 1 cycles as they are in
Figure 3. Since
RAL is already an IA, a DFU is the only remaining stage to be undergone ahead of
RAP. Similarly,
n_WAP is made by a DFU that takes
n_WAL as input.
n_WAP and
RAP are connected to the multiplexers.
Both AGUs in
Figure 3 and
Figure 4 contain two counters, two interleavers, and two multiplexers. Excluding such ones in common, the remaining components are colored grey for emphasis. The IA-based AGU (I-AGU) has one less R
J than the S-AGU. On the other hand, since SAs in {
j} are independent of
u unlike the IAs in {
πu(
j)}, the counters and R
J that generate
RAL and
WAL can be shared among all UPB
u for
u = 1, 2, …,
U, being an advantage of the S-AGU. Another noteworthy merit of the S-AGU is that it may employ the simplified memory subsystem in Ref. [
7]. More specifically, M
L is usually implemented with a dual-port memory to accommodate two requests per cycle. When M
L is accessed by SAs, however, a pair of adjacent requests can be integrated into one, and the number of memory accesses per cycle is reduced from two to one. Then, M
L can be implemented with a single-port memory instead of a dual-port one, reducing the hardware complexity significantly.
4. DFU-Reduced Architecture
A DFU includes a division that incurs a significant hardware burden. It is therefore important to minimize the number of DFUs. To this end, we now manipulate the AGUs as follows. The top right part of
Figure 3 is redrawn in step 1 of
Figure 5. We exchange the location of R
J and the following DFU as illustrated in step 2, as such a change does not affect the functionality at all. Then, instead of using two separate DFUs, the output of one DFU can be shared as shown in step 3, mitigating the complexity. Besides, R
J is minified to R
J/S, which is a register holding log
2(
J/
S) < log
2J bits. Note that each of
RAP1,
RAP2, and
WAP are log2(
J/
S)-bit long so as to address
J/
S entries in M
P1 and M
P2. Therefore, the relocation results in not only the removal of a DFU but also the reduction of the bit-width. The entire architecture of the DFU-reduced S-AGU is illustrated in
Figure 6.
A similar approach can be taken to the I-AGU as well. Step 1 of
Figure 7 redraws the top right part of
Figure 4. Swapping R
J and the following DFU, we can acquire step 2 of
Figure 7. Subsequently, we can share the output of the sole DFU and substitute R
J with R
J/S, as depicted in step 3. However, note that the output of the DFU is now
n_WAP, from which
WAL cannot be retrieved. To secure the indispensable
WAL from
n_WAL, we need to add R
J as illustrated in step 4. Unlike the original I-AGU which is of the lowest complexity, the DFU-reduced I-AGU in
Figure 8 requires the additional register to hold
n_WAP for one cycle. Thus, while one DFU is eliminated, one register is appended. In short, the S-AGU benefits more from the relocation technique than the I-AGU.
It is worth noting that the complexity of each DFU can be somewhat relieved by confining S to be a power of two, as the division by a power of two can be easily achieved by right-shift operations. In exchange for such a benefit, however, it sacrifices the range of applicability.
5. Evaluation and Discussion
For all kinds of AGU architectures in
Figure 3,
Figure 4,
Figure 6, and
Figure 8,
Table 2 summarizes the numbers of DFUs and register bits in a
U-user detector. The feasibility of single-port M
L is also tabulated. Counting all DFUs is apparent, as it is the number of DFUs per AGU multiplied by the number of AGUs in a detector,
U. In contrast, register bits should be counted while taking into account the following: the R
J that produces
WAL in S-AGUs can be shared among
U UPBs, as the SA-based
WAL is identical in all UPBs. In case of the original S-AGU, the number of R
J preceding the DFU is
U, whereas the number of R
J producing
WAL is 1, making the total number of register bits (
U + 1)log
2J. In the case of the DFU-reduced S-AGU, the numbers of R
J/S and R
J are
U and 1, respectively, making the total number of register bits
Ulog
2(
J/
S) + log
2J. All R
J’s in I-AGUs take different inputs and cannot be shared.
For the common and practical set of parameters used in Ref. [
5,
6,
7,
8], e.g.,
U = 16,
J = 8192, and
S = 16,
Table 3 enumerates the numbers of DFUs and register bits. The percentages in parentheses are calculated with respect to the original S-AGU. The DFU-reduced S-AGU includes the fewest DFUs and register bits. In particular, it requires 33% fewer DFUs and 29% fewer register bits than the original S-AGU, highlighting the benefits of the proposed relocation. On top of that, it may adopt single-port M
L, making it promising in every aspect of hardware complexity.
In addition to the DFUs and the register bits, the AGUs include other components that contribute to the overall hardware complexity, as follows: algebraic interleavers [
9,
12]; counters; multiplexers. Besides, identical logics can be synthesized differently as fan-in, fan-out, and gate sizing in circuitry vary. To evaluate more thoroughly by taking such factors into account, the architectures were implemented in a 65 nm CMOS process using a 300 MHz clock and a 1.2 V supply. The corresponding results are summarized in
Table 4. Equivalent gates were counted by regarding a two-input
nand gate as one. Power consumptions were measured by back-annotating switching activities. The percentages in parentheses are again calculated with respect to the original S-AGU. The original S-AGU and I-AGU are associated with almost the same equivalent gates and powers dissipated. On the contrary, the DFU-reduced S-AGU integrates fewer gates and consumes less power than the DFU-reduced I-AGU, as the latter demands the additional registers in order to obtain
WAL. Comparing the original and the DFU-reduced pairs, the DFU-reduced ones dissipate much lower power than their counterparts, owing to the removal of computationally intensive DFUs. In particular, the DFU-reduced S-AGU can be realized with 13% fewer gates, and spends 31% less power than the original S-AGU.