Next Article in Journal
Algorithmic Design of an FPGA-Based Calculator for Fast Evaluation of Tsunami Wave Danger
Next Article in Special Issue
Parallel Computing of Edwards—Anderson Model
Previous Article in Journal
A Blockchain-Based Audit Trail Mechanism: Design and Implementation
Previous Article in Special Issue
Parallel Implementation of the Algorithm to Compute Forest Fire Impact on Infrastructure Facilities of JSC Russian Railways
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An O(log2N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures

1
Department of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, UK
2
Department of Computer Science, University of Liverpool, Liverpool L69 3BX, UK
3
Department of Computer Engineering and Informatics, University of Patras, 26504 Patras, Greece
*
Author to whom correspondence should be addressed.
Algorithms 2021, 14(12), 342; https://doi.org/10.3390/a14120342
Submission received: 20 October 2021 / Revised: 20 November 2021 / Accepted: 25 November 2021 / Published: 26 November 2021
(This article belongs to the Collection Parallel and Distributed Computing: Algorithms and Applications)

Abstract

:
Resampling is a well-known statistical algorithm that is commonly applied in the context of Particle Filters (PFs) in order to perform state estimation for non-linear non-Gaussian dynamic models. As the models become more complex and accurate, the run-time of PF applications becomes increasingly slow. Parallel computing can help to address this. However, resampling (and, hence, PFs as well) necessarily involves a bottleneck, the redistribution step, which is notoriously challenging to parallelize if using textbook parallel computing techniques. A state-of-the-art redistribution takes O ( ( log 2 N ) 2 ) computations on Distributed Memory (DM) architectures, which most supercomputers adopt, whereas redistribution can be performed in O ( log 2 N ) on Shared Memory (SM) architectures, such as GPU or mainstream CPUs. In this paper, we propose a novel parallel redistribution for DM that achieves an O ( log 2 N ) time complexity. We also present empirical results that indicate that our novel approach outperforms the O ( ( log 2 N ) 2 ) approach.

1. Introduction

1.1. Motivation

In several modern applications, it is often necessary to estimate the state of a system, given a mathematical model for the system and a stream of noisy observations. Particle Filters (PFs) are typically used in this context. The key idea is to sample N hypotheses (i.e., particles) from an arbitrary proposal distribution to approximate the probability density function (pdf) of the true state. However, at some point, the particles experience a numerical error, called particle degeneracy, which makes the estimates diverge from the true state. A resampling algorithm is then applied to correct for degeneracy by replacing the particles that are diverging from the true state with copies of the particles that are not doing so [1]. This sampling–resampling approach is highly flexible, such that PFs find application in a wide range of fields, ranging from machine learning [2] to medical research [3], fault prediction [4], weather forecasting [5], tracking [6] or, broadly speaking, any domain involving decision making in response to streaming data. Resampling is also used in other Monte Carlo methods, such as sequential Monte Carlo samplers [7,8] and PHD filters [9]. For brevity, this paper focuses exclusively on PF contexts.
Modern efforts of making models more detailed have translated to an increasing demand in making PFs more accurate. This demand can be satisfied in several ways, ranging from applying better proposal distributions [10] to collecting more measurements [11] and using more particles [12,13]. Using more particles is especially important in settings where we are more interested in computing the probability that the true state falls within a certain state-space region, rather than simply estimating its mean [14]. However, the likely side-effect of any of these approaches is a significant increment to the run-time, and this may be critical, especially in real-time applications [12]. Parallel computing becomes necessary in order to compensate for this side-effect.

1.2. Problem Definition and Related Work

Although the particles can be sampled in embarrassingly parallel fashion, resampling is hard to parallelize globally, i.e., in such a way that the result for P > 1 processing units (or cores, as referred to in this paper) is identical to that achieved using a single core. This is because of the difficulties in parallelizing the constituent redistribution step, whose textbook implementation achieves an O ( N ) time complexity on one core. In order to bypass the need to parallelize resampling, one could use multiple PFs in parallel, each performing resampling locally, i.e., when each core performs resampling independently without considering the content of the other cores’ particles. However, this approach has shown accuracy, scalability and model-dependent applicability issues [15,16,17].
On Shared Memory (SM) architectures, it has been shown that a global redistribution of the particles takes O ( log 2 N ) computations by using a static load balancing approach, in which, all cores perform independently up to N binary searches in order to achieve a perfect workload balance. Examples are found in [12,13,18,19] on GPU and mainstream CPUs. High Performance Computing (HPC) applications, however, need to use Distributed Memory (DM) architectures to overcome the limitations in modern SM of a low memory capacity and Degree Of Parallelism (DOP).
On DM, parallelization is more complicated, as the cores cannot directly access the other cores’ memory without exchanging messages. Three master–worker solutions for DM (along with mixed versions of them) are presented in [20]: Centralized Resampling (C-R), Resampling with Proportional Allocation (RPA) and Resampling with Non-proportional Allocation (RNA). C-R performs resampling globally, but here, a central-unit gathers the particles from all cores, performs redistribution sequentially and scatters the result back to the cores, making this algorithm scale as O ( N ) . RPA also performs global resampling, but the network topology is randomized, as the redistribution is partly or potentially entirely (worst-case) centralized to one or a few cores, leading to a strongly data-dependent run-time and an O ( N ) time complexity in the worst-case. RNA has a simpler central unit and communication pattern than RPA, but sacrifices accuracy, as local resampling is performed. For the routing, RNA could use RPA or a ring topology, where the cores cyclically exchange a user-defined number of particles with their neighbors, though such an approach risks redundant communication. These master–worker approaches have been used, re-interpreted or mixed in recent work, such as in [17,21,22,23,24]. In [17,24,25], it is shown that such strategies may have accuracy or scalability issues, especially for a highly unbalanced workload, large N or DOP. In this paper, we then consider only fully-balanced solutions, which we define as follows.
Definition 1.
A fully-balanced redistribution meets the following requests:
  • All cores perform the same pre-agreed tasks (i.e., no central unit(s) are involved) to balance the workload evenly;
  • The number of messages for the load balancing is data-independent in order to guarantee a stable run-time, as often required in real-time applications;
  • The redistribution of the particles is performed globally in order to ensure the same output of sequential redistribution and that no speed–accuracy trade-off is made when the DOP increases.
In [26], it has been shown that redistribution can be parallelized in a fully-balanced fashion on DM by using a divide-and-conquer approach that recursively sorts and splits the particles. Since Bitonic Sort [27,28,29] is performed O ( log 2 N ) times, the achieved time complexity is O ( ( log 2 N ) 3 ) . In [17], sort is also employed recursively in a dynamic scheduler for RPA/RNA. In [30], the time complexity is reduced to O ( ( log 2 N ) 2 ) by showing that Bitonic Sort is only needed once. In [25], this new idea is ported from MapReduce to the Message Passing Interface (MPI) and then further optimized in [7]. However, the data movement in this fully-balanced redistribution is still the bottleneck, especially for high DOP.

1.3. Our Results

This paper proposes a novel fully-balanced approach for DM that achieves an O ( log 2 N ) time complexity and improves on the redistribution algorithms described in [7,25]. Our experimental results demonstrate that our novel redistribution is approximately eight times faster than the O ( ( log 2 N ) 2 ) ones on a cluster of 256 cores.
The rest of the paper is organized as follows: in Section 2, we briefly describe PFs, with a particular emphasis on resampling (and redistribution). In Section 3, we give brief information about DM and MPI. In Section 4, we describe our novel parallel redistribution in detail and include proof of its algorithmic complexity. In Section 5, we show the numerical results for redistribution first, and then for a PF example. In Section 6, we outline our conclusions and give recommendations for future work.

2. Sequential Importance Resampling

In this section, we briefly describe Sequential Importance Resampling (SIR), one of the most common variants of PFs [1].
PFs employ the Importance Sampling (IS) principle to estimate the true state X t R M of a system. IS uses an arbitrary proposal distribution q ( X t | X t 1 ) to randomly generate x t R N × M , a population of N statistically independent hypotheses for the state, called particles. Each particle x t i is then weighted by an unnormalized importance weight w t i R i = 0 , 1 , , N 1 , which is computed based on the resemblance between x t i and X t . In this way, x t approximates the posterior of the true state.
To weight each particle correctly, at every time step t, we collect Y t R M y , a set of measurable quantities that are related to X t by some pdf. At the initial time t = 0 , no measurement has been collected, so the particles are initially drawn from the initial distribution q 0 ( X 0 ) = p 0 ( X 0 ) and weighted equally as 1 / N . Then, for any time step t = 1 , 2 , , T P F , measurements are collected, and each particle is drawn from the proposal distribution as follows:
x t i q ( x t i | x t 1 i , Y t )
and weighted by
w t i = w t 1 i p ( x t i | x t 1 i ) p ( Y t | x t i ) q ( x t i | x t 1 i , Y t )
where p ( x t i | x t 1 i ) and p ( Y t | x t i ) are known from the model. The IS step performs (1) and (2) in sequence. Then, to effectively represent the pdf of the state, the weights are normalized as follows:
w ˜ t i = w t i / z = 0 N 1 w t z
However, the side effect of using IS is degeneracy, a phenomenon that (within a few iterations) makes all weights but one decrease towards 0. The variance of the weights is indeed proven to increase at every t [1]. The most popular strategy for tackling degeneracy is to perform resampling, a task that regenerates the particle population by deleting the particles with low weights and duplicating those with high weights. In SIR, resampling is performed if the Effective Sample Size (ESS)
N e f f = 1 / i = 0 N 1 ( w ˜ t i ) 2
decreases below an arbitrary threshold, which is commonly set to N 2 .
Several resampling schemes exist [31] and can be described as comprising three steps. The first step is to process the normalized weights w ˜ to generate ncopies Z N whose i-th element, ncopies i , defines how many times the i-th particle must be copied, such that
i = 0 N 1 ncopies i = N
The second step is redistribution, which involves duplicating each particle x i as many times as ncopies i . A textbook redistribution is described by Algorithm 1, which we refer to as Sequential Redistribution (S-R). This algorithm achieves an O ( N ) time complexity on a single core as (5) holds. A practical example for N = 8 particles is shown in Figure 1. In the final step of resampling, all weights are reset to 1 N .
Algorithm 1 Sequential Redistribution (S-R)
Input: x , ncopies , N
Output:  x n e w
1:
z 0
2:
for j 0 ; j < N ; j j + 1 do
3:
    for  k 0 ; k < ncopies j ; k k + 1  do
4:
         x n e w z x j , z z + 1
5:
    end for
6:
end for
We note that all resampling schemes require redistribution, independently of the chosen strategy to perform the first step. Since we focus on proposing a novel fully-balanced redistribution, to perform the first step, we focus on using systematic resampling [31], which is also known as Minimum Variance Resampling (MVR) [7,25,26,30]. The key idea of MVR is to first compute the cumulative density function of w ˜ t , cdf R N + 1 , as follows:
cdf i = z = 0 i 1 w ˜ z i = 0 , 1 , 2 , , N
a random variable u Uniform [ 0 , 1 ) is then sampled, such that each ncopies i can then be calculated as follows:
ncopies i = cdf i + 1 u cdf i u
where the brackets represent the ceiling function (e.g., 3.3 = 4 ).
After resampling, a new estimate of X t is computed as
Ξ t = E ( X t ) = i = 0 N 1 w t i x t i

3. Distributed Memory Architectures

DM is a type of parallel system that is inherently different to SM. In this environment, the memory is distributed over the cores and each core can only directly access its own private memory. The exchange of information stored in the memory of the other cores is achieved by sending/receiving explicit messages through a common communication network (see Figure 2).
DM provides several advantages over SM, such as: scalable and larger DOP; scalable and larger memory; memory contention (and other issues that stem from multiple cores accessing the same addresses in memory) not being a concern on DM. The main disadvantage of DM is the cost of communication and the consequent data movement. This can affect the speed-up relative to a single core implementation.
Any Application Programming Interface (API) for DM could be used. We choose MPI, since its intuitive syntax means it is arguably the most popular API for DM. MPI is also used in several referenced works on parallel redistribution for DM [7,17,22,23,25]. In MPI, the P cores are uniquely identified by a rank p = 0 , 1 , , P 1 , connected via communicators, and use explicit communication routines, e.g., MPI_Sendrecv, to exchange messages of arbitrary size.
Most algorithms that we implement in this paper use the divide-and-conquer paradigm. Therefore, we recommend using a power-of-two number of cores to balance the communication between them. For the same reason, the number of particles N should also be a power-of-two. In this case, the particles x and every array related to them, e.g.,  w , ncopies etc., can then be evenly split between the cores. Every core then always owns exactly n = N P elements of all arrays, whose global indexes are spread over the cores in increasing order. This means that, given a certain N, P pair, the i-th particle (where i [ 0 , N 1 ] ) will always belong to the same core with rank p = i n . More precisely, while all equations in this paper use global indexes for simplicity, each core p actually uses a local index j [ 0 , n 1 ] to address arrays and achieve parallelism, knowing that each j corresponds to global index i = j + p n . The space complexity is then O ( N P ) for any P N .

4. Novel O ( log 2 N ) Fully-Balanced Redistribution

In this section, we prove how it is possible to redistribute N particles in O ( log 2 N ) parallel time on DM by using a novel fully-balanced redistribution algorithm, which we name Rotational Nearly Sort and Split (RoSS) redistribution. We also provide details on how to implement RoSS on MPI. We refer any reader who is interested in the description of the resulting algorithms to Algorithms 2–5.
Algorithm 2 Rotational Nearly Sort
Input: x , ncopies , N, P, n = N P , p
Output: x , ncopies
1:
x , ncopies , zeros S-NS x , ncopies , n , see Algorithm 3
2:
shifts Exclusive_Cumulative_Sum zeros
3:
if P < N then perform leaf stage of the binary tree
4:
     p a r t n e r ( p 1 ) & ( P 1 ) , i.e., the neighbor
5:
    if  shifts & ( n 1 ) > 0  then
6:
        for  j 0 ; j < n ; j j + 1  do
7:
           if  j < shifts & ( n 1 )  then
8:
               Send x j , ncopies j to p a r t n e r , ncopies j 0
9:
           else
10:
               Shift particle to the left by shifts & ( n 1 )
11:
           end if
12:
        end for
13:
         shifts shifts shifts & ( n 1 )
14:
        Send shifts to p a r t n e r
15:
    else
16:
        Send arrays of 0s to p a r t n e r (Message to reject)
17:
    end if
18:
    Accept or reject the received particles and shifts
19:
end if
20:
for k 1 ; k log 2 P ; k k + 1 do binary tree
21:
     p a r t n e r ( p 2 k 1 ) & ( P 1 )
22:
    if  shifts & n 2 k 1 > 0  then
23:
        for  j 0 ; j < n ; j j + 1  do
24:
           Send x j , ncopies j to p a r t n e r , ncopies j 0
25:
        end for
26:
         shifts shifts shifts & n 2 k 1
27:
        Send shifts to p a r t n e r
28:
    else
29:
        Send arrays of 0s to p a r t n e r (Message to reject)
30:
    end if
31:
    Accept or reject the received particles and shifts
32:
end for
Algorithm 3 Sequential Nearly Sort (S-NS)
Input: x , ncopies , n
Output: x n e w , ncopies n e w , zeros
1:
l 0 , r n 1
2:
for j 0 ; j < n ; j j + 1 do
3:
    if  ncopies j > 0  then
4:
         ncopies n e w l ncopies j , x n e w l x j , l l + 1
5:
    else
6:
         ncopies n e w r ncopies j , x n e w r x j , r r 1
7:
    end if
8:
end for
9:
zeros n l
Algorithm 4 Rotational Split
Input: x , ncopies , N, P, n = N P , p
Output: x , ncopies
1:
csum Cumulative_Sum N , P , ncopies
2:
min _ shifts j csum j ncopies j j n p , j < n if ncopies j > 0 ; n p makes j global
3:
max _ shifts j csum j j 1 n p , j < n if ncopies j > 0
4:
for k 1 ; k log 2 P ; k k + 1 do binary tree
5:
     p a r t n e r p + P 2 k & ( P 1 )
6:
    for  j 0 ; j < n ; j j + 1  do
7:
        if  max _ shifts j & N 2 k > 0  then
8:
           if  min _ shifts j & N 2 k > 0  then
9:
                copies _ to _ send j ncopies j , ncopies j 0
10:
           else
11:
                copies _ to _ send j ( csum j j N 2 k n p )
12:
                ncopies j ncopies j copies _ to _ send j
13:
           end if
14:
            s t a r t e r csum j copies _ to _ send j , if  x j is the first particle to send
15:
           Send x j , copies _ to _ send j to p a r t n e r and send s t a r t e r too if x j is the first
15:
             particle to send
16:
        else
17:
           Send 0s to p a r t n e r (Message to reject)
18:
        end if
19:
    end for
20:
    Accept or reject the received particles and s t a r t e r , reset s t a r t e r to 0 if all particles
20:
       are sent and none is accepted
21:
     csum 0 s t a r t e r + ncopies 0 , csum j csum j 1 + ncopies j j = 1 , 2 , , n 1
22:
    Update min _ shifts and max _ shifts as in steps 2 and 3
23:
end for
24:
if P < N then perform leaf stage of the binary tree
25:
    for  j n 1 ; j 0 ; j j 1  do
26:
        if  csum j > ( p + 1 ) n  then
27:
            copies _ to _ send j min ( csum j ( p + 1 ) n , ncopies j )
28:
            ncopies j ncopies j copies _ to _ send j
29:
           Send x j , copies _ to _ send i to p a r t n e r
30:
        else
31:
           Send 0s to p a r t n e r (Message to reject)
32:
        end if
33:
        if  min _ shifts j > 0  then
34:
           Shift particle to the right by min _ shifts j
35:
        end if
36:
    end for
37:
    Accept or reject the received particles
38:
end if
Algorithm 5 Rotational Nearly Sort and Split (RoSS) Redistribution
Input: x , ncopies , N, P, n = N P , p
Output: x
1:
if P > 1 then
2:
     x , ncopies Rotational_Nearly_Sort x , ncopies , N , P , n , p , (9) now holds
3:
     x , ncopies Rotational_Split x , ncopies , N , P , n , p , (11) now holds
4:
end if
5:
x S-R ( x , ncopies , n )

4.1. General Overview

The algorithm consists of two phases. In the first phase, we want the elements in ncopies to be nearly sorted, a property which is defined as follows.
Definition 2.
A sequence of N non-negative integers, ncopies , is nearly sorted in descending order when it has the following shape:
ncopies = λ 0 , λ 1 , , λ m 1 , 0 , , 0
where ncopies i > 0 (marked with λs in (9)) i = 0 , 1 , , m 1 and 0 m N . On the other hand, ncopies is an ascending nearly sorted sequence if the last m elements are positive and the first are 0.
In this paper, ncopies is nearly sorted in descending order. We also note that, as the elements in ncopies are progressively shifted to achieve (9), the related particles in x are consequently also shifted. The main purpose of this phase is to separate all particles that must be duplicated (i.e., those for which ncopies i > 0 ) from those that must be deleted. Here, we prove that (9) can be achieved by using Rotational Nearly Sort, an  O ( log 2 N ) alternative to the O ( ( log 2 N ) 2 ) Nearly Sort step in [7] (as also described in Appendix A). We denote that (9) can also be achieved with sort, as carried out in [25] by using Bitonic Sort. In theory, one could also employ O ( log 2 N ) sorting networks [32,33]. However, the constant time of these networks is notoriously prohibitive in practice, even with some of the most recent advances [34]. Our constant time is significantly smaller; because of that, we do not consider AKS-like sorting networks as a practical approach for (9).
In the second phase, we want to achieve two goals: the first is to make room on the right of each particle that has to be copied; the second is for the P cores to have the same number of particles in their private memory. The first goal easily translates to shifting the particles to the right until ncopies has the following new shape:
ncopies = [ λ 0 , 0 , , 0 , λ 1 , 0 , , 0 , λ m 1 , 0 , , 0 ]
where, for each ncopies i > 0 (again marked with λ s in (10)), ncopies i 1 zeros follow. The second can be expressed as follows:
i = p N P ( p + 1 ) N P 1 ncopies i = N P p = 0 , , P 1
which is essentially (5) applied locally. In the next section, we prove it is possible to achieve both (10) and (11) in O ( log 2 N ) parallel time by using a single algorithm, which we refer to as Rotational Split. After that, the cores are completed by using S-R independently to redistribute the particles within their private memory.

4.2. Algorithmic Details and Theorems

In this section, we give details for a novel implementation of parallel redistribution and prove that it scales as O ( log 2 N ) . The reader is referred to Figure 3, which illustrates an example for N = 8 and P = 4 .

4.2.1. Rotational Nearly Sort

Theorem 1.
Given an array of N particles, x , and their copies to be created, ncopies , whose elements are evenly distributed across the P cores of a DM, Algorithm 2 (performed by each core p, p = 0 , 1 , , P 1 ) describes the steps to safely shift the elements in x and ncopies to achieve property (9), and performs that in O ( log 2 N ) parallel time for P = N .
Proof of Theorem 1.
The first process to carry out is to nearly sort the particles locally by calling Sequential Nearly Sort (S-NS) (see Algorithm 3), which iteratively moves the i-th particle to the left/right side of the output array if ncopies i is positive/zero. We note that this routine only takes O ( N P ) iterations, since every DM core owns N P particles. This will start moving the particles locally to the left before doing so globally.
The particles within the core p must now shift to the left by as many positions as the number of zero elements in ncopies owned by the cores with a lower rank. Let zeros Z P be the array that counts the number of ncopies i = 0 within each core; each element of shifts Z P (the array to keep track of the remaining shifts) can be initialized as follows:
shifts p = p ^ = 0 p 1 zeros p ^
Equation (12) can be parallelized by using a parallel exclusive cumulative sum (after each core p has initialized zeros p to the sum of zeros within its memory, at the end of S-NS). It is well covered in the literature that the cumulative sum (also known as prefix sum or scan) takes O ( N P + log 2 P ) computations [35,36].
We now want to express shifts p in binary notation and shift the particles by increasing power-of-two numbers of positions, depending on the bits of shifts p , from the Least Significant Bit (LSB) to the Most Significant Bit (MSB). This translates to using a bottom-up binary tree structure that will complete the task in O ( log 2 N ) time.
If P < N , then we first need to perform an extra leaf stage, in which, the cores send all of the particles to their neighbor if the bitwise AND of shifts p and N P 1 is positive. In simple terms, the leaf stage masks the log 2 N P LSBs of shifts p and performs the rotations referring to those bits in one operation per particle. Hence, during this step, each core sends and receives up to N P particles, meaning that the leaf stage takes O ( N P ) .
After the leaf stage, the actual tree-structure routine can start. At every k-th stage of the tree (for k = 1 , 2 , , log 2 P ), any core p will send to its partner p 2 k 1 all its particles (i.e., x and ncopies ) and shifts p N P 2 k 1 (i.e., the number of remaining shifts after the current rotation) if the bitwise AND of shifts p and N P 2 k 1 is positive; this corresponds to checking a new bit of shifts p , precisely the one which is significant at this stage. We note that, in order to balance the communication at all stages, the cores are set to send particles to reject (i.e., having ncopies i = 0 ) if the LSBs are 0. Hence, this phase takes O ( N P log 2 P ) because each core sends and receives N P particles log 2 P times.
Before the leaf stage, the particles in the memory of core p must shift at most from one end to the other end, i.e., by shifts p N 1 positions. Any non-negative integer shifts p N 1 can always be represented in base-2 by log 2 N bits. Therefore, each particle will be shifted O ( log 2 N ) times, each time by an increasing power-of-two number of positions, and, since shifts gets updated in O ( 1 ) , Algorithm 2 achieves an O ( log 2 N ) time complexity. Furthermore, as long as no particle to be copied collides with or gets past another one, Algorithm 2 always achieves (9), since the shifts to perform will progressively decrease, whereas (12) will always hold. Lemma 1 and Corollary 1 prove that collisions and overtaking (defined in the following definitions) can never occur.    □
Definition 3
(Collisions). Let x i be a particle having ncopies i 1 , and  x j be a particle having ncopies j 1 , with  j = i + d i s t , where 0 < d i s t < N 1 . A collision would occur if d i s t is a power-of-two number, x j is rotating to the left by d i s t and  x j stays where it is. More formally, a collision occurs if the total number of rotations that x j must perform has significant bit (i.e., the bit corresponding to d i s t rotations) equal to 1, while the same bit of the number of rotations that x i must perform is 0. The same definition can be applied to collisions when rotations are performed to the right if x i is rotating to the right (and, hence, have significant bit equal to 1) and x j is not rotating (and, hence, have significant bit equal to 0).
Definition 4
(Overtaking). Let x i again be a particle having ncopies i 1 , and  x j again be a particle having ncopies j 1 , with  j = i + d i s t , where 0 < d i s t < N 1 . The particle x j can overtake x i if it is rotating to the left by a power-of-two number greater than d i s t while x i stays where it is. The same problem occurs when rotations are performed to the right, if  x i is rotating to the right by a power-of-two number greater than d i s t while x j stays where it is.
Lemma 1.
During the k-th iteration of Rotational Nearly Sort, k = 1 , 2 , , log 2 P , a particle x j , having ncopies j 1 and rotating to the left by N P 2 k 1 positions, can never collide with a particle x i , having ncopies i 1 and j = i + N P 2 k 1 .
Proof of Lemma 1.
At the k-th iteration, particle x i has shifts i remaining rotations to the left, whereas x j has shifts j . Therefore, the necessary and sufficient condition for collisions, defined in Definition 3, can be restated for Rotational Nearly Sort by checking whether the following logical condition
shifts i & N P 2 k 1 = 0 shifts j & N P 2 k 1 > 0
is true, which corresponds to checking whether the significant bit of shifts i is 0 and the significant bit of shifts j is 1. This condition can also be rearranged as follows:
shifts i & N P 2 k 1 = 0 ( shifts i + zeros i + 1 : j 1 ) & N P 2 k 1 > 0
where zeros i + 1 : j 1 is the number of 0s in ncopies between positions i and j excluded.
Since Rotational Nearly Sort performs rotations using an LSB-to-MSB strategy, it is easy to infer that the bits to the right of the significant one at this iteration (i.e., bit k 1 ) are all 0. This means that, if the significant bit of shifts i is 0, the only condition that would make (14) true would be zeros i + 1 : j 1 = N P 2 k 1 . That is, however, impossible, because, in this case, there are only i + N P 2 k 1 1 memory slots between i and j, which means that:
zeros i + 1 : j 1 j i 1 = N P 2 k 1 1 < N P 2 k 1
   □
Corollary 1.
During the k-th iteration of Rotational Nearly Sort, k = 1 , 2 , , log 2 P , a particle x j , having ncopies j 1 and rotating to the left by N P 2 k 1 positions, can never overtake a particle x i , having ncopies i 1 and i < j < i + N P 2 k 1 .
Proof 
(Proof of Corollary 1). Lemma 1 can automatically prove Corollary 1 because, once again, (14) is true only if zeros i + 1 : j 1 = N P 2 k 1 . However, in this case, j < i + N P 2 k 1 , and hence:
zeros i + 1 : j 1 j i 1 < j i < N P 2 k 1
   □

4.2.2. Rotational Split

Theorem 2.
Given an array of N particles, x , and their copies to be created, ncopies , whose elements are nearly sorted according to (9), and evenly distributed across the P cores of a DM, Algorithm 4 (performed by each core p, p = 0 , 1 , , P 1 ) describes the steps to shift and/or safely split the elements in x and ncopies to achieve properties (10) and (11), and performs that in O ( log 2 N ) parallel time for P = N .
Proof of Theorem 2.
Let csum Z N be the inclusive cumulative sum of ncopies . As mentioned in the previous section, the cumulative sum achieves an O ( N P + log 2 P ) time complexity for any P N .
To achieve (10), we want to move the particles to the right to have ncopies i 1 gaps after each index i, such that ncopies i > 0 . It can be inferred that the minimum required number of shifts to the right that the i-th particle must perform is:
min _ shifts i = z = 0 i 1 ( ncopies z 1 ) = csum i ncopies i i
However, (10) alone does not guarantee (11). This is because, for each particle x i that must be copied more than once, we are rotating all of its copies by the same minimum number of positions, and we are not considering that some of these copies could be split and rotated further to also fill in the gaps that a strategy for (10) alone would create. Therefore, we also consider the maximum number of rotations that any copy of x i has to perform, without causing collisions. Since (10) aims to creating ncopies i 1 gaps for each i, such that ncopies i > 0 , that number is:
max _ shifts i = min _ shifts i + ncopies i 1 = csum i i 1
Therefore, the key idea is to consider min _ shifts i and max _ shifts i in binary notation and use their bits to rotate the particles accordingly, from the MSB to the LSB. This corresponds to using a top-down binary tree structure of log 2 P stages. At each stage k of the tree (for k = 1 , 2 , , log 2 P ), any core with rank p sends particles N 2 k positions ahead to its partner with rank p + P 2 k . At each k-th stage, we check the MSB of both min _ shifts i and max _ shifts i to infer copies _ to _ send i , the number of copies of x i that must rotate by N 2 k positions. For each x i , three possible scenarios may occur:
  • None of its copies must move;
  • All of them must rotate;
  • Some must split and shift, and the others must not move.
Trivially, the copies will not move if the MSB of max _ shifts i is 0, which also implies that the MSB of min _ shifts i is 0, since min _ shifts i max _ shifts i at all stages. If both MSBs are 1, we send copies _ to _ send i = ncopies i copies of x i . However, if only the MSB of max _ shifts i is equal to 1, copies _ to _ send i < ncopies i . The number of copies to split is equal to how many of them must be placed from position i + N 2 k onwards to achieve a perfect workload balance. This is equivalent to computing how many copies make
csum i > i + N 2 k
Therefore, if only the MSB of max _ shifts i is equal to 1, we send
copies _ to _ send i = csum i i N 2 k
copies of x i and we keep the remaining ones where they were. Here, as in Section 4.2.1, particles to be rejected are sent if the MSB of max _ shifts i is 0 in order to keep all cores equally busy. This phase takes an O ( N P log 2 P ) time complexity, because each core sends or receives N P particles at every stage. However, to ensure a logarithmic time complexity, one needs to update csum , min _ shifts and max _ shifts in O ( N P ) . This can be achieved if the cores send
s t a r t e r = csum s copies _ to _ send s
where s is the index of the first particle to send, having ncopies s > 0 . As long as no particle overwrites or moves past another one (which is proven to be impossible by Lemma 2), each core p can safely see the received s t a r t e r as
s t a r t e r = i = 0 p N P 1 ncopies i = csum p N P 1
and use it to re-initialize csum and update it in O ( N P ) as csum i = csum i 1 + ncopies i i = p N P + 1 , , ( p + 1 ) N P 1 . This strategy guarantees csum is always correct for at least any index i, such that ncopies i > 0 , i.e., those indexes we require to be correct. Once csum is updated, Equations (17) and (18) are embarrassingly parallel.
If P < N , after  log 2 P stages, the cores perform a leaf stage. Here, for each particle, we perform inter-core shifts or splits and shifts only if there is any copy to be sent to the neighbor; this is equivalent to checking if
csum i > ( p + 1 ) N P
where ( p + 1 ) N P is the first global index in the neighbor’s memory. Arrays of 0s are again sent when no inter-core shifts are needed. We also consider internal shifts only if min _ shifts i > 0 , in order to both make room to receive particles from the neighbor, p 1 , and guarantee (10). This leaf stage takes O ( N P ) because N P particles are sent or received.
It is easy to infer that min _ shifts i < N 1 and max _ shifts i N 1 , as a particle copy could at most be shifted, or split and shifted from the first to the last position. Both min _ shifts i and max _ shifts i can be represented in base-2 by log 2 N bits. Therefore, as also anticipated above, only up to log 2 N messages are required, which means that the achieved time complexity of Algorithm 4 is O ( log 2 N ) . Furthermore, the particles are progressively split according to the MSBs of min _ shifts i and max _ shifts i , until they are split according to (21). Hence, after the final leaf stage, the first and last element of csum in the memory of each core will necessarily meet the following two requirements:
csum p N P ncopies p N P = p N P csum ( p + 1 ) N P 1 = ( p + 1 ) N P
which, if subtracted, automatically proves (11).    □
Lemma 2.
Given a nearly sorted input ncopies , at the k-th iteration of Rotational Split, k = 1 , 2 , , log 2 P , a particle x i , having ncopies i 1 and rotating to the right by N 2 k positions, can never collide with or get past a particle x j , having ncopies j 1 and j = i + d i s t with 1 d i s t N 2 k .
Proof of Lemma 2.
This lemma can be proved in two possible complementary cases:
1.
There are one or more zeros between i and j;
2.
There are no zeros between i and j.
Case 1. Since the particles are initially nearly sorted, at the beginning, there are no zeros in between any pair of particles in position i and j. At the k-th iteration, if one or more zeros are found between x i and x j , it necessarily means that d i s t > max _ shifts i N 2 k . This is because of two reasons. First, zeros between two particles x i and x j can only be created if the MSB of min _ shifts i is 0 and the MSB of max _ shifts j is 1. Second, for any binary number, its MSB, if equal to 1, is a greater number than the one represented by any disposition of all remaining LSBs (e.g., ( 1000 ) 2 = 8 > ( 0111 ) 2 = 7 ). Hence, if there are any zeros between x i and x j , it is because, during at least one of the previous stages, x j rotated by an MSB and x i did not, such that x j is now beyond reach of possible collisions with x i .
Case 2. In this case, all particles between i and j are still nearly sorted. Therefore, x i would collide with (when d i s t = N 2 k ) or get past x j (when d i s t < N 2 k ) if
max _ shifts i & N 2 k > 0 min _ shifts j & N 2 k = 0
is true, which corresponds to checking whether the MSB of max _ shifts i is 1 (which also includes those cases where the MSB of min _ shifts i is 1) and the MSB of min _ shifts j is 0. In other words, (22) can be simplified to checking if max _ shifts i > min _ shifts j . However, for a pair of particles x i and x j within a nearly sorted group of particles, this is impossible, because:
max _ shifts i = csum i i 1 = csum j z = i + 1 j ncopies z j + d i s t 1 = csum j ncopies j j z = i + 1 j 1 ncopies z d i s t + 1 = min _ shifts j z = i + 1 j 1 ncopies z d i s t + 1 min _ shifts j
since csum j = z = 0 j ncopies z , d i s t = j i and (in this case) z = i + 1 j 1 ncopies z j i 1 .    □

4.2.3. Rotational Nearly Sort and Split Redistribution

Theorem 3.
Given an array of N particles, x , and their copies to be created, ncopies , whose elements are evenly distributed across the P cores of a DM, Algorithm 5 (performed by each core p, p = 0 , 1 , , P 1 ) redistributes all particles in x for which ncopies i > 0 , and performs that in O ( log 2 N ) parallel time for P = N .
Proof of Theorem 3.
After the cores perform Rotational Nearly Sort first, and then Rotational Split, Equation (11) holds, as proven by Theorem 2. Therefore, the cores can independently use S-R to redistribute their N P particle copies in their private memory in O ( N P ) iterations. As proven by Theorems 1 and 2, Algorithms 2 and 4 complete their task in O ( log 2 N ) parallel time, which means that the achieved time complexity by Algorithm 5 is then O ( N ) for P = 1 , O ( log 2 N ) for P = N cores and, for any 1 P N , is:
O N P + N P log 2 P
the first term in (23) represents S-R, which is always performed, and all of the steps that are only ever called once for any P > 1 (e.g., S-NS). The second term in (23) describes the log 2 P stages of Algorithms 2 and 4, during which, we update, send and receive up to N P particles.    □
With the results of Theorem 3 in view, we conclude that the PF tasks now take either O ( N P ) , O ( N P + log 2 P ) or  O ( N P + N P log 2 P ) time. This is because (3), (4) and (8) require reduction (which notoriously scales as O ( N P + log 2 P ) [36]), the IS step is embarrassingly parallel and (in the resampling algorithm) Equation (6) requires an exclusive cumulative sum; in addition, Equation (7) and resetting the weights to 1 / N are also embarrassingly parallel tasks. Table 1 summarizes this conclusion. This means that, even if we had an embarrassingly parallel fully-balanced redistribution, the time complexity of PFs will still be no less than O ( N P + log 2 P ) , since reduction and the cumulative sum are required elsewhere.

4.3. Implementation on MPI

In this section, we give brief information concering which MPI routines are needed to implement RoSS and the rest of SIR on MPI.
The exclusive cumulative sum that is required before the leaf stage in Algorithm 2 is parallelized on MPI by calling MPI_Exscan [37]. On the other hand, MPI_Scan is used to parallelize the inclusive cumulative sum of ncopies at the start of Algorithm 4.
During the binary tree and the leaf stages in Algorithms 2 and 4, the cores send and receive N P particles each time. On MPI, MPI_Sendrecv is ideal for these messages, since it requires the ranks to send to and receive from. Temporary arrays should be used on both communication ends to ensure data coherency before accepting or rejecting the received content.
All of the other operations or algorithms within RoSS, such as (17) or S-NS, are performed locally by each core and, therefore, do not need to call MPI routines.
For completeness, we point out that, in the other PF tasks, the reduction operation in (3), (4) and (8) is parallelized on MPI by calling MPI_Allreduce, whereas (6) needs MPI_Exscan.

5. Experimental Results

In this section, we show the numerical results of redistribution first and then show results for a PF example. In the experiment for redistribution, we compare RoSS, the novel fully-balanced algorithm presented in this paper, with N-R and B-R, two fully-balanced redistributions that take O ( ( log 2 N ) 2 ) steps (see Appendix A). These algorithms are compared by passing in input arrays with the same values to all algorithms: ncopies and x with M = 1 . To guarantee (5), ncopies is generated randomly by using MVR (see Equation (7)) with a log-normally distributed input w ˜ . For the PF experiment, we consider three versions of a SIR PF that only differ in terms of the constituent redistribution used. Each PF test is run for T P F = 100 iterations. Resampling is computed every iteration so that we ensure that the frequency of redistribution is the same. The model we consider is the stochastic volatility example described in Appendix B.
All experiments are conducted for N { 2 16 , 2 20 , 2 24 } particles and for up to P = 256  cores. Each reported run-time is the median of 20 runs collected for each N, P pair.
All algorithms in this paper are coded in C++ with OpenMPI-1.10.7 and compiled with -O3 optimization flag; double and unsigned int data types are, respectively, used for real and integer numbers. Combinations of MPI_Barrier and MPI_Wtime are used to collect the run-times for Figure 4 and Figure 5, while TAU Performance System 2.7 is used for the profiling in Figure 6. The cluster for the experiments consists of eight machines, interconnected with InfiniBand 100 Gbps and each mounting a 2 Xeon Gold 6138 CPU that provides 40 cores running at 2 GHz. We note that the results for up to P = 32 DM cores have been collected by requesting (at scheduling time) cores from the same node. For  P 64 , we have requested 32 DM cores from each node. While we cannot identify any resulting artifacts in the results, we acknowledge that this feature of the hardware does potentially influence our results when comparing run-times related to P 32 with those for P 64 . We emphasise that, to ensure the experimental results allow for a meaningful comparison between algorithms, all algorithms were assessed on the same hardware for each P. Future work would sensibly investigate these performance differences as part of a broader study on how to maximise performance as a function of hardware configuration.

5.1. RoSS vs. B-R and N-R

In Figure 4, we can see that the gap between the proposed approach and the other algorithms increases with P, particularly with large N. For  P 4 and all three of the values for N that are considered, RoSS is comparable with N-R and roughly four times as fast as B-R. However, for  P = 256 , RoSS is up to eight times faster than N-R (which is slightly faster than B-R, as shown in [7]) and B-R. We also note that these fully-balanced approaches may also stop scaling when the computation shrinks and the communication becomes too intensive: e.g., here, RoSS stops scaling at N = 2 16 and P = 64 .

5.2. Stochastic Volatility

Figure 5 shows that the PF using RoSS is four to six times faster than a PF using N-R/B-R. The maximum speed-up for a PF with RoSS is approximately 125 for P = 256  cores. Furthermore, the gap between a PF with RoSS and a PF with any other considered redistribution also increases with P, in line with the results in Section 5.1. Figure 6 describes profiling information for the three PFs with N = 2 24 . In Figure 6a, it is interesting to see that RoSS is the only redistribution variant that runs faster than IS for any P 256 , while B-R and N-R emerge as the bottleneck for a relatively low P, i.e., for 2 P 16 . This is mostly due to the larger volume of communication in B-R/N-R, as shown in Figure 6b. In Figure 6b, we have considered the average time spent on all MPI calls per core as the communication time. Since N P in this experiment, the time spent on the MPI_Sendrecv calls in the redistribution step is dominant over the time spent on all the other MPI routines, such as MPI_Allreduce or MPI_Exscan, which only consists of log 2 P single-scalar messages and up to 2 log 2 P arithmetical operations (in this case sum). Therefore, we can state that the reported average communication is equivalent (with a good degree of confidence) to the percentage of time that the cores are not used for any computation.

6. Conclusions

In this paper, we present RoSS, a novel fully-balanced redistribution for global resampling in SIR PFs on distributed memory environments. The algorithm has been implemented on MPI. The baselines for comparison are B-R and N-R, two similar state-of-the-art fully-balanced redistribution algorithms that both achieve an O ( ( log 2 N ) 2 ) time complexity and whose implementation is already available on MPI. We prove in this paper that RoSS redistribution achieves an O ( log 2 N ) time complexity.
The empirical results show that RoSS redistribution is almost an order of magnitude faster than B-R and N-R for up to P = 256 cores. Similar results are observed in the context of an exemplar PF. For the same level of parallelism, a PF using RoSS is up to six times faster than a PF using B-R/N-R and provides a maximum speed-up of 125. We also denote that, under the same testing conditions, RoSS is the only option for redistribution that can still be faster than IS for such large P.
Future work should focus on reducing the number of messages between the cores. One way to achieve this is to combine SM with DM: we note that an O ( log 2 N ) redistribution for SM already exists [19], and that mixing OpenMP, one of the most common programming models for SM architectures, with MPI is a routine practice in the HPC domain [38]. A second avenue for potential improvement consists of using recent versions of OpenMP that support GPU offload or using Cuda (in place of OpenMP) in order to take advantage of the extra speed-up that GPU cards can offer relative to a CPU.

Author Contributions

Conceptualization, A.V. and S.M.; formal analysis, A.V., P.G.S. and S.M.; software, investigation, validation, visualization and writing original draft, A.V.; review and editing, A.V., P.G.S. and S.M.; supervision, P.G.S. and S.M.; project administration and funding acquisition, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a UK EPSRC Doctoral Training Award (1818639), Schlumberger, and the UK EPSRC “Big Hypotheses” (EP/R018537/1) and “Algorithmic Aspects of Temporal Graphs” (EP/P02002X/1) grants.

Institutional Review Board Statement

Not application.

Informed Consent Statement

Not application.

Data Availability Statement

Not application.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. O((log 2 N) 2 ) Fully-Balanced Redistribution

This appendix describes the fully-balanced redistribution algorithm described in [30]. This routine redistributes the particles in O ( ( log 2 N ) 2 ) parallel time by performing the following steps.
First, the particles x are sorted by the values in ncopies . The chosen sorting algorithm is Bitonic Sort, a comparison-based parallel sorting algorithm that has been implemented on a cluster of graphics cards relatively recently [28]. For any P N , Bitonic Sort performs
O N P log 2 N P 2 + N P log 2 P 2
comparisons, where the first term describes the number of steps to perform Bitonic Sort locally, and the second term represents the data movement to merge the keys between the cores. Here, as in [25], Bitonic Sort is performed locally, but it is also possible to replace it with an O ( N log 2 N ) single-core sort, such as Mergesort. Quicksort is not recommended in this case, because  ncopies contains many zeros, which often results in Quicksort’s worst-case ( O ( N 2 ) ) run-time. An example of a sorting network for Bitonic Sort is illustrated in Figure A1. The particles to be duplicated are then separated from those to be deleted.
After this single sort, the algorithm moves on to another top-down routine. Starting from the root node, at each stage of the binary tree, three parallel operations are performed in sequence. First, the cores compute the inclusive cumulative sum over ncopies . Then, they search for a p i v o t that perfectly divides the node into two balanced leaves; in other words, p i v o t is the first index where the cumulative sum is equal to or greater than N 2 . To find and broadcast p i v o t to all cores of the node, the cores use linear search locally, followed by a sum, which can be performed in parallel by using reduction. In the end, the  N 2 particles on the right side of the p i v o t are shifted to the right side of the node. This is achieved by expressing r (the positions to shift by) in base-2 and rotating in O ( log 2 N ) steps according to the bits of r. In this way, the root node gets split into two balanced leaves. This top-down routine is recursively performed log 2 P times until the workload is equally distributed across the cores; then, S-R is called. In this paper, as in [7], we refer to this algorithm as Bitonic-Sort-Based Redistribution (B-R). Since Bitonic Sort is used once, and cumulative sum, reduction and rotational shifts are performed log 2 P times each, B-R achieves an O ( ( log 2 N ) 2 ) time complexity for P = N , or equal to (A1) for any P N . As mentioned in Section 4.1, it would be theoretically possible to replace Bitonic Sort with AKS sort in order to make the asymptotic time complexity of the sorting phase equal to O ( log 2 N ) , albeit with a large constant. However, since B-R needs a cumulative sum, a sum and rotational shifts up to log 2 N times, the overall computational complexity that would result from such an approach would still be O ( ( log 2 N ) 2 ) . In [30], B-R has been implemented on MapReduce. In [25], B-R has been ported to MPI.
Figure A1. Bitonic/Nearly Sort for N = 32 and P = 8 . The arrows represent inter-core messages (e.g., MPI_Sendrecv) and swap, which, for Nearly Sort, only applies to a zero-positive pair, since pairs of positive keys or zeros are nearly sorted. After each stage, results are in blue/red for Bitonic/Nearly Sort.
Figure A1. Bitonic/Nearly Sort for N = 32 and P = 8 . The arrows represent inter-core messages (e.g., MPI_Sendrecv) and swap, which, for Nearly Sort, only applies to a zero-positive pair, since pairs of positive keys or zeros are nearly sorted. After each stage, results are in blue/red for Bitonic/Nearly Sort.
Algorithms 14 00342 g0a1
A small change to B-R can result in a further 25 % improvement, as results in [7] and Section 5.1 show. This is achieved by substituting Bitonic Sort with an alternative algorithm, Nearly Sort. One does not actually need to perfectly sort the particles to divide the workload afterwards, but only needs to guarantee ncopies has shape (9). To achieve this property, the particles are first nearly sorted locally by calling Algorithm 3. We emphasize that doing so is faster than sorting the particles according to ncopies . Then, the particles are recursively merged, as in Bitonic Sort, by using the same sorting network illustrated in Figure A1. Since S-NS takes O ( N P ) , and the number of sent/received messages per core equals ( log 2 P ) 2 , we can infer that Nearly Sort costs
O N P + N P log 2 P 2
iterations. Here, as in [7], we refer to a B-R parallelization that uses Nearly Sort instead of Bitonic Sort as Nearly-Sort-Based Redistribution (N-R). Algorithm A1 summarizes both N-R and B-R, which achieve an O ( ( log 2 N ) 2 ) time complexity. Figure A2 shows an example of N-R for N = 16 and P = 8 ; a figure for B-R is omitted due to its similarities with N-R.
Figure A2. N-R - example for N = 16 and P = 8 . Each x i is actually a real vector, but is marked with a capital letter for brevity.
Figure A2. N-R - example for N = 16 and P = 8 . Each x i is actually a real vector, but is marked with a capital letter for brevity.
Algorithms 14 00342 g0a2
Algorithm A1 Bitonic/Nearly-Sort-Based Redistribution (B-R/N-R)
Input: x , ncopies , N, P, n = N P , p
Output: x
1:
if P > 1 then Bitonic/Nearly Sort the particles
2:
    Bitonic/Nearly_Sort ncopies , x , N , P
3:
end if, ncopies has now shape (9)
4:
for k 0 ; k < log 2 P ; k k + 1 do binary tree
5:
     csum Cumulative_Sum N 2 k , P 2 k , ncopies , the  P 2 k cores in each node perform
5:
       cumulative sum over ncopies
6:
     p i v o t Linear_Search ( ncopies , csum , n ) , search for first p i v o t s.t. csum p i v o t N 2 k + 1 ;
6:
       if not found p i v o t 0
7:
     p i v o t Sum ( p i v o t , P 2 k , p ) , the  P 2 k cores in each node broadcast p i v o t to each other
8:
     r N 2 k + 1 p i v o t , rotations to perform within the node
9:
     ncopies , x Rot_Shifts ncopies , x , r , p i v o t , P 2 k , p , the P 2 k cores in each node rotate
9:
       the N 2 k + 1 particles on the right of p i v o t by r positions according to the bits of r
10:
end for, ncopies has now shape (11)
11:
x S-R ( x , ncopies , n )

Appendix B. Stochastic Volatility Model

The PF example considered in Section 5.2 is a stochastic volatility model that has previously appeared in the literature [15,19] and estimates the GBP-to-USD exchange rate from 1 October 1981 to 28 June 1985. The model is as follows:
X t = ϕ X t 1 + σ V t
Y t = β exp 0.5 X t W t
where ϕ = 0.9731 , σ = 0.1726 , β = 0.6338 (as selected in [19]). V t N ( 0 , 1 ) and W t N ( 0 , 1 ) , which means that p ( x t i | x t 1 i ) and p ( Y t | x t i ) in (2) are also Gaussian. The initial state is sampled as X 0 N ( 0 , σ 2 1 ϕ 2 ) . The particles are initially drawn from p 0 ( X 0 ) and then from the dynamic model. Hence, (2) simplifies to w t i = w t 1 i p Y t | x t i .

References

  1. Arulampalam, M.; Maskell, S.; Gordon, N.; Clapp, T. A Tutorial on Particle Filters for Online Nonlinear/Non–Gaussian Bayesian Tracking. IEEE Trans. Signal Process. 2002, 50, 174–188. [Google Scholar] [CrossRef] [Green Version]
  2. Ma, X.; Karkus, P.; Hsu, D.; Lee, W.S. Particle Filter Recurrent Neural Networks. Proc. AAAI Conf. Artif. Intell. 2020, 34, 5101–5108. [Google Scholar] [CrossRef]
  3. Costa, J.M.; Orlande, H.; Campos Velho, H.; Pinho, S.; Dulikravich, G.; Cotta, R.; Cunha Neto, S. Estimation of Tumor Size Evolution Using Particle Filters. J. Comput. Biol. A J. Comput. Mol. Cell Biol. 2015, 22, 649–665. [Google Scholar] [CrossRef]
  4. Li, Q.; Liang, S.Y. Degradation Trend Prediction for Rotating Machinery Using Long-Range Dependence and Particle Filter Approach. Algorithms 2018, 11, 89. [Google Scholar] [CrossRef] [Green Version]
  5. van Leeuwen, P.J.; Künsch, H.R.; Nerger, L.; Potthast, R.; Reich, S. Particle Filters for High-Dimensional Geoscience Applications: A Review. Q. J. R. Meteorol. Soc. 2019, 145, 2335–2365. [Google Scholar] [CrossRef]
  6. Zhang, C.; Li, L.; Wang, Y. A Particle Filter Track-Before-Detect Algorithm Based on Hybrid Differential Evolution. Algorithms 2015, 8, 965–981. [Google Scholar] [CrossRef] [Green Version]
  7. Varsi, A.; Kekempanos, L.; Thiyagalingam, J.; Maskell, S. A Single SMC Sampler on MPI that Outperforms a Single MCMC Sampler. arXiv 2019, arXiv:1905.10252. [Google Scholar]
  8. Jennings, E.; Madigan, M. astroABC: An Approximate Bayesian Computation Sequential Monte Carlo Sampler for Cosmological Parameter Estimation. Astron. Comput. 2017, 19, 16–22. [Google Scholar] [CrossRef] [Green Version]
  9. Liu, J.; Wang, C.; Wang, W.; Li, Z. Particle Probability Hypothesis Density Filter Based on Pairwise Markov Chains. Algorithms 2019, 12, 31. [Google Scholar] [CrossRef] [Green Version]
  10. Naesseth, C.A.; Lindsten, F.; Schön, T.B. High-Dimensional Filtering Using Nested Sequential Monte Carlo. IEEE Trans. Signal Process. 2019, 67, 4177–4188. [Google Scholar] [CrossRef]
  11. Zhang, J.; Ji, H. Distributed Multi-Sensor Particle Filter for Bearings-Only Tracking. Int. J. Electron. 2012, 99, 239–254. [Google Scholar] [CrossRef]
  12. Lopez, F.; Zhang, L.; Beaman, J.; Mok, A. Implementation of a Particle Filter on a GPU for Nonlinear Estimation in a Manufacturing Remelting Process. In Proceedings of the 2014 IEEE/ASME International Conference on Advanced Intelligent Mechatronics, Besançon, France, 8–11 July 2014; pp. 340–345. [Google Scholar] [CrossRef]
  13. Lopez, F.; Zhang, L.; Mok, A.; Beaman, J. Particle Filtering on GPU Architectures for Manufacturing Applications. Comput. Ind. 2015, 71, 116–127. [Google Scholar] [CrossRef]
  14. Kreuger, K.; Osgood, N. Particle Filtering Using Agent-Based Transmission Models. In Proceedings of the 2015 Winter Simulation Conference (WSC), Huntington Beach, CA, USA, 6–9 December 2015; pp. 737–747. [Google Scholar] [CrossRef]
  15. Doucet, A.; Johansen, A. A Tutorial on Particle Filtering and Smoothing: Fifteen Years Later. Handb. Nonlinear Filter. 2009, 12, 3. [Google Scholar]
  16. Djuric, P.M.; Lu, T.; Bugallo, M.F. Multiple Particle Filtering. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP ’07, Honolulu, HI, USA, 15–20 April 2007; Volume 3, pp. III-1181–III-1184. [Google Scholar] [CrossRef]
  17. Demirel, O.; Smal, I.; Niessen, W.; Meijering, E.; Sbalzarini, I. PPF—A Parallel Particle Filtering Library. In Proceedings of the IET Conference on Data Fusion Target Tracking 2014: Algorithms and Applications (DF TT 2014), Liverpool, UK, 30 April 2014; pp. 1–8. [Google Scholar] [CrossRef] [Green Version]
  18. Murray, L.M.; Lee, A.; Jacob, P.E. Parallel Resampling in the Particle Filter. J. Comput. Graph. Stat. 2016, 25, 789–805. [Google Scholar] [CrossRef] [Green Version]
  19. Varsi, A.; Taylor, J.; Kekempanos, L.; Pyzer Knapp, E.; Maskell, S. A Fast Parallel Particle Filter for Shared Memory Systems. IEEE Signal Process. Lett. 2020, 27, 1570–1574. [Google Scholar] [CrossRef]
  20. Bolic, M.; Djuric, P.M.; Hong, S. Resampling Algorithms and Architectures for Distributed Particle Filters. IEEE Trans. Signal Process. 2005, 53, 2442–2450. [Google Scholar] [CrossRef] [Green Version]
  21. Zhu, R.; Long, Y.; Zeng, Y.; An, W. Parallel Particle PHD Filter Implemented on Multicore and Cluster Systems. Signal Process. 2016, 127, 206–216. [Google Scholar] [CrossRef]
  22. Bai, F.; Gu, F.; Hu, X.; Guo, S. Particle Routing in Distributed Particle Filters for Large-Scale Spatial Temporal Systems. IEEE Trans. Parallel Distrib. Syst. 2016, 27, 481–493. [Google Scholar] [CrossRef]
  23. Heine, K.; Whiteley, N.; Cemgil, A. Parallelizing Particle Filters With Butterfly Interactions. Scand. J. Stat. 2020, 47, 361–396. [Google Scholar] [CrossRef] [Green Version]
  24. Sutharsan, S.; Kirubarajan, T.; Lang, T.; Mcdonald, M. An Optimization-Based Parallel Particle Filter for Multitarget Tracking. IEEE Trans. Aerosp. Electron. Syst. 2012, 48, 1601–1618. [Google Scholar] [CrossRef] [Green Version]
  25. Varsi, A.; Kekempanos, L.; Thiyagalingam, J.; Maskell, S. Parallelising Particle Filters with Deterministic Runtime on Distributed Memory Systems. In Proceedings of the IET 3rd International Conference on Intelligent Signal Processing (ISP 2017), London, UK, 4–5 December 2017; pp. 1–10. [Google Scholar] [CrossRef]
  26. Maskell, S.; Alun-Jones, B.; Macleod, M. A Single Instruction Multiple Data Particle Filter. In Proceedings of the IEEE Nonlinear Statistical Signal Processing Workshop, Cambridge, UK, 13–15 September 2006; pp. 51–54. [Google Scholar] [CrossRef]
  27. Batcher, K.E. Sorting Networks and Their Applications. In Proceedings of the Spring Joint Computer Conference, Atlantic City, NJ, USA, 30 April–2 May 1968; Association for Computing Machinery: New York, NY, USA, 1968. AFIPS ’68 (Spring). pp. 307–314. [Google Scholar] [CrossRef]
  28. White, S.; Verosky, N.; Newhall, T. A CUDA-MPI Hybrid Bitonic Sorting Algorithm for GPU Clusters. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops, Pittsburgh, PA, USA, 10–13 September 2012; pp. 588–589. [Google Scholar] [CrossRef]
  29. Baddar, S.; Batcher, K. Designing Sorting Networks: A New Paradigm; SpringerLink: Bücher; Springer: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
  30. Thiyagalingam, J.; Kekempanos, L.; Maskell, S. MapReduce Particle Filtering with Exact Resampling and Deterministic Runtime. EURASIP J. Adv. Signal Process. 2017, 2017, 71–93. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  31. Hol, J.D.; Schon, T.B.; Gustafsson, F. On Resampling Algorithms for Particle Filters. In Proceedings of the 2006 IEEE Nonlinear Statistical Signal Processing Workshop, Cambridge, UK, 13–15 September 2006; pp. 79–82. [Google Scholar] [CrossRef] [Green Version]
  32. Ajtai, M.; Komlós, J.; Szemerédi, E. An 0(N Log N) Sorting Network. In Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing, Boston, MA, USA, 25–27 April 1983; ACM: New York, NY, USA, 1983. STOC ’83. pp. 1–9. [Google Scholar] [CrossRef]
  33. Paterson, M.S. Improved Sorting Networks With O(logN) Depth. Algorithmica 1990, 5, 75–92. [Google Scholar] [CrossRef]
  34. Seiferas, J. Sorting Networks of Logarithmic Depth, Further Simplified. Algorithmica 2009, 53, 374–384. [Google Scholar] [CrossRef]
  35. Ladner, R.E.; Fischer, M.J. Parallel Prefix Computation. J. ACM 1980, 27, 831–838. [Google Scholar] [CrossRef]
  36. Santos, E.E. Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines. J. Parallel Distrib. Comput. 2002, 62, 517–543. [Google Scholar] [CrossRef]
  37. Gropp, W.; Lusk, E.; Skjellum, A. Using MPI: Portable Parallel Programming with the Message-Passing Interface; The MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
  38. Li, H.F.; Liang, T.Y.; Chiu, J.Y. A Compound OpenMP/MPI Program Development Toolkit for Hybrid CPU/GPU Clusters. J. Supercomput. 2013, 66, 381–405. [Google Scholar] [CrossRef]
Figure 1. S-R—example for N = 8 . Each x i is actually a real vector, but is marked with a capital letter for brevity.
Figure 1. S-R—example for N = 8 . Each x i is actually a real vector, but is marked with a capital letter for brevity.
Algorithms 14 00342 g001
Figure 2. Distributed memory architecture.
Figure 2. Distributed memory architecture.
Algorithms 14 00342 g002
Figure 3. RoSS redistribution—example for N = 8 and P = 4 . Each x i is actually a real vector, but marked with a letter for brevity.
Figure 3. RoSS redistribution—example for N = 8 and P = 4 . Each x i is actually a real vector, but marked with a letter for brevity.
Algorithms 14 00342 g003
Figure 4. Redistribution—results for increasing N and P.
Figure 4. Redistribution—results for increasing N and P.
Algorithms 14 00342 g004
Figure 5. Stochastic volatility—results for increasing N and P.
Figure 5. Stochastic volatility—results for increasing N and P.
Algorithms 14 00342 g005
Figure 6. Stochastic volatility—profiling information for N = 2 24 and increasing P.
Figure 6. Stochastic volatility—profiling information for N = 2 24 and increasing P.
Algorithms 14 00342 g006
Table 1. Time complexity of each task of PFs on DM.
Table 1. Time complexity of each task of PFs on DM.
Task Name
(Parallelization Strategy)
DetailsSequential Time
Complexity
Parallel Time
Complexity
IS (embarrassingly parallel)Equations (1) and (2) O ( N ) O ( N P )
Normalize (reduction)Equation (3) O ( N ) O ( N P + log 2 P )
ESS (reduction)Equation (4) O ( N ) O ( N P + log 2 P )
MVR (cumulative sum)Equations (6) and (7) O ( N ) O ( N P + log 2 P )
Redistribution (RoSS)Algorithm 5 O ( N ) O ( N P + N P log 2 P )
Reset (embarrassingly parallel) w t i = 1 / N i O ( N ) O ( N P )
Estimate (reduction)Equation (8) O ( N ) O ( N P + log 2 P )
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Varsi, A.; Maskell, S.; Spirakis, P.G. An O(log2N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures. Algorithms 2021, 14, 342. https://doi.org/10.3390/a14120342

AMA Style

Varsi A, Maskell S, Spirakis PG. An O(log2N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures. Algorithms. 2021; 14(12):342. https://doi.org/10.3390/a14120342

Chicago/Turabian Style

Varsi, Alessandro, Simon Maskell, and Paul G. Spirakis. 2021. "An O(log2N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures" Algorithms 14, no. 12: 342. https://doi.org/10.3390/a14120342

APA Style

Varsi, A., Maskell, S., & Spirakis, P. G. (2021). An O(log2N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures. Algorithms, 14(12), 342. https://doi.org/10.3390/a14120342

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop