In onboard applications, real-time processing of particle identification not only focuses on identification accuracy, but also takes into account the unique requirements of in-orbit operation, such as the limitation of FPGA resources and the dynamic change in space particle flux. Firstly, considering the limited onboard resources and the prevalence of space FPGA applications, it is especially critical to develop a lightweight CNN network architecture for particle identification that is adapted to FPGA operation platforms. Secondly, the wide range of changes in cosmic space particle fluxes, especially during solar activity, necessitates the design of extremely low-latency network architecture to improve the system’s responsiveness and dynamic range. Moreover, the network model is expected to train more accurate architectures as the data are updated; therefore, developing a modular and reconfigurable FPGA operation architecture is also crucial.
2.1. Particle Waveform Data Software Preprocessing Module
The main function of this module is to process all kinds of particle waveform data obtained from in-orbit or ground accelerators, including data normalization, random sampling, random cropping, noise addition, and frequency domain transformation, in order to improve the robustness and generalization of the model and divide processed data into training, validation, and test data. The training data are mainly used to complete the training of the weights and biases of the entire network’s structure, validation data are used to determine whether the hyperparameters of the model are reasonable, and test data are mainly used to complete the final particle identification and inference and simultaneously measure the accuracy performance of the model’s inference.
In order to construct a software preprocessing module for particle waveform data, it is necessary to have a deep understanding of the characteristics of particle waveform data and construct a corresponding training dataset based on this characteristic. Semiconductor detectors are the most commonly used means for charged particle detection. Due to the internal electric field, the electric field strength inside the detector gradually increases from the backend to the frontend, and when charged particles are incident from the backend, the plasma erosion time on the trajectory of heavier ions is longer due to their short range, higher energy loss rate, and low electric field strength. In charge carrier transport, electron mobility is faster, and hole mobility is slower. Since the incidence is from the rear end of the detector, the average distance traveled during hole migration increases, which leads to an increase in the charge carrier’s transport time. When the incident energy is certain, variations in the number of nuclear charges and mass numbers of the particles can lead to differences in charge collection times. Heavier particles produce current pulses with a longer duration, lower amplitude, longer charge rise time, and greater time to a zero value, such that different particles deposited in the sensor will produce certain waveform differences [
10,
11,
12,
13,
14].
For the neutral component, take neutron gamma as an example. The organic scintillator is irradiated by neutrons or γ-rays to produce recoil protons and secondary electrons and luminescence, and the intensity of the light is quickly enhanced up to the maximum. On the contrary, the attenuation is relatively slow, and the attenuation is approximate relative to the exponential decay. Its luminous decay time contains two fast and slow components. When neutron or γ-ray irradiation occurs in the same scintillator, the formation of ionization densities in the scintillator is different, resulting in the different luminous decay times of the fast and slow components of the intensity ratio. The share of the fast component in proton-excited fluorescence generated by the interaction of neutrons with the scintillator is lower than the share of electron-excited fluorescence generated by the interaction of γ-rays with the scintillator, while the share of the slow component is higher than that of γ-rays. It is this factor that causes the difference in the decay time of the fluorescence generated by the interaction of neutrons and γ-rays with the scintillator, which in turn leads to a difference in the shape of the generated pulse [
15,
16,
17]. The waveforms of a charged particle are shown in
Figure 2a, where i represents the current generated by the charged particles in the semiconductor sensor, q represents the amount of charge generated by the charged particles in the semiconductor sensor, and U represents the voltage output by the charged particles in the semiconductor sensor and the subsequent front-end preamplifier circuit. The waveforms of a neutral particle are shown in
Figure 2b, where i represents the current generated by neutrons and gamma rays in the scintillator sensor.
Based on the above, it can be seen that the waveforms generated by semiconductor sensors interacting with particles or scintillators interacting with particles have a certain degree of variability in the rising or falling edges; thus, the training dataset needs to include information on the rising edges, falling edges, and full waveforms in order to comprehensively extract the characteristics of the waveforms and further expand the training dataset by means of data enhancement methods, such as random sampling, random cropping, noise addition, and frequency-domain transformations [
18], in order to improve the model’s generalization and robustness.
The data augmentation methods are implemented as follows: random downsampling of training data by controlling the extraction interval; random cropping of waveforms by randomizing the starting position and fixing the waveform length; adding Gaussian white noise to simulate real-world random noise; and frequency domain transformation using Fourier transform.
The overall implementation roadmap of the software’s preprocessing is shown in
Figure 3.
2.2. Convolutional Neural Network Software Training and Testing Module
The module mainly consists of several parts, which are the hyperparameter search module, training module, and test and validation module. In the process of building a convolutional neural network, the setting of hyperparameters is a key factor affecting the accuracy of training and final inferences. Conventional hyperparameters include the convolutional kernel size, step size, padding, activation function, number of convolutional layers, number of fully connected layers, loss function, number of iterations, learning rate, optimizer, batchsize, and dropout rate, which are very inefficient if combined and experimented with manually. Optuna (1.0.0) is a state-of-the-art automated machine learning framework focused on hyperparameter optimization, aiming to improve the performance of machine learning models. Its key strengths include a highly flexible design that allows users to customize complex optimization logic and parameter search spaces; an intuitive and powerful API that ensures ease of use and integration; and improved optimization efficiency through effective sampling strategies and pruning techniques. These features have made Optuna a widely adopted tool in research and industries [
19,
20,
21].
After determining the optimal hyperparameter combination through Optuna, the particle training waveform data and validation waveform data need to be subjected to forward propagation computation, loss function computation with backward gradient computation, and final training.
The test validation module is used to analyze the accuracy and ability of the particle discrimination model to achieve classification and discrimination under the current training state. By inputting a preprocessed hybrid test dataset, the accuracy of the validation and inference results can be obtained so as to understand the effectiveness of the model training and to provide data support for determining the model’s hyperparameters, as well as determining the model architecture. In practice, the key variables include training set accuracy, loss function value, validation set accuracy, and test set accuracy. The training set accuracy is calculated as shown in Equation (1):
where
TRCper denotes training accuracy;
TRCnum is the number of training data forward inference results that match with labels; and
TRnum is the total number of training data.
The loss function value is calculated as shown in Equation (2):
where
AVEloss is the overall average loss function value for training;
Tloss is the total loss function value; and
TRnum is the total number of training data.
The validation set accuracy is calculated as shown in Equation (3):
where
MDCper is the validation set accuracy;
MDCnum is the number of validation set forward inference results that match with labels; and
MDnum is the total number of the validation set’s data.
The test set accuracy is calculated as shown in Equation (4):
where
TSCper is the test set accuracy;
TSCnum is the number of test set forward inference results that match with labels; and
TSnum is the total number of test set data.
2.3. Particle Waveform Data FPGA Forward Inference Module
The particle waveform data FPGA forward inference module is the key link for realizing the real-time identification of particle waveforms, which must have the following characteristics in order to meet task requirements: reconfigurable, portable, flexible configuration, and reasonable resource consumption and extrapolation time.
Combined with the obtained fixed-point weights and bias parameters of each CNN layer, the construction process of the FPGA forward inference module for particle waveform data is shown in
Figure 4.
To ensure the correctness of the forward inference of the FPGA-based CNN architecture, the parameters obtained from training using the Pytorch (1.0.0) architecture need to be exported and generated as fixed-point input into the FPGA to verify the correctness of the parameters; given the complexity of verification using FPGA (1.0.0), a set of forward inference models is reconstructed based on MATLAB with the function of importing and exporting the parameters, the pooling layer of the convolutional layer and the fully connected layer calculation, and the calculation results of the output layer for each type of category and final classification results. Finally, the results of the Pytorch calculations and MATLAB calculations are compared to verify the correctness of the exported parameters.
Fixed-point quantization is not only a necessary step when migrating digital models from software environments to hardware implementations, especially when employing field-programmable gate arrays (FPGAs), but also provides multiple advantages [
22,
23,
24]. Firstly, FPGAs have a limited number of logic cells, and fixed-point quantization helps reduce the logic resources required, enabling more compact hardware designs. Secondly, fixed-point algorithms simplify the hardware implementation of the algorithms compared to floating-point algorithms, reducing processing delays and improving the speed and efficiency of data processing. Furthermore, fixed-point algorithms are usually more energy efficient than floating-point algorithms, which is especially important for power-sensitive application areas. Through quantization, a balance between accuracy and hardware resources can be found, allowing complex algorithms to be implemented on FPGAs, which is crucial for many resource-constrained embedded systems and high-performance computing applications. The convolutional kernel, bias, and input data in this architecture are quantized using a 16-bit fixed-point quantization strategy to improve the inference speed and reduce the resource footprint.
FPGA forward inference module construction at this stage typically uses platforms such as high-level synthesis language (HLS) or VITIS AI to automatically generate the model architecture into deployable hardware bitstreams using compilers and optimizers [
25,
26]. This approach, despite its many advantages, is not fully applicable to satellite-borne devices. Firstly, the transparency and maintainability of the code in onboard applications are of high priority, and the code facilitates error detection, system validation, and anti-irradiation designs. Secondly, onboard resources tend to be more constrained in terms of power consumption compared to ground-based devices, which require fine optimization and the finetuning of the design at the RTL level. In addition, onboard equipment often has customized functional requirements, and designing these devices at the RTL level can better organically combine different parts in order to improve operational efficiency. Finally, it is crucial to note that most SOC or AI chips at this stage do not have anti-irradiation capabilities and thus cannot meet the requirements for in-orbit operation. Therefore, the RTL-level development of the FPGA forward inference CNN module is adopted in this architecture.
The essence of the forward inference calculation of the convolutional neural network is to perform a large number of multiplication and addition operations, as well as the intercept, shift, splice, and other operations of register arrays; thus, it is necessary to use the DSP module inside the FPGA to perform parallel calculations. At the same time, considering the large flux factors of the actual application scenarios of particle identification, the inference time must be reduced as much as possible. Thus, if the resources allow, we directly use the on-chip registers in the FPGA to store all kinds of parameters. Therefore, the FPGA on-chip registers are directly used as the storage of various kinds of parameters when the resources allow in order to maximize the reduction in the data access operation times and reduce the delay of forward inferences; all the data are quantized by means of a 16-bit fixed-point number—which effectively reduces the complexity of multiplication–addition operations on the basis of not generating data overflow—and the strategy of parallel computation to meet the requirements of the inference time of the actual application environment. Considering that the model of the convolutional neural network may change with the modification of the actual model, and the strategy of parallel computation will also change with the modification of the model and the resource adjustment of the hardware platform, the modular design of the entire structure should be as flexible as possible to adapt to various changes.
The kernel architecture of the convolutional layer is shown in
Figure 5.
The computation of the convolutional layer consists of four main levels, which are the computation of the elements inside the convolutional kernel, the computation of the convolutional kernel corresponding to the different input channels of a single filter, the computation of the sliding of a single filter on one-dimensional data, and the computation of multiple filters. Based on the current architectural design of convolutional layer computation, the elements within the convolutional kernel, the one-dimensional data sliding computation, and the computation of multiple output channels can be fully or partially parallelized, whereas the convolutional kernel computation of the input channels is designed to be serial due to the need to accumulate computation between multiple channels.
Convolutional computation is a big user of chip resources and time resources, and the quantitative analysis of computing resources and an understanding of time delay are key steps in building a platform. Since the convolutional operation is mainly a multiplication–addition operation and the parallelism requirement is high, the special DSP module within the FPGA is used in the construction of the operation unit (PE unit), and the relationship between the actual construction process of the DSP module and the degree of parallelism is shown in Equation (5):
where DSPNum is the number of DSP modules of the FPGA that need to be consumed,
k represents the degree of parallelism within the convolutional kernel,
n represents the degree of the sliding window’s parallelism, and
m represents the degree of the output channel’s parallelism.
The extrapolation time is related to how serially the architecture is running, i.e., how many operations are required for each layer to complete the construction of the output data. Each computation requires one clock, and the final extrapolation time required to compute a convolutional layer is shown in Equation (6):
where
ClkNum is the number of the running clk to be consumed;
x represents the degree of serialization within the convolutional kernel;
y represents the degree of the serialization of the input channel element computation;
z represents the degree of the serialization of the sliding window;
a represents the degree of the serialization of the output channel; and
b is the number of extra clocks due to synchronization or some bias computation, as determined by the actual program.
The kernel architecture of the pooling layer is shown in
Figure 6.
The operation of the pooling layer is relatively simple due to the maximum pooling operation used in this architecture; the pooling size is 2. Thus, the dimensions of the input channels after the pooling operation will be reduced by half, and the larger value of neighboring data is taken as the output value. The pooling operation uses parallel processing operations for single-channel input data and serial processing operations for different input channel data. Finally, all data will be flattened and outputted.
The pooling operation purely comprises comparator processing and therefore does not consume the DSP module, for which its time delay is calculated in Equation (7):
where
clkpool is the serial degree of single-channel input data pooling;
clkchannel is the serial degree of multi-channel input pooling processing; and
c is the number of extra clocks due to synchronization as determined by the actual program.
The kernel architecture of the full connectivity layer is shown in
Figure 7.
The fully connected layer processes all spreading data obtained after convolution and pooling and obtains the mapping computation of the final classification outputs. Since it is a fully connected computation, all spreading input data will correspond to the weight; thus, the parallelism of the process will directly and significantly increase the number of DSP modules used.
The number of DSP modules used in the fully connected layer is shown in Equation (8):
where
DSPmuti is the input node’s multiplication computation parallelism and the
DSPoutputnode is the output category computation parallelism.
The full connectivity layer’s inference time calculation is shown in Equation (9):
where
clkmuti is the multiplicative computation seriality, i.e., the number of times a single channel completes the computation, and +1 is needed since a biased addition is eventually required;
clkoutputnode is the output category computation seriality; and
k is the number of extra clocks due to synchronization as determined by the actual program.