1. Introduction
The quest for low-carbon sources of electricity is the main direction of future energy demand fulfillment. It is especially crucial due to the need for global reduction in fossil fuel usage to reduce temperature increases, as declared by the 2015 United Nations Climate Change Conference (known as the Paris Agreement) [
1]. One of the possible solutions is the usage of nuclear fusion, which encourages many private companies and researchers around the world [
2,
3]. This article focuses on magnetically confined plasma in a tokamak device, which serves as an energy source.
In plasma, by reaching the required qualities of density and pressure, there are many unwanted processes like runaway electrons, magnetohydrodynamic activities (MHD), edge-localized modes (ELMs), fast ions, electron instability, and saw-tooth phenomena [
4,
5]. These can result in unwanted contact with the inner walls or parts of the internals of the tokamak. This is disadvantageous for the fusion process by lowering the performance of the reaction, stopping the reaction entirely, or even causing damage to the tokamak device itself. Monitoring plasma impurities to study and prevent such processes and ultimately improve experiment stability in tokamaks is necessary. This type of measurement is employed to indicate the occurrence of such unwanted processes, their creation, and changes in time and is a commonly applied approach [
6,
7].
It is essential to monitor plasma with stable control of its shape in order to maintain stable energy production. Nevertheless, the process and maintenance of plasma magnetic confinement is inherently complex [
8]. Among other things, this requires several specialized diagnostics and measures. Monitoring plasma states necessitates utilizing suitable methodologies, encompassing detectors, electronics, software, and integration techniques. One of the diagnostic techniques employed on tokamaks is the measurement of soft X-ray radiation (SXR) emitted by plasma impurities. SXR monitoring is a widely used technique on tokamaks [
4,
9]. Micropattern gaseous detectors, particularly gas electron multiplier (GEM) detectors, represent a viable option for detecting SXR radiation. This article focuses on this solution, which represents an alternative to silicon detectors due to enhanced resistance to neutron radiation.
In a simplified form, tokamak device control typically comprises two distinct parts—fast control (e.g., a Discharge Control System—DCS) and slow control (e.g., Control, Data Access, and Communication—CODAC [
10,
11]). The above citations concern WEST tokamak and its solutions, where the developed system was installed. Similar solutions are used in other tokamaks, especially those planned for ITER [
12,
13,
14]. Diagnostic system output, calculated in real time, with integration in the former, can provide meaningful outcomes for online plasma control. One challenge is implementing a solution that meets the specified latency requirements and allows for seamless integration into the existing control system working in a real-time feedback loop.
Diagnostic systems that fulfill the constraints and possibilities of current technology can comprise many parts. They must combine multiple hardware components, e.g., fast connections, field-programmable gate arrays (FPGAs—programmable logic-integrated circuits that process digital signals), central processing units (CPUs—the main processors of computing systems), general-purpose computing on graphic processing units (GPGPUs—graphic cards used for generic computations), accelerators (add-on devices used for computations), multiple data sources and flows, and different approaches such as classical algorithms or even data science methods [
15]. Such system implementations, in terms of software components, integration, and implementation approach, are the main focus of this article.
The primary purpose of this work was to propose, implement, and test a low-latency and asynchronous computational platform, named the Asynchronous Complex Computation Platform (AC2P), built for heterogeneous SXR GEM-based measurement systems and used for online radiation measurements of tokamak plasma impurities.
This original solution was recently developed and includes fast internal communication, processing latency, stability monitoring, the possibility of multi-device usage, and asynchronous and real-time processing. The platform was designed for convenient development and effective improvements.
Various measurement systems and approaches will be discussed to facilitate comparisons and the preparation, implementation, and testing of the computation platform. Verification was performed with a developed SXR GEM-based solution in the laboratory and experimental scenarios on WEST tokamak.
2. System Description
The system on which the platform is executed was thoroughly described in earlier articles [
16,
17]. The digital signal processing work is divided between Artix7 FPGAs and the computer system, including the Intel Xeon E5-2630 v3 processor on the S2600CW2R motherboard.
The developed system is capable of working in two base acquisition modes:
Global triggering with data quality monitoring that can be used for acquisition quality verification purposes and hardware behavior monitoring on fresh on-site installations. In this mode, a single channel with a proper signal level triggers acquisition on all channels available in the system. This method results in a snapshot of the detector behavior during each event [
18].
Local triggering, with optimized data layout and non-relevant data filtering. Only triggered pulses are stored and sent to the CPU in this mode. Therefore, it can register more intense plasma fluxes and impose a much lower load on the processing part [
19].
2.1. Data Processing
The system’s current design provides 1 cm plasma spatial resolution up to over 128 spatial channels [
20]. The parameters mentioned above are suitable for measuring hot plasma, monitoring impurities at a frequency of 1 kHz, and observing MHD activities at a frequency of 10 kHz [
20]. Using a GEM detector characterizes further signal processing steps. During amplification inside the detector, a single photon can be detected on multiple anode strips, resulting in a cluster that needs to be adequately computed in time-correlated channels. Data processing consists of multiple steps:
signal pulse detection, windowing and annotation;
sample integration;
cluster of pulse identification, determination of its charge and position;
separation of coinciding pulses (pileup decomposition);
calibration corrections;
multi-dimension histogramming [
21].
The first part is carried out in hardware (FPGA) as preprocessing, where further computation is carried out in software in an online and real-time manner, and sample pulse-related data are archived for further offline adjustments and improvements [
20]. The importance of quality analysis was mentioned in [
18].
Pulse sample-based data processing (software or FPGA), starting from step 2 (sample integration), provides many advantages:
possibility of signal defect detection and annotation of clusters of their quality [
18];
photon cluster with defect signal rejection from results;
higher quality of results (limiting systematic errors);
increased time resolution in high-flux measurements due to possible multi-pileup decomposition inside the measurement window (exemplary six-pulse pileup decomposition presented in [
21]);
verification and commissioning of measurement signals through the system’s life.
Most of the difference in the quality of outcome is observable during measurements of high-intensity fluxes, pulse pileup occurrence or signal disturbances originating from work in harsh environments like a tokamak site [
18,
21]. Data quality monitoring was proven essential for providing accurate results, and, in the future, works can be migrated to FPGAs [
18] to offload work from the CPU to optimize processing.
With the rise of computation possibilities in generic computing hardware and off-the-shelf solutions, the migration of computations from FPGA and application-specific integrated circuit (ASIC—integrated circuit customized for specific application) to software solutions is becoming more advantageous. Software solutions can be more accessible to implement when high throughput with complex algorithms is needed.
Combining multiple devices for computation is called heterogeneous computing and is widely used [
22,
23,
24]. Multiple solutions and platforms are being developed to integrate and execute computations on different devices optimally. The following sections will compare existing solutions used with similarities to the developed system.
2.2. System Performance
The system’s performance in data transfer and algorithm latency was measured and presented in [
25], providing latency of less than 10 ms. This outcome was achievable with the methods needed for global triggering—double buffering (platform side) and transposition of data (algorithm side).
Reliable local triggering modes provided new possibilities, such as lowering processing latency due to resolving memory layout drawbacks of global triggering modes, reducing overhead (only relevant data are sent), and implementing new algorithms with improved performance. Those works can be integrated into online processing during plasma pulses and working in a feedback loop in a tokamak state machine.
2.3. Installation on WEST Tokamak
The system was prepared to be run on WEST tokamak for impurity radiation monitoring [
20]. The first unit was integrated into a poloidal position and commissioned. The system provided multiple results to date [
26]. Its unique feature is registering all pulse-related sample data for offline postprocessing. It provides exceptional features to develop new or improved data processing algorithms in parallel with constant result quality verification since each computation can be verified (in the scope of error factor) with reference data.
The most extended plasma shots will be 1000 s long [
27]. Installing both devices in the future will provide many more research possibilities, e.g., tomography reconstruction [
20]. Multiple preparations were made in the system in terms of usability and communication. There was integration with the tokamak state machine on two fields:
hardware—real-time triggering of the measurement for each plasma shot, synchronized among tokamak;
software—integration with WestBox (middleware between the tokamak state machine and developed SXR measurement system) and software usability.
3. Requirements of the AC2P for Developed SXR GEM-Based Measurement Systems
Qualitative measurements carried out with the presented system provide challenges during earlier works:
Linear processing—a single line of data digestion threads, processing thread pool, and sink threads [
25]. Implementing additional computational devices or non-linear data processing would be challenging and would require optimization on two levels—platform and algorithm.
Double buffering—global trigger data layout required in-memory data transforming and transposition.
Synchronization primitives used during communication—all buffers were synchronized with pthread mutex primitives. It is an excess solution in processing with mostly one-to-one communication between nodes. Better solutions like Single Producer Single Consumer (SPSC) lock-free queues can be used [
28].
Integration of many different hardware modules used for, e.g., computing, readout or communication.
Fast implementation of multi-input non-linear data flow algorithms.
Scale-up (maximizing capabilities of single device) rather than scale-out (maximizing possibilities of the system by adding additional devices).
Works in real-time (for now, soft real-time), streaming manner (processing data as soon it arrives), close to the hardware.
To accomplish the above points and to implement the platform, a set of tools and methods must be chosen. More details on requirements are presented in [
29].
4. Comparison with Other GEM Detector-Based Solutions
Most systems dedicated to measurements with GEM detectors use ASIC solutions with Time of Flight (ToF), Time over Trigger (ToT), peak detection or counting of pulses. Since the introduction by F. Sauli in 1997 [
30], multiple solutions have emerged and are used widely in scientific projects [
31,
32,
33,
34,
35,
36,
37].
Such solutions are oriented towards channel quantity over quality. Most do not provide multi-second pulse-based sample measurements, limiting pileup detection or unnoticed inclusion of pileups as valid pulses and no or limited diagnostics output. Only SAMPA ASIC provided comparable results to the presented system but with a lower sampling frequency—20 mega samples per second (MSPS) vs. 80 MSPS and 10 vs. 40 samples per measurement window [
38].
No comparable processing software was found that can be used with AC2P. Solutions like ToF or ToT can cause a loss of signal characteristics and calculation errors even for low-flux sources like for a laboratory
55Fe source (see Figure 4 in [
18]) or especially in a harsh tokamak environment (where multiple pileups can occur). Using those methods can degrade the amount of valid input data and lower the measurement quality (like worse full width at half maximum value—FWHM) or give invalid conclusions [
18].
Many diagnostics prepare dedicated processing software, in which data flow is typically linear, not described in detail or written from scratch [
39,
40,
41]. Commonly readily available digitizers on the market are used like those produced by CAEN, D-TACQ or SP Device [
42,
43,
44].
For fusion devices, the ready-to-use solution is the MARTe2 library [
45]. It is a library of choice in many Plasma Control Systems [
46]. Based on its architecture, there is an implemented Plasma Control System for ITER [
42,
47]. This solution is not suited for data flow streaming solutions with multiple inter-node communications, as planned in AC2P, mainly due to the linear processing of data and communication.
Any of the libraries are used along with proper software usage, proper isolation of cores and system configuration, mentioned in other works for ensuring low latency and low overhead of Linux kernel [
46,
48].
An overview of available solutions was performed earlier [
29]. Two main groups can be determined:
The main disadvantage of the first group is related to construction aimed at throughput, e.g., latency, latency-increasing core utilization optimization by using task-based abstraction for swapping computing context. The advantage of this group is the data flow abstraction of tasks pinned to hardware, which can be represented as a Direct Acyclic Graph (DAG). The main disadvantage of the second group is linear processing, and the advantage is construction aimed at low processing latency. The advantage is to combine the positive characteristics of both groups in the AC2P platform.
The FastFlow library implements such an approach concentrating on low-latency communication between task vertices, which can outperform other solutions [
49]. It is implemented using the SPSC queue type, which can be lock-free without synchronization primitives [
50]. Using a subset of FastFlow with a low-level pipeline abstract can be beneficial in simplifying platform creation.
With sample-based SXR pulse data and the presented approach, full signal processing algorithm implementation in FPGA or ASICs will be troublesome due to the resources and development time needed to handle hundreds of channels simultaneously. The presented system handles this issue by switching to a hybrid hardware–software solution and offloading part of processing to general computation hardware.
The authors did not find a complex scale-up software solution realized in SXR measurements by GEM detectors. Other diagnostics used in plasma physics are prepared for such problems. However, no solution was found for complex data flow processing with a concentration on low-level real-time processing needed for tokamaks. The best fit is the most popular and widely used MARTe2 framework. However, cycle time-based execution and linear processing are disadvantageous in the current work.
There are two popular general-purpose frameworks to consider to use in AC2P: FastFlow and Intel TBB. Both start execution based on data arrival, not cycle execution (like MARTe2). TBB implements work-stealing task-based parallelism. Latency disturbance due to context switches is unfavorable. Research articles are comparing TBB and FastFlow, which can present lower communication latency in FastFlow (comparison carried out by the author of FastFlow in 2017) [
49]. FastFlow executes tasks on a single core by one-by-one execution (like MARTe2) and implements DAG-like processing.
TBB and FastFlow have received much work and improvement since the article was published in 2017 [
49]. Different cases have been tested [
51]. More comparisons of those frameworks can be made to test the actual state of efficiency and their possibilities. Another approach, like the efficiency of implementation measured in time, can be further investigated [
52].
Based on the advantages presented, FastFlow was chosen as the primary building block of the AC2P platform, which was built to develop an SXR measurement system for plasma impurity monitoring.
Summarizing the above, there are two groups of currently available processing systems (aimed at throughput data flow task-based generic streaming engines with complex computations, easy to implement, and aimed at latency low-level and low-latency solutions, challenging to implement and providing linear processing). Research showed that there are no solutions that can process complex algorithms in a low-latency manner with latency monitoring using the advantages of both groups. Moreover, systems used in various diagnostics, especially GEM, do not process raw pulse samples. AC2P implementation, along with the system raw data processing development, solves this issue by providing the highest possible measurement quality outputs.
This work will present the preparation, implementation and verification of AC2P and thus the whole SXR measurement system by implementation in the laboratory and installation on WEST tokamak.
5. AC2P Platform Description
The execution operating system is CentOS 6.10, with a 2.6.32-754 from the repository and a 3.18.44 kernel build from the source. Due to versatility, driver availability, and used hardware and solutions, the platform should execute on the GNU/Linux family of operating systems and kernels between 2.6.32 and 3.18. The possibility of a dedicated communication driver migration was tested in [
53].
Due to limitations of the used compiler and ecosystem, it was decided to use the FastFlow 2.0.3 version, providing much of the usability of newer versions without modifications to code, operating system or compilation flow.
Future updates of Linux distribution and solutions are planned.
The AC2P is designed to implement orchestration for algorithms and data distribution between data inputs and multiple generic computing hardwares like CPUs, GPGPUs or other accelerators (e.g., Xilinx Alveo family of devices). The implementation used low-level abstract elements of the FastFlow library with system optimization and core isolation. It is essential to implement it in a real-time manner to integrate with the tokamak real-time network to provide online results and insight on plasma for other systems.
The work aims to implement a complex DAG-based, low-latency, high-performance processing software platform. It includes part of the configuration of the operating system for best results and the introduction of the local triggering mode of hardware with the preliminary version of the new processing algorithm. It was verified with reference to the prepared in MATLAB version regarding implementation validation and preliminary quality comparison (both MATLAB reference implementation and the software platform are versioned as v.05.2024). Laboratory tests addressed quality, latency, and configuration using real high-intensity radiation sources.
5.1. Architecture
Figure 1 presents abstract blocks implementing different domains. The SXR measurement system, for the scope of this article, is a singular device and can be realized by, e.g., a motherboard with multiple CPUs and GPUs. Many such devices can realize the whole measurement system, one for each GEM-based measurement system installed, but it is out of the scope of the current article. The presented system consists of multiple data inputs (currently two FPGAs) and single storage/output transmission to the real-time network (currently in-memory storage, ultimately to be distributed to WEST tokamak devices for feedback loop).
Each DAG node should be as close to the hardware as possible—being able to maximize usage of available resources [
54]. Due to the need of processing as soon as possible (asynchronous) and real-time cycle-based connection to the tokamak systems, the architecture of the platform can be divided into two parts:
Both are presented in
Figure 1. The DAG processing graph is presented and used as the main building block for the whole solution. FastFlow for the asynchronous part is used by its API (control of execution is performed by the library), and for a real-time part, busy-waiting on low-level input buffers was used.
The following properties characterize the developed platform within i.a.:
Buffering optimization:
Minimal or no data buffering if not needed (zero-copy)—data obtained from data sources are not copied during processing—only pointers. If there is access to the accelerator, data should be copied into its memory directly (CPU bypass, e.g., GPUDirect by NVIDIA GPGPUs) or as soon as possible by CPU.
Turning on non-uniform memory access (NUMA—configuration in which the CPU can distinguish its memory from other CPUs in a multi-CPU system) within a Basic Input/Output System (BIOS—motherboard firmware) or Unified Extensible Firmware Interface (UEFI—motherboard firmware) providing the possibility of allocation of memory buffers close to related CPU.
Turning off hyper-threading (HT)-like technologies if not used by algorithms [
55,
56].
Low-level operating system (OS) configuration:
Implementation optimization:
Construction of processing parts should be designed with latency optimization in mind.
Mo or minimal dynamic allocation—data used by algorithms should be allocated before execution.
Fast point-to-point communication between computation nodes.
Close to hardware programming language (e.g., C++) [
58].
Simulation of real input streams (mock object of FPGA):
Streaming data archived from earlier real-world measurements.
Data generation based on timestamps from archived data.
Minimization of influence of simulation on the performance of the platform.
Low-disturbance time measurement for algorithm development and latency information for tokamak system feedback loop.
The above solution is an update of the architecture presented earlier for global triggering mode [
25] solving challenges presented in
Section 3.
5.2. Platform Implementation
Requirements described in earlier sections were developed and implemented in the AC2P. The current solution is based on two FPGA inputs. The DAG processing graph is divided into histogramming and the two-channel processing parts, as presented in
Figure 2. It describes stages of computations, paths and contents of communication, and the processing logic carried out in the system.
In the figures, there are annotations around graph edges where:
Samples are packages of pulse-related samples.
[q,p,t] are as follows: charge, position and occurrence time of multiple pulses aggregated into clusters.
Time monitoring data are diagnostic data obtained by timestamping each processing part. Circle nodes are processing nodes, whereas square ones are connected to data storage methods such as raw data samples or diagnostic data.
DAG Node Details
Different designed DAG nodes are presented in
Figure 2. Asynchronous processing types are:
A—handles interrupts from FPGA, checks data integrity, looks for data errors and archives sample-based data to ramdisk, controls the state of measurement based on flags found in data.
B, C—calculate signal baseline and integral of pulse for each measurement window resulting in [p,q,t] values, based on which it detects clusters (this is the most time-consuming part, so two instances are parallelized; see
Section 5.3).
E—calculates histograms (charge, position and time) based on input [q,p,t] triplets for each cycle based on timestamps of input data (current histogramming cycle is 10 ms) and archives them.
D—handles efficiency measurements (for tests only), saves to ramdisk all monitoring data related to data path from inputs to this node.
OS—tasks left for operating system (see
Figure 3).
Asterisk symbols are added to nodes confirming the finishing of data usage (see
Figure 2). Such a confirmation routine is common for all nodes. It is based on notifications from nodes and checks if all nodes have processed the raw sample data package from FPGA. If data are unnecessary, it frees entry in the circular buffer used in communication between FPGA and CPU. It was implemented as synchronized access inside nodes; the execution time per data package is very short, and using a dedicated thread for it is insufficient. The green color of the arrows highlights the usage of round-robin package sending; each odd package is sent to the B node, even to C.
Cycle/real-time processing is implemented in the E node—saving histogram data gathered from earlier steps in cycle intervals of CPU clock with additional parameter—latency of whole processing (current cycle is 10 ms).
The presented graph results from the optimization of processing to available hardware resources. In
Figure 2, uppercase letters describe actual tasks used in the implementation. Each performs multiple actions, due to better latency and performance of aggregated actions, their purpose, available hardware threads and FastFlow usage. The FastFlow library is implemented to pin a single node to a single thread as the best solution. Its internals consist of an infinite loop waiting for input on SPSC queues, processing it and sending it further.
The current system is built with a dual CPU with two 8-threads each (HT turned off). UEFI was configured so the operating system could see two independent NUMA nodes. One of the CPUs was designated entirely for processing—the one physically connected to the Peripheral Component Interconnect Express (PCIe—high-speed serial bus) slots with FPGAs attached. The second one was used for low-priority tasks, like OS and out-of-platform process handling. Threads A initially were dedicated only to data ingestion. However, due to at least around 2.5 ms of waiting time for the following package (results of maximum performance tests can be observed in the further part of the article, e.g., in section Latency Stability and Performance During Simulated Maximal Performance Acquisition, where data reading of single packet takes around 0.7 ms and the difference in time between packets is 3.3 ms), it was dedicated additionally to saving files with pulse-related data. D implements diagnostic, monitoring and internal latency archiving. E is the actual real-time output of the system that saves histograms to storage in memory files.
Due to a single node working on a single hardware thread, no scheduling was adjusted. DAG node pinning to hardware threads is presented in
Figure 3.
Along with computational processing, there are multiple monitoring, diagnostic and verification parts. Each of the edges of the graph transports dedicated data, highlighted in figures, multiple latency-related processing timestamping (25 different timestamps), part of which is used to calculate the whole processing latency saved along histograms. Others are used in stability verification and performance profiling. Multiple additional streams were conveniently added to the D node during development, like integration output, partial or complete [p,q,t] cluster listing. However, they were used only for tests and algorithm validation.
5.3. Implemented Algorithm
The developed algorithm (group task
B/
C in
Figure 2) implements a subset of steps described in
Section 2.1, where:
Sample integration—integration of pulse part of measurement window, diagnostics of quality of baseline and pulse quality annotation is taking place.
Cluster of pulse identification—where multiple windows of pulses are gathered as a single photon event and its position, occurrence time and charge are calculated.
Multi-dimension histogramming—where cluster histogramming takes place, providing position, time, and charge histograms.
Identification of the cluster of pulses is a part where adjacent in time and position window pulses are transformed into a graph. A modified depth-first search (DFS) algorithm adjusted to pulse detection physics by the GEM detector was implemented, which can detect any pattern as a cluster resulting from photon interaction with the GEM detector. The pattern was adjusted to characteristics prepared in other works in MATLAB.
6. Platform Performance and Validation Tests
Multiple steps of verification and tests were performed in order to validate the usage of the system and verify real-world scenarios. There are two configurations with which tests were performed. The first one (later described as the S1 variant) has a detector attached and 64 channels available with Intel Xeon CPU E5-2630 v4 (10 hardware threads each) and 32 GiB RAM per CPU and 2.6.32 kernel. The second one (later described as an S2 variant) has dual Intel Xeon CPU E5-2630 v3 (8 hardware threads each) and channels 49:112 (64 channels over 2 FPGAs) without detector attached, 64 GiB RAM per CPU and 3.18.44 kernel.
Laboratory tests were carried out with variation S1 and radiation generated by the X-ray tube Amptek Mini-X. Below is a comparison of the new algorithm working in the platform with the reference one, on WEST tokamak execution details and performance verifications and stability of online processing and simulation of hardware.
6.1. Pulse Samples to Histogram Algorithm
Implementation of the algorithm (B and C tasks in
Section 5.2) was verified with a reference MATLAB solution. The copy of archived pulse-related data samples read from FPGA during plasma pulse gathered in WEST tokamak was used as a test case. It was measured on 23 March 2023 and labelled as number 58317. This test confirms platform algorithm implementation as an initially useful processing step in real-world usage, providing usable information for further analysis and a meaningful performance load for the platform performance development and tests.
The test compared outputs by calculating relative error of charge, time and position histograms with written MATLAB reference algorithms.
Implementation was successful, and results are summarized in
Figure 4. The mean value of relative errors are 2.94% for charge, 3.21% for time and 3.18% for position histogram. This single measurement error is less than 5%, within the preliminary acceptable margin. A further reduction in algorithm error is planned.
In
Figure 4, the top row compares different histograms. The charge histogram differs mainly in bins consisting of lower charge clusters. The central row presents reference value relative error. The bottom one presents histograms of relative errors to see the distribution of error percentage.
Figure 5 presents the change in time of different histograms of 58317 plasma shot.
Algorithm implementation was prepared and implemented as meaningful regarding performance and output quality, thus fulfilling the test’s assumptions.
Multiple performance tests were performed and they are presented in the following sections. Those tests measure global latency and processing performance.
Based on earlier WEST tokamak and laboratory measures, an established flux of 3 MHz counts/s is valuable for short tests and 1 MHz counts/s for long ones. The obtained values were recreated in the laboratory using the X-ray tube as the radiation source imitating plasma discharge to evaluate the implemented fast computation platform.
6.2. Performance Verification
The sections below will concentrate on the latency tracking of each package during actual SXR flux, maximal readout capabilities, and simulation.
To process data faster than they are being delivered, the platform must free parts of the communication buffer between FPGA and CPU as soon as possible. Pulse-related data are used for archivization (buffer entry can be freed after the last step of the reader and data check) and in the clusterization part during integration (buffer entry can be freed at the end of package processing). Buffer entry can be freed as soon as both parts finish processing the data package—later in the article called packet confirmation.
In the sections below, uppercase letters are used for tasks corresponding to the one described in Section DAG Node Details and
Figure 2. In those Figures, the Y-axis can be limited to exclude outlier values; the few first elements can be misleading due to measurement of low flux before X-ray tube stabilisation and the last one due to a partially filled package of data.
6.2.1. Latency Stability and Performance During 3 MHz—Real Radiation
Measurements were performed with pulse-related data archivization and an algorithm running. Results are presented in
Figure 6.
The top left plot presents time spent during data processing in the readout part (A task in
Section 5.2). In the beginning, between the 128 and 1024 packages, there is around a 1.8 ms rise in latency due to kernel handling of page-faults—an exception during accessing a memory page in the process’s virtual address space. The observed rise in latency does not influence the data integrity, processing stability, or correctness, and it is planned for further investigation and improvement. There is a 2.5 ms latency of processing up to nearly 5000 packages due to time spent in sample-related data archivization (see
Figure 6a). The following packages were not archived due to the limit of RAM storage. Reading, checking validity, and preparing for data propagation without saving data took around 0.5 ms (packages after 5000 in
Figure 6a).
Tube flux was stable, resulting in a stable readout of 3 MHz clusters through a 50 s measurement (left bottom figure) and near uniform radiation over channels (lower right plot). Most of the time is spent on the clusterization part (B, C task)—
Figure 6b.
Histogramming time (E task) and summed up time spent in communication between task nodes, presented in
Figure 6c,e, do not significantly increase the latency of processing.
There is a single short rise in intensity in the bottom left plot, which resulted in a temporary decrease in time between packets read in the second to the bottom right plot.
Full processing time does not exceed 8 ms in the page-fault part and 6.75 ms in other parts.
The 3 MHz flux data packages are arriving in around 10 ms, so considering packet confirmation in less than 6.5 ms (ignoring the page-fault part), flux does not reach the computation capabilities of the platform (time of processing is lower than the time between subsequent data packages).
6.2.2. Latency Stability over 8 Min of Measurement of X-Ray with X-Ray Tube in the Laboratory
Measurements were performed with algorithm running and pulse-related data archivization. Results are presented in
Figure 7.
The performance of the reader (task A—
Figure 7a) is similar to
Figure 6a. Clusterization (
Figure 7b) takes slightly less time than the previous measurement. The platform is stable during tests with visible spikes in clusterization (B, C task—
Figure 7b) due to memory page handling.
The 1 MHz flux data packages are arriving between 24 and 30 ms, so considering packet confirmation in less than 10 ms, AC2P has much processing capabilities left.
There is a single short rise in intensity in the bottom left plot, which resulted in a temporary decrease in time between packets read, similar to earlier measurements.
6.2.3. Latency Stability and Performance During Simulated Maximal Performance Acquisition
Tests with an X-ray lamp did not saturate the performance of the presented measurement system. This section presents a test performed to simulate maximum load on the system. The measurement is performed without the GEM detector and executed to gather all signals detected on the input to be treated like valid radiation pulses—although they all are noise samples.
Measurements were carried out with measurement system variant S2, and the trigger was adjusted to gather data with maximal throughput by recording noise on 49 to 112 channels. The results of 16,154,333,810 events processed within 1.41 TiB of data acquired from two FPGAs in around 1200 s are presented in
Figure 8. Much of the detected clusters are rejected in the algorithm as too big to be related to a single photon; however, noise around 40 channels made it possible to classify these data as a type of cluster (
Figure 8h) in cycles of peaks of a few seconds (multiple high-intensity spikes in
Figure 8g).
Long-time measurements and page faults in the system resulted in cyclic single-package latency increase and around 2–3 ms peaks in long-time measurements after around the 20,000th package. Profiling showed that the kernel Transparent Huge Page defragmentation mechanism took place. The best long-term measurement outcome showed adjusting the time between cycles and amount of pages to scan [
59]. The cost of this adjustment is a wider standard deviation of processing, rising from around 300 μs to around 700 μs.
The whole latency of software processing does not exceed 7 ms. In the top left plot, there is a spike of 1 ms additional processing time between 128 and 1024 package and later cyclic spikes around every 1024 packages reaching around 0.5 ms, where lower parts do not exceed 0.16 ms. Aggregated communication time between nodes very rarely exceeds 40 μs. All of those characteristics need further investigation.
6.2.4. Comparison of Simulator and Real-World Measurement
The simulator was tested based on 3 MHz pulse-related data from
Section 6.2.1. A comparison between real measurement and simulation is presented in
Figure 9.
The simulator works by changing real hardware FPGA input to the system to pulse-related data from the buffer in memory without changes in interfaces (so-called stub object). From the perspective of AC2P execution, there are no changes in code.
The reader and data check part took less than 0.5 ms per package in the simulation. There is no observable page-fault error between the 128 and 1024 packages (
Figure 9a). Moreover, the simulation is more stable in this part. Clusterization latency is similar to the original, mostly less than 150 μs. During real flux measurement, the histogramming part processed data faster (
Figure 9c). However, the difference is negligible.
Full path latency is very similar, except for the page-fault part. Communication time in the simulation is similar to that in the real one.
The variance in the difference between packet generation is very low. However, it rises from values close to 0.01 ms up to almost 0.1 ms. Measurement must be longer to see if the difference will rise or stay stable. Differences are negligible and do not influence the overall usability of the simulator.
Counts in time and the sum of position histograms are very similar, and the error is much lower than 1%; however, further causes of the difference can be investigated if needed.
6.3. Preparation for Online Execution on WEST Tokamak
The system was installed on WEST tokamak and executed in multiple real-world scenarios for pulse-related data archivization during multiple plasma shots [
26]. It was integrated with WestBox and low-level triggering, with the configuration of up to 86 channels of GEM measurement input with 96 GiB of ramdisk storage for online data archivization. Online clusterization and histogramming parts are yet to be integrated and executed.
The current system configuration on WEST tokamak consists of 64 enabled channels (1 FPGA stream) with pulse-related data archivization (local triggering mode).
The system can read over events with 96 GiB of RAM reserved for storage of raw input. In parallel, a platform algorithm of calculations is running, resulting in measurements. When raw sample storage is exceeded, new incoming data packages are not archived. A space of 96 GiB provides at least 80 s of maximal measurement throughput (over 220 s of 3 MHz flux) for 64-channel configuration. The amount of raw pulse-related data signals that it is possible to gather is exceptional and not available in other SXR GEM-based systems known to the authors.
7. Discussion
The designed AC2P is the next generation of the earlier fast data acquisition software. It has been successfully integrated with the measurement system based on GEM-detector and FPGAs. Improvements from different perspectives are discussed below and divided into three categories.
7.1. Architecture and Functionality
There is a significant improvement in architecture. The current solution has the following properties:
Low-latency, high-performance, convenient way of communication implementation—better performance between graph nodes (FastFlow library is the solution tested by the community over the years), and the older one used simple synchronization primitives (currently, there is no bulk code on typical routines).
Convenient way of implementing complex DAG processing—the older one provided strictly linear processing with a thread pool.
Integration of parts of algorithm/processing in whole platform tasks, like per-node communication, monitoring or other platform benefits—the earlier one had an algorithm independent of DAG, and integration required additional modifications.
Thread pool integrated into the platform—now it is possible to use multiple hardware threads with the algorithm without losing platform benefits.
Accurate generation and timing quality tests in the simulator—an improvement of the earlier solution.
Many qualities in earlier solutions persisted, like per-single-packet latency monitoring, zero-copy if possible—sending pointers instead of copies through the system, saving raw data and histograms, and processing histograms after storage for pulse samples is fully occupied.
7.1.1. Communication
The main improvement and benefits come from migrating to DAG processing along with the long-tested and developed library FastFlow instead of simple synchronization primitives and thread pool approach.
There is the possibility of easy integration of new elements by simply creating a new node in a graph and a few code modifications for adding communication and other prerequisites.
FastFlow communication is based on lock-free SPSC queues, providing the best latency and throughput for sending data between two single nodes [
28]. There is a limitation in properly loading each DAG graph node—single FastFlow library nodes should be pinned only to a single hardware core.
7.1.2. Algorithm and Latency
The earlier algorithm was implemented and tuned for global trigger with additional possibilities like pulse splitting [
25]. It was unsuitable for local triggering mode, so a new general approach was created for AC2P.
The current solution is not precisely comparable to the older one due to different data-related architecture, capabilities, and optimization levels.
Earlier tests with older solution [
25] used global trigger measurement with around 6.4 MHz flux on the detector in a short pulse of 62 packages, 3.3 ms between each with 64 channels. The older one was parameterized to work with a thread pool grouped between waiter (amounts of groups processing a single package) and worker threads (amount of hardware designated to a single waiter group).
Each package in the global triggering mode tests consisted of around 21,000 pulses, whereas during the 3 MHz measurement from
Section 6.2.1, each package consisted of 43,690 pulses. Thus, using local triggering mode in measurements, with half of the radiation flux than earlier, provided around two times more relevant data in each package. The new approach results in much less throughput needed in data transfer.
The best result for the algorithm for a global triggering solution was to use a single group (waiter) of six (worker) threads working on a single package, resulting in 2.5 to 3.2 ms of computation latency for real data. Processing with maximal throughput of the system by noise acquisition was impossible due to rising latency time (processing was slower than the rate of incoming data).
The AC2P algorithm for local triggering is prepared to work with a single package in a single node thread and process data without rising latency in all performed tests, even artificial noise acquisition. This algorithm’s thread utilization is finer in control and better suited for DAG processing. It uses fewer resources (two cores), resulting in 5.5 to 5.9 ms latency. Both cores are working in parallel on different packets.
Due to the switch to local triggering mode, the current platform implementation does not have double buffering and provides less memory load.
The latency of both works is comparable. However, a solution with global triggering mode performance was saturated, whereas local triggering with AC2P was far from fully saturated. There are no long tests available for the former one. However, global and local triggering solutions have similar 0.5 ms latency spikes, which should be addressed later.
Improvement of the currently used algorithm and further AC2P optimizations can reduce latency further.
7.2. Algorithm Quality
The provided algorithms are the first technical results within the new platform, which are feasible for providing meaningful outcomes for other research. It is prepared to implement any pattern matching of pulse signals into clusters. The results are preliminary and are planned to be further improved.
7.3. Simulator
The simulator proved to recreate the processing part accurately. It is a tool for latency, communication and processing tests. It allows for the development and improvements of AC2P solution without real hardware. It is capable of simulating original plasma measurements archived earlier.
8. Further Works
Further testing and improvement of the algorithm are planned, as mentioned in
Section 7.1.2. It will be more tuned for cluster detection on local trigger mode. The algorithm can be further expanded for pulse pile-up decomposition.
There is a page-fault occurrence between the 128 and 1024 packages and memory page utilization issues that do not influence data integrity or quality, which will be addressed further in later works.
The current AC2P implementation, based on lock-free communication, with parameterization tuning of in-FPGA timeout (sending packages of data based on maximal time limit between subsequent packages) and package size mechanism (limiting of measured pulses per package), can provide low-latency real-time processing without losing asynchronous processing benefits. Future tests can be performed to check the dependence of different parameter values in tuning on throughput, latency, and platform performance.
Currently, dedicated electronics with FPGA can be partially migrated to other solutions like smart network interface cards (SmartNIC—network interface card with computation capabilities) or data processing unit (DPU—SmartNIC with increased computing capabilities) devices, and software processing can be carried out with GPGPU, other accelerators or utilize a second CPU.
FastFlow implementation can be compared and verified with other solutions like MARTe2, TBB, or other communication libraries regarding performance, latency complexity and ease of use for the SXR computational platform [
28].
Processing can add more streams for the division of the input stream to high-quality pulse data and different signal abnormalities like the one mentioned in [
18].
The full design is planned to be verified in a real environment in the next upcoming research campaign on WEST tokamak in Cadarache, France.
9. Conclusions
The monitoring of impurity emissions is an essential aspect of tokamak plasma research. It enables the identification of methods for removing impurities from critical components, understanding ongoing reactions, and the improvement of future plasma stability and energy generation.
The developed system approaches measurements in quality of data gathered by channels over quantity. It consists of 128 channels that sample data independently, thus resulting in a pulse-related sample data acquisition system. To the best of the authors’ knowledge, no other GEM-based solutions use sample-based data low-latency processing in software for SXR measurements.
The developed and described platform, named Asynchronous Complex Computation Platform (AC2P), provides solutions for low-latency, scale-up, and complex streams in SXR plasma diagnostics. The solution’s needs, implementation, configuration and optimization methods were presented. The architecture and processing DAG were described and implemented. Measurements present stable and valuable output and low-latency processing in different laboratory tests. The platform was used in real-world scenarios on WEST tokamak and has been proven to archive high-quality data. Further improvements are planned for the next campaign.
The algorithm for local triggering mode was developed and integrated into AC2P, compared with reference offline MATLAB implementation, providing high-quality output. Results were analyzed in terms of the scope of quality error, which was very low and on an acceptable level. The input simulator software, implemented in AC2P, has proven usable. Results are comparable to real measurements in off-laboratory and off-tokamak tests.
The system has proven to process 3 MHz constant flux of SXR radiation (photon events) over 64 channels with a single FPGA and local trigger, resulting in around 4.85 MHz of detected raw sample-related pulses in less than 8 ms with a clustering algorithm, resulting in final spectra—the energy and topology work online in real-time during measurements. The FPGA local triggering mode combined with AC2P results demonstrated an upgrade of the system in terms of internal architecture, communication, multiple input, complex, non-linear and asynchronous processing and algorithm integration.
In order to verify AC2P capabilities in terms of long plasma durations with all channels, twenty-minute measurements imitating the system’s maximal throughput with both FPGAs were performed. Measurement resulted in over 16 × 109 events processed from 1.41 TiB of input, stable processing and output. Therefore, it has been proven that the system is ready for much more intense plasma scenarios planned in future experimental campaigns on WEST and other tokamaks.
Author Contributions
Conceptualization, P.L., A.W., K.P., G.K. and M.C.; Methodology, P.L., P.K., K.P. and R.C.; Software, P.L., A.W., T.C., P.K., W.M.Z., R.C. and J.C.; Validation, P.L., A.W., T.C. and P.K.; Formal analysis, P.L., A.W., T.C., W.M.Z., K.P., G.K., M.C., K.M., D.M., J.C. and D.G.; Investigation, P.L., A.W., T.C., W.M.Z. and K.P.; Resources, P.L., A.W., T.C., G.K., M.C., K.M., D.M., J.C. and D.G.; Data curation, P.L., A.W., T.C., P.K., R.C., K.M., J.C. and D.G.; Writing—original draft, P.L. and A.W.; Writing—review & editing, P.L., A.W., K.P., R.C., M.C. and D.M.; Visualization, P.L. and A.W.; Supervision, A.W., K.P. and M.C.; Project administration, A.W., K.P., M.C. and D.M.; Funding acquisition, A.W., K.P. and M.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work has been carried out within the framework of the EUROfusion Consortium, funded by the European Union via the Euratom Research and Training Programme (Grant Agreement No 101052200—EUROfusion). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them. This scientific paper has been published as part of the international project co-financed by the Polish Ministry of Science and Higher Education within the programme called “PMW” for 2023–2024 (no. 5465/HEU-EURATOM/2023/2). Part of the research was funded by the Mobility PW (edition V, agreement no. 1820/30/Z09/2023)—Excellence Initiative: Research University (IDUB) programme of Warsaw University of Technology.
Data Availability Statement
The datasets presented in this article are not readily available because the data are part of an ongoing study.
Conflicts of Interest
The authors declare no conflict of interest.
References
- International Energy Agency. Energy Technology Perspectives 2015; International Energy Agency: Paris, France, 2015. [Google Scholar]
- Nature. The Chase for Fusion Energy. Available online: https://media.nature.com/original/magazine-assets/d41586-021-03401-w/d41586-021-03401-w.pdf (accessed on 7 March 2024).
- EUROfusion. The EUROfusion Roadmap—Long Version. 2018. Available online: https://euro-fusion.org/wp-content/uploads/2022/10/2018_Research_roadmap_long_version_01.pdf (accessed on 7 March 2024).
- Albanese, R.; Ambrosino, R.; Ariola, M.; De Tommasi, G.; Pironti, A.; Cavinato, M.; Neto, A.; Piccolo, F.; Sartori, F.; Ranz, R.; et al. Diagnostics, data acquisition and control of the divertor test tokamak experiment. Fusion Eng. Des. 2017, 122, 365–374. [Google Scholar] [CrossRef]
- De Blank, H.J. MHD instabilities in tokamaks. Fusion Sci. Technol. 2010, 57, 124–136. [Google Scholar] [CrossRef]
- Ravensbergen, T.; van Berkel, M.; Perek, A.; Galperti, C.; Duval, B.P.; Février, O.; van Kampen, R.J.; Felici, F.; Lammers, J.T.; Theiler, C.; et al. Real-time feedback control of the impurity emission front in tokamak divertor plasmas. Nat. Commun. 2021, 12, 1105. [Google Scholar] [CrossRef] [PubMed]
- Khodunov, I.; Komm, M.; Havranek, A.; Adamek, J.; Bohm, P.; Cavalier, J.; Seidl, J.; Devitre, A.; Dimitrova, M.; Elmore, S.; et al. Real-time feedback system for divertor heat flux control at COMPASS tokamak. Plasma Phys. Control Fusion 2021, 63, 8. [Google Scholar] [CrossRef]
- Walker, M.L.; De Vries, P.; Felici, F.; Schuster, E. Introduction to Tokamak Plasma Control. Proc. Am. Control Conf. 2020, 2020, 2901–2918. [Google Scholar] [CrossRef]
- Li, Y.L.; Xu, G.S.; Tritz, K.; Lin, X.; Liu, H.; Chen, Y.; Li, S.; Yang, F.; Wu, Z.W.; Wang, L.; et al. Upgrade of the multi-energy soft x-ray diagnostic system for studies of ELM dynamics in the EAST tokamak. Fusion Eng. Des. 2018, 137, 414–419. [Google Scholar] [CrossRef]
- Moreau, P.; Bremond, S.; Bucalossi, J.; Reux, C.; Douai, D.; Loarer, T.; Nardon, E.; Nouailletas, R.; Saint-Laurent, F.; Tamain, P.; et al. The Commissioning of the WEST Tokamak: Experience and Lessons Learned. IEEE Trans. Plasma Sci. 2020, 48, 1376–1381. [Google Scholar] [CrossRef]
- Moreau, P.; Bucalossi, J.; Missirlian, M.; Samaille, F.; Courtois, X.; Gil, C.; Lotte, P.; Meyer, O.; Nardon, E.; Nouailletas, R.; et al. Measurements and controls implementation for WEST. Fusion Eng. Des. 2017, 123, 1029–1032. [Google Scholar] [CrossRef]
- Kurihara, K.; Lister, J.B.; Humphreys, D.A.; Ferron, J.R.; Treutterer, W.; Sartori, F.; Felton, R.; Brémond, S.; Moreau, P. Plasma control systems relevant to ITER and fusion power plants. Fusion Eng. Des. 2008, 83, 959–970. [Google Scholar] [CrossRef]
- Liu, G.; Makijarvi, P.; Pons, N. The ITER CODAC network design. Fusion Eng. Des. 2018, 130, 6–10. [Google Scholar] [CrossRef]
- Snipes, J.A.; De Vries, P.C.; Gribov, Y.; Henderson, M.A.; Hunt, R.; Loarte, A.; Nunes, I.; Pitts, R.A.; Sinha, J.; Zabeo, L.; et al. ITER plasma control system final design and preparation for first plasma. Nucl. Fusion 2021, 61, 106036. [Google Scholar] [CrossRef]
- Degrave, J.; Felici, F.; Buchli, J.; Neunert, M.; Tracey, B.; Carpanese, F.; Ewalds, T.; Hafner, R.; Abdolmaleki, A.; de las Casas, D.; et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 2022, 602, 414–419. [Google Scholar] [CrossRef] [PubMed]
- Kasprowicz, G.; Zabołotny, W.M.; Poźniak, K.; Chernyshova, M.; Czarski, T.; Ga̧ska, M.; Kolasiński, P.; Krawczyk, R.; Linczuk, P.; Wojeński, A. Multichannel Data Acquisition System for GEM Detectors. J. Fusion Energy 2019, 38, 467–479. [Google Scholar] [CrossRef]
- Zabołotny, W.M.; Kasprowicz, G.; Poźniak, K.; Chernyshova, M.; Czarski, T.; Ga̧ska, M.; Kolasiński, P.; Krawczyk, R.; Linczuk, P.; Wojeński, A. FPGA and Embedded Systems Based Fast Data Acquisition and Processing for GEM Detectors. J. Fusion Energy 2019, 38, 480–489. [Google Scholar] [CrossRef]
- Wojenski, A.; Pozniak, K.T.; Linczuk, P.; Chernyshova, M.; Kasprowicz, G.; Mazon, D.; Czarski, T.; Krawczyk, R.; Gaska, M.; Malard, P. Data Quality Monitoring Considerations for Implementation in High Performance Raw Signal Processing Real-time Systems with Use in Tokamak Facilities. J. Fusion Energy 2020, 39, 221–229. [Google Scholar] [CrossRef]
- Kolasinski, P.; Pozniak, K.; Wojenski, A.; Linczuk, P.; Kasprowicz, G.; Chernyshova, M.; Mazon, D.; Czarski, T.; Colnel, J.; Malinowski, K.; et al. High-Performance FPGA streaming data concentrator for GEM electronic measurement system for WEST tokamak. Electronics 2023, 11, 3649. [Google Scholar] [CrossRef]
- Chernyshova, M.; Czarski, T.; Malinowski, K.; Kowalska-Strzȩciwilk, E.; Poźniak, K.; Kasprowicz, G.; Zabołotny, W.; Wojeński, A.; Kolasiński, P.; Mazon, D.; et al. Conceptual design and development of GEM based detecting system for tomographic tungsten focused transport monitoring. J. Instrum. 2015, 10, P10022. [Google Scholar] [CrossRef]
- Czarski, T.; Chernyshova, M.; Malinowski, K.; Pozniak, K.T.; Kasprowicz, G.; Kolasinski, P.; Krawczyk, R.; Wojenski, A.; Linczuk, P.; Zabolotny, W.; et al. Measuring issues in the GEM detector system for fusion plasma imaging. J. Instrum. 2018, 13, C08001. [Google Scholar] [CrossRef]
- Sano, Y.; Kobayashi, R.; Fujita, N.; Boku, T. Performance Evaluation on GPU-FPGA Accelerated Computing Considering Interconnections between Accelerators. In Proceedings of the 12th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, Tsukuba, Japan, 9–10 June 2022; pp. 10–16. [Google Scholar] [CrossRef]
- Kumar, M.; Kaur, G. HPC Workflow on Diverse XPU Architectures with oneAPI. In Proceedings of the 2022 2nd International Conference on Intelligent Technologies (CONIT 2022), Hubli, India, 24–26 June 2022; pp. 1–5. [Google Scholar] [CrossRef]
- Papadopoulos, L.; Soudris, D.; Kessler, C.; Ernstsson, A.; Ahlqvist, J.; Vasilas, N.; Papadopoulos, A.I.; Seferlis, P.; Prouveur, C.; Haefele, M.; et al. EXA2PRO: A Framework for High Development Productivity on Heterogeneous Computing Systems. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 792–804. [Google Scholar] [CrossRef]
- Linczuk, P.; Krawczyk, R.; Wojenski, A.; Zabolotny, W.; Chernyshova, M.; Pozniak, K.; Czarski, T.; Gaska, M.; Kasprowicz, G.; Kolasinski, P.; et al. Latency and throughput of online processing in Soft X-Ray GEM-based measurement system. J. Instrum. 2019, 14, 3–9. [Google Scholar] [CrossRef]
- Chernyshova, M.; Mazon, D.; Malinowski, K.; Czarski, T.; Ivanova-Stanik, I.; Jabłoński, S.; Wojeński, A.; Kowalska-Strzȩciwilk, E.; Poźniak, K.T.; Malard, P.; et al. First exploitation results of recently developed SXR GEM-based diagnostics at the WEST project. Nucl. Mater. Energy 2020, 25, 100850. [Google Scholar] [CrossRef]
- Bourdelle, C.; Artaud, J.; Basiuk, V.; Bécoulet, M.; Brémond, S.; Bucalossi, J.; Bufferand, H.; Ciraolo, G.; Colas, L.; Corre, Y.; et al. WEST Physics Basis. Nucl. Fusion 2015, 55, 063017. [Google Scholar] [CrossRef]
- Maxim Egorushkin (Atomic_Queue Github Repository). Throughput and Latency Benchmarks (of Different Queues). Available online: https://max0x7ba.github.io/atomic_queue/html/benchmarks.html (accessed on 7 March 2024).
- Linczuk, P.; Poźniak, K.; Chernyshova, M. Soft real-time data processing solutions in measurement systems on example of small-scale GEM based x-ray spectrometer. In Proceedings of the Photonics Applications in Astronomy, Communications, Industry, and High Energy Physics Experiments, Lublin, Poland, 15–16 September 2022; Volume 12476. [Google Scholar] [CrossRef]
- Sauli, F. GEM: A new concept for electron amplification in gas detectors. Nucl. Instrum. Methods Phys. Res. Sect. A Accel. Spectrometers Detect. Assoc. Equip. 1997, 386, 531–534. [Google Scholar] [CrossRef]
- Rivetti, A.; Alexeev, M.; Bugalho, R.; Cossio, F.; Da Rocha Rolo, M.D.; Di Francesco, A.; Greco, M.; Cheng, W.; Maggiora, M.; Marcello, S.; et al. TIGER: A front-end ASIC for timing and energy measurements with radiation detectors. Nucl. Instrum. Methods Phys. Res. Sect. A Accel. Spectrometers Detect. Assoc. Equip. 2019, 924, 181–186. [Google Scholar] [CrossRef]
- Hernandez, H.; De Souza Sanches, B.C.; Carvalho, D.; Bregant, M.; Pabon, A.A.; Da Silva, R.W.; Hernandez, R.A.; Weber, T.O.; Do Couto, A.L.; Campos, A.; et al. A Monolithic 32-Channel Front End and DSP ASIC for Gaseous Detectors. IEEE Trans. Instrum. Meas. 2020, 69, 2686–2697. [Google Scholar] [CrossRef]
- Aspell, P.; Bravo, C.; Dabrowski, M.; De Lentdecker, G.; De Robertis, G.; Firlej, M.; Fiutowski, T.; Hakkarainen, T.; Idzik, M.; Irshad, A.; et al. VFAT3: A Trigger and Tracking Front-end ASIC for the Binary Readout of Gaseous and Silicon Sensors. In Proceedings of the 2018 IEEE Nuclear Science Symposium and Medical Imaging Conference Proceedings (NSS/MIC), Sydney, NSW, Australia, 10–17 November 2018. [Google Scholar] [CrossRef]
- Lakovidis, G. VMM—An ASIC for Micropattern Detectors. EPJ Web Conf. 2018, 174, 4–8. [Google Scholar] [CrossRef]
- Moraes, D.; Anghinolfi, F.; Deval, P.; Jarron, P.; Riegler, W.; Rivetti, A.; Schmidt, B. CARIOCA-0.25 μm CMOS fast binary front-end for sensor interface using a novel current-mode feedback technique. In Proceedings of the ISCAS 2001. The 2001 IEEE International Symposium on Circuits and Systems, Sydney, NSW, Australia, 6–9 May 2001; Volume 1, pp. 360–363. [Google Scholar] [CrossRef]
- Pezzotta, A.; Croci, G.; Costantini, A.; De Matteis, M.; Tagnani, D.; Corradi, G.; Murtas, F.; Gorini, G.; Baschirotto, A. GEMMA and GEMINI, two dedicated mixed-signal ASICs for Triple-GEM detectors readout. J. Instrum. 2016, 11, C03058. [Google Scholar] [CrossRef]
- Murtas, F. The GEMPix detector. Radiat. Meas. 2020, 138, 1350–4487. [Google Scholar] [CrossRef]
- Jastrzembski, E.; Abbott, D.; Gu, J.; Gyurjyan, V.; Heyes, G.; Moffit, B.; Pooser, E.; Timmer, C.; Hellman, A. SAMPA Based Streaming Readout Data Acquisition Prototype. In Proceedings of the 22nd Virtual IEEE Real Time Conference, Virtual, 12–23 October 2020. [Google Scholar] [CrossRef]
- Li, X.; Chen, C.; Fan, W.; Zhu, R.; Huang, S.; Wen, X.; He, Z.; Yang, Q.; Yin, Z. Development of a real-time magnetic island reconstruction system based on PCIe platform for HL-2A tokamak. Plasma Sci. Technol. 2021, 23, 085103. [Google Scholar] [CrossRef]
- Mindur, B.; Fiutowski, T.; Koperny, S.; Wiacek, P.; Dabrowski, W. DAQ software for GEM-based imaging system. J. Instrum. 2018, 13, C12016. [Google Scholar] [CrossRef]
- Cruz, N.; Santos, B.; Fernandes, A.; Carvalho, P.F.; Sousa, J.; Goncalves, B.; Riva, M.; Centioli, C.; Marocco, D.; Esposito, B.; et al. The Design and Performance of the Real-Time Software Architecture for the ITER Radial Neutron Camera. IEEE Trans. Nucl. Sci. 2019, 66, 1310–1317. [Google Scholar] [CrossRef]
- Kadziela, M.; Jablonski, B.; Perek, P.; Makowski, D. Evaluation of the ITER Real-Time Framework for Data Acquisition and Processing from Pulsed Gigasample Digitizers. J. Fusion Energy 2020, 39, 261–269. [Google Scholar] [CrossRef]
- Kolasinski, P.; Poźniak, K.; Czarski, T.; Chernyshova, M.; Gaska, M.; Linczuk, P.; Kasprowicz, G.; Krawczyk, R.; Wojenski, A.; Zabolotny, W. New directions in the construction of tokamak plasma impurity diagnostics systems. In Proceedings of the Photonics Applications in Astronomy, Communications, Industry, and High Energy Physics Experiments 2021, Wilga, Poland, 31 May–1 June 2021; Volume 12040. [Google Scholar] [CrossRef]
- Pasch, E.; Beurskens, M.N.; Bozhenkov, S.A.; Fuchert, G.; Knauer, J.; Wolf, R.C. The Thomson scattering system at Wendelstein 7-X. Rev. Sci. Instrum. 2016, 87, 11E729. [Google Scholar] [CrossRef] [PubMed]
- Neto, A.C.; Sartori, F.; Piccolo, F.; Vitelli, R.; Tommasi, G.D.; Zabeo, L.; Barbalace, A.; Fernandes, H.; Valcárcel, D.F.; Batista, A.J.N.; et al. MARTe: A Multiplatform Real-Time Framework. IEEE Trans. Nucl. Sci. 2010, 57, 479–486. [Google Scholar] [CrossRef]
- Lourenço, P.D.; Santos, J.M.; Bogar, O.; Havranek, A.; Havlicek, J.; Zajac, J.; Hron, M.; Pánek, R.; Fernandes, H. Real-time multi-threaded reflectometry density profile reconstructions on COMPASS Tokamak. J. Instrum. 2019, 14, 11023. [Google Scholar] [CrossRef]
- Lee, W.; Bauvir, B.; Karlovsek, P.; Knap, M.; Lee, S.J.; Makowski, D.; Perek, P.; Tak, T.; Winter, A.; Žagar, A. Real-Time Framework for ITER Control Systems. In Proceedings of the 18th International Conference on Accelerator and Large Experimental Physics Control Systems, Shanghai, China, 14–22 October 2021; p. MOBL02. [Google Scholar] [CrossRef]
- Huang, Z.M.; Li, B.; Zheng, G.Y.; Yuan, Q.P.; Ji, X.Q.; Xiao, B.J.; Zhou, J.; Huang, J.J.; Zhang, R.R.; Lu, T.C.; et al. A new scheme of plasma control system based on real-time Linux cluster for HL-2M. Fusion Eng. Des. 2023, 192, 113763. [Google Scholar] [CrossRef]
- Aldinucci, M.; Danelutto, M.; Kilpatrick, P.; Torquati, M. Fastflow: High-Level and Efficient Streaming on Multicore. In Programming Multi-Core and Many-Core Computing Systems; Wiley: Hoboken, NJ, USA, 2017; pp. 261–280. [Google Scholar] [CrossRef]
- Aldinucci, M.; Danelutto, M.; Meneghin, M.; Torquati, M.; Kilpatrick, P. Efficient streaming applications on multi-core with FastFlow: The biosequence alignment test-bed. Adv. Parallel Comput. 2010, 19, 273–280. [Google Scholar] [CrossRef]
- Mencagli, G.; Torquati, M.; Griebler, D.; Danelutto, M.; Fernandes, L.G.L. Raising the parallel abstraction level for streaming analytics applications. IEEE Access 2019, 7, 131944–131961. [Google Scholar] [CrossRef]
- Andrade, G.; Griebler, D.; Santos, R.; Fernandes, L.G. A parallel programming assessment for stream processing applications on multi-core systems. Comput. Stand. Interfaces 2023, 84, 103691. [Google Scholar] [CrossRef]
- Linczuk, P.; Zabolotny, W.M.; Wojenski, A.; Krawczyk, R.D.; Pozniak, K.T.; Chernyshova, M.; Czarski, T.; Gaska, M.; Kasprowicz, G.; Kowalska-Strzeciwilk, E.; et al. Evaluation of FPGA to PC feedback loop. In Proceedings of the Photonics Applications in Astronomy, Communications, Industry, and High Energy Physics Experiments 2017, Wilga, Poland, 27 May–5 June 2017; Volume 10445. [Google Scholar] [CrossRef]
- Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual. Available online: https://cdrdv2-public.intel.com/671488/248966-046A-software-optimization-manual.pdf (accessed on 7 March 2024).
- Agner Fog. How Good Is Hyperthreading? Available online: https://www.agner.org/optimize/blog/read.php?i=6 (accessed on 7 March 2024).
- Drepper, U. What Every Programmer Should Know about Memory. 2007. Available online: www.akkadia.org/drepper/cpumemory.pdf (accessed on 7 March 2024).
- Yun, S.; Lee, W.; Lee, T.; Park, M.; Lee, S.; Neto, A.C.; Wallander, A.; Kim, Y.K. Evaluating performance of MARTe as a real-time framework for feed-back control system at tokamak device. Fusion Eng. Des. 2013, 88, 1323–1326. [Google Scholar] [CrossRef]
- Zeuch, S.; Monte, B.D.; Karimov, J.; Lutz, C.; Renz, M.; Traub, J.; Breß, S.; Rabl, T.; Markl, V. Analyzing Efficient Stream Processing on Modern Hardware. Proc. VLDB Endow. 2018, 12, 516–530. [Google Scholar] [CrossRef]
- Google Git Repositories on Kernel. Transparent Hugepage Support. Available online: https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/refs/tags/v3.18.44/Documentation/vm/transhuge.txt (accessed on 7 March 2024).
Figure 1.
The platform’s architecture is divided into two processing domains and a connection to the real-time network. Circles represent nodes of the DAG, and arrows represent edges of the DAG or communication path. The digestion nodes on the left side handle multiple data streams, which are inputs of the DAG processing. The asynchronous processing part processes data as soon as possible and sends it immediately. Nodes in the cycle/real-time processing part process data as soon as possible (or in a cycle manner) and send data synchronously with the outside network. Input and output nodes are particular nodes depicted as outside DAG due to external communication requirements beyond processing.
Figure 1.
The platform’s architecture is divided into two processing domains and a connection to the real-time network. Circles represent nodes of the DAG, and arrows represent edges of the DAG or communication path. The digestion nodes on the left side handle multiple data streams, which are inputs of the DAG processing. The asynchronous processing part processes data as soon as possible and sends it immediately. Nodes in the cycle/real-time processing part process data as soon as possible (or in a cycle manner) and send data synchronously with the outside network. Input and output nodes are particular nodes depicted as outside DAG due to external communication requirements beyond processing.
Figure 2.
Optimized processing pipeline. The black color describes the DAG of processing, where blue labels and frames group nodes within groups used for core pinning. Asterisks annotate nodes, which free space in the FPGA communication buffer when pulse-sampled data are no longer needed.
Figure 2.
Optimized processing pipeline. The black color describes the DAG of processing, where blue labels and frames group nodes within groups used for core pinning. Asterisks annotate nodes, which free space in the FPGA communication buffer when pulse-sampled data are no longer needed.
Figure 3.
Pinning of task groups to cores available in the CPU in the system. The first letter is connected to the optimized nodes described earlier, where the number is correlated to the FPGA channel input. OS-related and non-platform threads run on the first core with the D group and the second CPU.
Figure 3.
Pinning of task groups to cores available in the CPU in the system. The first letter is connected to the optimized nodes described earlier, where the number is correlated to the FPGA channel input. OS-related and non-platform threads run on the first core with the D group and the second CPU.
Figure 4.
The comparison of platform and reference (MATLAB) algorithms for plasma pulse number 58317 data performed offline with above 44 × 106 detected pulse clusters. The difference between implementations described in the middle row of plots shows values of order of magnitude lower than compared values in the top row. Relative error plot X-axes in bottom histograms were limited to standard deviation around the mean value. Reference values equal to 0 were set as NaN values during relative error calculation.
Figure 4.
The comparison of platform and reference (MATLAB) algorithms for plasma pulse number 58317 data performed offline with above 44 × 106 detected pulse clusters. The difference between implementations described in the middle row of plots shows values of order of magnitude lower than compared values in the top row. Relative error plot X-axes in bottom histograms were limited to standard deviation around the mean value. Reference values equal to 0 were set as NaN values during relative error calculation.
Figure 5.
The figure presents generated histograms in 10 ms cycles of 58317 plasma-related data obtained offline with the AC2P algorithm. The sum of values in time equals the histograms in
Figure 4.
Figure 5.
The figure presents generated histograms in 10 ms cycles of 58317 plasma-related data obtained offline with the AC2P algorithm. The sum of values in time equals the histograms in
Figure 4.
Figure 6.
Measurement of 3 MHz SXR flux generated with X-ray tube. This Figure is divided into eight subfigures, where (a) to (f) show the latency of processing spent in each node per package, (a) presents latency measurement of processing done in A node of optimised DAG, (b) the B or C node, (c) E node, (d) whole processing latency between data availability in memory and storage of histograms in ramdisk, (e) accumulated latency of communication between nodes during data package processing, (f) time between next packages availability, (g) time histogram and (h) position histogram.
Figure 6.
Measurement of 3 MHz SXR flux generated with X-ray tube. This Figure is divided into eight subfigures, where (a) to (f) show the latency of processing spent in each node per package, (a) presents latency measurement of processing done in A node of optimised DAG, (b) the B or C node, (c) E node, (d) whole processing latency between data availability in memory and storage of histograms in ramdisk, (e) accumulated latency of communication between nodes during data package processing, (f) time between next packages availability, (g) time histogram and (h) position histogram.
Figure 7.
The 8 min measurement of 1 MHz SXR flux. This Figure is divided into eight subfigures, where (a) to (f) show the latency of processing spent in each node per package, (a) presents latency measurement of processing done in A node of optimised DAG, (b) the B or C node, (c) E node, (d) whole processing latency between data availability in memory and storage of histograms in ramdisk, (e) accumulated latency of communication between nodes during data package processing, (f) time between next packages availability, (g) time histogram and (h) position histogram.
Figure 7.
The 8 min measurement of 1 MHz SXR flux. This Figure is divided into eight subfigures, where (a) to (f) show the latency of processing spent in each node per package, (a) presents latency measurement of processing done in A node of optimised DAG, (b) the B or C node, (c) E node, (d) whole processing latency between data availability in memory and storage of histograms in ramdisk, (e) accumulated latency of communication between nodes during data package processing, (f) time between next packages availability, (g) time histogram and (h) position histogram.
Figure 8.
Characteristics of noise measurement without pulse-related data archivization for 2 FPGA streams. This Figure is divided into eight subfigures, where (a) to (f) show the latency of processing spent in each node per package, (a) presents latency measurement of processing done in A node of optimised DAG, (b) the B or C node, (c) E node, (d) whole processing latency between data availability in memory and storage of histograms in ramdisk, (e) accumulated latency of communication between nodes during data package processing, (f) time between next packages availability, (g) time histogram and (h) position histogram. The blue color symbolizes the first stream and the green symbolizes the second. Figure (c) (histogramming) and (g,h) histograms have only one line; both streams are merged in the histogramming part.
Figure 8.
Characteristics of noise measurement without pulse-related data archivization for 2 FPGA streams. This Figure is divided into eight subfigures, where (a) to (f) show the latency of processing spent in each node per package, (a) presents latency measurement of processing done in A node of optimised DAG, (b) the B or C node, (c) E node, (d) whole processing latency between data availability in memory and storage of histograms in ramdisk, (e) accumulated latency of communication between nodes during data package processing, (f) time between next packages availability, (g) time histogram and (h) position histogram. The blue color symbolizes the first stream and the green symbolizes the second. Figure (c) (histogramming) and (g,h) histograms have only one line; both streams are merged in the histogramming part.
Figure 9.
Characteristics of 3 MHz measurement with simulation based on its output. This Figure is divided into eight subfigures, where (a) to (f) show the latency of processing spent in each node per package, (a) presents latency measurement of processing done in A node of optimised DAG, (b) the B or C node, (c) E node, (d) whole processing latency between data availability in memory and storage of histograms in ramdisk, (e) accumulated latency of communication between nodes during data package processing, (f) time between next packages availability, (g) time histogram and (h) position histogram. The blue color symbolizes the original measure, the green symbolizes the simulated one, and red is the difference between the original and the simulated. Only the difference (red plot) is presented in plots with similar values.
Figure 9.
Characteristics of 3 MHz measurement with simulation based on its output. This Figure is divided into eight subfigures, where (a) to (f) show the latency of processing spent in each node per package, (a) presents latency measurement of processing done in A node of optimised DAG, (b) the B or C node, (c) E node, (d) whole processing latency between data availability in memory and storage of histograms in ramdisk, (e) accumulated latency of communication between nodes during data package processing, (f) time between next packages availability, (g) time histogram and (h) position histogram. The blue color symbolizes the original measure, the green symbolizes the simulated one, and red is the difference between the original and the simulated. Only the difference (red plot) is presented in plots with similar values.
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).