1. Introduction
As a leading platform for heavy-ion scientific research in China and one of a few large-scale full-ion accelerating systems in the world, the Heavy Ion Research Facility in Lanzhou (HIRFL) [
1] and the High Intensity Heavy-ion Accelerator Facility (HIAF) [
2] have put forward higher requirements for detector technology. The development of the Monolithic Active Pixel Sensor (MAPS) [
3,
4] provides an unprecedented signal-to-noise ratio, energy resolution, spatial resolution, and readout speed in particle physics experiments. The MAPS integrates the sensing node and readout circuit into one chip. It collects the charge deposited in particles passing through the sensors and then measures the positions of particle hits. In addition, some newly designed MAPSs [
5,
6] also record the energy and timing information of the particle hits. The typical MAPS-based detectors at HIRFL and HIAF are vertex and tracking detectors in Electron-ion colliders in China (EicC) [
7] and the beam-positioning system at physics terminals [
8]. Besides these benefits, adopting MAPS technology increases the data volume. Therefore, high-performance cluster-locating technology is highly required in MAPS-based systems. Energy deposited by particle hits is collected by pixels in the MAPS, and these pixels form a connected region, which is called a cluster. Cluster-locating technology finds the clusters induced by particle hits in a given event, the aim of which is to compress the data volume and improve the accuracy and speed of particle detection.
Cluster-locating algorithms have already been widely used in physics experiments, such as at the European Organization for Nuclear Research (CERN) [
9,
10,
11,
12] and the Compressed Baryonic Matter (CBM) experiments [
13]. In particle physics experiments, charged particles are produced during beam collisions. These charged particles interact with the material around them as they pass through the vertex and tracking detectors, causing them to deposit energy that can be detected and measured. The track and energy information are critical for studying the properties of these particles. A cluster-locating algorithm helps identify the hit positions of charged particles in each detector layer along their track with high efficiency and accuracy. This contributes to the trajectory and energy information reconstruction of the particles. In addition, an online cluster-locating algorithm will also benefit the data compression since only data that contain clusters need to be stored.
The current commonly used cluster-locating algorithms mainly belong to the following categories:
- 1.
In regional detection algorithms [
14], samples (images) are divided into partitions as a prerequisite for consistency judgment. Each partition contains the maximum number of consistent pixels. For example, if we want to locate the area of a cluster, we start from the seed pixel in the cluster, which has the highest energy, then look for fired pixels in the seed pixel’s neighborhood, and take the fired pixel as the new seed pixel. This domain expansion process is repeated until no more fired pixels exist. The key to this method is to set reasonable guidelines for domain expansion.
- 2.
Edge detection methods, such as Canny [
15], the Sobel operator, etc., aim to find items’ edges. The performance of edge detection algorithms highly depends on the quality of the edges’ features. For example, if we want to use edge detection methods to locate clusters, we must obtain the edge information of the clusters. However, the edges of the clusters may be blurred, which cannot guarantee the continuity and closure of the edges. In this case, the detection accuracy can be relatively poor. Thus, the edge detection method has weak robustness for cluster location.
- 3.
A clustering algorithm [
16] is an unsupervised machine learning algorithm. For example, when we want to locate clusters, a clustering algorithm will divide each data frame into two dissecting subsets: the background and the foreground (clusters in our case). Taking the value of each pixel as the input, certain distance measurement methods (Euclidean distance, Manhattan distance, etc.) calculate the similarity between the data in each subset. The above process is continuously and iteratively optimized so that the data in the same subset are as similar to each other as possible, and the data in different subsets are as different as possible. Finally, we obtain two sets: the pixels of the objects (clusters) and the background pixels. Clustering algorithms are easy to deploy and execute. However, they are usually sensitive to isolated pixels and do not utilize the spatial information provided by the pixels in the samples. As a result, clustering algorithms highly rely on the features’ quality.
As one of the most famous scientific research trends, deep learning techniques are widely applied in many fields, such as computer vision, natural language processing, and speech recognition. Object detection techniques using deep learning technology have been actively studied [
17,
18]. They demonstrate significant potential to overcome the limitations of cluster-locating algorithms, such as those mentioned above.
Object detection algorithms with deep learning include two-stage and one-stage detection models. The two-stage model generates a series of candidate boxes, then extracts the features of these candidate boxes, and finally classifies and regresses the target. In contrast, the one-stage detection algorithm directly classifies and regresses the target. The RCNN series [
17,
18,
19,
20] is an essential representative of the two-stage detection model, which presents high accuracy in target location and recognition. On the other hand, the YOLO series [
21,
22,
23,
24,
25] is a relatively popular one-stage detection model, which has a good balance between speed and accuracy.
Since the advent of AlexNet [
26], convolutional neural networks (ConvNets) have been the dominant model architecture for computer vision. Since then, ConvNets have become a widely researched topic. Several more effective and more scalable convolutional neural networks have been proposed, such as VGGNet [
27], GoogLeNet [
28], ResNe(X)t [
29,
30], DenseNet [
31], MobileNet [
32], and EfficientNet [
33]. ConvNets have been widely used as backbone networks to improve performance in visual tasks. Since a ConvNet has inductive bias and translation invariance, it is a good design principle for object detection.
On the other hand, self-attention [
34] has been introduced as a recent advance. Unlike the convolution operation, the key to self-attention is to produce a weighted average of values computed from hidden units. Therefore, the interactions between the input signals of self-attention are determined by themselves rather than by their relative positions, such as convolution, which allows self-attention to capture long-range interactions. With the success of Transformer [
34] based on the self-attention mechanism in NLP (natural language processing), many previous studies have also tried to introduce the self-attention mechanism into computer vision, and Transformer has also become the mainstream network architecture. Recently, Vision Transformer (ViT) [
35] with only vanilla Transformer layers achieved good performance on ImageNet-1K [
36]. In addition, ViT pre-training on the large-scale weakly labeled JFT-300M dataset [
37] can obtain comparable results to state-of-the-art (SOTA) ConvNets. Swin Transformer [
38] was the first to show that Transformer can be used as a generic visual backbone and achieve SOTA performance in a range of vision tasks.
2. Method
To apply a deep learning approach in cluster-locating algorithms for CMOS pixel sensors in physics experiments, we performed the following research:
We performed a beam test on the Topmetal-M [
6] silicon pixel sensor at the External Target Facility at the Heavy Ion Research Facility in Lanzhou. In this test, we recorded the cluster data induced by energy of 320 MeV/u, with an average beam intensity of several thousand
/s. After pre-processing, we used the images that contained the cluster data to form the dataset for training and verification.
We constructed both one- and two-stage detection algorithms, as shown in
Figure 1 and
Figure 2. We use Transformer-based and CNN-based backbones for feature extraction at different stages in the one- and two-stage detection algorithms. Additionally, each model comes in four different sizes. The two-stage detection algorithms first generate region proposals from images and then generate the final object boxes from the region proposals, while the one-stage object detection algorithms do not need the region proposal stage and directly generate the object’s class probability and position coordinate values.
2.1. The Backbone Network and Its Variants
The backbone network is an essential feature extractor for a object detection task. It takes the image as the input and outputs the feature maps of the corresponding input image. However, most of the backbone networks used for object detection come from a network of classification tasks, taking out the last fully connected layer of the classification tasks. Therefore, a complex backbone network is needed to meet the requirements of high precision and a more accurate application. This study used two backbone networks: the Swin Transformer backbone based on the Transformer and the ConvNeXt backbone based on the convolutional neural network.
In addition, to continuously improve the model’s ability, different sizes of backbone variants were used, namely, Swin-T/S/B/L and ConvNeXt-T/S/B/L. In essence, the model is constantly deepened and widened. As shown in
Table 1, ConvNeXt variants differ in the number of channels and blocks in each stage. As summarized in
Table 2, the Swin Transformer variants differ in the channel number in hidden layers, the number of layers, and the number of heads in multi-head attention at each stage. The number of channels and heads doubles at each new stage. ConvNeXt and Swin Transformer have the same number of channels for variants of the same size type at each stage, such as Swin-T and ConvNeXt-T.
2.2. The Two-Stage Model
As shown in
Figure 1, the two-stage model performs feature extraction on the input image through the neural network with a feature pyramid network (FPN) [
39]. The backbone extracts different scales of region-of-interest (RoI) features from different stages of the feature pyramid. The FPN contains a bottom-up path and a top-down path with a lateral connection. The path from the bottom to the top builds a hierarchical structure, which can extract features at different scales. The path from the top to the bottom produces higher-resolution features through upsampling, and its semantic information is stronger. Each lateral connection combines feature maps of the same scale in the top-down and bottom-up paths to form a new pyramid level, and each level makes independent predictions. Higher-resolution feature maps help detect small targets, while lower-resolution feature maps contain rich semantic information, so constructing the FPN is essential.
The Region Proposal Network (RPN) [
17] has classification and regression branches. The classification branch divides the anchor boxes into positive and negative anchors through the softmax function. The regression branch calculates the offset of the boundary box regression of the anchor box to obtain a more accurate candidate box. Region proposals with a wide range of scales and aspect ratios are generated with the RPN. The classification of region proposals is completed simultaneously, which divides the candidates into two categories: background and target. The RPN accelerates the generating speed of candidate object bounding boxes because it shares a common set of convolution layers with the detection network. In addition, a new method for detecting objects of different sizes uses multi-scale anchors as references. The anchors greatly simplify the generation of region proposals of various sizes without needing multiple scales of images or features. The anchors have three aspect ratios. This method parameterizes the candidate bounding box relative to the reference anchor box, optimizes the prediction box’s position by measuring the distance between it and its corresponding ground-truth box, and processes the bounding box, including closing the excess part of the positive anchor beyond the image boundary to the edge of the image, removing the tiny positive anchor, and carrying out non-maximum suppression (NMS) of the remaining positive anchors. The RPN only makes a preliminary prediction of the object’s location to be detected.
Then, we use RoIAlign [
20] to extract a small feature map from each proposal. RoIAlign converts the area corresponding to the position coordinates of the region proposals into a fixed-sized feature, which is convenient for subsequent classification and boundary box regression operations. RoI pooling uses two quantization steps, resulting in misalignments between the input and output pixels of the network. To solve the misalignment problem, RoIAlign does not use quantization operations and keeps the floating-point coordinates of each proposal in the feature map. First, the bilinear interpolation method calculates the exact value of the four regularly sampled proposals’ feature map positions. Then, max or average pooling is used to aggregate the results to obtain the fixed-length feature of each proposal. RoIAlign can improve the accuracy of small-target detection because it uses floating-point numbers. By aligning with the small target of the original image, it can be restored to the original position with greater accuracy. This aspect is well adapted to our particle cluster detection task.
Finally, the network head for bounding-box recognition (classification and box regression) is applied to each proposal. The aim of classification here is to identify which category all previous positive anchors belong to, which differs from the only two-category classification in the RPN. Then, bounding box regression is performed on the proposals to obtain a more accurate final predicted box.
In this paper, the backbone variants we use are Swin Transformer based on Transformer architectures with attention mechanisms and ConvNeXt [
40] based on convolutional neural network (CNN) architectures.
2.3. The One-Stage Model
As shown in
Figure 2, the one-stage model generally comprises three parts: a backbone, a neck, and a head. The backbone mainly determines the feature representation ability. Meanwhile, its design significantly impacts the inference efficiency since it carries a high computation cost. The backbone is divided into multiple stages to extract features at different scales and facilitates features fusion later. Usually, the neck, located between the backbone and the head network, further improves the diversity and robustness of the features. The neck aggregates low-level physical features with high-level semantic features and then builds up pyramid feature maps at all levels. A neck is composed of several bottom-up paths and top-down paths. Building a pyramid on the feature map better solves the problems induced by changes in the scales of the targets. The famous examples are the feature pyramid network (FPN) [
39] and the Path Aggregation Network (PAN) [
41]. The FPN improves the size of the feature map to solve the problem of the scale sizes of different targets in detection. The PAN increases the network’s depth, improves the network’s robustness, and enhances the target-positioning ability to a certain extent. The FPN can capture strong semantic features from top to bottom, and the PAN conveys strong positioning characteristics from bottom to top. Hence, combining these two modules can effectively realize target positioning. The head consists of several convolutional layers, which use
convolution to adjust the number of channels on the three scales of the neck network. It predicts the final detection results according to multi-level features assembled by the neck. The decoupled head for classification and localization, widely used in object detection [
42,
43], speeds up the convergence.
The network architecture of our one-stage model comprises the following modules:
- 1.
The backbone for feature extraction over an entire image is Swin Transformer and ConvNeXt. Swin Transformer constructs hierarchical feature maps, achieves linear computation complexity by computing self-attention locally within non-overlapping windows, and introduces shifted windows for cross-window connections to enhance the modeling power. ConvNeXt adopts a hierarchical structure and uses a larger kernel size in convolution to increase the receptive field.
- 2.
The neck used to aggregate feature maps from different stages consists of an FPN and PAN, which enhance the entire feature hierarchy with the accurate localization of signals in lower layers by bottom-up path augmentation. To correspond to the output of three different scales, the FPN is constructed based on the feature map from backbone stages 2∼4.
- 3.
The head used to predict the classes and bounding boxes of objects is YOLOX [
44], which is divided into the classification, regression, and IoU branches.
5. Conclusions
To compress a large amount of online data and improve accuracy and speed in extracting clusters induced by particle hits at HIRLF and HIAF, we studied the performance of a deep learning approach applied to cluster-locating algorithms. We constructed one-stage and two-stage detection algorithms, with Swin Transformer and ConvNext as the backbones. Heavy-ion tests were performed on the Topmetal-M silicon pixel sensor to establish a dataset for training and validation. In general, the two-stage detection algorithms demonstrate significantly better accuracy in object localization and recognition, and the weakness in speed is acceptable for the current applications at HIRFL and HIAF. For example, the two-stage detection algorithm (ConvNeXt-L-RCNN) demonstrates the best detection accuracy of 68.0% AP, while the one-stage detection algorithm (ConvNeXt-T-YOLOX) achieves the fastest speed of 57.94 FPS. The deep-learning-based cluster-locating algorithm presents nearly the same detection efficiency as the traditional Selective Search approach while having a speed one order higher. Furthermore, a study on the fake rate shows that the one-stage detection algorithms show great potential for online data compression, and the two-stage detection algorithms can perform high-precision cluster detection tasks. The research in this paper aims at providing practical experience for applying cluster-locating algorithms in physics experiments. In the future, we expect to improve the detection algorithm to reach a good balance between speed and accuracy.