1. Introduction
Cosmic rays [
1,
2] are high-energy charged particles that originate from outer space. They are composed of light nuclei (such as proton and helium nuclei, at energies around
eV (1 EeV)) up to heavy nuclei (such as iron, at increasing energies) [
3]. Their energies [
1] can reach up to
eV (≈16 Joules), larger than the energy levels reached in the most advanced particle accelerator on Earth, the Large Hadron Collider (LHC) at CERN, where the maximum collision energy attainable is 14 TeV [
4]. Various astrophysical processes accelerate these particles, enabling them to reach such high energies. Based on their origins, cosmic rays are classified as [
5] solar energetic particles (originating from the Sun, mainly solar flares), galactic cosmic rays (originating from our galaxy), or extra-galactic cosmic rays (originating from external galaxies). Possible galactic and extra-galactic sources are accretion shocks in large-scale structures, active galactic nuclei, gamma-ray bursts, and neutron stars or magnetars [
6].
Cosmic rays are important agents in multi-messenger astronomy—the coordinated observation and interpretation of signals from various information carriers, called messengers, such as electromagnetic radiation, gravitational waves, neutrinos, and cosmic rays [
5,
7,
8,
9,
10]. For example, during solar flares, both electromagnetic radiation and cosmic rays are emitted [
11]. Their coordinated signals reveal important information that aids researchers in identifying their source.
As the energy increases, the flux of cosmic rays decreases rapidly (
Figure 1), where the flux of particles with energies above
eV is less than one particle per
per century. Two important transition features are observed: the knee (≈3 ·
eV) and the ankle (≈
eV). The knee feature is related to galactic cosmic rays, while the latter marks the shift from galactic to extra-galactic origins. Due to their low flux, only a few events have been detected for cosmic rays with energies around 100 EeV. The highest-energy cosmic ray ever observed was detected in 1991, with an estimated energy of 320 EeV. The flux of cosmic rays is strongly suppressed above the cutoff (≈5 ·
eV), but there is no consensus on exactly how much this limit is intrinsic to their sources and how much it is due to energy losses during the intergalactic propagation [
12,
13].
Direct measurements of cosmic rays with energies over
eV are challenging [
1]. This is due to both the decreasing flux with increasing energy and the fact that ultra-high-energy cosmic rays (UHECRs), which are particles with energies exceeding 1 EeV, can pass through detectors placed at the top of the atmosphere without undergoing interactions.
Indirect measurements are preferred for UHECRs. They are conducted by measuring not the primary particle itself but the secondary particles produced in an extensive cascade, following the initial collision of the cosmic ray with air molecules in the atmosphere. This process is known as an
extensive air shower (EAS) (
Figure 2), which consists of hadronic, muonic, and electromagnetic components. The electromagnetic component, composed of electrons, positrons, and photons, carries about 90% of the primary particle’s energy [
15], making it the most valuable component for energy measurements. The remaining 10% is carried by muons and neutrinos.
After the initial collision of a cosmic ray with air molecules in our atmosphere, secondary particles begin to be produced in a cascade manner. Their number increases until it reaches a maximum at a certain atmospheric depth, called
. Beyond this depth, the number of secondary particles starts decreasing, due to ionization losses [
15].
The footprint of the EAS provides valuable information about the cosmic ray, including its energy, arrival direction, core position on the ground, and the shower maximum (
). The reconstruction of
, i.e., the depth in the atmosphere at which the energy deposition in the shower is greatest, is essential for determining the mass compositions of the cosmic rays (
Figure 3). These measurements can be compared with
prediction from air shower simulations for the two extremes of primary particle types (proton and iron nuclei) for different hadronic interaction models.
The Pierre Auger Observatory [
18] is the world’s largest cosmic ray experiment, operating in the western Mendoza province, Argentina since 2004. Its main goal is to understand the origin and nature of UHECRs [
19]. The observatory is unique, in that it combines complementary detection techniques (
Figure 4); 27 fluorescence telescopes capture the ultraviolet light emitted by excited nitrogen molecules in the atmosphere; 1660 water Cherenkov detectors, which are ground-based tanks filled with purified water overlooked by three photo-multipliers that detect Cherenkov radiation, are dispersed over an area of 3000
with a 1.5 km spacing between them; additionally, a radio antenna array is sensitive in the frequency range of 30–80 MHz. Starting from an initial array of only 153 radio antennas (the Auger Engineering Radio Array), the Auger upgrade (AugerPrime) [
20] will include, among other extensions, a radio antenna on top of each ground detector, enabling electric field detection on the full map (
Figure 5).
As the flux of UHECRs decreases with increasing energy, Monte Carlo cosmic ray air shower simulation software, such as CORSIKA 7 (COsmic Ray SImulations for KAscade) [
22], are used to generate a statistically significant number of datasets. Various radio antennas response studies have been conducted on CoREAS [
23] (CORSIKA-based Radio Emission from Air Showers) simulations, part of CORSIKA, with antennas from the full Auger map, in the order of tens or even a few hundreds [
24,
25].
The main emission mechanisms that cause the electric field recorded by the antennas are the geomagnetic and Askaryan effects [
26]. The former represents the dominant contribution, and it depends on the geomagnetic position of the experiment, while the latter is caused by the negative charge excess developed in the shower, and it depends on the shower direction.
The radio pulses captured by antennas provide valuable data for reconstructing shower observables [
27], such as energy,
, and the zenith and azimuth angles [
25,
28]. This is because the radio footprint left by an EAS depends significantly on the properties of the UHECR. For example, inclined showers with a larger zenith angle produce an elongated, elliptical footprint, whereas vertical showers with a small zenith angle generate a more circular footprint. The azimuth angle determines the rotation of the footprint’s geometry. Additionally, the energy of the UHECRs plays a crucial role in the energy fluence and electric field’s strength, captured by radio antennas [
29].
Machine learning (ML) has become widely used nowadays in physics. Depending on their complexity, machine learning models are classified as shallow and deep. Shallow learning algorithms, such as linear regression, logistic regression, decision trees, and support vector machines, are more suited to simpler tasks that require fewer parameters and less computational power. They often excel in scenarios where the relationship between input features and output is relatively straightforward and the dataset is not overly large.
On the other hand, deep learning algorithms leverage neural networks with many layers, which enables them to model complex, non-linear relationships in data. These deep neural networks are particularly well-suited for tasks that involve high-dimensional data and intricate patterns, such as image recognition, natural language processing, and speech recognition. Deep learning models, such as convolutional neural networks (CNNs), require substantial computational resources and large datasets to perform effectively, but at the same time they offer significant improvements in Accuracy and capability for these complex tasks.
Since the radio footprint is highly correlated with many properties of UHECRs, it can be effectively used in image recognition techniques to infer the nature of the primary particle. This is where deep learning, particularly CNNs, plays an important role. By leveraging convolutional layers that use information from neighboring pixels, CNNs can detect patterns in images and emphasize the energy deposit and the geometry of the radio footprint left by an induced EAS.
There have been several studies on deep learning processing of radio data using CNNs—such as energy estimation, using ZHAireS simulations [
30], and classification between signal and background noise [
31,
32], using CoREAS simulations.
We propose a convolutional neural network architecture that classifies simulated CoREAS events between those generated by protons and those by iron nuclei. The model was trained and evaluated on a dataset of ≈3000 events (≈2000 with proton and ≈1000 with iron) from the Pierre Auger Collaboration for this purpose. The paper is arranged as follows: in
Section 2, we describe the datasets used in the analysis, the radio imaging techniques employed, and the distribution of the numerical features.
Section 3 presents a first shallow machine learning analysis on the numerical features only and the performance of the algorithms employed.
Section 4 describes the architecture of the neural network, the training process, and the metrics used to assess the model’s performance. We conclude in
Section 5 with an evaluation of the model’s success in the classification task.
2. Data Exploration and Preprocessing
Different signal properties are used in radio analysis, such as the signal-to-noise ratio (SNR), which indicates how effectively the signal can be distinguished from noise; the electric field strength, measured in μV/m, which describes the voltage induced by the signal into a one-meter-long radio antenna; and the energy fluence, measured in , which describes the energy deposit over the unit area. These properties can be analyzed for each polarization component of the electric field, or as a magnitude (for the SNR and the electric field strength), or as a sum (for energy fluence) of all components.
We considered first our dataset of a small simulation library done with CORSIKA v7.7420 and the CoREAS option for two types of primary particles (proton and iron) with an energy of
eV, for different air shower geometries defined by zenith (from 70° to 85°, in steps of 5°) and azimuth (from 0° to 360°, in steps of 45°) angles. This resulted in a total of 32 simulated events per primary particle. The simulated radio pulses were conducted for a fixed shower core in the center of the array and for all radio antennas located within a certain distance of the shower axis (in the shower plane),
< max(4 ·
(zenith), 1500 m), where
is the radius of the Cherenkov ring [
33]. The high-energy hadronic interaction model was QGSJETII-04; the low-energy interaction model was UrQMD, with a thinning parameter of 1 ×
. The atmospheric condition was representative of the Pierre Auger Observatory in October at Malargüe, with the corresponding magnetic field at the experiment location.
The CoREAS simulations in our dataset were performed on different subsets of radio antennas from the complete ideal Auger map of 3000
, at 1400 m a.s.l. This ideal map filled the gaps from the real Auger map, ensuring that all the radio antennas were evenly spaced. In Auger Phase II, each point also represents a radio antenna. The time series of the simulated electric field, captured by each radio antenna, was further processed, to calculate the energy fluence (Equation (
1) [
34],
Figure 6):
The plot can be further transformed in a gray-scale image, by applying a linear interpolation function to the energy fluence on the whole map viewed as a mesh grid of 400 × 400 points (
Figure 7, left). The radio antennas that were not simulated were considered to have zero energy fluence over that area. After removing the radio antennas used for the interpolation from the plot, only the energy fluence footprint is shown (
Figure 7, right).
The main issue with this imaging technique is that the energy fluence—as well as other signal properties, like maximum amplitude—leaves a relatively small footprint on the full map. This is because the full Auger map is best suited for UHECRs that are more inclined. For zenith angles less than 80°, the radio footprint becomes very narrow, almost point-like (
Figure 8).
We employed four types of imaging techniques:
Max local method: Each radio antenna’s energy fluence is MinMax scaled, where the maximum value is determined per simulation. This method presents the radio footprint on each unit area, relative to the footprint of the entire simulation. This method is presented in
Figure 7 and
Figure 8.
Log max local method: Similar to the Max local method, but with a transformation applied to the energy fluence, to enhance the visibility of the energy deposit comparison between different area units.
Max global method: Each radio antenna’s energy fluence is MinMax scaled, where the maximum value is determined across all simulations, which is over , captured by an antenna from a simulation of a vertical iron UHECR with an energy over eV, coming from the south-east. This method presents the radio footprint on each unit area, relative to the footprint of the entire dataset, aiding in the comparison of energy deposits between simulations from the whole dataset.
Log max global method: Similar to the Max global method, but with a transformation applied to the energy fluence, to enhance the visibility of the energy deposit comparison between different area units.
In order to increase our dataset from the order of tens to at least thousands, we used air showers simulation libraries from the Pierre Auger Collaboration [
33] (Auger simulation dataset). These simulations were done with CORSIKA/CoREAS v7.7401 for two primary particles: approximately 2000 showers for protons and about 1000 showers for iron. The core positions were randomly distributed uniformly over the entire ideal Auger layout. The simulations included zenith angles between 65° and 85°, azimuth angles between 0° and 360°, and primary particle energies in log10 (E/eV) from 18.4 to 20.1. The high-energy hadronic interaction model was Sibyll-2.3d; the low-energy interaction model was UrQMD. The thinning and atmospheric conditions, as well as the antennas selection algorithm, were the same as those used in our dataset. We further applied the same four radio imaging techniques.
Apart from the aforementioned images, we extracted four observables that significantly impacted the radio footprint: the zenith, azimuth, energy, and
(
Table 1).
Figure 9 shows a nearly uniform distribution for the first three, while
followed a more Gaussian distribution. In order to normalize the independent features, we applied the MinMax method to the former three and the Z-score to the latter.
The Pearson correlation coefficient (PCC) also highlighted the importance of
in determining the nuclear composition of the UHECRs (
Figure 10). While the other observables had a low impact, due to their uniform distribution for both primaries, there was a stronger linear correlation (−0.64) between
and the primary particle type (proton encoded as 0, and iron as 1): proton-induced showers reached their maximum at a greater depth in the atmosphere, compared to those induced by iron. Additionally, a significant linear correlation (0.36) existed between
and energy: air showers induced by cosmic rays with higher energies reached their maximum at greater depths in the atmosphere.
The Auger simulation dataset was split into training and testing sets using a 70/30 ratio, with stratification by the particle feature (proton or iron), to ensure balanced representation.
3. Shallow Learning Classification
To understand the impact of radio imaging techniques on a deep learning model, we first evaluated how the numerical data performed in the classification task, using several shallow classification algorithms from the
scikit-learn [
35] Python library. The models considered in our study included decision tree (
DecisionTreeClassifier), random forest (
RandomForestClassifier), support vector machine—SVM (
SVC), k-nearest neighbors (
KNeighborsClassifier), logistic regression (
LogisticRegression), gradient boosting (
GradientBoostingClassifier), and Gaussian naive Bayes (
GaussianNB). Each model was initialized with a random state of 42, to ensure reproducibility. The hyper-parameters for each model were tuned using grid search, and the best parameters were selected based on performance metrics.
The evaluation metrics used to provide a comprehensive assessment of the classification performance were Accuracy (Equation (
2)), which highlights the ratio between correct and total predictions, but can be misleading in class-imbalanced datasets [
36]; MCC (Matthews correlation coefficient, Equation (
3)), which is more reliable than Accuracy for class-imbalanced datasets [
36]; and
score (Equation (
4)), which balances precision and recall without considering true negatives. While the Accuracy and
scores range from 0 to 1 (with higher values indicating better predictions), MCC ranges from −1 to 1, where −1 indicates total disagreement between prediction and actual class, 0 indicates randomness, and 1 indicates a perfect match. Given our class-imbalanced dataset, we used the MCC as the primary metric to assess both the hyper-parameter tuning of the model and the comparison between different models.
Table 2 summarizes, based on their MCC score, the best three models with their optimized hyper-parameters and the resulting performance metrics, including MCC,
-score, and Accuracy for different features as input. The logistic regression and the support vector machine with linear kernel performed the best in the classification task, indicating that the data had a clear decision boundary that could be effectively captured by these linear classifiers to predict the nuclear composition of the UHECRs. The importance of
in the classification was evidenced by the MCC score of 0.69, which denoted good performance. Adding the energy feature increased the MCC to 0.83, while also including the spherical angles zenith and azimuth gave a slight boost in performance, up to about 0.85.
However, in reality, the values of the
, energy, zenith, and azimuth angles are not known and need to be reconstructed from the detector response data. Since
is an indispensable value in nuclear composition prediction and has been shown to be reliably reconstructed using deep neural network processing of data from the water Cherenkov detectors from the Pierre Auger Observatory [
37], we investigated the possibility of replacing the energy and the zenith and azimuth angles with the energy fluence of the electric field captured by radio antennas, as the radio footprint is significantly influenced by these observables (as explained in
Section 1). The proposed deep learning model is a CNN that receives as input the radio imaging techniques discussed in
Section 2, along with the
, and classifies simulated air shower events between those induced by protons and those by iron nuclei.
4. Deep Learning Classification
We further discuss the architecture of the neural network, its training process, and, finally, the evaluation of its performance.
The CNN is a slightly modified version of the pretrained ResNet-18 [
38]. As the ResNet-18 expects three-channel RGB images as input, we replace its first layer with a convolutional layer that receives the four radio images. The rest of the parameters, i.e., the number of output channels, the kernel size, the stride, the padding, and the bias are preserved. Its last layer (the classification layer) is also replaced with the identity operator, to integrate particle-specific features, such as zenith, azimuth, energy, and
.
The numerical features (up to four) are linearly transformed into a higher-dimensional space (the default number of dimensions is 64). The rectified linear unit (ReLU) activation function is then applied, to introduce non-linearity, enabling more complex learning.
The images and numerical features processing produce two output vectors, which are further concatenated and linearly transformed into a final vector. This vector is converted using the softmax function into a probability distribution of two outcomes, resulting in a
p probability that the UHECR particle is a proton, and a
q probability that it is an iron nucleus. The final architecture of the CNN model is displayed in
Figure 11.
The dataset used for training included only the feature and the four different techniques of energy fluence imaging. To artificially increase the training dataset, the data augmentation technique random resized crop was used. It randomly selects a portion of an image, resizes it, and then crops it to a specified size of 224 × 224 pixels. The selected portion has a scale between 0.75 and 1, and it has an aspect ratio between 0.75 and 1.33, providing variability in the training data.
We used the CrossEntropyLoss for the loss function and the AdamW optimizer with a learning rate of 0.001 and a weight decay of 0.05. The model parameters were updated every mini-batch of 64 images. Since the training set had 2166 images, this resulted in a total of gradient descent steps per epoch. An epoch is one complete pass of the training data through the algorithm. The training data were reshuffled at every epoch.
At each epoch, we plotted the training and test errors, the loss curve, and the MCC,
and Accuracy scores, for a total of 50 epochs. The test errors (Equation (
5)) for both proton and iron (
Figure 12) followed a similar decreasing trend, stabilizing around 10%. This demonstrates the robust ability of the model to predict the nuclear composition of the UHECR based on only
observable and radio images.
The training and testing loss (
Figure 13) both decreased steadily, indicating that the model was learning effectively, stabilizing to about 0.25.
Regarding the evaluation metrics (
Figure 14), the MCC increased rapidly in the first few epochs and then gradually stabilized around 0.8. This indicated a strong correlation between predicted and actual values. Accuracy increased quickly and then stabilized around 0.9, suggesting good performance on the classification task. The lower MCC value compared to Accuracy may indicate that the class imbalance affected the predictions. The
score increased sharply at the beginning and stabilized around 0.9, indicating a good balance between precision and recall.
Since the logistic regression model using all four numerical features as input achieved an MCC of about 0.85 (as shown in
Table 2) and the CNN, which used the
and energy fluence images as input, converged to an MCC of 0.8 (as seen in
Figure 14), it is evident that the neural network model demonstrates good performance. This indicates that its classification capability is comparable to a scenario where all the other observables (energy and zenith and azimuth angles) are already known.
We ran the training again, this time including all four numerical features (
, energy, zenith angle, and azimuth angle). The evaluation metrics (
Figure 15) showed a slight overall performance increase: the MCC stabilized just above 0.8, while the Accuracy and
scores rose just above 0.9. This shows that the radio images already contained information about these newly included features, illustrating that the radio imaging techniques effectively substituted for the unknown energy of the primary particle and the air shower direction.