1. Introduction
The Earth’s atmosphere is constantly impinged by astroparticles, giving rise to an atmospheric flux of secondary particles comprising three main components: electromagnetic (s and ), muonic (), and hadronic (consisting of various types of mesons and baryons, including nuclei).
The Latin American Giant Observatory (LAGO) (
https://lagoproject.net (accessed on 19 August 2024)) is a network of Water Cherenkov Detectors (WCDs) situated across multiple sites in Ibero-America. Among LAGO’s primary objectives are the measurement of high-energy events originating from space using WCDs at ground-level locations [
1] and the continuous enhancement of our WCD systems [
2]. LAGO WCDs employ a single, large-area photomultiplier tube as the primary sensor. When relativistic charged particles traverse the WCD, this leads to the emission of Cherenkov radiation within the water volume, which subsequently triggers a detection event in the detector’s data acquisition system. Due to its large water volume, neutral particles such as photons or neutrons can also be indirectly detected through processes like Compton scattering or pair creation for the former case, or nuclear interactions with the various materials present in the latter case.
WCDs at LAGO, which are not necessarily homogeneous across the detection network, are deployed in selected (some of them remote) sites with different altitudes and geomagnetic characteristics, and thus, with different rigidity cut-offs. In this scenario, one important goal is the continuous monitoring of the detector’s status and those factors that could affect the WCD response, such as the aging of the detector or its water quality, as they could create artificial increases or decreases in the flux of measured signals. Moreover, registering and transferring the complete flux of secondary particles may not be necessary during quiescent periods, when no astrophysical transients are detected. Thus, being able to characterize the WCD response in a real-time, local and unattended mode is extremely necessary, especially for those detectors deployed in challenging sites (e.g., in Antarctica or very high-altitude mountains). For this reason, one of the goals of this work is the possibility of automatic determination of WCD response to the flux of secondary particles, especially during astrophysical transients, such as those produced during disturbed conditions produced by Space Weather phenomena.
Magnetic and plasma interplanetary conditions near Earth, which have a major interest in Space Weather, can significantly modify the transport of low energy galactic cosmic rays (GCRs). These conditions can produce variability of the primary galactic proton fluxes that can be indirectly observed with particle detectors installed at Earth’s surface. This variability occurs mainly in the band from ∼hundred of MeVs to slightly more than 10 GeVs, and is evident, for example, in the well-known anti-correlation between the ∼11-year solar cycle of sunspots and the long-term variability of the galactic cosmic ray flux, e.g., [
3].
There are two major transient IP perturbations producing decreases in GCRs: Interplanetary Coronal Mass Ejections (ICMEs) and Stream Interaction Regions (SIRs). ICMEs are coronal mass ejected from the Sun and SIRs are interplanetary structures developed in the solar wind during the merging between fast solar wind streams when they reach slow interplanetary plasma, e.g., [
4].
When ICMEs or SIRs are detected by spacecraft near the geospace, ground-level GCRs generally show decreases in the flux of secondary particles, a phenomenon called Forbush decrease (FDs), e.g., [
5,
6].
The variability in the GCR flux at ground level, in both long (e.g., ∼11 years for the solar cycle) and short term (∼hours-days for FDs), have been systematically observed over several decades by measuring secondary neutrons developed in the atmospheric cascades, using Neutron Monitors (NMs), e.g., [
7,
8].
Given its characteristics, WCDs started to be examined as a possible complement to NMs, since they can observe FDs produced by ICMEs, e.g., [
9,
10]. In particular, FDs have been observed in different channels of LAGO WCDs. In this sense, first explorations have shown, with data from the Antarctic node, that WCDs of LAGO can observe spatial anisotropy of the GCRs flux, at least during quiescent days [
11].
FDs produced by ICMEs can be observed also using other kinds of ground-based particle detectors, in particular, using WCDs, e.g., [
9,
10]. From the analysis of data measured at an Antarctic LAGO node, it has been shown that spatial anisotropy of low energy GCR flux can be observed [
11].
Therefore, given that WCDs are cheaper and safer than NMs and that a WCD can observe the variability of secondary particle flux for different channels of deposited energies, WCDs started to be examined as a possible complement to NMs to make space weather studies.
Note, also that having flux variability for different deposited energies and for different classes of secondary particles could be used to better identify the variability of the flow of low-energy primaries providing more information about the conditions in the heliosphere, a major interest in Space Weather.
However, one of the aspects we can use is to identify the kind of particle detected from the features of the observed trace using WCDs. Thus, there is a special interest in discriminating the traces deposited by the different secondary particles with different energies.
The primary objective of this work is to find patterns within each WCD’s data that could allow us to assess the secondary particle contributions that compose the overall charge histogram of the secondary flux. We propose a machine learning (ML) algorithm to perform this task in such a way that each LAGO detector can have its tailored ML model as a result of a learning process using its particular dataset.
ML techniques have been used in many fields, including research in astroparticles, with encouraging results. In general, particle discrimination in WCD data is an important task for various kinds of studies. In particular, ML has been applied to analyze WCD data in different scenarios. For example, Jamieson et al. [
12] proposes a boosted decision tree (XGBoost), graph convolutional network (GCN), and dynamic graph convolutional neural network (DGCNN) to optimize neutron capture detection capability in large-volume WCDs, such as the Hyper-Kamiokande detector. Their work is driven by the necessity of distinguishing the neutron signal from others, such as muon spallation or other background signals. Additionally, ML techniques to identify muons are used in Conceição et al. [
13] for WCDs with reduced water volume and four photomultipliers (PMTs). In these cases, convolutional neural networks (CNNs) have been used, showing that the identification of muons in the station depends on the amount of electromagnetic contamination, with nearly no dependence on the configuration of the WCD array. These are two of many examples showing the potential of ML in this area of research [
14,
15,
16,
17].
In Torres Peralta et al. [
18], we proposed the use of a clustering (nonsupervised ML) technique to identify each of the components detected by LAGO WCDs using actual observations. We proved that Ordering Points To Identify the Clustering Structure (OPTICS) is suitable for identifying these components [
19]. However, further validation was needed to ensure that the algorithm is robust and obtains the desired outcome with high confidence.
Here, we continue the study of OPTICS applied to LAGO WCDs by using synthetic data from Monte Carlo simulations. Thereby, this work serves as a validation process for the proposed OPTICS method since the algorithm is performed in a synthetic dataset where the ground truth is known a priori. Moreover, we implement statistical analyses to ensure robustness and precision.
This work is organized as follows:
Section 2 explains the LAGO software suite, the simulation’s main parameters and limitations, and the outcome in the form of a synthetic dataset.
Section 3 details the ML technique including the main hyperparameters used.
Section 4 is dedicated to explaining the pipeline followed for the ML modeling and decisions on the data treatment. Finally, we present the results and conclusions in
Section 5 and
Section 6, respectively.
2. Simulation Framework
2.1. Atmospheric Radiation Calculations
Cosmic rays (CRs) are defined as particles and atomic nuclei originating from beyond Earth, spanning energy levels from several GeVs to more than
eV [
20]. These particles, upon reaching the upper atmosphere, interact with atmospheric elements to produce extensive air showers (EAS), a discovery made by Rossi and Auger in the 1930s [
21]. An EAS generates new particles, or secondaries through radiative and decay processes that follow the incoming direction of the CR [
22].
The formation and characteristics of an EAS are influenced by the energy (
) and type (such as gamma, proton, iron) of the incident primary CR, capable of generating more than
particles at peak energies. The process continues through atmospheric interaction until reaching the ground, where 85–90% of
is transferred to the electromagnetic (EM) channel, consisting of
s and
. Muons are produced by the decay of different mesons during the cascade development, mainly but not exclusively from
and kaons. Hadrons are produced mainly from evaporation and fragmentation during strong-force mediated interactions with atmospheric nuclei at the core of the EAS, also known as the shower axis, following the direction of the incoming primary particle. The particle distribution across the EM, muon, and hadronic channels is approximately 100:1:0.01, respectively [
23].
As can be supposed from the above description, the simulation of EAS is a task that demands significant computational resources. This challenge arises not only from the need to model intricate physical interactions but also from tracking an enormous quantity of particles and taking into account their respective interactions with the atmosphere. Among the available simulation tools, CORSIKA [
24] stands out as the most broadly adopted and rigorously tested, benefiting from ongoing enhancements [
25]. CORSIKA allows for the detailed simulation of EAS initiated by individual cosmic rays, with adjustable settings for various parameters such as atmospheric conditions, local Earth’s magnetic field (EMF) variations, and observation altitude. To effectively simulate expected background radiation across different global locations and times using CORSIKA, an auxiliary tool is necessary. This tool should dynamically adjust the parameters based on seasonal changes in the atmospheric profile and the variations in the cosmic ray flux influenced by solar activity, which also impacts the EMF.
To tackle these challenges, during the last years, the LAGO Collaboration has been developing, testing and validating ARTI [
26], an accessible toolkit designed to compute and analyze background radiation and its variability, and assessing the expected detector responses. ARTI is capable of predicting the expected flux of atmospheric cosmic radiation at any location under dynamic atmospheric and geomagnetic conditions [
27,
28], effectively integrating CORSIKA with Magneto-Cosmics [
29] and Geant4 [
30] with its own analysis tools. During its development, ARTI has been extensively tested and validated in the LAGO observatory and at other astroparticle observatories [
26,
31]. More recently, ARTI has been utilized and validated with the corresponding data in a diverse range of applications, including astrophysical gamma source detection [
1], monitoring space weather phenomena like Forbush decreases [
27,
31], estimating atmospheric muon fluxes at subterranean locations, and analyzing volcanic structures using muography [
32]. Additionally, ARTI has been used in conflict zones in Colombia to detect improvised explosive devices, examine the effects of space weather on neutron detection in water Cherenkov detectors, develop neutron detectors for monitoring the transport of fissile materials [
33], and even create ACORDE, a code for calculating radiation exposure during commercial flights.
Calculating the expected flux of the atmospheric radiation at any geographical position, from now on
, requires long integration times to avoid statistical fluctuations [
26]: while a single EAS involves the interaction and tracking of billions of particles during the shower development along the atmosphere, the atmospheric radiation is caused by the interaction of up to billions of CR impinging the Earth each second per squared meter. For the modeling of EAS, not only the interactions involved but also the corresponding atmospheric profile at each location, which could also vary as a function of time, should be considered, as it is the medium where each shower evolves [
34]. For this reason, ARTI can handle different atmospheric available models: the MODTRAN model sets a general atmospheric profile depending on the seasonal characteristics of large areas of the world (say, tropical, subtropical, arctic, and antarctic) [
35]; the Linsley’s layered model, which uses atmospheric profiles obtained from measurements at predefined sites [
36], or the set up of real-time atmospheric profile by using data from the Global Data Assimilation (data assimilation is the adjustment of the parameters of any specific atmospheric model to the real state of the atmosphere, measured by meteorological observations.) System (GDAS) [
37] and characterizing them by using Linsley’s model and finally an atmospheric profile obtained from the temporal averaging of the atmospheric GDAS profiles to build up a local density profile at each location for a certain period, e.g., one month [
28]. Finally,
is also affected by the variable conditions of the heliosphere and the EMF, as both affect the CR transport up to the atmosphere. ARTI also incorporates modules to consider changes over the secular magnitude of the EMF and disturbances due to transient solar phenomena, as Forbush decreasesAsorey et al. [
27].
After establishing the primary spectra, atmospheric profile, and the secular and occasional disturbances of the Earth’s magnetic field (EMF), it becomes possible to calculate the local expected flux of secondary particles,
. This calculation is carried out by injecting the integrated flux of primary particles into the atmosphere, with energies ranging from
to
eV. Here,
denotes the local directional rigidity cutoff tensor derived from the secular values of the EMF, according to the current International Geomagnetic Reference Field (IGRF) version 13 model [
38]. The variable
Z represents the charge of the primary particles, which range from protons (
) to iron (
). The upper energy limit of
eV is selected because, above 1 PeV, the primary spectra exhibit the so-called ’knee,’ significantly reducing the primary flux at higher energies and rendering their impact on atmospheric background calculations negligible [
26]. These calculations cover an area typically of 1 m
2 over a time integration period
ranging from several hours to days. Post-simulation, secondaries generated by primaries not allowed geomagnetically are discarded by comparing their magnetic rigidity
to the evolving values of
[
27].
This intensive process demands substantial computing resources. For example, estimating the daily flux
of secondary particles per square meter at a high-latitude location involves simulating approximately
extensive air showers (EAS), each contributing to the production of a comparable number of ground-level secondaries. For this reason, ARTI is designed to operate on high-performance computing (HPC) clusters and within Docker containers on virtualized platforms such as the European Open Science Cloud (EOSC), as well as to manage data storage and retrieval across public and federated cloud servers [
39].
2.2. Detector Response Simulations
Meiga [
32] is a software framework built on Geant4, tailored to facilitate the calculation of particle transport through extensive distances, such as hundreds or thousands of meters through rocks of varying densities and compositions, and is also pivotal in the design and characterization of particle detectors for muography. Structurally, Meiga is composed of various C++ classes, each dedicated to a specific functionality. It integrates Geant4 simulations for particle transport and detector response calculations, providing interfaces for users to manage detector descriptions and simulation executions.
Meiga offers a suite of customizable applications that simplify the simulation process for users by utilizing configuration files formatted in XML and JSON. This characteristic allows users to easily adapt the simulation framework to meet their specific project requirements [
32]. Additionally, Meiga includes utilities such as a configuration file parser, physical constants, material properties, and tools for geometric and mathematical calculations. The adoption of JSON for configuration files was motivated by the possibility of complying with the FAIR (for Findable, Accessible, Interoperable and Reusable) Data principles [
40], and for incorporating standards used during the creation of digital twins. Even more, as detailed in Taboada et al. [
32], the framework’s modular and adaptable design includes a set of pre-configured detector models and Geant4 physics lists, which users can easily extend or modify to develop tailored detectors and processes. This modular approach considerably reduces the time and effort required for simulation development.
Once the flux of secondary particles,
, is obtained, it is propagated in Meiga through a detailed model of the LAGO water Cherenkov detector (WCD). This model incorporates variables such as water quality, the photomultiplier tube (PMT) model, its geometric positioning within the detector, the internal coating of the water container, and the detector’s electronic response. As charged particles enter the water, they generate Cherenkov photons, which move through the detector’s volume until they are either absorbed or reach the PMT. The PMT is simulated in Meiga as a photosensitive surface, accurately replicating its characteristic spectral response based on the PMT’s quantum efficiency provided by the manufacturer. Given the substantial water volume in typical WCDs (generally over 1 m
3), the system is also sensitive to neutral particles such as neutrons and photons through secondary processes like neutron capture followed by prompt gamma emission, Compton scattering, or pair creation within the water [
33].
The detector’s electronic response is simulated to produce the final signal, a pulse representing the time distribution of photo-electrons (PEs) detected by the simulated electronics. This pulse, typically resembling a sampled FRED (Fast Rise and Exponential Decay) curve, is captured at the same rate as the detector’s electronics, ranging from 40 to 125 million samples per second, with time bins spanning 40 to 8 ns, respectively, and 10 to 14 bits for the analog-to-digital equivalent converter. Similar to the physical WCD, the total pulse acquisition time can be set at 300 to 400 ns depending on the acquisition conditions. Once captured, the pulse is analyzed for its ’peak’, the maximum number of photo-electrons registered within a single time bin, the ’charge’, total photo-electrons collected during the event, and characteristic times such as the rise and decay times, determined by the period needed for the integrated signal to reach predefined levels (typically 10→50% and 30→90% of the charge). Each pulse and its associated characteristics are logged for further analysis along with details of the impinging secondary particle. Unlike physical detectors, which record pulses without identifying the type of particle at each event, Meiga allows each pulse to be linked to its corresponding secondary particle. This capability enables the testing of predictions made by machine learning unsupervised analysis techniques, as detailed in the subsequent section.
The detector’s electronic response is simulated to produce the final signal, a pulse representing the time distribution of photo-electrons (PEs) detected by the simulated electronics. This pulse typically resembles a sampled Fast Rise and Exponential Decay (FRED) curve and is captured at the same rate as the detector’s electronics, ranging from 40 to 125 million samples per second, with time bins spanning 40 to 8 ns, respectively, and using 10 to 14 bits for the analog-to-digital converter. Similar to the physical WCD, the total pulse acquisition time can be set between 300 and 400 ns, depending on the acquisition conditions. Once captured, the pulse is analyzed for its ‘peak’—the maximum number of PEs registered within a single time bin, the ‘charge’—the total PEs collected during the event, and characteristic times such as the rise and decay times, determined by the period needed for the integrated signal to reach predefined levels (typically 10 to 50% and 30 to 90% of the charge). Each pulse and its associated characteristics are logged for further analysis along with details of the impinging secondary particle. Unlike physical detectors, which record pulses without identifying the type of particle for each event, Meiga allows each pulse to be linked to its corresponding secondary particle. This capability enables the testing of predictions made by machine learning unsupervised analysis techniques, as detailed in the subsequent section.
3. Machine Learning Framework
As mentioned above, the primary goal is to separate the signal contribution of different secondary particles within a given charge histogram from a single WCD. Since the ground truth is unknown for actual data obtained from a WCD, the approach presented by Torres Peralta et al. [
18] was to implement a non-supervised ML clustering algorithm. This type of algorithm deals with the problem of partitioning a dataset into groups considering insights from the unlabeled data.
In the aforementioned work, the selected dataset consisted of pulses (samples) captured by the data acquisition system (DAQ) that were digitized with a sampling rate of 40 MHz and 10-bit resolution on a time window of 400 ns. The origin of the data was from LAGO’s “Nahuelito” WCD site at Bariloche, Argentina.
The originally measured dataset was analyzed using the following features: the total charge deposited (the time integration of the pulse), the maximum value of the pulse, the time taken to deposit 90% of the charge, the pulse duration, and the time difference between the current and the next pulse. These original features were further analyzed using Principal Component Analysis (PCA), which resulted in a set of principal components labeled PCA 1 through PCA 5.
Figure 1 shows a visualization of the resulting components from PCA, where each subplot is a two-dimensional projection of the distribution between the selected components. In each, a darker color means that there are more points in that bin and thus that particular place has a higher density. Overall, the structure of the data shows a high complexity for the potential to form groups of points where in some cases, like in the projection between PCA 2 and PCA 1, one can observe a potential hierarchy of groups of different densities that compose a larger group.
We considered different types of clustering algorithms, such as partitioning methods (e.g., K-Means which is probably the most well-known clustering algorithm), hierarchical methods, density-based methods, grid-based methods, and distribution-based methods. After analyzing the structure of the data, a hierarchical density-based algorithm was chosen as a good candidate algorithm. In particular, we used Ordering Points to Identify the Clustering Structure (OPTICS) [
19]. OPTICS is a hierarchical density-based clustering algorithm, with the aim in our case, of organizing the secondary particle contributions from the cosmic rays into well-separated clusters.
It is worth mentioning that OPTICS has a set of advantages considered crucial to our application over other well-known density-based algorithms like the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm [
41]. One of the major advantages is the efficient memory usage, OPTICS is
while DBSCAN is
[
42]. This is especially relevant in our case because the high number of samples (over 39 million) requires more efficient memory management. In such a case, DBSCAN fails. Even when OPTICS has a poor performance regarding execution time (is highly sequential) and DBSCAN is more efficient and parallel at running time, memory management is a hard constraint. As such, OPTICS met the desired scalability requirements for our system.
For the work conducted for this paper and to achieve a more robust validation process, we use the same method for this work but applied to synthetic data resulting from simulations. Here, the pipeline begins with a standard prepossessing stage to prepare the data for the next stages. A set of criteria was chosen to filter data points that are considered anomalous, out of the dataset. From the resulting dataset, features were extracted, normalized, and passed through a PCA stage that resulted in a new set of orthonormal features called principal components. These new features comprise the final dataset that was used to feed the main stage of the pipeline. Details about the specifics of the methodology can be found in the next section.
The main part of the pipeline, the ML modeling, uses the OPTICS algorithm to generate the separated clusters by grouping points that share similarities in their features. What is particular to density-based clustering algorithms is that cluster membership is defined on a distance metric of how close points are to each other.
One of the desired characteristics in clustering algorithms is the capability of discovering arbitrarily shaped clusters, which is one of the most challenging tasks. Again, density-based algorithms may achieve this goal but unlike DBSCAN, OPTICS can achieve both complex cluster geometries like groups within groups (hierarchical structures) and variable cluster density [
42]. Many algorithms based on either centroids (such as K-means) or medoids (such as k-medoids) fail to satisfy these clustering criteria of developing widely different clusters, converging concave-shaped clusters and grouping hierarchically. In addition, OPTICS does not require any pre-defined specification for the number of partitions or clusters. Because of the above-mentioned advantages, we proposed OPTICS as the more suitable clustering method.
In OPTICS, there are two main concepts: core distances and reachability distances. A core distance defines the minimum distance needed for a given point
o to be considered a core point, where multiple of these form a core group and the possible beginning of a new cluster. In the case of the reachability distance, it defines the minimum distance from
p concerning
o, such that if
o is a core point,
p is directly density-reachable from
o. If
o is not a core point, then the reachability distance is undefined [
19].
In addition to these two concepts, two main hyper-parameters need to be set a priori before running the algorithm,
and
minPoints: (a)
is the maximum possible reachability distance that can be calculated between two points; (b) while
minPoints is the minimum number of points required in the neighborhood of a point
o for it to be considered a core point (
o itself is included in minPoints). These parameters greatly affect the runtime performance of the execution of the algorithm, wherein the absolute worst case can be
.
minPoints also helps with the granularity when searching for clusters, as a smaller value can help find clusters within clusters [
42].
The first stage of OPTICS outcome is a visual representation of the calculated reachability distances, , for each point concerning its closest core group. The points are ordered along the X-axis from smallest to greatest for its corresponding core group, while on the Y-axis it is the value. The resulting plot is the so-called reachability plot.
To interpret the reachability plot, points in the valleys represent data points that are spatially close to each other meaning high local density, and meaning that they are likely to belong to the same cluster. The valleys are separated by data points of larger reachability distance, which means they are farther away from the data points in a valley.
Figure 2, from Wang et al. 2019 [
43], illustrates this interpretation. It is worth mentioning, referring to the same figure, that OPTICS can detect a hierarchy of clusters, for example, where the green and light blue clusters are clearly child clusters of a parent cluster in red. Thus, in problems where varying density within clusters is assumed, OPTICS will be able to achieve cluster separation.
As was hinted in the previous paragraph, cluster formation depends on where one chooses a maximum reachability distance as a cutoff for cluster membership. This can be conducted in two ways, adaptively choose a maximum reachability distance depending on the structure of the valley, or choose a fixed maximum as a cut-off. The latter is the selected option in this work.
As with any ML algorithm, OPTICS is a non-deterministic algorithm, meaning that for each run, different results can be obtained. In particular, the non-deterministic part of OPTICS is related to the order in which data are processed when each point is selected to belong (or not) to a given cluster. This order varies in each run. If the model is robust enough, the results at each independent run will tend to converge to similar clustering of the data. We address this issue again when we explain the methodology used in this work.
Finally, it is worth mentioning that the implementation was conducted using Python as the programming language with the Scikit Learn library.
4. Methodology
We propose a methodology based on a data science approach where ML is used to implement a data-driven model/system. Often, these techniques need to satisfy the actual scientific question but also address emerging quality attributes such as debiasing and fairness, explainability, reproducibility, privacy and ethics, sustainability, scalability, etc.
The collection of data science stages from acquisition to cleaning/curation, to modeling, and so on, are referred to as data science pipelines (or workflows). Data science pipelines enable flexible but robust development of ML-based modeling and software development for later decision-making. We embrace the data science pipeline methodology to analyze and implement our proposed model [
44].
In brief, machine learning pipelines provide a structured, efficient, and scalable approach to developing and maintaining machine learning models. They allow the modularization of the workflow, standardizing processes, and ensuring consistency, which is essential for producing (and reproducing) robust models.
The pipeline designed in this work, in
Figure 3, can be summarized with the following:
Acquisition: produces the resulting 24-h simulation data using the characteristics of “Nahuelito” WCD.
Preprocessing:
Filtering: removes anomalies to guarantee the quality of the data used.
Splitting: divides simulation output into two sets, input and ground truth. This is conducted because we want to do clustering on the input set in a ’blind’ fashion (without the ground truth). The dataset with the ground truth is later used for validation of the results.
Feature Engineering and Feature Selection: creates the initial features to be used and then PCA is performed to select the final features set.
Parallel running of OPTICS: the input set is divided into 24 datasets each of one hour and fed in parallel to the OPTICS algorithm. As a result, 24 independent models are obtained. For each independent run, the particle composition of each cluster is extracted.
Averaging: Repeat the previous two steps 10 times and aggregate the results.
The methodology starts at the acquisition stage, here we used the MEIGA simulation framework to produce a dataset with information about every particle event that passed through the simulated WCD. To achieve this, real-time magnetospheric and local atmospheric conditions for February and March of 2012 were analyzed, and the resultant atmospheric secondary-particle flux was integrated into a specific MEIGA application featuring a comprehensive Geant4 model of the WCD at a specific LAGO location. This includes information about the interactions of the particle that produced the event (secondary particles) as well as the particle type itself. Thus, we have a priori knowledge about the particle composition of events simulated. In particular, we used a 24-h simulation for a WCD with the same characteristics as “Nahuelito” to reproduce similar characteristics as in [
18] (e.g., WCD geometry, rigidity cut-off, etc). The output from MEIGA was restructured for effective integration into our Machine Learning pipeline. Each simulated day of data consists of approximately 500 million particles arriving at the WCD.
Following the ML pipeline, the pre-processing stage takes the simulation output dataset and transforms it into a curated dataset suitable for extracting and selecting features for the learning process. The pre-processing stage consists of two subtasks: filtering and data splitting.
Regarding the first task, the criteria used to filter out anomalous data points are reduced to events where the particle did not have enough energy to interact with the WCD, in other words, did not produce photo-electrons (PEs) in the PMT and will not be detected. Unlike the actual data used in our previous work, where aggressive cleaning/filtering was needed, with synthetic data we perform a simple cleaning by eliminating those particles that do not produce a signal. This is because the generation of synthetic data is conducted in a controlled environment, meaning that the simulation does not account for external factors, like ambient noise or external sunlight filtering into the detector, that would be present in real data and would need to be filtered out [
18].
For the second part of the preprocessing stage, the cleaned dataset is split into two subsets. The first set contains the input data of events that would feed the next stage, while the second contains the ground truth (reserved for later validation). The ground truth consists of the actual particle composition of each event. OPTICS is a non-supervised ML algorithm and as such it is not a classifier. This means that after the clustering, the ground truth is used to analyze the particle composition of each cluster.
The next stages in the pipeline correspond to feature engineering and feature selection. Feature engineering refers to the construction of features from given features that lead to improved model performance. Feature engineering relies on transforming the feature space by applying different transformation functions (e.g., arithmetic and/or aggregate operators) to generate new ones. Feature selection corresponds to the actual election of more suitable features to enhance the performance of the ML model [
45].
In general terms, the features should contain enough information so that the algorithm is able to properly cluster the signal from the secondary particles. Besides, and at the same time, features that do not aid the learning process of the algorithm should be removed to avoid the problem called the ’curse of dimensionality’ [
46]. High-dimensional data (features are also referred to as dimensions of the data in ML) can be extremely difficult to analyze, contra intuitive and usually have a high computational cost (especially when dealing with Big Data) which, in turn, can lead to the degradation of the predictive capabilities of a given ML model. In brief, as the dimensionality increases, the number of data points required for good performance of any ML algorithm increases exponentially. Thus, we have to accomplish a trade-off between a relatively small number of features and ensure that the chosen features explain the problem.
The initial feature set proposed can be seen in
Table 1. In order to select the more suitable features, we performed a cross-correlation analysis to remove highly correlated features that may not add new significant information and can negatively affect the performance of ML algorithms, as stated previously. From the resulting cross-correlation matrix, seen in
Figure 4, it can be seen that features Peak and Pulse Duration were two possible candidates for removal from the final feature set. The peak had a high correlation of 0.94 with respect to Total PEs Deposited and a high correlation of 0.75 with respect to Pulse duration. Pulse duration had a high correlation of 0.87 with respect to Total PEs Deposited.
After thorough testing using the complete methodology presented here, it was found that removing only the feature Peak produced better results, thus, the final feature set used was Total Deposited, Time to Deposit 90% and Pulse Duration.
Before feeding the features into the ML stage, they were normalized and sent through a step of Principal Component Analysis (PCA). PCA is a feature selection and dimension reduction procedure to produce a set of principal components that maximize the variance along each dimension. This assumes a linear relationship between features and produces an ordered dataset where the first principal component is the one with the most variance and each subsequent principal component is orthonormal to the previous. This is a standard process to transform the original dataset to a new dataset that is better suited for ML algorithms [
47]. The components are linear combinations of the original features that maximally explain the variance in each selected dimension. This final transformed dataset is the output of the ‘feature engineering and feature selection’ and passes to the ML stage.
The resulting dataset generated from the 24 h of simulated data was then divided into 24 datasets, one for each hour. The size of each final subset was around 500,000 points of data. We performed a grid search to set the hyperparameters of
minPoints and
for the OPTICS algorithm. We found that setting the hyperparameter of
minPoints to 5000, produced the best results. With regards to
, a value of 0.5 ensured a good exploration of the space while reducing the run time considerably. A summary can be seen in
Table 2.
The ML modeling stage consisted of running the clustering algorithm OPTICS to produce the reachability plot and consequently select a cutoff
for the actual clustering (see
Section 5). Since we are interested in being able to classify the secondary particle contributions, we needed to see if the generated clusters have a majority contribution of a specific particle. This, essentially would mean that a cluster becomes a classification of a particular particle. Using the ground truth provided by the simulation, the composition of each cluster was calculated using the OPTICS output (for each hour).
Up to the ML modeling stage of the pipeline, the process is deterministic so it only needed to be performed once. For the ML stage, as mentioned above, the 24 h of data were divided into 24 one-hour datasets that were processed independently in parallel. This process was then repeated ten times to test both its accuracy and precision. At each run, the output may change (nondeterministic process) because at each instance the algorithm may start ordering the points from different initial points. Ideally, all the runs should converge to similar results. Thus, we computed the final output as the average of the results of the independent runs. This strategy is used to evaluate if the algorithm presents both accurate and precise results, hence being robust. The final output of the ML modeling stage is the clusters obtained and the average and standard deviation of each particle within each cluster.
In future work, we expect to enhance the pipeline by adding other stages towards achieving better scalability, easy implementation and monitoring in semi-operative mode, and explainability.
Finally, this methodology can be extrapolated and applied to different LAGO sites and WCDs running the same learning process once to learn their respective actual characteristics. If this model is used after the calibration of the WCD, we can estimate how the particle composition in each of the detectors is taking into account the site rigidity cut-off, altitude, and WCD geometry, among other particular characteristics. When, for instance, the water starts aging, the particle grouping will start varying and then the model will act as an automatic monitoring tool of the WCD health. This is one of the possible applications in an operative version.
5. Results
A total of 240 runs of the ML pipeline were conducted: 10 runs per hour of simulated data for a total of 24 h. Each run employed the OPTICS algorithm, which determined groupings in two steps: (a) generating a reachability plot and (b) performing the actual clustering based on a cut-off threshold to determine cluster membership.
An example of a reachability plot, shown in
Figure 5, is obtained from a single hour displays clear cluster structures. Each cluster is marked with a different color, while points that do not belong to any cluster are marked in black. As described in
Section 3, the
X-axis represents the ordered points and the
Y-axis represents the
reachability distance in the reachability plot.
A visual inspection reveals several ’valleys’ that indicate potential clusters. To define these clusters, a threshold must be selected. In our study, we used a fixed cut-off threshold of 0.08 for cluster membership, resulting in a stable eight-cluster structure across all runs. Although we considered different values for , 0.08 consistently provided good results for all the datasets used in this work.
Each identified cluster if
Figure 5 is well-defined. However, while the first six clusters exhibited mostly a singular structure, cluster 7 displayed a more complex composition with numerous substructures, appearing as small ’valleys’ within the larger group. These subgroups are absorbed into the larger cluster due to the fixed cut-off threshold that was selected. This specific complex case needs further investigation, which will be addressed in future works where we will incorporate additional data and implement adaptive thresholding.
The first stage of OPTICS outcome is a visual representation of the calculated reachability distances, , for each point concerning its closest core group. The points are ordered along the X-axis from smallest to greatest for its corresponding core group, while on the Y-axis it is the value. The resulting plot is the so-called reachability plot.
Using the same example run as a reference,
Figure 6 shows the corresponding histogram of the total amount of PEs deposited by events, with the
Y-axis in logarithmic scale. Each cluster was labeled and colored according to the same scheme used in the reachability plot, facilitating easy comparison between the two figures. The histogram shows three groups with similar behavior with larger counts (between 6000 to 10,000 PEs) in concordance with the Muon hump. These groups are clusters 0, 1, and 2.
When analyzing each cluster content, cluster 7 contained the highest number of points, with approximately 431,000 particles, followed by Cluster 0 with around 110,000 particles, and Cluster 2 with about 104,000 particles. The remaining clusters contained between 34,000 and 67,000 particles. These numbers represent the averaged number of particles from the output of each run. It is noteworthy that the approximate number of particles not assigned to any cluster was around 13,000, which is relatively small compared to the number of particles that the algorithm is able to assign to the clusters. These values are aligned with the visual groupings observed in
Figure 5.
An initial visual inspection suggested that certain clusters are strong candidates, with a larger count of a given particular secondary particle. For instance, clusters 0, 1, and 2 appeared to be predominantly composed of muons, as evidenced by the well-known muon hump visible in each of these clusters. Additionally, each of the one-hour runs consistently produced similar results for the eight clusters, maintaining a similar composition of particles.
To validate these results, we used the ground truth dataset. For each run, the particle composition was extracted and recorded. To assess the robustness of our findings, we aggregated the results by calculating the average and variation of the outcomes.
Table 3 presents the summary statistics for 240 one-hour total runs, detailing the distribution of clusters and secondary particles. The most noteworthy results can be observed in clusters 0, 1, and 2, where the majority of particles are muons, accounting for approximately 96.58%, 89.25%, and 97.27% of the total particles, respectively. These clusters showed a minimal presence of other particle types. Cluster 3 contained about 58% muons, which we did not consider a significant majority. The variation for this cluster was around
across different runs, indicating some difficulty for the algorithm in consistently grouping these particles. Additionally, Cluster 3 included roughly 23% photons, with other particles present in lower amounts. Cluster 4 was more heterogeneous, with approximately 46% photons, 23% electrons and positrons, 28% muons, and smaller percentages of neutrons and hadrons. The variation for the constituent particles (e.g.,
for photons,
for muons) suggested that the algorithm varied slightly on how it formed clusters across runs. The remaining clusters, situated in the lower energy regions of the histogram (i.e., towards the left side), were predominantly comprised of photons. Clusters 5, 6, and 7 had photon compositions of approximately 62%, 70%, and 80%, respectively. The same also exhibited similar proportions of electrons and positrons (around 20%), with small percentages of other particles. These results implied that the algorithm lacked sufficient information to adequately group the types of secondary particles in these lower energy regions. Nevertheless, the consistency between each independent one-hour run indicated that the algorithm reliably produces robust results.
In summary, the statistics revealed three distinct categories of clusters: (a) clusters with a majority of muons (Clusters 0, 1, and 2); (b) clusters with a majority of photons (Clusters 5, 6, and 7) and, (c) mixed groups (Clusters 3 and 4). These results are illustrated in
Figure 7, which presents a stacked bar chart showing the percentage composition of particles for each cluster. This visualization highlights the algorithm’s high accuracy, especially in grouping the muonic contributions of the simulated data.
6. Discussion and Conclusions
In this work, we proposed a Machine Learning pipeline to implement the OPTICS clustering algorithm to identify individual components within a charge histogram derived from synthetic data. The synthetic dataset was generated by the ARTI and MEIGA frameworks of the LAGO software suite. The Monte Carlo simulation outputs were tailored to fit the pipeline. The dataset encapsulated the characteristics of the LAGO WCD located at the Bariloche site in Argentina, known as ‘Nahuelito’. The pipeline can be summarized as a set of linked stages where the output is the outcome of the ML model. These stages included filtering/cleaning, feature engineering and selection, and the actual ML model. Unlike typical non-supervised ML and because the dataset is the output of simulations, we know the ground truth.
Using a 24-h dataset, we developed an end-to-end data science pipeline to implement the OPTICS algorithm, a hierarchical density-based clustering method. Then, the results were validated with the ground truth, and they demonstrated that our pipeline can effectively produce well-separated clusters.
Specifically, clusters 0, 1, and 2 predominantly consisted of muons, contributing to the well-known muon hump present in the charge histogram. These findings build to validate the initial results presented in our previous work [
18].
Figure 8 presents a zoomed-in view of the charge histogram (in black), alongside clusters 0, 1, and 2. The distinct shapes for the charge distribution for these clusters, which have the highest muon content, reflect the expected differences due to entry and exit trajectories of muons in the WCSs [
48]. Cluster 0 would correspond to signals from muons passing vertically through the detector, the well-known Vertical Equivalent Muon (VEM) parameter, a standard observable for the calibration of this type of detector when there are no secondary detectors available. Clusters 1 and 2 would correspond to another type of muons, e.g., those arriving at the WCD at different angles.
The predominance of the electromagnetic component (gammas, electrons, and positrons) in clusters 5, 6, and 7, and of muons in clusters 0, 1, and 2, suggests the potential to define bands of maximum content for these components, as well as an intermediate zone with mixed content. This will facilitate not only multispectral analysis but also particle analysis, similar to the methodology employed in the LAGO space weather program [
27] which relies on the automatic determination of WCD response to the flux of secondary particles, in particular during astrophysical transients, such as those produced during Space Weather events, see
Figure 9.
The proposed ML pipeline produced robust results for all 240 independent instances. In addition, repeated results showed minimal variation between runs showing the stability of the algorithm to reproduce results. Furthermore, lower energy regions of the charge histogram, like in cluster 7, showed substructures that need further analysis planned in future works.
Given that the proposed ML pipeline is planned to be implemented as a semi-automated, onboard, and real-time data analysis and calibration tool across the LAGO distributed network of WCDs, it is crucial to analyze its scalability in the context of Big Data with larger datasets and analyze its robustness across WCDs with diverse site characteristics. To achieve these objectives, we plan to develop a comprehensive benchmarking framework to automatically and seamlessly test the ML pipeline under various scenarios. As part of the planned automation, we will include hyperparameter tuning, handling of datasets of increasing size, and further exploration of predictive features. This approach will ensure the model’s adaptability and reliability in varied operational conditions.
Another key advantage of the framework developed for this work is the ability to seamlessly integrate it into the current LAGO software suite, directly being able to use the output from MEIGA simulations. Unlike conventional high-performance computing benchmarks, which have a low dependency on datasets, ML benchmarks are highly dependent on the dataset for training and inference. Thus, we will perform the benchmarking for each of the simulated LAGO sites and report the output and statistics.
Finally, this proposed benchmark will be an important step towards its deployment at LAGO WCD sites, as we want to ensure the effective use and monitoring of ML methods tailored specifically for each site.