1. Introduction
The problem of automatic anomaly detection is seen as one of the significant challenges in the analysis and recognition of measurement data. Anomaly detection concerns the search for those observations that deviate from the definition of normality for the considered set of observations [
1,
2]. Sometimes, interchangeable terms such as outlier detection or novelty detection are also used in this context, although they are not necessarily completely analogous [
3,
4,
5,
6,
7]. This area has been actively developed in recent years, and many methods have been proposed in this field of research [
8,
9]. Among the first techniques proposed to deal with anomalies detection were statistical methods [
10,
11], especially those related to density estimation like KDE (Kernel Density Estimation) [
12]. Nowadays, many solutions apply various machine learning methods, like shallow and deep models [
5,
13,
14].
Anomaly detection techniques are applied to analyze and solve a wide range of problems in various areas. Examples of practical applications include cybersecurity (intrusion detection systems) [
15,
16,
17,
18,
19], economy and healthcare (fraud detection) [
20,
21,
22,
23,
24,
25], industry (fault diagnosis, damage detection) [
26,
27,
28,
29,
30,
31], medicine (medical diagnosis, disease outbreak detection) [
32,
33,
34,
35,
36], earth sciences (event detection) [
37,
38,
39,
40], bioinformatics [
36,
41,
42,
43], genetics [
44,
45], physics [
46,
47,
48,
49,
50] or astronomy [
51,
52,
53,
54,
55,
56].
The ability to detect non-trivial observations that deviate from a consistent data stream is a particularly big challenge in particle physics and astronomy [
57]. The search for unusual data can lead to the discovery of unknown physical phenomena [
58,
59,
60]. Creating effective tools for such type of automatic analysis and identification is a very important research subject in particle physics and astronomy.
Research in particle physics is performed on data acquired from experiments performed on large-scale stationary particle accelerators at projects such as LHC/CERN (Large Hadron Collider) [
61,
62,
63,
64], SLAC (Stanford Linear Accelerator Center) [
65], Thomas Jefferson National Accelerator Facility [
66,
67], J-PARC (Japan Proton Accelerator Research Complex) [
68,
69] and many others. There are also large-scale observatories that measure cosmic radiation arriving from space. Among them are Pierre Auger Observatory [
70,
71], IceCube [
72] or Telescope Array Project [
73]. Stationary observatories of this type perform very accurate measurements. However, the observations they make are limited to the area where their research infrastructure is located. Due to this fact, they observe only a certain fraction of cosmic radiation reaching the Earth’s atmosphere. To overcome the limitations of stationary observatories, several projects have been developed in recent years that allow distributed observations of cosmic radiation. These projects are based on the citizen science paradigm and use CMOS/CCD camera-based particle detectors [
74]. Projects of this kind are:CRAYFIS (Cosmic RAYs Found In Smartphones) [
75,
76], DECO (Distributed Electronic Cosmic-ray Observatory) [
77,
78] and CREDO [
79,
80]. CRAYFIS [
81] is a globally distributed network of cosmic-ray sensors for the exploration of cosmic rays, with the potential to reveal unexpected or previously unobserved planet-scale phenomena such as widely separated simultaneous extensive air showers. DECO is a similar project which utilizes smartphone-based cosmic rays detectors. The project is conducting advanced research on detection and classification of particle types based on deep learning models [
82]. CREDO project uses smartphone-based detectors and additionally integrates data from other sources such as simple scintillation detectors [
83,
84,
85,
86,
87]. The data collected by the CREDO project is stored in open repositories and are available for scientific purposes. In all of these projects, the use of mobile detectors based on optical sensors that record traces of particle radiation energy offers great flexibility and the possibility to extend observation coverage on a global scale [
81]. Detectors acquires a huge amount of measurement data of various types, which requires appropriate analysis especially automatic recognition in big data streams [
88,
89].
The main purpose of searching for unusual signals (anomalies) in such data sets is to look for new physical phenomena (unknown physics) [
79]. Such new phenomena might be potentially registered as non-typical particle traces observed on detector arrays and might be evidence of new particles or physical interactions. Such phenomena can occur when ultra high-energy cosmic radiation strikes the Earth and creates a stream of secondary particles observed by detectors. It should be noted that the primary particles hitting the atmosphere can have energies far beyond the energy ranges achievable in Earth’s laboratories, creating unique physical conditions. Detection of unusual particle images observed on a globally distributed set of detectors also has the potential to reveal unexpected or previously unobserved phenomena occurring at the planetary scale [
81]. Such phenomena might be revealed for example if similar types of anomalies occur in remote geographic locations corresponding to independent or simultaneous extensive air showers (EAS). Statistical analysis of anomalies in a large dataset is also a useful tool for tuning detection and filtering algorithms for observed events. Such analysis also makes it possible to study the response of a variety of CMOS sensors to radiation by analysis of a statistically significant number of actual measurements.
The problem addressed in this paper concerns the detection of anomalies in CREDO data acquired from smartphone-based mobile detectors. The primary carrier of information in this case are images of particle tracks recorded on CMOS arrays [
74,
90]. Since the data is collected in continuous mode, it is necessary to take into account the possibility of streaming digging through the dataset for unusual observations. So far CREDO data has been analyzed for both background signal filtering and artifacts removal [
91,
92,
93], as well as classification and recognition [
92,
94,
95,
96]. There have also been initial works on detecting abnormal data based on various techniques such as rough sets [
97].
1.1. Novelty of This Research
To our knowledge, this is the first study which proposes a method that can detect potential anomalies in a continuous data stream and find objects with similar morphological structure in cosmic rays unlabeled data collected by a distributed network of CMOS sensors embedded in the cell phones. The proposed solution has been implemented and validated on the largest dataset of its kind to date, containing over 570,000 images. An important fact is that our approach has no limitations due to the size of the dataset, as embedding can be calculated and updated relatively quickly using small batches of new data. We were able to achieve this by using incremental PCA (Principal Components Analysis) feature extractions [
98,
99,
100,
101], appropriate image preprocessing and density-based anomalies search. In practice, the method presented in this paper has the potential for immediate detection of potential anomalies in the data stream incoming from the entire CREDO observatory network.
1.2. Paper Structure
The rest of the article is organized as follows.
Section 2 discusses the structure of the CREDO data subset used in the article, explains the preprocessing of the raw image data, the mathematical basis of Incremental PCA, and the scheme of the anomaly detection algorithm. We have divided the presentation and discussion of the results into two sections.
Section 3 presents technical aspects of the proposed method.
Section 4 contains detailed discussion and interpretation of results.
Section 5 summarizes the scientific contributions.
3. Results
We implemented our solution using Python 3.8. The source code of the proposed algorithm and the dataset can be downloaded from GitHub repository
https://github.com/browarsoftware/anomalies_bigdata (accessed on 20 December 2023). We have used numba 0.5, numpy 1.22, opencv-python 4.5, scikit-learn 1.0, scipy 1.8 Python libraries. Plots were made in R langauge 3.6.
The purpose of the evaluation was to test the effectiveness of detecting potential anomalies using image preprocessing (aligning) methods described in
Section 2.2, PCA-based features described in
Section 2.3 and anomalies detection approach in
Section 2.4.
The dataset presented in
Section 2.1 was set randomly. In
Table 2 we present a comparison of the resulting coordinate frames computed using basic PCA for four different preprocessing algorithms. The comparison of axes is intended to numerically calculate the difference between the potential embeddings and to indicate the effect of using different preprocessing methods on the calculation of PCA. The comparison of coordinate systems is done using coordinate frames weighted distance (cfd):
where
is eigenvectors matrix and eigenvalues vector of first PCA,
is eigenvectors matrix and eigenvalues vector of second PCA,
is i-th eigenvector and
i-th eigenvalue of first PCA,
sc is a sign correction (see Algorithm 2) and
∡ is an operator for calculating the angle between vectors. Note that all eigenvalues of PCA are non-negative; cfd is measured in radians (rad).
To perform embedding we used 62 features out of 3600 that is, we reduced the dimensionality of embedding to 62 dimensions. Such a reduction explains, depending on the preprocessing method adopted, between
and
of the total variance in our dataset. We decided to adopt such a number of dimensions because it allowed us to more easily manipulate the value of
in (
4), which must be determined depending on the number of dimensions and in practice cannot be determined other way than experimentally, as in DBSCAN algorithm.
We made a comparison of the sets of potential anomalies returned by the method described by Equation (
4) for different preprocessing algorithms from
Section 2.2 and embedding calculated with basic PCA algorithm from
Section 2.3. We used Jaccard index (
J) [
115] and Overlap coefficient (
OC) [
116] to compare the sets of anomalies:
where
are potential anomalies sets to be compared.
Figure 1 and
Figure 2 present comparison of results of potential anomalies detection with (
4) evaluated with (
8) and (
9). Types of preprocessing and values of
are in
Table 3. The
k parameter in (
4) was arbitrarily set to 3. The first sixteen potential anomalies for each of the four preprocessing methods calculated for basic PCA with
are shown in
Figure 3.
The next stage of the evaluation was to test the effectiveness of using Incremental PCA (see
Section 2.6) in the procedure for detection of potential anomalies in dateset under condition of continuously incoming data (see
Section 2.7). In order to do so, we made a comparison of coordinate frames obtained with basic PCA to coordinate frames obtained with Incremental PCA for a different number of data used when approximating PCA with Algorithm 2. The results are shown in
Figure 4. Each point on the plot shows the cfd value (
7) for the PCA coordinate axes calculated on the whole data and the coordinate axes calculated by Incremental PCA on a certain percentage of the whole data, that is, for example, the PCA axes calculated on the whole set with B. Replicate preprocessing and the axes calculated with Incremental PCA with B. Replicate preprocessing calculated on 10%, 20%, 30% of the data etc. We presented the selected cfd values for this evaluation in
Table 4. For Incremental PCA, we assumed a batch size (bs) of 10,000.
Then we made a comparison of Jaccard Index and Overlap Coefficient of the method for finding potential anomalies (
4) with the parameters
for PCA and Incremental PCA calculated on increasing numbers of data. Since the results for each image alignment were very similar on
Figure 5 we present the results for B. Reflect only. We always performed embedding on the entire dataset and we calculated Incremental PCA for some subset of the data, thus simulating a constant increment of the data on which embedding is performed relative to the data used when counting embedding. The number of data used by Incremental PCA is coded in
Figure 5 as follows:
basic PCA (calculated on full dataset),
Incremental PCA calculated on images.
Incremental PCA calculated on images,
Incremental PCA calculated on images,
Incremental PCA calculated on images,
Incremental PCA calculated on images,
Incremental PCA calculated on images,
The last step of the evaluation is to check the effectiveness of the method that detects similar objects according to Equation (
6). For this purpose, we used the preprocessing algorithm B. Reflect and we generated features using basic PCA. We do not present the results obtained with Incremental PCA because, as will be shown in the discussion, they are virtually identical to basic PCA. We selected 9 sample images representing characteristic shape morphologies of particle tracks in the dataset and found
most similar images according to Equation (
6). We presented the results in
Figure 6.
4. Discussion
Based on the results shown in
Table 2, it can be concluded that the individual coordinate frames calculated on datasets with different embeddings differ from each other considering the cfd measure (
7). In the case of lack of preprocessing (None) and B. Constant the differences between the obtained coordinate frames are the largest. This is due to the fact that B. Constant attaches black pixels on borders to the resulting image, which are not present in such numbers in raw images. There is a little difference between coordinate frames calculated on data processed with B. Reflect and B. Replicate, it amounts
rad. Although the difference is small, there is no guarantee that the embedding calculated with PCA on the set preprocessed with one method can be used interchangeably with the embedding calculated on the set preprocessed with another method. The choice of a particular preprocessing method determines the necessity of its use in subsequent stages of dataset analysis.
Although the different methods create different embedding of the images, the sets of anomalies they find are not significantly different. According to the results in
Figure 1 and
Figure 2, the number of anomalies found naturally decreases as
increases. This can be observed when comparing two anomalies detection methods with a larger and smaller
value–the Jaccard Index has a smaller value when there is a larger difference of
between those two methods. As expected for a certain preprocessing method, as the
decreases, new objects are added to the set of anomalies without removing those found with a larger
. This can be observed from the Overlap Coefficient, which always has a value of 1 within a single preprocessing method regardless of
. It can also be seen from the Overlap Coefficient analysis that the use of a preprocessing method (other than None) results in each of the detection algorithms returning a very similar set of potential anomalies–OC equals almost always 1 and the smallest value is in the case of B. Constant
and B. Replicate
and equals 0.74. If we compare embedding based on preprocessing None with other methods Overlap Coefficient ranges from 0 to 1, which means that different sets of potential anomalies are returned. Thus, one can conclude that preprocessing affects the anomalies that we detect. In the case of the Jaccard Index, the values of this coefficient are in most cases less than 1. This means that for the same values of
, the different preprocessing methods affect embedding in such a way that they search for sets of potential anomalies of different quantity. This confirms the results from
Table 2 that the coordinate frames are different from each other and the distances between objects in the PCA-designated spaces are also different.
When designing potential anomaly search algorithm using PCA embedding, we do not define the particle trajectory morphologies of interest. We expect that if we do not apply preprocessing but work on embedding generated from raw dataset (in our case preprocessing equals to None), the returned potential anomalies are different from those found when we apply preprocessing. This expectation is confirmed in
Figure 3. The set of the first 16 retrieved anomalies for each method at
in the case of preprocessing None returns a dataset different from those returned by Replicate, Reflect and Constant. However those three preprocessing methods returns very similar particle tracks. This means that moving and rotating the objects so that the largest variance of bright points is along the horizontal axis significantly affects the result. Basing on what we know about PCA in other image domains, it can be concluded that image alignment is a beneficial process for variance analysis. For this reason, we recommend the use image aligning. The sets of potential anomalies returned by proposed algorithm do not contain any typical morphologies of particle tracks shapes (see, for example, the results of
Figure 3). Thus, one can conclude that our proposed method effectively filter-off typical (in terms of analysis of variance) shapes of particle tracks by searching for those that can be treated as significantly different from the others in the dataset.
Based on the results from
Table 4 shown in
Figure 4, it can be seen that as the number of data processed by Incremental PCA increases (in our case with batch size set to 10,000), the cfd between PCA approximation and basic PCA expressed in radians decreases. Already for an approximately
of dataset, the difference between those two values is between 0.04 and 0.06 radians. The similarity of the coordinate frame calculated with basic PCA and Incremental PCA also affects the similarity of the obtained embedding and thus the detected anomalies. We performed such an analysis for preprocessing B. Reflect. As can be seen in
Figure 5 for Incremental PCA calculated on
of data with batch size 10,000,
, so the returned sets of potential anomalies are almost identical. As the data used to calculate Incremental PCA decreases, both coefficients also decrease but not significantly. For Incremental PCA calculated on
of data
, for
, for
, for
, for
. This means that using Incremental PCA, which is recalculated with incoming data with a batch size of 10,000, we get almost identical anomalies detection results as for basic PCA. It can be concluded that our method, if the dataset is shuffled (and is representative) a small portion of the dataset used to calculate Incremental PCA can detect almost identical set of anomalies as calculated with basic PCA. The method we proposed in
Section 2.7 for detecting of potential anomalies in large dataset under condition of continuously incoming objects works almost identically to the method using the entire dataset for PCA calculation. As a result, the approach we have proposed significantly reduces the memory and computational requirements of the algorithm for detecting anomalies and makes it possible to use it for big datasets.
Also the use of (
6) to detect similar objects works as expected. It returns morphologically similar objects to the one being searched for. The results shown in
Figure 6 for B. Reflect confirm that the returned objects
have a very similar shape to the searched image
. Thanks to using image aligning, the method is not sensitive to translation and rotation of objects in images. As can be seen, the method based on (
6) handles well the morphology of dots, lines, worms and various types of complex shapes. When searching for similar objects, the method also returns objects with similar levels of background noise, which may not be entirely beneficial (compare first and second row in the
Figure 6). At the moment, however, with the preprocessing method described by Algorithm 1 it is not possible to remove the background. This is a certain drawback of that method if it will be applied to search for morphologically similar objects not considering background. Despite this fact, its search results give very satisfactory results in terms of morphology search.
The proposed anomalies detection algorithm worked as expected. The images it found have anomalous features according to their definition (
4), that is, they contain traces of potential particles whose morphology differs significantly from typical image classes, that is, dots, lines and worms. The use of PCA as a feature extraction method did not create concentric clusters of objects. Due to this fact, one cannot use distance-based measures to find the central object of potential clusters, e.g., the “most typical dot class trajectory” around which there are similar objects. This behavior was expected because PCA does not statistically differentiate the correct signal from background noise present in some images. For this reason, density-based clustering seems to be an appropriate approach for grouping objects with similar morphology. Because (
4) defines anomalies using a density-based approach, it is impossible to say which potential particle trace is “more anomalous” than the other. By controlling the parameters
in (
4) and using various types of preprocessing, we have the ability to search the entire dataset. As we indicated in
Figure 1 and
Figure 2, the preprocessing method slightly affects the returned sets of anomalies.
We cannot exclude the possibility that some of these images are artifacts due to the access of visible light to the CMOS array. At this stage, we do not yet know the physical interpretations of the anomalies we are detecting. The main goal of our study was to create a method that would allow us to find them efficiently in large data sets. The physical interpretation of the results obtained is beyond the scope of this work and requires further research. Our proposed method is intended to be a useful mathematical tool for defining and finding potential anomalies.
In
Figure 7 we present examples of anomalies detected by the proposed method with parameters
, B. Replicate preprocessing, basic PCA. We chose them because they represent a variety of deviations from the typical shapes of expected most typical trajectories.
Figure 7a contains a clearly separated trajectories similar in shape to a dot and a worm.
Figure 7b contains a circular shape in the center (larger than a typical dot) and there is a halo surrounding it, which affects CMOS sensor less than the core of the potential hit.
Figure 7c looks like a typical worm, but the angle between its two parts is close to a right angle, which is unusual.
Figure 7d also morphologically resembles a worm, however the trajectory forms a closed loop.
Figure 7e contains a relatively wide rectilinear band, probably with a low energy deposit, which resembles a cloud.
Figure 7f is probably the result of image file corruption because it looks like it consists of two images separated horizontally.
Figure 7h contains a single circular area, but much larger than typical dot class representatives. In contrast,
Figure 7g contains a large dot having an additional linear tail.