Machine Learning-Based Anomaly Detection on Seawater Temperature Data with Oversampling
Abstract
:1. Introduction
2. Related Work
2.1. CTD Error Detection
2.2. Statistics-Based Anomaly Detection
2.3. Machine Learning-Based Anomaly Detection
2.4. Class Imbalanced Problem of Anomaly Detection
3. Methodology
3.1. CTD System
3.2. Dataset
3.3. Oversampling Methods
3.4. Anomaly Detection Models
4. Experiments and Evaluation
4.1. Performance Metrics
4.2. Experimental Setting
4.3. Experimental Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Classification Model | Dataset | Score | |||||
---|---|---|---|---|---|---|---|
Model Type | Model Name | Scale (0–1) | Oversampling (Augmentation) | Sensitivity (Recall) | Precision | F1 Score (Std.) | AUROC (Std.) |
Traditional method | IQR | x | x | 0.153 | 0.188 | 0.168 | 0.523 |
OCSVM | OCSVM-ND (normal data) | x | x | 0.475 | 0.139 | 0.215 | 0.501 |
OCSVM-AD (abnormal data) | x | x | 0.508 | 0.128 | 0.205 | 0.476 | |
OCSVM-D-30 | x | Duplication 30% | 0.508 | 0.133 | 0.211 | 0.486 | |
OCSVM-D-50 | x | Duplication 50% | 0.508 | 0.128 | 0.205 | 0.476 | |
OCSVM-D-75 | x | Duplication 75% | 0.508 | 0.129 | 0.206 | 0.478 | |
OCSVM-D-100 | x | Duplication 100% | 0.508 | 0.128 | 0.205 | 0.476 | |
OCSVM-R-30 | x | Uniform random 30% | 0.492 | 0.139 | 0.216 | 0.5 | |
OCSVM-R-50 | x | Uniform random 50% | 0.492 | 0.141 | 0.219 | 0.504 | |
OCSVM-R-75 | x | Uniform random 75% | 0.492 | 0.14 | 0.218 | 0.503 | |
OCSVM-R-100 | x | Uniform random 100% | 0.492 | 0.141 | 0.219 | 0.504 | |
OCSVM-S-30 | x | SMOTE 30% | 0.508 | 0.129 | 0.205 | 0.477 | |
OCSVM-S-50 | x | SMOTE 50% | 0.508 | 0.126 | 0.202 | 0.47 | |
OCSVM-S-75 | x | SMOTE 75% | 0.508 | 0.135 | 0.214 | 0.492 | |
OCSVM-S-100 | x | SMOTE 100% | 0.508 | 0.135 | 0.214 | 0.492 | |
OCSVM-A-30 | x | AE 30% | 0.508 | 0.125 | 0.201 | 0.467 | |
OCSVM-A-50 | x | AE 50% | 0.508 | 0.126 | 0.202 | 0.47 | |
OCSVM-A-75 | x | AE 75% | 0.508 | 0.126 | 0.201 | 0.469 | |
OCSVM-A-100 | x | AE 100% | 0.508 | 0.126 | 0.202 | 0.47 | |
MLP-1 hidden layer sizes (10) | MLP-1 | x | x | 0.812 | 0.936 | 0.869 (0.021) | 0.901 (0.015) |
MLP-1-D-30 | x | Duplication 30% | 0.8 | 0.915 | 0.852 (0.023) | 0.894 (0.023) | |
MLP-1-D-50 | x | Duplication 50% | 0.819 | 0.819 | 0.807 (0.062) | 0.891 (0.022) | |
MLP-1-D-75 | x | Duplication 75% | 0.82 | 0.781 | 0.796 (0.041) | 0.891 (0.03) | |
MLP-1-D-100 | x | Duplication 100% | 0.817 | 0.81 | 0.811 (0.035) | 0.892 (0.014) | |
MLP-1-R-30 | x | Uniform random 30% | 0.888 | 0.312 | 0.392 (0.172) | 0.654 (0.145) | |
MLP-1-R-50 | x | Uniform random 50% | 0.826 | 0.433 | 0.519 (0.197) | 0.751 (0.116) | |
MLP-1-R-75 | x | Uniform random 75% | 0.797 | 0.672 | 0.713 (0.081) | 0.862 (0.052) | |
MLP-1-R-100 | x | Uniform random 100% | 0.693 | 0.718 | 0.671 (0.115) | 0.816 (0.073) | |
MLP-1-S-30 | x | SMOTE 30% | 0.814 | 0.914 | 0.858 (0.021) | 0.9 (0.025) | |
MLP-1-S-50 | x | SMOTE 50% | 0.81 | 0.885 | 0.842 (0.02) | 0.896 (0.023) | |
MLP-1-S-75 | x | SMOTE 75% | 0.846 | 0.793 | 0.816 (0.023) | 0.904 (0.013) | |
MLP-1-S-100 | x | SMOTE 100% | 0.856 | 0.644 | 0.717 (0.147) | 0.873 (0.071) | |
MLP-1-A-30 | x | AE 30% | 0.79 | 0.931 | 0.852 (0.032) | 0.89 (0.032) | |
MLP-1-A-50 | x | AE 50% | 0.78 | 0.907 | 0.833 (0.047) | 0.882 (0.036) | |
MLP-1-A-75 | x | AE 75% | 0.793 | 0.907 | 0.845 (0.022) | 0.89 (0.019) | |
MLP-1-A-100 | x | AE 100% | 0.773 | 0.793 | 0.769 (0.034) | 0.867 (0.042) | |
MLP-1-S | o | x | 0.647 | 0.844 | 0.729 (0.029) | 0.813 (0.025) | |
MLP-1-D-30-S | o | Duplication 30% | 0.687 | 0.839 | 0.754 (0.016) | 0.832 (0.01) | |
MLP-1-D-50-S | o | Duplication 50% | 0.715 | 0.795 | 0.752 (0.021) | 0.842 (0.01) | |
MLP-1-D-75-S | o | Duplication 75% | 0.76 | 0.725 | 0.739 (0.018) | 0.856 (0.016) | |
MLP-1-D-100-S | o | Duplication 100% | 0.765 | 0.698 | 0.727 (0.031) | 0.855 (0.011) | |
MLP-1-R-30-S | o | Uniform random 30% | 0.792 | 0.361 | 0.481 (0.103) | 0.759 (0.07) | |
MLP-1-R-50-S | o | Uniform random 50% | 0.724 | 0.66 | 0.687 (0.025) | 0.831 (0.021) | |
MLP-1-R-75-S | o | Uniform random 75% | 0.758 | 0.732 | 0.742 (0.024) | 0.856 (0.022) | |
MLP-1-R-100-S | o | Uniform random 100% | 0.775 | 0.744 | 0.756 (0.027) | 0.865 (0.025) | |
MLP-1-S-30-S | o | SMOTE 30% | 0.688 | 0.853 | 0.76 (0.014) | 0.834 (0.011) | |
MLP-1-S-50-S | o | SMOTE 50% | 0.688 | 0.823 | 0.747 (0.028) | 0.832 (0.017) | |
MLP-1-S-75-S | o | SMOTE 75% | 0.726 | 0.753 | 0.735 (0.021) | 0.843 (0.014) | |
MLP-1-S-100-S | o | SMOTE 100% | 0.739 | 0.664 | 0.697 (0.031) | 0.838 (0.02) | |
MLP-1-A-30-S | o | AE 30% | 0.546 | 0.868 | 0.658 (0.069) | 0.765 (0.054) | |
MLP-1-A-50-S | o | AE 50% | 0.553 | 0.795 | 0.637 (0.093) | 0.761 (0.051) | |
MLP-1-A-75-S | o | AE 75% | 0.687 | 0.808 | 0.731 (0.029) | 0.827 (0.024) | |
MLP-1-A-100-S | o | AE 100% | 0.746 | 0.694 | 0.713 (0.026) | 0.845 (0.017) | |
MLP-2 hidden layer sizes (10,15,10) | MLP-2 | x | x | 0.805 | 0.94 | 0.867 (0.019) | 0.899 (0.017) |
MLP-2-D-30 | x | Duplication 30% | 0.797 | 0.907 | 0.845 (0.03) | 0.891 (0.029) | |
MLP-2-D-50 | x | Duplication 50% | 0.832 | 0.888 | 0.857 (0.021) | 0.907 (0.017) | |
MLP-2-D-75 | x | Duplication 75% | 0.846 | 0.846 | 0.845 (0.017) | 0.91 (0.01) | |
MLP-2-D-100 | x | Duplication 100% | 0.849 | 0.812 | 0.828 (0.032) | 0.908 (0.007) | |
MLP-2-R-30 | x | Uniform random 30% | 0.681 | 0.324 | 0.406 (0.134) | 0.686 (0.084) | |
MLP-2-R-50 | x | Uniform random 50% | 0.798 | 0.466 | 0.559 (0.129) | 0.798 (0.065) | |
MLP-2-R-75 | x | Uniform random 75% | 0.776 | 0.617 | 0.661 (0.11) | 0.834 (0.041) | |
MLP-2-R-100 | x | Uniform random 100% | 0.821 | 0.825 | 0.82 (0.027) | 0.896 (0.018) | |
MLP-2-S-30 | x | SMOTE 30% | 0.832 | 0.937 | 0.882 (0.013) | 0.912 (0.013) | |
MLP-2-S-50 | x | SMOTE 50% | 0.846 | 0.89 | 0.866 (0.011) | 0.914 (0.015) | |
MLP-2-S-75 | x | SMOTE 75% | 0.815 | 0.82 | 0.816 (0.031) | 0.893 (0.028) | |
MLP-2-S-100 | x | SMOTE 100% | 0.819 | 0.853 | 0.832 (0.025) | 0.898 (0.029) | |
MLP-2-A-30 | x | AE 30% | 0.8 | 0.932 | 0.859 (0.024) | 0.895 (0.024) | |
MLP-2-A-50 | x | AE 50% | 0.841 | 0.914 | 0.875 (0.02) | 0.914 (0.009) | |
MLP-2-A-75 | x | AE 75% | 0.814 | 0.925 | 0.863 (0.019) | 0.901 (0.023) | |
MLP-2-A-100 | x | AE 100% | 0.849 | 0.801 | 0.822 (0.027) | 0.907 (0.016) | |
MLP-2-S | o | x | 0.676 | 0.806 | 0.728 (0.02) | 0.823 (0.02) | |
MLP-2-D-30-S | o | Duplication 10% | 0.69 | 0.82 | 0.747 (0.025) | 0.832 (0.011) | |
MLP-2-D-50-S | o | Duplication 30% | 0.698 | 0.798 | 0.74 (0.025) | 0.834 (0.023) | |
MLP-2-D-75-S | o | Duplication 50% | 0.731 | 0.724 | 0.722 (0.036) | 0.841 (0.015) | |
MLP-2-D-100-S | o | Duplication 100% | 0.775 | 0.676 | 0.719 (0.036) | 0.856 (0.011) | |
MLP-2-R-30-S | o | Uniform random 10% | 0.707 | 0.524 | 0.589 (0.07) | 0.794 (0.026) | |
MLP-2-R-50-S | o | Uniform random 30% | 0.688 | 0.664 | 0.671 (0.051) | 0.814 (0.024) | |
MLP-2-R-75-S | o | Uniform random 50% | 0.755 | 0.709 | 0.726 (0.044) | 0.851 (0.025) | |
MLP-2-R-100-S | o | Uniform random 100% | 0.705 | 0.749 | 0.724 (0.036) | 0.833 (0.02) | |
MLP-2-S-30-S | o | SMOTE 10% | 0.685 | 0.797 | 0.732 (0.038) | 0.827 (0.02) | |
MLP-2-S-50-S | o | SMOTE 30% | 0.707 | 0.76 | 0.729 (0.037) | 0.834 (0.017) | |
MLP-2-S-75-S | o | SMOTE 50% | 0.714 | 0.718 | 0.712 (0.034) | 0.833 (0.006) | |
MLP-2-S-100-S | o | SMOTE 100% | 0.719 | 0.702 | 0.704 (0.031) | 0.833 (0.012) | |
MLP-2-A-30-S | o | AE 10% | 0.622 | 0.822 | 0.702 (0.063) | 0.799 (0.034) | |
MLP-2-A-50-S | o | AE 30% | 0.614 | 0.867 | 0.712 (0.052) | 0.798 (0.038) | |
MLP-2-A-75-S | o | AE 50% | 0.647 | 0.763 | 0.677 (0.036) | 0.801 (0.033) | |
MLP-2-A-100-S | o | AE 100% | 0.727 | 0.712 | 0.714 (0.04) | 0.838 (0.015) | |
MLP-3 hidden layer sizes (500,100,10) | MLP-3 | x | x | 0.809 | 0.915 | 0.856 (0.017) | 0.898 (0.018) |
MLP-3-D-30 | x | Duplication 30% | 0.836 | 0.852 | 0.841 (0.034) | 0.905 (0.014) | |
MLP-3-D-50 | x | Duplication 50% | 0.763 | 0.874 | 0.806 (0.017) | 0.871 (0.033) | |
MLP-3-D-75 | x | Duplication 75% | 0.839 | 0.774 | 0.8 (0.053) | 0.898 (0.016) | |
MLP-3-D-100 | x | Duplication 100% | 0.817 | 0.802 | 0.799 (0.04) | 0.89 (0.037) | |
MLP-3-R-30 | x | Uniform random 30% | 0.907 | 0.457 | 0.58 (0.157) | 0.835 (0.069) | |
MLP-3-R-50 | x | Uniform random 50% | 0.815 | 0.491 | 0.55 (0.199) | 0.765 (0.121) | |
MLP-3-R-75 | x | Uniform random 75% | 0.6 | 0.721 | 0.477 (0.235) | 0.688 (0.153) | |
MLP-3-R-100 | x | Uniform random 100% | 0.696 | 0.713 | 0.594 (0.262) | 0.76 (0.157) | |
MLP-3-S-30 | x | SMOTE 30% | 0.81 | 0.873 | 0.835 (0.028) | 0.895 (0.027) | |
MLP-3-S-50 | x | SMOTE 50% | 0.787 | 0.895 | 0.834 (0.028) | 0.885 (0.03) | |
MLP-3-S-75 | x | SMOTE 75% | 0.719 | 0.824 | 0.719 (0.176) | 0.838 (0.098) | |
MLP-3-S-100 | x | SMOTE 100% | 0.851 | 0.672 | 0.746 (0.049) | 0.89 (0.008) | |
MLP-3-A-30 | x | AE 30% | 0.81 | 0.912 | 0.856 (0.019) | 0.898 (0.021) | |
MLP-3-A-50 | x | AE 50% | 0.778 | 0.912 | 0.833 (0.069) | 0.882 (0.054) | |
MLP-3-A-75 | x | AE 75% | 0.821 | 0.908 | 0.861 (0.018) | 0.903 (0.016) | |
MLP-3-A-100 | x | AE 100% | 0.822 | 0.8 | 0.801 (0.043) | 0.892 (0.023) | |
MLP-3-S | o | x | 0.636 | 0.799 | 0.7 (0.047) | 0.803 (0.031) | |
MLP-3-D-30-S | o | Duplication 30% | 0.67 | 0.782 | 0.718 (0.032) | 0.819 (0.018) | |
MLP-3-D-50-S | o | Duplication 50% | 0.704 | 0.758 | 0.724 (0.029) | 0.832 (0.015) | |
MLP-3-D-75-S | o | Duplication 75% | 0.717 | 0.666 | 0.686 (0.051) | 0.828 (0.024) | |
MLP-3-D-100-S | o | Duplication 100% | 0.751 | 0.662 | 0.698 (0.038) | 0.843 (0.016) | |
MLP-3-R-30-S | o | Uniform random 30% | 0.702 | 0.507 | 0.566 (0.108) | 0.785 (0.068) | |
MLP-3-R-50-S | o | Uniform random 50% | 0.71 | 0.486 | 0.564 (0.112) | 0.781 (0.058) | |
MLP-3-R-75-S | o | Uniform random 75% | 0.671 | 0.787 | 0.717 (0.032) | 0.819 (0.03) | |
MLP-3-R-100-S | o | Uniform random 100% | 0.714 | 0.744 | 0.711 (0.063) | 0.831 (0.031) | |
MLP-3-S-30-S | o | SMOTE 30% | 0.656 | 0.832 | 0.731 (0.04) | 0.816 (0.017) | |
MLP-3-S-50-S | o | SMOTE 50% | 0.704 | 0.671 | 0.681 (0.03) | 0.822 (0.021) | |
MLP-3-S-75-S | o | SMOTE 75% | 0.697 | 0.752 | 0.719 (0.032) | 0.829 (0.019) | |
MLP-3-S-100-S | o | SMOTE 100% | 0.697 | 0.685 | 0.689 (0.039) | 0.822 (0.021) | |
MLP-3-A-30-S | o | AE 30% | 0.582 | 0.777 | 0.654 (0.054) | 0.776 (0.044) | |
MLP-3-A-50-S | o | AE 50% | 0.663 | 0.737 | 0.681 (0.073) | 0.805 (0.012) | |
MLP-3-A-75-S | o | AE 75% | 0.63 | 0.876 | 0.73 (0.052) | 0.808 (0.034) | |
MLP-3-A-100-S | o | AE 100% | 0.709 | 0.702 | 0.697 (0.028) | 0.828 (0.017) |
References
- Pörtner, H.-O.; Karl, D.M.; Boyd, P.W.; Cheung, W.; Lluch-Cota, S.E.; Nojiri, Y.; Schmidt, D.N.; Zavialov, P.O.; Alheit, J.; Aristegui, J. Ocean systems. In Climate Change 2014: Impacts, Adaptation, and Vulnerability. Part A: Global and Sectoral Aspects. Contribution of Working Group II to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change; Cambridge University Press: Cambridge, UK, 2014; pp. 411–484. [Google Scholar]
- Riser, S.C.; Freeland, H.J.; Roemmich, D.; Wijffels, S.; Troisi, A.; Belbéoch, M.; Gilbert, D.; Xu, J.; Pouliquen, S.; Thresher, A. Fifteen years of ocean observations with the global Argo array. Nat. Clim. Chang. 2016, 6, 145–153. [Google Scholar] [CrossRef]
- Williams, A. CTD (conductivity, temperature, depth) profiler. In Encyclopedia of Ocean Sciences: Measurement Techniques, Sensors and Platforms; Steele, J.H., Thorpe, S.A., Turekian, K.K., Eds.; Elsevier: Boston, MA, USA, 2009; pp. 25–34. [Google Scholar]
- Rudnick, D.L.; Klinke, J. The underway conductivity–temperature–depth instrument. J. Atmos. Ocean. Technol. 2007, 24, 1910–1923. [Google Scholar] [CrossRef]
- Masunaga, E.; Yamazaki, H. A new tow-yo instrument to observe high-resolution coastal phenomena. J. Marine Syst. 2014, 129, 425–436. [Google Scholar] [CrossRef]
- Venkatesan, R.; Ramesh, K.; Muthiah, M.A.; Thirumurugan, K.; Atmanand, M.A. Analysis of drift characteristic in conductivity and temperature sensors used in Moored buoy system. Ocean Eng. 2019, 171, 151–156. [Google Scholar] [CrossRef]
- Luo, P.; Song, Y.; Xu, X.; Wang, C.; Zhang, S.; Shu, Y.; Ma, Y.; Shen, C.; Tian, C. Efficient underwater sensor data recovery method for real-time communication subsurface mooring system. J. Mar. Sci. Eng. 2022, 10, 1491. [Google Scholar] [CrossRef]
- Martin, W.; Baross, J.; Kelley, D.; Russell, M.J. Hydrothermal vents and the origin of life. Nat. Rev. Microbiol. 2008, 6, 805–814. [Google Scholar] [CrossRef]
- Rühs, S.; Schwarzkopf, F.U.; Speich, S.; Biastoch, A. Cold vs. warm water route–sources for the upper limb of the Atlantic Meridional Overturning Circulation revisited in a high-resolution ocean model. Ocean Sci. 2019, 15, 489–512. [Google Scholar] [CrossRef]
- Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 2009, 41, 15. [Google Scholar] [CrossRef]
- Habeeb, R.A.A.; Nasaruddin, F.; Gani, A.; Hashem, I.A.T.; Ahmed, E.; Imran, M. Real-time big data processing for anomaly detection: A survey. Int. J. Inf. Manag. 2019, 45, 289–307. [Google Scholar] [CrossRef]
- Chalapathy, R.; Chawla, S. Deep learning for anomaly detection: A survey. arXiv 2019, arXiv:1901.03407. [Google Scholar] [CrossRef]
- Nassif, A.B.; Talib, M.A.; Nasir, Q.; Dakalbab, F.M. Machine learning for anomaly detection: A systematic review. IEEE Access 2021, 9, 78658–78700. [Google Scholar] [CrossRef]
- Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep learning for anomaly detection: A review. ACM Comput. Surv. 2021, 54, 38. [Google Scholar] [CrossRef]
- Hodge, V.; Austin, J. A survey of outlier detection methodologies. Artif. Intell. Rev. 2004, 22, 85–126. [Google Scholar] [CrossRef]
- Chandola, V.; Banerjee, A.; Kumar, V. Outlier detection: A survey. ACM Comput. Surv. 2007, 14, 15. Available online: https://www.researchgate.net/publication/242403027 (accessed on 3 May 2024).
- Zhang, J. Advancements of outlier detection: A survey. EAI Endorsed Trans. Scalable Inf. Syst. 2013, 13, 1–26. [Google Scholar] [CrossRef]
- Qiao, X.; Liu, Y. Adaptive weighted learning for unbalanced multicategory classification. Biometrics 2009, 65, 159–168. [Google Scholar] [CrossRef] [PubMed]
- Barua, S.; Islam, M.M.; Yao, X.; Murase, K. MWMOTE—Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2012, 26, 405–425. [Google Scholar] [CrossRef]
- Leevy, J.L.; Khoshgoftaar, T.M.; Bauder, R.A.; Seliya, N. A survey on addressing high-class imbalance in big data. J. Big Data 2018, 5, 42. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Wang, Y.; Yao, H.; Zhao, S. Auto-encoder based dimensionality reduction. Neurocomputing 2016, 184, 232–242. [Google Scholar] [CrossRef]
- Walfish, S. A review of statistical outlier methods. Pharm. Technol. 2006, 30, 82–86. Available online: https://www.pharmtech.com/view/review-statistical-outlier-methods (accessed on 3 May 2024).
- Chen, Y.; Zhou, X.S.; Huang, T.S. One-class SVM for learning in image retrieval. In Proceedings of the Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), Thessaloniki, Greece, 7–10 October 2001; pp. 34–37. [Google Scholar]
- Pal, S.K.; Mitra, S. Multilayer perceptron, fuzzy sets, classification. IEEE Trans. Neural Netw. 1992, 3, 683–697. [Google Scholar] [CrossRef] [PubMed]
- Narkhede, S. Understanding auc-roc curve. Towards Data Sci. 2018, 26, 220–227. Available online: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5 (accessed on 3 May 2024).
- Horne, E.; Toole, J. Sensor response mismatches and lag correction techniques for temperature-salinity profilers. J. Phys. Oceanogr. 1980, 10, 1122–1130. [Google Scholar] [CrossRef]
- Gregg, M.C.; Hess, W.C. Dynamic response calibration of Sea-Bird temperature and conductivity probes. J. Atmos. Ocean. Technol. 1985, 2, 304–313. [Google Scholar] [CrossRef]
- Larson, N.; Pederson, A. Temperature measurements in flowing water: Viscous heating of sensor tips. In Proceedings of the 1st International Group for Hydraulic Efficiency Measurements (IGHEM) Meeting, Montreal, QC, Canada, 25 June 1996. [Google Scholar]
- Lueck, R.G.; Picklo, J.J. Thermal inertia of conductivity cells: Observations with a Sea-Bird cell. J. Atmos. Ocean. Technol. 1990, 7, 756–768. [Google Scholar] [CrossRef]
- Ullman, D.S.; Hebert, D. Processing of underway CTD data. J. Atmos. Ocean. Technol. 2014, 31, 984–998. [Google Scholar] [CrossRef]
- Garau, B.; Ruiz, S.; Zhang, W.G.; Pascual, A.; Heslop, E.; Kerfoot, J.; Tintoré, J. Thermal lag correction on Slocum CTD glider data. J. Atmos. Ocean. Technol. 2011, 28, 1065–1071. [Google Scholar] [CrossRef]
- Anscombe, F.J. Rejection of outliers. Technometrics 1960, 2, 123–146. [Google Scholar] [CrossRef]
- Grubbs, F.E. Procedures for detecting outlying observations in samples. Technometrics 1969, 11, 1–21. [Google Scholar] [CrossRef]
- Roberts, S.J. Parametric and non-parametric unsupervised cluster analysis. Pattern Recognit. 1997, 30, 261–272. [Google Scholar] [CrossRef]
- Altman, D.G.; Bland, J.M. Parametric v non-parametric methods for data analysis. BMJ 2009, 338, a3167. [Google Scholar] [CrossRef] [PubMed]
- Eskin, E. Anomaly detection over noisy data using learned probability distributions. In Proceedings of the 17th International Conference Machine Learning (ICML), Stanford, CA, USA, 17–22 July 2000; pp. 255–262. [Google Scholar]
- Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
- Anderson, M.J. A new method for non-parametric multivariate analysis of variance. Austral Ecol. 2001, 26, 32–46. [Google Scholar] [CrossRef]
- Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; Wiley: New York, NY, USA, 1994. [Google Scholar]
- Rousseeuw, P.J.; Leroy, A.M. Robust Regression and Outlier Detection; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
- Smiti, A. A critical overview of outlier detection methods. Comput. Sci. Rev. 2020, 38, 100306. [Google Scholar] [CrossRef]
- Zhang, G.P. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 2003, 50, 159–175. [Google Scholar] [CrossRef]
- Desforges, M.; Jacob, P.; Cooper, J. Applications of probability density estimation to the detection of abnormal conditions in engineering. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 1998, 212, 687–703. [Google Scholar] [CrossRef]
- Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
- De Santis, A.; Pavón-Carrasco, F.J.; Ferraccioli, F.; Catalán, M.; Ishihara, T. Statistical analysis of the oceanic magnetic anomaly data. Phys. Earth Planet. Inter. 2018, 284, 28–35. [Google Scholar] [CrossRef]
- Wei, Z.; Xie, X.; Lv, W. Self-adaption vessel traffic behaviour recognition algorithm based on multi-attribute trajectory characteristics. Ocean Eng. 2020, 198, 106995. [Google Scholar] [CrossRef]
- Kullback, S. Information Theory and Statistics; Reprint of the second (1968) edition ed.; Dover Publications, Inc.: Mineola, NY, USA, 1997. [Google Scholar]
- Chen, J.; Chen, W.; Li, J.; Sun, P. A Generalized Model for Wind Turbine Faulty Condition Detection Using Combination Prediction Approach and Information Entropy. J. Environ. Inform. 2018, 32, 14–24. [Google Scholar] [CrossRef]
- Scully, B.M.; Young, D.L.; Ross, J.E. Mining marine vessel AIS data to inform coastal structure management. J. Waterw. Port Coast. Ocean. Eng. 2020, 146, 04019042. [Google Scholar] [CrossRef]
- Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. 1999, 31, 264–323. [Google Scholar] [CrossRef]
- Hawkins, D.M. Identification of Outliers, 1st ed.; Springer: Dordrecht, The Netherlands, 1980. [Google Scholar]
- Johnson, T.; Kwok, I.; Ng, R. Fast computation of 2-dimensional depth contours. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 27 August 1998; pp. 224–228. [Google Scholar]
- Peterson, L.E. K-nearest neighbor. Scholarpedia 2009, 4, 1883. [Google Scholar] [CrossRef]
- Ghorbani, W. Theoretical Foundation of Detection. In Network Intrusion Detection and Prevention: Concepts and Techniques; Advances in Information Security; Springer Science: Boston, MA, USA, 2010; Volume 47, pp. 73–114. [Google Scholar]
- Blázquez-García, A.; Conde, A.; Mori, U.; Lozano, J.A. A review on outlier/anomaly detection in time series data. ACM Comput. Surv. 2021, 54, 56. [Google Scholar] [CrossRef]
- Choi, K.; Yi, J.; Park, C.; Yoon, S. Deep learning for anomaly detection in time-series data: Review, analysis, and guidelines. IEEE Access 2021, 9, 120043–120065. [Google Scholar] [CrossRef]
- De Albuquerque Filho, J.E.; Brandão, L.C.; Fernandes, B.J.T.; Maciel, A.M. A review of neural networks for anomaly detection. IEEE Access 2022, 10, 112342–112367. [Google Scholar] [CrossRef]
- Xia, X.; Pan, X.; Li, N.; He, X.; Ma, L.; Zhang, X.; Ding, N. GAN-based anomaly detection: A review. Neurocomputing 2022, 493, 497–535. [Google Scholar] [CrossRef]
- Yepmo, V.; Smits, G.; Pivert, O. Anomaly explanation: A review. Data Knowl. Eng. 2022, 137, 101946. [Google Scholar] [CrossRef]
- Jeffrey, N.; Tan, Q.; Villar, J.R. A review of anomaly detection strategies to detect threats to cyber-physical systems. Electronics 2023, 12, 3283. [Google Scholar] [CrossRef]
- Ribeiro, C.V.; Paes, A.; de Oliveira, D. AIS-based maritime anomaly traffic detection: A review. Expert Syst. Appl. 2023, 231, 120561. [Google Scholar] [CrossRef]
- Tran, T.M.; Vu, T.N.; Nguyen, T.V.; Nguyen, K. UIT-ADrone: A Novel Drone Dataset for Traffic Anomaly Detection. IEEE J. Sel. Top. Appl. Earth Obs. 2023, 16, 5590–5601. [Google Scholar] [CrossRef]
- Kumari, P.; Bedi, A.K.; Saini, M. Multimedia datasets for anomaly detection: A review. Multimed. Tools Appl. 2023, 1–51. [Google Scholar] [CrossRef]
- Kharitonov, A.; Nahhas, A.; Pohl, M.; Turowski, K. Comparative analysis of machine learning models for anomaly detection in manufacturing. Procedia Comput. Sci. 2022, 200, 1288–1297. [Google Scholar] [CrossRef]
- Fernando, T.; Gammulle, H.; Denman, S.; Sridharan, S.; Fookes, C. Deep learning for medical anomaly detection—A survey. ACM Comput. Surv. 2021, 54, 141. [Google Scholar] [CrossRef]
- Fernandes, G.; Rodrigues, J.J.; Carvalho, L.F.; Al-Muhtadi, J.F.; Proença, M.L. A comprehensive survey on network anomaly detection. Telecommun. Syst. 2019, 70, 447–489. [Google Scholar] [CrossRef]
- Moustafa, N.; Hu, J.; Slay, J. A holistic review of network anomaly detection systems: A comprehensive survey. J. Netw. Comput. Appl. 2019, 128, 33–55. [Google Scholar] [CrossRef]
- Taha, A.; Hadi, A.S. Anomaly detection methods for categorical data: A review. ACM Comput. Surv. 2019, 52, 38. [Google Scholar] [CrossRef]
- Riveiro, M.; Pallotta, G.; Vespe, M. Maritime anomaly detection: A review. Wires Data Min. Knowl. 2018, 8, e1266. [Google Scholar] [CrossRef]
- Soleimani, B.H.; De Souza, E.N.; Hilliard, C.; Matwin, S. Anomaly detection in maritime data based on geometrical analysis of trajectories. In Proceedings of the 2015 18th International Conference on Information Fusion (Fusion), Washington, DC, USA, 6–9 July 2015; pp. 1100–1105. [Google Scholar]
- Carson-Jackson, J. Satellite AIS–developing technology or existing capability? J. Navig. 2012, 65, 303–321. [Google Scholar] [CrossRef]
- Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
- Dreyfus, S.E. An appraisal of some shortest-path algorithms. Oper. Res. 1969, 17, 395–412. [Google Scholar] [CrossRef]
- Rong, H.; Teixeira, A.; Soares, C.G. Data mining approach to shipping route characterization and anomaly detection based on AIS data. Ocean Eng. 2020, 198, 106936. [Google Scholar] [CrossRef]
- Douglas, D.H.; Peucker, T.K. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartogr. Int. J. Geogr. Inf. Geovisualization 1973, 10, 112–122. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231.
- Wang, Y.; Han, L.; Liu, W.; Yang, S.; Gao, Y. Study on wavelet neural network based anomaly detection in ocean observing data series. Ocean Eng. 2019, 186, 106129. [Google Scholar] [CrossRef]
- Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
- Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of the 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; pp. 243–248. [Google Scholar]
- Liu, X.-Y.; Wu, J.; Zhou, Z.-H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 2008, 39, 539–550. [Google Scholar] [CrossRef] [PubMed]
- Shelke, M.S.; Deshmukh, P.R.; Shandilya, V.K. A review on imbalanced data handling using undersampling and oversampling technique. Int. J. Recent Trends Eng. Res. 2017, 3, 444–449. [Google Scholar] [CrossRef]
- Pereira, R.M.; Costa, Y.M.; Silla, C.N., Jr. MLTL: A multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing 2020, 383, 95–105. [Google Scholar] [CrossRef]
- Arefeen, M.A.; Nimi, S.T.; Rahman, M.S. Neural network-based undersampling techniques. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 1111–1120. [Google Scholar] [CrossRef]
- Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23 August 2005; pp. 878–887. [Google Scholar]
- Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22–26 September 2003; Proceedings 7. pp. 107–119. [Google Scholar]
- Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Ramentol, E.; Caballero, Y.; Bello, R.; Herrera, F. Smote-rs b*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl. Inf. Syst. 2012, 33, 245–265. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
- Brandt, J.; Lanzén, E. A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification. Bachelor’s Thesis, Uppsala University, Uppsala, Sweden, 2021. [Google Scholar]
- Dai, W.; Ng, K.; Severson, K.; Huang, W.; Anderson, F.; Stultz, C. Generative oversampling with a contrastive variational autoencoder. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8 November 2019; pp. 101–109. [Google Scholar]
- Jo, W.; Kim, D. OBGAN: Minority oversampling near borderline with generative adversarial networks. Expert Syst. Appl. 2022, 197, 116694. [Google Scholar] [CrossRef]
- Scientific, S.-B. User manual SBE 9plus CTD. 2023. Available online: https://www.seabird.com/asset-get.download.jsa?id=54663149001 (accessed on 3 May 2024).
- Emmert-Streib, F.; Dehmer, M. Understanding statistical hypothesis testing: The logic of statistical inference. Mach. Learn. Knowl. 2019, 1, 945–962. [Google Scholar] [CrossRef]
- Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the support of a high-dimensional distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef]
- Seliya, N.; Abdollah Zadeh, A.; Khoshgoftaar, T.M. A literature review on one-class classification and its potential applications in big data. J. Big Data 2021, 8, 122. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Abiodun, O.I.; Jantan, A.; Omolara, A.E.; Dada, K.V.; Mohamed, N.A.; Arshad, H. State-of-the-art in artificial neural network applications: A survey. Heliyon 2018, 4, e00938. [Google Scholar] [CrossRef]
- Lu, D.; Weng, Q. A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 2007, 28, 823–870. [Google Scholar] [CrossRef]
Predictive Values | |||
---|---|---|---|
Positive (1) | Negative (0) | ||
Actual values | Positive (1) | TP (True positive) | FN (False negative) |
Negative (0) | FP (False positive) | TN (True negative) |
Classification Model | Dataset | Score | |||||
---|---|---|---|---|---|---|---|
Model Type | Model Name | Scale (0–1) | Oversampling (Augmentation) | Sensitivity (Recall) | Precision | F1 Score (Std.) | AUROC (Std.) |
Traditional method (baseline) | IQR | x | x | 0.153 | 0.188 | 0.168 | 0.523 |
(Hidden layer sizes) MLP-1: (10) MLP-2: (10,15,10) MLP-3: (500,100,10) | MLP-2-S-30 | x | SMOTE 30% | 0.832 | 0.937 | 0.882 (0.013) | 0.912 (0.013) |
MLP-2-A-50 | x | AE 50% | 0.841 | 0.914 | 0.875 (0.020) | 0.914 (0.009) | |
MLP-1 | x | x | 0.812 | 0.936 | 0.869 (0.021) | 0.901 (0.015) | |
MLP-2 | x | x | 0.805 | 0.94 | 0.867 (0.019) | 0.899 (0.017) | |
MLP-2-S-50 | x | SMOTE 50% | 0.846 | 0.89 | 0.866 (0.011) | 0.914 (0.015) | |
MLP-2-A-75 | x | AE 75% | 0.814 | 0.925 | 0.863 (0.019) | 0.901 (0.023) | |
MLP-3-A-75 | x | AE 75% | 0.821 | 0.908 | 0.861 (0.018) | 0.903 (0.016) |
Classification Model | Dataset | Score | |||||
---|---|---|---|---|---|---|---|
# | Model | Scale (0–1) | Oversampling (Augmentation) | Sensitivity (Recall) | Precision | F1 Score (Std.) | AUROC (Std.) |
1 | IQR | x | x | 0.153 | 0.188 | 0.168 | 0.523 |
2 | OCSVM-ND (normal data) | x | x | 0.475 | 0.139 | 0.215 | 0.501 |
3 | OCSVM-AD (abnormal data) | x | x | 0.508 | 0.128 | 0.205 | 0.476 |
4 | MLP-1 | x | x | 0.812 | 0.936 | 0.869 (0.021) | 0.901 (0.015) |
5 | MLP-1-S | o | x | 0.647 | 0.844 | 0.729 (0.029) | 0.813 (0.025) |
6 | MLP-2 | x | x | 0.805 | 0.94 | 0.867 (0.019) | 0.899 (0.017) |
7 | MLP-2-S | o | x | 0.676 | 0.806 | 0.728 (0.02) | 0.823 (0.02) |
8 | MLP-3 | x | x | 0.809 | 0.915 | 0.856 (0.017) | 0.898 (0.018) |
9 | MLP-3-S | o | x | 0.636 | 0.799 | 0.7 (0.047) | 0.803 (0.031) |
10 | OCSVM-Average | x | All cases | 0.503 | 0.132 | 0.209 (0.007) | 0.484 (0.014) |
11 | MLP-1-Average | x & o | All cases | 0.757 | 0.76 | 0.735 (0.106) | 0.844 (0.055) |
12 | MLP-2-Average | x & o | All cases | 0.755 | 0.774 | 0.751 (0.104) | 0.853 (0.051) |
13 | MLP-3-Average | x & o | All cases | 0.738 | 0.753 | 0.719 (0.099) | 0.836 (0.051) |
14 | MLP-ALL-Average | x | All cases | 0.805 | 0.789 | 0.77 (0.127) | 0.867 (0.062) |
15 | MLP-ALL-Average | o | All cases | 0.695 | 0.735 | 0.701 (0.053) | 0.822 (0.025) |
16 | ALL-D-Average | x & o | All cases of duplication | 0.733 | 0.698 | 0.694 (0.248) | 0.812 (0.164) |
17 | ALL-R-Average | x & o | All cases of uniform random | 0.713 | 0.535 | 0.562 (0.183) | 0.756 (0.118) |
18 | ALL-S-Average | x & o | All cases of SMOTE | 0.723 | 0.698 | 0.687 (0.203) | 0.807 (0.135) |
19 | ALL-A-Average | x & o | All cases of AE | 0.694 | 0.734 | 0.685 (0.209) | 0.795 (0.139) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kang, H.; Kim, D.; Lim, S. Machine Learning-Based Anomaly Detection on Seawater Temperature Data with Oversampling. J. Mar. Sci. Eng. 2024, 12, 807. https://doi.org/10.3390/jmse12050807
Kang H, Kim D, Lim S. Machine Learning-Based Anomaly Detection on Seawater Temperature Data with Oversampling. Journal of Marine Science and Engineering. 2024; 12(5):807. https://doi.org/10.3390/jmse12050807
Chicago/Turabian StyleKang, Hangoo, Dongil Kim, and Sungsu Lim. 2024. "Machine Learning-Based Anomaly Detection on Seawater Temperature Data with Oversampling" Journal of Marine Science and Engineering 12, no. 5: 807. https://doi.org/10.3390/jmse12050807