*7.2. Dataset*

To validate the proposed solution in various scenarios and on varied dataset properties, experiments were conducted using a number of different datasets. The characteristics of all the datasets are summarized in Table 1. All datasets were downloaded from the Kaggle website.

A range of dataset sizes was used in this paper to evaluate the performance of the proposed solution with different dataset sizes ranging from a few thousand rows to millions of rows. Therefore, a dataset with 54 MB was used.


For image detection and to confirm the model's robustness, two independent datasets were collected and tested. The dataset used in this paper was created using the analysis conducted by [60] and can be downloaded at https://github.com/muhammedtalo/COVID-19 (accessed on 20 February 2023). The dataset consisted of 500 pneumonia, 125 COVID-19, and 500 nofindings X-ray images. It was created using two separate resources: X-ray images obtained from multiple open-access sources of COVID-19 patients in the Cohen [61] database, and the chest X-ray database for normal and pneumonia X-ray images, provided by Wang et al. [62]. The COVID-19 dataset included 43 female patients and 82 male patients. Metadata were not available for all patients in this dataset. Positive COVID-19 patients were, on average, around 55 years old. This was a versatile dataset that could be used for multi-class and binary classification tasks.

The dataset from Harvard Lab [55] was also used in this study. The dataset consisted of non-enhanced chest CT scans of more than 1000 individuals diagnosed with COVID-19. The average age of the CT-scan patients was 47.18 years, with a standard deviation of 16.32 years and a range from 6 to 89 years. The population was composed of 60.9% males and 39.1% females. The most common self-reported co-morbidities among patients were coronary artery or hypertension disease, interstitial pneumonia or emphysema, and diabetes. The positive PTPCR patient images were obtained from in-patient treatment sites for COVID-19 and accompanying clinical symptoms, between March 2020 and January 2021. The scans were taken during end-inspiration with the subjects in a supine position.

The CT scans were conducted using a 16-slice helical mode on NeuViz equipment, without the use of intravenous contrast. The images were captured in DICOM format and were 16-bit gray-scale with 512 × 512 px. The slice thickness was determined by the operator and ranged from 1.5 to 3 mm, based on the clinical examination requirements. The CT scans were reviewed for the presence of COVID-19 infection by two board-certified radiologists. In cases where the first two radiologists were unable to reach a consensus, a third more-experienced radiologist provided the final judgment. The CT images showed a variety of patterns indicative of COVID-19-specific lung infections.

In the third phase of our comparison, two datasets were used. The specifics of the two major subsections of the sourced image graphs were as follows.


A total of 3106 images were utilized for model training, 16% of which were used for model validation. A total of 806 non-augmented images from various categories were used to test the proposed solution and assess the performance.

Furthermore, the large image datasets in Table 2 were used for the big-data evaluation. All the datasets were downloaded from the Kaggle website.


**Table 2.** Clustering datasets.

### Data Preparation

The data clustering had to be prepared, and the primary parameters had to be selected before clustering, as follows:

• **Noise Removal:** The advanced parallel *k*-means clustering algorithm utilized the mean imputation as the method for handling missing data. In this approach, missing values were replaced with the mean value of the corresponding feature across all samples. This method is simple and computationally efficient, and it has been shown to be effective in practice. However, the mean imputation may introduce bias in the clustering results if the missing data were not missing completely-at-random (MCAR). If the missing data were missing-at-random (MAR) or missing not-at-random (MNAR), more sophisticated methods such as regression imputation and multiple imputation could be required to avoid bias.

• **Number of Clusters:** Selecting the optimal number of clusters in the advanced parallel *k*-means clustering was crucial for achieving effective cluster analysis. This is particularly true in the medical field, where the identification of meaningful clusters can lead to more accurate diagnoses and treatments. However, the traditional methods of finding *k*-value, such as the Elbow method or the Silhouette method, are not always sufficient in the medical field, where the data are often complex and high-dimensional. In such cases, expert knowledge could be required to identify clinically relevant subgroups, which could then be used to determine the optimal number of clusters. In this paper, the *k*-value set to 2 in the clustering of numeric and text data and set to 5 for image clustering, as there were 5 main gray-scale stages of colors in the X-ray and MRI images.
