Big data applications have tremendously increased due to technological developments that led to increased data size and the conversion of regular data into large datasets. Big data in the form of extensive data thus requires high-speed servers for speedy processing [
1]. Large servers are needed to save this big data that make the data available on a request basis [
2]. Big data assists decision-making and validation in organizational processes [
3]. Conversely, there always is a critical trade-off between application size and efficiency, i.e., the larger the application data, the lower the efficiency of the application [
4]. Big data applications always require large amounts of data to model the system, whereas extensive data requires significant storage and methods to handle the data efficiently. To handle such scenarios, big data handling methods are required to divide the data into subgroups and handle them in the same manner as the source data [
5].
Besides data handling, big data is also prone to risks and constraints such as data validity, theoretical relevance, appropriate attribute association, controls, audibility, and precision. These parameters are meant to ensure the quality of information and the quality of big data [
6]. In addition to these constraints, many other factors are associated with big data, such as data security, sorting the data, the management of servers, and privileges related to data [
7]. The number of digital tools for data handling exceeded almost 92% by 2002 and is still increasing, leading to the big data business of about 46.4 billion [
8].
1.1. Motivation
Besides numerous big data applications, it has become challenging for applications in semantic networks, data mining, social networks, and information fusion [
9]. Likewise, many research interests were developed in pattern mining, data tracking, data storage, data visualization, analysis of user behavior, and data processing [
10]. This led to big data solutions in the form of technologies such as computational intelligence and machine learning that made it possible to attain many solutions using these technologies for scanning and processing data. These solutions include data condensation, incremental learning, distributed computing, divide and conquer, data sampling density-based approaches, and others [
8,
9,
10].
Sampling data has been of more importance in data handling problems with the issues of computational burden, complexity, and inefficiency associated with the tasks under consideration [
11]. The richness of the data quality is compromised in most scenarios due to biased estimations made per sample [
12]. To handle this issue, reverse sampling procedures are exploited using information from external sources where big data is an ensemble with probabilistic sampling [
13]. Sampling size is the most crucial factor for the system’s accuracy [
14]. The processes of non-probabilistic sampling, Zig Zag, inverse sampling, and cluster sampling have been introduced as a solution for big data sampling [
13,
15,
16].
This study aims to analyze a large amount of data and sample it randomly into subsets. The experiments show the influence of different techniques on the model performance is analyzed by combining the feature selection techniques and machine learning classifiers. The experiments use two feature selection techniques: random subset and random projection. It is quite possible that the random selection of a small portion of the data can produce as good results as the original data, making processing the whole data a waste of computing resources. It is worth saying that if a slight difference in performance can be achieved when the entire dataset is used, then it is possible to be neglected in favor of using just a small portion of the data with a close performance rate of the whole data.
1.2. Related Work
The information from data can be learned via machine learning algorithms to generate decision-making and prediction models [
16]. The machine learning techniques can learn the behavior and trends in data for future predictions by training the model via communication, comparisons, problem-solving, discoveries, and strategies [
4]. The more significant the amount of data, the higher the accuracy of machine learning, but machine learning good performance also depends on the data’s simplicity. This leads to the problem of machine learning and big data with unstructured data, unclassified data, and rapidly changing data [
17].
One of the deep learning paradigms, such as Convolutional Neural Networks (CNNs), can deal with the problem of data classification [
18,
19] for images and textual data [
20]. CNN has an excellent performance in image data classification and detection, but it requires large amounts of input data and, subsequently, high processing power. The CNN comprises multiple layers such as convolutional, pooling, and fully connected layers that require enormous resources to perform efficiently. The feature extraction mechanisms have gained significant attention [
21,
22] due to their ability to reduce massive data optimally. The dimensionality of data affects the performance of machine learning techniques and data handling mechanisms. More significant data require technological resources such as powerful processing tools unavailable in most scenarios.
The data attributes play a vital role in machine learning and data handling to develop better models, but they may complicate the scenarios due to inappropriate data coverage and classification. Many attributes and instances create complex dimensionality issues in large data sets. Thus, this article explores the NDPI video dataset [
23] divided into three categories: Acceptable data, flagged data, and unacceptable data for analysis. The data from image filtering is utilized for data sampling. The data used is well organized for three reasons, i.e., it is well categorized into three categories, the data can be converted into numerical values, and a large amount of data, up to 40GB, is available. This makes a case for large data processing and is thus appropriate for comprehensive machine learning-based analysis.
Many research articles, such as [
24,
25,
26,
27] in the literature, aim to solve similar problems in identical domains. The work in [
22,
28] explores large datasets with many classes to increase productivity and efficient machine learning models. The color transformation methods are studied in [
26]. Several evidence-based and adaptive sampling methods are explored for filtering in [
29,
30,
31]. The analysis for website filtering is provided in [
31]. The analysis of keyframes is illustrated in [
32]. The research works in [
33] and [
34] offer functions to visual attributes to make multimedia accessible. Content retrieval applications are explored in [
35,
36,
37,
38,
39,
40,
41,
42]. Feature analysis and reduction based on several related areas are also explored in [
19,
20,
41,
42,
43,
44,
45,
46,
47,
48]. Neighborhood rough sets are proposed in [
49] as tools for reducing the attributes in big data. This method provides the best choice for attribute selection. A hierarchical framework based on supervised models is proposed in [
50] using a support vector machine as a machine learning algorithm to reduce the attributes in big data. Gabor filters are exploited for noise reduction, and Elephant Herd Optimization is used for feature selection. Two effective feature selection methods, such as Principal Component Analysis and Linear Discriminant Analysis, are exploited in [
51] to reduce the attribute sets for machine learning algorithms: Random Forest, Naïve Bayes, Support Vector Machine, and Decision Tree. The dominance-based neighborhood rough sets (DNRS) method is exploited in [
52] for parallel attribute reduction that considers partial order for numerical and categorical attributes. Neighborhood decision with some consistency is explored for attribute reduction based on multi-criterion [
53]. The classification variations in varying attribute scenarios are handled with neighborhood decision consistency. Reduced error is attained with a new attribute reducing method, a heuristic method to derive the redact.
Recently, Rostami et al. [
54] offered a genetic algorithm based on community detection for the aim of feature selection, which acts in three phases. The feature similarities are determined in the first stage. During the second step, community detection algorithms classify the features into clusters. In the third stage, a genetic algorithm is used to select traits for a new community-based repair procedure. Nine benchmark classification problems were analyzed in terms of the performance of the presented approach. Additionally, the authors have compared the efficiency of the suggested technique with the results from four known algorithms for feature selection. Comparing the performance of the proposed technique with three new feature selection methods based on PSO, ACO, and ABC algorithms on three classifiers indicated that the accuracy of the proposed method is on average 0.52% higher than the PSO, 1.20% higher than ACO, and 1.57 higher than the ABC algorithm. Rajendran et al. [
55] concentrate on the development of a big data classification model using chaotic pigeon-inspired optimization (CPIO)-based feature selection in conjunction with an optimum deep belief network (DBN) model. The suggested model is performed in the Hadoop MapReduce environment to handle big data. The CPIO method is first employed to pick a subset of valuable features. The Harris Hawks Optimization (HHO)-based DBN model is also created as a classifier to provide suitable class labels. The invention of the HHO method to adjust the hyperparameters of the DBN model contributes to the improvement of classification performance. Several simulations were conducted to determine the superiority of the provided approach, and the results were analyzed from many dimensions.
In a separate effort, Rostami et al. [
56] conducted a comparative analysis of several feature selection approaches and categorized these methods generally. In addition, the current state of the art in swarm intelligence is examined, as are the most recent feature selection approaches based on these algorithms. Furthermore, the merits and limitations of the various examined feature selection approaches based on swarm intelligence are appraised. Song et al. [
57] present a novel three-phase hybrid Feature Selection technique (HFS-C-P) based on correlation-guided clustering and particle swarm optimization (PSO) to address the two difficulties mentioned above simultaneously. To do this, the suggested algorithm integrates three types of Feature Selection approaches depending on their benefits. In the first and second stages, a filter Feature Selection approach and a feature clustering-based method with low computing cost are developed to limit the search space required in the third phase. The third step then involves locating an ideal subset of features using an evolutionary algorithm with global searchability. In addition, a symmetric uncertainty-based feature deletion approach, a rapid correlation-guided feature clustering strategy, and an enhanced integer PSO are proposed to improve the performance of the three phases, respectively. The suggested technique is finally evaluated on 18 publicly accessible real-world datasets in contrast to nine Feature Selection algorithms.
Jain et al. [
58] proposed a model that undergoes initial preprocessing to eliminate unwanted words. The set of feature vectors is then extracted using Term Frequency-Inverse Document Frequency (TF-IDF) as a feature extraction technique. In addition, a Binary Brain Storm Optimization (BBSO) algorithm is applied to the Feature Selection procedure, resulting in enhanced classification performance. In addition, Fuzzy Cognitive Maps (FCMs) are used as a classifier to categorize the incidence of positive or negative emotions. A comprehensive analysis of experimental results ensures that the presented BBSO-FCM model performs better on the benchmark dataset. Abu Khurma et al. [
59] present a complete summary of 156 papers concerning NIA’s improvements for combating Feature Selection. They supplement the conversations with analytical perspectives, illustrated data, practical examples, and open-source software solutions and debate Feature Selection and NIA-related open topics. The study concludes with a summary of the fundamentals of NIAs-Feature Selection, investigating around 34 distinct operators. Chaotic maps are the most common operator. Hybridization is the most common kind of alteration. There are three forms of hybridization: NIA integration, NIA integration with a classifier, and NIA integration without a classifier. The most prevalent hybridization is the combination of a classifier and the NIA. Medical and microarray applications account for most NIA-Feature Selection modifications and use. Big data has benefited many fields. Recently, besides many new fusions of big data applications, the security paradigm has seen exponential usage of big data [
60,
61,
62]. Security and safety applications need precise, accurate, and expedited decision-making based on big data analytics. Our work also contributes toward the rapid model creation from smaller sets and deploying these models for several application scenarios.