IchthyNet: An Ensemble Method for the Classification of In Situ Marine Zooplankton Shadowgraph Images

Slocum, Brittney; Penta, Bradley

doi:10.3390/oceans6010007

Open AccessArticle

IchthyNet: An Ensemble Method for the Classification of In Situ Marine Zooplankton Shadowgraph Images

by

Brittney Slocum

^*,†

and

Bradley Penta

^†

U.S. Naval Research Laboratory, Stennis Space Center, MS 39529, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Oceans 2025, 6(1), 7; https://doi.org/10.3390/oceans6010007

Submission received: 24 October 2024 / Revised: 23 December 2024 / Accepted: 6 January 2025 / Published: 24 January 2025

Download

Browse Figures

Versions Notes

Abstract

This study explores the use of machine learning for the automated classification of the ten most abundant groups of marine organisms (in the size range of 5–12 cm) plus marine snow found in the ecosystem of the U.S. east coast. Images used in this process were collected using a shadowgraph imaging system on a towed, undulating platform capable of collecting continuous imagery over large spatiotemporal scales. As a large quantity (29,818,917) of images was collected, the task of locating and identifying all imaged organisms could not be efficiently achieved by human analysis alone. Several tows of data were collected off the coast of Delaware Bay. The resulting images were then cleaned, segmented into regions of interest (ROIs), and fed through three convolutional neural networks (CNNs): VGG-16, ResNet-50, and a custom model created to find more high-level features in this dataset. These three models were used in a Random Forest Classifier-based ensemble approach to reach the best identification fidelity. The networks were trained on a training set of 187,000 ROIs augmented with random rotations and pixel intensity thresholding to increase data variability and evaluated against two datasets. While the performance of each individual model is examined, the best approach is to use the ensemble, which performed with an F1-score of 98% and an area under the curve (AUC) of 99% on both test datasets while its accuracy, precision, and recall fluctuated between 97% and 98%.

Keywords:

plankton; machine-learning; shadowgraph; ocean ecosystems; neural-networks

1. Introduction

Zooplankton are heterotrophic organisms that occupy trophic levels between the photosynthetic organisms at the base of the oceanic food web and higher trophic level organisms (fish, large invertebrates, marine mammals, reptiles, and birds). Many of these organisms can produce large swarms and layers that can attract larger predators and play important roles in the transfer of organic material from the euphotic zone to the deep ocean (‘biological pump’) [1]. Zooplankton biomass, abundance, and species composition also vary depending on environmental conditions [2,3,4]. Understanding the links between these organisms’ abundance and their environment is crucial for ecosystem forecast models. One such environmental feature the organisms are connected to is the ocean chlorophyll level. Diatoms, responsible for nearly half of global photosynthesis, form large spring and fall blooms with high chlorophyll levels [5]. Copepods and appendicularians are known to feed on diatoms, meaning their presence can be expected in such areas [6]. As appendicularians move through cluttered waters, large detritus and food materials clog the filter of their gelatinous “houses”, causing the appendicularians to flee the housing. This discarded house then becomes a major component of marine snow [7]. Chaetognaths and siphonophores are known to feed on a variety of small marine life, including copepods, which appear to be a favorite [8,9]. Similarly, hydromedusae and ctenophores have been observed to feed on species such as copepods, siphonophores, and shrimp [10,11]. To understand how aggregations of these organisms can be predicted, we must investigate these highly entwined species.

Despite much interest from oceanographers, fisheries, and conservationists, the challenges faced in acquiring quality training data for automated plankton detection has limited detection and classification capabilities for marine zooplankton species [12,13]. Accurate and unbiased datasets are difficult to obtain due to this variability coupled with the volume of available images, often leading to thousands of samples for common species and only a handful for others [14]. Recently, a Mask-RCNN developed by Bi et al. in 2024 saw success on the detection of abundant organisms from a variety of data sources; however, this approach also reported low precision on the classification of less common specimens [15]. Attempts have been made to address this data imbalance issue, using approaches such as combining images from multiple acquisition systems and supplementing datasets with bench-top data to increase the variability in the training set [16,17]. Researchers found that bench-top imaging can be a useful tool for improving the quantity of rare species by capturing many postures of the same organisms, although care must be used to avoid capturing the same specimen too frequently, as this would bias classifiers. Ref. [18] provide a comprehensive overview of the advancements in plankton image analysis, yet still note the capability gaps due to domain shifts from the different imaging systems, difficulties in identifying and processing unseen classes, and uncertainty in expert annotations, which are necessary for training data.

This research aims to enhance the classification of organisms which form dense marine aggregations. To overcome the data imbalance issue often seen with such applications, the data were balanced by restricting to the size of all classes to that of the smallest label population. Augmentation techniques were then used to increase the training set size to increase data variability. The data preparation process involves collecting shadowgraph images, cleaning up artifacts, and converting them to a format suitable for convolutional neural network analysis. Three CNNs were trained to determine the presence of organisms, outputting the confidence in the detection and the classification of the detected organisms. In a CNN, the depth of the architecture is important as deeper networks learn more complex features. With increasing network depth and filter application, the subsequent layers detect higher-level features [19]. This leads to early layers acting more like edge detectors, while deeper layers identify features that build on these edges. As such, deeper networks have the ability to discern complex features, which correspond to distinct traits such as eyes, tails, and fins. As the organisms examined have a wide range of feature complexity, it was hypothesized that the best approach would be to apply a shallow, mid-level, and deep CNN, thus taking advantage of models which analyze the images using different features. These observations, plus the noted simplicity of some of the organisms, led to the selection of CNN architectures that leverage features of differing complexity levels, utilizing the confidence in low-level, mid-level, and high-level features to classify marine zooplankton species. This is important as these organisms groups range from very simple structures such as diatoms to very complex structures such as ctenophores and siphonophores. Distinctions become even more complex as the edge features of the shadowgraph imagery are obscured through use the of filtering. This filtering is often a necessary step in cleaning up noise and debris in the collected data for human review, but can degrade edge-features of the organisms that are necessary for examination by a shallow architecture.

Each model was trained and run on labeled datasets, with the outputs being aggregated and passed to a Random Forest Classifier to create an ensemble approach that uses knowledge of the relationships of the models to create a unified label. This architecture differs from many current approaches to plankton classification in that the shallow network is designed and trained for the purpose of classifying these marine zooplankton species with the deeper, off-the-shelf networks being used to supplement those decisions. Many ensemble approaches utilize only transfer learning of deep, off-the-shelf CNNs, with the most shallow CNN selected typically having at least 16 layers. The architecture also employs a Random Forest Classifier to determine the correct classification from the three CNN architectures rather than a simple voting system. This is because the architecture was designed with the assumption that the constituent CNNs may specialize in particular classifications, leading to relationships among the model disagreements. As such, the ensemble considers the three predictions as well as the source of each prediction and the confidence in those predictions to determine the best classification. The performance of this ensemble approach was assessed and compared to the performance of the individual models to determine if using CNNs in such a fashion improves the classification accuracy.

Success in using an ensemble of CNNs for phytoplankton classification was observed by Lumini and Nanni [20]. In their initial research, the authors created an ensemble utilizing transfer learning on a set of eight widely adopted image classification networks to classify different phytoplankton species. Rather than using transfer learning on a large set of deep models, the ensemble presented here uses three models with different levels of complexity, with the goal of employing features of different scales to produce a unified classification by examining the images from different perspectives. The ensemble is created by training a Random Forest Classifier to determine the appropriate label based on the information provided by the three models. The use of the Random Forest to examine the labels and confidence from each model was preferred over a simple voting system as the models were selected to specialize in differing complexity levels. This should lead to the shallow model being more reliable on organisms with distinct profiles and the deeper models being preferred for complex organisms which are characterized by very specific, often complex, features. The underlying CNN architectures and their training process are introduced in Section 2.2.

In this paper, we present the development of a Random Forest Classifier-based ensemble of carefully selected CNNs entitled IchthyNet. The results presented demonstrate that this model is highly effective at differentiating ten plankton groups (plus marine snow) frequently detected in ocean shadowgraph imagery. A comparison to the performance of the individual models reveals that, while the underlying CNN architectures perform well individually, the outcome of the developed ensemble is most reliable. This automated capability is crucial due to the large volume of organisms captured during data collection, which can produce millions of image segments to review. The ability to automatically detect organisms in the collected data speeds up the processing of each tow, allowing for the efficient examination of larger volumes of data. Reviewing and labeling all ROIs in each shadowgraph image for a human operator takes months, and such daunting tasks can lead to significant human error. However, the ensemble approach can label all data from the two-week-long Delaware expedition in less than one week on the hardware described in this research while maintaining the same level of precision. This is vital in using the collected data to determine complex relationships that exist between the organisms, environmental data, and optical/acoustic properties.

2. Materials and Methods

In this study, regions of interest (ROIs) are extracted from ISIIS video frames so that convolutional neural networks can be applied to determine the classification of the organism depicted in each ROI. As the data were collected in situ off the Delaware coast, we first clean, balance, and augment that dataset to make it more suitable for training. Then, we train three architectures of differing depths with the objective of leveraging the diversity in their feature complexity to construct an ensemble method which makes a final determination on the classification of each ROI.

2.1. Data Collection and Preprocessing

In a 2020 study that evaluated the performance of imaging and acoustic sampling methods against the traditional net-based sampling methods, seven tows of shadowgraph data were collected off the continental shelf to the east of Delaware Bay [21]. As part of this study, an In Situ Ichthyoplankton Imaging System (ISIIS) was towed to collect continuous imagery of planktonic organisms and measurements of the water column (see [22] for a description of the prototype). The nature of the sampling is described in detail in [21]. The ISIIS system has two cameras, but for this work, only frames from the large camera were used. The 12 cm wide line scanner captures shadows in the 50 cm space between the light source and imaging lens, effectively imaging any particle that enters this area. The line scanning camera collects data as a continuous strip of imagery, which is then stored by the acquisition software as square frames with sizes based on the size of the 12 cm field of view. Each 12 × 12 cm frame is comprised of 2048 × 2048 pixels for a pixel resolution of approximately 59.5 microns.

The ISIIS data collected in 2020 are used in the work presented here and introduced as Tows 1–7 (Figure 1). These tows are of varying lengths and are arranged to cover several inshore and offshore areas within the region. These differences in tow length lead to ROI counts for the tows ranging from only 1.8 million for Tow 2 to 8.4 million in Tow 3 (Table 1). Tows 1, 2, and 5 cover inshore areas near the bay, Tows 3, 6, and 7 transition between inshore and offshore, and Tow 4 is exclusively offshore.

To structure the ISIIS data in a format that applies to machine learning by use of convolutional neural networks, the collected frames were cleaned and segmented. To increase the visibility of the organisms of interest, the frames were flat-fielded using a radiometric calibration to remove background noise. These flat-fielded images were then segmented using a thresholding approach established in previous works [23,24,25]. To summarize the approach, a threshold gray-scale value of 185 is used to binarize the pixels. Regions with a large enough density of black particles, a minimum of 750 square-pixels, are then extracted from the overall frame as a region-of-interest (ROI). This process was completed using an application called ImageJ [26]. The resulting ROIs were then resized by scaling the largest dimension down to the input size for the target CNN and padding the rest to ensure equal height and width. This preserves the shape and relative size of each organism, while formatting the clipped images to the uniform size required by a CNN.

Previous studies [23] exclude images from turbid coastal areas, with elevated signal-to-noise ratios, and focus on images from clearer oceanic water. Our sampling, however, was primarily undertaken on the continental shelf, in water depths 100 m or shallower, and influenced by the Delaware River outflow. The classification of organisms in such areas can be more challenging as the images present more noise, but training on such images allows for the better generalization of the features contributing to organism identification.

The regions of interest were reviewed and labeled as one of the eleven clusters (Figure 2) examined in this study. These groupings are APP (appendicularians), CHA (chaetognaths), COP (copepods), CTE (ctenophores), DIA (diatoms), HYD (hydromedusae), INV (invertebrate larvae), SHR (shrimp), SIP (siphonophores), SNO (marine snow), and VEL (veligers). The labels are used to classify groupings of the most common groups of species in the area where the data were collected.

These shadowgraph images present several challenges for machine learning. The first challenge is that many of the organisms’ profiles change significantly depending on the angle of the organism and the extent to which the organism has twisted. Many of these organism groupings also present a wide variety of sizes and shadow intensities, as some group members may be more translucent than others. Some of these groups also have very similar profiles. This is the case with appendicularians and the chaetognaths, which can be easily mistaken depending on the clarity of the image; the copepods and shrimp, depending on the angle the species was captured at; and hydromedusae and ctenophores, which are both gelatinous and may exhibit very similar profiles. Differentiating between appendicularians inside their “houses” and marine snow also presents a unique challenge as the marine snow is often a later, dirtier stage of the appendicularian housing. The final hurdle in distinguishing these organisms is that many of the image segments contain more than one type of organism, which makes it difficult for the model to determine which area to focus on during the classification process.

Data availability presents an additional challenge for this type of problem. Many of the groups were found to have substantial support in the labeled data with over 113,000 labeled APP samples, CHA, COP, CTE, DIA, HYD, SHR, SIP containing 43,000–50,000 samples, and SNO containing roughly 96,000 samples. These groups were easily amassed due to their abundance across the tow locations, while VEL and INV samples were more difficult to come by, leading to just 5400 VELs and 840 INVs. To overcome this gap in the available labeled data, the models were initially trained and run on this uneven dataset, and those preliminary predictions were used to locate more unlabeled samples of these organisms. The initial predictions were reviewed and used to locate 20,000 INV samples and 80,000 VEL samples. To create a balanced dataset for training and validation, the smallest available sample size, 20,000, was used. Additionally, 85% of the data was taken for the training and validation data, leading to a dataset of 187,000 training samples with 17,000 per label.

2.2. Convolutional Neural Networks

In evaluating the success of machine learning for the classification of the eleven different organism clusters, three model architectures were examined. The architectures were selected to probe differing levels of feature complexity. The first two are well-known convolutional neural networks often selected for processing the MNIST (Modified National Institute of Standards and Technology database) dataset, which is a commonly used black and white dataset of handwritten digits [27], and on image datasets, which present a higher level of complexity [19]. Evaluating models that perform well given this overlap is useful as the images being presented are flat-fielded, black-and-white images of aquatic organisms—more complex than handwritten digits, but still with fewer features than the typical ImageNet dataset made of more than 20,000 categories of RGB color images [28]. The selected models are VGG-16 [29] and ResNet-50 [30]. Additionally, a shallow CNN performing only five layers of convolutions was developed which is henceforth referred to as ShadowNet. Each model was adapted to output a classification and confidence level so that the decisions could be leveraged in an ensemble approach. These three models are introduced in this section.

For the training and validation process, all models were trained on 60% of the data in the training set, 112,200 ROIs, validated against 20%, 37,400 ROIs, and tested against another 37,400 ROIs. To increase the variation in this small set of training data, the data were augmented using random rotation and thresholding, with the maximum pixel value randomly selected between 175 and 254. This random thresholding rounded all values above the selected threshold to 255, causing these pixels to blend in with the background. The rotation augmentation allows the model to observe the specimens at angles which may not be captured in situ to increase the observation variability of the data. The random pixel thresholding was found to be a useful tool as some researchers set higher threshold levels to remove more background noise and some set this at a level lower. As this threshold is increased, the edge features of the specimens, particularly tentacles and appendicularian housings, are degraded or even removed. The removal of such features, if the model is trained to expect them, can lead to an increase in missed classifications without the use of augmentation. Adding in this augmentation layer to the training process allows the models to treat the limited training set as a larger training set due to the now increased variability. The models were trained, evaluated, and tested on an Ubuntu system with 128 GB of RAM and an Intel Core i9-10920X with a processing speed of 3.5 GHz. While the models do take advantage of the large RAM available, they were designed with a memory-constrained reader so that only one batch at a time is loaded into memory, allowing the models to be run on systems with lower RAM by configuring the batch size. The models were trained, evaluated, and tested with the GPU enabled for the sake of performance.

CNNs are specialized artificial neural networks (ANNs) that maintain the spatial relationships of grid-based data. CNNs are composed of layers of neurons that employ non-linear functions, referred to as activation functions, to tune and optimize the weights associated with the features learned by the model [31]. Unlike their predecessors, CNNs process images, as well as other two-dimensional data, in its two-dimensional representation, allowing the network to learn spatial relationships in both the x and y direction [31]. As such, CNNs have seen widespread adoption for image processing and recognition problems.

CNNs are mainly comprised of convolution, pooling, and dense layers, also referred to as fully connected layers [32]. Convolution layers extract features of the input array through the application of various kernels, which are mathematical functions used to map the data from one representation to another. These kernels are presented as small two-dimensional arrays of developer-specified size and are applied in a sliding window fashion with a configurable step size. These parameters are tuned by the creator of the network and are often specific to the data the network will train on [32]. Once the features of interest are computed by the kernels, an activation function is used to extract meaning from those areas of interest by applying a thresholding function to update the weights of the feature. One ubiquitous activation function, and the only function applied in the models used in this research, is the rectified linear units (ReLUs) activation function. ReLU is a common selection for image processing as it sets the weight of all features with a negative value to zero. This leads to a quick reduction in the number of learned features. This is easily intuited from Equation (1), where x is the input to the neuron.

R e L U (x) = \{\begin{matrix} 0 & if x < 0 \\ x & if x > = 0 \end{matrix}

(1)

Pooling is often applied following convolution layers to reduce the size of the data being processed, thus increasing the speed of the model. Pooling is a form of down-sampling that reduces the data while maintaining important information on the learned features [31]. The universally accepted pooling functions are max-pooling and average-pooling. Max-pooling is applied in all models examined in this paper. This approach represents all values in the applied kernel with the maximum value in that kernel [32]. Max-pooling is generally preferred over average-pooling as it leads to faster convergence and produces better generalizations [33]. This is important to avoid over-fitting to features that have no true value, which this dataset has much of due to the debris frequently seen in the shadowgraph images.

Before making a final determination from the learned features, the extracted features are flattened into a feature vector where each node is connected to some number of nodes in the fully connected layers. This number of fully connected nodes is again determined by the creator of the CNN and is relative to the data being examined. The fully connected layers compute the dot product of weights and inputs and evaluate against an activation function to interpret the features learned by the model. Finally, these layers connect to an output layer responsible for the final determination of the learned features. The activation function used by this layer is again dependent on the input data [32]. Each of the models used in this study employ the Softmax function to give a label and confidence threshold for the output [34]. The most common form of this function is shown in the following equation where

z_{i}

is the output of the previous layer in the network (logit) for the ith class, K is the number of classes, and

z_{j}

is the logit per class.

σ {(\vec{z})}_{i} = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}}

(2)

The three models are evaluated based on the F1 statistic, accuracy, precision, recall, and area under the curve (AUC). As the training data are a balanced dataset, these metrics tend toward the same value. The evaluation data, however, are unbalanced due to some organisms naturally occurring more than others. In such cases, it is important to measure the performance using several performance measures. Accuracy, precision, and recall give differing looks at the ratio of correct predictions. Accuracy gives the number of predictions that matched the original label on the training and validation data, precision indicates the average ratio of true positive calls for all classes in the model, and recall indicates the average ratio of true negative calls for all classes in the model. The F1 score gives a harmonic mean of the precision and recall exhibited by the model. The AUC is computed as the area under the receiver operating characteristic (ROC) curve, which is used to measure how well a classifier separates the true positives and true negatives. The AUC for this problem is computed using a one-vs-one approach so that the AUC score is the average AUC of all possible pairwise combinations. This gives the average degree of separability achieved by the model, meaning it is telling of how capable the model is of distinguishing between classes on average, with a score of 1 being perfection [35].

2.2.1. VGG-16

VGG-16 is a popular CNN first proposed by Simonyan and Zisserman at the 2014 ImageNet Challenge where the model won first place [29]. The VGG model is frequently used for various image classification tasks due to its versatility. The ability to use transfer learning on the preexisting variations of the VGG model has been demonstrated for datasets of various complexity, making it a promising model to evaluate in this research [36,37,38].

The VGG network used in this research is referred to as VGG-16 due to the thirteen convolutional layers and three fully connected layers featured in the original model [29]. When trained for this paper, all convolution layers were maintained; however, the three connected layers were replaced with a single fully connected layer. This change was made as the use of three fully connected layers led to the use of more features than necessary for learning the data presented in this experiment. The adapted VGG-16 architecture used in this study is presented in Figure 3. In the original VGG-16 model, all convolution layers use the ReLU activation function; this is maintained as the convolution layers have not been altered. The fully connected layer also uses the ReLU activation function while the final output layer uses Softmax to predict one of the eleven known labels.

The weights trained by the original authors of VGG-16 on the ImageNet dataset were maintained for all convolution layers so that the training process could benefit from the knowledge that the model already possesses, leaving 14,714,688 non-trainable parameters. The new layers, those being the fully connected layer and output layer, lend 24,096,972 more parameters to be trained, for a total of 38,811,660 parameters in the model. The model was trained for up to 50 epochs, as initial training attempts stabilized well before this cutoff, with a callback being used to end training once the validation accuracy stabilized. This occurred after just 18 epochs, resulting in a training time of 4 h, 21 min, and 21 s and a validation time of 8 min and 44 s. The results of the training, validation, and verification of the model can be viewed in Section 3.

2.2.2. ResNet-50

Since its inception in 2015, ResNet-50 has become one of the most widely adopted convolutional neural network architectures [30,39]. The model was first proposed by He, Zhang, Ren, and Sun to overcome the ever-challenging issue of the vanishing gradient [30]. Vanishing gradients occur as the number of convolutional layers in a CNN increases because each activation function employed diminishes the input space toward zero, making it challenging for the model to continue to learn [40]. To address this problem, ResNet employs residual blocks that allow for direct connections to earlier layers of the model. Because these connections do not have an applied activation function, the derivatives do not experience the “squashing toward zero”, which leads to a higher overall derivative for the residual block [30].

ResNet-50 is the 50-layer adaptation of the original 34-layer ResNet CNN architecture [30]. The 50-layer ResNet model uses 16 residual blocks, with each block being composed of multiple convolution layers [30]. When adapted for the shadowgraph imagery, all pre-trained layers were maintained except for the final four layers, essentially retraining the final residual block. These final layers were adapted to be retrained so that weights corresponding to features from the eleven organism groupings are learned. This change was necessary as attempting to use only pre-trained layers led to an underfitting of the new data. The adapted ResNet-50 architecture described here is presented in Figure 4. In the original ResNet-50 model, all convolution layers use the ReLU activation function; this is maintained as the internal ResNet layers have not been altered. The fully connected, dense layer also uses the ReLU activation function, while the final output layer uses Softmax to predict one of eleven known labels.

Most of the weights trained by the original authors of the ResNet-50 model for the ImageNet dataset were maintained; however, the last four layers of the model were retrained for the features extracted from the new data. This leads to 23,587,712 non-trainable parameters and 48,174,731 parameters being trained for a total of 71,762,443 parameters in the model. The model was trained for up to 75 epochs, as it was assumed that this model may take longer to stabilize, with a callback being used to end training once the validation accuracy stabilizes. Contrary to initial expectations, this model stabilized early, training for only 14 epochs. This rapid stabilization of the model is attributed to the use of transfer learning to avoid retraining all layers of this very deep network. This resulted in a training time of 3 h 50 min 31 s and a validation time of 8 min 2 s. The results of the training, validation, and verification of the model can be viewed in Section 3.

2.2.3. ShadowNet

The ShadowNet model was designed around the impression that the samples in the shadowgraph images may be most easily learned through the examination of mainly the edge features. This is due to the observation that the majority of the organisms present very few internal features with some organisms depicted as a solid shadow. The model was created using only five convolution layers, three pooling layers, and one fully connected layer. Shallow networks run at increased speed due to the smaller number of features being considered. This tradeoff in gained speed, however, must be carefully balanced with the loss of accuracy that can result from the removal of more complex features. In this case, a shallow network examining less complex features was desired as some of the target classes can be more easily differentiated by their shapes rather than by the more complex features which some classes may share. This model takes in images sized 164 × 164 as many of the detected organisms are smaller than this limit. The layers of the model were carefully tuned to add in convolution layers until the accuracy of the training stopped increasing. This led to a model with a final output window of 6 × 6.

The architecture of the ShadowNet model depicted in Figure 5 contains five convolution layers with configurations given in Table 2. The first layer uses a filter sized 7 × 7 with a step size of 3 and no padding, allowing it to immediately reduce the size of the data and increase generality. All of these convolution layers use the ReLU activation function and same-size padding to avoid losing features near the edges of the image. Max-pooling is also applied between most convolution layers to increase the generalization of the learned features. The dense, fully connected layer also uses the ReLU activation function as the model performed best when tuned to this function. Finally, the output layer indicates the most likely label using the Softmax function.

As this architecture is an entirely new model created for the problem presented in this paper, no transfer learning could be used to jump-start the training process. This leads to all 10,099,851 parameters in the model being trainable parameters. The model was trained for up to 75 epochs, assuming that this model would take longer to train than VGG-16 due to the model learning from scratch rather than from prior knowledge. The use of a callback to end training upon stabilization led to 43 training epochs which lasted 4 h 49 min and 50 s, with a validation time of 4 min 40 s. The results of this training, validation, and verification of the model can be viewed in Section 3.

2.3. Random Forest Ensemble

As the organisms present in these data exhibit a wide variety of complexity, the CNN architectures examined are of various depths. The ShadowNet CNN is a shallow architecture performing only five convolutions. This model was designed to use more edge features and less complex internal features. A model like this is anticipated to do well on categories such as INV, COP, and VEL, which are typically dark, solid masses, with well-defined profiles in the shadowgraph imagery. In very clear images, this model should perform well on CHA as well due to the very distinct profile these organisms create. VGG-16 and ResNet-50 were selected for their ability to process more complex features, with VGG-16 being a medium-sized model performing 13 convolutions and ResNet-50 being a very deep model performing 48 convolutions. These extra layers are anticipated to be important in making more complex decisions, such as the difference between APP and SNO or CTE and HYD. The extra complexity of these models, however, can also lead to overfitting on those less complex organisms. As such, an ensemble method, IchthyNet, is introduced to leverage the strengths of each model to overcome their individual weak points. The architecture from the breakdown of the original shadowgraph frame to the output of the ensemble is shown in Figure 6.

The IchthyNet ensemble approach employs a Random Forest Classifier to examine the output label and confidence of each of the three underlying models and determine the best classification for each sample. This ensemble is trained on 10% of the original classifications from the two test datasets to learn which models perform best on which categories and how much their confidence levels contribute to an accurate answer. It then uses the learned relationships between the model predictions and confidence levels per class in these data to deconflict all predictions made by the three models.

3. Results

During training and evaluation, the ShadowNet CNN evaluates with the highest performance measures at 95% (Table 3), with VGG-16 right behind it at 94% and ResNet-50 following at 92%. Overall, all models do very well during training and evaluation, meeting our goal of 90%. Examining the loss per epoch and accuracy measured during the training of each model (Figure 7), it is clear that none of the models experience overfitting. This was safeguarded by utilizing the early-stopping mechanism to end training when the validation accuracy stops improving. The lack of closeness between the training and validation accuracy and loss per epoch for the ResNet-50 model, however, indicates that if early-stopping were not in place, this model would become overfit. This is evident as the training accuracy continues to increase while the validation accuracy remains stable. The other two models, however, maintain a close relationship between training and validation accuracy until the point of stabilization.

Despite training for varying epochs (Figure 7), all models trained for roughly the same amount of time with each completing training near 4 h. VGG-16 took the longest time to perform the evaluation on the 37,400 ROIs, taking nearly 9 min while ResNet-50 took just over 8 min and ShadowNet completed evaluation in just under 5 min. A test run on the CPU alone showed that the models take roughly 2 s longer per batch, with the batch sizes currently being set to 512 samples.

The accuracy, F1-score, precision, recall, and AUC were also calculated for two independent validation sets (Table 4). The first validation set is a set of 103,000 labeled samples from the coast of Delaware Bay in late April of 2018, which were withheld during the training and evaluation process. The second set is a set of 34,684 samples from the Plankton 1.0 dataset, collected in the Straits of Florida between 3 June and 6 June of 2014 [41]. This dataset is publicly available due to the Kaggle National DataScience Bowl held in 2015. All models perform with an improvement over the original evaluation scores on the Delaware data with the accuracy, F1-score, precision, and recall of the VGG-16 model ranging from 94–95%, 97% for the ShadowNet model, and 93% for the ResNet-50 model. All models, however, have a decreased performance on the Straits of Florida dataset, with the accuracy, F1-score, precision, and recall of the ShadowNet and VGG-16 models dipping to 91–92% and the ResNet-50 model dipping slightly less to 92%. Even on the less familiar Florida data, however, all models evaluate with an AUC of 98% or greater, meaning, overall, the models separate the classes very well. The ensemble model utilizing all three models performs the best on each dataset, achieving 98% for accuracy, F1-score, precision, and recall on the Delaware Bay data and 96% on the Plankton 1.0 dataset, with an AUC of 99% on both datasets.

Examining the accuracy per label for the ROIs sheds some insight into the shortcomings of the models (Table 5). For the Delaware dataset, the VGG-16 model performs the best on HYD at 99% accuracy, followed closely by SHR, SNO, and VEL at 98% accuracy. While this model achieves a reported 100% accuracy on the DIA category for the Delaware data, it was observed that the model heavily over-classifies this category, making its performance on that category less than ideal. In these data, ShadowNet achieves 99% accuracy for the VEL category, with DIA, HYD, and SHR being the next best categories at 98% accuracy. Resnet-50 also achieves high accuracy on the DIA category at 99%, with its second best category being HYD at 97%. In this dataset, all models struggle the most with the INV category, with the best two models being VGG-16 and ShadowNet at 90% and ResNet achieving just 84%. The combination of these models by use of the ensemble approach leads to the very high classification of DIA, HYD, SHR, and VEL at 99% on the Delaware Bay data, with CHA, COP, SIP, and SNO following right behind at 98%.

The trends in the accuracy per label are slightly different for the data collected in Florida. Here, the VGG-16 model performs the best on CHA at 95% accuracy, followed by COP at 95% accuracy. In these data, ShadowNet achieves 97% accuracy for COP, with its next best category being INV at 96% accuracy. Resnet-50 achieves the highest accuracy on CHA and COP at 95%, with HYD and DIA at 94%. In this dataset, all models struggle the most with the CTE category, with all individual models failing to classify even 50% of these samples correctly. The combination of these models by use of the ensemble approach, however, is able to rectify this shortcoming to classify CTE with 71% accuracy. The ensemble also lifts the top classifications on the unseen Florida data to 98% accuracy for CHA and COP and 97% for INV.

Finally, the models are run on all collected data from the seven tows. Each model was run three times, and the resulting time is the average of these three performances. Overall, ShadowNet completes all tows slightly faster than the other models; however, all models run in comparable times (Table 6).

As the tows were not uniform, the models typically took the longest on Tow 3, which had more than twice the number of ROIs than any other individual tow. On this Tow, VGG-16 ran for 15 h, ResNet-50 ran for 14 h and 20 min, and ShadowNet ran for 13 h and 13 min. On the smallest tow, Tow 2, the models had more comparable run times, with ResNet-50 taking 3 h and 17 min, VGG-16 taking 3 h and 4 min, and ShadowNet taking 2 h and 30 min. This means if all models were run in parallel, the ensemble model would complete in under 3 days, and if all models were run in sequence, it would finish processing the dataset of nearly 30 million images in roughly 6 days.

4. Discussion

From the performance of the ensemble on the Delaware Bay data and the Plankton 1.0 dataset, it is evident that utilizing the three CNN architectures together in the form of an ensemble model yields the highest accuracy. This is also confirmed by reviewing the distribution of model calls to correct calls, presented in Figure 8. In both cases, using the ensemble model over the individual models leads to the best preservation of the true labels. This is to be expected as the ensemble approach is combining the information from all three models, using the information gained from the training step to learn how the predictions of these models relates to the labeled data. As such, it can be anticipated that the ensemble approach has learned which model is most trustworthy for each different label, which reaches a more accurate decision when the models disagree about a classification. There were two cases (INV for the Delaware dataset and DIA for the Florida dataset) where the ensemble missed a few of the labels that the constituent models would have correctly labeled; however, the overall performance was more stable. As the ensemble was trained using a very small subset of labels to learn the relationships among the model predictions, this is likely due to cases where the models disagree in a fashion that the ensemble has not yet seen.

To verify that the performance seen in the two validation datasets is representative across all of the labeled data, the predictions of each model on all of the labeled data from the Delaware Bay area were examined. This is the original, imbalanced dataset of 1,048,511 labeled samples. Across these data, it is again evident (Figure 9) that the ensemble model makes the highest number of correct calls across all categories, achieving an accuracy, precision, recall, and F1 score of 98%. The distribution of the validated calls for this large subset of the original seven tows also matches expectations on the distributions of the organism clusters, with copepods and diatoms being the most abundant detected categories [42]. This distribution of calls is promising, as the data had to be balanced for training by reducing samples per label to the smallest subset of just 17,000 samples per label. The size of the training set was then augmented using random rotation and random thresholding to regrow the dataset. Finding the quantity of the predictions for all models, but particularly for the ensemble, provides evidence that this balancing and augmentation successfully balanced the dataset without overfitting the models to particular representations of each organism.

It was noted that the models, including the ensemble, had decreased performance on the Plankton 1.0 dataset. These data were anticipated to perform with a slight dip in performance for the ShadowNet model, as the data appear to have been heavily processed to remove the background from each ROI. This leads to slight deterioration in the edge features, which the ShadowNet model is more reliant on, while the VGG and ResNet models saw a performance increase for certain categories in these data, likely due to that reduction in noise. All models, however, had subpar performance for the CTE category in this dataset. The examination of the missed CTE samples compared to the correct CTE samples from the Plankton 1.0 dataset (Figure 10) shed some light on this. Comparing the missed samples to the samples the IchthyNet ensemble correctly identified, it became clear that the correctly identified CTE samples had less distortion in their features. This is because these samples were larger than the missed samples, leading to less obscuring of features when the up-scaling takes place. This makes sense as the CTE samples used in training, even at the highest threshold point, have well-defined features. This is likely due to the minimum size used when selecting regions of interest. The region of interest selection process for creating the training data ruled out objects smaller than 750 square-pixels. While other categories in the Plankton 1.0 dataset do exhibit samples smaller than the 750 square-pixel minimum, the other categories with a significant number of smaller samples (APP, DIA, and COP) seem to be less impacted, as these samples do not exhibit complex internal features which may get lost due to distortion during up-scaling.

A breakdown of the different CTE categories in the missed labels (Figure 11) reveals that the majority of the missed calls are cydippids with tentacles, which account for more than 50% of the missed CTE calls. These are frequently mistaken for either the SIP or HYD category by the models. The bulk remainder of the missed CTE calls are in the cydippids with no tentacles (31–41%), with lobates being only 5–10% of the missed samples. This, again, makes sense in terms of the size of the missed samples as cydippids tend to be the smaller samples. Many of the missed cydippid samples from the Plankton 1.0 dataset measure under 20 pixels across at their longest point, meaning no samples exist in the training data that would match the size of these samples. This means that, despite the augmentation to obscure the features by means of filtering, no training samples exist that have undergone the same amount of pixelation due to the up-scaling process. In the original data, however, these categories are split more evenly with 39% of the samples being cydippids with tentacles, 32% being cydippids without tentacles, and 29% being lobates. This would seem to indicate a lack of equivalent cydippid samples in the labeled training dataset.

While the ensemble performs less than ideal on the CTE data gathered from the Plankton 1.0 dataset, the approach does preserve 21% more of the CTE samples than the next best individual model, VGG-16. The ensemble was incredibly useful for preserving the other categories with subpar individual performance as well. Based on individual performance, the models lose an average of 44% of the CTE samples, 11% of APP, 15% of SHR, and 19% of SIP. Using the ensemble approach cut the error for each of these categories to less than half, mislabeling only 21% of CTE, 4% of APP, 8% of SHR, and 7% of SIP. These numbers are even lower in the data collected in the Delaware dataset, which is expected as these data more closely resemble the data used for training.

5. Conclusions

In this paper, we study the use of three CNN architectures for building an ensemble approach to the classification of ten groups of plankton plus marine snow. The models were selected based on their differing levels of feature complexity, which combine in such a way that the ensemble can utilize these differences to achieve higher performance than the constituent networks. In this work, we evaluate two CNN models trained using transfer learning and one smaller network created from scratch. The ensemble model employs a Random Forest algorithm on the output label and confidence of each model to make a final determination of the correct label. Experiments show that each of the models had differing strengths and weaknesses in the two evaluation datasets, and that the use of these models in an ensemble leads to a substantial performance improvement, particularly on external datasets. The experimental results show that the ensemble model obtains state-of-the-art performance on both the in-house prepared dataset covering the Delaware Bay and the Plankton 1.0 dataset obtained from Kaggle’s DataScience Bowl. The ensemble struggled predominantly in a single category of the previously unseen data, hinting that additional data should be collected and labeled to overcome a lack of cydippid samples in the CTE category.

The present research has been an excellent start to determining the species/functional group composition and distribution of the 1.88 mm to 40 mm (ESD) size organisms [21] encountered in the shadowgraph imagery from the Delaware Bay area. We aim to use the predictions of the ensemble model to examine the additional environmental data collected during the tows for relationships to the density of the detected scatterers. As the model presently achieves an anticipated 99% AUC and F1-score of 98% on the data collected in the study area, any strongly noted correlations should be beneficial in determining such relationships.

While initial testing on two ISIIS datasets demonstrated good generalizations on the currently supported classifications, future work will be undertaken to extend the application of this ensemble model, such that the segmentation of the image is performed as part of the model rather than a preprocessing step utilizing external software. In light of the large number of CTE samples which appear to be misclassified as a result of a lack of equivalently sized samples in the training data, this segmentation process will be tuned to encapsulate smaller organisms to improve the generalization of the models. We also aim to extend the number of organisms classified by adding a step to filter out low-confidence predictions so that organisms that do not match the currently supported classifications can be marked for manual review. These new samples, alongside publicly available data, will be used to determine if a substantial number of samples exist to add other phytoplankton categories to the models, as the modular design of the IchthyNet ensemble allows for the continual growth of supported categories. This will allow the model to be applied to more fine-tuned regions to overcome the current capability gap of classifying small organisms and extend capabilities for the classification of overlapping organisms.

Author Contributions

Conceptualization, B.P.; methodology, B.S and. B.P.; software, B.S.; validation, B.S and. B.P.; formal analysis, B.S and. B.P.; investigation, B.S and. B.P.; data-curation, B.S and. B.P.; writing—original draft preparation, B.S.; writing—review and editing, B.S and. B.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Naval Research Laboratory under the NRL Program Element 61153.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

Acknowledgments

We would like to thank the captain and crew of the RV Hugh R. Sharp. Ian Martens supported all aspects of cruise logistics and was instrumental in making for a successful field sampling expedition. Adam Greer, Alexis Hagemeyer, and Audrey Orzech collected and identified the ISIIS imagery. And, finally, Christopher Wood for data preparation and marshaling.

Conflicts of Interest

The authors declare no conflict of interest.

References

Marino, A.; Geneva, A. Deep scattering layer investigation through multi-beam bathymetry. In Proceedings of the OCEANS’94, Brest, France, 13–16 September 1994; IEEE: New York, NY, USA; Volume 3, pp. III–184. [Google Scholar]
Lee, E.H.; Choi, S.Y.; Seo, M.H.; Soh, H.Y. Impacts of hypoxia on the mesozooplankton community structure in a semi-enclosed bay. Front. Mar. Sci. 2022, 9, 1005442. [Google Scholar] [CrossRef]
Boldrocchi, G.; Villa, B.; Monticelli, D.; Spanu, D.; Magni, G.; Pachner, J.; Mastore, M.; Bettinetti, R. Zooplankton as an indicator of the status of contamination of the Mediterranean Sea and temporal trends. Mar. Pollut. Bull. 2023, 197, 115732. [Google Scholar] [CrossRef] [PubMed]
Parmar, T.K.; Rawtani, D.; Agrawal, Y.K. Bioindicators: The natural indicator of environmental pollution. Front. Life Sci. 2016, 9, 110–118. [Google Scholar] [CrossRef]
Nelson, D.M.; Tréguer, P.; Brzezinski, M.A.; Leynaert, A.; Quéguiner, B. Production and dissolution of biogenic silica in the ocean: Revised global estimates, comparison with regional data and relationship to biogenic sedimentation. Glob. Biogeochem. Cycles 1995, 9, 359–372. [Google Scholar] [CrossRef]
Acuña, J.L.; Deibel, D.; Saunders, P.A.; Booth, B.; Hatfield, E.; Klein, B.; Mei, Z.-P.; Rivkin, R. Phytoplankton ingestion by appendicularians in the North Water. Deep Sea Res. Part II Top. Stud. Oceanogr. 2002, 49, 5101–5115. [Google Scholar] [CrossRef]
Llopiz, J.K.; Richardson, D.E.; Shiroza, A.; Smith, S.L.; Cowen, R.K. Distinctions in the diets and distributions of larval tunas and the important role of appendicularians. Limnol. Oceanogr. 2010, 55, 983–996. [Google Scholar] [CrossRef]
Canino, M.F.; Grant, G.C. The feeding and diet of Sagitta tenuis (Chaetognatha) in the lower Chesapeake Bay. J. Plankton Res. 1985, 7, 175–188. [Google Scholar] [CrossRef]
Purcell, J.E. Dietary composition and diel feeding patterns of epipelagic siphonophores. Mar. Biol. 1981, 65, 83–90. [Google Scholar] [CrossRef]
Main, R.J. Observations of the feeding mechanism of a ctenophore, Mnemiopsis leidyi. Biol. Bull. 1928, 55, 69–78. [Google Scholar] [CrossRef]
Fulton, R.S.; Wear, R.G. Predatory feeding of the hydromedusae Obelia geniculata and Phialla quadrata. Mar. Biol. 1985, 87, 47–54. [Google Scholar] [CrossRef]
Christie, M.R.; Tissot, B.N.; Albins, M.A.; Beets, J.P.; Jia, Y.; Ortiz, D.M.; Thompson, S.E.; Hixon, M.A. Larval connectivity in an effective network of marine protected areas. PLoS ONE 2010, 5, e15715. [Google Scholar] [CrossRef] [PubMed]
Orenstein, E.; Ayata, S.D.; Maps, F.; Biard, T.; Becker, E.; Benedetti, F. Machine learning techniques to characterize functional traits of plankton from image data. Limnol. Oceanogr. 2022, 67, 1647–1669. [Google Scholar] [CrossRef] [PubMed]
Ciranni, M.; Murino, V.; Odone, F.; Pastore, V.P. Computer vision and deep learning meet plankton: Milestones and future directions. Image Vis. Comput. 2024, 143, 104934. Available online: https://www.sciencedirect.com/science/article/pii/S0262885624000374 (accessed on 12 December 2024). [CrossRef]
Bi, H.; Cheng, Y.; Cheng, X.; Benfield, M.C.; Kimmel, D.G.; Zheng, H.; Groves, S.; Ying, K. Taming the data deluge: A novel end-to-end deep learning system for classifying marine biological and environmental images. Limnol. Oceanogr. Methods 2024, 22, 47–64. [Google Scholar] [CrossRef]
Ellen, J.S.; Ohman, M.D. Beyond transfer learning: Leveraging ancillary images in automated classification of plankton. Limnol. Oceanogr. Methods 2024, 22, 943–952. [Google Scholar] [CrossRef]
Greer, A.T.; Duffy, P.I.; Walles, T.J.; Cousin, C.; Treible, L.M.; Aaron, K.D.; Nejstgaard, J.C. Modular shadowgraph imaging for zooplankton ecological studies in diverse field and mesocosm settings. Limnol. Oceanogr. Methods 2024, 23, 67–86. [Google Scholar] [CrossRef]
Eerola, T.; Batrakhanov, D.; Barazandeh, N.V.; Kraft, K.; Haraguchi, L.; Lensu, L.; Suikkanen, S.; Seppälä, J.; Tamminen, T.; Kälviäinen, H. Survey of automatic plankton image recognition: Challenges, existing solutions and future perspectives. Artif. Intell. Rev. 2024, 57, 114. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Lumini, A.; Nanni, L. Deep learning and transfer learning features for plankton classification. Ecol. Inform. 2019, 51, 33–43. [Google Scholar] [CrossRef]
Greer, A.T.; Lehrter, J.C.; Binder, B.M.; Nayak, A.R.; Barua, R.; Rice, A.E.; Cohen, J.H.; McFarl, M.; Hagemeyer, A.; Stockley, N.; et al. High-Resolution Sampling of a Broad Marine Life Size Spectrum Reveals Differing Size- and Composition-Based Associations with Physical Oceanographic Structure. Front. Mar. Sci. 2020, 7, 542701. [Google Scholar] [CrossRef]
Cowen, R.K.; Guigand, C.M. In situ ichthyoplankton imaging system (ISIIS): System design and preliminary results. Limnol. Oceanogr. Methods 2008, 6, 126–132. [Google Scholar] [CrossRef]
Luo, J.Y.; Irisson, J.O.; Graham, B.; Guig, C.; Sarafraz, A.; Mader, C.; Cowen, R.K. Automated plankton image analysis using convolutional neural networks. Limnol. Oceanogr. Methods 2018, 16, 814–827. [Google Scholar] [CrossRef]
Greer, A.T.; Chiaverano, L.M.; Luo, J.Y.; Cowen, R.K.; Graham, W.M. Ecology and behaviour of holoplanktonic scyphomedusae and their interactions with larval and juvenile fishes in the northern Gulf of Mexico. ICES J. Mar. Sci. 2018, 75, 751–763. [Google Scholar] [CrossRef]
Faillettaz, R.; Picheral, M.; Luo, J.Y.; Guig, C.; Cowen, R.K.; Irisson, J.O. Imperfect automatic image classification successfully describes plankton distribution patterns. Methods Oceanogr. 2016, 15, 60–77. [Google Scholar] [CrossRef]
Schneider, C.A.; Rasband, W.S.; Eliceiri, K.W. NIH Image to ImageJ: 25 years of image analysis. Nat. Methods 2012, 9, 671–675. [Google Scholar] [CrossRef]
LeCun, Y.; Cortes, C.; Burges, C. The MNIST Database. MNIST Handwritten Digit Database. 2010. Available online: https://wiki.pathmind.com/mnist (accessed on 26 September 2023).
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
Indolia, S.; Goswami, A.K.; Mishra, S.P.; Asopa, P. Conceptual understanding of convolutional neural network—A deep learning approach. Procedia Comput. Sci. 2018, 132, 679–688. [Google Scholar] [CrossRef]
Masci, J.; Meier, U.; Ciresan, D.; Schmidhuber, J.; Fricout, G. Steel defect classification with max-pooling convolutional neural networks. In Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, 10–15 June 2012; IEEE: New York, NY, USA; pp. 1–6. [Google Scholar]
Gao, B.; Pavel, L. On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning. arXiv 2017, arXiv:1704.00805. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Tammina, S. Transfer learning using vgg-16 with deep convolutional neural network for classifying images. Int. J. Sci. Res. Publ. 2019, 9, 143–150. [Google Scholar] [CrossRef]
Islam, S.; Khan, S.I.A.; Abedin, M.M.; Habibullah, K.M.; Das, A.K. Bird species classification from an image using VGG-16 network. In Proceedings of the 7th International Conference on Computer and Communications Management, Bangkok, Thailand, 27–29 July 2019; pp. 38–42. [Google Scholar]
Guan, Q.; Wang, Y.; Ping, B.; Li, D.; Du, J.; Qin, Y.; Xiang, J. Deep convolutional neural network VGG-16 model for differential diagnosing of papillary thyroid carcinomas in cytological images: A pilot study. J. Cancer 2019, 10, 4876. [Google Scholar] [CrossRef] [PubMed]
He, F.; Liu, T.; Tao, D. Why resnet works? Residuals generalize. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5349–5362. [Google Scholar] [CrossRef] [PubMed]
Borawar, L.; Kaur, R. ResNet: Solving Vanishing Gradient in Deep Networks. In Proceedings of the International Conference on Recent Trends in Computing: ICRTC 2022, Ghaziabad, India, 3–4 June 2022; Springer: Singapore, 2022. [Google Scholar]
Luo, A.; BoozAllen, J.; Sullivan, J.; Mills, S.; Cukierski, W. National Data Science Bowl. Kaggle. 2014. Available online: https://kaggle.com/competitions/datasciencebowl (accessed on 25 July 2014).
Naz, T.; Burhan, Z.; Jamal, P.J.A. Seasonal abundance of diatoms in correlation with the physicochemical parameters from coastal waters of Pakistan. Pak. J. Bot. 2013, 45, 1477–1486. [Google Scholar]

Figure 1. A map of the seven tows where the data were collected off of the Delaware coast, with the depth distribution of collected ROIs for each tow.

Figure 2. Sample images from the examined labels (not to scale) where the labels are as follows: APP for appendicularians, CHA for chaetognaths, COP for copepods, CTE for ctenophores, DIA for diatoms, HYD for hydromedusae, INV for invertebrate larvae, SHR for shrimp, SIP for siphonophores, SNO for marine snow, and VEL for veligers.

Figure 3. Adapted VGG-16 architecture for an input of 224 × 224 × 3 and an output of our 11 labels.

Figure 4. Adapted ResNet-50 architecture for an input of 224 × 224 × 3 and an output of 11 labels.

Figure 5. ShadowNet CNN architecture for an input of 164 × 164 × 1 and an output of 11 labels.

Figure 6. Flowchart of the IchthyNet ensemble approach beginning with the initial shadowgraph frame and ending with the Random Forest Classifier’s output label.

Figure 7. Accuracy and loss progression over all training epochs for each model: (A) VGG-16, (B) ShadowNet, and (C) ResNet-50.

Figure 8. Comparison of the quantity of classifications made by VGG-16, ShadowNet, IchthyNet Ensemble, and ResNet-50 for each category in the Delaware data (top) and the Florida data (bottom) versus the number of calls in each category that actually exist in the data (Real).

Figure 9. Comparison of the quantity of classifications made by VGG-16, ShadowNet, IchthyNet Ensemble, and ResNet-50 for each category in the total set of labeled data from the Delaware Bay area versus the number of calls in each category that actually exist in the data (Real).

Figure 10. Samples taken from the missed CTE specimens of the Florida data (top), the CTE specimens the IchthyNet ensemble correctly identified (middle), and the highest threshold level CTE specimens seen during training (bottom).

Figure 11. Distribution of the incorrect CTE calls from the Plankton 1.0 dataset. This dataset had 3 different categories for its CTE group, cydippid_tentacles, cydippid_no_tentacles, and lobates.

Table 1. Demographics of Tows 1–7, including the tow reference number, the collection date, and the number of ROIs collected during the tow.

Tow	Collection Date	ROI Count
1	27 April 2018–28 April 2018	4,852,173
2	30 April 2018	1,770,401
3	1 May 2018–2 May 2018	8,366,266
4	2 May 2018–3 May 2018	3,362,544
5	3 May 2018–4 May 2018	3,516,418
6	5 May 2018	4,135,340
7	6 May 2018	3,815,775

Table 2. Configurations of the convolution layers in the ShadowNet Model developed during this research which takes in black and white images sized 164 × 164 and outputs 11 labels.

Layer	Input Size	Filters	Kernel Size	Step Size	Output Size
Conv2D 1	164 × 164	64	7 × 7	3	55 × 55
Conv2D 2	27 × 27	128	5 × 5	1	27 × 27
Conv2D 3	27 × 27	128	3 × 3	1	27 × 27
Conv2D 4	13 × 13	256	3 × 3	1	13 × 13
Conv2D 5	13 × 13	256	3 × 3	1	13 × 13

Table 3. Comparison of the reported accuracy, F1-score, precision, recall, area under the curve, and evaluation time for each model during the training process.

Model	Acc	F1	Precision	Recall	Train Time (h:m:s)	Eval Time (h:m:s)
VGG-16	0.94	0.94	0.94	0.94	04:21:21	0:08:44
ShadowNet	0.95	0.95	0.95	0.95	04:49:50	0:04:40
ResNet-50	0.92	0.92	0.92	0.92	03:50:31	0:8:02

Table 4. Performance measures including accuracy, F1-score, precision, recall, area under the curve, and evaluation time using the two test datasets to evaluate each model. As the ensemble is the result of a quick-fit Random Forest algorithm on the output of the models, the time for the ensemble is computed as the sum of all models plus 10 s.

Delaware
Model	Acc	F1	Precision	Recall	AUC	Eval Time (h:m:s)
VGG-16	0.95	0.95	0.95	0.94	0.99	0:19:51
ShadowNet	0.97	0.97	0.97	0.97	0.99	0:36:15
ResNet-50	0.93	0.93	0.93	0.93	0.99	1:06:02
IchthyNet Ensemble	0.98	0.98	0.98	0.98	0.99	2:02:17
Florida
Model	Acc	F1	Precision	Recall	AUC	Eval Time (h:m:s)
VGG-16	0.91	0.92	0.91	0.91	0.99	0:03:38
ShadowNet	0.91	0.92	0.91	0.91	0.99	0:01:56
ResNet-50	0.92	0.92	0.92	0.92	0.98	0:04:18
IchthyNet Ensemble	0.96	0.96	0.96	0.96	0.99	0:10:02

Table 5. Accuracy on the test data per category for each model.

Delaware
Model	APP	CHA	COP	CTE	DIA	HYD	INV	SHR	SIP	SNO	VEL
VGG-16	0.95	0.86	0.92	0.89	1.0	0.99	0.90	0.98	0.93	0.98	0.98
ShadowNet	0.94	0.97	0.97	0.95	0.98	0.98	0.90	0.98	0.96	0.96	0.99
ResNet-50	0.92	0.94	0.91	0.87	0.99	0.97	0.84	0.95	0.92	0.92	0.96
IchthyNet Ensemble	0.97	0.98	0.98	0.96	0.99	0.99	0.87	0.99	0.98	0.98	0.99
Florida
Model	APP	CHA	COP	CTE	DIA	HYD	INV	SHR	SIP	SNO	VEL
VGG-16	0.90	0.95	0.94	0.50	0.88	0.92	0.93	0.86	0.79	-	-
ShadowNet	0.89	0.93	0.97	0.42	0.87	0.87	0.96	0.80	0.82	-	-
ResNet-50	0.89	0.95	0.95	0.41	0.94	0.94	0.93	0.89	0.81	-	-
IchthyNet Ensemble	0.96	0.98	0.98	0.71	0.93	0.96	0.97	0.92	0.93	-	-

Table 6. Average evaluation time per tow (h:m:s) for each model.

Model	1	2	3	4	5	6	7
VGG-16	6:44:03	3:04:30	15:02:55	5:18:31	4:55:01	6:08:58	5:36:26
ShadowNet	6:30:36	2:30:34	13:13:02	4:12:32	4:17:57	6:02:29	5:10:22
ResNet-50	7:48:57	3:17:16	14:20:41	7:39:01	4:45:42	6:44:37	6:35:58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Slocum, B.; Penta, B. IchthyNet: An Ensemble Method for the Classification of In Situ Marine Zooplankton Shadowgraph Images. Oceans 2025, 6, 7. https://doi.org/10.3390/oceans6010007

AMA Style

Slocum B, Penta B. IchthyNet: An Ensemble Method for the Classification of In Situ Marine Zooplankton Shadowgraph Images. Oceans. 2025; 6(1):7. https://doi.org/10.3390/oceans6010007

Chicago/Turabian Style

Slocum, Brittney, and Bradley Penta. 2025. "IchthyNet: An Ensemble Method for the Classification of In Situ Marine Zooplankton Shadowgraph Images" Oceans 6, no. 1: 7. https://doi.org/10.3390/oceans6010007

APA Style

Slocum, B., & Penta, B. (2025). IchthyNet: An Ensemble Method for the Classification of In Situ Marine Zooplankton Shadowgraph Images. Oceans, 6(1), 7. https://doi.org/10.3390/oceans6010007

Article Menu

IchthyNet: An Ensemble Method for the Classification of In Situ Marine Zooplankton Shadowgraph Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection and Preprocessing

2.2. Convolutional Neural Networks

2.2.1. VGG-16

2.2.2. ResNet-50

2.2.3. ShadowNet

2.3. Random Forest Ensemble

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI