1. Introduction
In the aquaculture industry, ensuring the precise feeding of bait is crucial [
1,
2,
3,
4] as it directly impacts the healthy growth of fish stocks. Traditional feeding methods mainly include manual feeding and timed, quantitative mechanical feeding. Manual feeding relies on the experience of aquaculture personnel and observations of fish stocks to determine the quantity of feed needed. However, this method is dependent on individual subjective judgment, requires a high level of experience, and is labor-intensive, making it difficult to promote for use in large-scale, industrialized farming. Another method involves the use of timed feeders to dispense feed, which overlooks the variations in the feeding patterns of fish stocks, potentially leading to inappropriate feeding quantities and timings that contradict the growth patterns of the fish. This may result in the bait not being fully utilized, leading to water pollution [
5]. Poor water quality treatment can have a detrimental effect on the growth of fish and may even lead to mass mortality. Therefore, improving the precision and adaptability of feeding becomes an important challenge in the management of aquaculture.
In recent years, the development of automated feeding systems for fish stocks based on intelligent technologies and sensors [
6] has emerged as a new focus. These systems are capable of dynamically adjusting feeding quantities and timings [
7] based on the needs of the fish stocks, which aids in enhancing aquaculture efficiency, reducing resource wastage, and ensuring the healthy growth of fish stocks. It is anticipated that these systems will gradually replace experience-based manual feeding decisions in the future.
The spatial characteristics [
8,
9] of fish stocks undergo significant changes during the feeding process, making fish feeding behavior an important indicator of fish appetite. Among various behavior analysis methods, computer vision [
10] is an efficient, non-contact detection [
11,
12] technology which has been widely applied in the analysis of fish behavior.
Traditional methods primarily quantify the intensity of fish feeding activities by extracting color and texture features from feeding images. Numerous studies utilize image segmentation [
13] or object detection techniques to extract fish coordinates and calculate their swimming speeds [
14] and aggregation levels for the analysis of fish feeding behavior per unit time. These methods can rapidly capture the behavioral characteristics of fish. However, considering the complexity of lighting conditions in industrialized farming and the randomness of fish behavior changes [
15], conventional methods based on manually designed features struggle to accurately extract the characteristics of fish feeding behavior.
Since the victory of AlexNet [
16] in the ImageNet competition in 2012, convolutional neural networks (CNNs) have garnered widespread attention. Unlike traditional feature extraction methods, CNNs do not require features to be manually designed. In complex tasks, CNNs demonstrate higher robustness and accuracy compared to conventional methods. In recent years, an increasing number of deep learning models have been applied in aquaculture. Zhou et al. [
17] proposed a method for analyzing fish feeding activities using LeNet-5, which combines a CNN and computer vision, categorizing fish feeding intensity into four states with a classification accuracy of 90%, effectively assessing fish feeding behavior. Hu et al. [
18] employed deep learning technologies, including an improved R(2 + 1)D convolutional neural network (CNN), to identify the size of water ripples produced during fish feeding to decide whether to continue feeding. They developed a computer vision-based intelligent fish feeding system that not only recognizes the dynamics of water ripples but also integrates data from water quality sensors, achieving an accuracy of 93.2%. Zheng [
19] and colleagues utilized a Spatio-Temporal Attention Network (STAN) to analyze the feeding behavior of Pompano fish stocks. By innovatively combining spatial images [
20] with optical flow images [
21,
22,
23] for video analysis, the study applied the STAN model to extract intuitive and perceptual features to determine the feeding or non-feeding states of Pompano fish stocks. This method incorporated a hierarchical convolutional network (HCN) to extract multi-scale spatial features. Experimental validation showed that the STAN model achieved a test accuracy of 97.97%. ResNet (Residual Network [
24]), another widely used deep learning model, has also shown its effectiveness in fish feeding behavior classification tasks. By utilizing residual connections to prevent the vanishing gradient problem, ResNet models can achieve commendable accuracy. However, they typically possess a large number of parameters, necessitating more computational resources and training time. This makes them less suitable for resource-constrained small devices, such as smartphones. However, the image quality of fish feeding behavior is affected by factors such as the fish pond environment and lighting conditions, leading to issues like low contrast and uneven illumination, which result in less prominent foreground features in the images. Additionally, due to the relatively small area of individual fish in images, CNN models can easily be disrupted by background features during the feature extraction process. This makes it challenging for traditional CNN models to focus on fish behavior [
25,
26,
27,
28,
29], reducing the accuracy of model classification.
To tackle this issue, our research introduces a multi-step image pre-enhancement strategy that sequentially employs the MSRCR (Multi-Scale Retinex with Color Restoration), Multi-Metric-Driven Contrast Limited Adaptive Histogram Equalization (mdc), and Unsharp Masking (UM) techniques for image pre-enhancement. This approach is designed to eliminate problems related to water reflections, low contrast, and unclear image detail features. Following this, the lightweight [
30,
31,
32] EfficientNet model is utilized for classifying fish feeding behaviors. The test results demonstrate that the proposed MIPS (Multi-Step Image Preprocessing Strategy) module significantly enhances the accuracy of various CNN models in classification tasks. Furthermore, compared to advanced ResNet [
33] models, it markedly reduces the training time of the model.
The key contributions can be summarized as the following three points:
(1) This study proposes a multi-step image pre-enhancement strategy by integrating existing techniques such as color space conversion, Multi-Scale Retinex with Color Restoration (MSRCR), Contrast Limited Adaptive Histogram Equalization (CLAHE), and Unsharp Masking (USM). This integration significantly improves the quality of underwater fish feeding behavior images.
(2) The main improvements include two parts: First, an innovative multi-parameter-driven CLAHE method is devised, which involves the detailed processing of only the luminance channel in the LAB color space. By carefully optimizing the clipLimit and tileGridSize parameters, this method balances local detail enhancement and overall visual coherence. Second, an adaptive image enhancement adjustment mechanism is proposed, based on comprehensive evaluation metrics, including Learned Perceptual Image Patch Similarity (LPIPS), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM). This mechanism dynamically adjusts the degree of image enhancement, ensuring that the results meet both subjective visual perception and objective quality standards.
(3) The present study presents a comprehensive system architecture for classifying the feeding intensity of fish. By applying a multi-step image pre-enhancement strategy to optimized models from the EfficientNet and ResNet series, a significant improvement in the accuracy of classifying fish feeding behavior intensity was achieved, validating the effectiveness and practical value of this strategy.
2. Methodology
2.1. Overview
The methodological framework proposed in this paper, as illustrated in
Figure 1, primarily consists of three modules: data collection and processing, image pre-enhancement, and classification based on lightweight CNN models.
The data collection and processing module involves first acquiring videos of fish feeding behavior in a production environment, followed by frame selection according to specific rules to form an image dataset. Within this module, data augmentation techniques such as random cropping and color enhancement are employed to increase the diversity of image samples and the model’s adaptability to geometric transformations and illumination changes. This, in turn, enhances the model’s accuracy and generalization ability in recognizing fish feeding behavior in complex natural environments. The image pre-enhancement module includes three main steps: (1) MSRCR, which is primarily used to address issues of water surface reflections in images [
34,
35]; (2) the Multi-Metric-Driven Contrast Limited Adaptive Histogram Equalization (mdc) technique, which aims to improve the contrast between fish, bait, and the background in the images; and (3) Unsharp Masking (USM) sharpening [
36], which is employed to enhance the details within the images.
2.2. Data Collection
The experiment was conducted within a real aquaculture system at the Hai’an aquaculture base, which houses approximately 500 Koi fish, each weighing between 1200 and 1500 g. Throughout this study, the aquaculture pond environment was consistently maintained with dissolved oxygen levels at 7.1 ± 1.0 mg/L and a water temperature within the range of 12–15 °C. Prior to the commencement of the experiment, the fish were trained to feed at a fixed location to facilitate subsequent data collection. Feeding was carried out twice daily, at 9:30 a.m. and 5:30 p.m., with each feeding amount set to 1.5–2.0% of the total weight of the fish stock.
The image collection system ensures optimal shooting effects through the rational layout of light sources and cameras, as well as the use of wide-angle lenses, as illustrated in
Figure 2. Firstly, the light source is located directly above the breeding area, approximately 1.2 m high, which can evenly illuminate the entire water surface and fish bodies, thereby ensuring the clarity and detail presentation of the images. Secondly, the camera is vertically placed above the breeding pool at a height of approximately 1.2 m, approximately 1 m above the surface of the aquaculture pool. This height and angle can cover the entire breeding pool, ensuring that all fish activities are captured, avoiding dead corners and blind spots, which is conducive to a comprehensive observation and analysis of fish behavior and activities.
The video data collected were encoded in MP4 format, featuring a frame rate of 30 frames per second and a resolution of 1280 × 720 pixels. Image processing was performed using Python, with the PyTorch library utilized for constructing neural networks. The captured videos were segmented into 452 sub-videos, each containing 300 frames. Video frames were extracted from the collected video data at intervals of 100 frames to construct a dataset of fish stock images. Subsequently, the collected dataset of fish stock images was categorized and labeled into four classes based on the fish’s feeding behaviors and the amount of feed dispensed.
As shown in
Table 1, the behaviors of the fish vary significantly under different feeding states. At the initial stage of feeding, the fish exhibit a strong desire to feed, competing for bait while causing disturbances such as splashes and noise. Approximately 1–2 min later, as the bait begins to diminish, a portion of the fish ceases feeding, and the overall feeding intensity of the stock [
37,
38] decreases to a moderate level. After another minute, with a further reduction in bait, only a few fish continue to show feeding interest, and the school begins to disperse further. Around 4 min into the process, almost all fish lose their desire to feed and start to disperse, swimming slowly along the bottom. Images representing different categories are illustrated in
Figure 3.
To enhance the diversity of the data samples and thereby improve model performance, several preprocessing operations are performed on the images of the fish stock. Initially, the images are subjected to random aspect ratio cropping, where they are randomly cropped into various sizes and aspect ratios. Under this premise, these images are then randomly rotated by changing their rotation angles and centers. Subsequently, the rotated images are trimmed to a uniform size. Finally, the images of the fish stock undergo random alterations in brightness, contrast, saturation, and hue.
2.3. Multi-Step Image Pre-Enhancement Strategy
As illustrated in
Figure 4, a systematic demonstration is provided of the enhancements achieved in underwater photography image quality through the application of a multi-step image pre-enhancement strategy. The original image displays a school of fish captured under natural lighting conditions in an underwater environment, affected by surface reflection and an uneven distribution of light. The image processed by the MSRCR algorithm has a reduced impact of surface reflection and, through dynamic range compression and color restoration techniques, an enhanced visual effect and color fidelity. The subsequent mdc processing step optimized the local contrast, particularly enhancing the discernibility of details in regions of low contrast, rendering them more conspicuous. Finally, the application of Unsharp Masking technology improved the local contrast and edge clarity of the image, especially in delineating the boundary between the fish and their background, thereby enhancing the visual resolution. The entire pre-enhancement process not only significantly improved the image quality but also provided clear, enhanced visual information for deeper image analysis and fish behavior recognition without sacrificing color authenticity.
2.3.1. Multi-Scale Retinex with Color Restoration (MSRCR)
In response to the challenges posed by lighting conditions on the fidelity of images capturing piscine feeding behaviors, particularly the pronounced problem of specular reflections on water surfaces, this study advocates the employment of the Multi-Scale Retinex with Color Restoration (MSRCR) algorithm as a remedial measure. The MSRCR algorithm facilitates the amelioration of water surface reflection issues by executing dynamic range compression, augmenting edge definition, and reinstating authentic coloration. Through these mechanisms, MSRCR significantly diminishes the discrepancies induced by uneven luminance, thereby ensuring the preservation of image consistency and verisimilitude across a spectrum of lighting conditions.
The MSRCR algorithm processes images by combining filters of multiple scales (or sizes) to better address variations in illumination across different scales. The algorithm includes the following steps:
Multi-Scale Decomposition: The original image undergoes decomposition into multiple scales, typically facilitated by Gaussian filters. Each scale is associated with a Gaussian filter of distinct size, designed to capture illumination details pertinent to that specific scale.
Single-Scale Retinex (SSR) Processing: The Retinex algorithm is individually applied to each scale. This phase involves the computation of the logarithmic domain discrepancy between the original image and its filtered counterpart, effectively highlighting the luminance contrast while mitigating illumination inconsistencies.
Combination of Processed Results: The SSR results from all scales are combined through a simple arithmetic mean to form the final MSRCR output.
Color Restoration: Given that MSRCR processing may inadvertently alter the color balance of an image, a color restoration function is implemented to preserve the authenticity of the original hues. This ensures that the resultant image maintains a natural appearance, notwithstanding the algorithmic modifications to its luminance and contrast.
The formula for MSRCR can be expressed as follows:
In Equation (1), represents the luminance at pixel point in the Retinex output image, denotes the luminance of the corresponding pixel in the original image, refers to the filtered image at the -th scale, is the weight corresponding to that scale, and signifies the total number of scales.
The MSRCR algorithm is particularly suited for images captured under uneven lighting conditions, such as scenarios where shadowed areas are overly dark or illuminated regions are excessively bright. By adjusting the local contrast of the image, MSRCR is capable of rendering the details in these areas more visible and discernible. Furthermore, this algorithm excels in maintaining or enhancing the natural colors of the image, resulting in colors that appear more vivid and authentic.
2.3.2. Multi-Metric Driven CLAHE and Unsharp Masking
In aquatic environments, illumination is often uneven, leading to areas within images that are either too dark or too bright, with indistinct details. Contrast Limited Adaptive Histogram Equalization (CLAHE) [
39,
40,
41] enhances the clarity of these regions by improving the contrast within local areas. Unlike conventional histogram equalization, CLAHE limits the extent of contrast enhancement to prevent issues of excessive enhancement, resulting in more natural contrast improvements and clearer details, particularly in areas of low contrast. Underwater environments typically exhibit low contrast, especially in turbid waters or situations marred by suspended particulates causing poor visibility. CLAHE effectively enhances local contrast, rendering the boundaries of fish, food particles, and the surrounding environment more discernible.
The application of CLAHE in this study fundamentally achieves the fine-tuning of contrast through a series of sequential operational steps, reflecting its practical operability as an image enhancement tool. Its key steps can be summarized as follows:
Histogram Clipping: If the value of a histogram bin exceeds a predefined contrast limit, the excess is uniformly distributed across other bins. This step prevents any single pixel value from becoming overly prominent, thereby avoiding unnatural image effects.
Histogram Equalization: The clipped histogram is equalized using the Cumulative Distribution Function (CDF). The equalization formula is typically as follows:
In the context of Equation (2), represents the pixel value after equalization, and denotes the probability the pixel level in the original histogram. This step redistributes pixel values, making the brightness distribution of the image more uniform and enhancing the overall contrast of the image. It prevents any particular pixel value from becoming too prominent, which could result in unnatural image effects. Specifically, this formula calculates the value of each pixel in the new image to ensure that the overall contrast enhancement is more natural and uniform.
As illustrated in
Figure 5, the specific algorithmic design process highlights two primary adjustable parameters within CLAHE: clipLimit and tileGridSize.
clipLimit: This parameter controls the threshold for histogram clipping. It determines the extent of contrast enhancement within each small block. If clipLimit is set lower, the contrast enhancement will be milder, which will help reduce noise but may not sufficiently enhance the image contrast. If clipLimit is set higher, the contrast enhancement will be stronger, which could introduce more noise or create unnatural image effects.
tileGridSize: This parameter determines how many small blocks (tiles) the image is divided into for individual histogram equalization. TileGridSize is typically represented by two values (width and height). Smaller tile sizes can provide more detailed local enhancement effects but may result in visible artifacts at the tile borders. Larger tile sizes might lead to less noticeable local enhancement but can offer a more uniform overall effect.
To ensure that image enhancement enhances detail while maintaining color authenticity and overall visual harmony, this study improves upon the traditional CLAHE algorithm by proposing the Multi-Metric-Driven Contrast Limited Adaptive Histogram Equalization (mdc) algorithm. This includes an integrated image quality assessment mechanism to optimize the enhancement process, ensuring that clarity is improved without compromising color fidelity. As demonstrated in
Figure 5, the enhancement workflow involves the conversion to the LAB color space, independent processing of the luminance channel, and quality control based on PSNR and SSIM indices, thus achieving a balance between visual effect and technical standards.
The main process is as follows:
Conversion from BGR to LAB Color Space: The LAB color space separates luminance and color information, allowing for operations to be carried out on luminance alone, thus avoiding impacts on color to maintain color fidelity.
Independent Processing of the L Channel: This involves separating the L channel from the LAB image, performing subsequent processing only on the L channel, and applying the CLAHE algorithm to the L channel. This approach focuses on enhancing the luminance details of the image, which is crucial for improving image quality.
Initialization of clipLimit and tileGridSize: Setting these parameters serves as the basis for beginning CLAHE processing, which can be adjusted based on outcomes to achieve desired effects.
CLAHE Application: Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to the L channel using the initial parameters set in the previous step. This stage is the core of the entire process; by limiting contrast enhancement, noise can be reduced, and over-enhancement can be prevented.
Image Quality Assessment: The quality of the enhanced image is evaluated using PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index) as metrics for image quality. PSNR measures the similarity between the reconstructed image and the original image, while SSIM assesses the structural information, luminance, and contrast of the image. Through the evaluation with PSNR and SSIM, the quality of the image can be quantified to ensure that the visual improvements of the enhanced image are effective. Setting these conditions as quality thresholds ensures that further processing is only undertaken when the image reaches a certain quality standard, thereby avoiding unnecessary computations and potential quality degradation.
By setting the initial clipLimit and tileGridSize parameters, the algorithm controls the intensity of local contrast enhancement while enhancing the image contrast, preventing over-processing and noise generation. The independent processing of the luminance channel maintains color fidelity, avoiding color distortion. The quality of the enhanced image is evaluated using metrics such as PSNR and SSIM, ensuring that improvements in image quality are made without sacrificing detail. Overall, this processing workflow aims to enhance the visibility and detail of images, especially under uneven lighting conditions, while maintaining the naturalness of color and reducing noise. This approach provides more accurate and authentic visual information for various applications.
2.3.3. Image Sharpening
Image sharpening is applied to meticulously accentuate the edges and textural details of underwater images, particularly in deep water or low-visibility conditions. This technique critically enhances the discernibility of the contrast between the fish contours and the background. The application of image sharpening techniques can enhance the contrast in local regions, making the boundaries between fish and their background more distinct. Image sharpening can highlight these details, making them easier to analyze and understand. Such processing aids in improving the accuracy of model classification.
To address the issue of low-contrast images lacking clarity and detail, especially in dark or overexposed areas, the Unsharp Masking (USM) sharpening algorithm is introduced on top of the CLAHE algorithm based on the LAB color space. Unsharp Masking (USM) is a sharpening technique, the core idea of which involves subtracting a blurred (low-pass filtered) version of the original image and then adding the result back to the original to enhance the edges and details. The sharpening process can be represented as
wherein
corresponds to the sharpened image,
represents the original image,
is the image after blurring, and
signifies the enhancement coefficient. The specific algorithm design process is illustrated in
Figure 5.
The specific operations are as follows:
Unsharp Masking: This is a common image sharpening technique used to enhance the visual clarity of an image. In this step, if the quality of the image, after CLAHE processing (based on PSNR and SSIM indices), is considered to be sufficiently good, Unsharp Masking is employed to further enhance the image’s details and edges. This sharpening is achieved by amplifying the high-frequency details within the image, which can make the image appear crisper; however, overuse may lead to noise or an over-sharpened effect in the image.
LPIPS < 0.4: Learned Perceptual Image Patch Similarity (LPIPS) is an advanced metric used to evaluate the perceptual similarity of images. This metric takes into account the characteristics of the human visual system to assess the perceptual similarity between the enhanced and original images. An LPIPS value of less than 0.4 indicates that the enhanced image is visually close enough to the original image, and the perceptual quality is deemed acceptable. This step ensures that the image enhancement does not deviate excessively, thus losing the perceptual characteristics and texture of the original image.
LAB to BGR conversion: After completing image sharpening and quality assessment, the image is converted from the LAB color space back to the BGR color space. This step is intended to translate the results of image processing into a more universally applicable color format for display and use across various devices and applications.
Through this series of processing steps, the adaptive enhancement of the image brightness and contrast is achieved while maintaining the authenticity of colors and minimizing the impact of noise. While enhancing the details and sharpness, the LPIPS ensures the perceptual quality of the post-processed image, preventing over-processing.
2.4. EfficientNet
The EfficientNet model [
42], developed by the Google AI team, represents an innovative convolutional neural network architecture that has led a new paradigm in scalable network design. The innovation of this model resides in its application of a systematic method of network scaling known as compound scaling. This method diverges from traditional network scaling practices that typically only enhance the depth or width of the network. EfficientNet employs a fixed scaling coefficient to extend the network’s depth, width, and the resolution of the input image in a balanced manner.
In this research, the base network model EfficientNet-B0 from the EfficientNets series was selected for analysis. EfficientNet-B0 is an efficient convolutional neural network (CNN) designed by utilizing the compound scaling method. This compound scaling method of EfficientNet enables the generation of a range of models, from EfficientNet-B0 to B7, each providing a set of capabilities that achieve a harmonious balance between computational efficiency and predictive accuracy.
EfficientNet-B0 has been optimized under predefined resource constraints through an automated neural architecture search (NAS) process, aimed at finding the optimal balance between efficiency and accuracy.
The core structure of EfficientNet-B0 is the Mobile Inverted Bottleneck Convolution (MBConv) module, which incorporates the attention mechanism from the Squeeze-and-Excitation Network (SENet [
43]). Upon its introduction, SENet achieved the highest accuracy on the ImageNet dataset, highlighting its effectiveness. The MBConv [
44] module, also a product of Neural Architecture Search (NAS), bears similarity to the Depthwise Separable Convolution in structure.
These innovations underscore the effectiveness of the EfficientNet architecture in creating models that not only push the boundaries of accuracy in image classification tasks but also do so with considerations for efficiency that make them practical for a wide range of computing environments.
Within the MBConv module, as shown in
Figure 6, a
×
pointwise convolution is first executed to adjust the output channel dimensions according to the expansion ratio. This is followed by a
×
depthwise convolution. If the squeeze-and-excitation operation (SE module) is required, it is implemented subsequent to the depthwise convolution. The final part of the module is another
×
pointwise convolution that restores the original channel dimensions. Additionally, the MBConv module integrates dropout connectivity and input skip connections, which have effectively shortened training times and enhanced the overall performance of the model.
The architecture of EfficientNet-B0 comprises 16 Mobile Inverted Bottleneck Convolution (MBConv) modules, 2 convolutional layers, 1 global average pooling layer, and 1 classification layer. In its structural diagram,
Figure 7 displays these components, where different colors represent different stages within the network to facilitate the distinction and understanding of the functions and organization of each part. Specifically, black layers represent the initial convolutional layer and the first MBConv layer, responsible for the initial feature extraction and down-sampling of the input image. Blue layers denote the MBConv layers with a kernel size of 3 × 3, focusing on efficient feature extraction and processing through depth-wise separable convolutions. Red layers indicate the MBConv layers with a kernel size of 5 × 5, providing a larger receptive field to capture more complex patterns and features in the image. The use of different colors helps to visually distinguish between the various types of layers and their specific roles within the network architecture. Across multiple standard image recognition benchmarks, EfficientNet has demonstrated significant performance, particularly achieving state-of-the-art accuracy on the ImageNet dataset while maintaining a lower computational complexity and a lower number of parameters. This efficiency makes EfficientNet particularly advantageous for application in resource-constrained small devices, enabling high-performance operation within limited computational resources.
3. Experiments and Discussion
3.1. Training Details
To evaluate the classification results of MIPS-EfficientNet, extensive experiments were conducted on a fish feeding behavior dataset. The MIPS-EfficientNet algorithm was compared with other advanced algorithms, such as ResNet-18, ResNet-50, and ResNeXt50. In this study, the aforementioned models were trained separately, and the accuracy (acc) and prediction time of the models post-training were compared.
To evaluate the classification results of MIPS-EfficientNet, extensive experiments were conducted on the fish feeding behavior dataset. The MIPS-EfficientNet algorithm was compared with other advanced algorithms such as ResNet-18, ResNet-50, and ResNeXt50. The dataset was divided into training, validation, and testing sets in a 7:2:1 ratio. Each model was trained on the training set and adjusted using the validation set. Finally, the accuracy (acc) and prediction time were compared on the testing set.
For the training of the target detection networks, the input image size was set to 224 × 224 × 3 (RGB). This paper adopted the basic and widely used Stochastic Gradient Descent (SGD) optimizer, which has shown commendable performance across many deep learning tasks, despite potentially requiring meticulous hyperparameter tuning. The hyperparameter settings were as follows: maximum iterations, max_epoch = 100; initial learning rate, lr0 = 0.004; decay factor, decay_factor = 0.1; momentum = 0.65; weight decay rate, weight_decay = 0.001; and log interval, log_interval = 10. The learning rate adjustment milestones were set at milestones = [25,35], which are related to the learning rate adjustment strategy. When the training reaches the specified epoch (in this case, the 25th and 35th epochs), the learning rate is adjusted according to pre-established rules. To accelerate the training of models, pre-trained weights of models such as ResNet-18, ResNet-50, ResNeXt50, and EfficientNet are used, which are trained on the ImageNet dataset. The pre-trained weights of these models can be obtained from “
https://github.com/pytorch/vision/tree/main/torchvision/models (accessed on 15 August 2023)”.
It is well known that different hyperparameter settings can significantly impact the performance of deep learning models. After a series of experiments in this study, the aforementioned hyperparameters yielded models that were satisfactory for the classification of fish feeding behavior.
The experiments were conducted on a Windows system by utilizing the PyTorch framework, Python version 3.8, and CUDA version 11.7. An NVIDIA 2080 Ti GPU was used in the experiments.
3.2. Evaluation Metrics
In classification tasks, commonly used evaluation metrics include
accuracy,
precision,
recall, and
F1
score. Among these,
accuracy is a key indicator of the overall performance of an algorithm, reflecting the proportion of samples correctly classified to the total number of samples. When dealing with datasets with imbalanced classes, precision and recall become more critical tools for performance assessment. Precision refers to the proportion of samples correctly predicted as positive out of the total samples predicted as positive, while recall is the proportion of samples correctly predicted as positive out of the total actual positive samples. The
F1
score is the harmonic mean of precision and recall, providing a comprehensive evaluation of these two metrics. These indicators collectively assist us in thoroughly understanding and evaluating the performance of classification models. The definitions of these metrics are as follows:
where
TP represents the number of true positive samples,
TN denotes the number of true negative samples,
FP stands for the number of false positive samples, and
FN refers to the number of false negative samples.
3.3. Performance of Different Training Strategies
When analyzing and comparing the impacts of different optimizers (AdamW and SGD) combined with different loss functions (Focal Loss and Cross Entropy Loss) on neural network performance, consideration is given from two main perspectives: firstly, the combination of an optimizer and loss function on the model’s fitting ability on the training set, and secondly, the effect of these combinations on the model’s generalization ability on the validation set. The detailed experimental outcomes are presented in
Table 2.
Initially, when considering the use of the AdamW optimizer, its combination with Focal Loss shows only moderate performance on the training set (79.85% training accuracy) and limited generalization capability on the validation set (79.10% validation accuracy). This may indicate that the model, under this combination, struggles to effectively learn key features of the data, or that Focal Loss is not suited for this dataset. Particularly in datasets where categories are relatively balanced, using Focal Loss might not be the best choice. However, when AdamW is combined with Cross Entropy Loss, there is a significant improvement in training accuracy to 99.21%, with validation accuracy reaching 91.04%. This suggests that Cross Entropy Loss is more appropriate for model learning in this scenario. Nonetheless, higher training accuracy with a comparatively lower validation accuracy could imply an overfitting issue, where the model may have learned specific features of the training data too well, lacking generalization capability on unseen data.
On the other hand, when using the SGD optimizer, its combination with Focal Loss performs exceptionally well on the training set (99.74% training accuracy) and also maintains high accuracy on the validation set (90.30%). This indicates that the combination of SGD with FL can effectively handle issues of class imbalance while maintaining good generalization ability. Even more notably, the combination of SGD with Cross Entropy Loss performs the best, achieving 100% training accuracy and validation accuracy of up to 97%. This combination not only achieves perfect fit on the training set but also exhibits excellent generalization ability on the validation set, suggesting that for this specific neural network and dataset, SGD combined with Cross Entropy Loss is the most effective combination.
These results underscore the importance of choosing the right optimizer and loss function suitable for the specific dataset and problem. While AdamW is a newer optimizer that may offer advantages in handling sparse data, in some cases, the traditional SGD optimizer, when combined with the appropriate loss function, may yield better results. This indicates that even advanced optimizers and loss functions need to be adjusted and optimized for specific applications. Moreover, it also shows that the choice of optimizer and loss function can significantly affect the model’s learning capability and generalization performance across different training and validation settings.
3.4. A Comparison with Other CNNs on the Testing Set
To evaluate the performance of the MIPS-EfficientNet algorithm, we conducted a series of experiments on the testing set, comparing it with other well-known convolutional neural networks (CNNs) including EfficientNet, ResNet-18, ResNet-50, and ResNeXt50. The comparison focused on several key metrics: test accuracy (Test-Acc), recall, precision, and F1 score.
As shown in
Table 3, Multi-Step Image Pre-processing Strategy (MIPS) optimization demonstrably enhanced the performance metrics of both EfficientNet and the ResNet series of models. Specifically, the MIPS-enhanced EfficientNet model showed significant improvements in precision, recall, and F1 score, with the F1 score markedly increasing from 80.68% to 93.78%. The F1 score, being the harmonic mean of precision and recall, more comprehensively reflects the model’s performance in handling positive classes, such as target categories of interest. This substantial increase suggests that the original design of EfficientNet, while prioritizing efficiency, may have compromised accuracy to some extent. Although EfficientNet boasts rapid response and shorter training periods, its performance was initially inferior to that of the ResNet models before MIPS optimization.
From
Table 3, it can be seen that there is no significant difference in the training time between MIPS-EfficientNet and EfficientNet, with both being 1.595 h. This is because MIPS is an image pre-enhancement strategy, where the preprocessing steps are completed before model training, thus having no impact on the training time. However, compared to other models, MIPS-EfficientNet has a significant advantage in training time. By comparing the training times of different models, it is evident that the training time of MIPS-EfficientNet is the same as that of the original EfficientNet, indicating that the MIPS optimization strategy does not increase the training time while significantly enhancing the model’s performance. Specifically, MIPS-EfficientNet shows a marked improvement in test accuracy, recall, precision, and F1 score compared to other models, while maintaining a relatively low training time.
After the enhancement with MIPS technology, the EfficientNet model not only improved significantly in performance, but in certain cases, its F1 score even surpassed that of some of the ResNet models. This not only validates the effectiveness of the MIPS optimization technique but also highlights the potential of the EfficientNet model when augmented with appropriate pre-processing strategies. Additionally, the ResNet models also exhibited performance improvements following MIPS pre-processing, further verifying the effectiveness of the MIPS pre-enhancement strategy and underscoring the general applicability and practicality of MIPS technology across multiple deep learning frameworks. These findings provide valuable insights for model selection in practical applications, especially when there is a need to balance limited computational resources with high performance, positioning MIPS pre-processing as a key strategy for balancing these demands.
3.5. Accuracy across Different Categories
To delve deeper into the limitations of the original EfficientNet model in the classification task of fish feeding behavior images, this study compared the confusion matrices of the unoptimized EfficientNet model and the MIPS-enhanced MIPS-EfficientNet model.
Figure 8a displays the classification efficacy of the original EfficientNet model, which accurately recognizes most categories. However, it exhibits a higher rate of misclassification, particularly in the category corresponding to subtle feeding behavior, likely due to the model’s over-sensitivity to background features in the images.
Moreover, as revealed in
Figure 8b, the MIPS-EfficientNet model, after the introduction of MIPS, shows improvements primarily in the recognition of the fourth category. The performance for the other categories remains stable or shows only minor changes. This indicates that the MIPS enhancement strategy effectively optimizes the model’s accuracy in classifying fish feeding behavior, especially for subtle behavioral distinctions. The MIPS approach, by optimizing the model’s feature extraction and representation capabilities, diminishes the interference of background noise on classification decisions. Consequently, it enhances the model’s ability to recognize fish behavior against complex backgrounds while improving classification accuracy. These findings affirm the importance and efficacy of the MIPS module in enhancing the performance of image classification tasks in complex underwater environments.
The MIPS-EfficientNet model demonstrated a significant improvement in accuracy, reaching 97.00%, which is substantially higher than the EfficientNet model’s 88.81%. Accuracy represents the overall proportion of correct predictions, indicating that MIPS-EfficientNet is more precise in its overall predictions. MIPS-EfficientNet also showed improvement in recall, achieving 93.56%, compared to EfficientNet’s recall of 82.35%. Recall measures the model’s ability to correctly identify positive instances; thus, MIPS-EfficientNet performs better in correctly identifying instances of each category. In terms of precision, MIPS-EfficientNet also surpassed EfficientNet, with scores of 94.10% and 87.27%, respectively. Precision refers to the proportion of instances predicted as positive that are actually positive, with higher precision indicating fewer false positives (misreports). Lastly, MIPS-EfficientNet significantly improved its F1 score, reaching 93.78%, compared to EfficientNet’s F1 score of 80.68%. The F1 score is the harmonic mean of precision and recall, indicating that MIPS-EfficientNet is more balanced and superior in overall performance. In summary, the MIPS-EfficientNet model shows significant improvements across all primary performance metrics. These metrics suggest that compared to the original EfficientNet model, MIPS-EfficientNet has made progress in reducing misjudgments (improving precision), enhancing the identification rate of positive instances (improving recall), and achieving a better balance between the two (improving the F1 score). These improvements may be attributed to optimizations in the model structure, training process, or data handling by MIPS-EfficientNet. Therefore, when selecting a model for practical application, based on these indicators, MIPS-EfficientNet surpasses the original EfficientNet model in reducing misjudgments, enhancing correct identification rates, and maintaining a good balance between these aspects.
In the process of delving deeper into the aforementioned phenomena, this study further employed t-distributed Stochastic Neighbor Embedding (t-SNE [
45]) to visualize the output of the model’s final convolutional layer. As depicted in
Figure 9, the impact of different loss functions on the distribution of the feature space is clearly observable.
Figure 9 provides a t-SNE comparative visualization of feature vectors extracted by the EfficientNet and MIPS-EfficientNet models. In these representations, samples from various categories are encoded in different colors to denote their class: none (yellow), weak (blue), medium (green), and strong (red). This color-coding facilitates the observation of the models’ capabilities to cluster similar features within a three-dimensional feature space. Notably, MIPS-EfficientNet demonstrates a more distinct clustering of points compared to EfficientNet, indicating enhanced capability in distinguishing between categories.
Within the feature space, the distances between categories, represented by unique colors, are pronounced, indicating the model’s effective discriminative ability to differentiate between sample categories. The spatial separation between categories, particularly the discernible division between the strong and medium classes, signifies the superior discriminative power of the MIPS-EfficientNet model. Such a clear demarcation of classes is crucial for precise classification tasks, especially in datasets characterized by complexity or a high degree of overlap.
Furthermore, it can be inferred that the MIPS-EfficientNet model places greater emphasis on the internal consistency of categories and the distinctions between them during the learning process. This is evidenced by the cohesiveness of the category clusters and their separation. Similarly, it can be anticipated that this attribute may lead to a lower classification error rate, as the model has understood the differences between categories on a more granular level.
Taking these observations into account, it can be posited that the MIPS-EfficientNet model may surpass traditional EfficientNet models in specific applications such as the recognition of fish behaviors. The t-SNE visualization in
Figure 9 illustrates the potential of MIPS-EfficientNet in handling complex datasets for refined classification tasks, and its efficiency in feature extraction warrants further exploration and application.
4. Conclusions
This paper presents an innovative image pre-enhancement strategy designed to optimize the precision and efficiency of fish feeding behavior classification. Initially, a multi-step image pre-enhancement protocol is adopted, encompassing three main processing phases: Multi-Scale Retinex with Color Restoration (MSRCR), Multi-Metric Driven Contrast Limited Adaptive Histogram Equalization (mdc), and Unsharp Masking. This suite of pre-enhancement actions effectively rectifies common image quality issues, such as surface reflections, low contrast, and indistinct detail features. Subsequent to image pre-enhancement, a lightweight EfficientNet neural network model is further employed to accomplish the classification of fish feeding behaviors.
This study involves integrating a Multi-Scale Image Preprocessing Strategy (MIPS) into the EfficientNet model and comparing its performance against the baseline, unprocessed EfficientNet model. The analysis focuses on the accuracy (acc) post model training and the time required for prediction. Moreover, to assess the universality of the MIPS preprocessing strategy and its impact on model performance, this strategy is also applied to other leading deep learning frameworks, such as ResNet-18, ResNet-50, and ResNeXt50. By comparing the accuracy and prediction time of these frameworks after applying MIPS preprocessing, this research aimed to comprehensively evaluate the potential of the preprocessing strategy in enhancing the efficiency and effectiveness of various deep learning models.
The test results indicate that the proposed Multi-Step Image Pre-enhancement (MIPS) module significantly enhances the performance of various convolutional neural network (CNN) models in terms of classification accuracy. Particularly in comparison with the current advanced ResNet series models, this approach not only improves the classification accuracy but also substantially reduces the training times of the models. These achievements demonstrate that the method introduced in this paper is both effective and efficient in processing fish feeding behavior classification tasks.
To validate the efficiency of the algorithm, we evaluated its practical application in the fish feeding process during experiments. By comparing it with other well-known convolutional neural networks (CNNs), we found that the MIPS-enhanced models performed exceptionally well on several key metrics. Below are some practical application examples.
In terms of cost reduction, by integrating the algorithm to optimize the feeding process, feed waste was reduced from 25% to 10%, saving the fish farm approximately 500 kg of feed per month, which is equivalent to monthly savings of USD 1500 (assuming a feed cost of USD 3 per kilogram). Regarding feeding efficiency, the experiment revealed that manual feeding took about 2 h per session, while the automated feeding system, after implementing the algorithm, completed the process in 1.5 h, increasing feeding efficiency by 25%, allowing staff to focus on other critical tasks. In terms of productivity improvement, an experiment monitoring two ponds showed that the pond using the algorithm had a 10% higher fish growth rate and a 15% improvement in the feed conversion ratio (FCR), resulting in an average weight increase of 200 g per fish, enhancing market value and profitability. Regarding the environmental impact, the optimized feeding process reduced the occurrence of feed waste sinking to the bottom of the pond, thus reducing water pollution and the need for frequent pond cleaning, benefiting the environment and lowering maintenance costs for the fish farm. These practical benefits demonstrate the real-world feasibility and effectiveness of the algorithm in improving the fish feeding process, proving its potential as a valuable tool in modern fish farming operations.
In this study, while the proposed Multi-Step Image Pre-Enhancement (MIPS) module combined with the lightweight EfficientNet network achieved significant outcomes in classifying fish feeding behaviors, particularly in enhancing classification accuracy and reducing the training time compared to ResNet series models, a critical limitation was observed. Specifically, the MIPS, as a multi-step image pre-enhancement strategy, incurs longer processing times during the prediction phase. This implies that despite a significant reduction in the overall training time and an improvement in classification accuracy, the prediction speed did not exhibit a competitive advantage.
To address this issue, future research directions may focus on implementing the MIPS steps themselves through convolutional neural network (CNN) models. Such an approach is anticipated to further optimize the overall performance of the model, especially in reducing the time required for the prediction stage. Through such optimization, it is expected to achieve faster prediction speeds while maintaining or even improving the classification accuracy, thereby providing a more efficient solution for practical applications.
Although the proposed method (MIPS-EfficientNet) shows significant improvements over traditional convolutional neural networks (CNNs), it has only been compared with a single class of CNNs. However, it is noteworthy that Vision Transformer-based models currently achieve the highest performance on the ImageNet benchmark, as indicated by the rankings on Papers With Code. It should be noted that the dataset used in this study is relatively small. Vision Transformer models typically perform better on large-scale datasets but may not perform as well on small datasets compared to traditional CNNs. Therefore, despite the superior performance of Vision Transformers on some large-scale datasets, future work should include a comparison of MIPS-EfficientNet with Vision Transformer models on small datasets to comprehensively evaluate their performance.