An Automated Framework Based on Deep Learning for Shark Recognition

Le, Nhat Anh; Moon, Jucheol; Lowe, Christopher G.; Kim, Hyun-Il; Choi, Sang-Il

doi:10.3390/jmse10070942

Open AccessArticle

An Automated Framework Based on Deep Learning for Shark Recognition

by

Nhat Anh Le

^1,†,

Jucheol Moon

^1,†

,

Christopher G. Lowe

²

,

Hyun-Il Kim

³ and

Sang-Il Choi

^3,4,*

¹

Department of Computer Engineering and Computer Science, California State University, Long Beach, CA 90840, USA

²

Department of Biological Sciences, California State University, Long Beach, CA 90840, USA

³

Department of Computer Science and Engineering, Dankook University, Yongin 16890, Korea

⁴

Department of Computer Engineering, Dankook University, Yongin 16890, Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Mar. Sci. Eng. 2022, 10(7), 942; https://doi.org/10.3390/jmse10070942

Submission received: 28 May 2022 / Revised: 3 July 2022 / Accepted: 5 July 2022 / Published: 9 July 2022

(This article belongs to the Special Issue Advances in Autonomous Underwater Robotics Based on Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

The recent progress in deep learning has given rise to a non-invasive and effective approach for animal biometrics. These modern techniques allow researchers to track animal individuals on a large-scale image database. Typical approaches are suited to a closed-set recognition problem, which is to identify images of known objects only. However, such approaches are not scalable because they mis-classify images of unknown objects. To recognize the images of unknown objects as ‘unknown’, a framework should be able to deal with the open set recognition scenario. This paper proposes a fully automatic, vision-based identification framework capable of recognizing shark individuals including those that are unknown. The framework first detects and extracts the shark from the original image. After that, we develop a deep network to transform the extracted image to an embedding vector in latent space. The proposed network consists of the Visual Geometry Group-UNet (VGG-UNet) and a modified Visual Geometry Group-16 (VGG-16) network. The VGG-UNet is utilized to detect shark bodies, and the modified VGG-16 is used to learn embeddings of shark individuals. For the recognition task, our framework learns a decision boundary using a one-class support vector machine (OSVM) for each shark included in the training phase using a few embedding vectors belonging to them, then it determines whether a new shark image is recognized as belonging to a known shark individual. Our proposed network can recognize shark individuals with high accuracy and can effectively deal with the open set recognition problem with shark images.

Keywords:

shark recognition; deep learning; OSVM; few-shot learning; VGG-UNet; VGG-16

1. Introduction

The ability to reliably identify and track individuals within a population or habitat is essential for understanding population dynamics, behavioral patterns and ecosystem function. Historically, animal marking (e.g., branding, notching, external tags), which began in the 1700s, has been used as a tool for quantifying behavior and identifying individuals previously sampled and counted. In the 1960s, electronic tags such as acoustic, radio and satellite transmitters were developed that could be attached to individuals, allowing researchers to track their movement and habitat use [1]. However, these human-made markings can fade and tags can be shed over time, making their utility less valuable for population estimation and long-term behavioral-pattern interpretation [2]. While applying these technologies to large, highly mobile species, particularly, aquatic species provided some of the only approaches for studying populations; they require the capture of individuals and have been limited in sample size due to costs. With improved camera technology, other techniques such as photogrammetry have grown in popularity, providing researchers a permanent record for identifying and counting individuals with unique natural marks, scars and features that are less likely to fade or be lost over time [3,4]. While this technology is now becoming commonplace and datasets continue to grow, data processing, analysis and categorization of unique features has been time consuming and labor intensive [5]. Recently, advances in computer-vision techniques combined with animals’ unique characteristics, animal biometrics, have introduced a more effective and non-invasive approach to recognize animal individuals. These approaches rely on large-scale datasets and involve manual or semi-manual work to identify individuals [6,7,8].

A disadvantage of these above approaches is that they require feature engineering. For example, users have to select reference points for each image manually [7,9]. Therefore, the recent research has been focused on automating the recognition process to locate, identify and classify images. In general, an automated recognition process includes methods to detect animals from backgrounds and to extract unique characteristics of the detected animals. For example, a deep-learning network model was able to extract unique features from cattle’s muzzle images, and a matching scheme based on the extracted features was used for cattle recognition [10,11]. This approach can reduce human biases and help us conduct research on large scale. When it comes to shark recognition, there were approaches using sharks’ dorsal fin as the animal’s unique characteristics. For example, Hughes and Burghardt [12] developed the partially automated contour-based visual ID system in shark biometrics. They focused on the contour information of the sharks as their biometric signature. First, they developed an open-contour stroke model to achieve robust fin detection. Second, they used scale-space selective fingerprinting to encode dorsal-fin patterns individually. Finally, they proposed a non-linear model for individual shark recognition.

This paper aims to provide a fully automated and vision-based framework to recognize shark individuals from underwater images. Our framework has two main components: shark detection and shark recognition. For the shark-detection phase, we develop a system to capture the shark body from underwater images. Given a training set including images and shark bodies, we train a network capable of predicting the shark-body region from a new image. Once the trained network predicts the region from a new image, we crop out the region and get rid of the background using Gaussian smoothing. These images are the input for the next phase. For the shark-recognition phase, we design a deep neural network that maps an image of the lateral view of a shark’s body to an embedding vector in latent space. To train the network, we use the triplet loss function, which enforces that the distances between the same sharks’ embedding vectors are smaller than the distances between the embedding vectors of different sharks. The embedding vector is a representation of each unique shark.

To adapt our recognition system to the open-set recognition scenario, we split the data into training, known test, and unknown test sets. The training set includes images of sharks and their labels, which is provided to train the network with the triplet loss. After the training phase, a few (

k \leq 20

) images for every shark in the known test set are selected and their embedding vectors are stored. Using these embedding vectors as inputs into OSVM [13], we compute the decision function for the k embedding vectors of these known sharks. During testing, an image in either the known test set (excluding the ones used to build the decision functions) or unknown test set is given to the system, and the system should be able to answer whether the image belongs to a shark included in the known test set or is a never before seen shark. The system achieves this by presenting an image as an embedding vector and checking whether it is within the decision boundary of the closest known shark’s centroid.

Our contributions are as follows. (1) We develop an automated system to capture shark-body images from underwater images. (2) We propose a network that transforms a shark-body image to an embedding vector. The same sharks’ embedding vectors in latent space have smaller distances than those of different sharks. (3) We introduce a system that recognize the images of known objects as ‘identified whom’ or unknown objects as ‘unknown’ with 81% accuracy.

2. Materials and Methods

2.1. Shark Recognition

We propose an encoder architecture. The encoder

f (\cdot)

maps images of sharks to embedding vectors in a latent space such that the distance between same sharks is smaller than the distance between different sharks. The encoder includes the VGG-UNet network and VGG-16 network. The encoder maps an image

s

to an embedding vector

v

:

v = f (s) .

(1)

The dimension of the embedding vector is 256 and the vectors are normalized, i.e.,

v \in R^{256}

and

{| | f (\cdot) | |}_{2} = 1

. Let

s_{i, a}

and

s_{j, a}

be two images (

i \neq j

) of the shark

i d = a

and let

s_{k, b}

be an image of the shark

i d = b

. Adapting the triplet loss [14], the loss of the proposed network is defined by

L = | | v_{i, a} - v_{j, a} {| |}_{2}^{2} - | | v_{i, a} - v_{k, b} {| |}_{2}^{2} + α,

(2)

where

v_{i, a} = f (s_{i, a})

,

v_{j, a} = f (s_{j, a})

,

v_{k, b} = f (s_{k, b})

, and

α

is a margin. We determined the margin

α

after comparing the performances of the system with different margin values. We evaluated the system with

α = 0, 0.5, 1.0, 1.5, and 2.0

. We set

α = 1.0

because the systems’ performance is best with

α = 1.0

. The loss tries to ensure the Euclidean distance between

v_{i, a}

and

v_{j, a}

is smaller than the distance between

v_{i, a}

and

v_{k, b}

for all possible triplets in the training set. The encoder architecture is illustrated in Figure 1.

2.1.1. VGG-UNet

The input of the network is a three channel image (RGB) of shape

224 \times 224

[15]. As shown in Figure 2, there are two main paths in the network: The contracting path (left side) and the expansive path (right side). The contracting path consists of four blocks; each block includes two

3 \times 3

convolutional layers of stride 2 without padding followed by a rectified linear unit (ReLU) and a

2 \times 2

max-pooling operation. At each block, the number of channels are doubled from 64. The expansive path includes three blocks; each block consists of

2 \times 2

up-convolution that halves the number of channels followed by a concatenation with the corresponding cropped feature map from the contracting path and two

3 \times 3

convolution operation. At the final layer, we use

1 \times 1

convolution operation to obtain the 2 channel segmentation map from 64 channel feature map. The output of the network is a two channel image of shape

112 \times 112

.

We trained the network capable of partitioning images into two regions: the shark and the surrounding area from original images. The training images consist of the images captured from the videos and the corresponding segmentation maps which were manually labeled by human. Once the network is trained, we predicted the masks using our images captured from the videos as input. During the test phase, for an image of arbitrary shape, the image was reshaped to be

224 \times 224

. For each captured image, the network’s output is a tensor of shape

112 \times 112 \times 2

. For

0 \leq i < 112

and

0 \leq j < 112

, the probability that the pixel is part of the shark region is denoted by

s e g [i] [j] [1]

, and the probability that the pixel belongs to the background is denoted by

s e g [i] [j] [0]

. After that, the annotation mask

p r

was created such that

p r [i] [j] = 1

if

s e g [i] [j] [0] \leq s e g [i] [j] [1]

and 0 otherwise. The prediction map was then reshaped to be

224 \times 224

using nearest neighbor interpolation. Combining the initially captured image and the annotation mask, we cropped the region occupied by sharks. We then used a Gaussian filter to generate a smoothing boundary between the shark and background as the final output image.

2.1.2. VGG-16 Network

As our dataset is relatively small, we used the pre-trained VGG-16 as our base model. The pre-trained VGG-16 is a deep convolutional neural network (CNN) [16] that has shown significant results for large-scale image-recognition tasks. The network was trained using the ImageNet database [17], which includes 1.3 training million images of 1000 classes [18]. The pre-trained VGG-16 includes five blocks of convolutional layers and three fully connected layers. We adapted the pre-trained weights of five blocks of convolutional layers and added four fully connected layers to extract the features of the sharks. The input to the network was a three-channel image of shape

224 \times 224

. The network includes five blocks of convolutional layers; each block includes

3 \times 3

convolutional layers of stride 1 with padding 1 followed by a ReLU and a

2 \times 2

max-pooling operation. At each block, the number of channels was doubled from 64. The number of convolutional layers for the first two blocks are two, and the last three blocks are three. The output of the last maxpooling layer is a matrix of shape

7 \times 7 \times 512

. To extract the unique features of the shark bodies, we flattened this output to make it a

25, 088 \times 1

feature vector, then we added four fully connected layers with 4096, 1024, 512, and 256 units in that order. In addition, we added a dropout layer with the dropout rate of 0.3 after each fully connected layer, except the last one. Figure 3 shows the proposed network model.

2.1.3. Few-Shot Learning

Figure 4 shows the shark-recognition process using the trained model. For

n \geq 1

, let

{s_{i, a} | 1 \leq i \leq n}

be the set of the randomly selected images of the sharks

i d = a

in the known dataset and

{v_{i, a} = f (s_{i, a}) | 1 \leq i \leq n}

be the set of corresponding embedding vectors using the trained model. For each shark

i d = a

in the known dataset, the system computes the centroid of n embedding vectors, defined by

M_{a} = \frac{1}{n} \sum_{i = 1}^{n} v_{i, a}

. The system learns decision functions in the latent space by using the OSVM algorithm [13] for all shark individuals. The algorithm obtains

{v_{i, a} | 1 \leq i \leq n}

as an input and solves the following optimization problem

\{\begin{matrix} min_{α} \frac{1}{2} \sum_{i}^{n} \sum_{i^{'}}^{n} α_{i} α_{i^{'}} 𝒦 (v_{i, a}, v_{i^{'}, a}) \\ subject to : 0 \leq α_{i} \leq \frac{1}{ν n}, \sum_{i = 1}^{n} α_{i} = 1 \end{matrix}

(3)

where

K (v, v^{'}) = e^{- γ | | v - v^{'} {| |}_{2}^{2}}

is a radial bias kernel function,

α_{i}

are the Lagrange multipliers,

γ

and

ν

are among the hyper-parameters of the system.

Let

s_{*, u}

be an image of an unknown shark u. For each shark, the decision function of

v_{*, u}

for the participant

i d = a

is defined by

h_{a} (v_{*, u}) = \sum_{i}^{n} α_{i} K (v_{i, a}, v_{*, u}) - ρ_{a}

where

ρ_{a} = \sum_{i}^{n} α_{i} K (v_{i, a}, v_{l, a})

for any l such that

0 < α_{l} < \frac{1}{ν n}

and

1 \leq l \leq n

. Considering that an unknown participant can be either one who was included in the training dataset or one who was not, the system determines the prediction of u as follows;

1.: Compute $v_{*, u} = f (s_{*, u})$ ;
2.: Find provisional shark $p = arg {min}_{a} | | M_{a} - v_{*, u} {| |}_{2}$ ;
3.: If $h_{p} (v_{*, u}) \geq τ$ , then “u is recognized as p”;
4.: Otherwise, “u is not recognized as any known shark”;

where

τ

is a system’s hyper-parameter.

In addition, we implemented a simple Euclidean-distance-based recognition framework and evaluated its performance. To determine whether a test image belongs to known objects, we compared the Euclidean distances between the test image and the centroids of the known objects in the latent space. By setting a threshold for the Euclidean distance as a hyper-parameter, we evaluated the performance of the system. However, because its performance is lower than our proposed method, we did not include the comparison result here.

3. Results

Datasets and Evaluation Metric

The original videos (4k, 30 fps) were recorded using GoPro Hero 8 underwater cameras at Guadalupe Island, off the west coast of Baja, Mexico. Figure 5 shows how the original videos were collected. From the videos, 12-megapixel frame images were generated and we manually removed images that did not include shark bodies. Our dataset includes images of 13 different shark individuals (subjects).

The data were divided into three sets: training, known test, and unknown test. First, we sampled six subjects from 13 subjects; for each subject, 75% of its images were allocated into the training set and the other 25% were allocated into the known test set. Second, for each subject in the known test set, we randomly selected 20 images to determine the decision boundary in latent space using OSVM algorithm. Finally, all images of the remaining 7 subjects were assigned to the unknown test set. The known test, unknown test, and training sets contain approximately 1527, 455, and 456 images, respectively. Such datasets were generated 20 times repeatedly and independently. For each data generation, we train and test the proposed method, and the averaged evaluation metrics are summarized.

We define an image to be true positive (TP) if it belongs to an individual in the known test set and the network correctly identifies it, and a false positive otherwise. On the other hand, we define an image to be true negative if it belongs to the unknown test set and the network correctly classifies it as such, and false negative otherwise. We report the true positive rate

T P R = \frac{T P}{T P + F N}

, the true negative rate

T N R = \frac{T N}{T N + F P}

, and the accuracy

A C C = \frac{T P + T N}{T P + F N + T N + F P}

.

The contour plots of ACC, TPR, and TNR as a function of hyper-parameters

γ

and

ν

are shown in Figure 6. All of the evaluation metrics are correlated with the selection of

γ

and

ν

, especially the selection of

ν

is for the system’s performance. The setting of hyper-parameters needs to be tuned considering both TPR and TNR simultaneously. If we select

γ

and

ν

values to obtain high TPR, in general, those selection results in low TNR. More concretely, if the system rejects all images, then we could obtain a perfect score for TNR, but obtain a zero score for TNR. Therefore, we selected

γ

and

ν

values to minimize the difference between TPR and TNR.

The effect of

τ

for fixed

γ

and

ν

is shown in Figure 7. We can see that the difference between TPR and TNR gets considerably reduced when we select

τ

which is smaller than 0. From this, we chose a negative

τ

for the decision boundary of OSVM in the latent space. Overall, the best TPR, TNR, and ACC of the proposed system are all 81% with the hyper-parameter setting of

γ = 1.9, ν = 0.1, τ = - 0.15

.

4. Discussion

To verify the system’s discriminative performance, we computed the Euclidean distances between genuine images (images of same shark) and between imposter images (images of different sharks). Figure 8A shows the genuine–imposter distributions of shark images. It shows that two distribution curves are clearly separated, which means that the proposed system distinguishes the images of already seen shark individuals.

To verify that the proposed method is correctly separating images from different subjects while clustering all images from the same subject, we utilized the UMAP [19] plot of embedding vectors from the known test dataset. Figure 8B shows that the proposed system can learn and distinguish the characteristics of the images of sharks. In this study, our proposed method maps images of sharks to high dimensional vectors in a latent space first then sets the decision boundaries of the subjects in the latent space. Any latent vector that does not belong to those boundaries is recognized as an unknown subject. Considering the dimension of the latent vector, the proposed system is highly scalable in terms of the number of subjects.

Figure 9 shows the optimized images computed for a selection of units in VGG-16 network. To unveil and understand that the activation of neurons at intermediate layers are correlated with input images, we adapted the method for visualizing and interpreting neural networks by Yosinski et al. [20]. For a given input image x and the activation of a neuron a, we obtained an optimized image using a gradient descent algorithm. With a gradient descent step size

η

, a single step updates the image

x \leftarrow x + η \frac{\partial a}{\partial x} .

(4)

The optimized images in Figure 9 show that some neurons in the network are correlated with the meaningful features of a shark body such as dorsal fin, tail, head, and belly. Therefore, we claim that the proposed system recognizes shark individuals using such important features of shark bodies.

Figure 10 shows the visualizations of highly activated regions in input images using gradient-weighted class activation mapping (Grad-CAM) [21]. The VGG-16 was pre-trained with the ImagNet dataset, and the dataset includes the classes of sharks including white shark, tiger shark, and hammerhead shark [17]. Given an input image, we forward propagated the image through VGG-16 network and computed the probabilities of every classes; when the predicted class is a shark, then the gradient of the predicted class is back-propagated to the input layer to get the visualization. These visualizations show that VGG-16 is activating meaningful regions of shark body well.

5. Conclusions

We proposed a fully automated and vision-based framework to recognize shark individuals. Our framework is capable of extracting shark images from snapshots of underwater videos. The framework then learns to map each shark image into an embedding vector in latent space such that the Euclidean distance between images of the same shark is smaller than that between images of different sharks. During the testing phase, the framework can identify whether the new image belongs to a shark individual encountered during the training phase or belongs to an unknown shark with 81% accuracy. When we have a database with more named shark individuals, we will be able to utilize the system to recognize and trace them very efficiently with low cost.

Author Contributions

Conceptualizing and writing—review and editing: J.M. and C.G.L.; Methodology, formal analysis: J.M. and S.-I.C.; Software, investigation: N.A.L. and H.-I.K.; Data curation, validation, and visualization: N.A.L.; Writing—original draft preparation: N.A.L.; Project administration: S.-I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Research Foundation of Korea (NRF) Grant by the Korean Government through the MSIT under Grant 2021R1A2B5B01001412; and in part by the Ministry of Science and Information and Communication Technology (ICT) (MSIT), South Korea, under the ICT Challenge and Advanced Network of Human Resource Development (HRD) (ICAN) Program, supervised by the Institute of Information and Communications Technology Planning and Evaluation (IITP), under Grant IITP-2021-2020-0-01824.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Lowe, C.G.; Bray, R.N. Movement and activity patterns. In The Ecology of Marine Fishes; University of California Press: Berkeley, CA, USA, 2006; pp. 524–553. [Google Scholar]
Silvy, N.J.; Lopez, R.R.; Peterson, M.J. Wildlife marking techniques. Tech. Wildl. Investig. Manag. 2005, 6, 339–376. [Google Scholar]
Trolliet, F.; Vermeulen, C.; Huynen, M.C.; Hambuckers, A. Use of camera traps for wildlife studies: A review. Biotechnol. Agron. Soc. Environ. 2014, 18, 446–454. [Google Scholar]
Brooks, E.J.; Sloman, K.A.; Sims, D.W.; Danylchuk, A.J. Validating the use of baited remote underwater video surveys for assessing the diversity, distribution and abundance of sharks in the Bahamas. Endanger. Species Res. 2011, 13, 231–243. [Google Scholar] [CrossRef]
Awad, A.I. From classical methods to animal biometrics: A review on cattle identification and tracking. Comput. Electron. Agric. 2016, 123, 423–435. [Google Scholar] [CrossRef]
Finn, C.; Duyck, J.; Hutcheon, A.; Vera, P.; Salas, J.; Ravela, S. Relevance Feedback in Biometric Retrieval of Animal Photographs. In Proceedings of the Pattern Recognition, 6th Mexican Conference, MCPR 2014, Cancun, Mexico, 25–28 June 2014; Springer International Publishing: Cham, Switzerland; pp. 281–290. [Google Scholar]
Kelly, M.J. Computer-Aided Photograph Matching in Studies Using Individual Identification: An Example from Serengeti Cheetahs. J. Mammal. 2001, 82, 440–449. [Google Scholar] [CrossRef] [Green Version]
Coleman, T.; Moon, J. A biometric for shark dorsal fins based on boundary descriptor matching. In Proceedings of the 32nd International Conference on Computer Applications in Industry and Engineering, San Diego, CA, USA, 30 September–2 October 2019; Volume 63, pp. 63–71. [Google Scholar]
Van Tienhoven, A.M.; Den Hartog, J.E.; Reijns, R.A.; Peddemors, V.M. A computer-aided program for pattern-matching of natural marks on the spotted raggedtooth shark Carcharias taurus. J. Appl. Ecol. 2007, 44, 273–280. [Google Scholar] [CrossRef]
Kumar, S.; Singh, S.K.; Singh, R.; Singh, A.K. Deep Learning Framework for Recognition of Cattle Using Muzzle Point Image Pattern. In Animal Biometrics: Techniques and Applications; Springer: Singapore, 2017; pp. 163–195. [Google Scholar]
Shojaeipour, A.; Falzon, G.; Kwan, P.; Hadavi, N.; Cowley, F.C.; Paul, D. Automated muzzle detection and biometric identification via few-shot deep transfer learning of mixed breed cattle. Agronomy 2021, 11, 2365. [Google Scholar] [CrossRef]
Hughes, B.; Burghardt, T. Automated Visual Fin Identification of Individual Great White Sharks. Int. J. Comput. Vis. 2017, 122, 542–557. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Schölkopf, B.; Williamson, R.C.; Smola, A.J.; Shawe-Taylor, J.; Platt, J.C. Support vector method for novelty detection. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 27–30 November 2000; pp. 582–588. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention at MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
Yosinski, J.; Clune, J.; Nguyen, A.; Fuchs, T.; Lipson, H. Understanding neural networks through deep visualization. arXiv 2015, arXiv:1506.06579. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Illustration of the encoder architecture. The encoder includes VGG-UNet and VGG-16.

Figure 2. The VGG-UNet architecture.

Figure 3. The architecture of the proposed network model. This network adapts a pre-trained VGG-16 network as a feature extractor.

Figure 4. Illustration of the shark recognition process using the trained model. The image of a shark

s_{*, u}

is recognized as the ‘green’, but the image

s_{*, w}

is not recognized.

Figure 4. Illustration of the shark recognition process using the trained model. The image of a shark

s_{*, u}

is recognized as the ‘green’, but the image

s_{*, w}

is not recognized.

Figure 5. Pictures of collecting visual data from underwater.

Figure 6. ACC, TPR, TNR as a function of

γ

and

ν

for the fixed

τ = - 0.1

. The yellow color represents the highest rates and the blue color represents the lowest rates.

Figure 6. ACC, TPR, TNR as a function of

γ

and

ν

for the fixed

τ = - 0.1

. The yellow color represents the highest rates and the blue color represents the lowest rates.

Figure 7. Performance rate as function of

τ

for fixed

γ

and

ν

. We show the results of (

γ = 1.9, ν = 0.1

) and (

γ = 1.0, ν = 0.02

).

Figure 7. Performance rate as function of

τ

for fixed

γ

and

ν

. We show the results of (

γ = 1.9, ν = 0.1

) and (

γ = 1.0, ν = 0.02

).

Figure 8. Distributions of distances and uniform manifold approximation and projection (UMAP) visualization, (A) genuine–imposter distributions of shark images, (B) UMAP plots of representation vectors of images in the known test set. Different color represents different shark individuals.

Figure 9. Visualization of the activation of neurons in conv2 layer in the block5 of VGG-16. The left most image is the input image, and the six other images are the results of optimization to maximize the activation of some neurons in the layer. The optimized images are slightly darker than the input image because they are re-scaled to visualize optimized regions saliently.

Figure 10. Visualization of activated regions using the Grad-CAM algorithm.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Le, N.A.; Moon, J.; Lowe, C.G.; Kim, H.-I.; Choi, S.-I. An Automated Framework Based on Deep Learning for Shark Recognition. J. Mar. Sci. Eng. 2022, 10, 942. https://doi.org/10.3390/jmse10070942

AMA Style

Le NA, Moon J, Lowe CG, Kim H-I, Choi S-I. An Automated Framework Based on Deep Learning for Shark Recognition. Journal of Marine Science and Engineering. 2022; 10(7):942. https://doi.org/10.3390/jmse10070942

Chicago/Turabian Style

Le, Nhat Anh, Jucheol Moon, Christopher G. Lowe, Hyun-Il Kim, and Sang-Il Choi. 2022. "An Automated Framework Based on Deep Learning for Shark Recognition" Journal of Marine Science and Engineering 10, no. 7: 942. https://doi.org/10.3390/jmse10070942

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Automated Framework Based on Deep Learning for Shark Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Shark Recognition

2.1.1. VGG-UNet

2.1.2. VGG-16 Network

2.1.3. Few-Shot Learning

3. Results

Datasets and Evaluation Metric

4. Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI