Deep-Learning-Based Sequence Causal Long-Term Recurrent Convolutional Network for Data Fusion Using Video Data

Jeon, DaeHyeon; Kim, Min-Suk

doi:10.3390/electronics12051115

Open AccessArticle

Deep-Learning-Based Sequence Causal Long-Term Recurrent Convolutional Network for Data Fusion Using Video Data

by

DaeHyeon Jeon

and

Min-Suk Kim

^*

Department of Human Intelligence and Robot Engineering, Sangmyung University, Cheonan 03016, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(5), 1115; https://doi.org/10.3390/electronics12051115

Submission received: 23 January 2023 / Revised: 22 February 2023 / Accepted: 23 February 2023 / Published: 24 February 2023

(This article belongs to the Special Issue Artificial Intelligence (AI) for Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

The purpose of AI-Based schemes in intelligent systems is to advance and optimize system performance. Most intelligent systems adopt sequential data types derived from such systems. Realtime video data, for example, are continuously updated as a sequence to make necessary predictions for efficient system performance. The majority of deep-learning-based network architectures such as long short-term memory (LSTM), data fusion, two streams, and temporal convolutional network (TCN) for sequence data fusion are generally used to enhance robust system efficiency. In this paper, we propose a deep-learning-based neural network architecture for non-fix data that uses both a causal convolutional neural network (CNN) and a long-term recurrent convolutional network (LRCN). Causal CNNs and LRCNs use incorporated convolutional layers for feature extraction, so both architectures are capable of processing sequential data such as time series or video data that can be used in a variety of applications. Both architectures also have extracted features from the input sequence data to reduce the dimensionality of the data and capture the important information, and learn hierarchical representations for effective sequence processing tasks. We have also adopted a concept of series compact convolutional recurrent neural network (SCCRNN), which is a type of neural network architecture designed for processing sequential data combined by both convolutional and recurrent layers compactly, reducing the number of parameters and memory usage to maintain high accuracy. The architecture is challenge-able and suitable for continuously incoming sequence video data, and doing so allowed us to bring advantages to both LSTM-based networks and CNNbased networks. To verify this method, we evaluated it through a sequence learning model with network parameters and memory that are required in real environments based on the UCF-101 dataset, which is an action recognition data set of realistic action videos, collected from YouTube with 101 action categories. The results show that the proposed model in a sequence causal long-term recurrent convolutional network (SCLRCN) provides a performance improvement of at least 12% approximately or more to be compared with the existing models (LRCN and TCN).

Keywords:

LRCN; SCCRNN; CNN; RNN; video stream data; data fusion; UCF-101

1. Introduction

Deep-learning-based research using sequential input data is important when it comes to effectively extracting features for the video streaming data. Most deep neural networks that depend on sequence data have previously shown improvements only in data areas such as voice and text. Sequence data are now used in the current research fields of image and vision, where deep learning techniques such as action classification and object detection are used to effectively predict an outcome. In prior works, schemes using the recurrent neural network (RNN) family [1,2,3,4] were generally used to solve vision problems for sequential data flow. Nowadays, a more diverse set of schemes is used for solving such problems.

CNNs have demonstrated remarkable performance on a variety of computer vision tasks in the context of state-of-the-art architectures [5,6,7]. They are designed to learn hierarchical representations of the input data by stacking multiple layers of a convolutional filter (CF) that detect increasingly complex patterns in the data, such as edges, textures, and shapes. A CNN can be successfully and efficiently applied to a variety of deep learning architectures, such as image classification (ImageNet) [8], object detection (Faster R-CNN) [9], semantic segmentation (U-Net) [10], and image generation (generative adversarial network: GAN) [11]. Effective neural network architectures have abilities to learn hierarchical representations of the input data and capture complex patterns with a powerful tool for a variety of visioning pattern recognition tasks.

According to the exited models based on sequence prediction, the learning methods using a CNN [12,13,14,15] are successfully proposed with the RNN family [16]. There is another related work [17] in progress, where we aim to increase the accuracy of two-stream using single and multiple optical flows. In this paper, many action classification models are provided and such models are compared for efficient prediction, which includes factors such as data used, memory, parameter, and time complexity. The aforementioned models will depend on multiple video data fusion methods which are on the same backbone network with diverse neural network architectures that use action classification data. SCCRNN is suitable to extract continuously incoming data rather than with sequential data coming, and it can bring both advantages of LSTM-based and CNN-based networks, which neural networks can have a remarkable model compression rate even though there is a limited expense of model performance.

CNN-based sequence data processes such as causal CNN networks are already used in NLP (natural language processing) fields (TTS (text to speech)). However, we compare many different action classification methods between single-frame-based CNN adopted by the sequential 2D visual field and SCLRCN combined with causal CNN and LRCN using sequential input data in this paper. SCLRCN has the advantage of LRCN in that not only does it have the size that can accommodate a large amount of data, but it can also effectively extract time-dependent features of a causal CNN and reuse network values from the previous learning process. Therefore, the architecture can find effective experimental results for a large amount of continuous data generated in the neural network environment. In addition, the SCLRCN is potentially a great starting point over many AI development areas since product models for the learning process can be applied to diverse neural network architectures using sequential input data, and it is not confined to a predictor made for a specific output.

2. Related Works

There are many CNN-based related works based on sequential input data to be successfully conducted using the RNN family. In particular, the action classification model can be configured in different ways depending on the efficient data types such as memory, parameter, and time complexity in which the neural network should be used in the actual learning process using video data fusion methods to the backbone network.

2.1. Dilated Convolutional Neural Network (Dilated CNN)

A dilated CNN is a CNN-based method that increases the size of the receptive field (RF) in the basic convolutional network so that even a convolution filter of the same size can be adopted for a large area of convolution [18]. It has a dilation rate (DR) as a parameter, and the RF, DR, and CF sizes are as follows;

R F = (C F \times D R) + (D R - 1)

(1)

As shown in Figure 1, the existing convolutional neural network needs to have 121 parameters to have an RF of 11 × 11.

As such, the dilated CNN has advantages in terms of both computational amounts and wide RFs [19]. By applying this advantage to TTS, which is a sequence data processing problem, in [16], high accuracy was obtained.

Figure 2 shows a CNN-based learning process including input, mask filter, and output in the representative features. The input is a multi-dimensional array of values that represents the data being processed. A mask filter in the CNN is a matrix of the same shape as the output of a convolutional layer used to selectively filter the output of the layer. It typically depends on the properties of the specific task being performed for input data. The output of a CNN can be further processed by additional layers to extract higher-level features and classify the input data.

2.2. Causal Convolutional Neural Network (Causal CNN)

As shown in Figure 3, the neural network is the difference of structures between a standard and a causal CNN. The standard CNN is a deep learning scheme that is used to determine the relationship between adjacent data features without sequential data. The causal CNN is also an approach based on deep learning that causally informs the relative network between the sequenced data and the past data. Such a neural network scheme using the causal network would be generally used in the sequential data, and it can effectively extract the feature relationship from multiple datasets. In the case of, for example, RNN and LSTM, the next operation generally cannot be performed until the prior operation is completed in the training mode. The causal CNN, however, has the advantage of being able to train as fast as the network can operate in parallel regardless of [16,20,21,22].

2.3. Temporal Convolution Network (TCN)

TCN has also a CNN-based neural network structure designed to effectively extract features for sequence data. TCN has been combined with both causal CNN and dilated CNN as shown in Figure 4. TCN has two advantages: (a) the advantage of effectively extracting features from the causal CNN, and (b) the advantage of the RF of a wide area based on the dilated CNN. Therefore, the network can be achieved with high-performing accuracy in TTS such as waveNet [16], where the features need to be extracted from a specific RF for good performance in other sequence models such as [20,23].

2.4. Other Video Fusion Networks

Diverse feature fusion neural networks for video-based action recognition are being studied to improve the performance of network accuracy.

2.4.1. Late Fusion

Late fusion is multiple sources, such as different modalities and sequence features, combined at a later stage of the processing pipeline. It is processed independently to result in representations using a fusion method, such as averaging or concatenation, at a later stage. By combining the features from multiple sources, the resulting representation can be more robust and capture more complete features. Figure 5 shows the late fusion network using a CNN applied to a fully convolutional network (FCL) for semantic segmentation by merging the current data and the prior 15 frames, respectively, after the CNN [2,24]. The features of the object motion need to be calculated in FCL by comparing two frames with the time difference, even though it cannot be detected with a single frame.

2.4.2. Early Fusion

Early fusion is in contrast to late fusion, where the sources are combined at a later stage of processing. It has different sources of information combined into a single representation at an early stage to form a single multi-modal feature. In this Figure 6, video data are fused as soon as the number of input data is as in [2]. A CNN is generally applied after the causal CNN in the first layer. By connecting pixel data, it is initially possible to effectively detect local data between pixels.

2.4.3. Slow Fusion

Slow fusion is similar to late fusion in that it is repeated over multiple stages with each level incorporating an additional representation of the data. It implies the use of an RNN in multi-modal processing, where the features are combined at each time step of the network. As shown in Figure 7, the network structure of slow fusion is combined with early fusion and TCN. It is implemented to access additional global information by data fusion through multiple layers.

2.4.4. Recurrent Neural Network (RNN)

An RNN is essentially designed to handle sequential data by utilizing the concept of recurrence neural network architecture where information flows only in one direction from input to output. It has a type of feedback loop method that allows for information to be passed from one step of the network to the next to maintain a kind of memory usage. An RNN has sequence input data with unspecified size. It is represented and expressed in Figure 8, as (2).

\begin{matrix} h_{t} = t a n h (W_{xh} x_{t} + W_{hh} x_{t - 1} + b_{h}) \\ y_{xh} = W_{hy} h_{t} + b_{y} \end{matrix}

(2)

2.4.5. Long Short-Term Memory (LSTM)

In the case of an RNN, gradient vanishing occurs where the previously hidden state value disappears over time. To redeem a defect, an LSTM [2] with cell state added to hidden-state was devised. An LSTM can be divided into three gates: forget, input, and output, as shown in Figure 9. The forget gate is determined by the previous cell state and the input gate updates to remember the current state. In addition, the output gate can find output results from both the cell state and input data.

2.4.6. Long-Term Recurrent Convolutional Network (LRCN)

An LRCN is a type of neural network architecture that combines the advantages of both the CNN and the RNN to process sequential data with spatial and temporal dependencies. It has the CNN-based first few layers that can extract useful spatial features from the input data, and those features need to be fed into RNN-based temporal dependencies in the data and capture long-term relationships. Both CNN and LSTM components can be combined into LRCN to learn spatial and temporal complex data representations, and the neural network architecture is powerful for processing sequential data. LRCN [2] is an RNN-based network that was developed to investigate whether it works effectively on data of the RNN family as sequence input data (video, streaming data). Each frame in input data is sequentially applied to the CNN to extract features, and then applied to the LSTM as seen in Figure 10.

3. Materials and Methods

The proposed method is designed based on SCLRCN with the strengths of both effectively extracting the causal relationship from the adjacent data in a causal CNN and effectively extracting the old data using an LSTM. The method can extract features between time and space based on a causal CNN, and it can be additionally adopted into an LSTM to collect features of necessary data in the long-term period with a small size of the RF for memory consumption and parallel operation.

3.1. Sequence Causal Long-Term Recurrent Convolutional Network (SCLRCN)

An SCLRCN is designed with the advantages of both effectively extracting the causal relationship from the adjacent data in a causal CNN and extracting the old data from the LSTM. In general, the RNN family models allow for an effective result that can be extracted from all neural network areas, not determined by input sequence. However, the SCLRCN has the potential fields to lose information in a hidden state since the prior data are affected by the current data. For that reason, it can cause a problem in that data may be lost even though there is a close correlation in the adjacent data. CNN models based on sequence data are generally able to keep important features with no specific learning process. However, it is also possible to collect the correlation data only within the RF accepted by the CNN model.

An SCLRCN shortly extracts features between time and space using a causal CNN as shown in Figure 11, and it adopts an LSTM to collect features of necessary data in the long-term period.

According to CNN-based sequence models, memory consumption, computational time, and a degree of feature extraction are needed to set a causal CNN. Using a CNN network for a single frame has the same effect as an LRCN, even if it cannot use temporal features from the CNN. As shown in Figure 12, using late fusion implies having features that cannot be effectively extracted due to the space between two layers, and it also cannot perform temporal feature extraction in CNN models. In addition, late fusion has to save part of the data, which is not used, or it cannot re-calculate data even though the part has already been calculated, as seen in Figure 12.

In the case of early fusion, it is possible to rapidly collect temporal and spatial features at the beginning of the learning process. However, many CNN-based network channels are required for a large amount of sequence data. In a case such as a TCN, the network can acquire a wide range of RFs depending on the number of layers, and it also has possibly low network memory and parallel operation since the same CNN mask is used on the same layer. Figure 13 shows that memory space and parallel operation are required when a TCN is combined with an LSTM.

TCN has the memory enhancement to use the prior operation for the next operation when the layers are deepened. On the other hand, it can be required with the selection of the size in the layer since it is possible using parallel operation.

3.2. Optimization of Learning Performance by SCLRCN

Slow fusion has two similar features to both early fusion and TCN; (1) The slow fusion using a causal CNN can adopt small memory and parallel operation even if the size of the RF is smaller than that of the TCN; (2) There is no need to save data separately when using both a causal CNN and an LSCN, as seen in Figure 14.

As shown in Table 1, an amount of computation and memory consumption is required for sequence-based CNN models while an LRCN stores the previous data in the memory. Slow fusion, TCN, and causal CNN [16,25] have high performing accuracy to be compared with other sequence-based models. These methods also have the advantage of being able to perform in parallel. As network layers become deeper and longer, however, the slower fusion and TCN need to calculate a large amount of data for the next level.

The RF in causal CNN, TCN, and slow fusion is small because the sequence data area for the RF would be sufficiently wide. In addition, the amount of memory consumption is probably increased due to the nature of the CNN network since a TCN and slow fusion are used with an LRCN at the same time. From the result, we can observe that the accuracy of the network model is high in the order of single frame, early fusion, late fusion, and slow fusion. This shows that causal CNN, TCN, and slow fusion using a CNN are more effective than other types of network models. Therefore, it can be confirmed that a causal CNN is suitable with an LRCN in terms of accuracy, memory usage, and consumption, as well as the amount of operation. The image-based neural network [26,27] can recognize the total number of non-parallel CNN layers vertically listed for the increment in accuracy rate as shown in Figure 14. As the number of CNN layers in, for example, AlexNet [13], VGGNet [15], GoogleNet [14], and ResNet [28], the RFs can be increased by viewing sequence data in terms of causal CNN, TCN, and slow fusion. The number of CNN layers for object detection in the cases of YOLO, such as Yolo-v1 [29], Yolo-9000 [30], Yolo-v3 [31], and Yolo-v4 [32] can also be increased.

4. Experimental Results

In this section, we analyze the proposed method using reliable deep learning models such as LRCN, TCN, and SCLRCN using the benchmark datasets UCF-101 [33,34]. The neural network structure of each model is based on convolutional layers via which the target load is connected. The performance of the models was also evaluated through the averaged test accuracy.

4.1. Experimental Environment

The performance improvement of the SCLRCN is tested with two different DL-based networks such as LRCN and TCN, which used lightweight networks based on VGGnet [15], as shown in Table 1. As shown in Figure 15, the RF for the TCN is set at 32 frames, and the SCLRCN is set at 13 frames. The output of the TCN and SCLRCN can be modified according to the number of input data. Since all outputs of the networks have the same label, it was created as the same size as the prediction through the LSTM layer in Figure 14. All of the datasets were based on UCF-101 [33]. For the training dataset, 7272 images were re-shaped into 32 frames tailored to the RF of the TCN so that all networks could be observed by images of the same length. Table 2 describes experimental neural network structures (network summary) such as LRCN, TCN, and SCRCN. They are essentially used with Conv3, which means the output of the Layer, block3-pool (MaxPooling2D) in each layer. However, the neural network structure for layer output is configured slightly differently, connecting each layer.

In this experiment, we evaluated and compared the prediction accuracy according to the type of neural network. The prediction is used in the information of all frames, not the prediction of the entire process.

4.2. Experimental Result and Evaluation

Table 3 shows the test accuracy with early stop for the end of the learning process in each network and the same learning environment. It can be shown that the SCLRCN has an approximately 12% higher accuracy than the LRCN, and a 36% higher accuracy than the TCN.

The experimental results show that while DL-based networks such as TCN and causal CNN are effective for sequence feature extraction, the image networks of sequence features using the LSTM are slightly more effective. In addition, the frame method using the LSTM can find a result with higher accuracy than the TCN even if the RF in the CNN network of the SCLRCN is smaller than the TCN. The result also shows that the method using a sequence-based CNN in video images can predict the movement of pixels between adjacent frames. Therefore, it can be confirmed that an RNN-based network such as an LSTM using the extracted features with the RF of an appropriate CNN network is more effective.

5. Discussion and Conclusions

In this study, we proposed an SCLRCN designed with the advantages of both effectively extracting the causal relationship from the adjacent data in a causal CNN and effectively extracting the old data from the LSTM. This method can not only shortly extract features between time and space using a causal CNN, but the method can also be adopted into an LSTM to collect features of necessary data in the long-term period. In particular, slow fusion using a causal CNN has a small size of the RF for memory consumption and parallel operation. It also does not need to save data for both a causal CNN and an LSCN. In addition, the RF in a causal CNN, TCN, and slow fusion has more effective advantages in that there is no need to consume large amounts of computational resources for future operations and a small network size for sequential data. In this paper, we confirmed that efficient memory usage and higher accuracy can be obtained for sequence visual data such as video images through an SCLRCN, which is merged by a causal CNN and an LRCN. For that reason, it is important to both analyze the relationship between pixels in a general area and to analyze the difference between the video pixels. The research for the SCLRCN is also useful in the hyper-parameter improvement and the method for the performance of the current CNN networks. It can be additionally used in action classification and other related fields, where similar experiments can be conducted since it can be applied so effectively to any network that uses sequence visual data. An SCLRCN generally performs high-performing accuracy in neural network architectures that utilize sequence features in image networks based on LSTM networks; however, in structures where the network extracts sequence features, the current neural networks such as the TCN and the causal CNN still demonstrate slightly better performance for learning accuracy. In future work, we plan to use more complicated CNN-based architectures to further analyze the effect of the hyper-parameter improvement and the method for the performance of both CNN-based and LSTM-based neural network architectures to extract sequence features in the action classification methods.

Author Contributions

Conceptualization, D.J. and M.-S.K.; methodology, D.J.; software, D.J.; validation, D.J. and M.-S.K.; formal analysis, D.J.; writing—original draft preparation, D.J. and M.-S.K.; writing—review and editing, M.-S.K.; visualization, M.-S.K.; supervision, M.-S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by a 2021 research grant from Sangmyung University.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef] [Green Version]
Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE TPAMI 2017, 39, 677–691. [Google Scholar] [CrossRef]
Chen, M.; Li, X.; Zhao, T. On Generalization Bounds of a Family of Recurrent Neural Networks. arXiv 2019, arXiv:1910.12947. [Google Scholar]
Tran, Q.H.; Lai, T.; Haffari, G.; Zukerman, I.; Bui, T.; Bui, H. The Context-Dependent Additive Recurrent Neural Net. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Naseer, I.; Akram, S.; Masood, T.; Jaffar, A.; Khan, M.A.; Mosavi, A. Performance Analysis of State-of-the-Art CNN Architectures for LUNA16. Sensors 2022, 22, 4426. [Google Scholar] [CrossRef] [PubMed]
Neris, R.; Guerra, R.; López, S.; Sarmiento, R. Performance evaluation of state-of-the-art CNN architectures for the on-board processing of remotely sensed images. In Proceedings of the 2021 XXXVI Conference on Design of Circuits and Integrated Systems (DCIS), Vila do Conde, Portugal, 24–26 November 2021. [Google Scholar] [CrossRef]
Huynh, E. Vision Transformers in 2022: An Update on Tiny ImageNet. arXiv 2022, arXiv:2205.10660. [Google Scholar]
Han, G.; Huang, S.; Ma, J.; He, Y.; Chang, S.F. Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment. arXiv 2021, arXiv:2104.07719. [Google Scholar] [CrossRef]
Guo, J.; Zhou, H.; Wang, L.; Yu, Y. UNet-2022: Exploring Dynamics in Non-isomorphic Architecture. arXiv 2022, arXiv:2210.15566. [Google Scholar]
Yeo, D.; Kim, M.-S.; Bae, J.-H. Adversarial Optimization-Based Knowledge Transfer of Layer-Wise Dense Flow for Image Classification. Appl. Sci. 2021, 11, 3720. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv 2014, arXiv:1406.2199. [Google Scholar]
Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Debbi, H. Causal Explanation of Convolutional Neural Networks; Springer: Cham, Switzerland, 2021; Volume 12976, pp. 633–649. [Google Scholar] [CrossRef]
Hamad, R.A.; Kimura, M.; Yang, L.; Woo, W.L.; Wei, B. Dilated causal convolution with multi-head self attention for sensor human activity recognition. Neural Comput. Appl. 2021, 33, 13705–13722. [Google Scholar] [CrossRef]
He, Y.; Zhao, J. Temporal Convolutional Networks for Anomaly Detection in Time Series. J. Phys. Conf. Ser. 2019, 1216, 042050. [Google Scholar] [CrossRef] [Green Version]
Boulahia, S.Y.; Amamra, A.; Madi, M.R.; Daikh, S. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach. Vis. Appl. 2021, 32, 121. [Google Scholar] [CrossRef]
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar] [CrossRef] [Green Version]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef] [Green Version]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. arXiv 2015, arXiv:1409.0575. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. arXiv 2018, arXiv:1705.07750v3. [Google Scholar]

Figure 1. Dilated convolution with 11 × 11 RF. (a) General convolution or convolution with DR value of 1; (b) Convolution with DR of 2; (c) Convolution with DR of 3.

Figure 2. CNN-based learning process including input, mask filter, and output.

Figure 3. Difference between standard CNN and causal CNN. (a) Standard CNN; (b) Causal CNN (Note: Blue denotes the number of nodes needed in current operation).

Figure 4. Learning process of temporal convolution network (TCN) (Note: Blue denotes the number of nodes needed in current operation).

Figure 5. Learning process of late fusion (Note: Blue denotes the number of nodes needed in current operation).

Figure 6. Learning process of early fusion (Note: Blue denotes the number of nodes needed in current operation).

Figure 7. Learning process of slow fusion (Note: Blue denotes the number of nodes needed in current operation).

Figure 8. Learning process of RNN.

Figure 9. Learning process of LSTM.

Figure 10. Learning process of single frame + LRCN (Note: Purple denotes data in the past and currently needed, blue denotes the number of nodes needed in current operation).

Figure 11. Learning process of SCLRCN (Note: Purple denotes data in the past and currently needed, Blue denotes the number of nodes needed in current operation).

Figure 12. Memory usage process for late fusion + LSTM (Note: Purple denotes data in the past and currently needed, yellow denotes memory to be stored for the future, and blue denotes the number of nodes needed in current operation).

Figure 13. Memory usage process for TCN + LSTM (Note: Note: Purple denotes data used in the past and currently, yellow denotes memory to be stored for the future, and blue denotes the number of nodes in current operation).

Figure 14. LRCN with causal CNN (SCLRCN) (Note: Purple denotes data used in the past currently required, blue denotes the number of nodes required by the current operation).

Figure 15. Learning methods: (a) LRCN, (b) TCN + LRCN, (c) SCCLRN.

Table 1. Simple calculation formula for the amount of computation and memory consumption required by CNN-based sequence models.

Models	CNN-Based Computational Usage	Prior Data Computational Usage	Required Memory	Memory without Current Usage
Single Frame	N/A	N/A	0	0
Late Fusion	$I_S + n -$ 1	I_S + n − 1	0	0
Early Fusion	$2 n$	n	2 $\sum_{l a y e r = 0}^{n - 1} F_D - n$	2 $\sum_{l a y e r = 0}^{n - 1} F_D - 2 n$
TCN	2 $\sum_{l a y e r = 0}^{n - 1} 2^{l a y e r}$	$2 n$ − 1	2 $\sum_{l a y e r = 0}^{n - 1} 2^{n - 1} - 2$ $\sum_{l a y e r = 0}^{n - 2} 2^{l a y e r} - n$	2 $\sum_{l a y e r = 0}^{n - 1} (2^{n - 1} + 2^{l a y e r}) - 2$ $\sum_{l a y e r = 0}^{n - 2} 2^{l a y e r} - 2 n$
Slow Fusion	$2^{n - 1} I_S +$ $\sum_{l a y e r = 0}^{n - 2} 2^{l a y e r}$	$I_S + 2 n - 3$	$2^{n - 2} + I_S +$ $\sum_{l a y e r = 0}^{n - 2} 2^{n - 1} - 2$ $\sum_{l a y e r = 0}^{n - 3} 2^{l a y e r} - n - 1$	$2^{n - 2} + I_S +$ $\sum_{l a y e r = 0}^{n - 2} (2^{n - 1} + 2^{l a y e r}) -$ $\sum_{l a y e r = 0}^{n - 3} 2^{l a y e r} - 2 n$
SCCLRN	1 + $\sum_{l a y e r = 1}^{n - 1} 2 l a y e r$	$2 n - 1$	$\sum_{l a y e r = 1}^{n} l a y e r$	0

Table 2. Experimental neural network structures.

LRCN	TCN	SCLRCN
-Input Size (212 × 212)
Conv3 × 64	Conv3× 64	Casual Conv3 × 4
Conv3 × 64	dilated dilated Conv3 × 64	Casual Conv3 × 64
-Maxpooling (104 × 104)
Conv3 × 128	Conv3×128	Casual Conv3×128
Conv3 × 128	dilated dilated Conv3×128	Casual Conv3 × 128
-Maxpooling (50 × 50)
Conv3 × 256	Conv3 × 256	Casual Conv3 × 256
Conv3 × 256	dilated dilated Conv3 × 256	Casual Conv3 × 256
Conv3 × 256	Conv3 × 256	Casual Conv3 × 256
-Maxpooling (22 × 22)
Conv3 × 512	Conv3 × 512	Casual Conv3 × 512
Conv3 × 512	dilated dilated Conv3×512	Casual Conv3 × 512
Conv3 × 512	Conv3 × 512	Casual Conv3 × 512
-Maxpooling (8 × 8)
Conv3 × 512	Conv3 × 512	Casual Conv3 × 512
Conv3 × 512	dilated dilated Conv3 × 512	Casual Conv3 × 512
Conv3 × 512	Conv3 × 512	Casual Conv3 × 512
-Maxpooling (1 × 1), Connected
LSTM (4096)	FCL (4095)	LSTM (4096)
LSTM (101)	FCL (101)	LSTM (101)

Table 3. Averaged test accuracy (%) of each model.

LRCN	TCN	SCLRCN
71.53	47.72	83.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jeon, D.; Kim, M.-S. Deep-Learning-Based Sequence Causal Long-Term Recurrent Convolutional Network for Data Fusion Using Video Data. Electronics 2023, 12, 1115. https://doi.org/10.3390/electronics12051115

AMA Style

Jeon D, Kim M-S. Deep-Learning-Based Sequence Causal Long-Term Recurrent Convolutional Network for Data Fusion Using Video Data. Electronics. 2023; 12(5):1115. https://doi.org/10.3390/electronics12051115

Chicago/Turabian Style

Jeon, DaeHyeon, and Min-Suk Kim. 2023. "Deep-Learning-Based Sequence Causal Long-Term Recurrent Convolutional Network for Data Fusion Using Video Data" Electronics 12, no. 5: 1115. https://doi.org/10.3390/electronics12051115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep-Learning-Based Sequence Causal Long-Term Recurrent Convolutional Network for Data Fusion Using Video Data

Abstract

1. Introduction

2. Related Works

2.1. Dilated Convolutional Neural Network (Dilated CNN)

2.2. Causal Convolutional Neural Network (Causal CNN)

2.3. Temporal Convolution Network (TCN)

2.4. Other Video Fusion Networks

2.4.1. Late Fusion

2.4.2. Early Fusion

2.4.3. Slow Fusion

2.4.4. Recurrent Neural Network (RNN)

2.4.5. Long Short-Term Memory (LSTM)

2.4.6. Long-Term Recurrent Convolutional Network (LRCN)

3. Materials and Methods

3.1. Sequence Causal Long-Term Recurrent Convolutional Network (SCLRCN)

3.2. Optimization of Learning Performance by SCLRCN

4. Experimental Results

4.1. Experimental Environment

4.2. Experimental Result and Evaluation

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI