Automated Parts-Based Model for Recognizing Human–Object Interactions from Aerial Imagery with Fully Convolutional Network
Abstract
:1. Introduction
- Combining Felzenszwalb’s super-pixel segmentation method with a region-merging algorithm to extract human and object silhouettes from images;
- Introducing an automated parts-based model that identifies twelve human body parts and selects the top five body parts depending upon their involvement in the performed interactions;
- Using remote sensing aerial imagery to extract two types of full-body features including oriented rotated brief (ORB) features and texton maps; moreover, two types of key-point-based features including the Radon transform and 8-chain Freeman code have been extracted;
- Applying a fully convolutional network for the classification of human–object interactions in the ground and aerial imagery.
2. Related Work
3. Proposed Methodology
3.1. Image Pre-Processing
3.1.1. Intensity Value Adjustment
3.1.2. Noise Removal
3.2. Silhouette Segmentation
Algorithm 1: Segmentation and Region Merging |
Input: Image X = [x1,…xn] Output: Cluster centers of merged regions C = [c1, c2, c3] Repeat Regions ← Felzenszwalb’s algorithm (X) %Extract features% For i in len(Regions): Mean[i]←Get_Mean(Regions[i]) Covar[i]←Get_Covariance(Regions[i]) SIFT[i]←Get_SIFT_descriptors(Regions[i]) SURF[i]←Get_SURF_descriptors(Regions[i]) End %Compute Similarity% For i,j in len(Regions): ← sim(Mean[i], Mean[j]) ← sim(Covar[i], Covar[j]) ← sim(SIFT[i], SIFT[j]) ← sim(SURF[i], SURF[j]) End If >= threshold: NewRegion = MergeRegions(Region[i],Region[j]) End Until all images have been segmented Return C = [c1,c2,c3] |
3.3. Automated Parts-Based Model
Algorithm 2: Automated Parts-Based Model |
Input: HSS: segmented human silhouette Output: 12 body parts including head, left elbow, right elbow, left hand, right hand, torso, left hip, right hip, left knee, right knee, left foot, right foot: key_body_points (p1, p2, p3…p12) and selected parts: key_body_parts (p1, p2, p3, p4, p5). Repeat KeyBodyPoints ← [] % detecting 5 key points from convex hull% For i = 1 to N do contour ← Getcontour (HSS) Convex hull ← DrawConvexhull (contour) For point on Convex hull do If point in contour do KeyBodyPoints.append (point) End End |
%detecting 6th key point from contour center% Center ← GetContourcenter (contour) KeyBodyPoints.append (Center) %detecting 6 additional key points% LE ← FindMidpoint (KeyBodyPoints [0], KeyBodyPoints [1]) lelbow ← Findclosestpointoncontour (LE) RE ← FindMidpoint (KeyBodyPoints [2], KeyBodyPoints [1]) relbow ← Findclosestpointoncontour (RE) LH ← FindMidpoint (KeyBodyPoints [3], KeyBodyPoints [1]) lhip ← Findclosestpointoncontour (LH) RH ← FindMidpoint (KeyBodyPoints [4], KeyBodyPoints [1]) rhip ← Findclosestpointoncontour (RH) LK ← FindMidpoint (KeyBodyPoints [3], KeyBodyPoints [5]) lknee ← Findclosestpointoncontour (LK) RK ← FindMidpoint (KeyBodyPoints [4], KeyBodyPoints [5]) rknee ← Findclosestpointoncontour (RK) KeyBodyPoints.append (lelbow, relbow, lhip, rhip, lknee, rknee) End return key_body_points (p1, p2, p3…p12) Scores ← [] For point in key_body_points do Score ← CosineSimilarity(point, object) Scores.append (score) End key_body_parts ← Get_top_5_points (Scores) Return key_body_parts (p1, p2, p3, p4, p5) |
3.4. Multi-Scale Feature Extraction
Algorithm 3: Multi-Scale Feature Extraction |
Input: N: full body silhouettes and five key body points Output: combined feature vector (f1, f2, f3…fn) % initiating feature vector for remote sensing HOI classification % FeatureVector ← [] F_vectorsize ← GetVectorsize () % loop on extracted human silhouettes % J ← len(silhouettes) For i = 1:J % extracting ORB and Texton features % ORB ← GetORBdescriptor (silhouette[i])) Texton ← GetTextonMap (silhouette[i])) FeatureVector.append (ORB, Texton) % loop on five key points % For i = 1:5 % extracting Chain Code and Radon features % Code ← GetChainCode(i, i + 1) Radon ← GetRadonTransform (silhouette, i) FeatureVector.append (Code, Radon) End End Feature-vector ← Normalize (FeatureVector) return feature vector (f1, f2, f3…fn) |
3.4.1. ORB Features
3.4.2. Texton Maps
3.4.3. Radon Transform
3.4.4. Eight-Chain Freeman Codes
3.5. Dimensionality Reduction: t-SNE
3.6. Fully Convolutional Network
4. Experimental Results
4.1. Dataset Description
4.1.1. VIRAT Video Dataset
4.1.2. YouTube Aerial Dataset
4.1.3. SYSU 3D HOI Dataset
4.2. Experiment I: Interaction Classification Accuracies
4.3. Experiment II: Accuracy and Loss Plots
4.4. Experiment II: Part-Based Model Detection
4.5. Experiment III: Comparison with Other Classifiers
4.6. Experimentation IV: Comparison of the Proposed System with State-of-the-Art Techniques
5. Discussion
6. Conclusions
6.1. Theoretical Implications
6.2. Research Limitations
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Fraser, B.T.; Congalton, R.G. Monitoring Fine-Scale Forest Health Using Unmanned Aerial Systems (UAS) Multispectral Models. Remote Sens. 2021, 13, 4873. [Google Scholar] [CrossRef]
- Mahmood, M.; Jalal, A.; Kim, K. WHITE STAG Model: Wise Human Interaction Tracking and Estimation (WHITE) using Spatio-temporal and Angular-geometric (STAG) Descriptors. Multimed. Tools Appl. 2020, 79, 6919–6950. [Google Scholar] [CrossRef]
- Liu, H.; Mu, T.; Huang, X. Detecting human—object interaction with multi-level pairwise feature network. Comput. Vis. Media 2020, 7, 229–239. [Google Scholar] [CrossRef]
- Cheong, K.H.; Poeschmann, S.; Lai, J.W.; Koh, J.M.; Acharya, U.R.; Yu, S.C.M.; Tang, K.J.W. Practical Automated Video Analytics for Crowd Monitoring and Counting. IEEE Access 2019, 7, 183252–183261. [Google Scholar] [CrossRef]
- Nida, K.; Gochoo, M.; Jalal, A.; Kim, K. Modeling Two-Person Segmentation and Locomotion for stereoscopic Action Identification: A Sustainable Video Surveillance System. Sustainability 2021, 13, 970. [Google Scholar]
- Tahir, B.; Jalal, A.; Kim, K. IMU sensor based automatic-features descriptor for healthcare patient’s daily life-log recognition. In Proceedings of the IBCAST 2021, Bhurban, Pakistan, 12–16 August 2022. [Google Scholar]
- Javeed, M.; Jalal, A.; Kim, K. Wearable sensors based exertion recognition using statistical features and random forest for physical healthcare monitoring. In Proceedings of the IBCAST 2021, Bhurban, Pakistan, 12–16 August 2022; pp. 512–517. [Google Scholar]
- Jalal, A.; Sharif, N.; Kim, J.T.; Kim, T.-S. Human activity recognition via recognized body parts of human depth silhouettes for residents monitoring services at smart homes. Indoor Built Environ. 2013, 22, 271–279. [Google Scholar] [CrossRef]
- Cafolla, D. A 3D visual tracking method for rehabilitation path planning. In New Trends in Medical and Service Robotics; Springer: Cham, Switzerland, 2019; pp. 264–272. [Google Scholar]
- Chaparro-Rico, B.D.; Cafolla, D. Test-retest, inter-rater and intra-rater reliability for spatiotemporal gait parameters using SANE (an eaSy gAit aNalysis systEm) as measuring instrument. Appl. Sci. 2020, 10, 5781. [Google Scholar] [CrossRef]
- Jalal, A.; Kamal, S. Real-time life logging via a depth silhouette-based human activity recognition system for smart home services. In Proceedings of the AVSS 2014, Seoul, Korea, 26–29 August 2014; pp. 74–80. [Google Scholar]
- Jalal, A.; Mahmood, M. Students’ Behavior Mining in E-Learning Environment Using Cognitive Processes with Information Technologies. Educ. Inf. Technol. 2019, 24, 2797–2821. [Google Scholar] [CrossRef]
- Wan, B.; Zhou, D.; Liu, Y.; Li, R.; He, X. Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection. In Proceedings of the ICCV 2019, Seoul, Korea, 27 October–2 November 2019; pp. 9468–9477. [Google Scholar]
- Yan, W.; Gao, Y.; Liu, Q. Human-object Interaction Recognition Using Multitask Neural Network. In Proceedings of the ISAS 2019, Albuquerque, NM, USA, 21–25 July 2019; pp. 323–328. [Google Scholar]
- Wang, T.; Yang, T.; Danelljan, M.; Khan, F.S.; Zhang, X.; Sun, J. Learning Human-Object Interaction Detection Using Interaction Points. In Proceedings of the CVPR 2020, Virtual, 14–19 June 2020; pp. 4116–4125. [Google Scholar]
- Gkioxari, G.; Girshick, R.; Dollár, P.; He, K. Detecting and Recognizing Human-Object Interactions. In Proceedings of the CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8359–8367. [Google Scholar]
- Li, Y.L.; Liu, X.; Lu, H.; Wang, S.; Liu, J.; Li, J.; Lu, C. Detailed 2D-3D Joint Representation for Human-Object Interaction. In Proceedings of the CVPR 2020, Virtual, 14–19 June 2020; pp. 10163–10172. [Google Scholar]
- Jin, Y.; Chen, Y.; Wang, L.; Yu, P.; Liu, Z.; Hwang, J.N. Is Object Detection Necessary for Human-Object Interaction Recognition? arXiv 2021, arXiv:2107.13083. [Google Scholar]
- Girdhar, R.; Ramanan, D. Attentional Pooling for Action Recognition. In Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Gkioxari, G.; Girshick, R.; Malik, J. Contextual Action Recognition with R*CNN. In Proceedings of the ICCV 2015, Santiago, Chile, 7–13 December 2015; pp. 1080–1088. [Google Scholar]
- Shen, L.; Yeung, S.; Hoffman, J.; Mori, G.; Fei, L. Scaling human-object interaction recognition through zero-shot learning. In Proceedings of the WACV 2018, Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1568–1576. [Google Scholar]
- Yao, B.; Li, F.-F. Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1691–1703. [Google Scholar] [PubMed]
- Meng, M.; Drira, H.; Daoudi, M.; Boonaert, J. Human-object interaction recognition by learning the distances between the object and the skeleton joints. In Proceedings of the International Conference and Workshops on Automatic Face and Gesture Recognition 2015, Ljubljana, Slovenia, 4–8 May 2015; pp. 1–6. [Google Scholar]
- Qi, S.; Wang, W.; Jia, B.; Shen, J.; Zhu, S.C. Learning Human-Object Interactions by Graph Parsing Neural Networks. In Proceedings of the ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 407–423. [Google Scholar]
- Fang, H.S.; Cao, J.; Tai, Y.W.; Lu, C. Pairwise Body-Part Attention for Recognizing Human-Object Interactions. In Proceedings of the ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 51–67. [Google Scholar]
- Mallya, A.; Lazebnik, S. Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering. In Proceedings of the CVPR, Virtual, 14–19 June 2020; pp. 414–428. [Google Scholar]
- Felzenszwalb, P.F.; Huttenlocher, D.P. Efficient Graph-Based Image Segmentation. Int. J. Comput. Vis. 2004, 59, 167–181. [Google Scholar] [CrossRef]
- Xu, X.; Li, G.; Xie, G.; Ren, J.; Xie, X. Weakly Supervised Deep Semantic Segmentation Using CNN and ELM with Semantic Candidate Regions. Complexity 2019, 2019, 1–12. [Google Scholar] [CrossRef]
- Dargazany, A.; Nicolescu, M. Human Body Parts Tracking Using Torso Tracking: Applications to Activity Recognition. In Proceedings of the ITNG 2012, Las Vegas, NV, USA, 16–18 April 2012; pp. 646–651. [Google Scholar]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the ICCV 2011, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
- Javed, Y.; Khan, M.M. Image texture classification using textons. In Proceedings of the ICET 2011, Islamabad, Pakistan, 5–6 September 2011; pp. 1–5. [Google Scholar]
- Julesz, B. Textons, the elements of texture perception, and their interaction. Nature 1981, 290, 91–97. [Google Scholar] [CrossRef] [PubMed]
- Leung, T.; Malik, J. Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons. In Proceedings of the ICCV 1999, Corfu, Greece, 20–25 September 1999; pp. 29–44. [Google Scholar]
- Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Oh, S.; Hoogs, A.; Perera, A.; Cuntoor, N.; Chen, C.C.; Lee, J.T.; Mukherejee, S.; Aggarwal, J.; Lee, H.; Swears, D.S.; et al. A large-scale benchmark dataset for event recognition in surveillance video. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 21–23 June 2011; pp. 3153–3160. [Google Scholar]
- Sultani, W.; Shah, M. Human action recognition in drone videos using a few aerial training examples. Comput. Vis. Image Underst. 2021, 206, 103186. [Google Scholar] [CrossRef]
- Jalal, A.; Kamal, S.; Farooq, A.; Kim, D. A spatiotemporal motion variation features extraction approach for human tracking and pose-based action recognition. In Proceedings of the IEEE International Conference on Informatics, Electronics and Vision, Fukuoka, Japan, 15–18 June 2015. [Google Scholar]
- Lee, J.T.; Chen, C.C.; Aggarwal, J.K. Recognizing human-vehicle interactions from aerial video without training. In Proceedings of the CVPR Workshops 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 53–60. [Google Scholar]
- Soomro, K.; Zamir, R.; Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. In Proceedings of the ICCV 2013, Sydney, Australia, 1–8 December 2013. [Google Scholar]
- Tahir, S.B.; Jalal, A.; Kim, K. IMU Sensor Based Automatic-Features Descriptor for Healthcare Patient’s Daily Life-Log Recognition. In Proceedings of the IEEE International Conference on Applied Sciences and Technology, Pattaya, Thailand, 1–3 April 2021. [Google Scholar]
- Waheed, M.; Javeed, M.; Jalal, A. A Novel Deep Learning Model for Understanding Two-Person Interactions Using Depth Sensors. In Proceedings of the ICIC 2021, Lahore, Pakistan, 9–10 December 2021; pp. 1–8. [Google Scholar]
- Hu, J.F.; Zheng, W.S.; Ma, L.; Wang, G.; Lai, J. Real-Time RGB-D Activity Prediction by Soft Regression. In Proceedings of the ECCV 2016, Amsterdam, The Netherlands, 8–16 October 2016; pp. 280–296. [Google Scholar]
- Gao, X.; Hu, W.; Tang, J.; Liu, J.; Guo, Z. Optimized Skeleton-based Action Recognition via Sparsified Graph Regression. In Proceedings of the ACM Multimedia 2019, Nice, France, 21–25 October 2019; pp. 601–610. [Google Scholar]
- Khodabandeh, M.; Vahdat, A.; Zhou, G.T.; Hajimirsadeghi, H.; Roshtkhari, M.J.; Mori, G.; Se, S. Discovering human interactions in videos with limited data labeling. In Proceedings of the CVPR 2015, Boston, MA, USA, 7–12 June 2015; pp. 9–18. [Google Scholar]
- Hu, J.F.; Zheng, W.S.; Lai, J.; Zhang, J. Jointly Learning Heterogeneous Features for RGB-D Activity Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2186–2200. [Google Scholar] [CrossRef] [Green Version]
- Ren, Z.; Zhang, Q.; Gao, X.; Hao, P.; Cheng, J. Multi-modality learning for human action recognition. Multim. Tools Appl. 2021, 80, 16185–16203. [Google Scholar] [CrossRef]
Authors | Main Contribution | Algorithm | Evaluation Metric | Datasets |
---|---|---|---|---|
B. Wan et al. [13] | used global spatial configuration to focus on the action-related parts of humans | PMFNet (a multi-branch deep neural network) | mAP (mean average precision) | V-COCO (verbs-common objects in context) and HICO-DET (humans interacting with common objects-detection) |
W. Yan et al. [14] | a digital glove called ‘WiseGlove’ was used to record hand movements | multitask 2D CNN (convolutional neural network) | recognition accuracy | collected using WiseGlove |
T. Wang et al. [15] | proposed the use of interaction points for recognizing human–object interactions | Hourglass-104 | mAP | V-COCO and HICO-DET |
G. Gkioxari et al. [16] | proposed the detection of humans on the basis of their appearances and that of objects through their action-specific density | ResNet-50-FPN (residual neural network-50-feature pyramid network) | mAP | V-COCO and HICO-DET |
Y.L. Li et al. [17] | a 3D pose-based system and a new benchmark named ambiguous-HOI | R-CNN (regions with convolutional neural network) | mAP | HICO-DET and Ambiguous HOI |
Y. Jin et al. [18] | performed human–object interaction (HOI) recognition without localizing objects or identifying human poses | a pre-trained image encoder and LSE-Sign loss function | mAP | HICO |
R. Girdhar et al. [19] | argued that focusing on humans and their body parts is not always useful and using the background and context can also be helpful | an attentional pooling module that can be replaced for a pooling operation in any standard CNN | mAP | HICO, MPII, and HMDB51 (human motion database) |
G. Gkioxari et al. [20] | made use of multiple cues in an image that revealed the interaction being performed | R*CNN (a variant of R-CNN) | mAP | PASCAL VOC (visual object classes) and MPII (Max-Planck Institute for Informatics) |
L. Shen et al. [21] | a zero-shot learning approach to accurately identify the relationship between verb and object | Faster R-CNN | mAP | HICO-DET |
B. Yao et al. [22] | used two types of contextual data, including co-occurrence context models and the co-occurrence statistics between objects and human poses | CRF (Conditional Random Field) | recognition accuracy | PPMI (people playing musical instruments) and Sports dataset |
M. Meng et al. [23] | a depth sensor-based system that calculated inter-joint and joint–object distances and then extracted pose invariant features | Random Forest | recognition accuracy | ORGBD (online red, green, blue, depth) Action dataset |
S. Qi et al. [24] | used a graph-based approach for HOI recognition | GPNN (Graph Parsing Neural Network) | F1 score | HICO-DET, V-COCO, and CAD-120 (Cornel activity dataset) |
H. Fang et al. [25] | a pairwise body-part attention model, which focused on crucial parts and their correlations for HOI recognition | visual geometry group (VGG) convolutional layers until the Conv5 layer for the extraction of full human features | mAP | HICO |
A. Mallya et al. [26] | fused features from a person bounding box and the whole image to recognize HOIs | NCCA (Normalized Canonical Correlation Analysis) | recognition accuracy | HICO and MPII |
Part | Similarity Score |
---|---|
HD | 0.770 |
RE | 0.882 |
LE | 0.850 |
RH | 0.996 |
LH | 0.982 |
TR | 0.947 |
RP | 0.995 |
LP | 0.991 |
RK | 0.990 |
LK | 0.993 |
RF | 0.998 |
LF | 0.997 |
Dataset | No. of Videos | No. of Classes | Aerial Imagery | Modality |
---|---|---|---|---|
VIRAT Video | 1482 | 9 | Yes | RGB |
YouTube Aerial | 400 | 8 | Yes | RGB |
SYSU 3D HOI | 480 | 12 | No | RGB + D |
Part | LAO | ULO | OTK | CTK | GIV | GOV | CAO | ENF | EXF | AVG |
---|---|---|---|---|---|---|---|---|---|---|
HD | 93.21 | 90.34 | 83.03 | 84.21 | 88.02 | 89.01 | 94.32 | 88.05 | 86.45 | 88.52 |
RE | 92.23 | 93.03 | 83.34 | 90.11 | 84.45 | 83.08 | 92.35 | 79.21 | 79.35 | 86.35 |
LE | 86.29 | 85.09 | 87.12 | 85.34 | 82.62 | 84.12 | 93.62 | 77.02 | 79.23 | 84.49 |
RH | 91.45 | 90.51 | 91.63 | 89.04 | 86.45 | 87.02 | 90.56 | 76.81 | 78.93 | 86.93 |
LH | 88.32 | 90.12 | 88.06 | 89.86 | 76.23 | 79.12 | 92.03 | 82.32 | 83.13 | 85.47 |
TR | 89.34 | 87.16 | 85.21 | 86.57 | 85.23 | 82.75 | 91.14 | 78.66 | 78.88 | 84.99 |
RP | 93.62 | 92.72 | 88.02 | 87.24 | 79.13 | 82.06 | 93.35 | 73.73 | 76.03 | 85.10 |
LP | 89.32 | 87.15 | 86.09 | 85.13 | 84.27 | 83.03 | 90.42 | 75.72 | 75.02 | 84.02 |
RK | 90.09 | 91.39 | 85.03 | 86.12 | 82.16 | 84.37 | 93.24 | 71.45 | 72.48 | 84.04 |
LK | 88.43 | 87.23 | 86.09 | 88.25 | 85.09 | 82.45 | 90.76 | 80.03 | 82.54 | 85.65 |
RF | 87.03 | 85.26 | 87.16 | 88.23 | 84.77 | 86.31 | 91.09 | 79.12 | 78.25 | 85.25 |
LF | 89.26 | 88.15 | 89.29 | 87.46 | 79.03 | 82.04 | 93.03 | 81.32 | 80.52 | 85.57 |
Average part detection rate = 85.53% |
Part | HR | KK | CD | BM | SK | SF | CL | GF | AVG |
---|---|---|---|---|---|---|---|---|---|
HD | 90.21 | 89.34 | 85.03 | 88.56 | 92.02 | 93.01 | 91.12 | 94.06 | 90.42 |
RE | 86.23 | 89.03 | 83.34 | 70.11 | 82.45 | 81.08 | 82.35 | 86.21 | 82.60 |
LE | 85.29 | 88.09 | 82.12 | 71.34 | 84.62 | 84.12 | 82.20 | 87.02 | 83.10 |
RH | 89.45 | 90.51 | 87.63 | 78.04 | 92.45 | 92.02 | 90.56 | 86.81 | 88.43 |
LH | 90.32 | 91.12 | 85.06 | 79.86 | 91.23 | 91.12 | 89.03 | 84.32 | 87.76 |
TR | 92.34 | 92.16 | 84.21 | 79.57 | 90.23 | 92.75 | 87.14 | 91.66 | 88.76 |
RP | 77.62 | 78.12 | 80.02 | 79.24 | 84.13 | 82.06 | 85.35 | 83.73 | 81.28 |
LP | 78.32 | 76.15 | 79.09 | 80.13 | 84.27 | 83.03 | 82.42 | 85.72 | 81.14 |
RK | 98.09 | 74.39 | 84.03 | 78.12 | 83.16 | 86.37 | 90.24 | 89.45 | 85.48 |
LK | 86.43 | 76.23 | 82.09 | 79.25 | 85.09 | 87.45 | 92.76 | 90.03 | 84.92 |
RF | 91.03 | 90.26 | 89.16 | 89.23 | 90.77 | 90.31 | 87.09 | 89.12 | 89.62 |
LF | 92.26 | 92.15 | 88.9 | 86.46 | 91.03 | 92.04 | 88.03 | 90.32 | 90.15 |
Average part detection rate = 86.14% |
Part | SP | MP | TF | TO | MC | SC | PB | WB | PP | CP | PG | DG | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HD | 95.2 | 94.3 | 92.3 | 92.6 | 98.0 | 96.0 | 93.1 | 94.1 | 93.5 | 95.2 | 92.3 | 94.3 | 94.2 |
RE | 91.2 | 90.0 | 94.3 | 90.1 | 92.5 | 91.1 | 92.4 | 90.2 | 92.4 | 92.2 | 90.0 | 90.3 | 91.4 |
LE | 90.3 | 93.1 | 92.1 | 91.3 | 94.6 | 94.1 | 92.2 | 87.0 | 91.2 | 90.3 | 91.1 | 92.1 | 91.6 |
RH | 93.5 | 92.5 | 91.6 | 88.0 | 96.5 | 97.0 | 91.6 | 87.8 | 92.9 | 93.5 | 94.5 | 94.6 | 92.8 |
LH | 94.3 | 93.1 | 92.1 | 89.9 | 96.2 | 97.1 | 92.0 | 89.3 | 93.1 | 94.3 | 93.1 | 95.1 | 93.3 |
TR | 90.3 | 89.2 | 92.2 | 93.6 | 95.2 | 92.8 | 97.1 | 97.7 | 92.9 | 92.3 | 93.2 | 92.2 | 93.2 |
RP | 90.6 | 91.1 | 90.0 | 94.2 | 91.1 | 88.1 | 95.4 | 93.7 | 91.0 | 92.6 | 95.1 | 94.0 | 92.3 |
LP | 89.3 | 92.2 | 92.1 | 92.1 | 92.3 | 89.0 | 92.4 | 95.7 | 92.0 | 94.3 | 96.2 | 93.1 | 92.6 |
RK | 93.1 | 92.4 | 94.0 | 96.1 | 95.2 | 94.4 | 93.2 | 92.5 | 91.5 | 94.1 | 92.4 | 94.0 | 93.6 |
LK | 92.4 | 94.2 | 96.1 | 94.3 | 95.1 | 94.5 | 92.8 | 91.0 | 92.5 | 92.4 | 94.2 | 92.1 | 93.5 |
RF | 91.0 | 91.3 | 93.2 | 93.2 | 94.8 | 93.3 | 95.1 | 94.1 | 94.3 | 94.0 | 95.3 | 97.2 | 93.9 |
LF | 92.3 | 92.2 | 94.9 | 92.5 | 93.0 | 92.0 | 94.0 | 93.3 | 94.5 | 95.3 | 96.2 | 96.9 | 93.9 |
Average detection accuracy rate = 93.02% |
Classes | ANN | CNN | FCN | ||||||
---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | Precision | Recall | F1 | |
LAO | 0.78 | 0.79 | 0.78 | 0.80 | 0.81 | 0.80 | 0.84 | 0.83 | 0.83 |
ULO | 0.77 | 0.77 | 0.77 | 0.81 | 0.80 | 0.80 | 0.80 | 0.82 | 0.81 |
OTK | 0.78 | 0.78 | 0.78 | 0.82 | 0.82 | 0.82 | 0.83 | 0.84 | 0.83 |
CTK | 0.79 | 0.80 | 0.79 | 0.83 | 0.81 | 0.82 | 0.81 | 0.85 | 0.83 |
GIV | 0.76 | 0.78 | 0.77 | 0.80 | 0.80 | 0.80 | 0.82 | 0.81 | 0.81 |
GOV | 0.77 | 0.78 | 0.77 | 0.81 | 0.80 | 0.80 | 0.80 | 0.82 | 0.81 |
CAO | 0.80 | 0.79 | 0.79 | 0.80 | 0.81 | 0.80 | 0.83 | 0.84 | 0.83 |
ENF | 0.77 | 0.76 | 0.76 | 0.78 | 0.78 | 0.78 | 0.78 | 0.80 | 0.79 |
EXF | 0.74 | 0.75 | 0.74 | 0.79 | 0.79 | 0.79 | 0.82 | 0.82 | 0.82 |
Mean | 0.77 | 0.78 | 0.78 | 0.80 | 0.80 | 0.80 | 0.81 | 0.83 | 0.82 |
Classes | ANN | CNN | FCN | ||||||
---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | Precision | Recall | F1 | |
HR | 0.77 | 0.78 | 0.77 | 0.80 | 0.81 | 0.80 | 0.83 | 0.83 | 0.83 |
KK | 0.77 | 0.78 | 0.77 | 0.80 | 0.80 | 0.80 | 0.82 | 0.83 | 0.82 |
CD | 0.79 | 0.80 | 0.79 | 0.83 | 0.85 | 0.84 | 0.86 | 0.87 | 0.86 |
BM | 0.88 | 0.89 | 0.88 | 0.84 | 0.85 | 0.84 | 0.90 | 0.91 | 0.90 |
SK | 0.80 | 0.81 | 0.80 | 0.82 | 0.82 | 0.82 | 0.85 | 0.85 | 0.85 |
SF | 0.80 | 0.80 | 0.80 | 0.82 | 0.82 | 0.82 | 0.87 | 0.87 | 0.87 |
CL | 0.82 | 0.83 | 0.82 | 0.86 | 0.88 | 0.87 | 0.90 | 0.90 | 0.90 |
GF | 0.80 | 0.82 | 0.81 | 0.83 | 0.83 | 0.83 | 0.87 | 0.87 | 0.87 |
Mean | 0.80 | 0.81 | 0.81 | 0.83 | 0.83 | 0.83 | 0.86 | 0.87 | 0.86 |
Classes | ANN | CNN | FCN | ||||||
---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | Precision | Recall | F1 | |
SP | 0.81 | 0.82 | 0.81 | 0.82 | 0.83 | 0.82 | 0.92 | 0.92 | 0.92 |
MP | 0.83 | 0.84 | 0.83 | 0.84 | 0.85 | 0.84 | 0.92 | 0.93 | 0.92 |
TF | 0.80 | 0.81 | 0.80 | 0.81 | 0.82 | 0.81 | 0.91 | 0.91 | 0.91 |
TO | 0.81 | 0.82 | 0.81 | 0.81 | 0.80 | 0.80 | 0.90 | 0.90 | 0.90 |
MC | 0.86 | 0.86 | 0.86 | 0.87 | 0.87 | 0.87 | 0.96 | 0.96 | 0.96 |
SC | 0.88 | 0.88 | 0.88 | 0.90 | 0.90 | 0.90 | 0.96 | 0.97 | 0.96 |
PB | 0.85 | 0.86 | 0.85 | 0.88 | 0.89 | 0.88 | 0.93 | 0.94 | 0.93 |
WB | 0.86 | 0.85 | 0.85 | 0.87 | 0.88 | 0.87 | 0.92 | 0.93 | 0.92 |
PP | 0.80 | 0.82 | 0.81 | 0.81 | 0.82 | 0.81 | 0.87 | 0.87 | 0.87 |
CP | 0.81 | 0.82 | 0.81 | 0.82 | 0.83 | 0.82 | 0.90 | 0.89 | 0.89 |
PG | 0.84 | 0.85 | 0.84 | 0.84 | 0.85 | 0.84 | 0.88 | 0.88 | 0.88 |
DG | 0.84 | 0.84 | 0.84 | 0.83 | 0.86 | 0.84 | 0.90 | 0.90 | 0.90 |
Mean | 0.83 | 0.84 | 0.84 | 0.84 | 0.85 | 0.85 | 0.91 | 0.92 | 0.92 |
Dataset | Execution Time (s) | ||
---|---|---|---|
CNN | ANN | FCN | |
VIRAT Video | 9430.21 ± 710 | 10,130.12 ± 720 | 8146.53 ± 620 |
YouTube Aerial | 55,313.67 ± 477 | 60,131.00 ± 432 | 4302.11 ± 398 |
SYSU 3D HOI | 5531.32 ± 142 | 6130.05 ± 129 | 4312.62 ± 114 |
Methods | Accuracy on SYSU 3D HOI (%) | Methods | Accuracy on VIRAT Video (%) | Methods | Accuracy on YouTube Aerial (%) |
---|---|---|---|---|---|
Hu et al. [42] | 54.2 | Lee et al. [38] | 77.78 | Sultani et al. [36] | 58.6 |
Gao et al. [43] | 77.9 | Khodabandeh et al. [44] | 81.40 | Sultani et al. [36] | 67.0 |
Hu et al. [45] | 84.89 | - | - | Sultani et al. [36] | 68.2 |
Ren et al. [46] | 86.89 | - | - | - | - |
Proposed method | 91.68 | 82.55 | 86.63 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ghadi, Y.Y.; Waheed, M.; al Shloul, T.; A. Alsuhibany, S.; Jalal, A.; Park, J. Automated Parts-Based Model for Recognizing Human–Object Interactions from Aerial Imagery with Fully Convolutional Network. Remote Sens. 2022, 14, 1492. https://doi.org/10.3390/rs14061492
Ghadi YY, Waheed M, al Shloul T, A. Alsuhibany S, Jalal A, Park J. Automated Parts-Based Model for Recognizing Human–Object Interactions from Aerial Imagery with Fully Convolutional Network. Remote Sensing. 2022; 14(6):1492. https://doi.org/10.3390/rs14061492
Chicago/Turabian StyleGhadi, Yazeed Yasin, Manahil Waheed, Tamara al Shloul, Suliman A. Alsuhibany, Ahmad Jalal, and Jeongmin Park. 2022. "Automated Parts-Based Model for Recognizing Human–Object Interactions from Aerial Imagery with Fully Convolutional Network" Remote Sensing 14, no. 6: 1492. https://doi.org/10.3390/rs14061492
APA StyleGhadi, Y. Y., Waheed, M., al Shloul, T., A. Alsuhibany, S., Jalal, A., & Park, J. (2022). Automated Parts-Based Model for Recognizing Human–Object Interactions from Aerial Imagery with Fully Convolutional Network. Remote Sensing, 14(6), 1492. https://doi.org/10.3390/rs14061492