Next Article in Journal
Damage Monitoring of Steel Bars Based on Torsional Guided Waves
Previous Article in Journal
Contrastive Multiscale Transformer for Image Dehazing
Previous Article in Special Issue
Robust Deep Neural Network for Learning in Noisy Multi-Label Food Images
 
 
Article
Peer-Review Record

Amount Estimation Method for Food Intake Based on Color and Depth Images through Deep Learning

Sensors 2024, 24(7), 2044; https://doi.org/10.3390/s24072044
by Dong-seok Lee 1 and Soon-kak Kwon 2,*
Reviewer 1: Anonymous
Reviewer 2:
Sensors 2024, 24(7), 2044; https://doi.org/10.3390/s24072044
Submission received: 20 January 2024 / Revised: 19 March 2024 / Accepted: 20 March 2024 / Published: 22 March 2024
(This article belongs to the Special Issue Artificial Intelligence for Food Computing and Diet Management)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper proposes an amount estimation method for food intake based on paired color and depth images pre- and post- meal. The color images are for detection using Mask R-CNN. The depth images are for amount estimation. The space volume for each food region is calculated by dividing a space between the food surfaces and the camera into multiple tetrahedra. The food intake amounts are estimated as the difference in volumes calculated from the pre- and transformed post-meal depth images. Experiments show the proposed method has low error rate. 

 

Strengths:

  1. The paper is well-written and nicely structured.
  2. Correcting the pixel values of the transformed depth images is somewhat novel.

Weaknesses:

  1. The biggest drawback is only presenting the experiment results of the proposed method and lacking comparison with other methods. One of the baselines should be calculating volume using Fig 11. [15-17] and volume calculation from depth images in [a](reference below) can also be baselines.
  2. The contribution of this paper is not clear and the novelty is limited. [a] has already estimated volumes from depth images in 2015. 
  3. It seems both Fig 11 and 12 need the container shape / depth / edge information to accurately calculate the volume. Since the empty plate image is in the dataset, why is Fig 12 better than Fig 11? Experiments comparing them are needed.

Detailed questions and suggestions

  1. It would be better to analyze the quality of the transformation as there are 8 variables and only 4 pairs of points are used, which could be affected by noise.
  2. It would be better to analyze the relationship between error and the granularity of the tetrahedra when calculating volumes.
  3. When performing the simulation experiment, is prior knowledge (volume formula of cuboids and cylinders) used? As the object backside cannot be captured, the space between the object and the background could also be included in the volume.
  4. In table 3, why is the volume of water pre meal < post meal?

Reference:

[a] Meyers, Austin, et al. "Im2Calories: towards an automated mobile vision food diary." Proceedings of the IEEE international conference on computer vision. 2015.

Comments on the Quality of English Language

There are minor grammar errors (e.g. L126 “The color images utilized to detect the food types and regions.” -> are utilized) but they do not affect the overall reading.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors propose an interesting algorithm, possessing novelty, to solve the problem of estimating the amount of food consumed. The paper utilizes modern computer vision techniques as well as depth map processing. Nevertheless, there are several shortcomings that need to be improved before the paper can be published:

1) The introduction could be strengthened, for example, by adding citations of works on computer vision in agriculture (10.1016/j.compag.2023.108036, 10.1016/j.heliyon.2023.e14722), including depth cameras (10.1016/j.compag.2020.105687, 10.3390/agronomy11091780)

2) Conceptually, Mask R-CNN solves the segmentation problem rather than the detection problem. The concept of detection is used in the text. An introduction should be given that "generally speaking the model results in pixel-by-pixel food selection, but we will use the term detection".

3) Table 3: in the calculation of water in the bowl, the volume after the meal increased. Please explain this result in more detail.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Back to TopTop