Next Article in Journal
Enhancing Motor Imagery Electroencephalography Classification with a Correlation-Optimized Weighted Stacking Ensemble Model
Next Article in Special Issue
Improved YOLOv8 for Dangerous Goods Detection in X-ray Security Images
Previous Article in Journal
Detection of DoS Attacks for IoT in Information-Centric Networks Using Machine Learning: Opportunities, Challenges, and Future Research Directions
Previous Article in Special Issue
An Aero-Engine Classification Method Based on Fourier Transform Infrared Spectrometer Spectral Feature Vectors
 
 
Article
Peer-Review Record

Improving the Performance of Automatic Lip-Reading Using Image Conversion Techniques

Electronics 2024, 13(6), 1032; https://doi.org/10.3390/electronics13061032
by Ki-Seung Lee
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Electronics 2024, 13(6), 1032; https://doi.org/10.3390/electronics13061032
Submission received: 29 December 2023 / Revised: 23 February 2024 / Accepted: 24 February 2024 / Published: 9 March 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

 

This paper proposes a method that does not use the IR/depth image directly, but instead estimates images based on the optical RGB image. To this end, a modified Unet was adopted to estimate the IR/depth image from an optical RGB image. The results showed that IR and depth images are rarely affected by lighting conditions. The recognition rates for optical, IR, and depth images were 48.29%, 95.76%, and 92.34%, respectively, under various lighting conditions. This paper explains the theory in more detail and the data are clearly listed. However, there are some problems. And I have some suggestions for the content and structure of the article.

1. In the introduction section, please indicate how you would like to work on your innovation as well as outline your innovation based on the current state of research and problems.

2. In your work study, the presentation of optical, NIR, and depth and speech recognition method is adequate, but the presentation of the methodology is not innovative, think about how you can improve it so that the reader can understand that your work is sufficiently innovative.

3. Figure 3 Your proposed network structure uses the Unet structure, explain how this process was achieved and where the innovation lies.

4. Is it possible to add a comparative analysis of other methods, with your method, to verify the superiority of your method's performance on the dataset?

5. Any major research advances and work in your research area in the last three years. Please add more references from the last three years, your current citations are not original enough.

6. Are there any limitations to the methods in your article? Explicitly acknowledge and discuss the limitations of the research, methodology or data. This helps to provide a balanced view of the research and shows awareness of potential weaknesses or limitations.

7. Whether the method you designed can be migrated to other application areas. How to carry out further research work for this study. A slightly more detailed description is possible.

8. Some of the relative references are ignored such as "Image Dehazing by an Artificial Image Fusion Method Based on Adaptive Structure Decomposition" and "A Novel Fast Single Image Dehazing Algorithm Based on Artificial Multiexposure Image Fusion"

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

please find my comments in the attachment.

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This article deals with an interesting application, which is the transcription to text of the spoken message, based on the automatic reading of lip movements. In my opinion, this application is the main merit of the article, for the following reasons. This application is built with the block diagram of Figure 2, and the paper should have revolved around this application, which now is just described in Section 3, titled "implementation of visual speech recognition".

However, the author has focused the paper on explaining the generation of IR and depth images from RGB images under varying illumination conditions. This approach falls into the group of applications for image quality enhancement, for the following reasons:

1. First, what is referred to in this article as an IR image is actually an image in the visible spectrum. It has been obtained with a sensor that captures information in infrared wavelengths, but in order to be displayed, there must be a direct translation to the wavelengths of the visible spectrum.

2. Secondly, when such an image is generated from RGB images, we are actually generating a visible image, with gray levels, from a color image. The appearance is that of the image that has been called IR, but obviously, it has information from the visible spectrum.

3. The performance of the lip reading system is better with these images, because variations due to illumination are being avoided, and this results in better learning of the system.

4. To be sure of the advantage of this approach, it should have been compared with other image enhancement techniques. If the problem is due to illumination variations, what techniques are available to correct these variations? If due to poor illumination, the signal-to-noise ratio is low, how can it be improved, for example, by image filtering, or even with complex systems such as those based on deep learning?

In summary, the paper presents that the main novelty is in the IR and depth imaging techniques. However, having questioned whether an IR image can be obtained from the visible spectrum, but simply a visible image with a representation in gray levels, with a scale similar to that of the images provided by IR cameras, it is suggested that a comparative analysis with other image enhancement techniques be presented.

Comments on the Quality of English Language

I am not a native English speaker, but I have detected some room for improvement. As an example,  the following sentence: "depth distribution was mostly predicted mostly from the RGB image". Please, read the paper carefully to detect any mistakes like that before submitten an improved version.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The author has addressed all my concerns.

Author Response

There were no comments from the reviewer.

Reviewer 3 Report

Comments and Suggestions for Authors

The effort made by the author to improve the paper is to be appreciated. I do not question the usefulness of IR and depth imaging for the application under consideration. Working with these images, captured by the appropriate cameras, good results are obtained in terms of recognition rate in lip reading.

Subsequently, two CNN networks are proposed to estimate the IR and depth images from the color image. For the estimation of the depth image, there is sufficient information in the color image, of geometric type, and also the information related to the differences in brightness. But I doubt that the IR image can be correctly estimated from the visible color image. It is true that the camera has a bandwidth that includes part of the infrared spectrum, but the product you are working with is directly a visible color image that does not take this information into account. The CNN network is trained to generate a black and white image, which is the product that an infrared camera would provide. But that product is a visible image. What the CNN ends up getting is a black and white image, similar in appearance to the product offered by the IR camera, but it is not an image with information from the IR spectrum.

The advantages in terms of recognition rates I believe are due to the fact that in this image the illumination variations are being avoided. In other words, as a conclusion, I believe that a method is being presented to generate an image that does not have large variations due to illumination, and that is the main advantage. The technique seems to be good, considering the comparative analysis included in this version of the paper in section 5.5.

To conclude, I believe that the paper could be accepted for publication, if the conclusions and some statements made are qualified, to make it clear that in reality, what is called the estimated IR image has been generated from the color image in the visible spectrum, and therefore, does not contain real information from the infrared spectrum. The advantage obtained is due to the fact that with this approach to the problem, it is possible to reduce the variations due to illumination, condensing the information in a single image, which in many cases is similar in appearance to that offered by infrared cameras.

 

Comments on the Quality of English Language

I think the quality of English is good. Nevertheless, I am not a native English speaker, therefore there could be some details I cannot detect.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop