Next Article in Journal
Lightweight Deep Learning for Road Environment Recognition
Previous Article in Journal
Rehabilitation of Patients with Moderate Knee Osteoarthritis Using Hyaluronic Acid Viscosupplementation and Physiotherapy
 
 
Article
Peer-Review Record

Bilateral Ear Acoustic Authentication: A Biometric Authentication System Using Both Ears and a Special Earphone

Appl. Sci. 2022, 12(6), 3167; https://doi.org/10.3390/app12063167
by Masaki Yasuhara 1,2,*, Isao Nambu 3 and Shohei Yano 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2022, 12(6), 3167; https://doi.org/10.3390/app12063167
Submission received: 17 February 2022 / Revised: 15 March 2022 / Accepted: 18 March 2022 / Published: 20 March 2022

Round 1

Reviewer 1 Report

Both the content of the paper and the topic under study are interesting and, in most cases, well presented. However, I would anticipate a more convincing justification of the employed material and its configuration. For instance, someone would expect the rationalization of the used equipment concerning its rated performance (i.e., frequency response, sensitivity, input level, etc.). Is there any specific reason why this earphone/microphone pair was used, or could any high-quality sensors be equivalently selected? However, since we are talking of a measurement system, important parameters expressing systems' performance should have been taken into consideration. In addition, we have a tiny sound field area with relatively large-sized sensors and the microphone placed at literally zero distance from the source. Hence, the attachment of the microphone to the earphone could cause resonance, clipping and harmonic distortion issues. Likewise, ear acoustic responses could be differentiated due to the sensory placement variations (i.e., we cannot ensure identical placement even to the same subject at all times).
Furthermore, external noise contamination variations could be encountered due to improper placement and/or positioning divergencies, affecting the whole analysis/authentication process (and its results). Were some of these measures taken into consideration? At least, the above aspects should have been commented on in the manuscript. Also, a photo with a typical sensory placement would help getting a better picture of the whole setup. Another issue is related to the "Cosine similarity" approach, which, though it seems reasonable, also lacks a convincing motivation and justification. For instance, why a different feature-driven solution was not used instead (i.e., for better performance, simplicity, faster computation or something else)? Could the authors comment on these aspects?   

Author Response

Thank you for reviewing our manuscript and providing valuable comments.

Our responses are shown in red font.

The attached PDF highlights the changes with yellow lines.
Note: Changes made by other reviewers are also included.

 

Comment 1

For instance, someone would expect the rationalization of the used equipment concerning its rated performance (i.e., frequency response, sensitivity, input level, etc.). Is there any specific reason why this earphone/microphone pair was used, or could any high-quality sensors be equivalently selected? However, since we are talking of a measurement system, important parameters expressing systems' performance should have been taken into consideration. In addition, we have a tiny sound field area with relatively large-sized sensors and the microphone placed at literally zero distance from the source. Hence, the attachment of the microphone to the earphone could cause resonance, clipping and harmonic distortion issues.

Thank you for your valuable comment. The earphones are expected to change position each time they are put on and taken off. The earphones we choose have an ear-tip, shaped to fit the ear concha, which is thought to increase the reproducibility of the wearing condition of the earphone. For the microphones, we selected small microphones with characteristics in the audible range of 20 Hz to 20 kHz. We confirmed that saturation did not occur with respect to sensitivity. In addition, the signal power is normalized to 1 in the preprocessing step so that any slight change in sensitivity in each experiment will not appear as a feature of the experiment. We have added the required explanations in the text as these issues are very informative to the reader. The the proximity of the microphone to the speakers was not a problem in our case. This is not something that we have taken any special measures against, so we do not think it is necessary to explain it in the text. Please check chapter 3.2, which has been revised.

 

Comment 2

Likewise, ear acoustic responses could be differentiated due to the sensory placement variations (i.e., we cannot ensure identical placement even to the same subject at all times).

Thank you for your comment. Yes, the acoustic responses have differences due to the sensory placement variations within the same subject. This differences are mentioned in chapter 6.3. It has been shown that the authentication accuracy depends on this difference. This is an interesting problem for future research.

 

Comment 3

Furthermore, external noise contamination variations could be encountered due to improper placement and/or positioning divergencies, affecting the whole analysis/authentication process (and its results). Were some of these measures taken into consideration?

Thank you for your valuable comment. We didn't consider the error measurement data. Even if the data include error data, we use it for training or testing. This is because in real life, a person may include miss data in actual use. We think its explanation not needed in the manuscript.  

 

Comment 4

Also, a photo with a typical sensory placement would help getting a better picture of the whole setup.

Thank you for your valuable comment. We agree with your opinion and have attached the picture as suggested.

 

Comment  5

Another issue is related to the "Cosine similarity" approach, which, though it seems reasonable, also lacks a convincing motivation and justification. For instance, why a different feature-driven solution was not used instead (i.e., for better performance, simplicity, faster computation or something else)? Could the authors comment on these aspects?   

Thank you for your comment.  We apologize for not understanding the meaning of “feature-driven solution”. We have explained why we use cosine similarity. To analyze both ear data, there are two methods:

  1. Cosine similarity approach
  2. Classification for left vs. right ear data

We think cosine similarity approach can simply perform signal comparisons. In contrast, classification approach can prove algorithmically whether there is enough difference to be classifiable. We assume that with either technique, it is possible to determine if there is a difference in the bilateral signal. The reason we did not use the classification approach was to avoid having to classify twice; moreover, the cosine similarity approach is nonparametric. Since the next chapter on evaluation uses SVM for classification, if the classification were also done in the chapter on signal comparison, two similar classifications would be introduced in a row. This makes it difficult for the reader to understand. Therefore, we used cosine similarity, which simply allows signals to be compared numerically.

Author Response File: Author Response.pdf

Reviewer 2 Report

In this paper, the authors discussed a biometric authentication system using both ears and a special earphone. In general, the topic is very interesting. However, some more discussion should be discussed before acceptance. The reviewer believes that these discussions would be much more helpful for readers.

 

  1. The SNR would be an important issue. However, the authors did not discuss the influence of this parameter on authors’ method. The authors should enhance this discussion in section 5.
  2. The reviewer wanders to know whether authors’ method depends on the noise distribution. When the noise subjects to non-Gaussian noise such as Class B noise [1], alpha noise [2] or Gaussian mixture noise, the reviewer wanders to know the performance of authors’ method. The authors should discuss these in section 6.

 

[1] X. Zhang,et al.Parameter estimation of impulsive noise with Class B model.IET Radar, Sonar and Navigation,2020,Doi: 10.1049/iet-rsn.2019.0477.

[2] P.G. Georgiou, et al. Alpha-stable modeling of noise and robust time-delay estimation in the presence of impulsive noise, IEEE Transactions on Multimedia, Doi: 10.1109/6046.784467.

[3] X. Xia, et al. Parameter Estimation for Gaussian Mixture Processes based on Expectation-Maximization Method. 2016 4th International Conference on Machinery, Materials and Information Technology Applications, Doi: 10.2991/icmmita-16.2016.96.

Author Response

Thank you for reviewing our manuscript and providing valuable comments.

Our responses are shown in red font.

The attached PDF highlights the changes with yellow lines.
Note: Changes made by other reviewers are also included.

 

Comment 1

The SNR would be an important issue. However, the authors did not discuss the influence of this parameter on authors’ method. The authors should enhance this discussion in section 5.

Thank you for the great advice. As you suggested, the relationship between SNR and authentication accuracy should be shown. We calculated the authentication accuracy when signals with SNR less than 30 dB are excluded and compared the results. As a result, we find out that excluding signals with SNR less than 30 dB does not improve the accuracy. Please check Chapter 6.3, where SNR is discussed. We apologize for not being able to discuss this in Chapter 5 as you suggested due to the structure of the paper.

 

Comment 2

The reviewer wanders to know whether authors’ method depends on the noise distribution. When the noise subjects to non-Gaussian noise such as Class B noise [1], alpha noise [2] or Gaussian mixture noise, the reviewer wanders to know the performance of authors’ method. The authors should discuss these in section 6.

Thank you for your comment. We apologies we can't understand that the reviewer says, "authors’ method depends on the noise distribution". By "authors' method", do you mean the impulse-response measurement method using MLS signal? If it is, an explanation is written below.  MLS signal is white noise. Mahto et al. and Gao et al. use TSP signals, the authors assume that the authors don’t require to discuss the difference in accuracy from this method (TSP signals).  We apologize for the Japanese document [1], we do not consider that the impulse response of the TSP and M-sequence signals causes a large difference in the measured signal. Please let us know if our interpretation is wrong.

 [1] M. Kobayashi, Y. Kaneda, Study of dependence of impulse response measurement error on a measurement signal, IEICE Tokyo student conference, https://www.ieice.org/tokyo/gakusei/activity/kenkyuu-happyoukai/happyou-ronbun/21/pdf/12.pdf

Author Response File: Author Response.pdf

Reviewer 3 Report

The work appears well structured and follows correct methodological criteria, so I consider it acceptable for publication, with only a few minor improvements.


My suggestions and points for improvement:
- A schematic of the system used to measure the acoustic channel is shown in Figure 1. It would be useful to put side by side a real picture of the prototype used.
- At line 88 is mentioned the transformation to minimum phase of the impulse response measured in the auditory canal, without explanation or bibliographic references of the reasons for this choice. It would be useful to add some explanation about this.
- In line 114 it is written that there may be some difference between left and right channel measurements due to eventual differences in speakers and microphones. I would suggest, for future research, to perform an auto-equalization of both channels, to reduce differences in responses due to transducers.
- The work does not mention the possible influence of ear wax in the measured impulse responses, which could change more or less significantly the response obtained and therefore the correct recognition of the user.

Author Response

Thank you for reviewing our manuscript and providing valuable comments.

Our responses are shown in red font.

The attached PDF highlights the changes with yellow lines.
Note: Changes made by other reviewers are also included.

 

Comment 1

A schematic of the system used to measure the acoustic channel is shown in Figure 1. It would be useful to put side by side a real picture of the prototype used.

Thank you for your valuable comment. We agree with you and have accordingly attached the real picture.

 

Comment 2

At line 88 is mentioned the transformation to minimum phase of the impulse response measured in the auditory canal, without explanation or bibliographic references of the reasons for this choice. It would be useful to add some explanation about this.

Thank you for your kind comment. We added the explanation at line 92--95.

 

Comment 3
In line 114 it is written that there may be some difference between left and right channel measurements due to eventual differences in speakers and microphones. I would suggest, for future research, to perform an auto-equalization of both channels, to reduce differences in responses due to transducers.

Thank you for your kind comment. We plan to consider auto-equalization in future.

 

Comment 4

The work does not mention the possible influence of ear wax in the measured impulse responses, which could change more or less significantly the response obtained and therefore the correct recognition of the user.

Thank you for your kind comment. We plan to consider ear wax in future.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors conducted some effort to revise their article, however, some remarks were treated rather superficially. Revision comments are not for the reviewers per se but also for the reader who would have similar questions. Hence, some of the revision comments should also be reflected in the manuscript to justify the associated implementation and configuration choices. For instance, the added text "We confirmed that saturation did not occur with respect to sensitivity" does not correspond to what we refer as scientific justification (at least, some rephrasing is necessary). Likewise, the authors altered their previous text from

"After normalizing the power of the signal to 1, a clipping process was 89
performed"

to

"The power of the signal was normalized to 1 to ensure that the magnitude of the signal is not a classification feature. After normalizing, a clipping process was performed"

I am afraid this cannot be considered as a justification of the adopted configuration (and demonstration of the associated motivation). How was the normalization performed? What do you mean by power (squared signal?) and what is the measure of the "1" value? Why normalization is not described in terms of dBFS level? By normalizing a signal, noise presence might be intensified (comment 3 also included this perspective but it was not actually addressed during revision). For what reason is the clipping process stand and used for? I think it is essential that the author will cover these aspects.

 

 

Author Response

Thank you for reviewing the revisions and providing valuable comments.

Our responses are shown in red font.

The attached PDF highlights the changes with yellow lines.

 

Comment 1

Revision comments are not for the reviewers per se but also for the reader who would have similar questions. Hence, some of the revision comments should also be reflected in the manuscript to justify the associated implementation and configuration choices. For instance, the added text "We confirmed that saturation did not occur with respect to sensitivity" does not correspond to what we refer as scientific justification (at least, some rephrasing is necessary).

Thank you for your valuable comment. We agree, and we changed "We confirmed that saturation did not occur with respect to sensitivity" to "The sensitivity of the sound processor was set as high as possible without saturating the waveform” in lines 88--89. We intend for the reader to understand that the output was 65 dB(A) and that the input was set as high as possible without saturating the waveform. 

 

Comment 2

Likewise, the authors altered their previous text from

"After normalizing the power of the signal to 1, a clipping process was 89
performed"

to

"The power of the signal was normalized to 1 to ensure that the magnitude of the signal is not a classification feature. After normalizing, a clipping process was performed"

I am afraid this cannot be considered as a justification of the adopted configuration (and demonstration of the associated motivation). How was the normalization performed? What do you mean by power (squared signal?) and what is the measure of the "1" value? Why normalization is not described in terms of dBFS level? By normalizing a signal, noise presence might be intensified (comment 3 also included this perspective but it was not actually addressed during revision). For what reason is the clipping process stand and used for? I think it is essential that the author will cover these aspects. By normalizing a signal, noise presence might be intensified (comment 3 also included this perspective but it was not actually addressed during revision).

Thank you for your valuable comment. We removed the expression "power" for normalization, and instead explained it succinctly in the form of an equation (Eq. (1)). The use of equations makes the normalization technique easier to understand by removing ambiguous explanations such as power and 1. In addition, we added the explanation about clipping process in lines 104--108.

> I am afraid this cannot be considered as a justification of the adopted configuration (and demonstration of the associated motivation).

The purpose of our analysis is to achieve high accuracy for acoustic authentication. However, amplitudes may vary due to differences in volume and sensitivity between the day of measurement. To allow for this variation in data and achieve high accuracy, normalization is the popular method used in machine learning. Therefore, in this paper, we have normalized for the magnitude of the signal as in Eq. (1). We added this explanation in lines 98--102. However, the effect of noise is not considered. So we need to consider it as you suggested in the future. There is a good opportunity to improve accuracy by considering noise. We have added discussion related to this issue in lines 303—306 so that the readers know the authors need to consider the noise to improve accuracy.

>Why normalization is not described in terms of dBFS level?

This is because we did not consider the noise contamination due to the improper placement or positioning divergence. We think that verifying the noise level using dBFS may improve the accuracy. This is an issue for the future. Thank you for your suggestion.

Author Response File: Author Response.pdf

Reviewer 2 Report

The reviewer has no more comments.

Author Response

Thank you for reviewing the revisions!

We give the reviewer my deepest gratitude of thanks.

Back to TopTop