Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

Appl. Sci. 2021, 11(4), 1548; https://doi.org/10.3390/app11041548

by Hyeongu Yun

, Taegwan Kang and Kyomin Jung^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2021, 11(4), 1548; https://doi.org/10.3390/app11041548

Submission received: 31 December 2020 / Revised: 2 February 2021 / Accepted: 3 February 2021 / Published: 8 February 2021

(This article belongs to the Special Issue Advances in Artificial Intelligence Methods for Natural Language Processing)

Round 1

Reviewer 1 Report

The paper is aimed at presenting an analysis of multi-head attention using state-of-art measures: Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA). Authors state that it is a novel approach to use these measures for diversity comparison of multi-head attention representation subspaces. The main hypothesis is that optimization of inter-head diversity may lead to better performance. To check it, 3 techniques for inter-head diversity control are investigated. Results are presented and discussed well, authors found out that some of the techniques based on SVCAA and CKA have the opposite effect on diversity than expected by other authors before. In conclusion, authors summarize results and show that their initial hypothesis is right and proposed techniques of diversity control leads to improved performance. The results and methods in this article can be useful for an audience of the journal.

The article has the most strong sides. The introduction is supplemented by related works section. In the article, all methods are adequately described including equations. All references to data and methods are presented. The results are presented clearly in tables and graphs. Conclusions are supported by results. However, the structure of the article is quite different compared to "Research Manuscript Sections", there are no recommended sections or names are different compared to Applied Sciences' "Instructions for Authors", I suggest consulting with the editor regarding this. Also, there are some small mistakes, see Line by line comments.

Line by line comments. Note: Some are indicated as suggestions only.

Line 43: "the three ...", the seems not needed, I suggest omitting the article.

Line 46: "improve"-"improves"

Line 61 and 293: "... opposed to Li et al. [7] expected, ..." I would suggest changing to something like "opposed to the expectation of Li et al. [7],"

Line 67: usually "on" and "at" are using when talking about levels. I would recommend changing in to "on" or "at".

Line 150-151: No line numbers in between.

Line 150 + 6: in centering matrix formula notation 11^T may be unclear for some readers, if it is a matrix of all ones, I would recommend adding description, to make it clear.

Line 157: I suggest changing word order "easily be" to "be easily"

Line 177: "...minimizing it.", there it is ambiguous. It there may be orientation, cosine similarity, disagreement, not clear.

Line 179-180: no line numbers in between

Line 194: "diversify" -> "diversifies"

Line 197: "identically" -> "identical" or "unchanged"

In Table 1 and 2: "number of head H" -> "...heads..."

Line 236: "...head..."-> "...heads..."

Table 3: it is unusual to start section by Table, I would suggest moving it after the first paragraph. Also, then it will be after the first mentioning in the text.

Line 276, 278: "...hidden sizes..." it is hidden size, does size is a thing not measure?

Author Response

We sincerely appreciate your constructive review.

We will consult about the structure of the article with the editor.

Also, we revised our manuscript according to your review as follows;

We corrected all grammatical errors that you have pointed out.
We added a more detailed description about 11^T in centering matrix formular, that 1 is a vector of ones and 11^T is a matrix of ones.
We corrected the editing error that makes some line numbers disappear.
We rearranged the tables and figures to make it easy for readers to read.

We submit the updated manuscript with all the changes highlighted.

Reviewer 2 Report

This paper focuses on the diversity between attention heads in Transformers. It uses the SVCCA and CKA metrics to measure and analyze the diversity between the heads in different models as well as within the same model. The authors also introduce three methods of affecting the inter-head diversity during training that show moderate positive effects on machine translation and language modelling.

While written in clearly understable English, there are several places that need grammatical correction, mainly missing articles or prepositions, tense and number errors. For example:
- Line 35: have proposed the disagreement score
- Line 123: operates H-many single-head attentions
- Line 129: multi-head attention diversifies representation subspaces
- Line 132: we adopt the following advanced tools
- Line 139: SVCCA proceeds in two steps
- Line 147: the authors have pointed out/to a limitation
- Line 207: SVCCA and CKA are a proper tool
- etc.

The main open question after reading the article is: is it desireable to increase or increase the inter-head diversity in Transformers? The three additional loss terms have different reported effects on the diversity: some of them increase it, some decrease it, and it is only in the conclusions that we see a sentence about "an optimal degree of inter-head diversity". This should be made more clear in the introduction, since currently the whole article uses the ambiguous term "control the diversity" without a clearer conclusion of what it means to "control" it.

Questions to address:

what is the reason for a much bigger translation dev set, compared to the test set?, a common WMT practice is to use twice as few samples for the dev set compared to the test set
on lines 206-207 you write that "we show SVCCA and CKA are (a) proper tool to measure the inter-head similarity", what do you mean by "proper"? further results rely on these metrics, but do not show that they are "proper". Or did you mean the increase in BLEU / ppl scores?, this could be stated more clearly
what is the reason for the disbalance between the two tasks: in section 5 and most of section 6 you focus on machine translation, while at the end of section 6 you also introduce language modelling and perplexity. Could the analysis of section 5 also be done for the language modelling task?
your language model is smaller than standard state-of-the-art models (you have 2 encoder layers, while BERT / XLM-R / others have 12 and some times 6 or 24 layers). Do you think your conclusions based on the smaller model also apply to bigger models and if yes, on which grounds do you think so?

Small formatting issues:

some lines are missing line numbers, which makes it harder to point to them in the review
text on lines 283-285 is hard to find on page 10 with so many figures / tables around it

With those issues addressed the article is a strong scientic contribution. The math and definitions are sound, the descriptions are easy to understand (despite frequent grammatical mistakes) and the results are interesting.

Author Response

We sincerely appreciate your constructive review.

We would like to refer to your comments to answer your question and introduce the updated points in our manuscript.

Is it desirable to increase or increase the inter-head diversity in Transformers? The three additional loss terms have different reported effects on the diversity: some of them increase it, some decrease it, and it is only in the conclusions that we see a sentence about "an optimal degree of inter-head diversity". This should be made more clear in the introduction, since currently the whole article uses the ambiguous term "control the diversity" without a clearer conclusion of what it means to "control" it.
- Our main claim in this article is that optimizing the inter-head diversity in Transformer model is desirable. As we show with the SVCCA analyses in Section 5 and Section 6, our three methods optimize the model's inner representations in each different way (i.e. without HSIC loss, drophead will only decrease the inter-head diversity and vice versa, and the orthogonality loss makes the core directions of the representations similar while diversifies the other directions).
- However, we agree with your concern that the term "control" in introduction is unclear. We updated the introduction section of our manuscript refraining from using the term "control" and explaining more details about optimizing the inter-head diversity
What is the reason for a much bigger translation dev set, compared to the test set? A common WMT practice is to use twice as few samples for the dev set compared to the test set.
- We followed the experimental setup of Voita et al. [6] (Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned, ACL 2019). They have sampled 2.5 M sentences from train set and used the original dev / test sets.
- We added above details in the Section 6.
On lines 206-207 you write that "we show SVCCA and CKA are (a) proper tool to measure the inter-head similarity", what do you mean by "proper"? further results rely on these metrics, but do not show that they are "proper". Or did you mean the increase in BLEU / ppl scores?, this could be stated more clearly.
- We had stated that "SVCCA and CKA are proper tools to measure the inter-head similarity" because measuring the similarity between two deep representations is not a trivial task.
- For a simple example, suppose there is a neural network representation (an output of intermediate layer) which can be seen as a mapping function f: X -> Y and another representation which is a permutation of nodes of the original f, f^{P}: X -> Y. In the viewpoint of neural networks, since the upper part can easily learn the linear projection that nullify the permutation, the permutation does not make any difference and the similarity between f and f^{P} should be very high. But, common similarity measures such as cosine similarity or l_{2} norm cannot notice the fact that f and f^{P} are similar representation since those measures are calculated with only two vectors (x, f(x)) and (x, f^{P}(x)). On the other hand, SVCCA and CKA incorporate the whole responses of f and f^{P} on the entire dataset space X. Therefore, SVCCA and CKA can easily find that f and f^{P} are very similar. In that sense, SVCCA and CKA are "proper" tools to measure the similarity between two deep representations of neural networks. For the further discussion, please refer Maheswaranathan et al. [21] (NeurIPS 2019), Kudugunta et al. [22] (EMNLP-IJCNLP 2019), and Bau et al. [11] (ICLR 2018). The exact same story can be applied to measuring the similarity between two heads' representations in the multi-head attention.
- However, we agree with your concern that the word "proper" may confuse readers. We updated the first paragraph of Section 5 with the more clear sentence ("By analyzing the diversity of representation subspaces, we show that how SVCCA and CKA reflect the dynamics of inter-head similarity in terms of the numbers of heads.").
What is the reason for the disbalance between the two tasks: in section 5 and most of section 6 you focus on machine translation, while at the end of section 6 you also introduce language modelling and perplexity. Could the analysis of section 5 also be done for the language modelling task?
- We have empirically shown that optimizing the inter-head diversity with our suggested methods leads to improved performances in machine translation tasks. We have done the experiments on PTB language modeling task in order to show that our suggested methods are applicable and effective to the other tasks than machine translation. The analyses in Section 5 and Section 6 indeed show the same result for the PTB language modeling experiments.
Your language model is smaller than standard state-of-the-art models (you have 2 encoder layers, while BERT / XLM-R / others have 12 and some times 6 or 24 layers). Do you think your conclusions based on the smaller model also apply to bigger models and if yes, on which grounds do you think so?
- In our analyses and suggested methods, the number of layers does not related at all, since the multi-head attention mechanism is performed on the output of the previous layer. Also, At Table 1, Table 2, and Table 3 in Section 5, SVCCA and CKA statistics show the persistent tendency related to dimension per number of heads (dim/head) but they are not related to the hidden size.
- In those grounds, we strongly believe that our methods and analyses can be applied to the larger language models including BERT.
- We updated the Section 6 with the above discussion.

Also, there are minor revisions according to your review as follows;

We corrected all grammatical errors including the errors that you have pointed out.
We corrected the editing error that makes some line numbers disappear.
We rearranged the tables and figures to make it easy for readers to read.

We submit the updated manuscript with all the changes highlighted.

Article Menu

Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

Further Information

Guidelines

MDPI Initiatives

Follow MDPI