Next Article in Journal
The Importance of Modeling Path Choice Behavior in the Vehicle Routing Problem
Next Article in Special Issue
An Algorithm for Solving Zero-Sum Differential Game Related to the Nonlinear H Control Problem
Previous Article in Journal
Three Diverse Applications of General-Purpose Parameter Optimization Algorithm
Previous Article in Special Issue
On the Semi-Local Convergence of Two Competing Sixth Order Methods for Equations in Banach Space
 
 
Article
Peer-Review Record

Hyperparameter Black-Box Optimization to Improve the Automatic Classification of Support Tickets

Algorithms 2023, 16(1), 46; https://doi.org/10.3390/a16010046
by Renato Bruni 1,*, Gianpiero Bianchi 2 and Pasquale Papa 2
Reviewer 1: Anonymous
Reviewer 3:
Algorithms 2023, 16(1), 46; https://doi.org/10.3390/a16010046
Submission received: 29 November 2022 / Revised: 22 December 2022 / Accepted: 1 January 2023 / Published: 10 January 2023
(This article belongs to the Collection Feature Papers in Algorithms)

Round 1

Reviewer 1 Report

 The manuscript “Improving the Automatic Classification of Support Tickets using Hyperparameter Black-box Optimization” describes a computational approach in which text mining and classification algorithms are used to categorize help tickets that were submitted to a customer center. The goal is to assign the right person with the correct expertise to a ticket faster. A secondary goal is to create consistency between answers to tickets with similar topics. 

The authors propose to employ machine learning techniques (convolutional neural nets (CNN) and support vector machines (SVM)) to perform the classification of tickets. In order to ensure high quality classification, the authors use derivative free optimization (DFO) over an integer lattice to tune the hyperparameters of the classifiers. In the end, the authors compare the performance of the DFO method with grid search. 

 

The manuscript is very easy to read and follow. There are only a few issues with language which do not distract from the content. However, reading the paper, several questions arise:

1)    What exactly is the novelty? From what I can tell, the authors applied established techniques to the ticket problem. The formulation of the black box problem (Eq 1) is not new (several developments do the same, e.g., HYPPO, DeepHyper). I am not an expert in applications of text mining and classification, so I cannot judge whether the application problem was novel. The question that immediately came to mind is why the Istat does not maintain a website where users enter their queries and select topics (categories) from a drop-down menu, thus eliminating a lot of the ambiguity of query topics?

2)    Section 2 and 4: In Section 2, the authors mention a set of 15000 tickets they received to work with. The numerical experiments in Section 4 are conducted on 8076 tickets. Why the difference?

3)    Section 2: How did the 31000 lemmas get reduced to 424? The authors said that this was done by cutting words that did not appear frequently enough, but has anyone looked into the importance of those ~98% of deleted words? If the 15000 tickets were reduced to 8076 due to the fact that the authors deleted a large amount of words, I have doubts that the proposed method will be relevant in practice as it leaves 46% of tickets unanswered. Some explanation is needed here. 

4)    An explanation is needed in terms of how the encoding of categorical variables to integer values impacts the outcome of the line search/primitive directions algorithm? The assignment of category to integer is arbitrary and directions may not make sense.

5)    The authors do a good job summarizing previous literature related to hyperparameter tuning. Why did the authors not compare to these methods? Many of these methods come with software. Instead, the authors compare to a simple (incomplete) grid search. 

6)    Certain assumptions were made regarding the hyperparameters of the learning models, e.g., CNN comes with 3 layers. Why was the number of layers not included in the list of tunable hyperparameters? The same question applies to other fixed hyperparameters.

7)    From the description of the tunable hyperparameters for CNN, I get for the black box optimizer 4x4x4x10x3x3=5760 possible combinations. The authors later state that there were 7680 different combinations. Similarly, I get a different number for combinations for the SVM. Why are there differences?

8)    Are 5 epochs really enough to train a CNN to convergence?

9)    The definition of the 7 classes: What is the difference between the indeterminable / unclassifiable entries in class 6 and the part in class 7 that contains tickets that weren’t classifiable in classes 1-6? These 2 classes do not seem distinct.

10) If I sum up the numbers in lines 448-450, I get a total of 8095. This number is larger than the 8076 tickets considered, indicating that some tickets must have been assigned to multiple classes by the experts. This ambiguity likely effects the outcome of the automated classification. How is it taken into account?

11) What are “some default hyperparameters” mentioned in line 455?

12) The results are not convincing. It’s obvious that a targeted hyperparameter search will likely require less time than a complete enumeration of all solutions. However, the final results should be equally good. This is not the case here since the authors excluded the best solution form the search space on which grid search was applied! For example, the filer sizes [5,4,3], activation_conv =elu, and activation_dense = sigmoid are all excluded from the grid search. It is not surprising that the DFO method finds a better solution. Thus, the conclusions drawn from the numerical analysis are not reliable or justified. The authors even note this fact later in the text, but do not address the issue and its implications any further.

13) The authors make the restricting assumptions for the grid search based on the compute time requirements for model training. It would be good to know how many GPUs the authors used for training. 

14) Lines 508-513: The authors briefly mention their experiments with a larger set of tickets for which labels were not available. Therefore, the algorithm’s accuracy could not be assessed either, yet the results were “satisfactory” – on what grounds were the results judged “satisfactory” when no ground truth was available? This short study contains no details and does not add towards convincing me that the proposed method is useful. 

15) It would have overall been nice to at least see some examples of tickets and how they were categorized, or any other data analytical plots would have been good.

 

Given these shortcomings of the manuscript, I recommend rejection. 

Author Response

Answers to Reviewer 1
The authors propose to employ machine learning techniques (convolutional neural nets (CNN) and support vector machines (SVM)) to perform the classification of tickets. In order to ensure high quality classification, the authors use derivative free optimization (DFO) over an integer lattice to tune the hyperparameters of the classifiers. In the end, the authors compare the performance of the DFO method with grid search.
The manuscript is very easy to read and follow. There are only a few issues with language which do not distract from the content. However, reading the paper, several questions arise:


1) What exactly is the novelty? From what I can tell, the authors applied established techniques to the ticket problem. The formulation of the black box problem (Eq 1) is not new (several developments do the same, e.g., HYPPO, DeepHyper). I am not an expert in applications of text mining and classification, so I cannot judge whether the application problem was novel. The question that immediately came to mind is why the Istat does not maintain a website where users enter their queries and select topics (categories) from a drop-down menu, thus eliminating a lot of the ambiguity of query topics?

As explained in the manuscript, the original content is:
a) A methodology to solve the difficult practical problem of the categorization of brief texts written in natural language on the basis of their semantic content. We are considering the practical application of the categorization of support tickets, however this problem arises in many different contexts. And even if in this case Istat would maintain a website to select topics from a drop-down menu (which is however impossible in the current organization), the problem would still arise in
several other contexts (and will also remain for the errors in the drop-down menu!).
We convert this problem into a multiclass classification problem and solve it by using a number of text mining steps. And of course the fact that the single steps were already known does not mean that the whole approach is not original.
The proposed methodology works at the formal level using a data-driven approach, therefore it can be applied to other text categorization problems in other contexts, and also in different languages, since it works by taking in input a dictionary of the language. If we use a dictionary of another language, the procedure will still work.
b) An approach to the difficult problem of determining the hyperparameter configuration of a generic machine learning procedure (not only neural networks, as oppose to other proposed algorithms). Again, this approach is purely formal, hence it can solve other problems of parameter optimization, dealing with parameters assuming integer, continuous or even categorical values. And it is well known that the world of algorithms is pervaded of parameter tuning problems.


2) Section 2 and 4: In Section 2, the authors mention a set of 15000 tickets they received to work with. The numerical experiments in Section 4 are conducted on 8076 tickets. Why the difference?
15000 tickets were used as a corpus to identify the relevant terms and compute their distances. After this, 8076 were labeled by experts of the field, as explained, and used to compute the performance, by using 50% of them as training and 50% as test (of course, the labels of the test tickets were unknown to the algorithm in the parameter optimization phase, and only used in the end to compute the final performance). However, as explained, our computational experience is not limited to the 8076 records: a total of 193,419 tickets were handled by the procedure, and only 8076 of them had labels, since the labelling operation with human experts is a costly operation.

3) Section 2: How did the 31000 lemmas get reduced to 424? The authors said that this was done by cutting words that did not appear frequently enough, but has anyone looked into the importance of those ~98% of deleted words? If the 15000 tickets were reduced to 8076 due to the fact that the authors deleted a large amount of words, I have doubts that the proposed method will be relevant in practice as it leaves 46% of tickets unanswered. Some explanation is needed here.
This is an automatic procedure designed to carry on a task that would have been impossible to do with human operators, because of the size of the data. In particular, checking in detail that 98% of deleted words seems not a practicable operation, otherwise there would be no need for our automatic procedure. Only the top 5000 most relevant terms in our list could be checked. The result is that there was no evidence of the need to substitute taken words with deleted words. Of course, taking different words would have caused a different evolution of the algorithm, and there is not much ground to say which evolution would have been better. In any case, the 15000 tickets were not at all reduced to 8076 due to deleted words, as explained in our previous answer.


4) An explanation is needed in terms of how the encoding of categorical variables to integer values impacts the outcome of the line search/primitive directions algorithm? The assignment of category to integer is arbitrary and directions may not make sense.
The assignment of categories to directions may not keep the exact "distance" between categories, if this is what you mean. However, it is even difficult to define what could exactly be the "distance" between categories, so this aspect does not provide an exact ground to speculate. Of course, in case something more could be said for the distances among categories, this could be considered by encoding them in directions having some particular relationship to each other...


5) The authors do a good job summarizing previous literature related to hyperparameter tuning. Why did the authors not compare to these methods? Many of these methods come with software. Instead, the authors compare to a simple (incomplete) grid search.

Some of these packages were preliminary tested. However, even if the methodology behind them is reasonable and was published, the available software provided, at least in our experience, quite unstable and not very satisfactory results. Thus, instead of spending time in trying to make them
work better, we judged much more "bulletproof" to conduct a grid search.
With regard to grid search, it was clear enough that a search over the same search space should produce about the same result. And, on the other hand, the time required by such a grid search would be impracticable. Therefore, we deemed more informative to consider a grid search limited to a number of combinations decided in advance and such that the computation time would be
reasonable. However, in the revised version of our manuscript, we included also the comparison with a grid search working on the largest search space that could be solved in practice with our computational resources of time and space (in the case of SVM it was the same search space of the black box algorithm). The "winning" parameter configuration was this time contained in the search
space, and was the one selected by the grid search.


6) Certain assumptions were made regarding the hyperparameters of the learning models, e.g., CNN comes with 3 layers. Why was the number of layers not included in the list of tunable hyperparameters? The same question applies to other fixed hyperparameters.
Again, not all hyperparameters can be tuned for evident reasons of computational time. Hence, when from preliminary tests there was evidence that some configuration could be fixed, we fixed it. That is the case of 3 layers in the CNN.


7) From the description of the tunable hyperparameters for CNN, I get for the black box optimizer 4x4x4x10x3x3=5760 possible combinations. The authors later state that there were 7680 different combinations. Similarly, I get a different number for combinations for the SVM. Why are there differences?
Yes, the correct value is 5760, it was a typo due to a previous alternative design of the experiment... Thanks for pointing this out.


8) Are 5 epochs really enough to train a CNN to convergence?
This looks like a rhetorical question.. From the results we can say that this was quite enough, but of course, if time was not a scarce resource, one could use more epochs and obtain some improvement. However, in our preliminary tests, more than 5 epochs produce a very large increase in time and a very modest improvement in accuracy , so we judged it was not worth it.


9) The definition of the 7 classes: What is the difference between the indeterminable / unclassifiable entries in class 6 and the part in class 7 that contains tickets that weren’t classifiable in classes 1-6? These 2 classes do not seem distinct.
Class 6 contains tickets ambiguous or belonging at the same time to more than one class, so they are "not classifiable". Class 7, on the contrary, contains tickets belonging to well defined categories, however these categories have small frequencies (they appear rarely). So, instead of making other 10 or so classes
with those small frequencies, they have been all put into one class: "the rest of the tickets".


10) If I sum up the numbers in lines 448-450, I get a total of 8095. This number is larger than the 8076 tickets considered, indicating that some tickets must have been assigned to multiple classes by the experts. This ambiguity likely effects the outcome of the automated classification. How is it taken into account?
Class 7 actually contains 2574 elements, so the total is 8076. It was a typo, thanks for pointing this out.


11) What are “some default hyperparameters” mentioned in line 455?
They are the parameter used by the classifiers when calling them from the scikit libraries without setting the parameters. This was only to clarify that this multiclass classification problem is not trivial.


12) The results are not convincing. It’s obvious that a targeted hyperparameter search will likely require less time than a complete enumeration of all solutions. However, the final results should be equally good. This is not the case here since the authors excluded the best solution form the search space on which grid search was applied! For example, the filer sizes [5,4,3], activation_conv =elu,
and activation_dense = sigmoid are all excluded from the grid search. It is not surprising that the DFO method finds a better solution. Thus, the conclusions drawn from the numerical analysis are not reliable or justified. The authors even note this fact later in the text, but do not address the issue and its implications any further.
As we explained in the manuscript, the grid search was not total because the time required would be too much. The search space of the grid search was of course decided before knowing the results of the black box algorithm. See also our extended answer to point 5)


13) The authors make the restricting assumptions for the grid search based on the compute time requirements for model training. It would be good to know how many GPUs the authors used for training.
The computational resources available for the 2 approaches were exactly the same, since they run on the same machine. In particular, the machine had 1 GPU of average performance.


14) Lines 508-513: The authors briefly mention their experiments with a larger set of tickets for which labels were not available. Therefore, the algorithm’s accuracy could not be assessed either, yet the results were “satisfactory” – on what grounds were the results judged “satisfactory” when no ground truth was available? This short study contains no details and does not add towards convincing me that the proposed method is useful.
As we explained in point 2, the large set of 193,419 tickets can not be labeled by human experts, so the accuracy cannot be computed. However, the results were judged satisfactory by the Istat, by means of analyses conducted by experts of the field over a random sample of those tickets. In particular, the accuracy for that sample was judged to be aligned to that obtained for the 8076 labeled records.


15) It would have overall been nice to at least see some examples of tickets and how they were categorized, or any other data analytical plots would have been good.
The tickets are in Italian language, and they describe difficulties in answering an Italian survey. If the Reviewer thinks that it would be useful we could provide some of them... However we frankly doubt that this could help the reader...


We are grateful to the Reviewer for the stimulating questions.

Author Response File: Author Response.pdf

Reviewer 2 Report

Improving the Automatic Classification of Support Tickets using Hyperparameter Black-box Optimization

1. Very interesting research entitled “Improving the Automatic Classification of Support Tickets using Hyperparameter Black-box Optimization”.

2. I suggest adapting the article to the suggested structure by algorithms-MDPI.

** Check "Microsoft Word template" from algorithms-MDPI.

 

      https://www.mdpi.com/files/word-templates/algorithms-template.dot

 

3. The design of the tables must be according to the journal algorithms-MDPI. See the image.  Change tables 1, 2 and 3 according to the format. (See attached file).

 

4. The bibliographical references of the article are in this order:

·       Reference [1] is in line 27.

·       Reference [31] is in line 35.

·       Reference [24] is in line 38.

·       Reference [8] is in line 40.

·       Reference [16] is in line 42.

·       Reference [30] is in line 44.    Etc.

References must be placed in ascending order. Correct.

5. The paragraph on lines 142-150 should be deleted. It is not necessary to comment in advance what will be covered in later sections.

6. The figures must be placed after finishing a paragraph. In the paragraph of lines 195-207, figure 1 is in the middle. Correct.

7. The form of capture or scanning of the Tickets, its format, the data it contains, its pre-processing, whether it is text or image, is not indicated. Show images of different Ticket formats.

8. Figure 1 shows: Classification Algorithm, Blackbox Optimization, Training Set extraction, Training with Hyperparameters and Testing and Performance Evaluation. A demonstration of each of the stages is required.

9. Elaborate an algorithm of the process of the process indicated in figure 1. I suggest that the algorithm in this article use the following format: (See attached file).

 

 

10. In lines 195-196 it says “corpus composed of the of text of 15,000 real tickets”. It is convenient to show an image of the Tickets. Also, show the process to get “each ticket has been converted into a vector with elements” (lines 200-201).

11. It is very confusing to use of “Convolutional Neural Networks (CNN) [12] and Support Vector Machines (SVM) [6] classifiers” (lines 213-213). Explain in detail its implementation.

12. There is only one equation in the entire article and it doesn't say much. All equations used throughout the “Automatic Classification of Support Tickets” process must be displayed.

 

13. Very good bibliography.

I request to make all the corrections indicated and the corrections of the other reviewers.

Comments for author File: Comments.pdf

Author Response

1. Very interesting research entitled “Improving the Automatic Classification of Support Tickets using Hyperparameter Black-box Optimization”.
Thank you very much for the positive assessment.


2. I suggest adapting the article to the suggested structure by algorithms-MDPI.
** Check "Microsoft Word template" from algorithms-MDPI. https://www.mdpi.com/files/word-templates/algorithms-template.dot
We modified the manuscript in order to take this observation into account.


3. The design of the tables must be according to the journal algorithms-MDPI. See the image. Change tables 1, 2 and 3 according to the format. (See attached file).
We modified the tables in order to take this observation into account.


4. The bibliographical references of the article are in this order:
· Reference [1] is in line 27.
· Reference [31] is in line 35.
· Reference [24] is in line 38.
· Reference [8] is in line 40.
· Reference [16] is in line 42.
· Reference [30] is in line 44. Etc.
References must be placed in ascending order. Correct.
In the revised version of our manuscript we placed the references in ascending order.


5. The paragraph on lines 142-150 should be deleted. It is not necessary to comment in advance what will be covered in later sections.
In the revised version of our manuscript we removed that paragraph.


6. The figures must be placed after finishing a paragraph. In the paragraph of lines 195-207, figure 1 is in the middle. Correct.
In the revised version of our manuscript we placed the figures after the end of the paragraph. However, Latex does not allow exact positioning of figures, these issues are usually handled by the editorial team.


7. The form of capture or scanning of the Tickets, its format, the data it contains, its pre-processing, whether it is text or image, is not indicated. Show images of different Ticket formats.
The tickets are just texts, the image of the ticket is not really significant... most of them were typed or handwritten by operators of the contact center.


8. Figure 1 shows: Classification Algorithm, Blackbox Optimization, Training Set extraction, Training with Hyperparameters and Testing and Performance Evaluation. A demonstration of each of the stages is required.
We explained all those parts in the subsequent parts of our manuscript.


9. Elaborate an algorithm of the process of the process indicated in figure 1. I suggest that the algorithm in this article use the following format: (See attached file).
In the revised version of our manuscript we elaborated an algorithm of the process in the format, kindly suggested by the Reviewer.


10. In lines 195-196 it says “corpus composed of the of text of 15,000 real tickets”. It is convenient to show an image of the Tickets. Also, show the process to get “each ticket has been converted into a vector with elements” (lines 200-201).
As explained before, the tickets are just texts, the image of the ticket is not really significant... The procedure to convert tickets into vectors is the word embedding phase, it was not invented by us, we provided a reference for that.


11. It is very confusing to use of “Convolutional Neural Networks (CNN) [12] and Support Vector Machines (SVM) [6] classifiers” (lines 213-213). Explain in detail its implementation.
The implementation used in our experiments are those of the Python package scikit learn. In particular, keras library for CNN and svm library for per SVM. That package is described in reference Pedregosa et al. (2011).


12. There is only one equation in the entire article and it doesn't say much. All equations used throughout the “Automatic Classification of Support Tickets” process must be displayed.
We did our best to display all relevant equations, however most of the steps are better explained with textual description.


13. Very good bibliography.
Thank you for the appreciation.
We are grateful to the Reviewer for the interesting comments.

Author Response File: Author Response.pdf

Reviewer 3 Report

1. please highlight the novelty in the sense of Fig.1 

2. the sentence "at the formal level using as much as possible a data-driven approach." does not contain technical significance. 

3. It is not obvious how authors embedding input data to the 2d-layers in the CNN.

4. The paper may also wider describe or even comparison of results with modern wide tuning methods, like NetAdapt, or any other AutoML techniques.  

5. The paper does not touch the state-of-the-art in the NLP in any sense. At least authors could mention the SOTA in the introduction; or in conclusion showing the ability of using their results in the modern models. 

6. It would be grateful if the paper will be included either repository reference or in-depth proposed algorithm description as scheme or in any other form.

7. Informally, it would be greater to include more information materials, like figures, tables, algorithms, and other.

Author Response

1. Please highlight the novelty in the sense of Fig.1
As explained in the manuscript, the original content is:
1) A methodology to solve the difficult practical problem of the categorization of brief texts written in natural language on the basis of their semantic content. We are considering the practical application of the categorization of support tickets, however this problem arises in many different contexts. We convert this problem into a multiclass classification problem and solve it by using a number of text mining steps.
The proposed methodology can also be applied to other text categorization problems in other contexts, and also in different languages, since it works by taking in input a dictionary of the language. If we use a dictionary of another language, the procedure will still work.
2) An approach to the difficult problem of determining the hyperparameter configuration of a generic machine learning procedure (not only neural networks, as oppose to other proposed algorithms). This approach is purely formal, hence it can solve other problems of parameter optimization, dealing with parameters assuming integer, continuous or even categorical values. And
it is well known that the world of algorithms is pervaded of parameter tuning problems.


2. The sentence "at the formal level using as much as possible a data-driven approach." does not contain technical significance.
In the revised version of our manuscript we modified that sentence, however we cannot remove the concepts.
With "At the formal level" we mean that our procedure is able to work independently from the content of the tickets, so it does not require predefined knowledge of the context or other specificities of the problem. This type of characterization is often used in computer science, however we tried to be more clear in the manuscript.
A "data-driven approach" is another well known characterization in data mining, and our procedure can be defined so because, for several choices, it does not rely of predefined decisions but it extract the information from the data. In particular, the classes are not predefined but are generated from available data, the final hyperparameters are not fixed by experts of the field but optimized by using the actual data that we need to classify, and so on.


3. It is not obvious how authors embedding input data to the 2d-layers in the CNN.
The CNN implementation used is that of the Python library keras. The input to the second layer is the standard one of that library.


4. The paper may also wider describe or even comparison of results with modern wide tuning methods, like NetAdapt, or any other AutoML techniques.
Some of these packages were preliminary tested. However, even if the methodology behind them is reasonable and was published, the available software provided, at least in our experience, quite unstable and not very satisfactory results. Thus, instead of spending time in trying to make them
work better, we judged much more "bulletproof" to conduct a grid search.
And with regard to grid search, it was clear enough that a search over the same search space should produce the same result. And, on the other hand, the time required by such a grid search would be impracticable. Therefore, we deemed more informative to consider a grid search limited to a number of combinations decided in advance and such that the computation time would be reasonable. However, in the revised version of our manuscript, we included also the comparison with a grid search working on the largest search space that could be solved in practice with our computational resources of time and space (in the case of SVM it was the same search space of the black box algorithm).


5. The paper does not touch the state-of-the-art in the NLP in any sense. At least authors could mention the SOTA in the introduction; or in conclusion showing the ability of using their results in the modern models.
We use the word embedding technique in Gensim word2vec to represent and extract terms, and deep learning networks for classification. That seems SOTA to us, maybe the Reviewer can be more clear on this point.


6. It would be grateful if the paper will be included either repository reference or in-depth proposed algorithm description as scheme or in any other form.
In the revised version of our manuscript we tried to go deeper into the algorithm description. Actually, many of the steps reported in Figure 1 are known techniques, so we provided references to the original publications instead of rewriting them...


7. Informally, it would be greater to include more information materials, like figures, tables, algorithms, and other.
In the revised version of our manuscript we did our best to take into account this Reviewer's concern.
We are grateful to the Reviewer for the interesting observations.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

I am grateful to the authors who have made the indicated corrections.

Back to TopTop