In this section, we want to conduct experiments to evaluate the quality of the proposed datasets. In particular, we want to answer the following questions:
Throughout all of the experiments conducted to answer these questions, we used the following configurations. We used PyTorch on a machine with three NVIDIA GPUs. The input documents were all truncated at 512 tokens. We used a maximum length of 128 tokens for the generated summaries. We used a learning rate equal to 0.00005, an Adam optimizer with s = (0.9, 0.999) and = 1 , and a linear learning rate scheduler with a max number of epochs = 4. The batch size was set to six during training, except for nBART, where we used a batch size equal to one.
To answer the first question, we trained the two IT5 and mBERT models on the MLSum-It dataset, translated from Spanish into Italian, and then we compared the results with those produced by the two Pegasus models reported in
Table 3 and described in
Section 2.
Table 3 reports the results in terms of various ROUGE measures and also analyzes the length of the summaries produced. From the numerical results reported in
Table 3, we can concluded that the IT5 and mBART models trained on MLSum outperform the Pegasus results. In particular, all the metrics are better, with the exception of the average length of the summaries produced by Pegasus, which are longer. In conclusion, we can say that training a model on a dataset, even if automatically translated from another language other than the target one, produces better results than those produced by a model trained on a dataset in a different language from the destination one, even if the model is very powerful.
The Fanpage and IlPost datasets proposed in this paper allow us to answer the second question and, at the same time, allow us to better understand the quality of the proposed datasets. The results reported in
Table 4 and
Table 5 show higher ROUGEs compared to the results in
Table 3, so we can deduce that a dataset created on the target language is better than the translated one. Furthermore, the metrics of the models trained on the datasets in Italian remain much higher than those reported by Pegasus. One thing that we can notice better from the
Table 4 and
Table 5 is that the length of the generated text is strictly dependent on the dataset (more than on the max length parameter of the model); in fact, on the IlPost dataset, we have a lower generation length for both mBART and IT5 than for the one we find on Fanpage.
To answer the third and final question, and to better understand if a dataset created directly in the Italian language is more robust than a dataset translated into Italian, we performed cross-dataset experiments in
Table 6. In particular, given the test set of a certain dataset, we have given it as an input to the models trained on the training sets of different datasets. From the results shown in
Table 6, we can see that, on the IlPost test set, both IT5 and mBART trained on Fanpage exceed the results of the same models trained on MLSum-It. The same thing can be deduced from the results obtained on the Fanpage test set, i.e., the models trained on IlPost exceed the results of the same models trained on MlSum-it. While in the last part of
Table 6, on the MLSum-It test set, we see very similar ROUGE values, which, in any case, differ slightly between the model trained on Fanpage and the one trained on IlPost (the difference in ROUGE-1 is approximately 0.6), so we can deduce that, apart from the average length generated, the two datasets are equivalent. In conclusion, we can say that it is always better to create a dataset directly for the target language. We can, therefore, see the two datasets proposed here as necessary datasets for all those who want to create text summarization models for the Italian language.
Human Evaluation
In addition to the evaluation metrics used, we also show and visually analyze some summary examples produced by the various models used in this paper. To allow readers to deepen this visual and numerical analysis, we also make the code, the trained models, and a web page with a directly usable demo available to the scientific community.
As the first document to be analyzed, we took the fifth example from the Fanpage test set, and obtained the summary generated by the IT5 model trained on Fanpage. The produced summary is as follows:
“Mary Katherine, una giovane ragazza alle prese con il mondo sconosciuto della flora che ci circonda, si risveglia un giorno e scopre di non essere più a casa sua, ma di essere stata trasportata in un altro universo. un viaggio epico, con un cast di doppiatori originali da brivido.”
The summary produced in this case was defined by a human being as “a great summary with no grammatical errors”, practically very similar to a summary produced by a human. We can compare the previous summary generated by the model with its expected summary or ground truth, shown below:
“Chris Wedge, regista de “L’Era Glaciale” e “Rio”, porta nelle sale una fantastica storia ambientata nel mondo, a noi ignoto, dei piccoli esseri che vivono nella natura circostante. Sensazionale.”
We can see that the model understood the content of the input document very well and generated a correct summary, but describes another point of view. We can, therefore, conclude that the model is well trained using the proposed dataset, and that it is able to generalize very well.
As a second example to analyze, we have selected the first example on the MLSum-It test set. In this case, we want to compare the summary produced by the mBART model trained on the IlPost dataset with the summary produced by the same mBART model, but trained on the Fanpage dataset. Using mBART trained on Fanpage, we obtain the following summary:
“José Ortega Cano, dopo un mese e mezzo ricoverato nell’ospedale Vergine Macarena di Siviglia dopo aver subito un grave incidente stradale (nel quale è morto Carlos Parra, l’autista dell’altro veicolo coinvolto nel sinistro) ha lasciato ieri mattina il centro ospedaliero dove ha dedicato alcune parole ai numerosi mezzi di comunicazione nazionale.”
When we use the same mBART model, but trained on IlPost, we obtain the following summary:
“L’ospedale di Siviglia dove è morto il defunto. José Ortega Cano è in ospedale da un mese e mezzo, dopo un incidente stradale in cui è morto Carlos Parra.”
The first difference we can see in these last two generated summaries is the number of words for each summary: the summary generated with the model trained on Fanpage is longer than that generated with the model trained on IlPost, but both are valid summaries for the input example provided. The second summary has a very assertive first sentence and a second sentence that better explains the content of the article, but this is exactly the style we can find for each summary in the IlPost dataset. Instead, the first summary it is more uniform and has a less journalistic style, but it is equally correct. With this last example, we can easily understand that using metrics such as ROUGE is not enough to evaluate summarization models, because even very different summaries can correctly represent the same input text; thus, a human evaluation is useful.
As the last example to analyze, we show the summary of the second example extracted from the Fanpage test set and generated by the IT5 model trained on Fanpage:
“Ll Pentagono ha appena annunciato che sta testando un sofisticatissima IA che ha l’obiettivo di prevedere con “giorni di anticipo” eventi di grande rilevanza nazionale e internazionale, analizzando dati da molteplici fonti.”
The latter is also a good summary, but it contains some grammatical errors such as “un sofisticassima” instad of “una sofisticatissima”. Sometimes grammatical errors of this type are generated, so the model is not perfectly trained; however, considering that humans can also make grammatical errors, we can conclude that, in general, good results can be achieved using the proposed datasets.