1. Introduction: AI and Explainability
Presently, transparency is one of the most critical words around the world [
1,
2]. We perceive it as the quality of easily seeing or understanding the others’ actions, implying openness, communication, and accountability [
3]. In computer science, transparency has been historically a significant challenge, and it has been addressed in different ways by different disciplines. We can mention here algorithmic transparency, algorithmic accountabily [
4,
5,
6,
7,
8] and, more recently, interpretable/explainable systems [
9,
10,
11,
12,
13,
14,
15,
16]. The common goal of all the above disciplines is that algorithms’ actions must be easily understood by users (expert and non-experts) when we execute them in a particular context.
On the other hand, Artificial Intelligence (AI) is impacting on our everyday lives, improving decision-making and performing intelligent tasks (image and speech recognition, image and speech generation, medical diagnostic systems and more). However, despite AI programs’ capabilities for problem-solving, they lack explainability [
17]. For example, deep learning algorithms use complex networks and mathematics, which are hard for human users to understand. AI must be accessible for all human users not only for expert ones. Furthermore, it could be a severe problem when an unexpected decision needs to be clarified in critical contexts: medical domain/health-care, judicial systems, banking/financial domain, bioinformatics, automobile industry, marketing, election campaigns, precision agriculture, military expert systems, security systems and education [
18]. Therefore, it is a current challenge to understand how and why the machines make decisions. Even the European Union has fostered a framework of regulations for providing the scientific community with guidelines to obtain satisfactory explanations from the execution of algorithms governing decisions’ making [
19].
On this regard, a current challenge in AI is to achieve explainability in AI [
13]. In the literature, several computational methods have been proposed to explain predictions from supervised models [
20,
21,
22]. On the other hand, there exists works whose objective is to propose visualization and animation techniques. We find different kinds of visualization and animation methods, we can mention here the following:
Projections of network outputs to visualize decisions in a K-dimensional space. This approach aims to present a set of images for all training vectors [
23].
Saliency maps (or heatmaps) to simplify and/or change the visual representation of a picture into something that is easier and more meaningful to analyze [
24,
25].
Colored neural networks to show the changes in the states of neurons [
26].
Visualization of programs as images using a self-attention mechanism for extracting image features [
27].
Animations using the Grand Tour technique [
28] in which animations are employed to have advantages with respect to classical visualization techniques [
29,
30].
Despite the facilities shown by the graphical tools, sometimes it is not easy to interpret them, especially when non-expert users are in charge of performing this task. For this reason, Interactive Natural Language Technology for Explainable Artificial Intelligence has been recently proposed [
31]:
“the final goal is to make AI self-explaining and thus contribute to translating knowledge into products and services for economic and social benefit, with the support of Explainable AI systems. Moreover, our focus is on the automatic generation of interactive explanations in NL, the preferred modality among humans, with visualization as a complementary modality [...]”. This paper follows this line of work.
On the other hand, NL is not the only, nor always, most efficient way to have humans communicate with computers. The use of video images in the context of an expert system has shown us that it is not enough to stick pictures to make the communication efficient and attractive [
32]. The combination of graphics and text is an effective explanation strategy [
33] because the users can focus their attention on the explainer information in the text [
34]. At the same time, graphics allows users to form mental models of the information explained in the text. The combination of visual and text information is a useful explanation strategy when it meets four conditions [
35]: (1) the text is easy for all users to understand; (2) the visuals are created and evaluated based on the user’s comprehension; (3) the visual representations are employed to provide users with good explanations of the texts; and (4) the experience of users with the content is little or no previous. Additionally, video is a ubiquitous data type not only for viewing (reading) but for daily communication and composition (writing) [
36].
For all the above considerations, we propose Automatic Video Generation (AVG) as a new interactive NL Technology for Explainable AI whose objective is to automatically generate step-by-step explainer videos to improve our understanding of complex phenomena dealing with a large amount of information, HNNs in our case. The idea is to combine in the same framework explanations in NL with visual representations of the data structures and algorithms employed in a process of machine learning and put them together in a sequence of frames (or slides). The result is a video presentation similar to those created by teachers, experts, or researchers when they need to explain something to an audience. Our approach is strongly context-dependent; hence, we will use this feature to provide users with factual and realistic explanations.
Moreover, storytelling techniques help us to include global commentaries. Additionally, as we have videos (mp4, slides, ppt) as output, the user can classically analyze them: forward, backward, pause, slow, fast, so on. Another significant advantage is that a complete understanding of the models could require several examples. Our approach can generate several videos from different samples quickly using a small number of changes. As this is a first approximation, we have some limitations; we highlight two: (i) the difficulty of generating videos with large neural networks; (ii) the limited degree of sophistication of the sentences generated in natural language. These limitations do not diminish the fact that our proposal is promising, and both limitations will address in future works.
It is important to highlight, although we employ our video-generation method for explaining the execution of a HNN, it could be applied to any problem whose solution involves the implementation of a certain algorithm (classical or machine learning ones) because we use the execution traces to automatically obtain the sequence of frames that will compose the future explainer video. More formally, our research (hypothesis, main objective, methodology, result and evaluation) is explained and formalized in
Figure 1 and
Figure 2. Additionally, our proposed method is presented and explained in
Figure 3 and
Figure 4 using block diagrams [
37].
In the literature, we can find several works about automatic video generation which differ quite a lot from our proposal [
38,
39]. In [
38] a framework for machine learning to generate videos called ArrowGAN is presented and in [
39] a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos is proposed. The main difference is that in our case videos are automatically generated from execution traces (symbolic information) to obtain a better understanding of machine learning algorithms. In the mentioned approaches videos are employed to generating other videos with different objectives. On the other hand, there is another line of work which aims to automatically generate textual explanations of software projects (code) [
40]. This paper is focused on the generation of function descriptions from example software projects and it is applied to study the related problem of semantic parser induction. It employs mainly code as the main resource of communication, no video is generated. To the best of our knowledge, it is the first time that data-to-text based on fuzzy logic and software visualization techniques are employed to automatically generate explainer videos from execution traces.
Summarizing, the most relevant contributions of this paper are:
We propose a new technology to automatically generate explainer videos from HNNs execution traces whose objective is to understand this process better.
Our approach provides users with visual and textual explanations using data-to-text and software visualization techniques.
We explain in detail a methodology and propose a software architecture.
We present an application to generate an explainer video on a real pattern recognition problem.
The rest of the paper is organized as follows.
Section 2 introduces several preliminaries concepts about Hopfield neural networks and provides reader with a very brief review of the state of art on the different involved areas. Then, in
Section 3 a methodology and a software architecture based on CTP for the automatic generation of explainer videos are presented in detail. In
Section 4 are explained the most important task, namely the content determination of frames (how the information will be delivered to the user) and the frames planning (in which order the information will be delivered to the user).
Section 5 explains how narration scripts can be used to support the visual and textual explanations during the videos’ visualization. In
Section 6 technical details about the implementation are given, the main algorithms are presented and explained. Afterwards,
Section 7 explains the evaluation carried out on a particular video. Finally,
Section 8 provides some concluding remarks and presents the future work with special attention in showing how and why we can use AVG as a powerful teaching-learning resource.
4. Frames Generation and Local NLG
4.1. Content Determination of Frames
In this section, we will explain the content determination of the frames. Content determination identifies the information we want to communicate through the generated explainer video. Therefore, video frames will be a set of explanations in NL and graphical representations. The central concept here is the neural network. To define a neural network data type, we need to define two types of data: a set of neurons and a set of connections between them. Therefore, each frame’s content provides users with information about these types’ values. We employ here the UML language for defining the data types and their relations.
In particular,
Figure 5 shows the different data types and their relations (Neuron, Connection, Neural Networks). Each of them contains fields (attributes or properties), and code in procedures (often known as methods). We represent and store the information generated during the HNN execution in a declarative log (ontology). We define each fact “frame” as a tuple of four elements: a unique identifier that indicates the order of execution; a label action gives us information about which operation was performed (initialization, neurons, connections, iteration, so on); information about the values of the data structures employed by the HNN and in some cases a set of messages. Different kind of messages can be generated from each frame. For example, the execution of an HNN can generate a declarative log of this kind (we omit some details for simplicity):
frame(1,neurons,[N0,N1,N2,N3,...,N6,N7,N8,N9],
"The neurons defined for the
Hopfield network are:").
frame(2,connections,[(N0,N1,0.0),...,
(N9,N9,0.0)],
"The connections between
neurons have been
established:").
frame(3,pattern([1.0,1.0,1.0,-1.0,-1.0,
-1.0,-1.0,-1.0,-1.0,-1.0],
"The input pattern is:")).
frame(4,iteration#1,[w(0,1,1.0)],
"Neuron N0 is enable.
The neuron N0 is
connected with
neuron N1 and N2").
frame(5,iteration#2,[w(0,2,1.0),
(w(1,2,1.0)]).
...
frame(13,iteration#9,[(w,0,9,-1.0), ...
(w,8,9,1.0)]).
Additionally, we employ graphical representations to provide users with useful and visual explanations about the changes produced in the internal states of an HNN during the execution. These representations will be essential in the next phases to obtain simple, informative and intuitive messages about the neurons, connections, neural network and training/pattern recognition tasks. We define two types of visual explanations, namely: first-order representations to visually explain graphical low-level data structures, variables or values involved in the process of execution; and the second-order ones to model more abstract figures. An essential feature of our approach is that the system generates second-order ones from the first ones automatically in a similar way than computational perceptions are created in the LDCP paradigm. The
Figure 6 shows both types of representations. We represent neurons, connections, and weights as a matrix of weights on the right side. On the left side, a classical graph represents the neurons’ activation, vertices are the neurons, edges are the connections, and the labels are the weights. Each visual explanations will be associated with a set of messages explaining the pieces of information shown in them.
4.2. Frames Planning
In essence, a plan is a sequence of actions that bring a system from an initial state to a goal state. A common approach to AI planning involves defining a series of atomic steps to achieve a communication goal. When the communication goal is to explain an execution of HNNs, then the plan is partially guided by the implementation and the execution trace. This last one provides the planner with a sequence ordered of frames. In our case, the steps needed to explain the HNN execution it is as follows (see
Figure 7):
Initialization. The system (using content determination resources) visually shows and textually explain the neurons, connections, neural networks weights and input pattern.
Training. The system (using content determination resources) visually show and textually explain the training procedure step by step. The system can employ a set of frames to provide user with useful information about it.
Pattern Recognition. The system (using content determination resources) visually show and textually explain the pattern recognition procedure step by step. The system can employ a set of frames to provide user with useful information about it.
Additionally, as the system using a particular plan to simulate a teacher’s presentation, it incorporates some extra frames before explaining the training and pattern recognition tasks. We add three slides more to the original sequence: Initial presentation (title, contact, date, so on); an index with little explanations of each item and; a slide (or several) for conclusions.
6. Technical Details about the Implementation
In this section, technical details about the implementation are given. The first implementation has been developed in Java 11. Several explainer videos and the source code can be watched and downloaded from the official web page of this paper (
youractionsdefineyou.com/videogeneratorhopfield/, accessed on 19 June 2021). The complete implementation is summarized using UML diagrams (see
Figure 9).
The implementation is formed by two packages: Hopfield neuronal network and video generation. The first one implements all the functionality required to execute the HNN (training and pattern recognition). The execution of an HNN consists of performing three steps: (i) to receive as input an array of 1 and −1 (the pattern); (ii) to train the HNN for recognizing this pattern; and (iii) to recognize any array received as input. This package has a class called ExecutionTrace for storing all the information about the HNN execution. The result is a structured file containing a sequence of rows with information about the execution. Each row is formed by an identifier (ID) that indicates the order in which a row must be generated; a label action (LABEL_ACTION) gives us information about which operation was performed (initialization, creation neurons, creation connections, creation neural networks, weights update, so on). After that, a set of values D1; ...; DN is provided with information about the results obtained after the execution of the action indicated previously. Finally, a set of messages about each could be also indicated to support the process of the video generation.
ID0;Label_Action;D1; D2;...;DN; M1;...;MN;
ID1;Label_Action;D1; D2;...;DM; M1;...;MM;
...
ID1;Label_Action;D1; D2;...;DL; M1;...;ML;
The video-generation package implements the functionality needed to generate an explainer video from the information created by the execution trace package. The main classes are CSV2Frame, FrameCreator, First-Order Representation, Second-Order Representation and VideoGenerator. The main Algorithm A1 receives as input an execution trace (list of rows) and returns a video in mp4. It reads one by one each line and creates a slide using the information associated with each row calling to the function
Create-Slide(row[]) (Algorithm A1-line 5 in
Appendix A).
The function Create-Slide(row[]) creates an empty slide, and it will call a sequence of functions whose objective is to create, where appropriate, a frame with visual and textual explanations. The mentioned functions are:
The first and second functions generate visual representations from the information received as parameters. The third function generates textual explanations from the visual ones. It is important to mention that sometimes we do not employ this function in the current implementation, but it will be an essential resource of explainability in the future. The Algorithm A2 shows a pseudocode to create a new frame and Algorithm A3 shows how we implement the function
first-order-representations. As we can observe this function creates a frame depending on the action received as parameter. For example, we explain here two of them: a simple one (
images-text(pattern,info)); and a more complex one (
images-text(iteration,info)) which is called complete cycle of representation (see
Appendix B for more details).
The function
images-text(pattern,info) generates an explainer frame about the input pattern from an array of information
info[] that contains the explanations generated to explain the data structures employed to implement this element. This function uses 2D atomic primitives to show the messages and create a graphical representation of the input pattern. The
Figure 10 graphically explains this process. Sometimes atomic primitives are not enough to implement a particular graphical representation, and we need to create additional functions. For example, we define the function
drawPattern that allows us to draw a visual matrix with different options: matrix alone, showing indexes or showing values. The possibilities will indicate in the parameters. If wn=true, then it creates a visual matrix showing their values; if wi=true, then it creates a graphical matrix with indexes (see
Figure 10).
For the generation of a group of frames whose objective is to explain a particular phase’s execution, the system must perform a complete cycle of representation, i.e., for each frame, it generates a first-order model from the information provided by the neurons, connections and the matrix of weights in a particular instant of time during the execution; and textual explanations about that. If the new representation needs to be clarified (it is a designer’s decision), the system can generate additional descriptions. Next, the system generates a second-order model from the first one. It aims to provide users with an alternative and more friendly representation of the first one (the system can also generate explanations for this representation). The
Figure 10 shows how this cycle is performed, and the concrete details about the implementation can be consulted in the source code on the official web page.
7. Evaluation
In this section, we evaluate different aspects of the explainer videos created using our method. In order to do this, we are going to employ a standard evaluation instrument called LORI (Learning Object Review Instrument) [
61]. This instrument allows us to evaluate relevant aspects of the videos by applying multiple-choice questionnaires about them: video quality, content quality, motivation and design/presentation. The questionnaires will be answer by a group of nine experts in AI and higher education after that they will have visualized the explainer video.
We have designed and created four kinds of questionnaires formed by several questions related to the mentioned aspects (see
Table 1,
Table 2,
Table 3 and
Table 4). The possible answers go from Strongly Agree (5) to Strongly Disagree (1).
7.1. Video Quality
To evaluate the quality of the videos, we propose four questions about four different aspects, namely: (i) how an expert perceives the similarity between the videos automatically generated using our method and those created by a teacher; (ii) the perception of the experts about the visual and textual explanations generated and (iii) the perception of the experts with respect to the usefulness of the videos to help users in the process of understanding the execution of a HNN. The questions are as follows: (
)
“The automatically generated explainer video is close to or likely to the videos that could be created by a human (a teacher)”; (
)
“The automatically generated explainer video shows useful visual explanations to understand the execution of a Hopfield network”; (
)
“The automatically generated explainer video shows useful textual explanations to understand the execution of a Hopfield network.”; (
)
“The automatically generated explainer video presents the information in a way that favors the understanding of the execution of a Hopfield network”. The questions, answers and results are shown in
Table 1 and
Figure 11.
Now, we analyze the answer given by the experts. With respect to the first question: , 44.4% of the experts were strongly agree, 44.4% were agree and 11.1% were disagree. With respect to the second one: , 33.3% of them were strongly agree, 44.4% were agree and 22.2% were disagree. With respect to the third question: , 11.1% were strongly agree, 55.6% were agree, 22.2% were neutral and 11.1% were disagree. Finally, with respect to the last question: (), 44.4% were strongly agree, 44.4% were agree and 11.1% were neutral. We can conclude that our explainer videos are a useful tool to get a better understanding of the execution of a HNN, but as the average of each aspect indicate (, , and ) we will need to improve them in the corresponding aspects: visual and, above all, textual explanations.
7.2. Content Quality
Now, we evaluate the content of the frames; accuracy, balanced presentation of ideas and adequate level of detail are aspect to taken into account. We propose five questions about the mentioned aspects: () “The resource presents the information objectively and with good writing”; () “The content does not present errors or omissions that could confuse or misinterpret its interpretation”; () “The statements are supported by evidence or logical arguments”; () “The information emphasizes the key points and the most significant ideas, with an appropriate level of detail”; () “Is the language used in the explanation appropriate for you?”.
We can conclude that the content of our explainer videos needs to be improved. This conclusion is directly related to the weaknesses detected in the previous step: textual explanations are poor. The worst result was identified in the question in which some experts had doubts in choosing an answer. Therefore, all these aspects must be studied and improved in future works.
7.3. Motivation
The motivation is another important aspect to be evaluated. We can define it as the ability to generate interest in the public. We propose three questions to evaluate it: () “The resource offers a representation close to reality that stimulates the interest of the student”; () “The duration of the content visualization favors the student’s attention”; () “Could students be motivated with this type of resource?”.
We conclude that experts disagree with the duration of the videos and the current format of them. However, the most of the experts were agree with respect to the question about if the resource is motivating.
7.4. Design and Presentation
Finally, we evaluate two important aspects: design and presentation. In particular, we propose a questionnaire formed by six questions: () “The presentation of the video requires a minimum number of visual searches”; () “The graphs and tables are clear, concise and without errors”; () “Videos include narration”; () “Paragraphs are headed by meaningful headings”; () “The writing is clear, concise and without errors”; () “The color and design are aesthetic”.
We conclude that experts have a neutral opinion about the aspects related to the design/presentation of frames. It confirms that the narrative and storytelling is an important issue to solve in future works.
8. Conclusions and Future Work
A new framework based on data-to-text and software visualization to automatically generate explainer videos about the Hopfield neural networks’ execution has been presented. A design based on experience and human-computer interface has been created and presented. A software architecture to build automatic video generation based on Computational Theory of Perceptions has been defined and explained in detail. Four modules form it:
Trace (for capturing the values of the data structures in each instant of time and other relevant information generated in the process of execution).
Frame generation/Local NLG (for creating slides containing texts and graphical representations about that).
Global NLG/Storytelling for the incorporation of texts which allows us to obtain a global communication.
Video generation (for generating a video in mp4 format from a set of images).
Each module has been implemented and tested using an object-based programming language called Java (version 11). We have focused on explaining the content determination and planning phases. Technical details about the implementation have been shown, the main algorithms and data structures implemented in this work have been explained. A real application for the generation of an explainer videos from HNN’s execution has been designed, implemented and explained. Finally, we have positively evaluated that they are a useful tool to obtain a better understanding of the execution of a HNN, but we will need to improve the visual and, above all, textual explanations and the narrative in future works.
Therefore, as future work we would like to work on the following challenges:
To improve the video by incorporating sounds, voices and more advanced avatars.
To create an automatic style module for providing users with options about visual aspects of the video (colors, font, speed, so on).
To investigate new methods for generating more rich sentences and narratives.
To be able to work with large neural networks using intelligent analysis and automatic summarization of videos [
62].
To employ links in which users can directly interact with the explainer video.
Special mention about future work is related to investigating how automatic explainer videos generation can be used as video tutorials for teaching-learning processes. The main reason is that videos online have become one of the most consumed digital resources, and millions of viewers watch them on YouTube and other platforms. For example, a tutorial on skin retouching in photoshop has been watched for more than a million times and a YouTube channel on photoshop tutorials has more than 170 thousand subscribers. In higher education, videos have also become an essential resource for students. They are integrated and employed as a fundamental part of traditional courses, and they serve as a cornerstone of many courses. On the other hand, they have become in the primary information-delivery mechanism in online classes. In fact, several experts have pointed out that e-learning is the future of education. In this regard, Global Industry Analysts projected e-learning will grow in the future [
63]. Additionally, the current pandemic has driven the use of video tutorials which play an important role in the e-learning area. At the same time, it has been shown how the use of presentation software instructions on video has advantages with respect to those performed on paper [
64]. Currently, video tutorials (recorded or live) are a fundamental resource for students because they provide them with another perspective which becomes the experience of learning in a more dynamic, practical and effective experience [
65]. For all these reasons, automatic explainer video generation can be a useful approach for digital and classical learning in the future.