In this section, we introduce important work and recent best approaches in the areas we are interested in. To properly understand the approaches used by the proposed systems and their limitations, it is necessary to know the metrics used to evaluate them. We therefore begin by defining them. Next, we introduce predictive justice approaches, and we end with dialogue systems. For more details on the presented techniques, we recommend these two reference manuals: [
3] for Natural Language Processing (NLP) and [
4] for IR.
2.1. Evaluation Metrics
In IR and classification, the following metrics are often used for binary classification. For each metric, one can calculate the results by class and aggregate them afterwards if the problem has more than two classes. Instances of the positive class, correctly classified, are called True Positives (TP). True Negatives (TN), False Positives (FP) and False Negatives (FN) are defined in the same way.
The precision is the proportion of truly positive instances among those identified as positive by the system: .
The recall is the proportion of truly positive instances that have been correctly classified by the system: .
The accuracy is the proportion of instances, both positive and negative that were classified correctly: .
Accuracy can be used if the classes are well balanced, but will give biased results towards the majority class otherwise. For example, with a distribution of instances between the positive class and the negative class, the rule “always assigns to the positive class” would get 90% accuracy.
The F-measure is the harmonic mean of precision and recall: .
In general, this metric is the best choice if the classes are not well balanced.
Each of the aforementioned metrics is defined in the case of a binary classification, and must be aggregated to evaluate a multi-class classification system. Then, three different averages are used: the micro, macro and weightedaverages, which are defined and illustrated hereafter.
For example, consider a classification task with three classes A, B and C where there can be only one class for each example, the results provided by the classification system being reported in the confusion matrix from
Table 1.
In the case of the micro-average, one evaluates all the classes together. For the micro-precision (microP), one counts the total of TP and FP using the confusion matrix: , , then one calculates the .
For the macro-precision (macroP), the precision for each class is calculated (which is noted to ), then the average is calculated. Using the same example, one has , same for and . Macro-precision is therefore:
The weighted average assigns a weight to each class equal to the number of instances associated with it. In our case of 22 instances, they are, respectively, five instances of A, 10 of B and 7 of C. The weighted precision (wP) is therefore .
Notation: In all this document, a variable in bold, for instance , denotes a vector.
The cosine similarity between two vectors allows for comparing these two vectors with each other, being equal to their scalar product divided by the product of their Euclidean norms. With two vectors
and
, the angle between
and
:
2.3. Language Models and Recurrent Networks
While word embeddings are trained with the explicit goal of retaining semantic information in their representations, language models learn it implicitly. Indeed, the objective of language models is to capture the probability distributions of word sequences. Since the nature of language is not well understood and the universe of possible word sequences is extremely large, language models use certain simplifications to approximate the probability of appearance of a sentence. For example, n-gram models maintain a count of occurrences of word sequences of n length or less. To calculate the probability of appearance of a sequence of length , one simply multiplies the probability of each of the n word sequences it contains. Once trained, these models can be used in concrete applications to represent words as one would do using word embeddings.
The first language patterns were learned using non-recurrent neural networks [
15,
16]. To handle contexts of variable size, one now uses recurrent
encoder–decoder architectures instead [
17]. The first of the two networks, the encoder, is a Recurrent Neural Network (RNN) which reads the words of the sentence to be translated one-by-one, and predicts the next word at each step. Once the sentence is complete, the hidden state contains a summary of the sentence. That summary, also called the
context, is then fed to the second network, the
decoder. This model then generates an output sequence based on that context, but also the previous outputs as it goes on.
These two steps of the recurrent encoder–decoder are illustrated by
Figure 3. For each symbol being read, the encoder updates its hidden state using the previous state as well as the current symbol. Once the last symbol has been read, the network has the context in its memory. Each symbol is predicted by the decoder using the context, the previously predicted symbol and the current hidden state. The latter is itself derived from the hidden state and the symbol of the previous time step. In our example, machine translation is the task used to jointly train the encoder and decoder models. While this model can be used as-is, the pre-trained encoder can then be used independently in other tasks with smaller corpora to represent documents.
The recurrent encoder–decoder is based on a variant of RNN, which uses the recurrence mechanism to model the order of words (or symbols) in the text, while Transformers are a new type of neural networks which use the attention mechanism instead of recurrence. This mechanism introduced by Bahdanau et al. [
18], originally for automatic text translation, has become the source of performance improvements for many reference tasks in NLP. The general idea is to predict, at each timestep (token) in a sequence, the importance of each of the elements passed as input to the network. This is referred to as the
attention that the network attaches to particular elements of the input sequence. In the case of translation, one part of the network learns the translation part
per se, and the other part learns to align the words or groups of words. Transformers [
19] use Convolutional Neural Networks (CNNs), combined with the attention mechanism. This technique completely eliminates the sequential dependency of recurrent networks, and thus allows for fully parallelized training. Although their training has a much higher overall computational cost than their equivalent in recurrent networks for example, this advantage makes them faster to execute, provided one has access to large parallel computing resources. As with word embeddings (see
Section 2.2.2), what allowed the popularization of transformers is the public availability of pre-trained networks, which can be used as-is to represent sequences of text, or fine-tuned to specific tasks in a reasonable amount of time to achieve even better performance. Devlin et al. [
19] introduced a transformer-based architecture called Bidirectionnal Encoder Representations from Transformers (BERT), whose pre-trained models are now very popular.
2.4. Chatbots
Chatbots could support the improvement of access to legal information. In this section, we describe some important contributions in this area. Chatbot systems can be divided into three different categories, Question Answering Systems (QASs), “social” chatbots and focused dialog systems. The main characteristic of a QAS is the use of data of various types, such as web pages or knowledge graphs, to provide direct answers to user questions. Social chatbots are rather generalist systems, while focused dialog systems are usually dedicated to a specific tasks.
2.4.1. Question-Answering Systems
Search engines or other search systems return documents in response to user requests. These queries can consist of a combination of keywords, but can also be in natural language (without any particular constraints, notably in terms of vocabulary). As an example, a user seeking information on work authorizations could formulate the following queries:
- -
Student visa, work authorization.
- -
I’m on a student visa. Can I work?
While a search engine would return essentially the same documents to both queries, a QAS could use the information from the natural language query to answer it precisely. Rather than having to read the complete documents describing student visa rules, the user of the QAS could receive an answer such as “As a general rule, students can work 20 h per week during the semester, [...]”. According to Jurafsky and Martin [
20], QASs are of two types. The first type associates queries with logical representations to query structured knowledge bases. In the second, relevant documents are first searched for using IR techniques, and then text understanding algorithms are used to extract relevant sections.
In the case of a QAS whose role is to provide information to immigrants, a simple knowledge base could be a table associating nationality of origin and visa type to a list of pre-requisites. We have the sets (incomplete but for the purposes of the example)
and
. The corresponding database table is shown in
Table 3.
Figure 4 presents a flowchart of the interactions with this QAS to answer the question “What are the pre-requisites to get a visa?”. The interactions with the user must inform the two strangers (nationality and type of visa) before they can be given an answer using the database in
Table 3. Since the system follows very simple rules, it is entirely predictable and, assuming the information is correct and complete, the correct answer will be provided to the user in all cases.
It would be possible to apply this paradigm in a second step to systems such as the ones we are developing, in order to further facilitate access to information. However, this paradigm is not flawless since it forces interactions to remain within a very rigid framework. The effort required to build the knowledge base is also much greater than that required to bring together an unstructured corpus of data. Work exists to extract structured information from relatively homogeneous textual data, notably Auer et al. [
21] created an ontology from the set of Wikipedia pages and the relationships between them (the hypertext links). These approaches allow access to a large relational knowledge base, at the cost of a decrease in data quality compared to manual work. Despite this limitation, this type of tool is an interesting source of improvement for further work. In particular, we have begun its integration into the NLP tools of the NBC to increase their coverage thanks to the automatic extraction of synonyms from this database.
A IR-based QAS could be designed using a corpus that includes the documents below:
- -
: “If you are an American citizen, you need a job offer to apply to a work visa. To get a student visa, you only need to prove your student status. There is no working holiday visa option.
- -
: “If you are French, you will need […]”
To answer the following queries and , the first step consists of identifying the right document:
- -
: “What are the prerequisites for a work permit for U.S. citizens?”
- -
: “What are the prerequisites for Americans to get a work permit?”
This step can be performed using one of the methods presented in
Section 2.1. A simple cosine similarity on BoW vectors would work with the
query, which has vocabulary in common with
, but not with the
query, which uses slightly different words. Trained embeddings would match these both queries with
.
Various techniques can then be used to extract the relevant part of the document to answer the question. One usually begins by applying a grammatical analysis of the query to determine the type of question (“Who”, “What” “When”, etc.). It is then possible to use morpho-syntactic labeling tools to filter the results and make a cosine similarity comparison, for example, to select the most relevant sentence(s).
2.4.2. General Chatbots
The design of chatbot systems not restricted to a specific task is difficult, and the evaluation of such systems is complex. The characterization of what constitutes a good conversation has inspired many authors but still remains an open research question [
22]. The first generalist chatbot, ELIZA [
23], was created to mimic a conversation with a psychotherapist. The algorithm simply consisted of associating certain keywords with predefined phrases. Thanks to Weizenbaum et al.’s [
23] choice of “persona”, the system received very positive feedback from users, without taking great risks since it answered mainly with new questions.
The ALICE [
24] system was the first to handle natural language entries. The Artificial Intelligence Markup Language was developed within this framework, to design a system of conversational rules. Many chatbots created using this language are listed in Satu et al. [
25].
As with other areas of NLP, learning-based dialogue systems are currently the most common approach to conversational systems. Vinyals and Le [
26] use a sequence-to-sequence neural network architecture (
seq2seq, introduced by Sutskever et al. [
27]) to train a chatbot from end-to-end without any prior knowledge. The model simply learns how to predict the most probable answer (character by character) to a question from the training corpus. The simplicity of this model has several advantages. The only data needed are discussions: no special annotation is required, so many corpora are usable. Moreover, since this method does not require any domain-specific knowledge, it is very easily applicable in a variety of contexts.
However, this approach has limitations, which the authors acknowledge. First, the objective of predicting the most likely next step in a conversation is not a very good indicator of the purpose of a conversation. Indeed, the goal of a conversation is often a longer-term objective such as sharing a certain piece of information or completing a specific task. Second, the model tends to favour low-risk responses, which often amount to very short and uninteresting answers. Finally, the model does not contain any mechanism ensuring the answers are consistent with each other. This last point is, according to the authors, one of the points that prevent their model from passing the Turing test.
Qiu et al. [
28] explain that IR-based methods often fail to deal with long, precise questions. They also point out that text generation-based models may generate inconsistent or meaningless answers. Motivated by the limitations of both of these approaches, they developed a hybrid chatbot, which combines IR and text generation in a commercial setting, to provide customer service and shopping recommendation. This approach leads to three models: a IR model, a text generation model, and a third one selects an answer to provide a user with according to the confidence threshold of the first algorithm. When evaluated manually, the hybrid model outperforms the simple model with 60% of text accuracy versus 40%.
The Turing test, introduced by Turing [
29] under the name of “Imitation Game”, involves a machine trying to impersonate a human through a written conversation with an examiner. This test was Turing’s precise way of answering the question “Can machines think?” through observation. According to Radziwill and Benton [
30], this objective has guided the development of chatbots since ELIZA [
23]. Ramos [
31] as well as Radziwill and Benton [
30] suggest that this ability to mimic human behaviour is not necessarily a desirable quality, however, and argue that even human empathy towards these systems would not suffer from a lack of it.
Deriu et al. [
22] list other competitions and metrics, which attempt to evaluate the quality of a chatbot system, but no evaluation method stands out as a de-facto standard. Metrics such as BLUE [
32] and RED [
33] measure the overlap between the chatbot dialogue and pre-set phrases. Some approaches, such as Lowe et al. [
34], use an RNN to try to predict how judges would rate sentences or the entire conversation. This approach requires extensive manual annotation, but its results correlate quite closely with user judgments. Despite many promising leads, the problem is still open since it seems difficult to identify the quality factors of a chatbot in general.
2.4.3. Task-Specific Chatbots
For task-specific chatbots, whether informational or transactional, the application domain is known in advance. In the later case, the role of the chatbot is to automate the exchanges in order to carry out a transaction: the cancellation of a subscription, or the sending of a bank transfer for example. This has allowed the development of finite state systems [
35] and rules-based systems to govern transitions between these states. The state of a system is composed of variables and their values, which describe all the elements of the environment and chatbot configuration needed to guide the chatbot decisions. In our very simple example, illustrated in
Figure 4, a state of the system could be described as follows:
started: TRUE; citizenship: UNKNOWN, visa_type: STUDENT; requirements_given: FALSE
When users provide their citizenship, the property citizenship is now set (to FRENCH for instance) and the updated properties and their values now constitute the new state.
These successful techniques, coupled with statistical approaches [
36,
37], are still in use today because they are capable of supporting the design of simple and robust systems. The system developed by Bobrow et al. [
35] is still the basis of many flight reservation systems today [
38].
However, these techniques have the disadvantage of not being flexible. Indeed, transitions between states are normally questions (or patterns of questions, as with ALICE). If several questions should lead to the same action, the rule must be duplicated. Other information may also be taken into account when choosing a state transition (e.g., the geographical position of the user or the number of days before the next visa lottery, where applicable). In general, with n binary variables, the size of the tree of possible states will be .
For this reason, it is often difficult keep track of states and their transitions. The concept of “intent” [
39] was developed to help to remove the coupling from a specific user utterance with a state transition. The intent designates the desired outcome of the interaction. The two sentences “
I want to know how to get a student visa.” and “
Requirements for foreigners to study in Canada” both convey a similar information need, which could be an intent called
get_information_student_visa. RASA (
https://rasa.com) is a framework that helps to develop machine learning-based chatbots. It uses intents, and handles the state transitions as described hereafter. An intent classification model is responsible for assigning the correct intent to each user interaction, and then a second model uses the context of the conversation and that intent to identify the best response to that interaction. Assuming that there is no history with relevant additional information, a chatbot could answer with an action such as
give_generic_student_visa_info. If additional information is known, the action could be personalized, for example, by returning information specific to French immigrants.
As we will explain in
Section 4.1 and
Section 4.2, the systems we developed do not need to track any state in their initial versions, but could benefit from state tracking in future versions. While the intent classification task could be approached with many techniques, the StarSpace [
13] algorithm is one of the most efficient supporting this task. StarSpace is an algorithm for learning embeddings for entities of different types in the same space, in a supervised manner. For intention classification, the entities are therefore the documents (user-generated text and Frequently Asked Questions (FAQ)), as well as the intent labels. The algorithm consists of matching positive document-intent pairs (those where the intent matches the document) and limiting the proximity of negative pairs. In practice, entities are represented by their features, and the representation of these features that is updated by minimizing the loss function by stochastic gradient descent. To optimize the training time, the authors use negative instance sampling [
10] and a margin parameter in the loss function to avoid focusing on near-perfect instances.
The next section will present the methodology we followed to develop two chatbots by building on these techniques.