Using Machine Learning to Calibrate Automated Performance Assessment in a Virtual Laboratory: Exploring the Trade-Off between Accuracy and Explainability

Zafeiropoulos, Vasilis; Kalles, Dimitris

doi:10.3390/app14177944

Open AccessArticle

Using Machine Learning to Calibrate Automated Performance Assessment in a Virtual Laboratory: Exploring the Trade-Off between Accuracy and Explainability

by

Vasilis Zafeiropoulos

and

Dimitris Kalles

^*

School of Science and Technology, Hellenic Open University, 263 35 Patras, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7944; https://doi.org/10.3390/app14177944

Submission received: 9 August 2024 / Revised: 2 September 2024 / Accepted: 3 September 2024 / Published: 6 September 2024

(This article belongs to the Special Issue Recent Applications of Explainable AI (XAI))

Download

Browse Figures

Versions Notes

Abstract

:

Hellenic Open University has been developing Onlabs, a virtual biology laboratory simulating its on-site laboratory, for its students to be trained before the on-site learning activities. The evaluation of user performance in Onlabs is based on a scoring algorithm, which admits some optimization by means of Genetic Algorithms and Artificial Neural Networks. Moreover, for a particular experimental procedure (microscoping), we have experimented with incorporating into it some background knowledge about the procedure, which allows one to break it down in a series of conceptually linked steps in a hierarchical fashion. In this work, we review the flat and hierarchical modes used for the calibration of the automated assessment mechanism and offer an experimental comparison of both approaches with the aim of devising automated scoring schemes which are fit for training in an at-a-distance learning context. Overall, the genetic algorithm fails to deliver good convergence results in the non-hierarchical setting but performs better in the hierarchical one. On the other hand, the neural network most of the time converges, with the non-hierarchical network achieving a slightly better convergence than the hierarchical one, with the latter, however, delivering a smoother and more realistic assessment mechanism.

Keywords:

virtual labs; automated assessment; Genetic Algorithms; Artificial Neural Networks

1. Introduction

One of the most delicate tasks that science universities are confronted with is the instruction and assessment of their students in using laboratory equipment and successfully carrying out experiments. Serious risks lurk when making use of an on-site laboratory, such as those of accidents and equipment damages, restricting, therefore, the full use of the various instruments as well as the possibility of learning by trial and error. Distance learning universities offering science education face another problem, too: that of not having their students on a regular basis (or at all) at their laboratory facilities, making their lab training and evaluation even harder.

The development of virtual laboratories as training platforms complementing on-site lab training can go a long way to address how one becomes familiar with using specialized equipment. In virtual labs, the user would make unlimited virtual use of the various instruments and perform experiments without the fear of equipment damages or accidents. Users could also have the opportunity to have their (virtual lab) performance evaluated with respect to conducting a particular experiment either by a human expert supervising the session or by the virtual lab software itself.

This vision (and necessity) has been driving Hellenic Open University’s (HOU) development of Onlabs, a 3D virtual biology laboratory for the at-a-distance lab training and evaluation of undergraduate and postgraduate natural science students. The underlying concept is that students can use the software at home before using the actual equipment on-site.

Unlike with on-site lab training, students can freely experiment with the lab instruments and equipment in virtually unlimited ways, even making improper use of some of them, and learn from their own mistakes. Additionally, they can make use of an instruction mode, where they are guided with voice and text throughout a particular experiment, or of an evaluation mode, where they obtain a system-calculated score of their performance in an experiment.

Onlabs has been developed to resemble a computer game; it contains state-of-the-art 3D graphics, whereas user–environment interaction allows mouse-and-keyboard-based navigation, object handling, knob-turning, switch-clicking, etc. (see Figure 1).

Onlabs was initially developed using Hive3D (https://www.eyelead.com/index, accessed on 1 July 2024), but development switched to Unity in 2016. There exists (http://onlabs.eap.gr/, accessed on 1 July 2024) a collection of versions, either as stable systems or research prototypes, but for the scope of this work, we will focus on one version (2.1.2), which contains two distinct simulated experimental procedures, microscoping and 10X TBE solution preparation, and three different modes of playing, instruction, evaluation and experimentation. In the microscoping of a test specimen, the user is expected to set up the microscope, create a test specimen, and microscope it with all objective lenses of the photonic microscope; in the preparation of 500 mL of 10X TBE water solution, the user measures 17.4 gr of Boric Acid and 54 gr of Trizma Base powders on the electronic scale, dissolves them in water with a magnetic stirrer, and adds extra water as well as EDTA pH 8.0 (ethylenediamine tetraacetic acid) to the produced solution.

As far as playing modes are concerned, in instruction, a user is guided with voice and text through the steps of the selected procedure, each time only being allowed to perform the suggested move; in experimentation, a user may make any action whatsoever (provided it has been implemented); and in evaluation, a user can, additionally, receive a numerical score reflecting the extent to which the simulated experiment has been completed successfully. It is this latter step (evaluation) that is the focal point of this work.

The Onlabs assessment mechanism mainly revolves around two factors: the specific actions (such as equipment handling) carried out for the completion of the experiment, as well as the order in which those actions were made. As the various actions are, naturally, not of the same importance, different weights, concerning either the completion rate or the order of chosen actions, are assigned to each one of them.

The initialization of such weights has been carried out in close collaboration with subject experts. While these experts (instructors in laboratory-based biology) are central to identifying what is important and what is not, the numbers they give must be treated in the same fashion that knowledge engineers (as early as the 1980s) were treating knowledge harnessed during interviews, which were a central step of expert system development; namely, experts can identify what is important and what is not but, more often than not, they will be prone to taking a lot of things for granted and will be unable to explicitly express their knowledge [1,2]. As a result, weights suggested by experts, while definitely being helpful as anchors and indicative of importance, are only partially accurate; therefore, to improve a weight-based assessment mechanism, we must resort to more nuanced approaches involving, to a certain extent, machine learning based on specific examples provided by experts.

To improve on the original weight suggested by the experts, we have utilized Genetic Algorithms (GAs), where we explore alternative weighing schemes based on the experts’ original suggestions, and Artificial Neural Networks (ANN), where we attempt to learn an altogether new assessment mechanism which, however, is not too far off from the experts’ suggestion, both in terms of the final result and in terms of the assessment of individual steps. Using a GA or an ANN depends on the particular experiment completion metric we want to model (more to follow), but both approaches depend on a training set of experiments which have been carried out by knowledgeable lab users (who could tell the difference between poor and good performance) and which have also been evaluated by subject experts.

Additionally, this work presents an improvement attempt that also seeks to utilize domain expert knowledge that can be unequivocally communicated; specifically, the breaking down of an experiment into some (few, several at most) steps, which are conceptually distant to each other (for example, preparing a test specimen for microscoping and actually starting to use the knobs with one’s eyes on the lenses, are two steps that no reasonable user would ever do in parallel). Such domain knowledge utilization thus imposes a hierarchy of importance and conceptual similarity of the various steps of an experiment which, in turn, can be used to guide either the GA or the ANN mechanism towards a more effective search (such a “hierarchical” approach can be directly compared to the original “flat” one).

The rest of this paper is structured in six subsequent sections. First, in Section 2, we briefly review how the problem of assessing progress in virtual lab practice can be seen as being a close relative to the more well-established scoring schemes in adventure games. Then, in Section 3, we show how expert guidance can help in developing a proof-of-concept scoring mechanism and proceed to propose that such expert guidance may be more valuable when used in delivering a final assessment as opposed to a step-wise assessment of individual steps. Thus, in Section 4, we utilize some fundamental machine learning techniques viewing the design of an assessment mechanism as an instance of a knowledge acquisition problem. Then, in Section 5, we review the experimental results, while in Section 6, we proceed to adapt our basic mechanism by incorporating some domain knowledge, aiming to improve its explainability aspects. We close with Section 7, which consists of a short discussion of the potential of our approach for the at-scale deployment of virtual laboratory software.

2. Assessing Progress in Educational Experiments as Games

As mentioned, a virtual biology laboratory could serve as a powerful tool for initiating students to the daily routine in the on-site laboratory, enhancing and further enriching their practical experience (virtually) and offering them the opportunity to experiment safely and unrestrictedly on things they could not do in reality and learn by trial and error. Such realistic and instructive virtual labs are, amongst others, Labster, developed by a Danish multi-national company (https://www.labster.com/, accessed on 1 July 2024), Learnexx 3D, developed by Solvexx Solutions Ltd. (Basingstoke, UK) (http://learnexx.com/, accessed on 1 July 2024), and the MAGES Suite, developed by ORamaVR (Heraklion, Greece) (https://oramavr.com/, accessed on 1 July 2024).

This string of product releases draws on recent technological advances, but also on a lengthy and productive research strand on how to enhance conventional educational activities. For example, learning a subject by interacting through a virtual class is considered to be more efficient than attending a lecture on a subject or reading books on that subject [3]. Furthermore, virtual classes provide shy or restrained students with the opportunity to perform learning actions, such as asking questions, that they would probably be reluctant to do in a real class [4].

As mentioned earlier, a key characteristic of Onlabs is its embedded assessment mechanism, which, as mentioned in the introduction, is used for the evaluation of the user’s performance with respect to a particular experiment. In computer games, the embedded mechanisms for the evaluation of the player’s performance fall under the term ‘scoring’.

Scoring has been present since the first arcade games in the history of personal computers. Since then, the classical 2D arcade computer games have evolved into state-of-the-art 3D multi-player ones, and along with them, scoring has diversified into more abstract measures like “experience points” and “skill points” [5]. However, while computer games have also evolved into serious games and educational virtual worlds, the performance evaluation of the various users in them is mainly completed externally by their tutors and not by hard-coded scoring algorithms [6]. In fact, explicit evaluation mechanisms in serious games have long been proposed as a major research direction [7].

For the purpose of developing reliable and robust automatic assessment methods, Mislevy et al. proposed Evidence-Centered Design (ECD). In ECD, special focus is laid on evidentiary reasoning, i.e., data drawn from information and observations that facilitate inferences about the unobservable aspects of an examinee’s competence level in given performance situations [8,9]. Being theoretically based on ECD, stealth assessment in games has been suggested [10,11], which consists of a silent process by which performance data are being collected during playing and inferences are being made with Bayesian Networks about the player’s skills. Examples of games into which stealth assessment has been designed and incorporated are Taiga Park, an immersive 3D role-playing game for middle school kids to improve their knowledge in ecology and their skills in scientific inquiry [10]; Oblivion, a first-person 3D role-playing game set in a medieval world developed by Beshesda Softworks [10]; and Newton’s Playground, a computer game for students to be acquainted through 2D physics simulations with various physical quantities such gravity, mass, kinetic energy, and transfer of momentum [11].

Bayesian Networks have also been used for the assessment of university students’ performance in a virtual electronic laboratory simulating in a realistic way various undergraduate electronic engineering curriculum-based laboratory activities at Portsmouth University [12]. In a similar fashion, a performance assessment mechanism with Artificial Neural Networks has been designed and incorporated into the IMMEX simulation at UCLA; the latter originally consisted of a simulated patient whose clinical immunology disorder the university’s medical students were supposed to diagnose, but later its cognitive paradigm of problem solving was extended to other scientific fields as well [13].

While the above approaches point to the direction of a declarative approach as regards scoring assessment, there exist more promising techniques and tools for games with a large number of internal states, which is why we also turned our attention to machine-learning-based optimization (or, more accurately, calibration) approaches.

Resembling an adventure game, Onlabs also consists of a suitable test-bed for the training of artificial agents, as that particular category of computer games has been proposed for [14,15]. Several machine learning applications have been developed within the framework of an adventure game. Such is the application of Sophie’s Kitchen, developed by Thomaz and Breazeal [16,17]. In this, Sophie, a robot-like Non-Playable Character agent, tries to prepare a cake by using the appropriate ingredients together and baking it in the oven; in each training session, the expert gives Sophie feedback with which Sophie learns how to properly make a cake with Reinforcement Learning.

In recent years, adventure games, particularly the text-based ones, have re-emerged as platforms for machine learning. In those, several techniques have been used for the training of NPCs, such as Deep Reinforcement Learning [18,19] and Artificial Neural Networks [20,21].

Despite ANNs not being popular for computer games because of their black-box nature [22], there have been various studies and applications of them in gaming for the classification of opponents [23], the exploration of arcade game levels [24], and the control of the fight or flight responses of NPCs [22]. In the last two decades, Genetic Algorithms have also been proposed for computer games for the purpose of adjusting the NPCs’ behavior [23,25,26,27].

At the same time, for the weights concerning the order of the actions made in an experiment, a variant of Reinforcement Learning (RL) is used based on various auto-playing sessions by a Non-Playable Character. This paper deals only with the machine learning techniques applied on the completion rate, i.e., the GA and the ANN.

3. Assessment Modeling in Experiments

We now review various key properties of the underlying environment, which have influenced the design of the assessment mechanism.

A key first property is that the virtual environment (in Onlabs) is discrete; objects are perceived as separate from one another and can be manipulated by an agent only through a finite set of actions. Moreover, the environment is deterministic; when a specific action is performed, the produced new state of the environment is uniquely (i.e., non-stochastically) determined by the action and the environment’s previous state [28].

The virtual environment is populated by several different entities. Those entities can either be characters or “inanimate” objects. The main existing character in Onlabs is the human agent, namely the Ego Character, controlled by the user through the application’s interface. In terms of both design and development, entities are modelled as classes. A class is an abstract representation of one or more kindred entities; for example, all bottles (wash bottle, EDTA bottle) are instances of the bottle class. Furthermore, a class may be a specialization of a base class, or in programming terms, inherit from it; for example, the bottle class inherits from the vessel one, meaning that bottles share vessels’ basic traits as well as having their own particular ones. For the sake of simplicity, henceforth, we will be referring to the various entities with their class names, unless we deal with two or more of a kind; in that case, we will be using the names of the respective instances.

Most inanimate entities in Onlabs have specific actions that can be performed on them by the user. Such actions are Pick Up (collecting an object to the inventory), Press (valid for buttons, switches, and triggers), Rotate (valid for knobs), and Use With (combining an object with another one).

Each class has several features. Those can be qualitative (alphanumeric) or quantitative (numeric) depending on the values they can take; for example, the state feature of microscope’s AC switch class is qualitative, taking the values of ‘ON’ and ‘OFF’, while the position feature of the aperture knob class is quantitative, taking values in 0.40. The human user can alter those feature values directly by performing various actions on the respective entities, or indirectly by performing actions on other entities, which then trigger value changes in other entities’ features; for example, pressing the microscope’s AC switch changes the latter’s state feature from ‘ON’ to ‘OFF’ while using the microscope’s plug with the socket changes the microscope entity’s connection status feature from “disconnected” to “connected to socket” [29,30].

A collection of state–transition diagrams [31] capture value changes; see Figure 2 and Figure 3 for the simple STDs corresponding to the switching and the plugging-in actions.

As mentioned earlier, Onlabs’ scoring takes two metrics into consideration; the completion rate, reflecting the steps that the user has made for the completion of a particular experiment, and penalty points, reflecting problematic issues in the order that those steps have been performed. Both of these score metrics are visible as real-time feedbacks to the playing user.

The necessary steps for the successful completion of microscoping are shown in Table 1 (alongside the explicit changes shown in italics in values for the relevant features of the involved entities).

As one can see, some values are alphanumeric while most of them are numeric. We now show how we start with the individual scores for the various values and then combine them into an overall scoring. However, before any actual score calculation takes place we deal with the quantification of the alphanumeric values and with the normalization of the numeric ones.

The quantification of alphanumeric values is quite straightforward. For example, by observing the first assignment in Table 1, we see that it consists of the microscope’s connection status feature taking the value of ‘connected to socket’; since there are only two states, ‘disconnected’ and ‘connected to socket’, we just convert these values to 0 and to 1.

For features which admit numerical values, a further step of normalization is required to transform a value x from [0…x_max] into one in [0…1]. For example, the position feature of the aperture knob of the microscope, which configures the opening of the microscope’s iris and the light getting through it, admits values which range from 0 (the initial value, where the iris is closed) to 40 (the optimal value, where the iris is fully open); therefore, we need a normalization function which would particularly return 0 when the position’s value is 0 and 1 when the position’s value is 40. Such a function is the following (where a = 0.104 and c = 0.006), as shown in Figure 4.

f (x) = \frac{1 + a}{1 + c \cdot {(x - x_{m a x})}^{2}} - a

As can be seen, normalization need not enforce linearity, and one can use constants a and c to fine tune the relative importance of values closer to the optimum (maximum).

Having specified the various individual scores, we can now calculate the combined score, the completion rate, as a weighted average of all individual ones.

c o m p l e t i o n r a t e \leftarrow \frac{\sum_{k = 1}^{n} w_{k} \cdot x_{k}}{\sum_{k = 1}^{n} w_{k}}

(1)

We use n to denote the number of steps required for the completion of the procedure (16 in our microscoping), and w_i to denote the predefined weight of the i-indexed individual score. As no individual scores can exceed the value of 1, the completion rate cannot be greater than 1, so a simple re-scaling suffices to report the completion rate in the 0–100 range for user friendliness purposes.

In practice, the completion rate measures the weighted “distance” of the various features’ values from their optimal values, i.e., the values taken when the procedure in question has been successfully completed. However, it fails to capture the order in which those optimal values were achieved; in other words, in which order the necessary steps were made. As mentioned, we make use of the notion of penalty points to have a metric of errors in the order of steps, but their examination is beyond the scope of this paper.

Upon the completion of a play session in evaluation mode, the user is presented with an aggregate score which, like completion rate, ranges from 0 to 1, but is also affected by other factors which are multiplied with the latter. Those factors are penalty points, as briefly mentioned above, Δtime, indicating the number of seconds passed from the beginning of the session until its ending minus the minimum time required for the completion of the experimental procedure, and resetting rate, measuring the extent to which the various instrument components have been reset to their original state (to allow for a new experiment to commence). The latter rate is important because in every experiment the instruments need to be switched off and/or stored for safety and maintenance purposes.

A basic formula of aggregate score combining all of the afore-mentioned factors and ranging from 0 to 1 is (with β and γ being positive constants) as follows.

a g g r e g a t e s c o r e \leftarrow e^{- \frac{p e n a l t y p o i n t s}{β}} \cdot e^{- \frac{| Δ t i m e |}{γ}} \cdot c o m p l e t i o n r a t e \cdot r e s s e t i n g r a t e

4. Calibrating the Assessment Mechanism with Machine Learning

In the previous section, we described the assessment mechanism used for the evaluation of the human user’s performance. However, the initial approach was to use a human expert’s intuition to define the weights w_i for the completion rate. As this is a typical case of a knowledge acquisition bottleneck, we resort to machine learning to explore whether they could be defined in a better way.

The completion rate is a rather static metric, which takes into account only the final states of various objects and not the order of the steps taken for their achievement; for the particular case of the microscoping of a test specimen procedure, it includes 16 different weights, w₁ to w₁₆, each one for the respective feature state changes that need to be made. Nevertheless, even the expert-supplied weights cannot be considered as totally reliable, since the experts feel much more at ease to give an overall score as opposed to breaking it down.

Choosing to preserve the particular weighted average structure of the completion rate and to search for optimal weights for the latter, we turned to Genetic Algorithms. After trying out Genetic Algorithms, we then experimented with an Artificial Neural Network, where, this time, the objective was to use the individual scores as inputs and explore the possible non-linear mappings between individual scores and expert-supplied aggregate scores. As we shall explain in Section 4.2, the ANN uses the same as inputs as the weighted average and produces an alternative output score to the weighted average’s completion rate.

4.1. Calibration Using Genetic Algorithms

The weights w₁, w₂, …, w_n in the completion rate of an experiment represent the importance of their respective individual scores x₁, x₂, …, x_n and they form the weight vector w.

\vec{w} = {(w_{1}, w_{2}, \dots, w_{n})}^{T}

(2)

For our GA, each weight consists of a gene and each weight vector consists of a chromosome. We first randomly produce a population of p weight vectors,

\vec{w^{1}}

,

\vec{w^{2}}

, …,

\vec{w^{p}}

, or, in short,

\vec{w^{i}}

. Figure 5 gives an illustrative description of the genes, chromosomes, and populations of our GA.

Specializing (2) for each weight vector, we obtain the following.

\vec{w^{i}} = {(w_{1}^{i}, w_{2}^{i}, \dots, w_{n}^{i})}^{T}

(3)

Now reformulating (1) in vector terms, we obtain the following:

c o m p l e t i o n r a t e \leftarrow \frac{\vec{w^{i}} \cdot \vec{s}}{{‖\vec{w^{i}}‖}_{1}}

where

{‖\vec{w^{i}}‖}_{1}

is the first-degree norm of

\vec{w^{i}}

vector, equal to

\sum_{k = 1}^{n} w_{k}^{i}

.

The first generation of weight vectors is produced by creating a set of weight vectors,

\vec{w^{1}}

,

\vec{w^{2}}

, …,

\vec{w^{p}}

, with random values from 0 to 100 as their components.

The fitness function of our GA guides (probabilistically) how weight vectors will propagate in the next generation, as individuals or through crossover with other weight vectors. It should, therefore, reflect as accurately as possible the score given by an expert on all of the examples of the training data set. It needs to be mentioned that, by incorporating human feedback, we actually have an interactive GA that differs from the conventional GAs which let the population evolve without interacting with a human expert.

After a play session is completed in training mode, the computer calculates the various individual scores x_i and the resulting completion rate while the expert provides their own evaluation.

For the individual scores, we create a score vector

\vec{x}

, similar to the weight vector of (2).

\vec{x} = {(x_{1}, x_{2}, \dots, x_{n})}^{T}

Assuming, for simplicity reasons, that only one session has been played, so there is only one score vector,

\vec{x}

, and denoting the expert’s score for this session as

E S (\vec{x})

, we proceed by defining a generic form of fitness function as follows.

{F i t n e s s}_{g e n e r i c} (\vec{w^{i}}) = 1 - | E S (\vec{x}) - \frac{\vec{w^{i}} \cdot \vec{x}}{{‖\vec{w^{i}}‖}_{1}} |

(4)

The generic fitness function reaches its maximum value, 1, when the produced completion rate,

\frac{\vec{w^{i}} \cdot \vec{x}}{{‖\vec{w^{i}}‖}_{1}}

, equals the expert’s score, and its minimum value when those two quantities are as “far” as possible from each other.

Of course, a single score vector, i.e., a single play session or, in machine learning terms, a single training data point, does not suffice for a GA. We therefore carry out several sessions for the same experiment and obtain a series of score vectors,

\vec{x^{j}}

,

j = 1 \dots h

, with their respective expert scores,

E S (\vec{x^{j}})

. Adjusting the generic fitness function in (4) for each score vector

\vec{x^{j}}

, we have the following.

{F i t n e s s}_{j} (\vec{w^{i}}) = 1 - | E S (\vec{x^{j}}) - \frac{\vec{w^{i}} \cdot \vec{x}}{{‖\vec{w^{i}}‖}_{1}} |

(5)

An obvious overall fitness function of weight vector

\vec{w^{i}}

is the average of the various fitness functions described in (5). Thus, we define the following:

F i t n e s s (\vec{w^{i}}) = \frac{\sum_{j = 1}^{h} {F i t n e s s}_{j} (\vec{w^{i}})}{h}

F i t n e s s (\vec{w^{i}}) = \frac{\sum_{j = 1}^{h} [1 - |E S (\vec{x^{j}}) - \frac{\vec{w^{i}} \cdot \vec{x^{j}}}{{‖\vec{w^{i}}‖}_{1}}|]}{h}

F i t n e s s (\vec{w^{i}}) = 1 - \frac{\sum_{j = 1}^{h} |E S (\vec{x^{j}}) - \frac{\vec{w^{i}} \cdot \vec{x^{j}}}{{‖\vec{w^{i}}‖}_{1}}|}{h}

where h is the number of the different score vectors or training examples.

The generic fitness function shown above is a negative linear one. We also define three alternative generic fitness functions, which reward in a different way the deviation of the success rate produced by the current weights from the human expert’s score, and which allow us to experiment more effectively with genetic algorithm convergence as follows:

negative quadratic: $1 - {(E S (\vec{x}) - \frac{\vec{w^{i}} \cdot \vec{x}}{{‖\vec{w^{i}}‖}_{1}})}^{2}$
negative exponential: $e^{- λ \cdot |E S (\vec{x}) - \frac{\vec{w^{i}} \cdot \vec{x}}{{‖\vec{w^{i}}‖}_{1}}|}$
inverse: $\frac{1}{λ \cdot |E S (\vec{x}) - \frac{\vec{w^{i}} \cdot \vec{x}}{{‖\vec{w^{i}}‖}_{1}}|}$

where λ is a constant greater than 1.

The new generation contains the same number of weight vectors, where some of them might be “duplicates” of weight vectors from the current generation (selection operation), and most of them will be “created” by performing crossover in random pairs of the existing generation.

At first, we calculate the selection probability for each weight vector of the current generation.

\Pr (\vec{w^{i}}) = \frac{F i t n e s s (\vec{w^{i}})}{\sum_{k = 1}^{p} F i t n e s s (\vec{w^{k}})}

We now define a ratio r representing the proportion of replaceable weights (r being static for all generations), i.e., the number of weight vectors which will not survive into the new generation. Then, we apply the selection operation, i.e., we select

(1 - r) \cdot p

weight vectors from the current generation according to their selection probability and “copy” them into the new generation.

At the crossover operation, we choose

\frac{r \cdot p}{2}

pairs of weight vectors from the current generation (including those who were selected before), with respect to their selection probability, to crossover and put their offspring (two for each pair) into the new generation.

Last comes the mutation operation. Setting the mutation rate to be m (being static for all generations, like r), we choose, with uniform probability, m percent of the weight vectors that have been created in this new generation to be mutated. We have not defined just one type of mutation, but three: the doubling of a gene (weight value); the halving of a gene; and the swapping of two genes within a chromosome (weight vector). The user may choose any one of them, like they choose a generic fitness function.

This second generation of weight vectors that has been produced is then used with the exact same score vectors (i.e., the same playing outcomes) and the same expert’s completion rate (i.e., the same evaluation by the human expert for each respective playing outcome) that the first generation was used with; afterwards, it undergoes the same operations of selection, crossover, and mutation producing the third generation, and so on.

We finally need to specify a termination condition for our GA. We could set the latter to halt when the maximum of

\{F i t n e s s (\vec{w^{i}}), 1 \leq i \leq p\}

at a generation becomes greater than a threshold (e.g., 0.95). However, if a particular limit of generations were reached without any of the weight vectors of the last generation satisfying the termination condition, the algorithm would need to be restarted or the threshold reduced. Thus, in order to secure that our GA eventually ends, we set the termination condition to be an explicit number of generations, specified each time by the user [32,33].

4.2. Calibration Using Artificial Neural Networks

We used a vanilla ANN [34] with three layers: one input layer with k units (one for each individual score x_k), a hidden layer with three units, and an output layer with a single unit, with full connections between units of consecutive layers (see Figure 6). All units feature a standard sigmoid activation function (

\frac{1}{1 + e^{- x}}

).

The calculated completion rate of an experiment is set to be equal to the rescaled value of the output layer unit. The rescaling of the ANN output value into the [0,1] interval is necessary because, unlike the weighted average, the ANN might not produce a value within said field. For the rescaling, two outputs need to be calculated: one output produced by the ANN with all individual scores being 0, or the score vector being <0,0,…,0>, and another one produced with all input values being 1, or the score vector being <1,1,…,1>. The value produced with the <0,0,…,0> vector, i.e., when the user has not yet performed any action and is at their initial state, even though it may not be 0 itself, is set to correspond to a completion rate equal to 0. We name this value output_zero. Likewise, the value produced with the <1,1,…,1> vector, i.e., when the user has successfully performed every necessary action and reached the final state, is set to correspond to a completion rate equal to 1. We name this value output_one. The rescaling is performed through a linear function

y = a \cdot x + b

, where a and b coefficients are calculated to be

\frac{1}{{o u t p u t}_{o n e} - {o u t p u t}_{z e r o}}

and

- a \cdot {o u t p u t}_{z e r o}

, respectively. Note here that the output is not rescaled into the [0,1] interval before or during training but only after the latter is completed; only then the ANN is used for the user’s evaluation. Before training (with vanilla error back-propagation), all weights are initialized with random values in [−0.05, +0.05] [33].

5. Training Results

As described earlier, a single training example corresponds to one full experimental session and consists of the individual scores achieved by a human user in a play session accompanied by the estimation of the achieved completion rate as given by a human expert (the latter one is given in a numeric mode as well as with a ‘Low’, ‘Medium’, or ‘High’ tag).

We recruited four biology experts in our laboratory to evaluate 15 play sessions each (60 play sessions were played in total). All play sessions were carried out from one of the co-authors, utilizing his knowledge of the experiment itself, which he had also implemented in the software. Of those 15 sessions, 5 were played as ‘Low’ and classified as such by the particular expert; 5 were played and classified as ‘Medium’; and the other 5 were played and classified as ‘High’. We therefore have in total 20 sessions classified as ‘Low’, 20 as ‘Medium’, and 20 as ‘High’.

It is important to note here that our data collection does not aim to definitively calibrate the completion rate weights to actually usable values, but to assist the investigation where machine learning techniques seem to be suitable for this setting of the scoring mechanism. Essentially, we set the ground so that a statistically sound designed experiment can then be deployed to produce actually usable weights.

In order to conduct the training and testing, we form various groups of our training examples. For both techniques (GA and ANNs), we ran two types of accuracy measurements to see how close we come to the score indicated by the respective expert for each data point.

Initially, we ran an “optimistic” case by using re-substitution, i.e., by training our system on all data points and testing the results on these same data points and computing the Mean Squared Error (MSE) [33]. Then, we dropped the “optimistic” assumption and applied cross-validation, i.e., by training on some of the data points and testing the trained models on the rest of the data points (with all data points serving once as test data). Additionally, we also ran re-substitution and cross-validation experiments on a by-expert basis, three-fold cross-validation experiments on a Low–Medium–High basis (with each fold consisting of data of only one type), and four-fold cross-validation experiments among different experts (with each fold consisting of data of only one expert).

5.1. Fine Tuning the Completion Rate Weights with a Genetic Algorithm

Let us state that a GA-based training converges if the MSE in the combinations of weights decreases as the number of generations increases.

We tried numerous combinations of generic fitness functions and permutation methods with various different values for the number of population members, the crossover rate, the mutation rate, and the number of generations. For most of the cases, no convergence was achieved. A typical behavior of no convergence is one of the negative linear generic fitness function with a permutation of two genes as a mutation method, as shown in Figure 7.

Figure 8 shows that convergence was achieved in a few cases, one of them being the inverse generic fitness function with the halving of a gene as a mutation method, where we observed a very small convergence of around 0.013 and 0.004 magnitude for re-substitution and cross-validation, respectively.

In addition to the convergence being rare, the weight vectors resulting from training seem unrealistic. For example, in the latter case, the new produced weight vector is as follows:

4.69, 2.03, 27.01, 8.85, 23.59, 8.97, 22.01, 7.26, 1, 25, 17.03, 20.53, 3.02, 3.09, 24.64, 21.39

while the intuitive weight vector (as specified by the expert during implementation) is as follows:

10,10, 5, 10, 2, 10, 5, 5, 2, 2, 10, 1, 8, 8, 8, 8

with the cosine similarity between the two vectors being 63.83%.

One notices several inconsistencies in the new weight vector, such as the weight for testing the specimen holder knob being 25, while the weight for testing the stage knob is 1 (see these particular values in a bounding box for both weight vectors). This is ample indication that a straightforward application of Genetic Algorithms to improve a basic weighted average model of evaluation is incomplete (we describe an enhancement in Section 6).

5.2. Estimating the Completion Rate with an Artificial Neural Network

Let us state that ANN-based training converges if the MSE in the final score function decreases as the number of epochs increases.

For a variety of ANN parameters (and in contrast to the GA), we observed that convergence was achieved in nearly 80% of cases of our data set groupings. Below, we show some indicative examples from the performed experiments.

Figure 9 shows the MSE to Epochs graphs for re-substitution and cross-validation on all data sets; both approaches suggest convergence being achieved after 200–400 epochs.

Similar behavior is exhibited in the re-substitution and cross-validation cases within a single expert’s data sets (see Figure 10), with convergence being achieved at about 600 epochs and a noticeable deterioration for cross-validation thereafter (which is, of course, to be expected, since the longer we train on a set of data and test on a different set, the larger the potential for overfitting).

Similarly, cross-validation among experts also converges within the first 400–800 Epochs, as shown in Figure 11.

On the contrary, convergence to a higher than the original MSE or no convergence at all is achieved in the cross-validation among different experiment types (Low, Medium, High), as shown in Figure 12. This is to be expected, of course, since if, for example, it is trained on mediocre performances only, it is highly unlikely that it will be able to tell a good one.

Despite the fact that back-propagation training converges in the vast majority of cases, the produced ANN does not correspond to a realistic completion rate, i.e., a realistic output value after being rescaled to the [0,1] interval. In fact, some actions may result in an unrealistic increase, or even decrease, of the completion rate. For our microscoping test case and for training that spans up to 2000 Epochs, we observe that the completion rate decreases from step 3 to step 6, as shown in Figure 13, while normally it should increase for every new step made. We also observe (Figure 13) a slight decrease from step 12 to step 13, instead of the respective expected increase at such an advanced stage of the experiment.

We observed such inconsistencies in the training results for all our cases of data set groupings, which suggests that the ANN is incomplete in terms of the performance evaluation it makes, which paves the way for further enhancements.

6. Hierarchical Learning with a Context

As seen in the previous section, the Genetic Algorithm used for the calibration of the weights of the weighted average type of completion rate did not, in general, converge, i.e., there was no evident decrease in the MSE after training. Even in those few cases where we observed some convergence, the weights produced by the GA were unrealistic and did not contribute to a reasonable scoring. That, inevitably, makes us question whether our weighted average measure of completion rate is appropriate at all. This problem motivated an attempt to modify the definition of the completion rate by taking into account that an experiment, at a high level, can be viewed as a sequence of conceptually similar steps, with conceptual similarity indicating the extent to which a step is part of a multi-step procedure that has some evident sub-goal within an experiment. Such a hierarchical fashion of looking at an experiment can be, in principle, utilized to guide both a GA and an ANN in their search for a better completion rate estimate.

6.1. Defining the Hierarchy for a New Completion Rate

The microscoping procedure, as described in previous sections, consists of 16 actions which need to be carried out and each of which has been assigned a weight, according to its significance, to reflect its contribution to the total score. However, even though those 16 weights can be different from each other, that quantitative difference is the only one we are taking into account; i.e., the weighted average to which they contribute is flat, in the sense that actions are considered to be placed at the same conceptual level, albeit with a difference in significance. This, obviously, is an over-simplification, at least for most of them. For example, testing a particular knob is just a sub-step of the sub-procedure of testing all knobs, while microscoping with a particular objective lens is a sub-step of the sub-procedure of microscoping with all lenses; thus, it seems rather unreasonable to create an artificial relationship between these two steps (testing a knob and using a particular objective lens) instead of allowing that relationship to stand between the actual sub-procedures they belong to.

We have, therefore, valid reasons to expect that the problems of non-convergence and of unrealistic weights produced by training so far can be addressed by exploiting our knowledge of the conceptual similarity of some steps and by using such knowledge to guide the search mechanisms of GAs and ANNs towards more meaningful weights. We thus segmented the microscoping procedure into the four different sub-procedures, as shown in Table 2, and then introduced this segmentation into our scoring mechanism.

With this structure in mind, we define a hierarchical weighted average for the completion rate as follows:

h i e r a r c h i c a l c o m p l e t i o n r a t e \leftarrow \frac{\sum_{t = 1}^{n_{s}} (v_{t} \cdot \frac{\sum_{k = 1}^{n_{t}} w_{t, k} \cdot x_{t, k}}{\sum_{k = 1}^{n_{t}} w_{t, k}})}{\sum_{t = 1}^{n_{s}} v_{t}}

where:

-: w_t,k, x_t,k are the weights and individual scores for the k-th action of the t-th sub-procedure, respectively;
-: n_t is total number of actions contributing to the t-th procedure;
-: v_t is the weight for the t-th sub-procedure;
-: n_s is the total number of sub-procedures.

As a matter of convention, we refer to the weights v_t as external weights.

6.2. Training with Genetic Algorithms and Hierarchical Learning

The only required change in our Genetic Algorithm is to use the newly introduced hierarchical weighted average instead of the conventional flat one (

\frac{\vec{w^{i}} \cdot \vec{x}}{{‖\vec{w^{i}}‖}_{1}}

). For the sake of brevity, we refer to the GA trainings on the flat and hierarchical completion rates as GA-F and GA-H, respectively.

The results of “hierarchical training” (see Figure 14, Figure 15 and Figure 16) seem to be an improvement over their “flat” counterparts (see Figure 7 and Figure 8), as we observe convergence in more trainings and for a wider combinations of parameters.

However, again, as in the flat weighted average case, the resulting weights are not realistic. For example, in the last example shown in Figure 16, training on expert 1’s data sets results in the following weight vector (the weights corresponding to each sub-procedure are included in the first four pair of brackets, while the fifth one consists of the external weights).

< {1, 1.11, 1.35, 2.85, 4.16, 1.92},

{3.45, 36.62, 1, 46.41},

{1.1, 1},

{115.96, 1, 11.34, 8.93},

{3.23, 1, 2, 2.22} >

The cosine similarity of the latter weight vector to the intuitive one is calculated to be only 35.09%. One also sees a big discrepancy between the weights regarding the sub-procedure of knob testing; testing the specimen holder knob has a weight of 46.41, while the weight for testing the stage knob is 1. Likewise, in the sub-procedure of focusing with the objective lenses, the weight for focusing with the 4X lens is 115.96, while the weights for focusing with any of the other objective lenses range from 1 to 11.34. Last but not least, the external weights have similar values which mistakenly treats all sub-procedures as being roughly equally important.

As a consequence, we assume that, despite our hierarchical weighted average securing convergence more often than our flat one, it is still not a sufficient model of evaluation. In other words, an overdue deep approach to modelling (hierarchical instead of flat) might be more productive in terms of accuracy but, eventually, less meaningful; i.e., trying to fine tune the performance might deteriorate the level of explainability.

6.3. Training with Artificial Neural Networks and Hierarchical Learning

We will now try to incorporate our hierarchical redesign of the microscoping procedure into the ANN used as our alternative completion rate metric.

Since the microscoping procedure is split into four sub-procedures, a three-layered ANN with four units in its hidden layer (instead of the three that we had until now) seems applicable. The respective ANN of the one used so far but with four units in the hidden layer, conventionally called ‘ANN-F4’ (letter ‘F’ standing for “flat”, i.e., non-hierarchical), which is shown in Figure 17.

By the same convention, we name the previously-used ANN with three units in the hidden layer (shown in Figure 6 of Section 4.2) as ‘ANN-F3’.

As we said, ANN-F4, like ANN-F3, is non-hierarchical; its only difference from ANN-F3 is the number of units in the hidden layer. We will attempt to create a “hierarchical” ANN by modifying the interconnections of the various units between the input and hidden layer of ANN-4F, taking into consideration the hierarchy of microscoping actions in Section 6.

In that section, we divided microscoping procedure into four sub-procedures. Assuming that each unit of the hidden layer corresponds to each one of the sub-procedures, a hierarchical reconstruction of ANN-F4 would be to interconnect each unit of the hidden layer only with the respective individual scores from the input layers. The new hierarchical ANN, which we conventionally name ‘ANN-H4’ (letter ‘H’ standing for “hierarchical”), is shown in Figure 18.

We will now review how the two new ANNs (ANN-4F and ANN-4H) perform on the existing data sets and compare their results with each other and with the ones of ANN-3F as well.

We start by noting that ANN-4F does no better than ANN-3F. While re-substitution seems to be identical for both variants, cross-validation experiments for ANN-4F seem to demonstrate that overfitting sets on as early as within 200–400 Epochs (see Figure 19 and compare it to Figure 6).

Convergence does not improve in the case of ANN-4H either (see Figure 20).

Despite not actually obtaining better results in terms of convergence with any of those new ANNs, we also need to examine how the completion rate changes according to the various steps performed. A comparison of the performance of all three ANNs is illustrated in Figure 21. The latter explicitly shows that ANN-4H, albeit producing worse results in terms of convergence, has generally a smoother completion rate increase as well as the fewest decreases.

We observe that both ANN-4H and ANN-4F, upon the completion of the 12th step, i.e., on entering the microscoping mode (looking through the ocular lenses) with the only remaining steps being those of focusing with each of the four available objective lenses, produce a completion rate of 70–80%, while ANN-3F produces a completion rate of approximately 55%, which is more consistent with the human expert’s evaluation. That issue, combined with the fact that ANN-4F and ANN-4H show worse convergence results than ANN-3F, indicates the possibility that, in our domain, ANNs with four units in the hidden layer are subject to overfitting.

So, the hierarchical reconstruction of our ANN type of completion rate results in a more realistic assessment mechanism in terms of score increases and decreases, but also in a less realistic one in terms of correspondence to the final evaluations carried out by the human experts. That is the exact opposite of the hierarchical weighted average type of completion rate trained with a GA; there (Figure 14, Figure 15 and Figure 16), the training of the hierarchically-redesigned weighted average resulted into greater accuracy with respect to the human evaluators, but also to less explainable results (with weight vectors being different to what a human expert would suggest).

7. Conclusions and Future Work

Virtual labs are tools which can offer university students remote lab training that is free from the risk of accidents and damages. They also have the capacity to provide the trainees with an automated evaluation of their online performance. It is along these lines that Hellenic Open University has developed its own virtual lab, Onlabs.

The evaluation of the user’s performance in Onlabs is based on a scoring algorithm which has specifically been developed for this purpose. The algorithm provides the user with a two-fold real-time score, a completion rate indicating to what extent the user has made the necessary actions for the successful completion of the experiment, and penalty points, which are assigned whenever the user is performing an action in the wrong order. Those two scores, along with the time spent by the user on the experiment as well as the extent to which the involved instruments have been reset upon the end of the session, are combined to produce an aggregate score for the user’s overall performance.

Initially, the parameters of the scoring algorithm are defined intuitively. In order, however, to explore whether these parameters can be more realistically set and, subsequently, result in a more efficient scoring algorithm, so that informed student guidance can be achieved automatically at a large scale, without the costly presence of human evaluators, machine learning was used. Such an approach can also help with the projected development of more implementations for a series of experiments, each of which will eventually be required to be accompanied with a mechanism to assess the extent to which the experiment has been successfully completed; i.e., when we refer to “scale” we not only mean the number of learners, but also the number of experiments.

We experimented with Genetic Algorithms and Artificial Neural Networks along with feedback from human experts which are used for the calibration of the completion rate. Moreover, for those two methods, the completion rate has also been redesigned in a hierarchical fashion, i.e., by grouping the various required actions into sub-groups according to their conceptual similarity. We focused the GA- and ANN-based training to the simulated experimental procedure of microscoping.

The training results vary depending on the learning method and the setting (hierarchical or flat) used. The GA gives poor convergence results in the default, non-hierarchical setting and moderately better results in the hierarchical one, while the produced completion rate assessment mechanism is in both cases unrealistic, yet even more unrealistic in the hierarchical setting. On the contrary, the ANN most of the time converges, with the non-hierarchical ANN achieving a slightly better convergence that the hierarchical one, but with the latter producing a smoother and, to a considerable extent, much more realistic completion rate.

To conclude, we note that the calibration with the machine learning of the scoring mechanism in terms of completion rate produces an estimate that is at best indicative of the student’s actual performance. At the same time, the hierarchical reconstruction of the completion rate measure trades off accuracy for realism and explainability, allowing one to observe an increasing score on the completion rate as one progresses through the steps of the experiment and, gradually, reaches the final goal. This warrants an investigation into the redesign of said hierarchy, aiming to improve both accuracy and realism instead of trading off one for the other. Moreover, we obviously need to collect and use more training data, which should also be as diverse as possible; that is, we should ask for feedback from more biology experts and, probably, over more experiments for each one.

Key among our future goals is to apply machine learning training in the context of the electrophoresis experimental procedure, a complex multi-equipment multi-step procedure. Since such a procedure readily lends itself to hierarchical treatment, we expect it will serve as a more credible test-bed for a broader study of the proposed machine learning techniques for calibrating the scoring mechanism. Additionally, it will also speed up our investigation of other, more sophisticated techniques, such as the discrete Whale Optimization Algorithm [35], which perfectly fits our domain, and reinforcement learning, whose early baseline application [36] suggests that it holds potential, since this is the standard type of problem where it seems to excel (at least in theory).

Lastly, we note that the above line of research has culminated in a series of educational intelligence developments where we investigate aspects of how a suitable online component might be an indispensable tool in advancing the knowledge level of learners [37]. As we have also developed and deployed an analytics-friendly online version of Onlabs [38], where, for microscoping, we aim at a much broader audience than HOU’s students of biology, we believe that further applied research for massive-scale automated evaluation will continue to be a top-level priority worldwide. At the same time, expanding Onlabs into the extended reality domain [39], experimenting with formalisms that allow us to compute similarities between experimental trajectories [40], and investigating the extent to which this technology can be used as a formal educational tool for professionals [41] provide an opportunity of research in evaluation algorithms in an even broader domain.

Author Contributions

Conceptualization, V.Z.; methodology, V.Z. and D.K.; Software, V.Z.; formal analysis, V.Z.; investigation, V.Z.; resources, V.Z.; Data curation, V.Z.; writing—original draft preparation, V.Z.; writing—review & editing, D.K.; visualization, V.Z., supervision, D.K.; project administration, D.K.; funding acquisition, D.K. All authors have read and agreed to the published version of the manuscript.

Funding

Parts of this research have been co-financed by European Union funds and Greek national funds.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The training data used for this paper are included in all variations our Machine Learning version of Onlabs, which are available for download. Particularly, the data are stored in the MIC_training_data.txt file in the /MLData subfolder of the application. For our readers’ convenience, along with the variations of the full package of the ML version, we have included separate packages each one containing just one particular variation of our ML training processes, namely GA-F, GA-H, ANN3-F, ANN4-F, and ANN4-H, and none of which contain the simulation of the experimental procedure of microscoping.

Acknowledgments

The text of this paper contains context-setting information and generic descriptions which also appear in some other work (properly referenced) by the same co-authors. The paper contains substantial new information (Section 6 and Section 7) which constitutes new work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Clancey, W.J. The Epistemology of a Rule-Based Expert System—A Framework for Explanation. Artif. Intell. 1983, 20, 215–251. [Google Scholar] [CrossRef]
Clancey, W.J. The Knowledge Level Reinterpreted: Modeling How Systems Interact. Mach. Learn. 1989, 4, 285–291. [Google Scholar] [CrossRef]
de Freitas, S. Serious Virtual Worlds: A Scoping Study; Joint Informations System Committee: Bristol, UK, 2008. [Google Scholar]
Maratou, V. Implementation of an Educational Virtual World for Software Engineering. Master’s Thesis, Hellenic Open University, Patras, Greece, 2012. Available online: https://apothesis.eap.gr/archive/item/77610 (accessed on 1 July 2024). (In Greek).
Score (Game). Wikipedia. Available online: https://en.wikipedia.org/wiki/Score_(game) (accessed on 1 July 2024).
Bellotti, F.; Kapralos, B.; Lee, K.; Moreno-Ger, P.; Berta, R. Assessment in and of Serious Games: An Overview. Adv. Hum. Comput. Interact. 2013, 2013, e136864. [Google Scholar] [CrossRef]
Bellotti, F.; Berta, R.; Gloria, A.D. Designing Effective Serious Games: Opportunities and Challenges for Research. Int. J. Emerg. Technol. Learn. iJET 2010, 5, 22–35. [Google Scholar] [CrossRef]
Mislevy, R.J.; Steinberg, L.S.; Almond, R.G. Focus Article: On the Structure of Educational Assessments. Meas. Interdiscip. Res. Perspect. 2003, 1, 3–62. [Google Scholar] [CrossRef]
Mislevy, R.J.; Steinberg, L.S.; Almond, R.G.; Lukas, J.F. Concepts, Terminology and Basic-Models of Evidence-Centered Design. In Automated Scoring of Complex Tasks in Computer-Based Testing; Williamson, D.M., Mislevy, R.J., Bejar, I.I., Eds.; Routledge: Mahwah, NJ, USA, 2006; pp. 15–47. ISBN 978-1-135-10921-9. [Google Scholar]
Shute, V. Stealth Assessment in Computer-Based Gamee to Support Learning. In Computer Games and Instruction; Information Age Publishing: Charlotte, NC, USA, 2011. [Google Scholar]
Shute, V.; Ventura, M. Stealth Assessment: Measuring and Supporting Learning in Video Games; The MIT Press: Cambridge, MA, USA, 2013; ISBN 978-0-262-51881-9. [Google Scholar]
Achumba, I.E. Intelligent Performance Assessment in a Virtual Electronic Laboratory. Ph.D. Thesis, University of Portsmouth, Portsmouth, UK, 2011. [Google Scholar]
Stevens, R.H.; Casillas, A. Artificial Neural Networks. In Automated Scoring of Complex Tasks in Computer-Based Testing; Williamson, D.M., Mislevy, R.J., Bejar, I.I., Eds.; Routledge: Mahwah, NJ, USA, 2006; pp. 259–312. ISBN 978-1-135-10921-9. [Google Scholar]
Amir, E.; Doyle, P. Adventure Games: A Challenge for Cognitive Robotics; American Association of Artificial Intelligence: Washington, DC, USA, 2002; Volume 8, Available online: www.aaai.org (accessed on 1 July 2024).
Hlubocky, B.; Amir, E. Knowledge-Gathering Agents in Adventure Games. In Proceedings of the AAAI-04 Workshop on Challenges in Game AI, Menlo Park, CA, USA, 25–29 July 2004. [Google Scholar]
Thomaz, A.L.; Breazeal, C. Reinforcement Learning with Human Teachers: Evidence of Feedback and Guidance with Implications for Learning Performance. In AAAI’06, Proceedings of the 21st National Conference on Artificial Intelligence, Boston, MA, USA, 16–20 July 2006; Cohn, A., Ed.; AAAI Press: Boston, MA, USA, 2006; Volume 1, pp. 1000–1005. [Google Scholar]
Thomaz, A.L.; Breazeal, C. Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners. Artif. Intell. 2008, 172, 716–737. [Google Scholar] [CrossRef]
Ammanabrolu, P.; Riedl, M. Playing Text-Adventure Games with Graph-Based Deep Reinforcement Learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 3557–3565. [Google Scholar]
Narasimhan, K.; Kulkarni, T.; Barzilay, R. Language Understanding for Text-Based Games Using Deep Reinforcement Learning. arXiv 2015, arXiv:1506.08941. [Google Scholar]
Kostka, B.; Kwiecien, J.; Kowalski, J.; Rychlikowski, P. Text-Based Adventures of the Golovin AI Agent. In Proceedings of the 2017 IEEE Conference on Computational Intelligence and Games (CIG), New York, NY, USA, 22–25 August 2017; pp. 181–188. [Google Scholar]
Robitzski, D. A Neural Network Dreams Up This Text Adventure Game as You Play. Available online: https://futurism.com/text-adventure-game-neural-network (accessed on 5 October 2019).
Robbins, M.S. Using Neural Networks to Control Agent Threat Response. In Game AI Pro 360: Guide to Tactics and Strategy; Rabin, S., Ed.; CRC Press: Boca Raton, FL, USA, 2019; p. 242. [Google Scholar]
Charles, D.; Fyfe, C.; Livingstone, D.; McGlinchey, S. Biologically Inspired Artificial Intelligence for Computer Games; IGI Global: Hershey, PA, USA, 2008; ISBN 978-1-59140-646-4. [Google Scholar]
Luo, J.J. An Exploration of Neural Networks Playing Video Games. Available online: https://towardsdatascience.com/an-exploration-of-neural-networks-playing-video-games-3910dcee8e4a (accessed on 26 September 2019).
Arora, A. Using Genetic Algorithms to Automate the Chrome Dinosaur Game (Part 2): Our Journey Goes Genetic. Available online: https://heartbeat.fritz.ai/using-genetic-algorithms-to-automate-the-chrome-dinosaur-game-part-2-1c0007334297 (accessed on 26 September 2019).
de Mendonça, V.G.; Pozzer, C.T.; Raittz, R.T. A Framework for Genetic Algorithms in Games. In Proceedings of the VII SBGames, Belo Horizonte, Brazil, 10–12 November 2008; p. 4. [Google Scholar]
Martin, M. Using a Genetic Algorithm to Create Adaptive Enemy AI. Available online: https://www.gamasutra.com/blogs/MichaelMartin/20110830/90109/Using_a_Genetic_Algorithm_to_Create_Adaptive_Enemy_AI.php (accessed on 26 September 2019).
Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 2nd ed.; Pearson Education: Upper Saddle River, NJ, USA, 2003; ISBN 978-0-13-790395-5. [Google Scholar]
Zafeiropoulos, V.; Kalles, D.; Sgourou, A. Learning by Playing: Development of an Interactive Biology Lab Simulation Platform for Educational Purposes. In Experimental Multimedia Systems for Interactivity and Strategic Innovation; Deliyannis, I., Kostagiolas, P., Banou, C., Eds.; IGI Global: Hershey, PA, USA, 2016; pp. 204–221. [Google Scholar]
Zafeiropoulos, V.; Kalles, D.; Sgourou, A. Adventure-Style Game-Based Learning for a Biology Lab. In Proceedings of the 2014 IEEE 14th International Conference on Advanced Learning Technologies (ICALT), Athens, Greece, 7–10 July 2014; pp. 665–667. [Google Scholar]
Yourdon, E. Modern Structured Analysis; Yourdon Press: Upper Saddle River, NJ, USA, 1989; ISBN 978-0-13-598624-0. [Google Scholar]
Zafeiropoulos, V.; Kalles, D. Human-Computer Learning Interaction in a Virtual Laboratory. In Proceedings of the Online, Open and Flexible Higher Education Conference 2019, Madrid, Spain, 14 June 2019. [Google Scholar]
Zafeiropoulos, V.; Kalles, D. Computer-Human Mutual Training in a Virtual Laboratory Environment. In Artificial Intelligence and Assistive Technologies; Tsihrintzis, G., Virvou, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Mitchell, T.M. Machine Learning; McGraw-Hill, Inc.: New York, NY, USA, 1997; ISBN 978-0-07-042807-2. [Google Scholar]
Tian, G.; Zhang, C.; Zhang, X.; Feng, Y.; Yuan, G.; Peng, T.; Pham, D.T. Multi-Objective Evolutionary Algorithm with Machine Learning and Local Search for an Energy-Efficient Disassembly Line Balancing Problem in Remanufacturing. J. Manuf. Sci. Eng. 2023, 145, 051002. [Google Scholar] [CrossRef]
Zafeiropoulos, V. Human-Computer Learning Interaction in a Laboratory Environment. Ph.D. Thesis, Hellenic Open University, Patras, Greece, 2021. [Google Scholar]
Paxinou, E.; Georgiou, M.; Kakkos, V.; Kalles, D.; Galani, L. Achieving Educational Goals in Microscopy Education by Adopting Virtual Reality Labs on Top of Face-to-Face Tutorials. Res. Sci. Technol. Educ. 2022, 40, 320–339. [Google Scholar] [CrossRef]
Paxinou, E.; Mitropoulos, K.; Tsalapatas, H.; Kalles, D. A Distance Learning VR Technology Tool for Science Labs. In Proceedings of the 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA), Corfu, Greece, 18–20 July 2022; pp. 1–5. [Google Scholar]
Kiourt, C.; Kalles, D.; Lalos, A.; Papastamatiou, N.; Silitziris, P.; Paxinou, E.; Theodoropoulou, H.; Zafeiropoulos, V.; Papadopoulos, A.; Pavlidis, G. XRLabs: Extended Reality Interactive Laboratories. In Proceedings of the 12th International Conference on Computer Supported Education, Prague, Czech Republic, 2–4 May 2020; SCITEPRESS—Science and Technology Publications: Prague, Czech Republic, 2020; pp. 601–608. [Google Scholar]
Sypsas, A.; Kalles, D. Computing Similarities Between Virtual Laboratory Experiments Models Using Petri Nets. In Proceedings of the 20th International Conference on Modeling & Applied Simulation, Online, 15–17 September 2021; p. 37. [Google Scholar]
Zafeiropoulos, V.; Anastassakis, G.; Orphanoudakis, T.; Kalles, D.; Fanariotis, A.; Fotopoulos, V. The V-Lab VR Educational Application Framework: A Beacon Application of the XR2Learn Project. In MobileHCI ’23 Companion, Proceedings of the 25th International Conference on Mobile Human-Computer Interaction, Athens, Greece, 26–29 September 2023; Komninos, A., Santoro, C., Eds.; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1–4. [Google Scholar]

Figure 1. A snapshot of the virtual laboratory (Onlabs).

Figure 2. State-transition diagram of AC switch’s state feature.

Figure 3. State-transition diagram of microscope’s connection status feature.

Figure 4. The graph of the normalization function

f (x) = \frac{1 + a}{1 + c \cdot {(x - 40)}^{2}} - a

for the aperture knob’s position.

Figure 4. The graph of the normalization function

f (x) = \frac{1 + a}{1 + c \cdot {(x - 40)}^{2}} - a

for the aperture knob’s position.

Figure 5. Genes, chromosomes, and populations of the Genetic Algorithm.

Figure 6. The three-layer ANN with three units in hidden layer, used for the assessment of user’s performance.

Figure 7. MSE to generations graphs for GA—re-substitution and cross-validation on all data sets (population members: 100, crossover rate: 0.9, mutation rate: 0.01, generic fitness function: negative linear, mutation method: permutation of two genes).

Figure 8. MSE to generations graphs (logarithmic scale) for GA—re-substitution and cross-validation on all data sets (population members: 100, crossover rate: 0.9, mutation rate: 0.03, generic fitness function: inverse, mutation method: halving of a gene).

Figure 9. MSE to Epochs graph for ANN training—re-substitution and cross-validation on all data sets.

Figure 10. MSE to Epochs graph for ANN training—re-substitution and cross-validation within one of our expert’s data sets.

Figure 11. MSE to Epochs graph for ANN training—cross-validation among experts.

Figure 12. MSE to Epochs graph for ANN training—cross-validation among types.

Figure 13. Completion rate increase with respect to performed step for ANN training.

Figure 14. MSE to generations graphs for GA on hierarchical completion rate—re-substitution on expert 4 (population members: 100, crossover rate: 0.2, mutation rate: 0.3, generic fitness function: inverse, mutation method: doubling of a gene).

Figure 15. MSE to generations graphs for GA on hierarchical completion rate—training on experts 2, 3, and 4 and testing on expert 1 (population members: 100, crossover rate: 0.5, mutation rate: 0.5, generic fitness function: negative exponential, mutation method: halving of a gene).

Figure 16. MSE to generations graphs for GA for hierarchical completion rate—re-substitution and cross-validation on expert 1’s data sets (population members: 100, crossover rate: 0.6, mutation rate: 0.2, generic fitness function: inverse, mutation method: doubling of a gene).

Figure 17. A three-layer ANN with four units in hidden layer (ANN-F4).

Figure 18. A hierarchical three-layer ANN with four units in hidden layer (ANN-H4).

Figure 19. MSE to Epochs graph for ANN-4F—re-substitution and cross-validation on all data sets.

Figure 20. MSE to Epochs graph for ANN-4H—re-substitution and cross-validation on all data sets.

Figure 21. Completion rate increase with respect to performed step graph for the training of ANN-3F, ANN-4F, and “hierarchical” ANN-4H.

Table 1. Steps necessary for the successful completion of the microscoping procedure.

1	Connect the microscope into the socket: microscope\|connection ← ‘connected to socket’
2	Turn the microscope light on: AC switch\|state ← ‘ON’
3	Set light intensity to two-thirds of maximum: light intensity knob\|position ← ¾ ∙ MaxPosition
4	Set iris to fully open: aperture knob\|position ← MaxPosition
5	Lift the condenser lens to the top position: condenser\|height ← MaxHeight
6	Set lens to 4X active: revolving nosepiece\|active focus ← 4
7	Test coarse focus knob: coarse focus knob\|tested ← ‘yes’
8	Test fine focus knob: fine focus knob\|tested ← ‘yes’
9	Test stage knob: stage knob\|tested ← ‘yes’
10	Test specimen holder knob: test specimen knob\|tested ← ‘yes’
11	Put test specimen on stage: stage\|attached specimen ← ‘yes’
12	Enter microscope mode: Ego\|mode ← ‘microscoping’
13	Focus with lens 4X: microscoping\|clearness [revolving nosepiece\|active focus = 4] ← 1
14	Focus with lens 10X: microscoping\|clearness [revolving nosepiece\|active focus = 10] ← 1
15	Focus with lens 40X: microscoping\|clearness [revolving nosepiece\|active focus = 40] ← 1
16	Focus with lens 100X: microscoping\|clearness [revolving nosepiece\|active focus = 100] ← 1

Table 2. The 16-step microscoping procedure divided into 4 sub-procedures.

Set microscope
1.1
Connect the microscope into the socket
1.2
Turn the microscope light on
1.3
Set light intensity to two-thirds of maximum
1.4
Set iris to fully open
1.5
Lift the condenser lens to the top position
1.6
Set lens 4X active

2.

Test microscope knobs

2.1: Test coarse focus knob
2.2: Test fine focus knob
2.3: Test stage knob
2.4: Test specimen holder knob

3.

Prepare actual microscoping

3.1: Put test specimen on stage
3.2: Enter microscope mode

4.

Perform actual microscoping

4.1: Focus with lens 4X
4.2: Focus with lens 10X
4.3: Focus with lens 40X
4.4: Focus with lens 100X

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zafeiropoulos, V.; Kalles, D. Using Machine Learning to Calibrate Automated Performance Assessment in a Virtual Laboratory: Exploring the Trade-Off between Accuracy and Explainability. Appl. Sci. 2024, 14, 7944. https://doi.org/10.3390/app14177944

AMA Style

Zafeiropoulos V, Kalles D. Using Machine Learning to Calibrate Automated Performance Assessment in a Virtual Laboratory: Exploring the Trade-Off between Accuracy and Explainability. Applied Sciences. 2024; 14(17):7944. https://doi.org/10.3390/app14177944

Chicago/Turabian Style

Zafeiropoulos, Vasilis, and Dimitris Kalles. 2024. "Using Machine Learning to Calibrate Automated Performance Assessment in a Virtual Laboratory: Exploring the Trade-Off between Accuracy and Explainability" Applied Sciences 14, no. 17: 7944. https://doi.org/10.3390/app14177944

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

1	Connect the microscope into the socket: microscope\|connection ← ‘connected to socket’
2	Turn the microscope light on: AC switch\|state ← ‘ON’
3	Set light intensity to two-thirds of maximum: light intensity knob\|position ← ¾ ∙ MaxPosition
4	Set iris to fully open: aperture knob\|position ← MaxPosition
5	Lift the condenser lens to the top position: condenser\|height ← MaxHeight
6	Set lens to 4X active: revolving nosepiece\|active focus ← 4
7	Test coarse focus knob: coarse focus knob\|tested ← ‘yes’
8	Test fine focus knob: fine focus knob\|tested ← ‘yes’
9	Test stage knob: stage knob\|tested ← ‘yes’
10	Test specimen holder knob: test specimen knob\|tested ← ‘yes’
11	Put test specimen on stage: stage\|attached specimen ← ‘yes’
12	Enter microscope mode: Ego\|mode ← ‘microscoping’
13	Focus with lens 4X: microscoping\|clearness [revolving nosepiece\|active focus = 4] ← 1
14	Focus with lens 10X: microscoping\|clearness [revolving nosepiece\|active focus = 10] ← 1
15	Focus with lens 40X: microscoping\|clearness [revolving nosepiece\|active focus = 40] ← 1
16	Focus with lens 100X: microscoping\|clearness [revolving nosepiece\|active focus = 100] ← 1

Article Menu

Using Machine Learning to Calibrate Automated Performance Assessment in a Virtual Laboratory: Exploring the Trade-Off between Accuracy and Explainability

Abstract

1. Introduction

2. Assessing Progress in Educational Experiments as Games

3. Assessment Modeling in Experiments

4. Calibrating the Assessment Mechanism with Machine Learning

4.1. Calibration Using Genetic Algorithms

4.2. Calibration Using Artificial Neural Networks

5. Training Results

5.1. Fine Tuning the Completion Rate Weights with a Genetic Algorithm

5.2. Estimating the Completion Rate with an Artificial Neural Network

6. Hierarchical Learning with a Context

6.1. Defining the Hierarchy for a New Completion Rate

6.2. Training with Genetic Algorithms and Hierarchical Learning

6.3. Training with Artificial Neural Networks and Hierarchical Learning

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI