**1. Introduction**

Animal welfare is an important societal concern [1,2]. The use of animals in biomedical scientific research is widespread, and globally significant, with approximately 115 million animals used per year [3]. Incontrovertibly, there is an ethical obligation to safeguard welfare of these animals through employing strategies to minimize pain, fear, and distress [4–6], in addition to the promotion of positive welfare states. However, to achieve this, validated methods for identification of animal emotional state are required. Despite significant research attention, ascertaining nature and strength of animal emotion remains a challenging task [7–11].

The study of emotion in laboratory animals has typically focused on aversive states such as pain. This area of study was driven by two perspectives: a scientific and welfare standpoint. The scientific

viewpoint, based on the extrinsic value of the animal, relates to the robustness of results acquired from animal models. There is an abundance of data on the impact of pain on a wide range of metabolic, immunologic, and other processes in the body. These alterations introduce variability or confound interpretation of results [12–14]. The welfare viewpoint, considering the intrinsic value of the animal, assumes that pain occurs frequently in animal models and should therefore be avoided or minimized for the benefit of the animal. Notwithstanding, di fferences between these viewpoints in terms of underlying motivation for study, the requirement for a reliable, practical method for assessment of pain is shared by both.

Recently, evaluation of complex motor responses, such as facial and corporal expression was proposed as a neurobiological readout of mammalian brain neuro-circuitry associated with emotional experience [11,15–17]. The former has received significant research attention, especially in rodents, as a potential assessment method for both positive and negative emotional states [9]. There remains controversy as to the communicative function of facial expressions in rodents, since these species tend to prioritize other senses such as olfaction and touch in communication [8]. However, the finding that in mice, lesions of the insular cortex, modulate facial pain expressions supports the use of facial expression assessment. The insular cortex is associated with human pain perception; hence it is assumed by analogy that facial grimace may represent a negative emotional experience [18]. Furthermore, studies on empathy tends to sugges<sup>t</sup> that rodents are communicating the presence of a painful state to others, to elicit an empathic response [19]. Although not specifically demonstrated, it is feasible that this may be occurring through interpretation of facial expression [8]. Additionally, it was recently shown through the use of machine learning methods that facial expressions in mice may not only indicate direction of e ffect or valence of emotion (positive or negative), but intensity and persistence [20].

Attempts to standardize evaluation of facial expressions for pain assessment has culminated in the development of the "grimace scales". These were developed originally for mice [18] and were adapted for use in rats [21], rabbits [22,23], sheep [24,25], ferrets [26], cats [27] and horses [28]. Grimace scales are simplified methods for evaluating facial expressions specifically related to pain based on the assessment of action units focusing on the eyes, ears, and cheeks. The utility of the scales was well-established across a range of laboratory animal species and animal model types. However, this evaluation has typically focused on their use via retrospective video recording review, and as a research tool to obtain data relevant to the animal model. There are fewer dedicated studies into the scales as 'bedside' pain assessment tools for rapid evaluation of pain status in laboratory animals in order to implement humane endpoints or provide analgesia. Therefore, the focus of this review is to discuss the practical utility of grimace scales in a range of laboratory animal species, identifying barriers to their use and potential confounders. The focus will be on laboratory animal rodents as the most common species used in biomedical research, but research from other species will be drawn upon. It is anticipated that this review will guide biomedical researchers, animal technicians and ethics committees when implementing pain assessment methods as part of research protocols.

### **2. History of Facial Expression Scoring for Pain in Laboratory Animals**

In recognition of the poor translation of outcomes from animal pre-clinical studies on pain physiology and analgesic development to humans [29,30], there has been a recent focus on development of methods for assessment of the a ffective pain response using non-evoked (spontaneous) responses [31]. Grimace scales are one such response derived from human facial codification scales [32,33]. The Facial Action Coding System (FACS) systematically catalogs all possible movements of the facial muscles, or combinations of them, such as lowering the eyebrows, tightening and closing the eyelids, wrinkling the nose, and raising the upper lip. Categorization of changes in these muscle movements or so-called "Facial Action Units" (FAU) enables facial recognition and categorization of emotions [16,34]. The finding that facial codification scales could quantify pain in humans with limited or non-existent verbal communication [35], provided the basis for using FAU in the development of grimace scales (GS) for animals (see [36]).

The mouse grimace scale (MGS) was the first to be developed. Langford et al. [18] in 2010 applied a nociceptive abdominal constriction test through administration of acetic acid that allowed the elucidation of facial action units that reliably detected pain. Validation was performed using a variety of traditional preclinical pain assays [18]. Five action units were described: (1) orbital tightening, (2) nose bulge, (3) cheek bulge, (4) ear position and (5) whisker change. A year later, Sotocinal et al. [21] in 2011 published the rat grimace scale (RGS) comprising four action units, due to consolidation of nose and cheek flattening into one unit. Utility of the RGS to detect pain was demonstrated in standard pre-clinical nociceptive tests in addition to following a surgical laparotomy procedure. Furthermore, the RGS was shown to be modified after analgesic administration indicating the specificity to pain [21]. Furthermore, the development of grimace scales in other common laboratory animal species followed, see Table 1.


**Table 1.** Original studies in which grimace scales were developed for a range of species commonly used as laboratory animals.

### **3. Terminology Around Pain Classification and Assessment**

A variety of terms are often used to describe pain and the assessment methods applied to it. Pain is usually classified according to the duration of its effect or its originating source within the body [39,40]. Acute pain arises at the time of injury and is often experienced as different in nature to the alternatively described 'chronic' pain. The latter generally referring to pain experienced over a longer duration, although there appears to be no accepted duration marking the transition from acute to chronic pain [41]. An alternative distinction between the two time-course descriptors was suggested by scientists: that related to functionality. Acute pain is argued to be adaptive, provoking a learned response by the animal to avoid a similar painful insult in the future [39]. Chronic pain on the other hand is said to be maladaptive [42]. However, this latter point is controversial with a variety of studies (see [43] for review) suggesting that pain-related hypervigilance may influence estimation of risk, subsequent behavior, and thus enhance survival.

Pain scales themselves are often described in terms of their validity, reliability, sensitivity [41]. Validity describes the extent to which the scale measures its intended outcome i.e., pain. There are several sub-categories describing validity. The most commonly referred to in the context of grimace scales are face validity and construct validity. Face validity describes what the test appears to be measuring i.e., pain. Construct validity relates to the extent to which the scales measure that specific construct. Therefore, the test needs to be both sensitive and specific to pain [44,45]. In pain studies construct validity is often determined using an applied analgesic test, since this is assumed to reduce pain and thereby reduce grimace scores if the test is truly pain-related [44]. External validity refers to how generalizable the measure is to other settings. In the context of grimace scales this is relevant in taking the scales from research scenarios to the clinical setting. This relates to practicability to perform during the working day, simplicity of the task, as well as the need for equipment and training. To date, this is the area that has received the least attention with regard to grimace scales.

Reliability refers to the scale producing the same result each time it is used both within, and between animals, and time points [46]. In the context of grimace scales, this is determined by the variability resulting in a single observer's measurements (intra-observer variability), the variation between di fferent observers' measurements (inter-observer), and variability between laboratories or research centers [44]. Sensitivity describes the ability of the scale to accurately identify changes in the degree of pain such that subtle changes are recognized [45]. In the context of pain scales this is often indicated when scale changes that occur correlate in direction, and proportion with other measures [45]. It is common in assessment of pain in veterinary species to achieve measurement accuracy in pain scoring by using a smaller number of broad category groups, such as mild, moderate, and severe, rather than expecting sensitivity when small di fferences in scores are considered. The following will consider how all of these measurement characteristics may influence the clinical applicability of grimace scales for use in biomedical research.

### **4. Clinical Applicability of Grimace Scales in Biomedical Research**

### *4.1. Development of Real-Time Grimace Scores*

There is now an extensive body of literature on the application of grimace scales in a range of animal models used commonly in biomedical research. The majority of this validation work has occurred in rodent models. It is beyond the scope of this review to describe all of the models used but the range includes oncology (see e.g., [47–50]), infectious disease [51], pain models [48,52,53], neurological conditions [33,54,55], genetic conditions [56], and maxillofacial interventions [49,50,57]. However, the vast majority of research to date has performed grimace scoring retrospectively from captured video footage.

Retrospective scoring is likely superior when using grimace scores to inform research outcomes, for example determining e fficacy of analgesics or success of model induction. These methods allow for the possibility of replication, by multiple observers where appropriate, with an increased time available for scoring at the researcher's leisure. A cage-side or 'real-time' method on the other hand would ideally provide instant assessment allowing interventions to support welfare, for example by implementing humane endpoints or administering analgesics. Development of the latter is clearly of more interest to ethical review committees and animal carers needing to make rapid clinical decisions. To date there has been substantially less focus on development and validation of real-time methods.

Miller and Leach [58] in 2015 performed the first comprehensive evaluation of a real-time method applied in mice. In this study, both retrospective and real-time scoring were compared. Real-time scoring was performed by observing mice three times over a 10 min period, while animals were being filmed for the retrospective analysis. Grimace scores were calculated by summation of each action unit as described by Langford et al. [18], and totals were then averaged across the observation

points. Live scores were always found to be significantly lower than corresponding retrospective video scoring. The authors posed that this could have resulted from the activity levels and changing nature of the face during live scoring. Blinking for instance, resulting in a score of 0 for orbital tightening, will likely be selected at least some of the time as a result of random chance selection of photographs for scoring. In a real-time scenario, the rapid nature of blinking will likely preclude its scoring. Similarly, Chartier et al. [47] in 2020 also found consistently lower scores from live scoring compared to retrospective scoring in a mouse model of colitis-associated colo-rectal cancer. One potential explanation for this trend is that the presence of a human observer influences performance of the facial action units, for example, an increased alertness to the human (predator) could lead to wider eyes and 'pricked' ears, lowering the grimace score. On the contrary, intriguing findings from Sorge et al. [59] demonstrated that not all observers are equal, with no impact of a female observer on scores in rats and mice (obtained retrospectively), but a reduction of scores in the presence of a male [59]. In the first investigation of real-time scoring in rats, Leung et al. [60] in 2016 found that interval observations (15 s of observation) were able to discriminate between control and analgesic–treated groups whereas point observations (conducted several times over a period) showed poor group discrimination. In this study, substantial variability was seen between single observations of either point or interval. Limits of agreement, with a retrospective scoring system were however fairly large with a 0.5 score range either side of the bias meaning there is was a substantial risk of both over or underestimating the score. Furthermore, point scoring became generally unreliable at discriminating groups when done for less than 2 min, assumed to be due to a loss of power due to fewer observations. A later rat study by the same research group [61], investigated the interval method compared to a retrospective method in a colitis model showing the former to be reliable in predicting pain, with scores similar to the standard method.

The implications of these findings for clinical pain assessment are several. Firstly, it needs to be considered that although good discriminant ability was generally found in these studies, results were obtained by statistical combination of multiple scores. In a clinical scenario, an observer is likely to take one score, and not have the means or time to mathematically manipulate the values to arrive at a reliable score. Secondly, the Leung et al. [60] study suggests that variability across the observation period is likely and that at least 2 min of observation is needed. It is unlikely to be practical for a caregiver to spend 2 min per animal performing pain assessment across a study. In this case, some other more general method of distress measurement is likely to be needed to 'triage' animals for secondary grimace assessment. There has been no investigation of the e ffect of movement to the clear cages, in isolation, as typically occurs in grimace studies as opposed to scoring occurring in the home cage environment. Several factors may influence the grimace scoring between these two scenarios. The novelty of the scoring box may trigger a state of alert influencing grimace scores in a similar vein to that suggested for the presence of a human observer. This novelty may indeed contribute to the variability seen between scores over time since habituation will eventually occur. Alternately, if scoring in the home cage, the presence of cage furniture, a potential more relaxed state of the animal in its familiar environment, or even the influence of circadian rhythms (see later) may all variously influence the action units or ability to see them accurately. A further consideration with real-time scoring is that there may be an inherent observer bias as the animal's overall demeanor, or presence of other pain behaviors such as twitching may be noted leading the observer to err on the side of higher action unit scores when unsure. This is not necessarily an issue per se in a clinical scenario since the goal is to recognize sick animals for further evaluation and treatment. However, these other behaviors may not be unique to pain but represent general sickness behavior that may not be able to be rectified by analgesic administration, and hence inappropriate medication administration may occur. If such biasing were occurring it would be expected that there may be di fferences in grimace scoring between observers experienced with working with the species in question versus more naïve observers [47].

Notwithstanding, these findings some research groups do appear to have been able to use the MGS or RGS in a point observation, real-time scenario to obtain predicted results. For example, in chemotherapy-induced toxicity models in mice [62], and rats [63] single grimace scores allowed distinguishing between groups and followed the progression of the disease course as expected, after induction of chemotherapy-induced gu<sup>t</sup> toxicity. Alternately, Hsi et al. [64] in 2020 were unable to use point mouse grimace scores to distinguish between groups either supplemented or not with dextrose following bariatric surgery. However, in this experimental design there was no sham group so it is unknown whether the MGS can reliably determine pain in this model [64].

There is clearly a need for further validation of real-time observation methods with a particular focus on one-o ff observations versus a series of observations, correlation with other established measures of pain assessment, inter-observer variability and home cage versus novel area.

### *4.2. Impact of Biology and the Environment*

### 4.2.1. Strain and Sex Di fferences

There is some evidence that features of biology, performance of routine procedures, or aspects of the environment may influence grimace scores. This has implications for setting of intervention scores (see later), and should be a consideration in driving further research or recommendations for application to clinical practice.

Aspects of biology have perhaps been the most researched with regard to their impact on grimace scores. The greatest implication of such changes likely relates to any di fferences between rodent strains or stocks given the wide range typically used in research. In mice, strain di fferences in MGS scores in animals not exposed to any painful interventions was demonstrated. Miller and Leach in 2015 [58] found that C3H/He mice showed significantly higher scores than CD-1 and C57BL/6 animals, although the order of e ffect for the latter two strains was di fferent between males and females. In female BALB/c mice the grimace score was even higher than C3H/He (males were not investigated in this study). Cho et al. [65] in 2019 similarly demonstrated a di fference in MGS scores post-craniotomy, with C57BL/6 mice with lower scores than CD-1 animals [65]. However, in pairwise comparisons of the CBA and DBA/2 strains in two further studies, no di fferences were found [66,67]. It was suggested by some authors that detection of facial features in dark animals may be more di fficult [65,68]. Improving the image quality and providing a contrasting background color when recording appear to mitigate the e ffects [18], hence this may not be a feature of animal pigmentation per se. It should, however, be noted that in the Miller and Leach [58] 2015 study, female C57BL/6 animals were not scored the lowest; that place being taken by the white CD-1 animals. Brown C3H animals also occupied an intermediate position. In a clinical scenario where real-time scoring is likely to take place the issue of poor background contrast on videos is not of concern. However, some investigation of the e ffects of color on live grimace scoring is warranted since it may be equally as di fficult for a human observer to distinguish features such as whiskers against a similar coat color background, especially when trying to observe at a distance so as not to influence the animal's behavior.

Di fferences between sexes have also been uncovered in research to date on the MGS, but results are complex and sugges<sup>t</sup> there may be strain interactions. For example, Miller and Leach [58] observed no di fferences in MGS scores between male and female C57BL/6 mice [58]. However in the same study, both CD-1 and C3H/He males had greater scores than their female counterparts [58]. Similarly, male BALB/c mice had higher grimace scores than females [69]. Alternately, Cho et al. [65] found no sex di fferences in CD-1 mice, although di fferences in response to analgesic were noted with females appearing to respond to carprofen with a reduction in grimace score more readily than males [65]. In rats, limited studies were carried out into sex di fferences but no di fferences were found in the original validation study [21], or in a later study [70]. Unfortunately, it appears that most grimace studies in rats and mice appear to have been conducted in one sex, with a large proportion using male animals, see e.g., [52,71–73]. This bias in study design toward males, coupled with the enhanced understanding of the existence of di fferent pathways and immune-cell types for pain processing between male and female rodents [74], renders extrapolation of findings to female rodents problematic.

### 4.2.2. Impact of Routine Procedures

It is clear that procedures occurring fairly often as part of vivarium routines may influence responses and should be taken into consideration when considering practical implementation of the grimace scales. For example, several studies evaluated the impact of anesthetics on rodent grimace scales. In general, both inhalational and injectable anesthetics lead to a short-term increase in grimace scores in both rats [73] and mice [66,75,76], although strain di fferences in the presence of this response were reported [66,75,76]. While this response is generally short-lived, repeated exposures lead to enhanced duration of the increase [68,73]. This is a particular consideration since grimace assessment would typically occur post-operatively to allow rescue analgesia administration and there is suggestion that the score increase may persist for up to a few hours post anesthesia [75,76].

There is a growing body of evidence that non-aversive handling of mice leads to reduced anxiety and improved resilience in the face of accompanying pain [77–79]. Cupping or tunnel handling are proposed as alternatives to the traditional method of picking up by the tail [78]. Perhaps somewhat surprisingly given the reported specificity of the MGS for pain there is some evidence that method of routine handling influences MGS with increased scores in mice handled by the tail compared to those that were tunnel handled [69]. This contradicts the findings of a previous study where no di fferences between the two methods were reported [67]. This is an area that should be a priority for further investigation for several reasons. Firstly, since non-aversive methods have not been widely incorporated into laboratory animal practice, especially among researchers [80], it is quite likely that mice even within one study will be subject to di fferent handling techniques. Any e ffect of handling method on grimace score could therefore confound interpretation of grimace scores used to determine research protocol e ffects on pain. Secondly, while there appears to have been no dedicated study on whether tail handling induces pain, there is suggestion that it is non-painful, ye<sup>t</sup> aversive [78]. If the method is actually non-painful this calls into question the specificity of the MGS for pain, and therefore whether it has construct validity.

Ear tagging or ear notching are routine handling procedures used to permanently identify laboratory animals [81]. These procedures are known to cause acute pain as reflected by alterations in physiological indices such as heart rate and blood pressure [82]. However, the results obtained by Miller and Leach [81] in mice did not reveal any change to MGS scores as a result of ear notching [81]. In a later mouse study, with a factorial study design evaluating handling method with ear tagging or tattooing, MGS increased following ear tagging but tattooing or restraint had no impact on scores [69]. Alternately, Keating et al. [22] in 2012 showed that ear tattooing in rabbits led to increases in rabbit grimace scale scores that were ameliorated by the application of a local topical anaesthetic (lidocaine/prilocaine) [22]. Corticosterone measures in this study sugges<sup>t</sup> that the pain response was short-lived and had resolved by 1-h post-procedure. Given that only three studies, performed in di fferent species, evaluated these common procedures, it would be unwise to draw firm conclusions. However, the lack of grimace score increase in the Miller and Leach [81] study does imply that the scale may not be sensitive to pain of a mild and short-lived nature either intrinsically, or as a result of practical features whereby the pain is missed due to the scoring process required. Conversely, this finding provides some evidence that routine procedures may have minimal e ffect on grimace scales, thus reducing potential confounding when using the scales for humane endpoint implementation. When reconciling the di fference in findings between this [81], and the later study, Roughan and Sevenoaks [69] in 2019 speculated that ear tagging may be perceived as more painful than notching due to the prolonged irritation by the tag [69].
