*4.3. Validity*

If grimace scales are to be implemented as a routine clinical assessment tool in biomedical research facilities, there needs to be a clear understanding of whether they are specific to pain, and can reliably measure pain in the models being used. This is important because it influences the animal caretaker's decision as regard to treatment options, for example, whether analgesics will be e ffective in mitigating clinical signs. It can be seen from the above discussion that there are a range of external factors that affect grimace scores, speaking to their validity as a pain assessment tool; anesthetics are a prime example. Setting aside the lack of study into their application in a real-time scenario, which influences their generalizability, another key concern is whether they are valid for all pain types. Results of the original Langford et al. [18] in 2010 study suggested that the technique was only applicable for acute pain states [18], since changes were not recorded after the application of traditional models of chronic pain, such as chronic constriction injury (CCI). However, there have now been a range of studies, largely performed in mice, which sugges<sup>t</sup> that the grimace scale may be applicable for pain that is chronic or neuropathic in nature, or of a non-surgical origin (see [36] for detailed discussion).

The study findings of Akintola et al. [52] in 2017 contradict the previous results of Langford et al. 2010 with both RGS and MGS increasing after application of the CCI model in these species. Pain arising from cancer has also been shown to cause an elevation of the MGS, for example in colo-rectal cancer [49] and in a metastatic breast cancer model [49,50]. The MGS was successfully used in models expected to produce pain of a neuropathic nature, for example in headache and migraine [55,87] and craniotomy [65]. There is also suggestion that pain of a visceral nature elevates scores based on studies evaluating colonic nociception [88], pelvic pain [89], colitis [61], and alimentary mucositis [62,63]. Hereditary sickle cell disease frequently leads to painful episodes in human patients. Cold treatment of transgenic sickle mice led to increased grimace scores, which were alleviated using a known analgesic agent. Furthermore, body changes of decreased length and increased back curvature were also correlated with the change in grimace scores [56]. These findings lend support to the proposition that the grimace scales have good construct validity for non-acute pain.

Despite these findings, results from other studies implies that further evaluation of the grimace techniques are necessary to ascertain validity. For example, in contradiction to later work [62,63] demonstrating elevations in scores in rats and mice with mucositis, Whittaker et al. [90] in 2015 found no change in grimace scores in a rat model, albeit using retrospective rather than real-time scoring. However, this study did find increases in frequency of established behavioral indicators of pain such as back arching and twitching [90]. Alternately, Leung et al. [61] in a rat DSS- colitis model found grimace score increases in the absence of an increase in composite behavioral score.

Other studies also raise questions of whether the grimace scales are truly unique to pain. Caecal ligation and puncture models are commonly used to study sepsis [91]. While sepsis is undoubtedly a painful condition based on human reports [92], there is also an overwhelming cytokine response causing sickness behavior. Studies to date on this model [51,93] have not teased apart the possible contribution of this sickness response to the facial expression changes. There is a study that lends support to this idea; the work of Yamamoto et al. in 2016 [94], while not employing the published rat grimace scale, provides evidence that nausea influences the eye action unit. Toxin administration, which might also be expected to cause dual symptoms of pain and sickness, similarly elevated the MGS [95]. Furthermore, analgesic administration was not always successful in reducing the scores implying an alternate cause of the facial action unit response. Finally, head injury may alter the animal's ability to influence the facial action units via neural mechanisms and render grimace scores unreliable [44].

### *4.4. Automation of Techniques*

One of the main current barriers to widespread clinical application of the grimace scales is the lack of understanding as to their validity and reliability when used for live scoring. However, as illustrated, there is now a wealth of literature on the validity and application of retrospective techniques using video or photo footage. In a clinical scenario these methods have limited application due to the time taken to extract the images, perform the scoring and potentially combine scores using statistical methods. However, there was some investigation of a range of technologies which minimize the time taken for various aspects of this process. At the simplest level use of freeware video to JPG converter software can reduce the time associated with manual searching and capture of images from recorded video footage by automating the capture process [48]. However, this still requires manual viewing of the selected images to obtain unobstructed head shots. Sotocinal et al. [21] in 2011 developed Rodent Face Finder ® which is able to detect rodent eyes and ears to generate stills of rodent faces. This software was used in a range of studies measuring grimace scores in both rats and mice see e.g., [44,52,84,96]. Recently, another research group generated an algorithm to generate repeatable, non-observer biased, standardized and randomized pictures in one step. The authors sugges<sup>t</sup> that their system o ffers benefits in scoring animals with dark fur and allowing several animals to be filmed and generate images simultaneously [97]. They further went on to show that the system was robust across several facilities potentially minimizing issues around inter-laboratory variability as discussed previously [98].

This process of semi-automation makes grimace scoring somewhat more applicable to a clinical environment but the time taken to manually score images is still likely to be a barrier to implementation. In recent years, there has been some progress on further automation of facial expression recognition using machine learning techniques. Deep learning methods allow classification and predictions on the data without previous feature design [99]. Tuttle et al. 2018 [99] first trained a neural network using human scored mouse images. Their system was highly accurate (94% agreemen<sup>t</sup> with human scores) for a binary (pain versus no pain) output, with scores correlating highly with human-assigned scores. Other groups have similarly demonstrated the promise of deep learning methods for use with the MGS when based on binary outputs [100,101]. Progress has also been made toward automating a facial pain expression system in sheep using techniques used in human facial recognition [102,103].

These automation methods are in their infancy and no doubt there will be further development of these techniques over the next few years. A key issue currently is that they lack sensitivitybeing only able to distinguish a painful from a non-painful state. This renders their current use for welfare assessment and endpoint implementation limited. However, given the success and practical implementation of machine learning methods in recognition of human facial expression, it is likely to be only a matter of time before a similar level of sensitivity of scoring will be possible in animal–focused methods [104].

### **5. Practical Considerations**

The above discussion highlights some areas in need for future research particularly in regard to practical usage of the grimace scales in laboratory animal medicine. A key issue is what to do with the data when it is acquired, and what it means for the animal. In research use of the grimace scales, statistically significant di fferences in grimace scores in comparison with controls are typically reported. However, in a clinical scenario, a mass of data or control animals' results may not be available to make this comparison on the spot. Moreover, statistical significance may not always equate with clinical significance. There needs to be ascertainment of the level of grimace score at which pain is actually occurring, since the evidence suggests that grimace scores in healthy animals are rarely zero [58]. Some attempts were made to address this issue with the development of intervention thresholds. Scores that are above this level signify that the animal is in pain, and consideration should be given to providing rescue analgesia [105]. These thresholds would need to be derived based on the method of combining individual action unit scores used, for example in the MGS summation of scores leads to a maximum of 10, whereas averaging leads to a maximum of 2. Oliver et al. [105] in 2014 determined for rats that 0.67/2 was a suitable intervention threshold. An intervention threshold has also been suggested for sheep (above 5/10) [25], and cats (0.39/1) [27]. It was considered by the authors in the sheep study that false positives for pain were unlikely above this cut-o ff score, although it was acknowledged that due to low test sensitivity some animals scoring below five may have a painful condition [25]. Since individuals experience pain di fferently, and there are associated sex di fferences in both pain experience and response to analgesics, work is needed to tailor intervention thresholds considering these factors. Additionally, consideration should be given to the fluctuating nature of pain [106], rendering regular monitoring, scoring and comparison with previous scores critical [44]. Monitoring sta ff need to consider tailoring of analgesic regimes due to animals potentially being in more pain in their active phase (see e.g., [84]), which may fall outside of sta ffed hours.

Given, the lack of established intervention thresholds perhaps the best current advice would be to use a holistic approach in pain assessment and consider grimace scores alongside other measures of well-being such as standard clinical scoring, and where possible look for trends in score progression within the same animal to guide decision-making. Animal carers also need to consider the potential impacts of inter and intra-observer variability on scoring which may be significant when statistical methods on group data are not used to smooth out variability. A prudent approach, where possible, would be to use the same scorer in a clinical case. This concern also brings up the issue of training of scorers which has received minimal research attention. Some studies implied that minimal training, such as the provision of online instructions, is all that is necessary to achieve consistent results between expert and novice scorers [69,107]. However, another study has shown that more in-depth training, using practice scoring associated with structured opportunities for discussion, enhanced scoring ability [108].

### **6. Conclusions and Future Directions**

Despite 10 years of investigation, widespread uptake of grimace scoring in biomedical research has not occurred. The grimace scales o ffer enormous potential for clinical use in biomedical research. They are simple, require no equipment and were shown through research study to have good construct validity for most conditions. However, the methodology used in research on grimace scales is unlikely to lend to practical implementation due to its time intensive and retrospective nature. To date, few studies have investigated the validity of grimace scales in scenarios requiring on the spot pain assessment and clinical decision-making. Key areas for focus are on grimace score validity in animals housed in home cages, the reliability of using a limited number of real-time observation points, the impact of observers on scores, and the need for observer training. This is an area in urgen<sup>t</sup> need for future research to realize the potential value of grimace scales.

One area that has received attention is the automation of scales using machine learning and algorithmic methods. This is a welcome development and will enhance the practical potential of grimace scales. It is hoped that in future years, grimace scale scoring may just be one of several outcome measures acquired routinely through facility-automated systems. This scenario is most likely to address the practical issues inherent when dealing with large numbers of animals, going some way toward addressing public concern around ethical decision-making in biomedical research.

**Author Contributions:** Conceptualization, D.M.-R. and A.L.W.; investigation, D.M.-R., A.L.W., A.O.-H., A.V.-M., E.H., J.M.-B.; writing—original draft preparation, D.M.-R., A.L.W., A.O.-H., A.V.-M., E.H., J.M.-B.; writing—review and editing, D.M.-R. and A.L.W.; project administration, D.M.-R. and A.L.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding. A.W. is supported by an Australian Government, NHMRC Peter Doherty Biomedical Research Fellowship (APP1140072).

**Conflicts of Interest:** The authors declare that they have no conflict of interest.
